國科大數(shù)據(jù)挖掘課程HW1

2年前作者：Torture_L分類：Toy博客閱讀(27)違法舉報(bào)

這篇具有很好參考價(jià)值的文章主要介紹了國科大數(shù)據(jù)挖掘課程HW1。希望對大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

HW1

Submission requirements:

Please submit your solutions to our class website.

Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.

(a) Draw a star schema diagram for the data warehouse.

中科宏一數(shù)據(jù)挖掘服務(wù),Python,數(shù)據(jù)挖掘,數(shù)據(jù)倉庫,數(shù)據(jù)庫

(b) Starting with the base cuboid [date, spectator, location, game]，what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles?

step 1. Roll-up on date from date_key to all
step 2. Roll-up on spectator from spectator_key to status
step 3. Roll-up on location from location_key to location_name
step 4. Roll-up on game from game_key to all

step 5. Dice with "status=student" and "location_name=Los Angeles"

? Bitmap indexing is a very useful optimization technique. Please present the pros and cons of using bitmap indexing in this given data warehouse.

優(yōu)點(diǎn)

位圖索引是一種高效的索引結(jié)構(gòu)，在查詢、過濾等方面上，由于進(jìn)行的是位運(yùn)算，所以比常規(guī)的查詢方式快很多。例如在本倉庫中，假設(shè)對于spectator表的子列status，我們有：

spectator_key	status	gender
0	學(xué)生	男
1	成人	女
2	學(xué)生	男
3	學(xué)生	女
4	老人	女

status就可以建立以下位圖索引：

status="學(xué)生" : 10110
status="成人" : 01000
status="老人" : 00001

gender可以建立以下位圖索引：

gender="男": 10100
gender="女": 01011

例如，我們想要查詢學(xué)生，只需要用10110去過濾原始數(shù)據(jù)就行。

我們想混合查詢，比如同時(shí)查詢status="學(xué)生"和gender="男"的數(shù)據(jù)，只需要進(jìn)行并操作就行了：

10110 & 10100 = 10100

可以大大提高計(jì)算速度。

此外，位圖索引可以在一定程度上繞開原始數(shù)據(jù)，進(jìn)一步提高處理速度。例如，我們想統(tǒng)計(jì)滿足上面條件的人數(shù)，只需要:

ans=0
x=(10110&10100)
while x:
	x&=(x-1)
	ans+=1

缺點(diǎn)

位圖索引比較適合枚舉類型，也就是離散型變量，對于連續(xù)變量，位圖索引并不適用，往往需要先做離散化。比如本倉庫中，phone number字段可能就不太適合(也許這個(gè)字段沒有存在的必要？)

而當(dāng)屬性列非常多時(shí)，我們做位圖索引的開銷也比較大。

Q2．某電子郵件數(shù)據(jù)庫中存儲了大量的電子郵件。請?jiān)O(shè)計(jì)數(shù)據(jù)倉庫的結(jié)構(gòu)，以便用戶從多個(gè)維度進(jìn)行查詢和挖掘。

中科宏一數(shù)據(jù)挖掘服務(wù),Python,數(shù)據(jù)挖掘,數(shù)據(jù)倉庫,數(shù)據(jù)庫

Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:

age	23	23	27	27	39	41	47	49	50	52	54	54	56	57	58	58	60	61
%fat	9.5	26.5	7.8	17.8	31.4	25.9	27.4	27.2	31.2	34.6	42.5	28.8	33.4	30.2	34.1	32.9	41.2	35.7

(a) Calculate the mean, median, and standard deviation of age and %fat.

             age       %fat
mean   46.444444  28.783333
std    13.218624   9.254395
median      51.0       30.7

(b) Draw the boxplots for age and %fat.

中科宏一數(shù)據(jù)挖掘服務(wù),Python,數(shù)據(jù)挖掘,數(shù)據(jù)倉庫,數(shù)據(jù)庫

? Draw a scatter plot based on these two variables.

中科宏一數(shù)據(jù)挖掘服務(wù),Python,數(shù)據(jù)挖掘,數(shù)據(jù)倉庫,數(shù)據(jù)庫

(d) Normalize age based on min-max normalization.

x=data["age"]
y=data['%fat']
X=(x-x.min())/(x.max()-x.min())
Y=(y-y.min())/(y.max()-y.min())
print(X,Y)

Result is:

0     0.000000
1     0.000000
2     0.105263
3     0.105263
4     0.421053
5     0.473684
6     0.631579
7     0.684211
8     0.710526
9     0.763158
10    0.815789
11    0.815789
12    0.868421
13    0.894737
14    0.921053
15    0.921053
16    0.973684
17    1.000000

(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?

print(np.corrcoef(x,y))
print("相關(guān)系數(shù)" ,stats.pearsonr(x,y)[0])

Result is

[[1.        0.8176188]
 [0.8176188 1.       ]]
相關(guān)系數(shù) 0.8176187964565874

I think they are positively correlated.

(f) Smooth the fat data by bin means, using a bin depth of 6.

def mean(x):
    return round(sum(x)/len(x),2)

N_y=sorted(y)
bins=[[]]
for j in N_y:
    bins[-1].append(j)
    if len((v:=bins[-1]))==6:
        v[:]=[mean(v)]*len(v)
        bins.append([])
for i,j  in enumerate(bins[:-1]):
    print("bin %d is :"%(i+1),j)

bin 1 is : [19.12, 19.12, 19.12, 19.12, 19.12, 19.12]
bin 2 is : [30.32, 30.32, 30.32, 30.32, 30.32, 30.32]
bin 3 is : [36.92, 36.92, 36.92, 36.92, 36.92, 36.92]

(g) Smooth the fat data by bin boundaries, using a bin depth of 6.

這里因?yàn)槲覀兪菍ε藕眯虻臄?shù)據(jù)做處理，所以可以通過二分法進(jìn)行優(yōu)化，獲取中間分界。文章來源地址http://www.zghlxwxcb.cn/news/detail-635554.html

def close(x,a,b):
    # 是否靠近下界
    return (x-a)<=(b-x)

def boundary(x):
    Min=x[0]
    Max=x[-1]

    l,r=0,len(x)-1
    while l<=r:
        mid=(r-l)//2+l
        if close(x[mid],Min,Max):
            if not close(x[mid+1],Min,Max):
                l=mid
                break
            l=mid+1
        else:
            if close(x[mid-1],Min,Max):
                l=mid
                break
            r=mid-1
    return [[Min]*l+[Max]*(len(x)-l)]

N_y=sorted(y)
bins=[[]]
for j in N_y:
    bins[-1].append(j)
    if len((v:=bins[-1]))==6:
        v[:]=boundary(v)
        bins.append([])
for i,j  in enumerate(bins[:-1]):
    print("bin %d is :"%(i+1),j)

bin 1 is : [[7.8, 7.8, 27.2, 27.2, 27.2, 27.2]]
bin 2 is : [[27.4, 27.4, 32.9, 32.9, 32.9, 32.9]]
bin 3 is : [[33.4, 33.4, 33.4, 33.4, 42.5, 42.5]]

到了這里，關(guān)于國科大數(shù)據(jù)挖掘課程HW1的文章就介紹完了。如果您還想了解更多內(nèi)容，請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點(diǎn)僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符，請點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋，一經(jīng)查實(shí)，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

機(jī)器學(xué)習(xí)--課后作業(yè)--hw1
本篇文章全文參考這篇blog 網(wǎng)上找了很多教程，這個(gè)是相對來說清楚的，代碼可能是一模一樣，只是進(jìn)行了一些微調(diào)，但是一定要理解這個(gè)模型具體的處理方法，這個(gè)模型我認(rèn)為最巧妙的它對于數(shù)據(jù)的處理，直接把前9天所有的參數(shù)參數(shù)當(dāng)作變量，最簡單粗暴的方法，，然后再
2024年02月14日
瀏覽(21)
【數(shù)據(jù)挖掘算法與應(yīng)用】——數(shù)據(jù)挖掘?qū)д?/a>
數(shù)據(jù)挖掘技術(shù)背景大數(shù)據(jù)如何改變我們的生活 1.數(shù)據(jù)爆炸但知識貧乏 ??人們積累的數(shù)據(jù)越來越多。但是，目前這些數(shù)據(jù)還僅僅應(yīng)用在數(shù)據(jù)的錄入、查詢、統(tǒng)計(jì)等功能，無法發(fā)現(xiàn)數(shù)據(jù)中存在的關(guān)系和規(guī)則，無法根據(jù)現(xiàn)有的數(shù)據(jù)預(yù)測未來的發(fā)展趨勢，導(dǎo)致了“數(shù)據(jù)爆炸但知識
2023年04月09日
瀏覽(102)
關(guān)聯(lián)規(guī)則挖掘（上）：數(shù)據(jù)分析 | 數(shù)據(jù)挖掘 | 十大算法之一
??????????歡迎來到我的博客?????????? ??作者：秋無之地 ??簡介：CSDN爬蟲、后端、大數(shù)據(jù)領(lǐng)域創(chuàng)作者。目前從事python爬蟲、后端和大數(shù)據(jù)等相關(guān)工作，主要擅長領(lǐng)域有：爬蟲、后端、大數(shù)據(jù)開發(fā)、數(shù)據(jù)分析等。 ??歡迎小伙伴們點(diǎn)贊????、收藏
2024年02月07日
瀏覽(32)
【數(shù)據(jù)挖掘競賽】零基礎(chǔ)入門數(shù)據(jù)挖掘-二手汽車價(jià)格預(yù)測
目錄一、導(dǎo)入數(shù)據(jù)? 二、數(shù)據(jù)查看可視化缺失值占比? 繪制所有變量的柱形圖，查看數(shù)據(jù) 查看各特征與目標(biāo)變量price的相關(guān)性三、數(shù)據(jù)處理 ?處理異常值查看seller,offerType的取值查看特征 notRepairedDamage? ?異常值截?cái)??填充缺失值? ?刪除取值無變化的特征查看目標(biāo)變量p
2023年04月27日
瀏覽(25)
數(shù)據(jù)挖掘-實(shí)戰(zhàn)記錄（一）糖尿病python數(shù)據(jù)挖掘及其分析
一、準(zhǔn)備數(shù)據(jù) 1.查看數(shù)據(jù) 二、數(shù)據(jù)探索性分析 1.數(shù)據(jù)描述型分析 2.各特征值與結(jié)果的關(guān)系 a)研究各個(gè)特征值本身類別 b)研究懷孕次數(shù)特征值與結(jié)果的關(guān)系 c)其他特征值 3.研究各特征互相的關(guān)系三、數(shù)據(jù)預(yù)處理 1.去掉唯一屬性 2.處理缺失值 a)標(biāo)記缺失值 b)刪除缺失值行數(shù) ?c
2024年02月11日
瀏覽(23)
數(shù)據(jù)挖掘(3.1)--頻繁項(xiàng)集挖掘方法
目錄 1.Apriori算法 Apriori性質(zhì) 偽代碼 apriori算法 apriori-gen(Lk-1)【候選集產(chǎn)生】 has_infrequent_subset(c,Lx-1)【判斷候選集元素】例題求頻繁項(xiàng)集：對于頻繁項(xiàng)集L={B,C,E}，可以得到哪些關(guān)聯(lián)規(guī)則： 2.FP-growth算法 FP-tree構(gòu)造算法【自頂向下建樹】 insert_tree([plP],T) 利用FP-tree挖掘頻繁項(xiàng)集
2023年04月09日
瀏覽(29)
數(shù)據(jù)倉庫與數(shù)據(jù)挖掘
數(shù)據(jù)挖掘（Data mining），又譯為資料探勘、數(shù)據(jù)采礦。它是數(shù)據(jù)庫知識發(fā)現(xiàn)（Knowledge-Discovery in Databases，KDD）中的一個(gè)步驟。數(shù)據(jù)挖掘一般是指從大量的數(shù)據(jù)中通過算法搜索隱藏于其中的信息的過程。數(shù)據(jù)挖掘通常與計(jì)算機(jī)科學(xué)有關(guān)，并通過統(tǒng)計(jì)、在線分析處理、情報(bào)檢索、
2024年02月06日
瀏覽(27)
數(shù)據(jù)挖掘與圖像挖掘：計(jì)算機(jī)視覺的創(chuàng)新
計(jì)算機(jī)視覺是人工智能領(lǐng)域的一個(gè)重要分支，它涉及到計(jì)算機(jī)對圖像和視頻數(shù)據(jù)進(jìn)行分析和理解。數(shù)據(jù)挖掘則是數(shù)據(jù)科學(xué)領(lǐng)域的一個(gè)核心技術(shù)，它涉及到從大量數(shù)據(jù)中發(fā)現(xiàn)隱藏的模式和規(guī)律。隨著數(shù)據(jù)量的增加，數(shù)據(jù)挖掘技術(shù)在計(jì)算機(jī)視覺領(lǐng)域得到了廣泛應(yīng)用，以提高計(jì)算機(jī)
2024年04月17日
瀏覽(29)
《數(shù)據(jù)挖掘基礎(chǔ)》實(shí)驗(yàn)：Weka平臺實(shí)現(xiàn)關(guān)聯(lián)規(guī)則挖掘
進(jìn)一步理解關(guān)聯(lián)規(guī)則算法（Apriori算法、FP-tree算法），利用weka實(shí)現(xiàn)數(shù)據(jù)集的挖掘處理，學(xué)會調(diào)整模型參數(shù)，讀懂挖掘規(guī)則，解釋規(guī)則的含義（1）隨機(jī)選取數(shù)據(jù)集為對象，完成以下內(nèi)容：（用兩種方法：Apriori算法、FP-tree算法）文件導(dǎo)入與編輯；參數(shù)設(shè)置說明；結(jié)果截圖；
2024年02月02日
瀏覽(98)
數(shù)據(jù)挖掘|序列模式挖掘及其算法的python實(shí)現(xiàn)
序列(sequence)模式挖掘也稱為序列分析。序列模式發(fā)現(xiàn)（Sequential Patterns Discovery）是由R．Agrawal于1995年首先提出的。序列模式尋找的是事件之間在順序上的相關(guān)性。例如，“凡是買了噴墨打印機(jī)的顧客中，80%的人在三個(gè)月之后又買了墨盒”，就是一個(gè)序列關(guān)聯(lián)規(guī)則。對于保險(xiǎn)
2024年04月09日
瀏覽(31)