HW1
Submission requirements:
Please submit your solutions to our class website.
Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.
(b) Starting with the base cuboid [date, spectator, location, game],what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles?
step 1. Roll-up on date from date_key to all
step 2. Roll-up on spectator from spectator_key to status
step 3. Roll-up on location from location_key to location_name
step 4. Roll-up on game from game_key to all
step 5. Dice with "status=student" and "location_name=Los Angeles"
? Bitmap indexing is a very useful optimization technique. Please present the pros and cons of using bitmap indexing in this given data warehouse.
優(yōu)點(diǎn)
位圖索引是一種高效的索引結(jié)構(gòu),在查詢、過濾等方面上,由于進(jìn)行的是位運(yùn)算,所以比常規(guī)的查詢方式快很多。例如在本倉庫中,假設(shè)對于spectator
表的子列status
,我們有:
spectator_key | status | gender |
---|---|---|
0 | 學(xué)生 | 男 |
1 | 成人 | 女 |
2 | 學(xué)生 | 男 |
3 | 學(xué)生 | 女 |
4 | 老人 | 女 |
status
就可以建立以下位圖索引:
status="學(xué)生" : 10110
status="成人" : 01000
status="老人" : 00001
gender
可以建立以下位圖索引:
gender="男": 10100
gender="女": 01011
例如,我們想要查詢學(xué)生,只需要用10110
去過濾原始數(shù)據(jù)就行。
我們想混合查詢,比如同時(shí)查詢status="學(xué)生"
和gender="男"
的數(shù)據(jù),只需要進(jìn)行并操作就行了:
10110 & 10100 = 10100
可以大大提高計(jì)算速度。
此外,位圖索引可以在一定程度上繞開
原始數(shù)據(jù),進(jìn)一步提高處理速度。例如,我們想統(tǒng)計(jì)滿足上面條件的人數(shù),只需要:
ans=0
x=(10110&10100)
while x:
x&=(x-1)
ans+=1
缺點(diǎn)
位圖索引比較適合枚舉類型,也就是離散型變量,對于連續(xù)變量,位圖索引并不適用,往往需要先做離散化。比如本倉庫中,phone number
字段可能就不太適合(也許這個(gè)字段沒有存在的必要?)
而當(dāng)屬性列非常多時(shí),我們做位圖索引的開銷也比較大。
Q2.某電子郵件數(shù)據(jù)庫中存儲了大量的電子郵件。請?jiān)O(shè)計(jì)數(shù)據(jù)倉庫的結(jié)構(gòu),以便用戶從多個(gè)維度進(jìn)行查詢和挖掘。
Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:
age | 23 | 23 | 27 | 27 | 39 | 41 | 47 | 49 | 50 | 52 | 54 | 54 | 56 | 57 | 58 | 58 | 60 | 61 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
%fat | 9.5 | 26.5 | 7.8 | 17.8 | 31.4 | 25.9 | 27.4 | 27.2 | 31.2 | 34.6 | 42.5 | 28.8 | 33.4 | 30.2 | 34.1 | 32.9 | 41.2 | 35.7 |
(a) Calculate the mean, median, and standard deviation of age and %fat.
age %fat
mean 46.444444 28.783333
std 13.218624 9.254395
median 51.0 30.7
(b) Draw the boxplots for age and %fat.
? Draw a scatter plot based on these two variables.
(d) Normalize age based on min-max normalization.
x=data["age"]
y=data['%fat']
X=(x-x.min())/(x.max()-x.min())
Y=(y-y.min())/(y.max()-y.min())
print(X,Y)
Result is:
0 0.000000
1 0.000000
2 0.105263
3 0.105263
4 0.421053
5 0.473684
6 0.631579
7 0.684211
8 0.710526
9 0.763158
10 0.815789
11 0.815789
12 0.868421
13 0.894737
14 0.921053
15 0.921053
16 0.973684
17 1.000000
(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?
print(np.corrcoef(x,y))
print("相關(guān)系數(shù)" ,stats.pearsonr(x,y)[0])
Result is
[[1. 0.8176188]
[0.8176188 1. ]]
相關(guān)系數(shù) 0.8176187964565874
I think they are positively correlated.
(f) Smooth the fat data by bin means, using a bin depth of 6.
def mean(x):
return round(sum(x)/len(x),2)
N_y=sorted(y)
bins=[[]]
for j in N_y:
bins[-1].append(j)
if len((v:=bins[-1]))==6:
v[:]=[mean(v)]*len(v)
bins.append([])
for i,j in enumerate(bins[:-1]):
print("bin %d is :"%(i+1),j)
bin 1 is : [19.12, 19.12, 19.12, 19.12, 19.12, 19.12]
bin 2 is : [30.32, 30.32, 30.32, 30.32, 30.32, 30.32]
bin 3 is : [36.92, 36.92, 36.92, 36.92, 36.92, 36.92]
(g) Smooth the fat data by bin boundaries, using a bin depth of 6.文章來源:http://www.zghlxwxcb.cn/news/detail-635554.html
這里因?yàn)槲覀兪菍ε藕眯虻臄?shù)據(jù)做處理,所以可以通過二分法進(jìn)行優(yōu)化,獲取中間分界。文章來源地址http://www.zghlxwxcb.cn/news/detail-635554.html
def close(x,a,b):
# 是否靠近下界
return (x-a)<=(b-x)
def boundary(x):
Min=x[0]
Max=x[-1]
l,r=0,len(x)-1
while l<=r:
mid=(r-l)//2+l
if close(x[mid],Min,Max):
if not close(x[mid+1],Min,Max):
l=mid
break
l=mid+1
else:
if close(x[mid-1],Min,Max):
l=mid
break
r=mid-1
return [[Min]*l+[Max]*(len(x)-l)]
N_y=sorted(y)
bins=[[]]
for j in N_y:
bins[-1].append(j)
if len((v:=bins[-1]))==6:
v[:]=boundary(v)
bins.append([])
for i,j in enumerate(bins[:-1]):
print("bin %d is :"%(i+1),j)
bin 1 is : [[7.8, 7.8, 27.2, 27.2, 27.2, 27.2]]
bin 2 is : [[27.4, 27.4, 32.9, 32.9, 32.9, 32.9]]
bin 3 is : [[33.4, 33.4, 33.4, 33.4, 42.5, 42.5]]
到了這里,關(guān)于國科大數(shù)據(jù)挖掘課程HW1的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!