一、什么是特征降維
降維是指在某些限定條件下,降低隨機(jī)變量(特征)個(gè)數(shù),得到一組“不相關(guān)”主變量的過程
1、降維
降低維度
ndarry
? ? 維數(shù):嵌套的層數(shù)
? ? 0維:標(biāo)量,具體的數(shù)0 1 2 3...
? ? 1維:向量
? ? 2維:矩陣
? ? 3維:多個(gè)二維數(shù)組嵌套
? ? n維:繼續(xù)嵌套下去
2、特征降維降的是什么
降的是二維數(shù)組,特征是幾行幾列的,幾行有多少樣本,幾列有多少特征
降低特征的個(gè)數(shù)(就是列數(shù))
二、降維的兩種方式
1、特征選擇
2、主成分分析(可以理解一種特征提取的方式)
三、什么是特征選擇
1、定義
數(shù)據(jù)中包含冗余或相關(guān)變量(或稱特征、屬性、指標(biāo)等),旨在從原有特征中找出主要特征
2、例子:想要對鳥進(jìn)行類別的區(qū)分
特征?
(1)羽毛顏色
(2)眼睛寬度
(3)眼睛長度
(4)爪子長度
(5)體格大小
比如還有的特征:是否有羽毛、是否有爪子,那這些特征就沒有意義
3、方法
Filter(過濾式):主要探究特征本身特點(diǎn)、特征與特征和目標(biāo)值之間關(guān)聯(lián)
(1)方差選擇法:低方差特征過濾,過濾掉方差比較低的特征
(2)相關(guān)系數(shù):特征與特征之間的相關(guān)程度
(3)方差選擇法在文本分類中表現(xiàn)非常不好,對噪聲的處理能力幾乎為0,還刪除了有用的特征
Embedded(嵌入式):算法自動(dòng)選擇特征(特征與目標(biāo)值之間的關(guān)聯(lián))
(1)決策樹:信息熵、信息增益
(2)正則化:L1、L2
(3)深度學(xué)習(xí):卷積等
(4)對于Embedded方式,只能在講解算法的時(shí)候再進(jìn)行介紹,更好的去理解
4、模塊
sklearn.feature_selection
四、低方差特征過濾
1、刪除低方差的一些特征,前面講過方差的意義。再結(jié)合方差的大小來考慮這個(gè)方式的角度
(1)特征方差?。耗硞€(gè)特征大多樣本的值比較相近
(2)特征方差大:某個(gè)特征很多樣本的值都有差別
2、API
sklearn.feature_selection.VarianceThreshold(threshold = 0.0)
刪除所有低方差特征,設(shè)置一個(gè)臨界值,低于臨界值的都刪掉
Variance:方差
Threshold:閾值
3、Variance.fit_transform(X)
X:numpy array格式的數(shù)據(jù)[n_samples, n_features]
返回值:訓(xùn)練集差異低于threshold的特征將被刪除。默認(rèn)值是保留所有非零方差特征,即刪除所有樣本中具有相同值的特征
4、數(shù)據(jù)計(jì)算
我們對某些股票的指標(biāo)特征之間進(jìn)行一個(gè)篩選,數(shù)據(jù)在factor_returns.csv文件當(dāng)中,除去index、date、return列不考慮(這些類型不匹配,也不是所需要的指標(biāo))
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import VarianceThreshold
import jieba
import pandas as pd
def datasets_demo():
"""
sklearn數(shù)據(jù)集使用
"""
#獲取數(shù)據(jù)集
iris = load_iris()
print("鳶尾花數(shù)據(jù)集:\n", iris)
print("查看數(shù)據(jù)集描述:\n", iris["DESCR"])
print("查看特征值的名字:\n", iris.feature_names)
print("查看特征值幾行幾列:\n", iris.data.shape)
#數(shù)據(jù)集的劃分
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
print("訓(xùn)練集的特征值:\n", x_train, x_train.shape)
return None
def dict_demo():
"""
字典特征抽取
"""
data = [{'city': '北京','temperature':100},{'city': '上海','temperature':60},{'city': '深圳','temperature':30}]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = DictVectorizer(sparse=False)
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
print("特征名字:\n", transfer.get_feature_names())
return None
def count_demo():
"""
文本特征抽取
"""
data = ["life is short,i like like python", "life is too long,i dislike python"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def count_chinese_demo():
"""
中文文本特征抽取
"""
data = ["我 愛 北京 天安門", "天安門 上 太陽 升"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray());
print("特征名字:\n", transfer.get_feature_names())
return None
def cut_word(text):
"""
進(jìn)行中文分詞
"""
return " ".join(list(jieba.cut(text))) #返回一個(gè)分詞生成器對象,強(qiáng)轉(zhuǎn)成list,再join轉(zhuǎn)成字符串
def count_chinese_demo2():
"""
中文文本特征抽取,自動(dòng)分詞
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def tfidf_demo():
"""
用tf-idf的方法進(jìn)行文本特征抽取
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = TfidfVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def minmax_demo():
"""
歸一化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = MinMaxScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def stand_demo():
"""
標(biāo)準(zhǔn)化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = StandardScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def variance_demo():
"""
過濾低方差特征
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("factor_returns.csv")
#print("data:\n", data)
data = data.iloc[:, 1:-2]
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = VarianceThreshold(threshold=3)
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new, data_new.shape)
return None
if __name__ == "__main__":
# 代碼1:sklearn數(shù)據(jù)集使用
#datasets_demo()
# 代碼2:字典特征抽取
#dict_demo()
# 代碼3:文本特征抽取
#count_demo()
# 代碼4:中文文本特征抽取
#count_chinese_demo()
# 代碼5:中文文本特征抽取,自動(dòng)分詞
#count_chinese_demo2()
# 代碼6: 測試jieba庫中文分詞
#print(cut_word("我愛北京天安門"))
# 代碼7:用tf-idf的方法進(jìn)行文本特征抽取
#tfidf_demo()
# 代碼8:歸一化
#minmax_demo()
# 代碼9:標(biāo)準(zhǔn)化
#stand_demo()
# 代碼10:低方差特征過濾
variance_demo()
運(yùn)行結(jié)果:
data:
pe_ratio pb_ratio market_cap return_on_asset_net_profit du_return_on_equity ev earnings_per_share revenue total_expense
0 5.9572 1.1818 8.525255e+10 0.8008 14.9403 1.211445e+12 2.0100 2.070140e+10 1.088254e+10
1 7.0289 1.5880 8.411336e+10 1.6463 7.8656 3.002521e+11 0.3260 2.930837e+10 2.378348e+10
2 -262.7461 7.0003 5.170455e+08 -0.5678 -0.5943 7.705178e+08 -0.0060 1.167983e+07 1.203008e+07
3 16.4760 3.7146 1.968046e+10 5.6036 14.6170 2.800916e+10 0.3500 9.189387e+09 7.935543e+09
4 12.5878 2.5616 4.172721e+10 2.8729 10.9097 8.124738e+10 0.2710 8.951453e+09 7.091398e+09
... ... ... ... ... ... ... ... ... ...
2313 25.0848 4.2323 2.274800e+10 10.7833 15.4895 2.784450e+10 0.8849 1.148170e+10 1.041419e+10
2314 59.4849 1.6392 2.281400e+10 1.2960 2.4512 3.810122e+10 0.0900 1.731713e+09 1.089783e+09
2315 39.5523 4.0052 1.702434e+10 3.3440 8.0679 2.420817e+10 0.2200 1.789082e+10 1.749295e+10
2316 52.5408 2.4646 3.287910e+10 2.7444 2.9202 3.883803e+10 0.1210 6.465392e+09 6.009007e+09
2317 14.2203 1.4103 5.911086e+10 2.0383 8.6179 2.020661e+11 0.2470 4.509872e+10 4.132842e+10
[2318 rows x 9 columns]
data_new:
[[ 5.95720000e+00 1.18180000e+00 8.52525509e+10 ... 1.21144486e+12
2.07014010e+10 1.08825400e+10]
[ 7.02890000e+00 1.58800000e+00 8.41133582e+10 ... 3.00252062e+11
2.93083692e+10 2.37834769e+10]
[-2.62746100e+02 7.00030000e+00 5.17045520e+08 ... 7.70517753e+08
1.16798290e+07 1.20300800e+07]
...
[ 3.95523000e+01 4.00520000e+00 1.70243430e+10 ... 2.42081699e+10
1.78908166e+10 1.74929478e+10]
[ 5.25408000e+01 2.46460000e+00 3.28790988e+10 ... 3.88380258e+10
6.46539204e+09 6.00900728e+09]
[ 1.42203000e+01 1.41030000e+00 5.91108572e+10 ... 2.02066110e+11
4.50987171e+10 4.13284212e+10]] (2318, 8)
五、相關(guān)系數(shù)
1、皮爾遜相關(guān)系數(shù)(Pearson Correlation Coefficient)
反映變量之間相關(guān)關(guān)系密切程度的統(tǒng)計(jì)指標(biāo)
2、公式計(jì)算案例
(1)公式
(2)比如說我們計(jì)算年廣告費(fèi)投入與月均銷售額
(3)那么之間的相關(guān)系數(shù)怎么計(jì)算
(4)最終計(jì)算
(5)結(jié)果=0.9942
所以我們最終得出結(jié)論是廣告投入費(fèi)與月平均銷售額之間有高度的正相關(guān)關(guān)系
4、API
from scipy.stats import pearsonr
X:(N,) array_like
Y:(N,) array_like
Returns:(Pearson’s correlation coefficient, p-value),返回值是兩個(gè)
注:pandas上面也有這個(gè)求相關(guān)系數(shù)的方法
5、案例:股票的財(cái)務(wù)指標(biāo)相關(guān)性計(jì)算
計(jì)算某兩個(gè)變量之間的相關(guān)系數(shù)
data [ ] 里面的關(guān)鍵字要用你自己表里面的列名
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr
import jieba
import pandas as pd
def datasets_demo():
"""
sklearn數(shù)據(jù)集使用
"""
#獲取數(shù)據(jù)集
iris = load_iris()
print("鳶尾花數(shù)據(jù)集:\n", iris)
print("查看數(shù)據(jù)集描述:\n", iris["DESCR"])
print("查看特征值的名字:\n", iris.feature_names)
print("查看特征值幾行幾列:\n", iris.data.shape)
#數(shù)據(jù)集的劃分
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
print("訓(xùn)練集的特征值:\n", x_train, x_train.shape)
return None
def dict_demo():
"""
字典特征抽取
"""
data = [{'city': '北京','temperature':100},{'city': '上海','temperature':60},{'city': '深圳','temperature':30}]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = DictVectorizer(sparse=False)
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
print("特征名字:\n", transfer.get_feature_names())
return None
def count_demo():
"""
文本特征抽取
"""
data = ["life is short,i like like python", "life is too long,i dislike python"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def count_chinese_demo():
"""
中文文本特征抽取
"""
data = ["我 愛 北京 天安門", "天安門 上 太陽 升"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray());
print("特征名字:\n", transfer.get_feature_names())
return None
def cut_word(text):
"""
進(jìn)行中文分詞
"""
return " ".join(list(jieba.cut(text))) #返回一個(gè)分詞生成器對象,強(qiáng)轉(zhuǎn)成list,再join轉(zhuǎn)成字符串
def count_chinese_demo2():
"""
中文文本特征抽取,自動(dòng)分詞
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def tfidf_demo():
"""
用tf-idf的方法進(jìn)行文本特征抽取
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = TfidfVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def minmax_demo():
"""
歸一化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = MinMaxScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def stand_demo():
"""
標(biāo)準(zhǔn)化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = StandardScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def variance_demo():
"""
過濾低方差特征
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("factor_returns.csv")
#print("data:\n", data)
data = data.iloc[:, 1:-2]
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = VarianceThreshold(threshold=3)
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new, data_new.shape)
# 4、計(jì)算某兩個(gè)變量之間的相關(guān)系數(shù)
r = pearsonr(data["pe_ratio"], data["pb_ratio"])
print("相關(guān)系數(shù):\n", r)
return None
if __name__ == "__main__":
# 代碼1:sklearn數(shù)據(jù)集使用
#datasets_demo()
# 代碼2:字典特征抽取
#dict_demo()
# 代碼3:文本特征抽取
#count_demo()
# 代碼4:中文文本特征抽取
#count_chinese_demo()
# 代碼5:中文文本特征抽取,自動(dòng)分詞
#count_chinese_demo2()
# 代碼6: 測試jieba庫中文分詞
#print(cut_word("我愛北京天安門"))
# 代碼7:用tf-idf的方法進(jìn)行文本特征抽取
#tfidf_demo()
# 代碼8:歸一化
#minmax_demo()
# 代碼9:標(biāo)準(zhǔn)化
#stand_demo()
# 代碼10:低方差特征過濾
variance_demo()
運(yùn)行結(jié)果:
data:
pe_ratio pb_ratio market_cap return_on_asset_net_profit du_return_on_equity ev earnings_per_share revenue total_expense
0 5.9572 1.1818 8.525255e+10 0.8008 14.9403 1.211445e+12 2.0100 2.070140e+10 1.088254e+10
1 7.0289 1.5880 8.411336e+10 1.6463 7.8656 3.002521e+11 0.3260 2.930837e+10 2.378348e+10
2 -262.7461 7.0003 5.170455e+08 -0.5678 -0.5943 7.705178e+08 -0.0060 1.167983e+07 1.203008e+07
3 16.4760 3.7146 1.968046e+10 5.6036 14.6170 2.800916e+10 0.3500 9.189387e+09 7.935543e+09
4 12.5878 2.5616 4.172721e+10 2.8729 10.9097 8.124738e+10 0.2710 8.951453e+09 7.091398e+09
... ... ... ... ... ... ... ... ... ...
2313 25.0848 4.2323 2.274800e+10 10.7833 15.4895 2.784450e+10 0.8849 1.148170e+10 1.041419e+10
2314 59.4849 1.6392 2.281400e+10 1.2960 2.4512 3.810122e+10 0.0900 1.731713e+09 1.089783e+09
2315 39.5523 4.0052 1.702434e+10 3.3440 8.0679 2.420817e+10 0.2200 1.789082e+10 1.749295e+10
2316 52.5408 2.4646 3.287910e+10 2.7444 2.9202 3.883803e+10 0.1210 6.465392e+09 6.009007e+09
2317 14.2203 1.4103 5.911086e+10 2.0383 8.6179 2.020661e+11 0.2470 4.509872e+10 4.132842e+10
[2318 rows x 9 columns]
data_new:
[[ 5.95720000e+00 1.18180000e+00 8.52525509e+10 ... 1.21144486e+12
2.07014010e+10 1.08825400e+10]
[ 7.02890000e+00 1.58800000e+00 8.41133582e+10 ... 3.00252062e+11
2.93083692e+10 2.37834769e+10]
[-2.62746100e+02 7.00030000e+00 5.17045520e+08 ... 7.70517753e+08
1.16798290e+07 1.20300800e+07]
...
[ 3.95523000e+01 4.00520000e+00 1.70243430e+10 ... 2.42081699e+10
1.78908166e+10 1.74929478e+10]
[ 5.25408000e+01 2.46460000e+00 3.28790988e+10 ... 3.88380258e+10
6.46539204e+09 6.00900728e+09]
[ 1.42203000e+01 1.41030000e+00 5.91108572e+10 ... 2.02066110e+11
4.50987171e+10 4.13284212e+10]] (2318, 8)
相關(guān)系數(shù):
(-0.004389322779936261, 0.8327205496564927)
相關(guān)系數(shù):
?(-0.004389322779936261, 0.8327205496564927)
前面一個(gè)是相關(guān)系數(shù),比較接近于0,說明這兩者不太相關(guān)
后面是p-value,假設(shè)H0:x,y不相關(guān),p-value越大,H0成立的概率越大。p-value值表示顯著水平,越小越好
所以這里是說明前面的相關(guān)系數(shù)成立的可能性很大
6、特征與特征之間相關(guān)性很高怎么辦
(1)選取其中一個(gè)
(2)加權(quán)求和
比如revenue和total_expense相關(guān)性高,各占50%
(3)主成分分析
7、用圖片展示相關(guān)性
安裝matplotlib
(1)先安裝Pillow
參考資料:https://pillow.readthedocs.io/en/latest/installation.html
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade Pillow
(2)再安裝matplotlib
pip3 install matplotlib
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr
import jieba
import pandas as pd
import matplotlib.pyplot as plt
def datasets_demo():
"""
sklearn數(shù)據(jù)集使用
"""
#獲取數(shù)據(jù)集
iris = load_iris()
print("鳶尾花數(shù)據(jù)集:\n", iris)
print("查看數(shù)據(jù)集描述:\n", iris["DESCR"])
print("查看特征值的名字:\n", iris.feature_names)
print("查看特征值幾行幾列:\n", iris.data.shape)
#數(shù)據(jù)集的劃分
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
print("訓(xùn)練集的特征值:\n", x_train, x_train.shape)
return None
def dict_demo():
"""
字典特征抽取
"""
data = [{'city': '北京','temperature':100},{'city': '上海','temperature':60},{'city': '深圳','temperature':30}]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = DictVectorizer(sparse=False)
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
print("特征名字:\n", transfer.get_feature_names())
return None
def count_demo():
"""
文本特征抽取
"""
data = ["life is short,i like like python", "life is too long,i dislike python"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def count_chinese_demo():
"""
中文文本特征抽取
"""
data = ["我 愛 北京 天安門", "天安門 上 太陽 升"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray());
print("特征名字:\n", transfer.get_feature_names())
return None
def cut_word(text):
"""
進(jìn)行中文分詞
"""
return " ".join(list(jieba.cut(text))) #返回一個(gè)分詞生成器對象,強(qiáng)轉(zhuǎn)成list,再join轉(zhuǎn)成字符串
def count_chinese_demo2():
"""
中文文本特征抽取,自動(dòng)分詞
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def tfidf_demo():
"""
用tf-idf的方法進(jìn)行文本特征抽取
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = TfidfVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def minmax_demo():
"""
歸一化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = MinMaxScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def stand_demo():
"""
標(biāo)準(zhǔn)化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = StandardScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def variance_demo():
"""
過濾低方差特征
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("factor_returns.csv")
#print("data:\n", data)
data = data.iloc[:, 1:-2]
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = VarianceThreshold(threshold=3)
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new, data_new.shape)
# 4、計(jì)算某兩個(gè)變量之間的相關(guān)系數(shù)
r1 = pearsonr(data["pe_ratio"], data["pb_ratio"])
print("相關(guān)系數(shù):\n", r1)
r2 = pearsonr(data["revenue"], data["total_expense"])
print("revenue與total_expense之間的相關(guān)性:\n", r2)
#用圖片展示相關(guān)性
plt.figure(figsize=(20, 8), dpi=100)
plt.scatter(data['revenue'], data['total_expense'])
plt.show()
return None
if __name__ == "__main__":
# 代碼1:sklearn數(shù)據(jù)集使用
#datasets_demo()
# 代碼2:字典特征抽取
#dict_demo()
# 代碼3:文本特征抽取
#count_demo()
# 代碼4:中文文本特征抽取
#count_chinese_demo()
# 代碼5:中文文本特征抽取,自動(dòng)分詞
#count_chinese_demo2()
# 代碼6: 測試jieba庫中文分詞
#print(cut_word("我愛北京天安門"))
# 代碼7:用tf-idf的方法進(jìn)行文本特征抽取
#tfidf_demo()
# 代碼8:歸一化
#minmax_demo()
# 代碼9:標(biāo)準(zhǔn)化
#stand_demo()
# 代碼10:低方差特征過濾
variance_demo()
六、主成分分析
1、什么是主成分分析(PCA)
定義:高維數(shù)據(jù)轉(zhuǎn)化為低維數(shù)據(jù)的過程,在此過程中可能會(huì)舍棄原有數(shù)據(jù)、創(chuàng)造新的變量
作用:是數(shù)據(jù)維數(shù)壓縮,盡可能降低原數(shù)據(jù)的維數(shù)(復(fù)雜度),損失少量信息
應(yīng)用:回歸分析或者聚類分析當(dāng)中
2、如何最好的對一個(gè)立體的物體二維表示
現(xiàn)實(shí)中是一個(gè)水壺,拍成照片就是平面的
相當(dāng)于將三維降到二維,在這個(gè)過程中可能就會(huì)有信息的損失
如何去衡量信息損失有多少,直觀的檢驗(yàn)方法是能不能通過二維的圖像,能夠還原出它還是一個(gè)水壺
從這四個(gè)圖片中可以看到,最后一個(gè)能識(shí)別出是水壺,也就是說最后一個(gè)從三維降到二維它損失的信息是最少的
3、PCA計(jì)算過程
找到一個(gè)合適的直線,通過一個(gè)矩陣運(yùn)算得出主成分分析的結(jié)果
PCA是一種數(shù)據(jù)降維的技術(shù),它并不是將數(shù)據(jù)擬合到一個(gè)模型中,而是通過線性變換將原始的高維數(shù)據(jù)投影到一個(gè)低維的子空間中,使得投影后的數(shù)據(jù)仍然盡可能地保留原始數(shù)據(jù)的信息,同時(shí)減少了特征的數(shù)量和減少了冗余性
4、API
sklearn.decomposition.PCA(n_components=None)
將數(shù)據(jù)分解為較低維數(shù)空間
n_components:
如果傳小數(shù):表示保留百分之多少的信息
如果傳整數(shù):減少到多少特征
5、PCA.fit_transform(X)
X:numpy array格式的數(shù)據(jù)[n_samples, n_features]
返回值:轉(zhuǎn)換后指定維度的array
6、數(shù)據(jù)計(jì)算文章來源:http://www.zghlxwxcb.cn/news/detail-645326.html
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr
from sklearn.decomposition import PCA
import jieba
import pandas as pd
import matplotlib.pyplot as plt
def datasets_demo():
"""
sklearn數(shù)據(jù)集使用
"""
#獲取數(shù)據(jù)集
iris = load_iris()
print("鳶尾花數(shù)據(jù)集:\n", iris)
print("查看數(shù)據(jù)集描述:\n", iris["DESCR"])
print("查看特征值的名字:\n", iris.feature_names)
print("查看特征值幾行幾列:\n", iris.data.shape)
#數(shù)據(jù)集的劃分
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
print("訓(xùn)練集的特征值:\n", x_train, x_train.shape)
return None
def dict_demo():
"""
字典特征抽取
"""
data = [{'city': '北京','temperature':100},{'city': '上海','temperature':60},{'city': '深圳','temperature':30}]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = DictVectorizer(sparse=False)
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
print("特征名字:\n", transfer.get_feature_names())
return None
def count_demo():
"""
文本特征抽取
"""
data = ["life is short,i like like python", "life is too long,i dislike python"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def count_chinese_demo():
"""
中文文本特征抽取
"""
data = ["我 愛 北京 天安門", "天安門 上 太陽 升"]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 2、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new.toarray());
print("特征名字:\n", transfer.get_feature_names())
return None
def cut_word(text):
"""
進(jìn)行中文分詞
"""
return " ".join(list(jieba.cut(text))) #返回一個(gè)分詞生成器對象,強(qiáng)轉(zhuǎn)成list,再join轉(zhuǎn)成字符串
def count_chinese_demo2():
"""
中文文本特征抽取,自動(dòng)分詞
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = CountVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def tfidf_demo():
"""
用tf-idf的方法進(jìn)行文本特征抽取
"""
# 1、將中文文本進(jìn)行分詞
data = ["今天很殘酷,明天更殘酷,后天很美好,但絕對大部分是死在明天晚上,所以每個(gè)人不要放棄今天。",
"我們看到的從很遠(yuǎn)星系來的光是在幾百萬年前之前發(fā)出的,這樣當(dāng)我們看到宇宙時(shí),我們是在看它的過去。",
"如果只用一種方式了解某樣事物,你就不會(huì)真正了解它。了解事物真正含義的秘密取決于如何將其與我們所了解的事物相聯(lián)系。"]
data_new = []
for sent in data:
data_new.append(cut_word(sent))
print(data_new)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = TfidfVectorizer()
# 3、調(diào)用fit_transform()
data_final = transfer.fit_transform(data_new)
print("data_final:\n", data_final.toarray())
print("特征名字:\n", transfer.get_feature_names())
return None
def minmax_demo():
"""
歸一化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = MinMaxScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def stand_demo():
"""
標(biāo)準(zhǔn)化
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("dating.txt")
#print("data:\n", data)
data = data.iloc[:, 0:3] #行都要,列取前3列
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器
transfer = StandardScaler()
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
def variance_demo():
"""
過濾低方差特征
"""
# 1、獲取數(shù)據(jù)
data = pd.read_csv("factor_returns.csv")
#print("data:\n", data)
data = data.iloc[:, 1:-2]
print("data:\n", data)
# 2、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = VarianceThreshold(threshold=3)
# 3、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new, data_new.shape)
# 4、計(jì)算某兩個(gè)變量之間的相關(guān)系數(shù)
r1 = pearsonr(data["pe_ratio"], data["pb_ratio"])
print("相關(guān)系數(shù):\n", r1)
r2 = pearsonr(data["revenue"], data["total_expense"])
print("revenue與total_expense之間的相關(guān)性:\n", r2)
#用圖片展示相關(guān)性
plt.figure(figsize=(20, 8), dpi=100)
plt.scatter(data['revenue'], data['total_expense'])
plt.show()
return None
def pca_demo():
"""
PCA降維
"""
data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer = PCA(n_components=3)
# 2、調(diào)用fit_transform
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
# 1、實(shí)例化一個(gè)轉(zhuǎn)換器類
transfer2 = PCA(n_components=0.9)
# 2、調(diào)用fit_transform
data_new2 = transfer2.fit_transform(data)
print("data_new2:\n", data_new2)
return None
if __name__ == "__main__":
# 代碼1:sklearn數(shù)據(jù)集使用
#datasets_demo()
# 代碼2:字典特征抽取
#dict_demo()
# 代碼3:文本特征抽取
#count_demo()
# 代碼4:中文文本特征抽取
#count_chinese_demo()
# 代碼5:中文文本特征抽取,自動(dòng)分詞
#count_chinese_demo2()
# 代碼6: 測試jieba庫中文分詞
#print(cut_word("我愛北京天安門"))
# 代碼7:用tf-idf的方法進(jìn)行文本特征抽取
#tfidf_demo()
# 代碼8:歸一化
#minmax_demo()
# 代碼9:標(biāo)準(zhǔn)化
#stand_demo()
# 代碼10:低方差特征過濾
#variance_demo()
# 代碼11:PCA降維
pca_demo()
運(yùn)行結(jié)果:文章來源地址http://www.zghlxwxcb.cn/news/detail-645326.html
data_new:
[[ 1.28620952e-15 3.82970843e+00 5.26052119e-16]
[ 5.74456265e+00 -1.91485422e+00 5.26052119e-16]
[-5.74456265e+00 -1.91485422e+00 5.26052119e-16]]
data_new2:
[[ 1.28620952e-15 3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]]
到了這里,關(guān)于機(jī)器學(xué)習(xí)基礎(chǔ)之《特征工程(4)—特征降維》的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!