国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

<span id="fi46w"></span>

【人工智能概論】 XGBoost應(yīng)用——特征篩選

2年前作者：小白的努力探索分類：Toy博客閱讀(26)違法舉報

這篇具有很好參考價值的文章主要介紹了【人工智能概論】 XGBoost應(yīng)用——特征篩選。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

【人工智能概論】 XGBoost應(yīng)用——特征篩選

換一個評價指標(biāo)，特征排序結(jié)果就會不一樣，甚至同樣的數(shù)據(jù)同樣的方法多次執(zhí)行得到的結(jié)果也不是完全一樣，特征篩選這件事見仁見智，要理性看待，但確實可以提供一種交叉驗證的角度。

一. 梯度提升算法是如何計算特征重要性的？

使用梯度提升算法的好處是在提升樹被創(chuàng)建后，可以相對直接地得到每個特征的重要性得分。
一般來說，一個特征越多的被用來在模型中構(gòu)建決策樹，它的重要性就相對越高。
在單個決策樹中通過每個特征分裂點改進(jìn)性能度量的量來計算特征的重要性。由節(jié)點負(fù)責(zé)加權(quán)和記錄次數(shù)，也就是說一個特征對分裂點改進(jìn)性能度量越大（或越靠近根節(jié)點），權(quán)值越大；被越多提升樹所選擇，特征越重要。性能度量可以是選擇分裂節(jié)點的Gini純度，也可以是其他度量函數(shù)。
最終將一個特征在所有提升樹中的結(jié)果進(jìn)行加權(quán)求和后然后平均，得到重要性得分。

二. 動手繪制特征的重要性

2.1 特征關(guān)鍵度分?jǐn)?shù) feature_importances_

通過xgboost模塊提供的方法構(gòu)建一個XGBoost模型，在訓(xùn)練的過程中模型會自動給各特征的重要性打分。
這些特征重要性分?jǐn)?shù)可以通過模型成員變量 feature_importances_ 獲得。
可以將它們打印出來：

print(model.feature_importances_)

也可以將它們繪制于條形圖上：

# from matplotlib import pyplot as plt
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()

2.2 應(yīng)用舉例

訓(xùn)練一個XGBoost模型，并展示特征的重要性。

from numpy import loadtxt
from xgboost import XGBClassifier
from matplotlib import pyplot as plt

# load data
dataset = loadtxt('machine-1-1.csv', delimiter=",")

# split data into X and y
X = dataset[:,1:39]
y = dataset[:,39]

# fit model no training data
model = XGBClassifier()
model.fit(X, y)

# feature importance
print(model.feature_importances_)

# plot
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()

得分情況與直方圖顯示

2.3 特征關(guān)鍵度排序可視化顯示 plot_importance

前面用條形圖顯示的方法很不錯，但是并沒有按照重要程度進(jìn)行排序，所幸xgboost提供內(nèi)置的繪圖函數(shù)可以實現(xiàn)這個功能。
xgboost庫提供了plot_importance（）函數(shù)，其可以按重要性順序繪制要素。

# plot feature importance
# from matplotlib import pyplot as plt
plot_importance(model)
plt.show()

2.4 應(yīng)用舉例

還是上面的數(shù)據(jù)

# plot feature importance using built-in function
from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot as plt
# load data
dataset = loadtxt('machine-1-1.csv', delimiter=",")
# split data into X and y
X = dataset[:,1:39]
y = dataset[:,39]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model,importance_type='gain')
plt.show()

得分情況

2.5 解決plot_importance和feature_importance獲得的特征排序不同

在使用xgboost提供的plot_importance和feature_importance方法獲得的特征排序時，可能會出現(xiàn)獲得的排名不一樣的問題。
為什么會出現(xiàn)？

因為plot_importance默認(rèn)的importance_type是’weight’，而feature_importance_默認(rèn)的importance_type是’gain’。

怎么辦？

換成一樣的就行了。

xgboost里面的feature importance是怎么計算的呢？

importance type一共有三種類型：weight, gain, cover

weight 是特征在提升樹里出現(xiàn)的次數(shù)，即所有樹中，某個特征作為分裂節(jié)點的次數(shù)。

gain 是在所有樹中，某個特征在分裂后帶來的平均信息增益。

cover 是與特征相關(guān)的記錄(observation)的相對數(shù)量。例如，如果有100條記錄(observation)，4個特征(feature) 和3棵樹(tree)，并且假設(shè)特征1分別用于確定樹1，樹2和樹3中10、5和2個記錄的葉節(jié)點；則cover指標(biāo)會將該特征的coverage計算為10 + 5 + 2 = 17個記錄。這將針對所有4個特征進(jìn)行計算，其cover將以所有特征的cover指標(biāo)的17%表示。

換一個評價指標(biāo)，結(jié)果就會不一樣，這其實告訴我們一個什么道理，特征篩選這件事見仁見智，要理性看待，但確實可以提供一種交叉驗證的角度。

三. 基于評分的特征選擇

3.1 基本原理

特征重要性評分可用于scikit-learn中的特征選擇。
這是通過使用SelectFromModel類完成的，該類采用一個模型，并且可以將數(shù)據(jù)集轉(zhuǎn)換為具有選定要素的子集。
該類可以采用預(yù)先訓(xùn)練好的模型，如在整個訓(xùn)練數(shù)據(jù)集上進(jìn)行訓(xùn)練的模型。
然后，它可以使用閾值來確定要選擇的特征。即當(dāng)在SelectFromModel實例上調(diào)用transform()方法時，該閾值被用于在訓(xùn)練集和測試集上一致性選擇相同特征。

3.2 實際舉例

首先，在訓(xùn)練集上訓(xùn)練xgboost模型，并在測試集上檢測效果；
然后，將模型封裝在一個SelectFromModel實例中，通過該實例與特征重要性評分來選擇特征；
最后，用所選擇的特征子集訓(xùn)練模型，并在相同的特征方案下在測試集上評估效果。
核心代碼：

# 用閾值選擇特征
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# 訓(xùn)練模型
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# 評估模型
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)

可以通過測試多個閾值，獲取多組特征子集，進(jìn)行性能與成本之間的衡量。
完整代碼：

# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel

# load data
dataset = loadtxt('machine-1-1.csv', delimiter=",")

# split data into X and y
X = dataset[:,1:39]
Y = dataset[:,39]

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)

# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:

    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    
    # train model
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

結(jié)果展示：
從直覺來說隨著閾值的變大，特征數(shù)量會減少，模型的準(zhǔn)確率也應(yīng)隨之下降。
這是有道理的，因此就需要在模型復(fù)雜度（特征數(shù)量）和準(zhǔn)確率做一個權(quán)衡。
但是有些情況（就像上邊），特征數(shù)量的減少反而會讓準(zhǔn)確率升高，或許因為這些被剔除特征是噪聲。文章來源地址http://www.zghlxwxcb.cn/news/detail-818205.html

四. XGBoost做回歸任務(wù)

總體與分類的差不多，只是細(xì)節(jié)需要調(diào)整。
accuracy_score -> XGBRegressor；XGBClassifier -> mean_squared_error

from numpy import loadtxt
from numpy import sort
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectFromModel
import pandas as pd

# load data
dataset = pd.read_csv('diabetes.csv', header=0)
dataset = dataset.iloc[:,1:].values

# split data into X and y
X = dataset[:,:-2]
Y = dataset[:,-2]

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

# fit model on all training data
model = XGBRegressor()
model.fit(X_train, y_train)

# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = mean_squared_error(y_test, predictions)
print("Accuracy: %.2f" % accuracy)

# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
    
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    
    # train model
    selection_model = XGBRegressor()
    selection_model.fit(select_X_train, y_train)
    
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = mean_squared_error(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f" % (thresh, select_X_train.shape[1], accuracy))

也可以繼續(xù)看特征關(guān)鍵程度

from matplotlib import pyplot as plt
from xgboost import plot_importance

# feature importance
print(model.feature_importances_)

# plot
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()

# plot feature importance
plot_importance(model,importance_type='gain')
plt.show()

五. 其它內(nèi)容

5.1 參數(shù)的問題

XGBoost有多種超參數(shù)，它們對模型的性能有很大的影響，調(diào)參是門藝術(shù)。

5.2 網(wǎng)格調(diào)參法

xgboost既可以用來做二分類、多分類，也可以用來做回歸，除了數(shù)據(jù)特征以外，對模型調(diào)參也是影響模型性能的關(guān)鍵環(huán)節(jié)，一般是按一定的步驟、網(wǎng)格搜索最優(yōu)參數(shù)，如下兩篇文章一個是用來分類，一個是用來預(yù)測數(shù)值的案例，并且詳細(xì)給出了調(diào)參的步驟和代碼：
分類器XGBClassifier的調(diào)參
回歸器XGBRegressor的調(diào)參

5.3 隨機(jī)種子

random_state可以用于很多函數(shù)，比如訓(xùn)練集測試集的劃分；構(gòu)建決策樹；構(gòu)建隨機(jī)森林。
可以通過確定隨機(jī)種子來實現(xiàn)可復(fù)現(xiàn)的結(jié)果。

到了這里，關(guān)于【人工智能概論】 XGBoost應(yīng)用——特征篩選的文章就介紹完了。如果您還想了解更多內(nèi)容，請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來自互聯(lián)網(wǎng)用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符，請點擊違法舉報進(jìn)行投訴反饋，一經(jīng)查實，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費用

【人工智能概論】 optimizer.param_groups簡介
optimizer.param_groups ：是一個list，其中的元素為字典； optimizer.param_groups[0] ：是一個字典，一般包括[‘params’, ‘lr’, ‘betas’, ‘eps’, ‘weight_decay’, ‘a(chǎn)msgrad’, ‘maximize’]等參數(shù)（不同的優(yōu)化器包含的可能略有不同，而且還可以額外人為添加鍵值對）；舉例展示：不同鍵
2024年02月10日
瀏覽(33)
hnu計算機(jī)與人工智能概論答案2.20
補(bǔ)一下第一次作業(yè) 第1關(guān)：數(shù)據(jù)輸入與輸出第一題在屏幕上輸出字符串：hi, \\\"how are you\\\" ,I\\\'m fine and you 第二題從鍵盤輸入兩個整數(shù)，計算兩個數(shù)相除的商與余數(shù) 假設(shè)輸入12,5 輸出為 2 2 第三題在屏幕上輸入一個三位數(shù)輸出該數(shù)的個位、十位和百位數(shù)字假設(shè)輸入125 輸出為 5 2
2024年02月08日
瀏覽(30)
hnu計算機(jī)與人工智能概論答案3.15
?終于肝完了！有一說一，這一次難度肉眼可見的提升，終于明白程序員為什么會禿頂了（頭發(fā)真的禁不住薅?。４蠹液眠\！第1關(guān)：循環(huán)結(jié)構(gòu)-while與for循環(huán) 第1題編程計算如下公式的值1^2+3^2+5^2+...+995^2+997^2+999^2并輸出結(jié)果第2題用 while 語句完成程序邏輯，求如下算法可
2024年02月08日
瀏覽(94)
hnu計算機(jī)與人工智能概論5.26（方程求根）
第1關(guān)：用暴力搜索法求方程的近似根? 本關(guān)任務(wù)：用暴力搜索法求 f(x)=x3?x?1 在[-10,10]之間的近似根。已知f(-10)0，f(10)0,畫圖可知函數(shù)在[-10,10]區(qū)間有且僅有一個根。要求近似根帶入函數(shù)f(x)之后，函數(shù)值與0之間的誤差在 10?6 之內(nèi)，請保留4位小數(shù)輸出該根值，并輸出搜尋次
2024年02月03日
瀏覽(26)
【人工智能概論】使用kaggle提供的GPU訓(xùn)練神經(jīng)網(wǎng)絡(luò)
注冊賬號的時候可能會遇到無法進(jìn)行人際驗證的問題，因此可能需要科學(xué)上網(wǎng)一下。具體步驟略。 kaggle的GPU資源需要綁定手機(jī)號才能使用點擊右上角的頭像。點擊Account 找到手機(jī)驗證界面Phone Verification，會看到下圖，根據(jù)1處的提示知，這種情況下手機(jī)是收不到驗證碼的，因
2024年02月04日
瀏覽(23)
【人工智能概論】 PyTorch可視化工具Tensorboard安裝與簡單使用
Tensorboard原本是Tensorflow的可視化工具，但自PyTorch1.2.0版本開始，PyTorch正式內(nèi)置Tensorboard的支持，盡管如此仍需手動安裝Tensorboard。否則會報錯。 ModuleNotFoundError: No module named ‘tensorboard’ 進(jìn)入相應(yīng)虛擬環(huán)境后，輸入以下指令即可安裝。輸入以下指令，不報錯即說明安裝成功。
2023年04月24日
瀏覽(22)
【人工智能概論】自編碼器（Auto-Encoder , AE）
自編碼器結(jié)構(gòu)圖自編碼器是自監(jiān)督學(xué)習(xí)的一種，其可以理解為一個試圖還原其原始輸入的系統(tǒng)。其主要由編碼器（Encoder）和解碼器（Decoder）組成，其工作流程是將輸入的數(shù)據(jù) x 經(jīng)編碼器壓縮成 y ， y 再由解碼器轉(zhuǎn)化成 x* ，其目的是讓 x* 和 x 盡可能相近。注意：盡管自編碼
2024年02月04日
瀏覽(23)
NHU-Python(商)實驗九-二維列表（計算與人工智能概論）
任務(wù)描述血壓的正常范圍是 60mmHg舒張壓90mmHg 90mmHg收縮壓140mmHg 輸入小張測量血壓的日期，舒張壓和收縮壓，存放到列表xy中將小張血壓不正常次數(shù)百分比計算并顯示出來將小張血壓不正常的日期，舒張壓和收縮壓顯示出來例如輸入 2020-1-1,80,100 2020-1-2,90,120 2020-1-3,100,150 202
2024年02月04日
瀏覽(17)
【人工智能概論】構(gòu)建神經(jīng)網(wǎng)絡(luò)——以用InceptionNet解決MNIST任務(wù)為例
兩條原則，四個步驟。從宏觀到微觀把握數(shù)據(jù)形狀準(zhǔn)備數(shù)據(jù) 構(gòu)建模型確定優(yōu)化策略完善訓(xùn)練與測試代碼 InceptionNet的設(shè)計思路是通過增加網(wǎng)絡(luò)寬度來獲得更好的模型性能。其核心在于基本單元Inception結(jié)構(gòu)塊，如下圖：通過縱向堆疊Inception塊構(gòu)建完整網(wǎng)絡(luò)。 MNIST是入門級的
2023年04月20日
瀏覽(29)
人工智能概論報告-基于PyTorch的深度學(xué)習(xí)手寫數(shù)字識別模型研究與實踐
本文是我人工智能概論的課程大作業(yè)實踐應(yīng)用報告，可供各位同學(xué)參考，內(nèi)容寫的及其水，部分也借助了gpt自動生成，排版等也基本做好，大家可以參照。如果有需要word版的可以私信我，或者在評論區(qū)留下郵箱，我會逐個發(fā)給。word版是我最后提交的，已經(jīng)調(diào)整統(tǒng)一了全文格
2024年02月05日
瀏覽(110)

<ul id="zfwwa"><code id="zfwwa"></code></ul><track id="zfwwa"><ol id="zfwwa"></ol></track><track id="zfwwa"><code id="zfwwa"></code></track><ul id="zfwwa"><code id="zfwwa"></code></ul>

<li id="zfwwa"></li>