本次訓(xùn)練的變量是一致對應(yīng)的,訓(xùn)練準(zhǔn)備通過后,后續(xù)建模都不會有報錯的!
一、訓(xùn)練準(zhǔn)備(x_train, x_test, y_train, y_test)
1.1 導(dǎo)包
scikit-learn包以及鏡像
pip3 install --index-url https://pypi.douban.com/simple scikit-learn
1.2 數(shù)據(jù)要求
必須全部為數(shù)字類型且無空值才能進(jìn)行訓(xùn)練,關(guān)于非數(shù)據(jù)類型需要進(jìn)行相對處理例如:可以采用獨熱編碼或者label編碼進(jìn)行處理。
本文演示的是pandas 的dataframe數(shù)據(jù)類型的操作,轉(zhuǎn)換成別的類型也同理
1.21 導(dǎo)入數(shù)據(jù)
import pandas as pd df = pd.read_csv('data.csv') df.head(5) #查看數(shù)據(jù)前五條
1.22 數(shù)據(jù)類型查看檢測以及轉(zhuǎn)換
1. 通過df.info()查看類型以及缺失值情況
df.info()
2. label編碼
使用sklearn中的LabelEncoder類,將標(biāo)簽分配給分類變量的不同類別,并將其轉(zhuǎn)換為整數(shù)標(biāo)簽。from sklearn.preprocessing import LabelEncoder Label_df[i] = LabelEncoder().fit_transform(Label_df[i])
3. 獨熱編碼
pd.get_dummies函數(shù)是Pandas中用于執(zhí)行獨熱編碼的函數(shù)。它將類別變量轉(zhuǎn)換為獨熱編碼的形式,其中每個類別將被轉(zhuǎn)換為新的二進(jìn)制特征,表示原始特征中是否存在該類別。這對于機(jī)器學(xué)習(xí)模型處理分類數(shù)據(jù)時非常有用。
例如,如果有一個類別特征"color",包含紅色、藍(lán)色和綠色三個類別。使用pd.get_dummies函數(shù)可以將這個特征轉(zhuǎn)換為三個新的特征"color_red"、“color_blue"和"color_green”,它們的取值為0或1,表示原始特征中是否包含對應(yīng)的顏色。df_one_hot = pd.get_dummies(df, columns=['color']) df_one_hot.replace({False: 0, True: 1})
4. 缺失值處理
直接刪除#刪除指定列缺失值 df.dropna(subset=['身份證號'],inplace = True) #刪除NaN值 df.dropna(axis=0,inplace=True) #全部為空就刪除此行 df.dropna(axis=0,how="all",inplace=True) #有一個為空就刪除此行 df.dropna(axis=0, how='any', inplace=True)
填充
#數(shù)據(jù)填充 df.fillna(method='pad', inplace=True) # 填充前一條數(shù)據(jù)的值 df.fillna(method='bfill', inplace=True) # 填充后一條數(shù)據(jù)的值 df.fillna(df['cname'].mean(), inplace=True) # 填充平均值
5. 檢測函數(shù)這里是我自己定義的高效快速便捷方式
檢測函數(shù),輸入dataframe用for循環(huán)對每列檢測和操作, 自動檢測空值,object類型數(shù)據(jù),并且進(jìn)行默認(rèn)操作,
df.fillna(method=‘pad’, inplace=True) # 填充前一條數(shù)據(jù)的值
df.fillna(method=‘bfill’, inplace=True) # 填充后一條數(shù)據(jù)的值
獨熱編碼
df_one_hot = pd.get_dummies(df, columns=[‘color’])
返回處理好的dataframedef process_dataframe(df): df.fillna(method='pad', inplace=True) # 填充前一條數(shù)據(jù)的值 df.fillna(method='bfill', inplace=True) # 填充后一條數(shù)據(jù)的值 df_one_hot = df.copy() for i in df.columns: if df[i].dtype == object: df_one_hot = pd.get_dummies(df, columns=[i]) # 獨熱編碼 return df_one_hot
更多dataframe操作可以看一下鄙人不才總結(jié)的小處理
http://t.yssmx.com/iRbFj1.22 劃分?jǐn)?shù)據(jù)
from sklearn.model_selection import train_test_split x_data = df.iloc[:, 0:-1] y_data = df.iloc[:, -1] # 劃分?jǐn)?shù)據(jù)集 x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3, random_state=42)
二、回歸
2.1 線性回歸
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
訓(xùn)練以及簡單預(yù)測
from sklearn.linear_model import LinearRegression
from sklearn import metrics
#加載模型訓(xùn)練
Linear_R = LinearRegression()
Linear_R.fit(x_train, y_train)
# 預(yù)測
y_pred = Linear_R.predict(x_test)
# 評估
MAE_lr = metrics.mean_absolute_error(y_test, y_pred)
MSE_lr = metrics.mean_squared_error(y_test, y_pred)
RMSE_lr = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_lr = metrics.r2_score(y_test, y_pred)
print("LinearRegression 評估")
print("MAE: ", MAE_lr)
print("MSE: ", MSE_lr)
print("RMSE: ", RMSE_lr)
print("R2 Score: ", R2_Score_lr)
2.2 隨機(jī)森林回歸
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
#加載模型訓(xùn)練
RandomForest_R = RandomForestRegressor()
RandomForest_R.fit(x_train, y_train)
# 預(yù)測
y_pred = RandomForest_R.predict(x_test)
# 評估
MAE_Forest= metrics.mean_absolute_error(y_test, y_pred)
MSE_Forest = metrics.mean_squared_error(y_test, y_pred)
RMSE_Forest = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_Forest = metrics.r2_score(y_test, y_pred)
print("LinearRegression 評估")
print("MAE: ", MAE_Forest)
print("MSE: ", MSE_Forest)
print("RMSE: ", RMSE_Forest)
print("R2 Score: ", R2_Score_Forest)
模型優(yōu)化
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
# 創(chuàng)建一個參數(shù)網(wǎng)格,定義需要調(diào)整的超參數(shù)及其可能的取值范圍
param_grid = {
'n_estimators': [100, 200, 300], # 樹的數(shù)量
'max_depth': [None, 5, 10, 15], # 最大深度
'min_samples_split': [2, 5, 10], # 內(nèi)部節(jié)點再劃分所需最小樣本數(shù)
'min_samples_leaf': [1, 2, 4] # 葉子節(jié)點最少樣本數(shù)
}
# 創(chuàng)建一個隨機(jī)森林回歸模型
rf = RandomForestRegressor()
# 使用 RandomizedSearchCV 進(jìn)行參數(shù)搜索
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=100, cv=5, verbose=2, random_state=42, n_jobs=-1)
# 訓(xùn)練模型并搜索最佳參數(shù)組合
random_search.fit(x_train, y_train)
# 輸出最佳參數(shù)組合和最佳評分
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)
# 使用最佳參數(shù)組合的模型進(jìn)行預(yù)測
best_model = random_search.best_estimator_
y_pred = best_model.predict(x_test)
# 評估模型性能
MAE_Forest = metrics.mean_absolute_error(y_test, y_pred)
MSE_Forest = metrics.mean_squared_error(y_test, y_pred)
RMSE_Forest = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_Forest = metrics.r2_score(y_test, y_pred)
print("\nRandom Forest Regression Evaluation with Best Parameters:")
print("MAE: ", MAE_Forest)
print("MSE: ", MSE_Forest)
print("RMSE: ", RMSE_Forest)
print("R2 Score: ", R2_Score_Forest)
2.3 GradientBoostingRegressor梯度提升樹回歸
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
這里是引用梯度提升樹(GradientBoosting)是一種集成學(xué)習(xí)方法,通過構(gòu)建多個弱預(yù)測模型(通常是決策樹),然后將它們組合成一個強(qiáng)預(yù)測模型。梯度提升樹通過迭代的方式訓(xùn)練決策樹模型,每一次迭代都會針對之前迭代的殘差進(jìn)行擬合。它通過梯度下降的方式逐步改進(jìn)模型,以最小化損失函數(shù)。
梯度提升樹在每一輪迭代中,通過擬合一個新的弱模型來糾正之前模型的錯誤。在每一輪迭代中,它會計算出模型的負(fù)梯度(殘差),然后用新的弱模型去擬合這個負(fù)梯度,使得之前模型的殘差得到修正。最終,多個弱模型組合成一個強(qiáng)模型,可以用于回歸問題和分類問題。在Scikit-Learn中,GradientBoostingRegressor是基于梯度提升樹的回歸模型。它可以通過調(diào)節(jié)樹的數(shù)量、樹的深度以及學(xué)習(xí)率等超參數(shù)來控制模型的復(fù)雜度和泛化能力。梯度提升樹在處理各種類型的數(shù)據(jù)集時都表現(xiàn)良好,并且常被用于解決回歸問題。
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import metrics
#加載模型訓(xùn)練
GradientBoosting_R = GradientBoostingRegressor()
GradientBoosting_R.fit(x_train, y_train)
# 預(yù)測
y_pred = GradientBoosting_R.predict(x_test)
# 評估
MAE_GradientBoosting= metrics.mean_absolute_error(y_test, y_pred)
MSE_GradientBoosting = metrics.mean_squared_error(y_test, y_pred)
RMSE_GradientBoosting = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_GradientBoosting = r2_score(y_test, y_pred)
print("GradientBoostingRegressor 評估")
print("MAE: ", MAE_GradientBoosting)
print("MSE: ", MSE_GradientBoosting)
print("RMSE: ", RMSE_GradientBoosting)
print("R2 Score: ", R2_Score_GradientBoosting)
2.4 Lasso回歸
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso
Lasso回歸(Least Absolute Shrinkage and Selection Operator Regression)是一種線性回歸方法,它利用L1正則化來限制模型參數(shù)的大小,并傾向于產(chǎn)生稀疏模型。與傳統(tǒng)的最小二乘法不同,Lasso回歸在優(yōu)化目標(biāo)函數(shù)時,不僅考慮到數(shù)據(jù)擬合項,還考慮到對模型參數(shù)的懲罰項。
Lasso回歸的優(yōu)化目標(biāo)函數(shù)是普通最小二乘法的損失函數(shù)加上L1范數(shù)的懲罰項
from sklearn.linear_model import Lasso
from sklearn import metrics
#加載模型訓(xùn)練
Lasso_R = Lasso()
Lasso_R.fit(x_train, y_train)
# 預(yù)測
y_pred = Lasso_R.predict(x_test)
# 評估
MAE_Lasso= metrics.mean_absolute_error(y_test, y_pred)
MSE_Lasso = metrics.mean_squared_error(y_test, y_pred)
RMSE_Lasso = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_Lasso = metrics.r2_score(y_test, y_pred)
print("Lasso 評估")
print("MAE: ", MAE_Lasso)
print("MSE: ", MSE_Lasso)
print("RMSE: ", RMSE_Lasso)
print("R2 Score: ", R2_Score_Lasso)
2.5 Ridge嶺回歸
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
from sklearn.linear_model import Ridge
from sklearn import metrics
#加載模型訓(xùn)練
Ridge_R = Ridge()
Ridge_R.fit(x_train, y_train)
# 預(yù)測
y_pred = Ridge_R.predict(x_test)
# 評估
MAE_Ridge= metrics.mean_absolute_error(y_test, y_pred)
MSE_Ridge = metrics.mean_squared_error(y_test, y_pred)
RMSE_Ridge = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_Ridge = r2_score(y_test, y_pred)
print("RidgeCV 評估")
print("MAE: ", MAE_Ridge)
print("MSE: ", MSE_Ridge)
print("RMSE: ", RMSE_Ridge)
print("R2 Score: ", R2_Score_Ridge)
2.6 Elastic Net回歸
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html
Elastic Net回歸是一種結(jié)合了嶺回歸(Ridge Regression)和Lasso回歸(Lasso
Regression)的線性回歸模型。它通過結(jié)合L1和L2正則化懲罰項來克服嶺回歸和Lasso回歸各自的限制,以達(dá)到更好的預(yù)測性能。嶺回歸使用L2正則化,它通過向損失函數(shù)添加一個懲罰項來限制模型參數(shù)的大小,防止過擬合。Lasso回歸使用L1正則化,它傾向于產(chǎn)生稀疏的模型,即使大部分特征對目標(biāo)變量沒有影響,也會將它們的系數(shù)縮減為零。
Elastic
Net回歸結(jié)合了L1和L2正則化的優(yōu)點,可以同時產(chǎn)生稀疏模型并減少多重共線性帶來的影響。它的損失函數(shù)包括數(shù)據(jù)擬合項和正則化項,其中正則化項是L1和L2范數(shù)的線性組合。Elastic Net回歸在特征維度很高,且特征之間存在相關(guān)性時很有用。它可以用于特征選擇和回歸分析,尤其適用于處理實際數(shù)據(jù)集中的復(fù)雜問題。文章來源:http://www.zghlxwxcb.cn/news/detail-741936.html
from sklearn.linear_model import ElasticNet
from sklearn import metrics
# 使用訓(xùn)練數(shù)據(jù)擬合模型
elastic_net = ElasticNet()
elastic_net.fit(x_train, y_train)
# 預(yù)測
y_pred = elastic_net.predict(x_test)
# 評估
MAE_ElasticNet= metrics.mean_absolute_error(y_test, y_pred)
MSE_ElasticNet = metrics.mean_squared_error(y_test, y_pred)
RMSE_ElasticNet = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_ElasticNet = metrics.r2_score(y_test, y_pred)
print("ElasticNet 評估")
print("MAE: ", MAE_ElasticNet)
print("MSE: ", MSE_ElasticNet)
print("RMSE: ", RMSE_ElasticNet)
print("R2 Score: ", R2_Score_ElasticNet)
2.7 DecisionTreeRegressor決策樹模型
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html文章來源地址http://www.zghlxwxcb.cn/news/detail-741936.html
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
decision_tree = DecisionTreeRegressor()
decision_tree.fit(x_train, y_train)
y_pred = decision_tree.predict(x_test)
# 評估
MAE_decision_tree= metrics.mean_absolute_error(y_test, y_pred)
MSE_decision_tree = metrics.mean_squared_error(y_test, y_pred)
RMSE_decision_tree = metrics.mean_squared_error(y_test, y_pred, squared=False)
R2_Score_decision_tree = r2_score(y_test, y_pred)
print("DecisionTreeRegressor 評估")
print("MAE: ", MAE_decision_tree)
print("MSE: ", MSE_decision_tree)
print("RMSE: ", RMSE_decision_tree)
print("R2 Score: ", R2_Score_decision_tree)
自動化模型加評估
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_error, r2_score
modellist = [LinearRegression,RandomForestRegressor,GradientBoostingRegressor,Lasso,Ridge,ElasticNet,DecisionTreeRegressor]
namelist = ['LinearRegression','RandomForest','GradientBoosting','Lasso','Ridge','ElasticNet','DecisionTree']
RMSE = []
R2_Score = []
for i in range(len(modellist)):
mymodel = modellist[i]
tr_model = mymodel()
tr_model.fit(x_train, y_train)
y_pred = tr_model.predict(x_train)
print(f'{namelist[i]} 模型評估 \n MAE:{mean_absolute_error(y_train, y_pred)} MSE:{mean_squared_error(y_train, y_pred)} RMSE:{mean_squared_error(y_train,y_pred, squared=False)} R2 Score:{r2_score(y_train, y_pred)}')
y_pred = tr_model.predict(x_test)
RMSE.append(mean_squared_error(y_test,y_pred, squared=False))
R2_Score.append(r2_score(y_test, y_pred))
data_show = pd.concat([pd.DataFrame(RMSE),pd.DataFrame(R2_Score),pd.DataFrame(namelist)],axis=1)
data_show.columns = ['RMSE','R2_Score','model']
data_show
三、分類
…未完待續(xù)
到了這里,關(guān)于模型應(yīng)用系實習(xí)生-模型訓(xùn)練筆記(更新至線性回歸、Ridge回歸、Lasso回歸、Elastic Net回歸、決策樹回歸、梯度提升樹回歸和隨機(jī)森林回歸)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!