1 場(chǎng)景分析
1.1 項(xiàng)目背景
描述開發(fā)項(xiàng)目模型的一系列情境和因素,包括問題、需求、機(jī)會(huì)、市場(chǎng)環(huán)境、競(jìng)爭(zhēng)情況等
1.2. 解決問題
傳統(tǒng)機(jī)器學(xué)習(xí)在解決實(shí)際問題中主要分為兩類:
- 有監(jiān)督學(xué)習(xí):已知輸入、輸出之間的關(guān)系而進(jìn)行的學(xué)習(xí),從而產(chǎn)生一個(gè)能夠?qū)σ阎斎虢o出合適輸出的模型。這些算法在圖像分類、語音識(shí)別、自然語言處理、推薦系統(tǒng)等領(lǐng)域有著廣泛的應(yīng)用
- 無監(jiān)督學(xué)習(xí):已知輸入,無輸出結(jié)果而進(jìn)行的學(xué)習(xí),發(fā)現(xiàn)數(shù)據(jù)中的潛在特征和規(guī)律而訓(xùn)練的模型。這些算法在數(shù)據(jù)挖掘、圖像處理、自然語言處理等領(lǐng)域有著廣泛的應(yīng)用
傳統(tǒng)機(jī)器學(xué)習(xí)達(dá)到的目的主要分為兩類
- 分析影響結(jié)果的主要因素
- 充分必要條件下預(yù)測(cè)結(jié)果
傳統(tǒng)機(jī)器學(xué)習(xí)算法在實(shí)際開發(fā)中主要分兩類
- 基于樹的算法
- 非基于樹的算法
2 數(shù)據(jù)整體情況
2.1 數(shù)據(jù)加載
數(shù)據(jù)分析3劍客:numpy pandas matplotlib
# 導(dǎo)入相關(guān)包
import os
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import seaborn as sns
import plotly.express as px
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
1、 pandas讀取數(shù)據(jù): pd.read_csv(),訓(xùn)練數(shù)據(jù)一般從csv文件加載。讀取數(shù)據(jù)返回DataFrame,df.head() 查看前5條件數(shù)據(jù)分布
# 讀取數(shù)據(jù)
df = pd.read_csv('./xxx.csv')
df.head()
2、查看數(shù)據(jù)總體信息
df.info()
3、 查看數(shù)據(jù)描述
# 數(shù)據(jù)總數(shù)、平均值、標(biāo)準(zhǔn)差、最大最小值,25% 50% 75% 分位值
df.describe().T
4、統(tǒng)計(jì)數(shù)據(jù)空值
df.isnull().sum()
5、 查看數(shù)據(jù)形狀
df.shape
6、查看數(shù)據(jù)類型
df.dtypes
2.2 樣本是否均衡
如果正、負(fù)樣本不均衡怎么做?
- 大樣本變少——下采樣
- 小樣本變多——上采樣
- 實(shí)際應(yīng)用中,上采樣較多,將真實(shí)的數(shù)據(jù)做重復(fù)冗余
2.3 數(shù)據(jù)分析
以下為案例:
2.3.1單因分析
- 繪制直方圖
fig = px.histogram(df, x='列名', hover_data=df.columns, title='XXX分布', barmode='group')
fig.show()
fig = px.histogram(df, x='TPC_LIP', color='TPC_LIP', hover_data=df.columns, title='罐蓋分布', barmode='group')
fig.show()
- 繪制分布圖
hv.Distribution(np.round(df['列名'])).opts(title='標(biāo)題', color='green', xlabel='x軸標(biāo)簽名', ylabel='y軸標(biāo)簽名')\
.opts(opts.Distribution(width=1000, height=600, tools=['hover'], show_grid=True))
hv.Distribution(df['BF_IRON_DUR']).opts(title='XXX時(shí)長(zhǎng)', color='red', xlabel='時(shí)長(zhǎng)(秒)', ylabel='Destiny')\
.opts(opts.Distribution(width=1000, height=600, tools=['hover'], show_grid=True))
2.3.2 多因分析
- 繪制直方圖
temp_agg = df.groupby('OUTER_TEMPERATURE').agg({'TEMPERATURE': ['min', 'max']})
temp_maxmin = pd.merge(temp_agg['TEMPERATURE']['max'],temp_agg['TEMPERATURE']['min'],right_index=True,left_index=True)
temp_maxmin = pd.melt(temp_maxmin.reset_index(), ['OUTER_TEMPERATURE']).rename(columns={'OUTER_TEMPERATURE':'OUTER_TEMPERATURE', 'variable':'Max/Min'})
hv.Bars(temp_maxmin, ['OUTER_TEMPERATURE', 'Max/Min'], 'value').opts(title="Temperature by OUTER_TEMPERATURE Max/Min", ylabel="TEMPERATURE")\
.opts(opts.Bars(width=1000, height=700,tools=['hover'],show_grid=True))
- 尋找特征偏態(tài)(skewness)和核密度估計(jì)(Kernel density estimate KDE)
plt.figure(figsize=(15,10))
for i,col in enumerate(df.columns, 1):
plt.subplot(5,3,i)
plt.title(f"Distribution of {col} Data")
sns.histplot(df[col],kde=True)
plt.tight_layout()
plt.plot()
- 繪制曲線圖
iron_temp = df['IRON_TEMPERATURE'].iloc[:300]
temp = df['TEMPERATURE'].iloc[:300]
(hv.Curve(iron_temp, label='XXX') * hv.Curve(temp, label='XXX')).opts(title="XXXX溫度對(duì)比", ylabel="IRON_TEMPERATURE", xlabel='TEMPERATURE')\
.opts(opts.Curve(width=1500, height=500,tools=['hover'], show_grid=True))
3 數(shù)據(jù)處理
3.1 數(shù)據(jù)清洗
3.1.1離群值
利用箱形圖找出離群值并可過濾剔除
Minimum 最小值
First quartile 1/4分位值
Median 中間值
Third quartile 3/4分位值
Maximum 最大值
- XXX離群值1
- XXX離群值2
fig = px.box(df, y='XXX', title='XXXXX')
fig.show()
3.1.2空數(shù)據(jù)處理
如果數(shù)據(jù)量比較大,查出空數(shù)據(jù)的行或列刪除即可,反之要珍惜現(xiàn)有的數(shù)據(jù)樣本
可采用以下兩種方法進(jìn)行補(bǔ)全
- 隨機(jī)森林補(bǔ)全
# 引入隨機(jī)森林模型
from sklearn.ensemble import RandomForestRegressor
# 隨機(jī)森林模型
rfr = RandomForestRegressor(random_state=None, n_estimators=500, n_jobs=-1)
# 利用已知輸入和輸出數(shù)據(jù)進(jìn)行模型訓(xùn)練
rfr.fit(known_X, known_y)
# 輸出模型得分
score = rfr.score(known_X, known_y)
print('模型得分', score)
# 獲得缺失的特征數(shù)據(jù)X預(yù)測(cè)并補(bǔ)全
unknown_predict = rfr.predict(unKnown_X)
- 簡(jiǎn)單歸類補(bǔ)全
# 引入簡(jiǎn)單歸類包
from sklearn.impute import SimpleImputer
# 對(duì)缺失的列進(jìn)行平均值補(bǔ)全
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# 進(jìn)行模型訓(xùn)練
imputer = imputer.fit_transform(df[['TEMPERATURE']])
# 輸出訓(xùn)練結(jié)果
imputer
3.2 特征工程
特征衍生、選擇、縮放、分布、重要性
-
特征衍生: 特征轉(zhuǎn)換和特征組合
特征轉(zhuǎn)換——單特征自己進(jìn)行變換,例如取絕對(duì)值、進(jìn)行冪函數(shù)變換等
特征組合——多特征之間組合變換,如四則運(yùn)算、交叉組合、分組統(tǒng)計(jì)等
3.2.1 特征選擇
corr相關(guān)性系數(shù),刪除相關(guān)性強(qiáng)、冗余的特征,對(duì)分析特征權(quán)重很重要
# 淺顏色代表正相關(guān) 深顏色代表負(fù)相關(guān)
plt.figure(figsize=(16, 16))
sns.heatmap(df.corr(), cmap='BrBG', annot=True, linewidths=.5)
_ = plt.xticks(rotation=45)
3.2.2 特征縮放
- 受特征縮放的影響:距離算法 KNN K-means SVM 等
-
不受特征縮放的影響:基于樹的算法
縮放方法
-
歸一化
最大、最小值 0~1 之間,適合非高斯分布 K-Nearest Neighbors and Neural Networks -
標(biāo)準(zhǔn)化
適合高斯分布,但也可不是高斯分布。平均值為0,標(biāo)準(zhǔn)差為1,即使有異常值不受影響 -
Robust Scaler(魯棒縮放)
計(jì)算上下四分位數(shù)(Q1和Q3)之間的差值,每個(gè)數(shù)據(jù)點(diǎn)減去下四分位數(shù)(Q1),再除以四分位數(shù)范圍(Q3-Q1)
# data
x = pd.DataFrame({
# Distribution with lower outliers
'x1': np.concatenate([np.random.normal(20, 2, 1000), np.random.normal(1, 2, 25)]),
# Distribution with higher outliers
'x2': np.concatenate([np.random.normal(30, 2, 1000), np.random.normal(50, 2, 25)]),
})
np.random.normal
scaler = preprocessing.RobustScaler()
robust_df = scaler.fit_transform(x)
robust_df = pd.DataFrame(robust_df, columns =['x1', 'x2'])
scaler = preprocessing.StandardScaler()
standard_df = scaler.fit_transform(x)
standard_df = pd.DataFrame(standard_df, columns =['x1', 'x2'])
scaler = preprocessing.MinMaxScaler()
minmax_df = scaler.fit_transform(x)
minmax_df = pd.DataFrame(minmax_df, columns =['x1', 'x2'])
fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols = 4, figsize =(20, 5))
ax1.set_title('Before Scaling')
sns.kdeplot(x['x1'], ax = ax1, color ='r')
sns.kdeplot(x['x2'], ax = ax1, color ='b')
ax2.set_title('After Robust Scaling')
sns.kdeplot(robust_df['x1'], ax = ax2, color ='red')
sns.kdeplot(robust_df['x2'], ax = ax2, color ='blue')
ax3.set_title('After Standard Scaling')
sns.kdeplot(standard_df['x1'], ax = ax3, color ='black')
sns.kdeplot(standard_df['x2'], ax = ax3, color ='g')
ax4.set_title('After Min-Max Scaling')
sns.kdeplot(minmax_df['x1'], ax = ax4, color ='black')
sns.kdeplot(minmax_df['x2'], ax = ax4, color ='g')
plt.show()
3.2.3 類別特征處理
- 非基于樹的算法最好的方式——獨(dú)熱編碼
# 獨(dú)熱編碼
feature_col_nontree = ['TPC_AGE','TPC_LID','BF_START_WAITING', 'BF_IRON_DUR', 'BF_END_WAITING', 'BF_RAIL_DUR', 'RAIL_STEEL_DUR',
'EMPTY_START_WAITING', 'EMPTY_DUR', 'EMPTY_END_WAITING', 'STEEL_RAIL_DUR', 'RAIL_BF_DUR','TOTAL_TIME','OUTER_TEMPERATURE']
fullSel=pd.get_dummies(feature_col_nontree)
- 基于樹的算法最好的方式——標(biāo)簽編碼
df_tree = df.apply(LabelEncoder().fit_transform)
df_tree.head()
3.2.4 特征重要性
注意:只有在特征沒有冗余或被拆分的情況下,分析特征的重要性才有意義
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X, y)
clf.feature_importances_
plt.rcParams['figure.figsize'] = (12, 6)
plt.style.use('fivethirtyeight')
feature = list(X.columns)
importances = clf.feature_importances_
feat_name = np.array(feature)
index = np.argsort(importances)[::-1]
plt.bar(range(len(index)), importances[index], color='lightblue')
plt.step(range(15), np.cumsum(importances[index]))
_ = plt.xticks(range(15), labels=feat_name[index], rotation='vertical', fontsize=14)
4 構(gòu)建模型
4.1 數(shù)據(jù)拆分
訓(xùn)練數(shù)據(jù)80% 測(cè)試數(shù)據(jù)20%
訓(xùn)練數(shù)據(jù)80% 在分80%為訓(xùn)練數(shù)據(jù),20%為驗(yàn)證數(shù)據(jù)
from sklearn.model_selection import train_test_split
X = df.drop('TEMPERATURE', axis=1)
y = df['TEMPERATURE']
X_train_all, X_test, y_train_all, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_all, y_train_all, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
print(X_valid.shape, y_valid.shape)
4.2 選用算法
非基于樹的算法
- LinearRegression
- LogisticRegression
- Naive Bayes
- SVM
- KNN
- K-Means
基于樹的算法
- Decission Trees
- Extra Trees
- Random Forest
- XGBoost
- GBM
- LightGBM
4.2 數(shù)據(jù)交叉驗(yàn)證
- k-fold cross-validation:
k個(gè)不相交的子集,其中一個(gè)子集作為測(cè)試集,其余的子集作為訓(xùn)練集。重復(fù)k次 - stratified k-fold cross-validation (樣本分布不均勻情況下使用)
4.3 算法比較優(yōu)選
# 導(dǎo)入機(jī)器學(xué)習(xí) 線性回歸為例
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
# 設(shè)置kfold 交叉采樣法拆分?jǐn)?shù)據(jù)集
kfold = StratifiedKFold(n_splits=10)
# 匯總不同模型算法
regressors = []
regressors.append(SVR())
regressors.append(DecisionTreeRegressor())
regressors.append(RandomForestRegressor())
regressors.append(ExtraTreesRegressor())
regressors.append(GradientBoostingRegressor())
regressors.append(KNeighborsRegressor())
regressors.append(LinearRegression())
regressors.append(LinearDiscriminantAnalysis())
regressors.append(XGBRegressor())
# 不同機(jī)器學(xué)習(xí)交叉驗(yàn)證結(jié)果匯總
cv_results = []
for regressor in regressors:
cv_results.append(cross_val_score(estimator=regressor, X=X_train, y=y_train,
scoring='neg_mean_squared_error',
cv=kfold, n_jobs=-1))
# 求出模型得分的均值和標(biāo)準(zhǔn)差
cv_means = []
cv_std = []
for cv_result in cv_results:
cv_means.append(cv_result.mean())
cv_std.append(cv_result.std())
# 匯總數(shù)據(jù)
cvResDf = pd.DataFrame({'cv_mean': cv_means,
'cv_std': cv_std,
'algorithm':['SVC','DecisionTreeReg','RandomForestReg','ExtraTreesReg',
'GradientBoostingReg','KNN','LR','LDA', 'XGB']})
cvResDf
bar = sns.barplot(data=cvResDf.sort_values(by='cv_mean', ascending=False),
x='cv_mean', y='algorithm', **{'xerr': cv_std})
bar.set(xlim=(0.7, 0.9))
4.3 深度學(xué)習(xí)效果
tesorflow
import keras
d_model = keras.models.Sequential()
d_model.add(keras.layers.Dense(units=256, activation='relu', input_shape=(X_train_scaler.shape[1:])))
d_model.add(keras.layers.Dense(units=128, activation='relu'))
d_model.add(keras.layers.Dense(units=1))
out_put_dir = './'
if not os.path.exists(out_put_dir):
os.mkdir(out_put_dir)
out_put_file = os.path.join(out_put_dir, 'model.keras')
callbacks = [
keras.callbacks.TensorBoard(out_put_dir),
keras.callbacks.ModelCheckpoint(out_put_file, save_best_only=True, save_weights_only=True),
keras.callbacks.EarlyStopping(patience=5, min_delta=1e-3)
]
d_model.compile(optimizer='Adam', loss='mean_squared_error', metrics=['mse'])
history = d_model.fit(X_train_scaler, y_train, epochs=100, validation_data=(X_valid_scaler, y_valid), callbacks=callbacks)
pytorch
import pandas as pd
import torch
from torch import nn
data = pd.read_csv('XXX.csv', header=None)
print(data.head())
X = data.iloc[:, :-1]
print(X.shape)
Y = data.iloc[:, -1]
Y.replace(-1, 0, inplace=True)
print(Y.value_counts())
X = torch.from_numpy(X.values).type(torch.FloatTensor)
Y = torch.from_numpy(Y.values.reshape(-1, 1)).type(torch.FloatTensor)
model = nn.Sequential(
nn.Linear(15, 1),
nn.Sigmoid()
)
print(model)
loss_fn = nn.BCELoss()
opt = torch.optim.SGD(model.parameters(), lr=0.0001)
batch_size = 32
steps = X.shape[0] // batch_size
for epoch in range(1000):
for batch in range(steps):
start = batch * batch_size
end = start + batch_size
x = X[start:end]
y = Y[start:end]
y_pred = model(x)
loss = loss_fn(y_pred, y)
opt.zero_grad()
loss.backward()
opt.step()
print(model.state_dict())
accuracy = ((model(X).data.numpy() > 0.5) == Y.numpy()).mean()
print('accuracy = ', accuracy)
5 模型優(yōu)化
選出相對(duì)表現(xiàn)優(yōu)秀的模型進(jìn)行優(yōu)化,經(jīng)過調(diào)參和工程反復(fù)應(yīng)用情況,選擇最優(yōu)模型
5.1 網(wǎng)絡(luò)搜索
- DecisionTreeRegressor模型
#DecisionTreeRegressor模型
GTR = DecisionTreeRegressor()
gb_param_grid = {
'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
'splitter': ['best', 'random'],
'max_depth': [4, 8],
'min_samples_leaf': [100,150],
'max_features': [0.3, 0.1]
}
modelgsGTR = GridSearchCV(GTR,param_grid = gb_param_grid, cv=kfold,
scoring="neg_mean_squared_error", n_jobs= -1, verbose = 1)
modelgsGTR.fit(X_train,y_train)
modelgsGTR.best_score_
- xgboost
import xgboost as xgb
params = {'objective':'reg:linear',
'booster':'gbtree',
'eta':0.03,
'max_depth':10,
'subsample':0.9,
'colsample_bytree':0.7,
'silent':1,
'seed':10}
num_boost_round = 6000
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)
evals = [(dtrain, 'train'), (dtest, 'validation')]
gbm = xgb.train(params, # 模型參數(shù)
dtrain, # 訓(xùn)練數(shù)據(jù)
num_boost_round, # 輪次,決策樹的個(gè)數(shù)
evals=evals, # 驗(yàn)證,評(píng)估的數(shù)據(jù)
early_stopping_rounds=100, # 在驗(yàn)證集上,當(dāng)連續(xù)n次迭代,分?jǐn)?shù)沒有提高后,提前終止訓(xùn)練
verbose_eval=True) # 打印輸出log日志,每次訓(xùn)練詳情
5.2 正則化
作用:
- 抵制w無限增大,防止溢出
- 減少訓(xùn)練集與測(cè)試集之間的結(jié)果差異,防止過擬合
- 或多或少影響訓(xùn)練集的效果
L2使得所有w均變小
L1使得最不重要的特征維度變小,增強(qiáng)泛化能力,也起到降維的作用。L1在實(shí)際應(yīng)用中較多。
6 模型評(píng)估
-
Accuracy 準(zhǔn)確率:模型預(yù)測(cè)正確結(jié)果所占的比例,容易受到正負(fù)樣本不平衡時(shí)影響
-
Precision 精確率:模型預(yù)測(cè)為正樣本占實(shí)際正樣本的比例,容易受到所選閾值的影響。希望事務(wù)精準(zhǔn)發(fā)生,對(duì)精確率要求相對(duì)較高(比如推送廣告)
-
Recall 召回率:正樣本占所有模型預(yù)測(cè)為正樣本的比例,容易受到所選閾值的影響。希望負(fù)面或不好的事務(wù)不發(fā)生,對(duì)召回率要求相對(duì)較高(比如投送涉及黃、賭、毒的內(nèi)容文章)
-
F1 score (F1):模型精確率和召回率的一種加權(quán)平均,它的最大值是1,最小值是0
-
ROC/AUC (Receiver Operating characteristic 接收者操作特征曲線, Area Under Carve 曲線下面積)
ROC的曲線由所有閾值點(diǎn)theta組成,其下面積越大說明分類效果越好
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
recalls = [] # 召回率
precisions = [] # 精確度
aucs = [] # 曲線下面積
y_pred_proba = grid_search.predict_proba(X_test)
for threshold in thresholds:
y_ = y_pred_proba[:,1] >= threshold
cm = confusion_matrix(y_test,y_)
# TP/(TP + FN)
recalls.append(cm[1,1]/(cm[1,0] + cm[1,1])) # 召回率
# TP/(TP + FP)
precisions.append(cm[1,1]/(cm[0,1] + cm[1,1])) # 精確率
fpr,tpr,_ = roc_curve(y_test,y_)
auc_ = auc(fpr,tpr)
aucs.append(auc_)
plt.figure(figsize=(12,6))
plt.plot(thresholds,recalls,label = 'Recall')
plt.plot(thresholds,aucs,label = 'auc')
plt.plot(thresholds,precisions,label = 'precision')
plt.legend()
plt.xlabel('thresholds')
-
Log loss 損失函數(shù)文章來源:http://www.zghlxwxcb.cn/news/detail-808472.html
- 線性回歸(MES 均方誤差)
- 邏輯回歸(交叉熵)
- 線性回歸(MES 均方誤差)
請(qǐng)尊重別人的勞動(dòng)成果 轉(zhuǎn)載請(qǐng)務(wù)必注明出處文章來源地址http://www.zghlxwxcb.cn/news/detail-808472.html
到了這里,關(guān)于基于傳統(tǒng)機(jī)器學(xué)習(xí)模型算法的項(xiàng)目開發(fā)詳細(xì)過程的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!