數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參

這篇具有很好參考價(jià)值的文章主要介紹了數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問。

本文數(shù)據(jù)集來自阿里天池：https://tianchi.aliyun.com/competition/entrance/231784/information
主要參考了Datawhale的整個(gè)操作流程：https://tianchi.aliyun.com/notebook/95460
小編也是第一次接觸數(shù)據(jù)挖掘，所以先跟著Datawhale寫的教程操作了一遍，不懂的地方加了一點(diǎn)點(diǎn)自己的理解，感謝Datawhale！

目標(biāo)

了解常用的機(jī)器學(xué)習(xí)模型，并掌握機(jī)器學(xué)習(xí)模型的建模與調(diào)參流程

步驟

1. 調(diào)整數(shù)據(jù)類型，減少數(shù)據(jù)在內(nèi)存中占用的空間

具體方法定義如下：
對(duì)每一列循環(huán)，將每一列的轉(zhuǎn)化為對(duì)應(yīng)的數(shù)據(jù)類型，在不損失數(shù)據(jù)的情況下，盡可能地減少DataFrame中每列的內(nèi)存占用

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum()  # memory_usage() 方法返回每一列的內(nèi)存使用情況，sum() 將它們相加。
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    # 對(duì)每一列循環(huán)
    for col in df.columns:
        col_type = df[col].dtype # 獲取列類型
        if col_type != object:
            # 獲取當(dāng)前列的最小值和最大值
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                # np.int8 是 NumPy 中表示 8 位整數(shù)的數(shù)據(jù)類型。
                # np.iinfo(np.int8) 返回一個(gè)描述 np.int8 數(shù)據(jù)類型的信息對(duì)象。
                # .min 是該信息對(duì)象的一個(gè)屬性，用于獲取該數(shù)據(jù)類型的最小值。
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category') # 將當(dāng)前列的數(shù)據(jù)類型轉(zhuǎn)換為分類類型，以節(jié)省內(nèi)存
    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

調(diào)用上述函數(shù)查看效果：
其中，data_for_tree.csv保存的是我們?cè)谔卣鞴こ滩襟E中簡(jiǎn)單處理過的特征

sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

2. 使用線性回歸來簡(jiǎn)單建模

因?yàn)樯鲜鎏卣鳟?dāng)時(shí)是為樹模型分析保存的，所以沒有對(duì)空值進(jìn)行處理，這里簡(jiǎn)單處理一下

sample_feature.head()

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
可以看到notRepairedDamage這一列有異常值‘-’：

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)

建立訓(xùn)練數(shù)據(jù)和標(biāo)簽：

train_X = sample_feature.drop('price',axis=1)
train_y = sample_feature['price']

簡(jiǎn)單建模：

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model = model.fit(train_X, train_y)
'intercept:'+ str(model.intercept_) # 這一行代碼用于輸出模型的截距（即常數(shù)項(xiàng)）
sorted(dict(zip(sample_feature.columns, model.coef_)).items(), key=lambda x:x[1], reverse=True) # 這行代碼是用于輸出模型的系數(shù)，并按照系數(shù)的大小進(jìn)行排序
# sample_feature.columns 是特征的列名。
# model.coef_ 是線性回歸模型的系數(shù)。
# zip(sample_feature.columns, model.coef_) 將特征列名與對(duì)應(yīng)的系數(shù)打包成元組。
# dict(...) 將打包好的元組轉(zhuǎn)換為字典。
# sorted(..., key=lambda x:x[1], reverse=True) 對(duì)字典按照值（系數(shù)）進(jìn)行降序排序。

畫圖查看真實(shí)值與預(yù)測(cè)值之間的差距：

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50) # 從訓(xùn)練數(shù)據(jù)中隨機(jī)選擇 50 個(gè)樣本的索引
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black') # 繪制真實(shí)價(jià)格與特征 'v_9' 之間的散點(diǎn)圖
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue') # 繪制模型預(yù)測(cè)價(jià)格與特征 'v_9' 之間的散點(diǎn)圖
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
通過作圖我們發(fā)現(xiàn)數(shù)據(jù)的標(biāo)簽（price）呈現(xiàn)長(zhǎng)尾分布，不利于我們的建模預(yù)測(cè)。
對(duì)標(biāo)簽進(jìn)行進(jìn)一步分析：
畫圖顯示標(biāo)簽的分布：左邊是所有標(biāo)簽數(shù)據(jù)的一個(gè)分布，右邊是去掉最大的10%標(biāo)簽數(shù)據(jù)之后的一個(gè)分布

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1) # 創(chuàng)建一個(gè)包含 1 行 2 列的子圖，并將當(dāng)前子圖設(shè)置為第一個(gè)子圖
sns.distplot(train_y) # 顯示價(jià)格數(shù)據(jù)的直方圖以及擬合的密度曲線
plt.subplot(1,2,2)
# quantile 函數(shù)來計(jì)算價(jià)格數(shù)據(jù)的第 90%分位數(shù)，然后通過布爾索引選取低于第 90 百分位數(shù)的價(jià)格數(shù)據(jù)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
對(duì)標(biāo)簽進(jìn)行 log(x+1) 變換，使標(biāo)簽貼近于正態(tài)分布：

train_y_ln = np.log(train_y + 1)

顯示log變化之后的數(shù)據(jù)分布：

import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
然后我們重新訓(xùn)練，再可視化

model = model.fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

可以看出結(jié)果要比上面的好一點(diǎn)：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

3. 五折交叉驗(yàn)證

在使用訓(xùn)練集對(duì)參數(shù)進(jìn)行訓(xùn)練的時(shí)候，一般會(huì)將整個(gè)訓(xùn)練集分為三個(gè)部分：訓(xùn)練集（train_set），評(píng)估集（valid_set），測(cè)試集（test_set）這三個(gè)部分。這其實(shí)是為了保證訓(xùn)練效果而特意設(shè)置的。

測(cè)試集很好理解，其實(shí)就是完全不參與訓(xùn)練的數(shù)據(jù)，僅僅用來觀測(cè)測(cè)試效果的數(shù)據(jù)。

在實(shí)際的訓(xùn)練中，訓(xùn)練的結(jié)果對(duì)于訓(xùn)練集的擬合程度通常還是挺好的（初始條件敏感），但是對(duì)于訓(xùn)練集之外的數(shù)據(jù)的擬合程度通常就不那么令人滿意了。因此我們通常并不會(huì)把所有的數(shù)據(jù)集都拿來訓(xùn)練，而是分出一部分來（這一部分不參加訓(xùn)練）對(duì)訓(xùn)練集生成的參數(shù)進(jìn)行測(cè)試，相對(duì)客觀的判斷這些參數(shù)對(duì)訓(xùn)練集之外的數(shù)據(jù)的符合程度。這種思想就稱為交叉驗(yàn)證（Cross Validation）

（1）使用線性回歸模型，對(duì)未處理標(biāo)簽的特征數(shù)據(jù)進(jìn)行五折交叉驗(yàn)證

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
# 下面這個(gè)函數(shù)主要實(shí)現(xiàn)對(duì)參數(shù)進(jìn)行對(duì)數(shù)轉(zhuǎn)換輸入目標(biāo)函數(shù)
def log_transfer(func):
    def wrapper(y, yhat):
        # np.nan_to_num 函數(shù)用于將對(duì)數(shù)轉(zhuǎn)換后可能出現(xiàn)的 NaN 值轉(zhuǎn)換為 0
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    # 返回內(nèi)部函數(shù) wrapper，這是一個(gè)對(duì)原始函數(shù)的包裝器，它將對(duì)傳入的參數(shù)進(jìn)行對(duì)數(shù)轉(zhuǎn)換后再調(diào)用原始函數(shù)
    return wrapper
# 計(jì)算5折交叉驗(yàn)證得分
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
# model 是要評(píng)估的模型對(duì)象。
# train_X 是訓(xùn)練數(shù)據(jù)的特征，train_y 是訓(xùn)練數(shù)據(jù)的目標(biāo)變量。
# verbose=1 設(shè)置為 1 時(shí)表示打印詳細(xì)信息。
# cv=5 表示進(jìn)行 5 折交叉驗(yàn)證。
# scoring=make_scorer(log_transfer(mean_absolute_error)) 指定了評(píng)分標(biāo)準(zhǔn)
# 使用了 make_scorer 函數(shù)將一個(gè)自定義的評(píng)分函數(shù) log_transfer(mean_absolute_error) 轉(zhuǎn)換為一個(gè)可用于評(píng)分的評(píng)估器。
# log_transfer(mean_absolute_error) 這一步的作用就是將真實(shí)值和預(yù)測(cè)值在輸入到mean_absolute_error之前進(jìn)行l(wèi)og轉(zhuǎn)換
# mean_absolute_error 是一個(gè)回歸問題中常用的評(píng)估指標(biāo)，用于衡量預(yù)測(cè)值與實(shí)際值之間的平均絕對(duì)誤差
print('AVG:', np.mean(scores))

結(jié)果展示：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
使用線性回歸模型，對(duì)處理過標(biāo)簽的特征數(shù)據(jù)進(jìn)行五折交叉驗(yàn)證：

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
print('AVG:', np.mean(scores))

結(jié)果展示：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
可以看見，調(diào)整之后的數(shù)據(jù)，誤差明顯變小
查看scores：

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

4. 模擬真實(shí)業(yè)務(wù)情況

交叉驗(yàn)證在某些與時(shí)間相關(guān)的數(shù)據(jù)集上可能反映了不真實(shí)的情況，比如我們不能通過2018年的二手車價(jià)格來預(yù)測(cè)2017年的二手車價(jià)格。這個(gè)時(shí)候我們可以采用時(shí)間順序?qū)?shù)據(jù)集進(jìn)行分隔。在本例中，我們可以選用靠前時(shí)間的4/5樣本當(dāng)作訓(xùn)練集，靠后時(shí)間的1/5當(dāng)作驗(yàn)證集
具體操作如下：

import datetime
sample_feature = sample_feature.reset_index(drop=True)
split_point = len(sample_feature) // 5 * 4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()
train_X = train.drop('price',axis=1)
train_y_ln = np.log(train['price'] + 1)
val_X = val.drop('price',axis=1)
val_y_ln = np.log(val['price'] + 1)
model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))

結(jié)果展示：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

5. 繪制學(xué)習(xí)率曲線與驗(yàn)證曲線

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )): 
    """
    模型估計(jì)器 estimator
    圖的標(biāo)題 title
    特征數(shù)據(jù) X
    目標(biāo)數(shù)據(jù) y
    y軸的范圍 ylim
    交叉驗(yàn)證分割策略 cv
    并行運(yùn)行的作業(yè)數(shù) n_jobs 
    訓(xùn)練樣本的大小 train_size
    """
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  # 設(shè)置 y 軸的范圍為 ylim
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    # 使用 learning_curve 函數(shù)計(jì)算學(xué)習(xí)曲線的訓(xùn)練集得分和交叉驗(yàn)證集得分
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  # 計(jì)算訓(xùn)練集得分的均值
    train_scores_std = np.std(train_scores, axis=1)  # 計(jì)算訓(xùn)練集得分的標(biāo)準(zhǔn)差
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#區(qū)域  
    # 使用紅色填充訓(xùn)練集得分的方差范圍
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    # 使用綠色填充交叉驗(yàn)證集得分的方差范圍
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    # 繪制訓(xùn)練集得分曲線
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    # 繪制交叉驗(yàn)證集得分曲線
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt 
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

6. 嵌入式特征選擇

在過濾式和包裹式特征選擇方法中，特征選擇過程與學(xué)習(xí)器訓(xùn)練過程有明顯的分別。而嵌入式特征選擇在學(xué)習(xí)器訓(xùn)練過程中自動(dòng)地進(jìn)行特征選擇。嵌入式選擇最常用的是L1正則化與L2正則化。在對(duì)線性回歸模型加入兩種正則化方法后，他們分別變成了嶺回歸與Lasso回歸

對(duì)上述三種模型進(jìn)行交叉驗(yàn)證訓(xùn)練，并對(duì)比結(jié)果：

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
# 創(chuàng)建一個(gè)模型實(shí)力列表
models = [LinearRegression(),
          Ridge(),
          Lasso()]
result = dict()
for model in models:
    model_name = str(model)[:-2] # 獲取模型名稱
    # 訓(xùn)練模型
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    # 收集各模型訓(xùn)練得分
    result[model_name] = scores
    print(model_name + ' is finished')
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

結(jié)果展示：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
分別對(duì)三個(gè)模型訓(xùn)練得到的參數(shù)進(jìn)行分析：

一般線性回歸

import seaborn as sns
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:', model.intercept_)
# 組合數(shù)據(jù)
data = pd.DataFrame({'coef_abs': abs(model.coef_), 'feature': train_X.columns})
# 畫圖
sns.barplot(x='coef_abs', y='feature', data=data)

展示：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

嶺回歸

L2正則化在擬合過程中通常都傾向于讓權(quán)值盡可能小，最后構(gòu)造一個(gè)所有參數(shù)都比較小的模型。因?yàn)橐话阏J(rèn)為參數(shù)值小的模型比較簡(jiǎn)單，能適應(yīng)不同的數(shù)據(jù)集，也在一定程度上避免了過擬合現(xiàn)象?？梢栽O(shè)想一下對(duì)于一個(gè)線性回歸方程，若參數(shù)很大，那么只要數(shù)據(jù)偏移一點(diǎn)點(diǎn)，就會(huì)對(duì)結(jié)果造成很大的影響；但如果參數(shù)足夠小，數(shù)據(jù)偏移得多一點(diǎn)也不會(huì)對(duì)結(jié)果造成什么影響，專業(yè)一點(diǎn)的說法是『抗擾動(dòng)能力強(qiáng)』

import seaborn as sns
model = Ridge().fit(train_X, train_y_ln)
print('intercept:', model.intercept_)
# 組合數(shù)據(jù)
data = pd.DataFrame({'coef_abs': abs(model.coef_), 'feature': train_X.columns})
sns.barplot(x='coef_abs', y='feature', data=data)

展示：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

Lasso回歸
L1正則化有助于生成一個(gè)稀疏權(quán)值矩陣，進(jìn)而可以用于特征選擇

import seaborn as sns
model = Lasso().fit(train_X, train_y_ln)
print('intercept:', model.intercept_)
# 組合數(shù)據(jù)
data = pd.DataFrame({'coef_abs': abs(model.coef_), 'feature': train_X.columns})
sns.barplot(x='coef_abs', y='feature', data=data)

展示：
在這里我們可以看到power、used_time等特征非常重要
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

6. 非線性模型

決策樹通過信息熵或GINI指數(shù)選擇分裂節(jié)點(diǎn)時(shí)，優(yōu)先選擇的分裂特征也更加重要，這同樣是一種特征選擇的方法。XGBoost與LightGBM模型中的model_importance指標(biāo)正是基于此計(jì)算的

下面我們選擇部分模型進(jìn)行對(duì)比：

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

結(jié)果：
數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能
可以看到隨機(jī)森林模型在每一個(gè)fold中均取得了更好的效果?。?！

7. 模型調(diào)參

在這里主要介紹三種調(diào)參方法

（1）貪心調(diào)參

所謂貪心算法是指，在對(duì)問題求解時(shí)，總是做出在當(dāng)前看來是最好的選擇。也就是說，不從整體最優(yōu)上加以考慮，它所做出的僅僅是在某種意義上的局部最優(yōu)解。

以lightgbm模型為例：

## LGB的參數(shù)集合：
# 損失函數(shù)
objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']
# 葉子節(jié)點(diǎn)數(shù)
num_leaves = [3,5,10,15,20,40, 55]
# 最大深度
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []
best_obj = dict()
# 計(jì)算不同選擇下對(duì)應(yīng)結(jié)果，其中 score最小時(shí)為最優(yōu)結(jié)果
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
# 畫出各選擇下，損失的變化
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

（2）Grid Search 調(diào)參

GridSearchCV：一種調(diào)參的方法，當(dāng)你算法模型效果不是很好時(shí)，可以通過該方法來調(diào)整參數(shù)，通過循環(huán)遍歷，嘗試每一種參數(shù)組合，返回最好的得分值的參數(shù)組合

from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y_ln)
clf.best_params_

得到的最佳參數(shù)為：{'max_depth': 10, 'num_leaves': 55, 'objective': 'huber'}
我們?cè)儆米罴褏?shù)來訓(xùn)練模型：

model = LGBMRegressor(objective='huber',
                          num_leaves=55,
                          max_depth=10)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

結(jié)果跟之前的調(diào)參是相當(dāng)?shù)模?br> 數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能

（3）貝葉斯調(diào)參

貝葉斯優(yōu)化通過基于目標(biāo)函數(shù)的過去評(píng)估結(jié)果建立替代函數(shù)（概率模型），來找到最小化目標(biāo)函數(shù)的值。貝葉斯方法與隨機(jī)或網(wǎng)格搜索的不同之處在于，它在嘗試下一組超參數(shù)時(shí)，會(huì)參考之前的評(píng)估結(jié)果，因此可以省去很多無用功。

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    #num_leaves: 決策樹上的葉子節(jié)點(diǎn)數(shù)量。較大的值可以提高模型的復(fù)雜度，但也容易導(dǎo)致過擬合。
    # max_depth: 決策樹的最大深度?？刂茦涞纳疃瓤梢韵拗颇Ｐ偷膹?fù)雜度，有助于防止過擬合。
    # subsample: 訓(xùn)練數(shù)據(jù)的子樣本比例。該參數(shù)可以用來控制每次迭代時(shí)使用的數(shù)據(jù)量，有助于加速訓(xùn)練過程并提高模型的泛化能力。
    # min_child_samples: 每個(gè)葉子節(jié)點(diǎn)所需的最小樣本數(shù)。通過限制葉子節(jié)點(diǎn)中的樣本數(shù)量，可以控制樹的生長(zhǎng)，避免過擬合。
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val
    
rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)
# 最大化 rf_cv 函數(shù)返回的值，即最小化負(fù)的平均絕對(duì)誤差
rf_bo.maximize()

結(jié)果：

1 - rf_bo.max['target']

數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參,數(shù)據(jù)科學(xué),數(shù)據(jù)挖掘,人工智能文章來源地址http://www.zghlxwxcb.cn/news/detail-847904.html

總結(jié)

上述我們主要通過log轉(zhuǎn)換、正則化、模型選擇、參數(shù)微調(diào)等方法來提高預(yù)測(cè)的精度
最后附上一些學(xué)習(xí)鏈接供大家參考：
線性回歸模型：https://zhuanlan.zhihu.com/p/49480391
決策樹模型：https://zhuanlan.zhihu.com/p/65304798
GBDT模型：https://zhuanlan.zhihu.com/p/45145899
XGBoost模型：https://zhuanlan.zhihu.com/p/86816771
LightGBM模型：https://zhuanlan.zhihu.com/p/89360721
用簡(jiǎn)單易懂的語言描述「過擬合 overfitting」？https://www.zhihu.com/question/32246256/answer/55320482
模型復(fù)雜度與模型的泛化能力：http://yangyingming.com/article/434/
正則化的直觀理解：https://blog.csdn.net/jinping_shi/article/details/52433975
貪心算法： https://www.jianshu.com/p/ab89df9759c8
網(wǎng)格調(diào)參： https://blog.csdn.net/weixin_43172660/article/details/83032029
貝葉斯調(diào)參： https://blog.csdn.net/linxid/article/details/81189154

到了這里，關(guān)于數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！