零基礎(chǔ)入門數(shù)據(jù)挖掘 - 二手車交易價(jià)格預(yù)測(cè)
賽題理解
比賽要求參賽選手根據(jù)給定的數(shù)據(jù)集,建立模型,二手汽車的交易價(jià)格。
賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交易平臺(tái)的二手車交易記錄,總數(shù)據(jù)量超過(guò)40w,包含31列變量信息,其中15列為匿名變量。為了保證比賽的公平性,將會(huì)從中抽取15萬(wàn)條作為訓(xùn)練集,5萬(wàn)條作為測(cè)試集A,5萬(wàn)條作為測(cè)試集B,同時(shí)會(huì)對(duì)name、model、brand和regionCode等信息進(jìn)行脫敏。
比賽地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX
數(shù)據(jù)形式
訓(xùn)練數(shù)據(jù)集具有的特征如下:
- name - 汽車編碼
- regDate - 汽車注冊(cè)時(shí)間
- model - 車型編碼
- brand - 品牌
- bodyType - 車身類型
- fuelType - 燃油類型
- gearbox - 變速箱
- power - 汽車功率
- kilometer - 汽車行駛公里
- notRepairedDamage - 汽車有尚未修復(fù)的損壞
- regionCode - 看車地區(qū)編碼
- seller - 銷售方
- offerType - 報(bào)價(jià)類型
- creatDate - 廣告發(fā)布時(shí)間
- price - 汽車價(jià)格(目標(biāo)列)
- v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根據(jù)汽車的評(píng)論、標(biāo)簽等大量信息得到的embedding向量)【人工構(gòu)造 匿名特征】
預(yù)測(cè)指標(biāo)
賽題要求采用mae作為評(píng)價(jià)指標(biāo)
具體算法
導(dǎo)入相關(guān)庫(kù)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import scipy.stats as st
import warnings
warnings.filterwarnings('ignore')
# 解決中文顯示問(wèn)題
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
數(shù)據(jù)分析
先讀入數(shù)據(jù):
train_data = pd.read_csv("used_car_train_20200313.csv", sep = " ")
用excel打開(kāi)可以看到每一行數(shù)據(jù)都放下一個(gè)單元格中,彼此之間用空格分隔,因此此處需要指定sep為空格,才能夠正確讀入數(shù)據(jù)。
觀看一下數(shù)據(jù):
train_data.head(5).append(train_data.tail(5))
那么下面就開(kāi)始對(duì)數(shù)據(jù)進(jìn)行分析。
train_data.columns.values
array(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType',
'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage',
'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0',
'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9',
'v_10', 'v_11', 'v_12', 'v_13', 'v_14'], dtype=object)
以上為數(shù)據(jù)具有的具體特征,那么可以先初步探索一下每個(gè)特征的數(shù)值類型以及取值等。
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 150000 non-null int64
1 name 150000 non-null int64
2 regDate 150000 non-null int64
3 model 149999 non-null float64
4 brand 150000 non-null int64
5 bodyType 145494 non-null float64
6 fuelType 141320 non-null float64
7 gearbox 144019 non-null float64
8 power 150000 non-null int64
9 kilometer 150000 non-null float64
10 notRepairedDamage 150000 non-null object
11 regionCode 150000 non-null int64
12 seller 150000 non-null int64
13 offerType 150000 non-null int64
14 creatDate 150000 non-null int64
15 price 150000 non-null int64
16 v_0 150000 non-null float64
17 v_1 150000 non-null float64
18 v_2 150000 non-null float64
19 v_3 150000 non-null float64
20 v_4 150000 non-null float64
21 v_5 150000 non-null float64
22 v_6 150000 non-null float64
23 v_7 150000 non-null float64
24 v_8 150000 non-null float64
25 v_9 150000 non-null float64
26 v_10 150000 non-null float64
27 v_11 150000 non-null float64
28 v_12 150000 non-null float64
29 v_13 150000 non-null float64
30 v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
可以看到除了notRepairedDamage是object類型,其他都是int或者float類型,同時(shí)可以看到部分特征還是存在缺失值的,因此這也是后續(xù)處理的重要方向。下面查看缺失值的情況:
train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 1
brand 0
bodyType 4506
fuelType 8680
gearbox 5981
power 0
kilometer 0
notRepairedDamage 0
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
dtype: int64
可以看到是部分特征存在較多的缺失值的,因此這是需要處理的部分,下面對(duì)缺失值的數(shù)目進(jìn)行可視化展示:
missing = train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace = True)
missing.plot.bar()
我們也可用多種方式來(lái)查看缺失值:
msno.matrix(train_data.sample(10000))
這種圖中的白線代表為缺失值,可以看到中間的三個(gè)特征存在較多白線,說(shuō)明其采樣10000個(gè)的話其中仍然存在較多缺失值。
msno.bar(train_data.sample(10000))
上圖中同樣是那三個(gè)特征,非缺失值的個(gè)數(shù)也明顯比其他特征少。
再回到最開(kāi)始的數(shù)據(jù)類型處,我們可以發(fā)現(xiàn)notRepairedDamage特征的類型為object,因此我們可以來(lái)觀察其具有幾種取值:
train_data['notRepairedDamage'].value_counts()
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
可以看到其存在"-“取值,這也可以認(rèn)為是一種缺失值,因此我們可以將”-"轉(zhuǎn)換為nan,然后再統(tǒng)一對(duì)nan進(jìn)行處理。
而為了測(cè)試數(shù)據(jù)集也得到了相同的處理,因此讀入數(shù)據(jù)集并合并:
test_data = pd.read_csv("used_car_testB_20200421.csv", sep = " ")
train_data["origin"] = "train"
test_data["origin"] = "test"
data = pd.concat([train_data, test_data], axis = 0, ignore_index = True)
得到的data數(shù)據(jù),是具有20000萬(wàn)條數(shù)據(jù)。那么可以統(tǒng)一對(duì)該數(shù)據(jù)集的notRepairedDamage特征進(jìn)行處理:
data['notRepairedDamage'].replace("-", np.nan, inplace = True)
data['notRepairedDamage'].value_counts()
0.0 148585
1.0 19022
Name: notRepairedDamage, dtype: int64
可以看到"-"已經(jīng)被替換成了nan,因此在計(jì)數(shù)時(shí)沒(méi)有被考慮在內(nèi)。
而以下這兩種特征的類別嚴(yán)重不平衡,這種情況可以認(rèn)為它們對(duì)于結(jié)果的預(yù)測(cè)并不會(huì)起到什么作用:
data['seller'].value_counts()
0 199999
1 1
Name: seller, dtype: int64
data["offerType"].value_counts()
0 200000
Name: offerType, dtype: int64
因此可以對(duì)這兩個(gè)特征進(jìn)行刪除:
del data["seller"]
del data["offerType"]
以上是對(duì)特征的初步分析,那么接下來(lái)我們對(duì)目標(biāo)列,也就是預(yù)測(cè)價(jià)格進(jìn)行進(jìn)一步的分析,先觀察其分布情況:
target = train_data['price']
plt.figure(1)
plt.title('Johnson SU')
sns.distplot(target, kde=False, fit=st.johnsonsu)
plt.figure(2)
plt.title('Normal')
sns.distplot(target, kde=False, fit=st.norm)
plt.figure(3)
plt.title('Log Normal')
sns.distplot(target, kde=False, fit=st.lognorm)
我們可以看到價(jià)格的分布是極其不均勻的,這對(duì)預(yù)測(cè)是不利的,部分取值較為極端的例子將會(huì)對(duì)模型產(chǎn)生較大的影響,并且大部分模型及算法都希望預(yù)測(cè)的分布能夠盡可能地接近正態(tài)分布,因此后期需要進(jìn)行處理,那我們可以從偏度和峰度兩個(gè)正態(tài)分布的角度來(lái)觀察:
sns.distplot(target);
print("偏度: %f" % target.skew())
print("峰度: %f" % target.kurt())
偏度: 3.346487
峰度: 18.995183
對(duì)這種數(shù)據(jù)分布的處理,通??梢杂胠og來(lái)進(jìn)行壓縮轉(zhuǎn)換:
# 需要將其轉(zhuǎn)為正態(tài)分布
sns.distplot(np.log(target))
print("偏度: %f" % np.log(target).skew())
print("峰度: %f" % np.log(target).kurt())
偏度: -0.265100
峰度: -0.171801
可以看到,經(jīng)過(guò)log變換之后其分布相對(duì)好了很多,比較接近正態(tài)分布了。
接下來(lái),我們對(duì)不同類型的特征進(jìn)行觀察,分別對(duì)類別特征和數(shù)字特征來(lái)觀察。由于這里沒(méi)有在數(shù)值類型上加以區(qū)分,因此我們需要人工挑選:
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10',
'v_11', 'v_12', 'v_13','v_14' ]
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'notRepairedDamage', 'regionCode',]
那么對(duì)于類別型特征,我們可以查看其具有多少個(gè)取值,是否能夠轉(zhuǎn)換one-hot向量:
# 對(duì)于類別型的特征需要查看其取值有多少個(gè),能不能轉(zhuǎn)換為onehot
for feature in categorical_features:
print(feature,"特征有{}個(gè)取值".format(train_data[feature].nunique()))
print(train_data[feature].value_counts())
name 特征有99662個(gè)取值
387 282
708 282
55 280
1541 263
203 233
...
26403 1
28450 1
32544 1
102174 1
184730 1
Name: name, Length: 99662, dtype: int64
model 特征有248個(gè)取值
0.0 11762
19.0 9573
4.0 8445
1.0 6038
29.0 5186
...
242.0 2
209.0 2
245.0 2
240.0 2
247.0 1
Name: model, Length: 248, dtype: int64
brand 特征有40個(gè)取值
0 31480
4 16737
14 16089
10 14249
1 13794
6 10217
9 7306
5 4665
13 3817
11 2945
3 2461
7 2361
16 2223
8 2077
25 2064
27 2053
21 1547
15 1458
19 1388
20 1236
12 1109
22 1085
26 966
30 940
17 913
24 772
28 649
32 592
29 406
37 333
2 321
31 318
18 316
36 228
34 227
33 218
23 186
35 180
38 65
39 9
Name: brand, dtype: int64
bodyType 特征有8個(gè)取值
0.0 41420
1.0 35272
2.0 30324
3.0 13491
4.0 9609
5.0 7607
6.0 6482
7.0 1289
Name: bodyType, dtype: int64
fuelType 特征有7個(gè)取值
0.0 91656
1.0 46991
2.0 2212
3.0 262
4.0 118
5.0 45
6.0 36
Name: fuelType, dtype: int64
gearbox 特征有2個(gè)取值
0.0 111623
1.0 32396
Name: gearbox, dtype: int64
notRepairedDamage 特征有2個(gè)取值
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
regionCode 特征有7905個(gè)取值
419 369
764 258
125 137
176 136
462 134
...
7081 1
7243 1
7319 1
7742 1
7960 1
Name: regionCode, Length: 7905, dtype: int64
可以看到name和regionCode 有很多個(gè)取值,因此不能轉(zhuǎn)換為onthot,其他是可以的。
而對(duì)于數(shù)值特征,我們可以來(lái)查看其與價(jià)格之間的相關(guān)性關(guān)系,這也有利于我們判斷哪些特征更加重要:
numeric_features.append("price")
price_numeric = train_data[numeric_features]
correlation_score = price_numeric.corr() # 得到是一個(gè)特征數(shù)*特征數(shù)的矩陣,元素都行和列對(duì)應(yīng)特征之間的相關(guān)性
correlation_score['price'].sort_values(ascending = False)
price 1.000000
v_12 0.692823
v_8 0.685798
v_0 0.628397
power 0.219834
v_5 0.164317
v_2 0.085322
v_6 0.068970
v_1 0.060914
v_14 0.035911
v_13 -0.013993
v_7 -0.053024
v_4 -0.147085
v_9 -0.206205
v_10 -0.246175
v_11 -0.275320
kilometer -0.440519
v_3 -0.730946
Name: price, dtype: float64
可以看到,例如v14,v13,v1,v7這種跟price之間的相關(guān)系數(shù)實(shí)在是過(guò)低,如果是在計(jì)算資源有限的情況下可以考慮舍棄這部分特征。我們也可以直觀的展示相關(guān)性:
fig,ax = plt.subplots(figsize = (12,12))
plt.title("相關(guān)性展示")
sns.heatmap(correlation_score, square = True, vmax = 0.8)
對(duì)于數(shù)值特征來(lái)說(shuō),我們同樣關(guān)心其分布,下面先具體分析再說(shuō)明分布的重要性:
# 查看特征值的偏度和峰度
for col in numeric_features:
print("{:15}\t Skewness:{:05.2f}\t Kurtosis:{:06.2f}".format(col,
train_data[col].skew(),
train_data[col].kurt()))
power Skewness:65.86 Kurtosis:5733.45
kilometer Skewness:-1.53 Kurtosis:001.14
v_0 Skewness:-1.32 Kurtosis:003.99
v_1 Skewness:00.36 Kurtosis:-01.75
v_2 Skewness:04.84 Kurtosis:023.86
v_3 Skewness:00.11 Kurtosis:-00.42
v_4 Skewness:00.37 Kurtosis:-00.20
v_5 Skewness:-4.74 Kurtosis:022.93
v_6 Skewness:00.37 Kurtosis:-01.74
v_7 Skewness:05.13 Kurtosis:025.85
v_8 Skewness:00.20 Kurtosis:-00.64
v_9 Skewness:00.42 Kurtosis:-00.32
v_10 Skewness:00.03 Kurtosis:-00.58
v_11 Skewness:03.03 Kurtosis:012.57
v_12 Skewness:00.37 Kurtosis:000.27
v_13 Skewness:00.27 Kurtosis:-00.44
v_14 Skewness:-1.19 Kurtosis:002.39
price Skewness:03.35 Kurtosis:019.00
可以看到power特征的偏度和峰度都非常大,那么把分布圖畫出來(lái):
f = pd.melt(train_data, value_vars=numeric_features)
# 這里相當(dāng)于f是一個(gè)兩列的矩陣,第一列是原來(lái)特征
# 第二列是特征對(duì)應(yīng)的取值,例如power有n個(gè)取值,那么它會(huì)占據(jù)n行,這樣疊在一起
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
#g 是產(chǎn)生一個(gè)對(duì)象,可以用來(lái)應(yīng)用各種圖面畫圖,map應(yīng)用
# 第一個(gè)參數(shù)就是dataframe數(shù)據(jù),但是要求是長(zhǎng)數(shù)據(jù),也就是melt處理完的數(shù)據(jù)
# 第二個(gè)參數(shù)是用來(lái)畫圖依據(jù)的列,valiable是melt處理完,那些特征的列名稱
# 而那些值的列名稱為value
# 第三個(gè)參數(shù)col_wrap是代表分成多少列
g = g.map(sns.distplot, "value")
關(guān)于melt的使用可以看使用Pandas melt()重塑DataFrame - 知乎 (zhihu.com),我覺(jué)得講得非常容易理解。
可以看到power的分布非常不均勻,那么跟price同樣,如果出現(xiàn)較大極端值的power,就會(huì)對(duì)結(jié)果產(chǎn)生非常嚴(yán)重的影響,這就使得在學(xué)習(xí)的時(shí)候關(guān)于power 的權(quán)重設(shè)定非常不好做。因此后續(xù)也需要對(duì)這部分進(jìn)行處理。而匿名的特征的分布相對(duì)來(lái)說(shuō)會(huì)比較均勻一點(diǎn),后續(xù)可能就不需要進(jìn)行處理了。
還可以通過(guò)散點(diǎn)圖來(lái)觀察兩兩之間大概的關(guān)系分布:
sns.pairplot(train_data[numeric_features], size = 2, kind = "scatter",diag_kind = "kde")
(這部分就自己看自己發(fā)揮吧)
下面繼續(xù)回到類別型特征,由于其中存在nan不方便我們畫圖展示,因此我們可以先將nan進(jìn)行替換,方便畫圖展示:
# 下面對(duì)類別特征做處理
categorical_features_2 = ['model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'notRepairedDamage']
for c in categorical_features_2:
train_data[c] = train_data[c].astype("category")
# 將這些的類型轉(zhuǎn)換為分類類型,不保留原來(lái)的int或者float類型
if train_data[c].isnull().any():
# 如果該列存在nan的話
train_data[c] = train_data[c].cat.add_categories(['Missing'])
# 增加一個(gè)新的分類為missing,用它來(lái)填充那些nan,代表缺失值,
# 這樣在后面畫圖方便展示
train_data[c] = train_data[c].fillna('Missing')
下面通過(guò)箱型圖來(lái)對(duì)類別特征的每個(gè)取值進(jìn)行直觀展示:
def bar_plot(x, y, **kwargs):
sns.barplot(x = x, y = y)
x = plt.xticks(rotation = 90)
f = pd.melt(train_data, id_vars = ['price'], value_vars = categorical_features_2)
g = sns.FacetGrid(f, col = 'variable', col_wrap = 2, sharex = False, sharey = False)
g = g.map(bar_plot, "value", "price")
這可以看到類別型特征相對(duì)來(lái)說(shuō)分布也不會(huì)出現(xiàn)極端情況。
特征工程
在特征處理中,最重要的我覺(jué)得是對(duì)異常數(shù)據(jù)的處理。之前我們已經(jīng)看到了power特征的分布尤為不均勻,那么這部分有兩種處理方式,一種是對(duì)極端值進(jìn)行舍去,一部分是采用log的方式進(jìn)行壓縮,那么這里都進(jìn)行介紹。
首先是對(duì)極端值進(jìn)行舍去,那么可以采用箱型圖來(lái)協(xié)助判斷,下面封裝一個(gè)函數(shù)實(shí)現(xiàn):
# 主要就是power的值分布太過(guò)于異常,那么可以對(duì)一些進(jìn)行處理,刪除掉
# 下面定義一個(gè)函數(shù)用來(lái)處理異常值
def outliers_proc(data, col_name, scale = 3):
# data:原數(shù)據(jù)
# col_name:要處理異常值的列名稱
# scale:用來(lái)控制刪除尺度的
def box_plot_outliers(data_ser, box_scale):
iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
# quantile是取出數(shù)據(jù)對(duì)應(yīng)分位數(shù)的數(shù)值
val_low = data_ser.quantile(0.25) - iqr # 下界
val_up = data_ser.quantile(0.75) + iqr # 上界
rule_low = (data_ser < val_low) # 篩選出小于下界的索引
rule_up = (data_ser > val_up) # 篩選出大于上界的索引
return (rule_low, rule_up),(val_low, val_up)
data_n = data.copy()
data_series = data_n[col_name] # 取出對(duì)應(yīng)數(shù)據(jù)
rule, values = box_plot_outliers(data_series, box_scale = scale)
index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
# 先產(chǎn)生0到n-1,然后再用索引把其中處于異常值的索引取出來(lái)
print("Delete number is {}".format(len(index)))
data_n = data_n.drop(index) # 整行數(shù)據(jù)都丟棄
data_n.reset_index(drop = True, inplace = True) # 重新設(shè)置索引
print("Now column number is:{}".format(data_n.shape[0]))
index_low = np.arange(data_series.shape[0])[rule[0]]
outliers = data_series.iloc[index_low] # 小于下界的值
print("Description of data less than the lower bound is:")
print(pd.Series(outliers).describe())
index_up = np.arange(data_series.shape[0])[rule[1]]
outliers = data_series.iloc[index_up]
print("Description of data larger than the lower bound is:")
print(pd.Series(outliers).describe())
fig, axes = plt.subplots(1,2,figsize = (10,7))
ax1 = sns.boxplot(y = data[col_name], data = data, palette = "Set1", ax = axes[0])
ax1.set_title("處理異常值前")
ax2 = sns.boxplot(y = data_n[col_name], data = data_n, palette = "Set1", ax = axes[1])
ax2.set_title("處理異常值后")
return data_n
我們應(yīng)用于power數(shù)據(jù)集嘗試:
train_data_delete_after = outliers_proc(train_data, "power", scale =3)
Delete number is 963
Now column number is:149037
Description of data less than the lower bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: power, dtype: float64
Description of data larger than the lower bound is:
count 963.000000
mean 846.836968
std 1929.418081
min 376.000000
25% 400.000000
50% 436.000000
75% 514.000000
max 19312.000000
Name: power, dtype: float64
可以看到總共刪除了900多條數(shù)據(jù),使得最終的箱型圖也正常許多。
那么另外一種方法就是采用log進(jìn)行壓縮,但這里因?yàn)槲疫€想用power進(jìn)行數(shù)據(jù)分桶,構(gòu)造出一個(gè)power等級(jí)的特征,因此我就先構(gòu)造再進(jìn)行壓縮:
bin_power = [i*10 for i in range(31)]
data["power_bin"] = pd.cut(data["power"],bin_power,right = False,labels = False)
這種方法就是將power按照bin_power的數(shù)值進(jìn)行分段,最低一段在新特征中取值為1,以此類推,但是這樣會(huì)導(dǎo)致大于最大一段的取值為nan,也就是power取值大于300的在power_bin中取值為nan,因此可以設(shè)置其等級(jí)為31來(lái)處理:
data['power_bin'] = data['power_bin'].fillna(31)
那么對(duì)于power現(xiàn)在就可以用log進(jìn)行壓縮了:
data['power'] = np.log(data['power'] + 1)
接下來(lái)進(jìn)行新特征的構(gòu)造。
首先是使用時(shí)間,我們可以用creatDate減去regDate來(lái)表示:
data["use_time"] = (pd.to_datetime(data['creatDate'],format = "%Y%m%d",errors = "coerce")
- pd.to_datetime(data["regDate"], format = "%Y%m%d", errors = "coerce")).dt.days
# errors是當(dāng)格式轉(zhuǎn)換錯(cuò)誤就賦予nan
而這種處理方法由于部分?jǐn)?shù)據(jù)日期的缺失,會(huì)導(dǎo)致存在缺失值,那么我的處理方法是填充為均值,但是測(cè)試集的填充也需要用訓(xùn)練數(shù)據(jù)集的均值來(lái)填充,因此我放到后面劃分的時(shí)候再來(lái)處理。
下面是對(duì)品牌的銷售統(tǒng)計(jì)量創(chuàng)造特征,因?yàn)橐?jì)算某個(gè)品牌的銷售均值、最大值、方差等等數(shù)據(jù),因此我們需要在訓(xùn)練數(shù)據(jù)集上計(jì)算,測(cè)試數(shù)據(jù)集是未知的,計(jì)算完畢后再根據(jù)品牌一一對(duì)應(yīng)填上數(shù)值即可:
# 計(jì)算某個(gè)品牌的各種統(tǒng)計(jì)數(shù)目量
train_gb = train_data.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
info = {}
kind_data = kind_data[kind_data["price"] > 0]
# 把價(jià)格小于0的可能存在的異常值去除
info["brand_amount"] = len(kind_data) # 該品牌的數(shù)量
info["brand_price_max"] = kind_data.price.max() # 該品牌價(jià)格最大值
info["brand_price_min"] = kind_data.price.min() # 該品牌價(jià)格最小值
info["brand_price_median"] = kind_data.price.median() # 該品牌價(jià)格中位數(shù)
info["brand_price_sum"] = kind_data.price.sum() # 該品牌價(jià)格總和
info["brand_price_std"] = kind_data.price.std() # 方差
info["brand_price_average"] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
# 均值,保留兩位小數(shù)
all_info[kind] = info
brand_feature = pd.DataFrame(all_info).T.reset_index().rename(columns = {"index":"brand"})
這里的brand_feature獲得方法可能有點(diǎn)復(fù)雜,我一步步解釋出來(lái):
brand_feature = pd.DataFrame(all_info)
brand_feature
這里是7個(gè)統(tǒng)計(jì)量特征作為索引,然后有40列代表有40個(gè)品牌。
brand_feature = pd.DataFrame(all_info).T.reset_index()
brand_feature
轉(zhuǎn)置后重新設(shè)置索引,也就是:
將品牌統(tǒng)計(jì)量作為列,然后加入一個(gè)列為index,可以認(rèn)為是品牌的取值。
brand_feature = pd.DataFrame(all_info).T.reset_index().rename(columns = {"index":"brand"})
brand_feature
這一個(gè)就是將index更名為brand,這一列就是品牌的取值,方便我們后續(xù)融合到data中:
data = data.merge(brand_feature, how='left', on='brand')
這就是將data中的brand取值和剛才那個(gè)矩陣中的取值一一對(duì)應(yīng),然后取出對(duì)應(yīng)的特征各個(gè)值,作為新的特征。
接下來(lái)需要對(duì)大部分?jǐn)?shù)據(jù)進(jìn)行歸一化:
def max_min(x):
return (x - np.min(x)) / (np.max(x) - np.min(x))
for feature in ["brand_amount","brand_price_average","brand_price_max",
"brand_price_median","brand_price_min","brand_price_std",
"brand_price_sum","power","kilometer"]:
data[feature] = max_min(data[feature])
對(duì)類別特征進(jìn)行encoder:
# 對(duì)類別特征轉(zhuǎn)換為onehot
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType','fuelType','gearbox',
'notRepairedDamage', 'power_bin'],dummy_na=True)
對(duì)沒(méi)用的特征可以進(jìn)行刪除了:
data = data.drop(['creatDate',"regDate", "regionCode"], axis = 1)
至此,關(guān)于特征的處理工作基本上就完成了,但是這只是簡(jiǎn)單的處理方式,可以去探索更加深度的特征信息(我不會(huì)哈哈哈哈)。
建立模型
先處理數(shù)據(jù)集:
use_feature = [x for x in data.columns if x not in ['SaleID',"name","price","origin"]]
target = data[data["origin"] == "train"]["price"]
target_lg = (np.log(target+1))
train_x = data[data["origin"] == "train"][use_feature]
test_x = data[data["origin"] == "test"][use_feature]
train_x["use_time"] = train_x["use_time"].fillna(train_x["use_time"].mean())
test_x["use_time"] = test_x["use_time"].fillna(train_x["use_time"].mean())# 用訓(xùn)練數(shù)據(jù)集的均值填充
train_x.shape
(150000, 371)
可以看看訓(xùn)練數(shù)據(jù)是否還存在缺失值:
test_x.isnull().sum()
power 0
kilometer 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
use_time 0
brand_amount 0
brand_price_max 0
brand_price_min 0
brand_price_median 0
brand_price_sum 0
brand_price_std 0
brand_price_average 0
model_0.0 0
model_1.0 0
model_2.0 0
model_3.0 0
model_4.0 0
model_5.0 0
model_6.0 0
model_7.0 0
model_8.0 0
model_9.0 0
model_10.0 0
model_11.0 0
model_12.0 0
model_13.0 0
model_14.0 0
model_15.0 0
model_16.0 0
model_17.0 0
model_18.0 0
model_19.0 0
model_20.0 0
model_21.0 0
model_22.0 0
model_23.0 0
model_24.0 0
model_25.0 0
model_26.0 0
model_27.0 0
model_28.0 0
model_29.0 0
model_30.0 0
model_31.0 0
model_32.0 0
model_33.0 0
model_34.0 0
model_35.0 0
model_36.0 0
model_37.0 0
model_38.0 0
model_39.0 0
model_40.0 0
model_41.0 0
model_42.0 0
model_43.0 0
model_44.0 0
model_45.0 0
model_46.0 0
model_47.0 0
model_48.0 0
model_49.0 0
model_50.0 0
model_51.0 0
model_52.0 0
model_53.0 0
model_54.0 0
model_55.0 0
model_56.0 0
model_57.0 0
model_58.0 0
model_59.0 0
model_60.0 0
model_61.0 0
model_62.0 0
model_63.0 0
model_64.0 0
model_65.0 0
model_66.0 0
model_67.0 0
model_68.0 0
model_69.0 0
model_70.0 0
model_71.0 0
model_72.0 0
model_73.0 0
model_74.0 0
model_75.0 0
model_76.0 0
model_77.0 0
model_78.0 0
model_79.0 0
model_80.0 0
model_81.0 0
model_82.0 0
model_83.0 0
model_84.0 0
model_85.0 0
model_86.0 0
model_87.0 0
model_88.0 0
model_89.0 0
model_90.0 0
model_91.0 0
model_92.0 0
model_93.0 0
model_94.0 0
model_95.0 0
model_96.0 0
model_97.0 0
model_98.0 0
model_99.0 0
model_100.0 0
model_101.0 0
model_102.0 0
model_103.0 0
model_104.0 0
model_105.0 0
model_106.0 0
model_107.0 0
model_108.0 0
model_109.0 0
model_110.0 0
model_111.0 0
model_112.0 0
model_113.0 0
model_114.0 0
model_115.0 0
model_116.0 0
model_117.0 0
model_118.0 0
model_119.0 0
model_120.0 0
model_121.0 0
model_122.0 0
model_123.0 0
model_124.0 0
model_125.0 0
model_126.0 0
model_127.0 0
model_128.0 0
model_129.0 0
model_130.0 0
model_131.0 0
model_132.0 0
model_133.0 0
model_134.0 0
model_135.0 0
model_136.0 0
model_137.0 0
model_138.0 0
model_139.0 0
model_140.0 0
model_141.0 0
model_142.0 0
model_143.0 0
model_144.0 0
model_145.0 0
model_146.0 0
model_147.0 0
model_148.0 0
model_149.0 0
model_150.0 0
model_151.0 0
model_152.0 0
model_153.0 0
model_154.0 0
model_155.0 0
model_156.0 0
model_157.0 0
model_158.0 0
model_159.0 0
model_160.0 0
model_161.0 0
model_162.0 0
model_163.0 0
model_164.0 0
model_165.0 0
model_166.0 0
model_167.0 0
model_168.0 0
model_169.0 0
model_170.0 0
model_171.0 0
model_172.0 0
model_173.0 0
model_174.0 0
model_175.0 0
model_176.0 0
model_177.0 0
model_178.0 0
model_179.0 0
model_180.0 0
model_181.0 0
model_182.0 0
model_183.0 0
model_184.0 0
model_185.0 0
model_186.0 0
model_187.0 0
model_188.0 0
model_189.0 0
model_190.0 0
model_191.0 0
model_192.0 0
model_193.0 0
model_194.0 0
model_195.0 0
model_196.0 0
model_197.0 0
model_198.0 0
model_199.0 0
model_200.0 0
model_201.0 0
model_202.0 0
model_203.0 0
model_204.0 0
model_205.0 0
model_206.0 0
model_207.0 0
model_208.0 0
model_209.0 0
model_210.0 0
model_211.0 0
model_212.0 0
model_213.0 0
model_214.0 0
model_215.0 0
model_216.0 0
model_217.0 0
model_218.0 0
model_219.0 0
model_220.0 0
model_221.0 0
model_222.0 0
model_223.0 0
model_224.0 0
model_225.0 0
model_226.0 0
model_227.0 0
model_228.0 0
model_229.0 0
model_230.0 0
model_231.0 0
model_232.0 0
model_233.0 0
model_234.0 0
model_235.0 0
model_236.0 0
model_237.0 0
model_238.0 0
model_239.0 0
model_240.0 0
model_241.0 0
model_242.0 0
model_243.0 0
model_244.0 0
model_245.0 0
model_246.0 0
model_247.0 0
model_nan 0
brand_0.0 0
brand_1.0 0
brand_2.0 0
brand_3.0 0
brand_4.0 0
brand_5.0 0
brand_6.0 0
brand_7.0 0
brand_8.0 0
brand_9.0 0
brand_10.0 0
brand_11.0 0
brand_12.0 0
brand_13.0 0
brand_14.0 0
brand_15.0 0
brand_16.0 0
brand_17.0 0
brand_18.0 0
brand_19.0 0
brand_20.0 0
brand_21.0 0
brand_22.0 0
brand_23.0 0
brand_24.0 0
brand_25.0 0
brand_26.0 0
brand_27.0 0
brand_28.0 0
brand_29.0 0
brand_30.0 0
brand_31.0 0
brand_32.0 0
brand_33.0 0
brand_34.0 0
brand_35.0 0
brand_36.0 0
brand_37.0 0
brand_38.0 0
brand_39.0 0
brand_nan 0
bodyType_0.0 0
bodyType_1.0 0
bodyType_2.0 0
bodyType_3.0 0
bodyType_4.0 0
bodyType_5.0 0
bodyType_6.0 0
bodyType_7.0 0
bodyType_nan 0
fuelType_0.0 0
fuelType_1.0 0
fuelType_2.0 0
fuelType_3.0 0
fuelType_4.0 0
fuelType_5.0 0
fuelType_6.0 0
fuelType_nan 0
gearbox_0.0 0
gearbox_1.0 0
gearbox_nan 0
notRepairedDamage_0.0 0
notRepairedDamage_1.0 0
notRepairedDamage_nan 0
power_bin_0.0 0
power_bin_1.0 0
power_bin_2.0 0
power_bin_3.0 0
power_bin_4.0 0
power_bin_5.0 0
power_bin_6.0 0
power_bin_7.0 0
power_bin_8.0 0
power_bin_9.0 0
power_bin_10.0 0
power_bin_11.0 0
power_bin_12.0 0
power_bin_13.0 0
power_bin_14.0 0
power_bin_15.0 0
power_bin_16.0 0
power_bin_17.0 0
power_bin_18.0 0
power_bin_19.0 0
power_bin_20.0 0
power_bin_21.0 0
power_bin_22.0 0
power_bin_23.0 0
power_bin_24.0 0
power_bin_25.0 0
power_bin_26.0 0
power_bin_27.0 0
power_bin_28.0 0
power_bin_29.0 0
power_bin_31.0 0
power_bin_nan 0
dtype: int64
可以看到都沒(méi)有缺失值了,因此接下來(lái)可以用來(lái)選擇模型了。
由于現(xiàn)實(shí)原因(電腦跑不動(dòng)xgboost)因此我選擇了lightGBM和隨機(jī)森林、梯度提升決策樹(shù)三種,然后再用模型融合,具體代碼如下:
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error,mean_absolute_error, f1_score
import lightgbm as lgb
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.model_selection import KFold, StratifiedKFold,GroupKFold, RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import GradientBoostingRegressor as gbr
from sklearn.linear_model import LinearRegression as lr
lightGBM:
lgb_param = { # 這是訓(xùn)練的參數(shù)列表
"num_leaves":7,
"min_data_in_leaf": 20, # 一個(gè)葉子上最小分配到的數(shù)量,用來(lái)處理過(guò)擬合
"objective": "regression", # 設(shè)置類型為回歸
"max_depth": -1, # 限制樹(shù)的最大深度,-1代表沒(méi)有限制
"learning_rate": 0.003,
"boosting": "gbdt", # 用gbdt算法
"feature_fraction": 0.50, # 每次迭代時(shí)使用18%的特征參與建樹(shù),引入特征子空間的多樣性
"bagging_freq": 1, # 每一次迭代都執(zhí)行bagging
"bagging_fraction": 0.55, # 每次bagging在不進(jìn)行重采樣的情況下隨機(jī)選擇55%數(shù)據(jù)訓(xùn)練
"bagging_seed": 1,
"metric": 'mean_absolute_error',
"lambda_l1": 0.5,
"lambda_l2": 0.5,
"verbosity": -1 # 打印消息的詳細(xì)程度
}
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state = 4)
# 產(chǎn)生一個(gè)容器,可以用來(lái)對(duì)對(duì)數(shù)據(jù)集進(jìn)行打亂的5次切分,以此來(lái)進(jìn)行五折交叉驗(yàn)證
valid_lgb = np.zeros(len(train_x))
predictions_lgb = np.zeros(len(test_x))
for fold_, (train_idx, valid_idx) in enumerate(folds.split(train_x, target)):
# 切分后返回的訓(xùn)練集和驗(yàn)證集的索引
print("fold n{}".format(fold_+1)) # 當(dāng)前第幾折
train_data_now = lgb.Dataset(train_x.iloc[train_idx], target_lg[train_idx])
valid_data_now = lgb.Dataset(train_x.iloc[valid_idx], target_lg[valid_idx])
# 取出數(shù)據(jù)并轉(zhuǎn)換為lgb的數(shù)據(jù)
num_round = 10000
lgb_model = lgb.train(lgb_param, train_data_now, num_round,
valid_sets=[train_data_now, valid_data_now], verbose_eval=500,
early_stopping_rounds = 800)
valid_lgb[valid_idx] = lgb_model.predict(train_x.iloc[valid_idx],
num_iteration=lgb_model.best_iteration)
predictions_lgb += lgb_model.predict(test_x, num_iteration=
lgb_model.best_iteration) / folds.n_splits
# 這是將預(yù)測(cè)概率進(jìn)行平均
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_lgb, target_lg)))
這里需要注意我進(jìn)入訓(xùn)練時(shí)split用的是target,而在其中價(jià)格用的是target_lg,因?yàn)閠arget是原始的價(jià)格,可以認(rèn)為是離散的取值,但是我target_lg經(jīng)過(guò)np.log之后,我再用target_lg進(jìn)行split時(shí)就會(huì)報(bào)錯(cuò),為:
Supported target types are: (‘binary’, ‘multiclass’). Got ‘continuous’ instead.
我認(rèn)為是np.nan將其轉(zhuǎn)換為了連續(xù)型數(shù)值,而不是原來(lái)的離散型數(shù)值取值,因此我只能用target去產(chǎn)生切片索引。
CV score: 0.15345674
同樣,觀察一下特征重要性:
pd.set_option("display.max_columns", None) # 設(shè)置可以顯示的最大行和最大列
pd.set_option('display.max_rows', None) # 如果超過(guò)就顯示省略號(hào),none表示不省略
#設(shè)置value的顯示長(zhǎng)度為100,默認(rèn)為50
pd.set_option('max_colwidth',100)
# 創(chuàng)建,然后只有一列就是剛才所使用的的特征
df = pd.DataFrame(train_x.columns.tolist(), columns=['feature'])
df['importance'] = list(lgb_model.feature_importance())
df = df.sort_values(by='importance', ascending=False) # 降序排列
plt.figure(figsize = (14,28))
sns.barplot(x='importance', y='feature', data = df.head(50))# 取出前五十個(gè)畫圖
plt.title('Features importance (averaged/folds)')
plt.tight_layout() # 自動(dòng)調(diào)整適應(yīng)范圍
可以看到使用時(shí)間遙遙領(lǐng)先。
隨機(jī)森林:
#RandomForestRegressor隨機(jī)森林
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
valid_rfr = np.zeros(len(train_x))
predictions_rfr = np.zeros(len(test_x))
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x, target)):
print("fold n°{}".format(fold_+1))
tr_x = train_x.iloc[trn_idx]
tr_y = target_lg[trn_idx]
rfr_model = rfr(n_estimators=1600,max_depth=9, min_samples_leaf=9,
min_weight_fraction_leaf=0.0,max_features=0.25,
verbose=1,n_jobs=-1) #并行化
#verbose = 0 為不在標(biāo)準(zhǔn)輸出流輸出日志信息
#verbose = 1 為輸出進(jìn)度條記錄
#verbose = 2 為每個(gè)epoch輸出一行記錄
rfr_model.fit(tr_x,tr_y)
valid_rfr[val_idx] = rfr_model.predict(train_x.iloc[val_idx])
predictions_rfr += rfr_model.predict(test_x) / folds.n_splits
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_rfr, target_lg)))
CV score: 0.17160127
梯度提升:
#GradientBoostingRegressor梯度提升決策樹(shù)
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)
valid_gbr = np.zeros(len(train_x))
predictions_gbr = np.zeros(len(test_x))
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x, target)):
print("fold n°{}".format(fold_+1))
tr_x = train_x.iloc[trn_idx]
tr_y = target_lg[trn_idx]
gbr_model = gbr(n_estimators=100, learning_rate=0.1,subsample=0.65 ,max_depth=7,
min_samples_leaf=20, max_features=0.22,verbose=1)
gbr_model.fit(tr_x,tr_y)
valid_gbr[val_idx] = gbr_model.predict(train_x.iloc[val_idx])
predictions_gbr += gbr_model.predict(test_x) / folds.n_splits
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_gbr, target_lg)))
CV score: 0.14386158
下面用邏輯回歸對(duì)這三種模型進(jìn)行融合:
train_stack2 = np.vstack([valid_lgb, valid_rfr, valid_gbr]).transpose()
test_stack2 = np.vstack([predictions_lgb, predictions_rfr,predictions_gbr]).transpose()
#交叉驗(yàn)證:5折,重復(fù)2次
folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
valid_stack2 = np.zeros(train_stack2.shape[0])
predictions_lr2 = np.zeros(test_stack2.shape[0])
for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack2,target)):
print("fold {}".format(fold_))
trn_data, trn_y = train_stack2[trn_idx], target_lg.iloc[trn_idx].values
val_data, val_y = train_stack2[val_idx], target_lg.iloc[val_idx].values
#Kernel Ridge Regression
lr2 = lr()
lr2.fit(trn_data, trn_y)
valid_stack2[val_idx] = lr2.predict(val_data)
predictions_lr2 += lr2.predict(test_stack2) / 10
print("CV score: {:<8.8f}".format(mean_absolute_error(target_lg.values, valid_stack2)))
CV score: 0.14343221
那么就可以將預(yù)測(cè)結(jié)果先經(jīng)過(guò)exp得到真正結(jié)果就去提交啦!
prediction_test = np.exp(predictions_lr2) - 1
test_submission = pd.read_csv("used_car_testB_20200421.csv", sep = " ")
test_submission["price"] = prediction_test
feature_submission = ["SaleID","price"]
sub = test_submission[feature_submission]
sub.to_csv("mysubmission.csv",index = False)
上述是直接指定參數(shù),那么接下來(lái)我會(huì)對(duì)lightGBM進(jìn)行調(diào)參,看看是否能夠取得更好的結(jié)果:
# 下面對(duì)lightgbm調(diào)參
# 構(gòu)建數(shù)據(jù)集
train_y = target_lg
x_train, x_valid, y_train, y_valid = train_test_split(train_x, train_y,
random_state = 1, test_size = 0.2)
# 數(shù)據(jù)轉(zhuǎn)換
lgb_train = lgb.Dataset(x_train, y_train, free_raw_data = False)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train,free_raw_data=False)
# 設(shè)置初始參數(shù)
params = {
"boosting_type":"gbdt",
"objective":"regression",
"metric":"mae",
"nthread":4,
"learning_rate":0.1,
"verbosity": -1
}
# 交叉驗(yàn)證調(diào)參
print("交叉驗(yàn)證")
min_mae = 10000
best_params = {}
print("調(diào)參1:提高準(zhǔn)確率")
for num_leaves in range(5,100,5):
for max_depth in range(3,10,1):
params["num_leaves"] = num_leaves
params["max_depth"] = max_depth
cv_results = lgb.cv(params, lgb_train,seed = 1,nfold =5,
metrics=["mae"], early_stopping_rounds = 15,stratified=False,
verbose_eval = True)
mean_mae = pd.Series(cv_results['l1-mean']).max()
boost_rounds = pd.Series(cv_results["l1-mean"]).idxmax()
if mean_mae <= min_mae:
min_mae = mean_mae
best_params["num_leaves"] = num_leaves
best_params["max_depth"] = max_depth
if "num_leaves" and "max_depth" in best_params.keys():
params["num_leaves"] = best_params["num_leaves"]
params["max_depth"] = best_params["max_depth"]
print("調(diào)參2:降低過(guò)擬合")
for max_bin in range(5,256,10):
for min_data_in_leaf in range(1,102,10):
params['max_bin'] = max_bin
params['min_data_in_leaf'] = min_data_in_leaf
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['mae'],
early_stopping_rounds=10,
verbose_eval=True,
stratified=False
)
mean_mae = pd.Series(cv_results['l1-mean']).max()
boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
if mean_mae <= min_mae:
min_mae = mean_mae
best_params['max_bin']= max_bin
best_params['min_data_in_leaf'] = min_data_in_leaf
if 'max_bin' and 'min_data_in_leaf' in best_params.keys():
params['min_data_in_leaf'] = best_params['min_data_in_leaf']
params['max_bin'] = best_params['max_bin']
print("調(diào)參3:降低過(guò)擬合")
for feature_fraction in [0.6,0.7,0.8,0.9,1.0]:
for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]:
for bagging_freq in range(0,50,5):
params['feature_fraction'] = feature_fraction
params['bagging_fraction'] = bagging_fraction
params['bagging_freq'] = bagging_freq
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['mae'],
early_stopping_rounds=10,
verbose_eval=True,
stratified=False
)
mean_mae = pd.Series(cv_results['l1-mean']).max()
boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
if mean_mae <= min_mae:
min_mae = mean_mae
best_params['feature_fraction'] = feature_fraction
best_params['bagging_fraction'] = bagging_fraction
best_params['bagging_freq'] = bagging_freq
if 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys():
params['feature_fraction'] = best_params['feature_fraction']
params['bagging_fraction'] = best_params['bagging_fraction']
params['bagging_freq'] = best_params['bagging_freq']
print("調(diào)參4:降低過(guò)擬合")
for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]:
for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]:
params['lambda_l1'] = lambda_l1
params['lambda_l2'] = lambda_l2
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['mae'],
early_stopping_rounds=10,
verbose_eval=True,
stratified=False
)
mean_mae = pd.Series(cv_results['l1-mean']).max()
boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
if mean_mae <= min_mae:
min_mae = mean_mae
best_params['lambda_l1'] = lambda_l1
best_params['lambda_l2'] = lambda_l2
if 'lambda_l1' and 'lambda_l2' in best_params.keys():
params['lambda_l1'] = best_params['lambda_l1']
params['lambda_l2'] = best_params['lambda_l2']
print("調(diào)參5:降低過(guò)擬合2")
for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]:
params['min_split_gain'] = min_split_gain
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['mae'],
early_stopping_rounds=10,
verbose_eval=True,
stratified=False
)
mean_mae = pd.Series(cv_results['l1-mean']).max()
boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
if mean_mae <= min_mae:
min_mae = mean_mae
best_params['min_split_gain'] = min_split_gain
if 'min_split_gain' in best_params.keys():
params['min_split_gain'] = best_params['min_split_gain']
print(best_params)
注意在lgb.cv中要設(shè)置參數(shù)stratified=False,同樣是之間那個(gè)連續(xù)與離散的問(wèn)題!
{'num_leaves': 95, 'max_depth': 9, 'max_bin': 215, 'min_data_in_leaf': 71, 'feature_fraction': 1.0, 'bagging_fraction': 1.0, 'bagging_freq': 45, 'lambda_l1': 0.0, 'lambda_l2': 0.0, 'min_split_gain': 1.0}
那么再用該模型做出預(yù)測(cè):
best_params["verbosity"] = -1
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state = 4)
# 產(chǎn)生一個(gè)容器,可以用來(lái)對(duì)對(duì)數(shù)據(jù)集進(jìn)行打亂的5次切分,以此來(lái)進(jìn)行五折交叉驗(yàn)證
valid_lgb = np.zeros(len(train_x))
predictions_lgb = np.zeros(len(test_x))
for fold_, (train_idx, valid_idx) in enumerate(folds.split(train_x, target)):
# 切分后返回的訓(xùn)練集和驗(yàn)證集的索引
print("fold n{}".format(fold_+1)) # 當(dāng)前第幾折
train_data_now = lgb.Dataset(train_x.iloc[train_idx], target_lg[train_idx])
valid_data_now = lgb.Dataset(train_x.iloc[valid_idx], target_lg[valid_idx])
# 取出數(shù)據(jù)并轉(zhuǎn)換為lgb的數(shù)據(jù)
num_round = 10000
lgb_model = lgb.train(best_params, train_data_now, num_round,
valid_sets=[train_data_now, valid_data_now], verbose_eval=500,
early_stopping_rounds = 800)
valid_lgb[valid_idx] = lgb_model.predict(train_x.iloc[valid_idx],
num_iteration=lgb_model.best_iteration)
predictions_lgb += lgb_model.predict(test_x, num_iteration=
lgb_model.best_iteration) / folds.n_splits
# 這是將預(yù)測(cè)概率進(jìn)行平均
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_lgb, target_lg)))
CV score: 0.14548046
再用模型融合,同樣的代碼,得到:文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-778205.html
CV score: 0.14071899
完成!文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-778205.html
到了這里,關(guān)于零基礎(chǔ)入門數(shù)據(jù)挖掘——二手車交易價(jià)格預(yù)測(cè):baseline的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!