国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

零基礎(chǔ)入門數(shù)據(jù)挖掘——二手車交易價(jià)格預(yù)測(cè):baseline

這篇具有很好參考價(jià)值的文章主要介紹了零基礎(chǔ)入門數(shù)據(jù)挖掘——二手車交易價(jià)格預(yù)測(cè):baseline。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方,請(qǐng)大家不吝賜教,您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

零基礎(chǔ)入門數(shù)據(jù)挖掘 - 二手車交易價(jià)格預(yù)測(cè)

賽題理解

比賽要求參賽選手根據(jù)給定的數(shù)據(jù)集,建立模型,二手汽車的交易價(jià)格。

賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交易平臺(tái)的二手車交易記錄,總數(shù)據(jù)量超過(guò)40w,包含31列變量信息,其中15列為匿名變量。為了保證比賽的公平性,將會(huì)從中抽取15萬(wàn)條作為訓(xùn)練集,5萬(wàn)條作為測(cè)試集A,5萬(wàn)條作為測(cè)試集B,同時(shí)會(huì)對(duì)name、model、brand和regionCode等信息進(jìn)行脫敏。

比賽地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

數(shù)據(jù)形式

訓(xùn)練數(shù)據(jù)集具有的特征如下:

  • name - 汽車編碼
  • regDate - 汽車注冊(cè)時(shí)間
  • model - 車型編碼
  • brand - 品牌
  • bodyType - 車身類型
  • fuelType - 燃油類型
  • gearbox - 變速箱
  • power - 汽車功率
  • kilometer - 汽車行駛公里
  • notRepairedDamage - 汽車有尚未修復(fù)的損壞
  • regionCode - 看車地區(qū)編碼
  • seller - 銷售方
  • offerType - 報(bào)價(jià)類型
  • creatDate - 廣告發(fā)布時(shí)間
  • price - 汽車價(jià)格(目標(biāo)列)
  • v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根據(jù)汽車的評(píng)論、標(biāo)簽等大量信息得到的embedding向量)【人工構(gòu)造 匿名特征】

預(yù)測(cè)指標(biāo)

賽題要求采用mae作為評(píng)價(jià)指標(biāo)

具體算法

導(dǎo)入相關(guān)庫(kù)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import scipy.stats as st
import warnings
warnings.filterwarnings('ignore')
# 解決中文顯示問(wèn)題
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

數(shù)據(jù)分析

先讀入數(shù)據(jù):

train_data = pd.read_csv("used_car_train_20200313.csv", sep = " ")

用excel打開(kāi)可以看到每一行數(shù)據(jù)都放下一個(gè)單元格中,彼此之間用空格分隔,因此此處需要指定sep為空格,才能夠正確讀入數(shù)據(jù)。

觀看一下數(shù)據(jù):

train_data.head(5).append(train_data.tail(5))

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

那么下面就開(kāi)始對(duì)數(shù)據(jù)進(jìn)行分析。

train_data.columns.values
array(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType',
       'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage',
       'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0',
       'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9',
       'v_10', 'v_11', 'v_12', 'v_13', 'v_14'], dtype=object)

以上為數(shù)據(jù)具有的具體特征,那么可以先初步探索一下每個(gè)特征的數(shù)值類型以及取值等。

train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以看到除了notRepairedDamage是object類型,其他都是int或者float類型,同時(shí)可以看到部分特征還是存在缺失值的,因此這也是后續(xù)處理的重要方向。下面查看缺失值的情況:

train_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64

可以看到是部分特征存在較多的缺失值的,因此這是需要處理的部分,下面對(duì)缺失值的數(shù)目進(jìn)行可視化展示:

missing = train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace = True)
missing.plot.bar()

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

我們也可用多種方式來(lái)查看缺失值:

msno.matrix(train_data.sample(10000))

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

這種圖中的白線代表為缺失值,可以看到中間的三個(gè)特征存在較多白線,說(shuō)明其采樣10000個(gè)的話其中仍然存在較多缺失值。

msno.bar(train_data.sample(10000))

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

上圖中同樣是那三個(gè)特征,非缺失值的個(gè)數(shù)也明顯比其他特征少。


再回到最開(kāi)始的數(shù)據(jù)類型處,我們可以發(fā)現(xiàn)notRepairedDamage特征的類型為object,因此我們可以來(lái)觀察其具有幾種取值:

train_data['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64

可以看到其存在"-“取值,這也可以認(rèn)為是一種缺失值,因此我們可以將”-"轉(zhuǎn)換為nan,然后再統(tǒng)一對(duì)nan進(jìn)行處理。

而為了測(cè)試數(shù)據(jù)集也得到了相同的處理,因此讀入數(shù)據(jù)集并合并:

test_data = pd.read_csv("used_car_testB_20200421.csv", sep = " ")
train_data["origin"] = "train"
test_data["origin"] = "test"
data = pd.concat([train_data, test_data], axis = 0, ignore_index = True)

得到的data數(shù)據(jù),是具有20000萬(wàn)條數(shù)據(jù)。那么可以統(tǒng)一對(duì)該數(shù)據(jù)集的notRepairedDamage特征進(jìn)行處理:

data['notRepairedDamage'].replace("-", np.nan, inplace = True)
data['notRepairedDamage'].value_counts()
0.0    148585
1.0     19022
Name: notRepairedDamage, dtype: int64

可以看到"-"已經(jīng)被替換成了nan,因此在計(jì)數(shù)時(shí)沒(méi)有被考慮在內(nèi)。

而以下這兩種特征的類別嚴(yán)重不平衡,這種情況可以認(rèn)為它們對(duì)于結(jié)果的預(yù)測(cè)并不會(huì)起到什么作用:

data['seller'].value_counts()
0    199999
1         1
Name: seller, dtype: int64
data["offerType"].value_counts()
0    200000
Name: offerType, dtype: int64

因此可以對(duì)這兩個(gè)特征進(jìn)行刪除:

del data["seller"]
del data["offerType"]

以上是對(duì)特征的初步分析,那么接下來(lái)我們對(duì)目標(biāo)列,也就是預(yù)測(cè)價(jià)格進(jìn)行進(jìn)一步的分析,先觀察其分布情況:

target = train_data['price']
plt.figure(1)
plt.title('Johnson SU')
sns.distplot(target, kde=False, fit=st.johnsonsu)
plt.figure(2)
plt.title('Normal')
sns.distplot(target, kde=False, fit=st.norm)
plt.figure(3)
plt.title('Log Normal')
sns.distplot(target, kde=False, fit=st.lognorm)

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn
天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn
天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

我們可以看到價(jià)格的分布是極其不均勻的,這對(duì)預(yù)測(cè)是不利的,部分取值較為極端的例子將會(huì)對(duì)模型產(chǎn)生較大的影響,并且大部分模型及算法都希望預(yù)測(cè)的分布能夠盡可能地接近正態(tài)分布,因此后期需要進(jìn)行處理,那我們可以從偏度和峰度兩個(gè)正態(tài)分布的角度來(lái)觀察:

sns.distplot(target);
print("偏度: %f" % target.skew())
print("峰度: %f" % target.kurt())
偏度: 3.346487
峰度: 18.995183

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

對(duì)這種數(shù)據(jù)分布的處理,通??梢杂胠og來(lái)進(jìn)行壓縮轉(zhuǎn)換:

# 需要將其轉(zhuǎn)為正態(tài)分布
sns.distplot(np.log(target))
print("偏度: %f" % np.log(target).skew())
print("峰度: %f" % np.log(target).kurt())
偏度: -0.265100
峰度: -0.171801

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

可以看到,經(jīng)過(guò)log變換之后其分布相對(duì)好了很多,比較接近正態(tài)分布了。


接下來(lái),我們對(duì)不同類型的特征進(jìn)行觀察,分別對(duì)類別特征和數(shù)字特征來(lái)觀察。由于這里沒(méi)有在數(shù)值類型上加以區(qū)分,因此我們需要人工挑選:

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 
                    'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10',
                    'v_11', 'v_12', 'v_13','v_14' ]

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'notRepairedDamage', 'regionCode',]

那么對(duì)于類別型特征,我們可以查看其具有多少個(gè)取值,是否能夠轉(zhuǎn)換one-hot向量:

# 對(duì)于類別型的特征需要查看其取值有多少個(gè),能不能轉(zhuǎn)換為onehot
for feature in categorical_features:
    print(feature,"特征有{}個(gè)取值".format(train_data[feature].nunique()))
    print(train_data[feature].value_counts())
name 特征有99662個(gè)取值
387       282
708       282
55        280
1541      263
203       233
         ... 
26403       1
28450       1
32544       1
102174      1
184730      1
Name: name, Length: 99662, dtype: int64
model 特征有248個(gè)取值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
         ...  
242.0        2
209.0        2
245.0        2
240.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand 特征有40個(gè)取值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType 特征有8個(gè)取值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType 特征有7個(gè)取值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox 特征有2個(gè)取值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage 特征有2個(gè)取值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode 特征有7905個(gè)取值
419     369
764     258
125     137
176     136
462     134
       ... 
7081      1
7243      1
7319      1
7742      1
7960      1
Name: regionCode, Length: 7905, dtype: int64

可以看到name和regionCode 有很多個(gè)取值,因此不能轉(zhuǎn)換為onthot,其他是可以的。


而對(duì)于數(shù)值特征,我們可以來(lái)查看其與價(jià)格之間的相關(guān)性關(guān)系,這也有利于我們判斷哪些特征更加重要:

numeric_features.append("price")
price_numeric = train_data[numeric_features]
correlation_score = price_numeric.corr() # 得到是一個(gè)特征數(shù)*特征數(shù)的矩陣,元素都行和列對(duì)應(yīng)特征之間的相關(guān)性
correlation_score['price'].sort_values(ascending = False)
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64

可以看到,例如v14,v13,v1,v7這種跟price之間的相關(guān)系數(shù)實(shí)在是過(guò)低,如果是在計(jì)算資源有限的情況下可以考慮舍棄這部分特征。我們也可以直觀的展示相關(guān)性:

fig,ax = plt.subplots(figsize = (12,12))
plt.title("相關(guān)性展示")
sns.heatmap(correlation_score, square = True, vmax = 0.8)

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

對(duì)于數(shù)值特征來(lái)說(shuō),我們同樣關(guān)心其分布,下面先具體分析再說(shuō)明分布的重要性:

# 查看特征值的偏度和峰度
for col in numeric_features:
    print("{:15}\t Skewness:{:05.2f}\t Kurtosis:{:06.2f}".format(col,
                                                    train_data[col].skew(), 
                                                   train_data[col].kurt()))
power          	 Skewness:65.86	 Kurtosis:5733.45
kilometer      	 Skewness:-1.53	 Kurtosis:001.14
v_0            	 Skewness:-1.32	 Kurtosis:003.99
v_1            	 Skewness:00.36	 Kurtosis:-01.75
v_2            	 Skewness:04.84	 Kurtosis:023.86
v_3            	 Skewness:00.11	 Kurtosis:-00.42
v_4            	 Skewness:00.37	 Kurtosis:-00.20
v_5            	 Skewness:-4.74	 Kurtosis:022.93
v_6            	 Skewness:00.37	 Kurtosis:-01.74
v_7            	 Skewness:05.13	 Kurtosis:025.85
v_8            	 Skewness:00.20	 Kurtosis:-00.64
v_9            	 Skewness:00.42	 Kurtosis:-00.32
v_10           	 Skewness:00.03	 Kurtosis:-00.58
v_11           	 Skewness:03.03	 Kurtosis:012.57
v_12           	 Skewness:00.37	 Kurtosis:000.27
v_13           	 Skewness:00.27	 Kurtosis:-00.44
v_14           	 Skewness:-1.19	 Kurtosis:002.39
price          	 Skewness:03.35	 Kurtosis:019.00

可以看到power特征的偏度和峰度都非常大,那么把分布圖畫出來(lái):

f = pd.melt(train_data, value_vars=numeric_features)
# 這里相當(dāng)于f是一個(gè)兩列的矩陣,第一列是原來(lái)特征
# 第二列是特征對(duì)應(yīng)的取值,例如power有n個(gè)取值,那么它會(huì)占據(jù)n行,這樣疊在一起
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
#g 是產(chǎn)生一個(gè)對(duì)象,可以用來(lái)應(yīng)用各種圖面畫圖,map應(yīng)用
# 第一個(gè)參數(shù)就是dataframe數(shù)據(jù),但是要求是長(zhǎng)數(shù)據(jù),也就是melt處理完的數(shù)據(jù)
# 第二個(gè)參數(shù)是用來(lái)畫圖依據(jù)的列,valiable是melt處理完,那些特征的列名稱
# 而那些值的列名稱為value
# 第三個(gè)參數(shù)col_wrap是代表分成多少列
g = g.map(sns.distplot, "value")

關(guān)于melt的使用可以看使用Pandas melt()重塑DataFrame - 知乎 (zhihu.com),我覺(jué)得講得非常容易理解。

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

可以看到power的分布非常不均勻,那么跟price同樣,如果出現(xiàn)較大極端值的power,就會(huì)對(duì)結(jié)果產(chǎn)生非常嚴(yán)重的影響,這就使得在學(xué)習(xí)的時(shí)候關(guān)于power 的權(quán)重設(shè)定非常不好做。因此后續(xù)也需要對(duì)這部分進(jìn)行處理。而匿名的特征的分布相對(duì)來(lái)說(shuō)會(huì)比較均勻一點(diǎn),后續(xù)可能就不需要進(jìn)行處理了。

還可以通過(guò)散點(diǎn)圖來(lái)觀察兩兩之間大概的關(guān)系分布:

sns.pairplot(train_data[numeric_features], size = 2,  kind = "scatter",diag_kind = "kde")

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

(這部分就自己看自己發(fā)揮吧)


下面繼續(xù)回到類別型特征,由于其中存在nan不方便我們畫圖展示,因此我們可以先將nan進(jìn)行替換,方便畫圖展示:

# 下面對(duì)類別特征做處理
categorical_features_2 = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features_2:
    train_data[c] = train_data[c].astype("category")
    # 將這些的類型轉(zhuǎn)換為分類類型,不保留原來(lái)的int或者float類型
    if train_data[c].isnull().any():
        # 如果該列存在nan的話
        train_data[c] = train_data[c].cat.add_categories(['Missing'])
        # 增加一個(gè)新的分類為missing,用它來(lái)填充那些nan,代表缺失值,
        # 這樣在后面畫圖方便展示
        train_data[c] = train_data[c].fillna('Missing')

下面通過(guò)箱型圖來(lái)對(duì)類別特征的每個(gè)取值進(jìn)行直觀展示:

def bar_plot(x, y, **kwargs):
    sns.barplot(x = x, y = y)
    x = plt.xticks(rotation = 90)
    
f = pd.melt(train_data, id_vars = ['price'], value_vars = categorical_features_2)
g = sns.FacetGrid(f, col = 'variable', col_wrap = 2, sharex = False, sharey = False)
g = g.map(bar_plot, "value", "price")

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

這可以看到類別型特征相對(duì)來(lái)說(shuō)分布也不會(huì)出現(xiàn)極端情況。

特征工程

在特征處理中,最重要的我覺(jué)得是對(duì)異常數(shù)據(jù)的處理。之前我們已經(jīng)看到了power特征的分布尤為不均勻,那么這部分有兩種處理方式,一種是對(duì)極端值進(jìn)行舍去,一部分是采用log的方式進(jìn)行壓縮,那么這里都進(jìn)行介紹。

首先是對(duì)極端值進(jìn)行舍去,那么可以采用箱型圖來(lái)協(xié)助判斷,下面封裝一個(gè)函數(shù)實(shí)現(xiàn):

# 主要就是power的值分布太過(guò)于異常,那么可以對(duì)一些進(jìn)行處理,刪除掉
# 下面定義一個(gè)函數(shù)用來(lái)處理異常值
def outliers_proc(data, col_name, scale = 3):
    # data:原數(shù)據(jù)
    # col_name:要處理異常值的列名稱
    # scale:用來(lái)控制刪除尺度的
    def box_plot_outliers(data_ser, box_scale):
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        # quantile是取出數(shù)據(jù)對(duì)應(yīng)分位數(shù)的數(shù)值
        val_low = data_ser.quantile(0.25) - iqr # 下界
        val_up = data_ser.quantile(0.75) + iqr # 上界
        rule_low = (data_ser < val_low) # 篩選出小于下界的索引
        rule_up = (data_ser > val_up) # 篩選出大于上界的索引
        return (rule_low, rule_up),(val_low, val_up)
    
    data_n = data.copy()
    data_series = data_n[col_name]  # 取出對(duì)應(yīng)數(shù)據(jù)
    rule, values = box_plot_outliers(data_series, box_scale = scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    # 先產(chǎn)生0到n-1,然后再用索引把其中處于異常值的索引取出來(lái)
    print("Delete number is {}".format(len(index)))
    data_n = data_n.drop(index) # 整行數(shù)據(jù)都丟棄
    data_n.reset_index(drop = True, inplace = True)  # 重新設(shè)置索引
    print("Now column number is:{}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]  # 小于下界的值
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("Description of data larger than the lower bound is:")
    print(pd.Series(outliers).describe())
    fig, axes = plt.subplots(1,2,figsize = (10,7))
    ax1 = sns.boxplot(y = data[col_name], data = data, palette = "Set1", ax = axes[0])
    ax1.set_title("處理異常值前")
    ax2 = sns.boxplot(y = data_n[col_name], data = data_n, palette = "Set1", ax = axes[1])
    ax2.set_title("處理異常值后")
    return data_n

我們應(yīng)用于power數(shù)據(jù)集嘗試:

train_data_delete_after = outliers_proc(train_data, "power", scale =3)
Delete number is 963
Now column number is:149037
Description of data less than the lower bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: power, dtype: float64
Description of data larger than the lower bound is:
count      963.000000
mean       846.836968
std       1929.418081
min        376.000000
25%        400.000000
50%        436.000000
75%        514.000000
max      19312.000000
Name: power, dtype: float64

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

可以看到總共刪除了900多條數(shù)據(jù),使得最終的箱型圖也正常許多。

那么另外一種方法就是采用log進(jìn)行壓縮,但這里因?yàn)槲疫€想用power進(jìn)行數(shù)據(jù)分桶,構(gòu)造出一個(gè)power等級(jí)的特征,因此我就先構(gòu)造再進(jìn)行壓縮:

bin_power = [i*10 for i in range(31)]
data["power_bin"] = pd.cut(data["power"],bin_power,right = False,labels = False)

這種方法就是將power按照bin_power的數(shù)值進(jìn)行分段,最低一段在新特征中取值為1,以此類推,但是這樣會(huì)導(dǎo)致大于最大一段的取值為nan,也就是power取值大于300的在power_bin中取值為nan,因此可以設(shè)置其等級(jí)為31來(lái)處理:

data['power_bin'] = data['power_bin'].fillna(31)

那么對(duì)于power現(xiàn)在就可以用log進(jìn)行壓縮了:

data['power'] = np.log(data['power'] + 1) 

接下來(lái)進(jìn)行新特征的構(gòu)造。

首先是使用時(shí)間,我們可以用creatDate減去regDate來(lái)表示:

data["use_time"] = (pd.to_datetime(data['creatDate'],format = "%Y%m%d",errors = "coerce")
                        - pd.to_datetime(data["regDate"], format = "%Y%m%d", errors = "coerce")).dt.days
# errors是當(dāng)格式轉(zhuǎn)換錯(cuò)誤就賦予nan

而這種處理方法由于部分?jǐn)?shù)據(jù)日期的缺失,會(huì)導(dǎo)致存在缺失值,那么我的處理方法是填充為均值,但是測(cè)試集的填充也需要用訓(xùn)練數(shù)據(jù)集的均值來(lái)填充,因此我放到后面劃分的時(shí)候再來(lái)處理。


下面是對(duì)品牌的銷售統(tǒng)計(jì)量創(chuàng)造特征,因?yàn)橐?jì)算某個(gè)品牌的銷售均值、最大值、方差等等數(shù)據(jù),因此我們需要在訓(xùn)練數(shù)據(jù)集上計(jì)算,測(cè)試數(shù)據(jù)集是未知的,計(jì)算完畢后再根據(jù)品牌一一對(duì)應(yīng)填上數(shù)值即可:

# 計(jì)算某個(gè)品牌的各種統(tǒng)計(jì)數(shù)目量
train_gb = train_data.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
    info = {}
    kind_data = kind_data[kind_data["price"] > 0]
    # 把價(jià)格小于0的可能存在的異常值去除
    info["brand_amount"] = len(kind_data) # 該品牌的數(shù)量
    info["brand_price_max"] = kind_data.price.max() # 該品牌價(jià)格最大值
    info["brand_price_min"] = kind_data.price.min() # 該品牌價(jià)格最小值
    info["brand_price_median"] = kind_data.price.median() # 該品牌價(jià)格中位數(shù)
    info["brand_price_sum"] = kind_data.price.sum() # 該品牌價(jià)格總和
    info["brand_price_std"] = kind_data.price.std() # 方差
    info["brand_price_average"] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
    # 均值,保留兩位小數(shù)
    all_info[kind] = info
brand_feature = pd.DataFrame(all_info).T.reset_index().rename(columns = {"index":"brand"})

這里的brand_feature獲得方法可能有點(diǎn)復(fù)雜,我一步步解釋出來(lái):

brand_feature = pd.DataFrame(all_info)
brand_feature

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

這里是7個(gè)統(tǒng)計(jì)量特征作為索引,然后有40列代表有40個(gè)品牌。

brand_feature = pd.DataFrame(all_info).T.reset_index()
brand_feature

轉(zhuǎn)置后重新設(shè)置索引,也就是:

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

將品牌統(tǒng)計(jì)量作為列,然后加入一個(gè)列為index,可以認(rèn)為是品牌的取值。

brand_feature = pd.DataFrame(all_info).T.reset_index().rename(columns = {"index":"brand"})
brand_feature

這一個(gè)就是將index更名為brand,這一列就是品牌的取值,方便我們后續(xù)融合到data中:

data = data.merge(brand_feature, how='left', on='brand')

這就是將data中的brand取值和剛才那個(gè)矩陣中的取值一一對(duì)應(yīng),然后取出對(duì)應(yīng)的特征各個(gè)值,作為新的特征。


接下來(lái)需要對(duì)大部分?jǐn)?shù)據(jù)進(jìn)行歸一化:

def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))
for feature in ["brand_amount","brand_price_average","brand_price_max",
                "brand_price_median","brand_price_min","brand_price_std",
               "brand_price_sum","power","kilometer"]:
    data[feature] = max_min(data[feature])

對(duì)類別特征進(jìn)行encoder:

# 對(duì)類別特征轉(zhuǎn)換為onehot
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType','fuelType','gearbox', 
                                     'notRepairedDamage', 'power_bin'],dummy_na=True)

對(duì)沒(méi)用的特征可以進(jìn)行刪除了:

data = data.drop(['creatDate',"regDate", "regionCode"], axis = 1)

至此,關(guān)于特征的處理工作基本上就完成了,但是這只是簡(jiǎn)單的處理方式,可以去探索更加深度的特征信息(我不會(huì)哈哈哈哈)。

建立模型

先處理數(shù)據(jù)集:

use_feature = [x for x in data.columns if x not in ['SaleID',"name","price","origin"]]
target = data[data["origin"] == "train"]["price"]
target_lg = (np.log(target+1))

train_x = data[data["origin"] == "train"][use_feature]
test_x = data[data["origin"] == "test"][use_feature]

train_x["use_time"] = train_x["use_time"].fillna(train_x["use_time"].mean())

test_x["use_time"] = test_x["use_time"].fillna(train_x["use_time"].mean())# 用訓(xùn)練數(shù)據(jù)集的均值填充

train_x.shape
(150000, 371)

可以看看訓(xùn)練數(shù)據(jù)是否還存在缺失值:

test_x.isnull().sum()
power                    0
kilometer                0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
use_time                 0
brand_amount             0
brand_price_max          0
brand_price_min          0
brand_price_median       0
brand_price_sum          0
brand_price_std          0
brand_price_average      0
model_0.0                0
model_1.0                0
model_2.0                0
model_3.0                0
model_4.0                0
model_5.0                0
model_6.0                0
model_7.0                0
model_8.0                0
model_9.0                0
model_10.0               0
model_11.0               0
model_12.0               0
model_13.0               0
model_14.0               0
model_15.0               0
model_16.0               0
model_17.0               0
model_18.0               0
model_19.0               0
model_20.0               0
model_21.0               0
model_22.0               0
model_23.0               0
model_24.0               0
model_25.0               0
model_26.0               0
model_27.0               0
model_28.0               0
model_29.0               0
model_30.0               0
model_31.0               0
model_32.0               0
model_33.0               0
model_34.0               0
model_35.0               0
model_36.0               0
model_37.0               0
model_38.0               0
model_39.0               0
model_40.0               0
model_41.0               0
model_42.0               0
model_43.0               0
model_44.0               0
model_45.0               0
model_46.0               0
model_47.0               0
model_48.0               0
model_49.0               0
model_50.0               0
model_51.0               0
model_52.0               0
model_53.0               0
model_54.0               0
model_55.0               0
model_56.0               0
model_57.0               0
model_58.0               0
model_59.0               0
model_60.0               0
model_61.0               0
model_62.0               0
model_63.0               0
model_64.0               0
model_65.0               0
model_66.0               0
model_67.0               0
model_68.0               0
model_69.0               0
model_70.0               0
model_71.0               0
model_72.0               0
model_73.0               0
model_74.0               0
model_75.0               0
model_76.0               0
model_77.0               0
model_78.0               0
model_79.0               0
model_80.0               0
model_81.0               0
model_82.0               0
model_83.0               0
model_84.0               0
model_85.0               0
model_86.0               0
model_87.0               0
model_88.0               0
model_89.0               0
model_90.0               0
model_91.0               0
model_92.0               0
model_93.0               0
model_94.0               0
model_95.0               0
model_96.0               0
model_97.0               0
model_98.0               0
model_99.0               0
model_100.0              0
model_101.0              0
model_102.0              0
model_103.0              0
model_104.0              0
model_105.0              0
model_106.0              0
model_107.0              0
model_108.0              0
model_109.0              0
model_110.0              0
model_111.0              0
model_112.0              0
model_113.0              0
model_114.0              0
model_115.0              0
model_116.0              0
model_117.0              0
model_118.0              0
model_119.0              0
model_120.0              0
model_121.0              0
model_122.0              0
model_123.0              0
model_124.0              0
model_125.0              0
model_126.0              0
model_127.0              0
model_128.0              0
model_129.0              0
model_130.0              0
model_131.0              0
model_132.0              0
model_133.0              0
model_134.0              0
model_135.0              0
model_136.0              0
model_137.0              0
model_138.0              0
model_139.0              0
model_140.0              0
model_141.0              0
model_142.0              0
model_143.0              0
model_144.0              0
model_145.0              0
model_146.0              0
model_147.0              0
model_148.0              0
model_149.0              0
model_150.0              0
model_151.0              0
model_152.0              0
model_153.0              0
model_154.0              0
model_155.0              0
model_156.0              0
model_157.0              0
model_158.0              0
model_159.0              0
model_160.0              0
model_161.0              0
model_162.0              0
model_163.0              0
model_164.0              0
model_165.0              0
model_166.0              0
model_167.0              0
model_168.0              0
model_169.0              0
model_170.0              0
model_171.0              0
model_172.0              0
model_173.0              0
model_174.0              0
model_175.0              0
model_176.0              0
model_177.0              0
model_178.0              0
model_179.0              0
model_180.0              0
model_181.0              0
model_182.0              0
model_183.0              0
model_184.0              0
model_185.0              0
model_186.0              0
model_187.0              0
model_188.0              0
model_189.0              0
model_190.0              0
model_191.0              0
model_192.0              0
model_193.0              0
model_194.0              0
model_195.0              0
model_196.0              0
model_197.0              0
model_198.0              0
model_199.0              0
model_200.0              0
model_201.0              0
model_202.0              0
model_203.0              0
model_204.0              0
model_205.0              0
model_206.0              0
model_207.0              0
model_208.0              0
model_209.0              0
model_210.0              0
model_211.0              0
model_212.0              0
model_213.0              0
model_214.0              0
model_215.0              0
model_216.0              0
model_217.0              0
model_218.0              0
model_219.0              0
model_220.0              0
model_221.0              0
model_222.0              0
model_223.0              0
model_224.0              0
model_225.0              0
model_226.0              0
model_227.0              0
model_228.0              0
model_229.0              0
model_230.0              0
model_231.0              0
model_232.0              0
model_233.0              0
model_234.0              0
model_235.0              0
model_236.0              0
model_237.0              0
model_238.0              0
model_239.0              0
model_240.0              0
model_241.0              0
model_242.0              0
model_243.0              0
model_244.0              0
model_245.0              0
model_246.0              0
model_247.0              0
model_nan                0
brand_0.0                0
brand_1.0                0
brand_2.0                0
brand_3.0                0
brand_4.0                0
brand_5.0                0
brand_6.0                0
brand_7.0                0
brand_8.0                0
brand_9.0                0
brand_10.0               0
brand_11.0               0
brand_12.0               0
brand_13.0               0
brand_14.0               0
brand_15.0               0
brand_16.0               0
brand_17.0               0
brand_18.0               0
brand_19.0               0
brand_20.0               0
brand_21.0               0
brand_22.0               0
brand_23.0               0
brand_24.0               0
brand_25.0               0
brand_26.0               0
brand_27.0               0
brand_28.0               0
brand_29.0               0
brand_30.0               0
brand_31.0               0
brand_32.0               0
brand_33.0               0
brand_34.0               0
brand_35.0               0
brand_36.0               0
brand_37.0               0
brand_38.0               0
brand_39.0               0
brand_nan                0
bodyType_0.0             0
bodyType_1.0             0
bodyType_2.0             0
bodyType_3.0             0
bodyType_4.0             0
bodyType_5.0             0
bodyType_6.0             0
bodyType_7.0             0
bodyType_nan             0
fuelType_0.0             0
fuelType_1.0             0
fuelType_2.0             0
fuelType_3.0             0
fuelType_4.0             0
fuelType_5.0             0
fuelType_6.0             0
fuelType_nan             0
gearbox_0.0              0
gearbox_1.0              0
gearbox_nan              0
notRepairedDamage_0.0    0
notRepairedDamage_1.0    0
notRepairedDamage_nan    0
power_bin_0.0            0
power_bin_1.0            0
power_bin_2.0            0
power_bin_3.0            0
power_bin_4.0            0
power_bin_5.0            0
power_bin_6.0            0
power_bin_7.0            0
power_bin_8.0            0
power_bin_9.0            0
power_bin_10.0           0
power_bin_11.0           0
power_bin_12.0           0
power_bin_13.0           0
power_bin_14.0           0
power_bin_15.0           0
power_bin_16.0           0
power_bin_17.0           0
power_bin_18.0           0
power_bin_19.0           0
power_bin_20.0           0
power_bin_21.0           0
power_bin_22.0           0
power_bin_23.0           0
power_bin_24.0           0
power_bin_25.0           0
power_bin_26.0           0
power_bin_27.0           0
power_bin_28.0           0
power_bin_29.0           0
power_bin_31.0           0
power_bin_nan            0
dtype: int64

可以看到都沒(méi)有缺失值了,因此接下來(lái)可以用來(lái)選擇模型了。


由于現(xiàn)實(shí)原因(電腦跑不動(dòng)xgboost)因此我選擇了lightGBM和隨機(jī)森林、梯度提升決策樹(shù)三種,然后再用模型融合,具體代碼如下:

from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error,mean_absolute_error, f1_score
import lightgbm as lgb
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.model_selection import  KFold, StratifiedKFold,GroupKFold, RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import GradientBoostingRegressor as gbr
from sklearn.linear_model import LinearRegression as lr

lightGBM

lgb_param = {  # 這是訓(xùn)練的參數(shù)列表
    "num_leaves":7,
    "min_data_in_leaf": 20,  # 一個(gè)葉子上最小分配到的數(shù)量,用來(lái)處理過(guò)擬合
    "objective": "regression",  # 設(shè)置類型為回歸
    "max_depth": -1,  # 限制樹(shù)的最大深度,-1代表沒(méi)有限制
    "learning_rate": 0.003,
    "boosting": "gbdt",  # 用gbdt算法
    "feature_fraction": 0.50,  # 每次迭代時(shí)使用18%的特征參與建樹(shù),引入特征子空間的多樣性
    "bagging_freq": 1,  # 每一次迭代都執(zhí)行bagging
    "bagging_fraction": 0.55,  # 每次bagging在不進(jìn)行重采樣的情況下隨機(jī)選擇55%數(shù)據(jù)訓(xùn)練
    "bagging_seed": 1,
    "metric": 'mean_absolute_error',
    "lambda_l1": 0.5,
    "lambda_l2": 0.5,
    "verbosity": -1  # 打印消息的詳細(xì)程度
}
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state = 4)
# 產(chǎn)生一個(gè)容器,可以用來(lái)對(duì)對(duì)數(shù)據(jù)集進(jìn)行打亂的5次切分,以此來(lái)進(jìn)行五折交叉驗(yàn)證
valid_lgb = np.zeros(len(train_x))
predictions_lgb = np.zeros(len(test_x))


for fold_, (train_idx, valid_idx) in enumerate(folds.split(train_x, target)):
    # 切分后返回的訓(xùn)練集和驗(yàn)證集的索引
    print("fold n{}".format(fold_+1))  # 當(dāng)前第幾折
    train_data_now = lgb.Dataset(train_x.iloc[train_idx], target_lg[train_idx])
    valid_data_now = lgb.Dataset(train_x.iloc[valid_idx], target_lg[valid_idx])
    # 取出數(shù)據(jù)并轉(zhuǎn)換為lgb的數(shù)據(jù)
    num_round = 10000
    lgb_model = lgb.train(lgb_param, train_data_now, num_round, 
                        valid_sets=[train_data_now, valid_data_now], verbose_eval=500,
                       early_stopping_rounds = 800)
    valid_lgb[valid_idx] = lgb_model.predict(train_x.iloc[valid_idx],
                                             num_iteration=lgb_model.best_iteration)
    predictions_lgb += lgb_model.predict(test_x, num_iteration=
                                           lgb_model.best_iteration) / folds.n_splits
    # 這是將預(yù)測(cè)概率進(jìn)行平均
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_lgb, target_lg)))

這里需要注意我進(jìn)入訓(xùn)練時(shí)split用的是target,而在其中價(jià)格用的是target_lg,因?yàn)閠arget是原始的價(jià)格,可以認(rèn)為是離散的取值,但是我target_lg經(jīng)過(guò)np.log之后,我再用target_lg進(jìn)行split時(shí)就會(huì)報(bào)錯(cuò),為:

Supported target types are: (‘binary’, ‘multiclass’). Got ‘continuous’ instead.

我認(rèn)為是np.nan將其轉(zhuǎn)換為了連續(xù)型數(shù)值,而不是原來(lái)的離散型數(shù)值取值,因此我只能用target去產(chǎn)生切片索引。

CV score: 0.15345674

同樣,觀察一下特征重要性:

pd.set_option("display.max_columns", None)  # 設(shè)置可以顯示的最大行和最大列
pd.set_option('display.max_rows', None)  # 如果超過(guò)就顯示省略號(hào),none表示不省略
#設(shè)置value的顯示長(zhǎng)度為100,默認(rèn)為50
pd.set_option('max_colwidth',100)
# 創(chuàng)建,然后只有一列就是剛才所使用的的特征
df = pd.DataFrame(train_x.columns.tolist(), columns=['feature'])
df['importance'] = list(lgb_model.feature_importance())
df = df.sort_values(by='importance', ascending=False)  # 降序排列
plt.figure(figsize = (14,28))
sns.barplot(x='importance', y='feature', data = df.head(50))# 取出前五十個(gè)畫圖
plt.title('Features importance (averaged/folds)')
plt.tight_layout()  # 自動(dòng)調(diào)整適應(yīng)范圍

天山賽題以預(yù)測(cè)二手車的交易價(jià)格為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某交,機(jī)器學(xué)習(xí),數(shù)據(jù)挖掘,機(jī)器學(xué)習(xí),python,人工智能,sklearn

可以看到使用時(shí)間遙遙領(lǐng)先。

隨機(jī)森林

#RandomForestRegressor隨機(jī)森林
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
valid_rfr = np.zeros(len(train_x))
predictions_rfr = np.zeros(len(test_x))
 
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x, target)):
    print("fold n°{}".format(fold_+1))
    tr_x = train_x.iloc[trn_idx]
    tr_y = target_lg[trn_idx]
    rfr_model = rfr(n_estimators=1600,max_depth=9, min_samples_leaf=9, 
                  min_weight_fraction_leaf=0.0,max_features=0.25,
                  verbose=1,n_jobs=-1) #并行化
    #verbose = 0 為不在標(biāo)準(zhǔn)輸出流輸出日志信息
#verbose = 1 為輸出進(jìn)度條記錄
#verbose = 2 為每個(gè)epoch輸出一行記錄
    rfr_model.fit(tr_x,tr_y)
    valid_rfr[val_idx] = rfr_model.predict(train_x.iloc[val_idx])
    
    predictions_rfr += rfr_model.predict(test_x) / folds.n_splits
    
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_rfr, target_lg)))
CV score: 0.17160127

梯度提升

#GradientBoostingRegressor梯度提升決策樹(shù)
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)
valid_gbr = np.zeros(len(train_x))
predictions_gbr = np.zeros(len(test_x))
 
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x, target)):
    print("fold n°{}".format(fold_+1))
    tr_x = train_x.iloc[trn_idx]
    tr_y = target_lg[trn_idx]
    gbr_model = gbr(n_estimators=100, learning_rate=0.1,subsample=0.65 ,max_depth=7, 
                    min_samples_leaf=20, max_features=0.22,verbose=1)
    gbr_model.fit(tr_x,tr_y)
    valid_gbr[val_idx] = gbr_model.predict(train_x.iloc[val_idx])
    
    predictions_gbr += gbr_model.predict(test_x) / folds.n_splits
 
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_gbr, target_lg)))
CV score: 0.14386158

下面用邏輯回歸對(duì)這三種模型進(jìn)行融合:

train_stack2 = np.vstack([valid_lgb, valid_rfr, valid_gbr]).transpose()
test_stack2 = np.vstack([predictions_lgb, predictions_rfr,predictions_gbr]).transpose()
#交叉驗(yàn)證:5折,重復(fù)2次
folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)
valid_stack2 = np.zeros(train_stack2.shape[0])
predictions_lr2 = np.zeros(test_stack2.shape[0])
 
for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack2,target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack2[trn_idx], target_lg.iloc[trn_idx].values
    val_data, val_y = train_stack2[val_idx], target_lg.iloc[val_idx].values
    #Kernel Ridge Regression
    lr2 = lr()
    lr2.fit(trn_data, trn_y)
    
    valid_stack2[val_idx] = lr2.predict(val_data)
    predictions_lr2 += lr2.predict(test_stack2) / 10
    
print("CV score: {:<8.8f}".format(mean_absolute_error(target_lg.values, valid_stack2)))
CV score: 0.14343221

那么就可以將預(yù)測(cè)結(jié)果先經(jīng)過(guò)exp得到真正結(jié)果就去提交啦!

prediction_test = np.exp(predictions_lr2) - 1
test_submission = pd.read_csv("used_car_testB_20200421.csv", sep = " ")
test_submission["price"] = prediction_test
feature_submission = ["SaleID","price"]
sub = test_submission[feature_submission]
sub.to_csv("mysubmission.csv",index = False)

上述是直接指定參數(shù),那么接下來(lái)我會(huì)對(duì)lightGBM進(jìn)行調(diào)參,看看是否能夠取得更好的結(jié)果:

# 下面對(duì)lightgbm調(diào)參
# 構(gòu)建數(shù)據(jù)集
train_y = target_lg
x_train, x_valid, y_train, y_valid = train_test_split(train_x, train_y, 
                                                      random_state = 1, test_size = 0.2)
# 數(shù)據(jù)轉(zhuǎn)換
lgb_train = lgb.Dataset(x_train, y_train, free_raw_data = False)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train,free_raw_data=False)

# 設(shè)置初始參數(shù)
params = {
    "boosting_type":"gbdt",
    "objective":"regression",
    "metric":"mae",
    "nthread":4,
    "learning_rate":0.1,
    "verbosity": -1
}

# 交叉驗(yàn)證調(diào)參
print("交叉驗(yàn)證")
min_mae = 10000
best_params = {}

print("調(diào)參1:提高準(zhǔn)確率")
for num_leaves in range(5,100,5):
    for max_depth in range(3,10,1):
        params["num_leaves"] = num_leaves
        params["max_depth"] = max_depth
        cv_results = lgb.cv(params, lgb_train,seed = 1,nfold =5,
                           metrics=["mae"], early_stopping_rounds = 15,stratified=False,
                           verbose_eval = True)
        mean_mae = pd.Series(cv_results['l1-mean']).max()
        boost_rounds = pd.Series(cv_results["l1-mean"]).idxmax()
        if mean_mae <= min_mae:
            min_mae = mean_mae
            best_params["num_leaves"] = num_leaves
            best_params["max_depth"] = max_depth
if "num_leaves" and "max_depth" in best_params.keys():
    params["num_leaves"] = best_params["num_leaves"]
    params["max_depth"] = best_params["max_depth"]

print("調(diào)參2:降低過(guò)擬合")
for max_bin in range(5,256,10):
    for min_data_in_leaf in range(1,102,10):
            params['max_bin'] = max_bin
            params['min_data_in_leaf'] = min_data_in_leaf
            
            cv_results = lgb.cv(
                                params,
                                lgb_train,
                                seed=1,
                                nfold=5,
                                metrics=['mae'],
                                early_stopping_rounds=10,
                                verbose_eval=True,
                                stratified=False
                                )
                    
            mean_mae = pd.Series(cv_results['l1-mean']).max()
            boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
 
            if mean_mae <= min_mae:
                min_mae = mean_mae
                best_params['max_bin']= max_bin
                best_params['min_data_in_leaf'] = min_data_in_leaf
if 'max_bin' and 'min_data_in_leaf' in best_params.keys():
    params['min_data_in_leaf'] = best_params['min_data_in_leaf']
    params['max_bin'] = best_params['max_bin']
    
print("調(diào)參3:降低過(guò)擬合")
for feature_fraction in [0.6,0.7,0.8,0.9,1.0]:
    for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]:
        for bagging_freq in range(0,50,5):
            params['feature_fraction'] = feature_fraction
            params['bagging_fraction'] = bagging_fraction
            params['bagging_freq'] = bagging_freq
            
            cv_results = lgb.cv(
                                params,
                                lgb_train,
                                seed=1,
                                nfold=5,
                                metrics=['mae'],
                                early_stopping_rounds=10,
                                verbose_eval=True,
                                stratified=False
                                )
                    
            mean_mae = pd.Series(cv_results['l1-mean']).max()
            boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
 
            if mean_mae <= min_mae:
                min_mae = mean_mae
                best_params['feature_fraction'] = feature_fraction
                best_params['bagging_fraction'] = bagging_fraction
                best_params['bagging_freq'] = bagging_freq
                
if 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys():
    params['feature_fraction'] = best_params['feature_fraction']
    params['bagging_fraction'] = best_params['bagging_fraction']
    params['bagging_freq'] = best_params['bagging_freq']
    
print("調(diào)參4:降低過(guò)擬合")
for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]:
    for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]:
        params['lambda_l1'] = lambda_l1
        params['lambda_l2'] = lambda_l2
        cv_results = lgb.cv(
                            params,
                            lgb_train,
                            seed=1,
                            nfold=5,
                            metrics=['mae'],
                            early_stopping_rounds=10,
                            verbose_eval=True,
                            stratified=False
                            )
                
        mean_mae = pd.Series(cv_results['l1-mean']).max()
        boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
 
        if mean_mae <= min_mae:
            min_mae = mean_mae
            best_params['lambda_l1'] = lambda_l1
            best_params['lambda_l2'] = lambda_l2
if 'lambda_l1' and 'lambda_l2' in best_params.keys():
    params['lambda_l1'] = best_params['lambda_l1']
    params['lambda_l2'] = best_params['lambda_l2']
    

print("調(diào)參5:降低過(guò)擬合2")
for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]:
    params['min_split_gain'] = min_split_gain
    
    cv_results = lgb.cv(
                        params,
                        lgb_train,
                        seed=1,
                        nfold=5,
                        metrics=['mae'],
                        early_stopping_rounds=10,
                        verbose_eval=True,
                        stratified=False
                        )
            
    mean_mae = pd.Series(cv_results['l1-mean']).max()
    boost_rounds = pd.Series(cv_results['l1-mean']).idxmax()
 
    if mean_mae <= min_mae:
        min_mae = mean_mae
        best_params['min_split_gain'] = min_split_gain
        
if 'min_split_gain' in best_params.keys():
    params['min_split_gain'] = best_params['min_split_gain']
    
print(best_params)

注意在lgb.cv中要設(shè)置參數(shù)stratified=False,同樣是之間那個(gè)連續(xù)與離散的問(wèn)題!

{'num_leaves': 95, 'max_depth': 9, 'max_bin': 215, 'min_data_in_leaf': 71, 'feature_fraction': 1.0, 'bagging_fraction': 1.0, 'bagging_freq': 45, 'lambda_l1': 0.0, 'lambda_l2': 0.0, 'min_split_gain': 1.0}

那么再用該模型做出預(yù)測(cè):

best_params["verbosity"] = -1

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state = 4)
# 產(chǎn)生一個(gè)容器,可以用來(lái)對(duì)對(duì)數(shù)據(jù)集進(jìn)行打亂的5次切分,以此來(lái)進(jìn)行五折交叉驗(yàn)證
valid_lgb = np.zeros(len(train_x))
predictions_lgb = np.zeros(len(test_x))


for fold_, (train_idx, valid_idx) in enumerate(folds.split(train_x, target)):
    # 切分后返回的訓(xùn)練集和驗(yàn)證集的索引
    print("fold n{}".format(fold_+1))  # 當(dāng)前第幾折
    train_data_now = lgb.Dataset(train_x.iloc[train_idx], target_lg[train_idx])
    valid_data_now = lgb.Dataset(train_x.iloc[valid_idx], target_lg[valid_idx])
    # 取出數(shù)據(jù)并轉(zhuǎn)換為lgb的數(shù)據(jù)
    num_round = 10000
    lgb_model = lgb.train(best_params, train_data_now, num_round, 
                        valid_sets=[train_data_now, valid_data_now], verbose_eval=500,
                       early_stopping_rounds = 800)
    valid_lgb[valid_idx] = lgb_model.predict(train_x.iloc[valid_idx],
                                             num_iteration=lgb_model.best_iteration)
    predictions_lgb += lgb_model.predict(test_x, num_iteration=
                                           lgb_model.best_iteration) / folds.n_splits
    # 這是將預(yù)測(cè)概率進(jìn)行平均
print("CV score: {:<8.8f}".format(mean_absolute_error(valid_lgb, target_lg)))
CV score: 0.14548046

再用模型融合,同樣的代碼,得到:

CV score: 0.14071899

完成文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-778205.html

到了這里,關(guān)于零基礎(chǔ)入門數(shù)據(jù)挖掘——二手車交易價(jià)格預(yù)測(cè):baseline的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來(lái)自互聯(lián)網(wǎng)用戶投稿,該文觀點(diǎn)僅代表作者本人,不代表本站立場(chǎng)。本站僅提供信息存儲(chǔ)空間服務(wù),不擁有所有權(quán),不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載,請(qǐng)注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符,請(qǐng)點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋,一經(jīng)查實(shí),立即刪除!

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

相關(guān)文章

  • 1.Python數(shù)據(jù)分析項(xiàng)目——二手車價(jià)格預(yù)測(cè)

    1.Python數(shù)據(jù)分析項(xiàng)目——二手車價(jià)格預(yù)測(cè)

    流程 具體操作 基本查看 查看缺失值、查看重復(fù)值、查看數(shù)值類型 預(yù)處理 缺失值處理(確定是否處理后,使用篩選方式刪除)拆分?jǐn)?shù)據(jù) 、標(biāo)簽的特征處理(處理成0/1格式)、特征工程(one-hot編碼) 數(shù)據(jù)分析 groupby分組求最值數(shù)據(jù)、seaborn可視化 預(yù)測(cè) 拆分?jǐn)?shù)據(jù)集、建立模型、

    2024年02月12日
    瀏覽(25)
  • 【Python實(shí)戰(zhàn)】Python采集二手車數(shù)據(jù)——超詳細(xì)講解

    【Python實(shí)戰(zhàn)】Python采集二手車數(shù)據(jù)——超詳細(xì)講解

    今天,我們將采集某二手車數(shù)據(jù),通過(guò)這個(gè)案例,加深我們對(duì)xpath的理解。通過(guò)爬取數(shù)據(jù)后數(shù)據(jù)分析能夠直觀的看到二手車市場(chǎng)中某一品牌的相對(duì)數(shù)據(jù),能夠了解到現(xiàn)在的二手車市場(chǎng)情況,通過(guò)分析數(shù)據(jù)看到二手車的走勢(shì),車商就可以利用這些數(shù)據(jù)進(jìn)行定價(jià),讓想買二手車卻

    2024年02月01日
    瀏覽(20)
  • 基于Python+Flask+Echart實(shí)現(xiàn)二手車數(shù)據(jù)分析展示

    基于Python+Flask+Echart實(shí)現(xiàn)二手車數(shù)據(jù)分析展示

    作者主頁(yè):編程指南針 作者簡(jiǎn)介:Java領(lǐng)域優(yōu)質(zhì)創(chuàng)作者、CSDN博客專家 、CSDN內(nèi)容合伙人、掘金特邀作者、阿里云博客專家、51CTO特邀作者、多年架構(gòu)師設(shè)計(jì)經(jīng)驗(yàn)、騰訊課堂常駐講師 主要內(nèi)容:Java項(xiàng)目、Python項(xiàng)目、前端項(xiàng)目、人工智能與大數(shù)據(jù)、簡(jiǎn)歷模板、學(xué)習(xí)資料、面試題庫(kù)

    2024年02月09日
    瀏覽(17)
  • 大數(shù)據(jù)分析案例-基于XGBoost算法構(gòu)建二手車價(jià)格評(píng)估模型

    大數(shù)據(jù)分析案例-基于XGBoost算法構(gòu)建二手車價(jià)格評(píng)估模型

    ???♂? 個(gè)人主頁(yè):@艾派森的個(gè)人主頁(yè) ???作者簡(jiǎn)介:Python學(xué)習(xí)者 ?? 希望大家多多支持,我們一起進(jìn)步!?? 如果文章對(duì)你有幫助的話, 歡迎評(píng)論 ??點(diǎn)贊???? 收藏 ??加關(guān)注+ 喜歡大數(shù)據(jù)分析項(xiàng)目的小伙伴,希望可以多多支持該系列的其他文章 大數(shù)據(jù)分析案例合集

    2023年04月09日
    瀏覽(24)
  • 天池長(zhǎng)期賽:二手車價(jià)格預(yù)測(cè)(422方案分享)

    天池長(zhǎng)期賽:二手車價(jià)格預(yù)測(cè)(422方案分享)

    前言 一、賽題介紹及評(píng)測(cè)標(biāo)準(zhǔn) 二、數(shù)據(jù)探索(EDA) 1.讀取數(shù)據(jù)、缺失值可視化 2.特征描述性統(tǒng)計(jì) 3.測(cè)試集與驗(yàn)證集數(shù)據(jù)分布 4.特征相關(guān)性 三、數(shù)據(jù)清洗 四、特征工程 1.構(gòu)建時(shí)間特征 2.匿名特征交叉 3.平均數(shù)編碼 五、建模調(diào)參 六、模型融合 總結(jié) 賽題屬于回歸類型,相比于

    2024年02月01日
    瀏覽(17)
  • 數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參

    數(shù)據(jù)挖掘入門項(xiàng)目二手交易車價(jià)格預(yù)測(cè)之建模調(diào)參

    本文數(shù)據(jù)集來(lái)自阿里天池:https://tianchi.aliyun.com/competition/entrance/231784/information 主要參考了Datawhale的整個(gè)操作流程:https://tianchi.aliyun.com/notebook/95460 小編也是第一次接觸數(shù)據(jù)挖掘,所以先跟著Datawhale寫的教程操作了一遍,不懂的地方加了一點(diǎn)點(diǎn)自己的理解,感謝Datawhale! 了解

    2024年04月11日
    瀏覽(24)
  • Python二手車價(jià)格預(yù)測(cè)(二)—— 模型訓(xùn)練及可視化

    Python二手車價(jià)格預(yù)測(cè)(二)—— 模型訓(xùn)練及可視化

    一、Python數(shù)據(jù)分析-二手車數(shù)據(jù)獲取用于機(jī)器學(xué)習(xí)二手車價(jià)格預(yù)測(cè) 二、Python二手車價(jià)格預(yù)測(cè)(一)—— 數(shù)據(jù)處理 ? ? ? ? 前面分享了二手車數(shù)據(jù)獲取的內(nèi)容,又對(duì)獲取的原始數(shù)據(jù)進(jìn)行了數(shù)據(jù)處理,相關(guān)博文可以訪問(wèn)上面鏈接。許多朋友私信我問(wèn)會(huì)不會(huì)出模型,今天模型basel

    2024年02月05日
    瀏覽(21)
  • Spring Boot后端+Vue前端:打造高效二手車交易系統(tǒng)

    Spring Boot后端+Vue前端:打造高效二手車交易系統(tǒng)

    作者介紹: ??大廠全棧碼農(nóng)|畢設(shè)實(shí)戰(zhàn)開(kāi)發(fā),專注于大學(xué)生項(xiàng)目實(shí)戰(zhàn)開(kāi)發(fā)、講解和畢業(yè)答疑輔導(dǎo)。 ?? 獲取源碼聯(lián)系方式請(qǐng)查看文末 ?? ?推薦訂閱精彩專欄 ???? 避免錯(cuò)過(guò)下次更新 Springboot項(xiàng)目精選實(shí)戰(zhàn)案例 更多項(xiàng)目: CSDN主頁(yè)YAML墨韻 學(xué)如逆水行舟,不進(jìn)則退。學(xué)習(xí)如趕

    2024年04月28日
    瀏覽(23)
  • python筆記17_實(shí)例演練_二手車折舊分析p2

    …… 書接上文 探查車齡為5年的車輛,折舊價(jià)值與車輛等級(jí)的關(guān)系。 這里用到了 DataFrame 的 groupby 函數(shù) ,這個(gè)函數(shù)對(duì)于數(shù)據(jù)處理的重要程度無(wú)需贅言。 groupby 必須配合聚合函數(shù) 同時(shí)使用,否則只能得到一個(gè) DataFrameGroupBy 類型的玩意兒。 這里是可以只傳 groupby 參數(shù),不寫聚合

    2024年02月07日
    瀏覽(21)
  • python筆記16_實(shí)例練習(xí)_二手車折舊分析p1

    python筆記16_實(shí)例練習(xí)_二手車折舊分析p1

    python數(shù)據(jù)分析練習(xí),具體數(shù)據(jù)不放出。 分析實(shí)踐很簡(jiǎn)單。目的不是做完,而是講清楚每一步的目的和連帶的知識(shí)點(diǎn)(所以才叫學(xué)習(xí)筆記) 原始數(shù)據(jù)格式:csv文件 原始數(shù)據(jù)結(jié)構(gòu): 數(shù)據(jù)格式 字段名 int(無(wú)用信息) 無(wú) String che300_brand_name float new_price String maker_type float lowest_pric

    2024年02月07日
    瀏覽(24)

覺(jué)得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請(qǐng)作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包