酒店預訂訂單的分析與建模【決策樹、xgboost】
本項目包含
1.數(shù)據(jù)處理
2.數(shù)據(jù)探索性分析
3.網(wǎng)格搜索對決策樹、xgboost進行模型參數(shù)調優(yōu)
4.基于五折交叉驗證的決策樹、xgboost模型預測
專欄和往期項目
??往期文章可以關注我的專欄
下巴同學的數(shù)據(jù)加油小站
會不定期分享數(shù)據(jù)挖掘、機器學習、風控模型、深度學習、NLP等方向的學習項目,關注不一定能學到你想學的東西,但是可以學到我想學和正在學的東西??
往期項目-數(shù)據(jù)分析建模方向
1.基于線性回歸對男性體脂率的預測
2.大五人格測試數(shù)據(jù)集的探索【可視化+k-means聚類分析】
3.使用線性回歸、LGBM對二手車價格進行預測
本文代碼、數(shù)據(jù)點擊下方鏈接可獲?。?br> 4.關于酒店預訂數(shù)據(jù)集的探索【EDA+五折交叉驗證決策樹、xgboost預測】
數(shù)據(jù)與背景描述
背景描述
在線酒店預訂渠道已經(jīng)極大地改變了預訂的可能性和客戶的行為。
酒店預訂取消的典型原因包括計劃的改變、日程安排的沖突等,對酒店客人來說,因為可以選擇免費或最好是低價從而更容易取消預訂,但對酒店來說,這是一個不太理想的、可能會減少收入的因素,需要解決的問題。
數(shù)據(jù)說明
column 列名
Booking_ID 每個預訂的唯一標識符
no_of_adults 成人的數(shù)量
no_of_children 兒童的數(shù)量
no_of_weekend_nights 客人入住或預訂入住酒店的周末晚數(shù)(周六或周日)
no_of_week_nights 客人在酒店住宿或預訂住宿的周晚數(shù)(周一至周五)
type_of_meal_plan 客戶預訂的膳食計劃的類型
required_car_parking_space 顧客是否需要一個停車位?(0-不,1-是)
room_type_reserved 顧客預訂的房間類型。這些值是由INN酒店集團加密(編碼)的
lead_time 預訂日期和抵達日期之間的天數(shù)
arrival_year 抵達日期的年份
arrival_month 抵達日期的月份
arrival_date 該月的日期
market_segment_type 市場部分的指定
repeated_guest 該客戶是否為重復客人?(0 - 否, 1- 是)
no_of_previous_cancellations 在當前預訂之前,客戶取消的先前預訂的數(shù)量
no_of_previous_bookings_not_canceled 在當前預訂前未被客戶取消的先前預訂的數(shù)量
avg_price_per_room 每天預訂的平均價格;房間的價格是動態(tài)的。(單位:歐元)
no_of_special_requests 客戶提出的特殊要求的總數(shù)(例如,高樓層,從房間看風景等)
booking_status 表示預訂是否被取消的標志
導入并檢查數(shù)據(jù)
導入數(shù)據(jù)
import pandas as pd
df = pd.read_csv('/home/mw/input/data9304/Hotel Reservations.csv')
df.head()
檢查數(shù)據(jù)
數(shù)據(jù)無缺失,無重復
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
Booking_ID 36275 non-null object
no_of_adults 36275 non-null int64
no_of_children 36275 non-null int64
no_of_weekend_nights 36275 non-null int64
no_of_week_nights 36275 non-null int64
type_of_meal_plan 36275 non-null object
required_car_parking_space 36275 non-null int64
room_type_reserved 36275 non-null object
lead_time 36275 non-null int64
arrival_year 36275 non-null int64
arrival_month 36275 non-null int64
arrival_date 36275 non-null int64
market_segment_type 36275 non-null object
repeated_guest 36275 non-null int64
no_of_previous_cancellations 36275 non-null int64
no_of_previous_bookings_not_canceled 36275 non-null int64
avg_price_per_room 36275 non-null float64
no_of_special_requests 36275 non-null int64
booking_status 36275 non-null object
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB
df.duplicated().sum()
0
EDA
數(shù)據(jù)含義與分析目的
數(shù)據(jù)含義
首先看看數(shù)據(jù)都有哪些
數(shù)據(jù)一共19列
預定ID,是唯一標識符,僅用于區(qū)分數(shù)據(jù)
顧客數(shù)量有兩列:成人數(shù)量和兒童數(shù)量兩列
顧客預定天數(shù):分為工作日和周末兩列
顧客需求類數(shù)據(jù):用餐類型,停車位,房間類型,顧客特殊要求數(shù)量四列
日期、時間類型數(shù)據(jù):預定與抵達日間隔天數(shù),抵達日期年份、月份,抵達日期四列
預定方法(在線、離線)
是否為歷史用戶
本次前客戶是否取消數(shù)目:取消數(shù)目、未取消數(shù)目兩列
預定房價的平均價格
是否被取消(目標變量)
明確目的
然后明確數(shù)據(jù)探索性分析的目的:我們想找出是否取消預定與上述其他特征是否存在一定的關系。
所以我們可以進行對比分析,這里只進行簡單的分析,變量間關系暫不分析
成人、兒童數(shù)目的分析
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.figure(figsize = (16, 12))
plt.suptitle("成人、兒童顧客的數(shù)目",fontweight="bold", fontsize=30)
plt.subplot(2,2,1)
plt.gca().set_title('成人數(shù)目對比分布')
sns.countplot(x = 'booking_status', hue = 'no_of_adults', edgecolor="black", alpha=0.7, data = df)
plt.subplot(2,2,2)
plt.gca().set_title('成人數(shù)目總分布')
sns.countplot(x = 'no_of_adults', edgecolor="black", alpha=0.7,data = df)
plt.subplot(2,2,3)
plt.gca().set_title('兒童數(shù)目對比分布')
sns.countplot(x = 'booking_status', hue = 'no_of_children', edgecolor="black", alpha=0.7, data = df)
plt.subplot(2,2,4)
plt.gca().set_title('兒童總數(shù)目分布')
sns.countplot(x = 'no_of_children', edgecolor="black", alpha=0.7,data = df)
顧客預定天數(shù)的分布
plt.figure(figsize = (16, 12))
plt.suptitle("顧客預定天數(shù)的分布",fontweight="bold", fontsize=30)
plt.subplot(2,2,1)
plt.gca().set_title('工作日預定天數(shù)對比')
sns.countplot(x = 'booking_status', hue = 'no_of_week_nights', edgecolor="black", alpha=0.7, data = df)
plt.subplot(2,2,2)
plt.gca().set_title('工作日預定總天數(shù)分布')
sns.countplot(x = 'no_of_week_nights', edgecolor="black", alpha=0.7,data = df)
plt.subplot(2,2,3)
plt.gca().set_title('周末預定天數(shù)對比')
sns.countplot(x = 'booking_status', hue = 'no_of_weekend_nights', edgecolor="black", alpha=0.7, data = df)
plt.subplot(2,2,4)
plt.gca().set_title('周末預定總天數(shù)分布')
sns.countplot(x = 'no_of_weekend_nights', edgecolor="black", alpha=0.7,data = df)
顧客需求類數(shù)據(jù)分析
plt.figure(figsize = (20, 24))
plt.suptitle("顧客需求類數(shù)據(jù)分析",fontweight="bold", fontsize=30)
plt.subplot(4,2,1)
plt.gca().set_title('用餐計劃類型對比')
sns.countplot(x = 'booking_status', hue = 'type_of_meal_plan', edgecolor="black", alpha=0.7, data = df)
plt.subplot(4,2,2)
plt.gca().set_title('用餐計劃類型數(shù)據(jù)分布')
sns.countplot(x = 'type_of_meal_plan', edgecolor="black", alpha=0.7,data = df)
plt.subplot(4,2,3)
plt.gca().set_title('是否需要停車位對比')
sns.countplot(x = 'booking_status', hue = 'required_car_parking_space', edgecolor="black", alpha=0.7, data = df)
plt.subplot(4,2,4)
plt.gca().set_title('是否需要停車位數(shù)據(jù)分布')
sns.countplot(x = 'required_car_parking_space', edgecolor="black", alpha=0.7,data = df)
plt.subplot(4,2,5)
plt.gca().set_title('房間類型對比')
sns.countplot(x = 'booking_status', hue = 'room_type_reserved', edgecolor="black", alpha=0.7, data = df)
plt.subplot(4,2,6)
plt.gca().set_title('房間類型數(shù)據(jù)分布')
sns.countplot(x = 'room_type_reserved', edgecolor="black", alpha=0.7,data = df)
plt.subplot(4,2,7)
plt.gca().set_title('特殊需求對比')
sns.countplot(x = 'booking_status', hue = 'no_of_special_requests', edgecolor="black", alpha=0.7, data = df)
plt.subplot(4,2,8)
plt.gca().set_title('特殊需求數(shù)據(jù)分布')
sns.countplot(x = 'no_of_special_requests', edgecolor="black", alpha=0.7,data = df)
日期、時間類型數(shù)據(jù)分析
日期、時間類型數(shù)據(jù):預定與抵達日間隔天數(shù),抵達日期年份、月份,抵達日期四列
lead_time arrival_year arrival_month arrival_date
plt.figure(figsize = (16, 12))
plt.suptitle("日期、時間類型數(shù)據(jù)分析",fontweight="bold", fontsize=30)
plt.subplot(2,2,1)
plt.gca().set_title('間隔天數(shù)')
sns.kdeplot(x='lead_time', hue='booking_status', shade=True, data=df)
# sns.kdeplot( data=df.lead_time,shade=True)
plt.subplot(2,2,2)
plt.gca().set_title('到達年份')
sns.kdeplot(x='arrival_year', hue='booking_status', shade=True, data=df)
plt.subplot(2,2,3)
plt.gca().set_title('到達月份')
sns.kdeplot(x='arrival_month', hue='booking_status', shade=True, data=df)
plt.subplot(2,2,4)
plt.gca().set_title('到達日期')
sns.kdeplot(x='arrival_date', hue='booking_status', shade=True, data=df)
其他數(shù)據(jù)分析
預定方法(在線、離線等)market_segment_type
是否為歷史用戶repeated_guest
本次前客戶是否取消數(shù)目:取消數(shù)目、未取消數(shù)目兩列no_of_previous_cancellations、no_of_previous_bookings_not_canceled
預定房價的平均價格avg_price_per_room
plt.figure(figsize = (16, 12))
plt.suptitle("預定方法與歷史用戶",fontweight="bold", fontsize=30)
plt.subplot(2,2,1)
plt.gca().set_title('預定方法對比')
sns.countplot(x = 'booking_status', hue = 'market_segment_type', edgecolor="black", alpha=0.7, data = df)
plt.subplot(2,2,2)
plt.gca().set_title('預定方法總數(shù)')
sns.countplot(x = 'market_segment_type', edgecolor="black", alpha=0.7,data = df)
plt.subplot(2,2,3)
plt.gca().set_title('是否為歷史用戶對比')
sns.countplot(x = 'booking_status', hue = 'repeated_guest', edgecolor="black", alpha=0.7,data = df)
plt.subplot(2,2,4)
plt.gca().set_title('是否為歷史用戶總數(shù)')
sns.countplot(x = 'repeated_guest', edgecolor="black", alpha=0.7,data = df)
ax = sns.catplot('booking_status', 'no_of_previous_cancellations',height=4, aspect=2, data=df)
ax.fig.suptitle("歷史訂單取消數(shù)目",
fontsize=20, fontdict={"weight": "bold"})
ax2 = sns.catplot('booking_status', 'no_of_previous_bookings_not_canceled',height=4, aspect=2, data=df)
ax2.fig.suptitle("歷史訂單未取消數(shù)目",
fontsize=20, fontdict={"weight": "bold"})
ax3 = sns.catplot('booking_status', 'avg_price_per_room', kind="boxen",height=4, aspect=2, data=df)
ax3.fig.suptitle("房間平均價格",
fontsize=20, fontdict={"weight": "bold"})
相關性熱力圖
plt.figure(figsize=(24,16))
ax = sns.heatmap(df.corr(), square=True, annot=True, fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
模型預測
特征編碼
df = df.drop('Booking_ID', axis = 1)
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
for feat in ['type_of_meal_plan', 'room_type_reserved','market_segment_type','booking_status']:
lbl = LabelEncoder()
lbl.fit(df[feat])
df[feat] = lbl.transform(df[feat])
df.head()
數(shù)據(jù)處理
X = df.drop('booking_status', axis = 1)
X = X.values
y = df['booking_status']
y.sum()/len(y)
0.6723638869745003
模型構建
五折交叉驗證的決策樹
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold,RepeatedKFold
import numpy as np
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import roc_curve,roc_auc_score
from sklearn.tree import DecisionTreeClassifier
param = {'criterion':['gini', 'entropy'],
'splitter':['best', 'random'],
'max_depth': range(1,10,2),
'min_samples_leaf': range(1,10,2)
}
gs = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=param, cv=5, scoring="roc_auc", n_jobs=-1, verbose=10)
gs.fit(X,y)
print(gs.best_params_)
Fitting 5 folds for each of 100 candidates, totalling 500 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent
workers. [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.6s
[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.7s
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 1.7s
[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 1.8s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1818s.) Setting
batch_size=2. [Parallel(n_jobs=-1)]: Batch computation too fast
(0.0935s.) Setting batch_size=8. [Parallel(n_jobs=-1)]: Done 23 tasks
| elapsed: 2.0s [Parallel(n_jobs=-1)]: Done 67 tasks |
elapsed: 3.0s [Parallel(n_jobs=-1)]: Done 139 tasks | elapsed:
5.1s [Parallel(n_jobs=-1)]: Done 211 tasks | elapsed: 7.7s [Parallel(n_jobs=-1)]: Done 299 tasks | elapsed: 10.1s
[Parallel(n_jobs=-1)]: Done 387 tasks | elapsed: 12.7s
{‘criterion’: ‘gini’, ‘max_depth’: 9, ‘min_samples_leaf’: 7,
‘splitter’: ‘best’} [Parallel(n_jobs=-1)]: Done 500 out of 500 |
elapsed: 17.2s finished
n_fold = 5
folds = KFold(n_splits=n_fold, shuffle=True, random_state=2022)
oof_dt = np.zeros(len(X))
for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
X_train, X_valid = pd.DataFrame(X).iloc[train_index], pd.DataFrame(X).iloc[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
eval_set = [(X_valid, y_valid)]
model_dt= DecisionTreeClassifier(
max_depth=9,criterion='gini',splitter='best',min_samples_leaf = 7,random_state=2022
).fit(X_train,y_train)
y_pred_valid = model_dt.predict(X_valid)
oof_dt[valid_index] = y_pred_valid.reshape(-1, )
print(roc_auc_score(y, oof_dt))
0.8417716385830245文章來源:http://www.zghlxwxcb.cn/news/detail-497546.html
print(classification_report(y, oof_dt))
precision recall f1-score support
0 0.82 0.76 0.79 11885
1 0.89 0.92 0.90 24390
accuracy 0.87 36275
macro avg 0.86 0.84 0.85 36275
weighted avg 0.87 0.87 0.87 36275
五折交叉驗證的xgboost
from xgboost import XGBClassifier
param = {'max_depth': [9,12,15],
'learning_rate': [0.05,0.1],
'n_estimators': [500,700,900]
}
gs = GridSearchCV(estimator=XGBClassifier(), param_grid=param, cv=3, scoring="roc_auc", n_jobs=-1, verbose=10)
gs.fit(X,y)
print(gs.best_params_)
Fitting 3 folds for each of 18 candidates, totalling 54 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 26.1s
[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 1.0min
[Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 3.0min
[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 4.5min
[Parallel(n_jobs=-1)]: Done 21 tasks | elapsed: 8.0min
[Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 12.0min
[Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 14.6min
[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 18.7min
[Parallel(n_jobs=-1)]: Done 54 out of 54 | elapsed: 23.1min finished
{'learning_rate': 0.05, 'max_depth': 12, 'n_estimators': 500}
n_fold = 5
folds = KFold(n_splits=n_fold, shuffle=True, random_state=2022)
oof_xgb = np.zeros(len(X))
for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
X_train, X_valid = pd.DataFrame(X).iloc[train_index], pd.DataFrame(X).iloc[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
eval_set = [(X_valid, y_valid)]
model_xgb = XGBClassifier(
max_depth=12,learning_rate=0.05,n_estimators=500,random_state=2022
).fit(X_train,y_train,early_stopping_rounds=100, eval_metric="auc",eval_set=eval_set, verbose=True)
y_pred_valid = model_xgb.predict(X_valid)
oof_xgb[valid_index] = y_pred_valid.reshape(-1, )
print(roc_auc_score(y, oof_xgb))
0.8807500918930099文章來源地址http://www.zghlxwxcb.cn/news/detail-497546.html
print(classification_report(y, oof_xgb))
precision recall f1-score support
0 0.87 0.82 0.85 11885
1 0.92 0.94 0.93 24390
accuracy 0.90 36275
macro avg 0.89 0.88 0.89 36275
weighted avg 0.90 0.90 0.90 36275
到了這里,關于酒店預訂訂單的分析與建?!緵Q策樹、xgboost】的文章就介紹完了。如果您還想了解更多內容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章,希望大家以后多多支持TOY模板網(wǎng)!