數(shù)據(jù)分析與數(shù)據(jù)挖掘?qū)崙?zhàn)案例(7/16):
2022 年首屆釘釘杯大學(xué)生大數(shù)據(jù)挑戰(zhàn)賽練習(xí)題目 練習(xí)題 A:二手房房價分析與預(yù)測
要點:
1、機器學(xué)習(xí)
2、數(shù)據(jù)挖掘
3、數(shù)據(jù)清洗、分析、pyeahcrs可視化
4、隨機森林回歸預(yù)測模型預(yù)測房價
整體代碼:
過程代碼:
1、讀入數(shù)據(jù)、清洗數(shù)據(jù):
import pandas as pd
import numpy as np
df=pd.read_csv("data.csv",encoding='utf-8') #文件路徑為絕對路徑,根據(jù)自己電腦文件夾的路徑修改
df
df.info() #查看df信息
df.dropna(inplace=True) #刪除空值行
df.drop('Unnamed: 0',axis=1,inplace=True) #刪除無用列
df
df=df.drop_duplicates() ###消除重復(fù)記錄
df
2、解決相關(guān)問題:
(一) 根據(jù)附件中的數(shù)據(jù)集,將二手房數(shù)據(jù)按照“區(qū)域”屬性進(jìn)行劃分,然后計算每個 區(qū)域的二手房均價,最后將區(qū)域及對應(yīng)的房屋均價信息通過縱向條形圖顯示 :
import re
zonjia = []
for v in df['總價']:
a = re.findall(r'\d+',str(v))[0]
# print(a)
zonjia.append(int(a))
df['總價1'] = zonjia
df #得到數(shù)字類型的總價1
df1_1 = df[['區(qū)域','總價1']].groupby('區(qū)域').mean()
df1_1.columns = ['區(qū)域均價']
df1_1['區(qū)域均價'] = df1_1['區(qū)域均價'].astype(int)
df1_1
#畫圖:
from pyecharts.charts import Bar
from pyecharts import options as opts
%matplotlib inline
bar = Bar()
bar.add_xaxis(list(df1_1.index))
bar.add_yaxis("單位:萬", list(df1_1['區(qū)域均價']))
bar.set_global_opts(title_opts=opts.TitleOpts(title="區(qū)域房屋均價"))
bar.render_notebook()
# bar.render() #生成html
(二) 根據(jù)附件的數(shù)據(jù)集,計算各個區(qū)域二手房數(shù)量占總二手房數(shù)量的百分比,并畫出 餅狀圖 :
df['計數(shù)'] =1
df1_2 = df[['區(qū)域','計數(shù)']].groupby('區(qū)域').count()
df1_2 #得到統(tǒng)計數(shù)據(jù)的dataform表格
#畫圖:
from pyecharts.charts import Pie
from pyecharts import options as opts
# 富文本
rich_text = {
"a": {"color": "#999", "lineHeight": 22, "align": "center"},
"abg": {
"backgroundColor": "#e3e3e3",
"width": "100%",
"align": "right",
"height": 22,
"borderRadius": [4, 4, 0, 0],
},
"hr": {
"borderColor": "#aaa",
"width": "100%",
"borderWidth": 0.5,
"height": 0,
},
"b": {"fontSize": 16, "lineHeight": 33},
"per": {
"color": "#eee",
"backgroundColor": "#334455",
"padding": [2, 4],
"borderRadius": 2,
},
}
# 虛假數(shù)據(jù)
cate = list(df1_2.index)
data = list(df1_2['計數(shù)'])
pie = (Pie()
.add('二手房數(shù)量', [list(z) for z in zip(cate, data)],
label_opts=opts.LabelOpts(position='outsiede',
formatter="{a|{a}}{abg|}\n{hr|}\n {b|: }{c} {per|n5n3t3z%} ",
rich=rich_text))
)
pie.render_notebook()
(三) 將二手房按照“裝修”屬性進(jìn)行劃分,然后計算不同裝修程度的二手房數(shù)量,并 用條形圖顯示不同裝修程度二手房的數(shù)量。 :
df1_3 = df[['裝修','計數(shù)']].groupby('裝修').count()
df1_3
from pyecharts.charts import Bar
from pyecharts import options as opts
%matplotlib inline
bar = Bar()
bar.add_xaxis(list(df1_3.index))
bar.add_yaxis("統(tǒng)計數(shù)量", list(df1_3['計數(shù)']))
bar.set_global_opts(title_opts=opts.TitleOpts(title="裝修程度統(tǒng)計"))
bar.render_notebook()
# bar.render() #生成html
3、機器學(xué)習(xí)隨機森林建模預(yù)測房價:
(一)將二手房按照戶型進(jìn)行分組,然后提取前 5 組最熱門的二手房戶型(出售數(shù)量最多的 5 組戶型),最后計算這 5 個熱門戶型的均價并畫圖顯示。
df2_1 = df[['戶型','計數(shù)']].groupby('戶型').count()
df2_1
df2_1.sort_values(by='計數(shù)',axis=0,ascending=False,inplace=True)
df2_1
names = list(df2_1.index[0:5])
names
df2_1_1 = df[['戶型','總價1']].groupby('戶型').mean()
df2_1_1
datas = []
for v in names:
datas.append(int(df2_1_1.loc[v]))
datas
from pyecharts import options as opts
from pyecharts.charts import Bar,Line,Grid
B = ["草莓","芒果","葡萄","雪梨","西瓜","檸檬","車?yán)遄?]
CB = [78,95,120,102,88,108,98]
line = Line()
line.add_xaxis(names)
line.add_yaxis("均價單位:萬",datas)
line.set_global_opts(title_opts=opts.TitleOpts(title="最熱五戶型均價"),
legend_opts=opts.LegendOpts())
line.render_notebook()
(二)選擇附件中適合的屬性,建立模型預(yù)測二手房的價格
特征工程:(提取出數(shù)字?jǐn)?shù)據(jù), 拆分?jǐn)?shù)據(jù)、特征編碼等:)
df2 = df.drop(['小區(qū)名字','計數(shù)','總價'],axis=1) #刪除明顯無關(guān)的特征列
df2
# 字符型數(shù)據(jù)和離散型數(shù)據(jù)轉(zhuǎn)為數(shù)字特征:
df2['建筑面積1'] = df2['建筑面積'].str[:-2]
df2
df2['單價1'] = df2['單價'].str[:-4]
df2
shi = []
ting = []
wei = []
for v in df2['戶型']:
re_ = re.findall(r'\d+',v)
# print(re_)
if len(re_) >=3:
shi.append(re_[0])
ting.append(re_[1])
wei.append(re_[2])
else:
shi.append(0)
ting.append(0)
wei.append(0)
df2['室'] = shi
df2['廳'] =ting
df2['衛(wèi)'] =wei
df2
df2 = df2.drop(['戶型','建筑面積','單價'],axis=1) #刪除無用的列
df2
df2 = df2.drop(['戶型','建筑面積','單價'],axis=1) #刪除無用的列
df2
# 將字符標(biāo)簽或者類別數(shù)字化
df2['朝向'] = pd.Categorical(df2['朝向']).codes
df2
df2['樓層'] = pd.Categorical(df2['樓層']).codes
df2['裝修'] = pd.Categorical(df2['裝修']).codes
df2['區(qū)域'] = pd.Categorical(df2['區(qū)域']).codes
df2
建模:
y=df2.iloc[:,-4] #目標(biāo)列
y
x=df2.drop('單價1',axis=1)
x #特征列數(shù)據(jù)
#劃分?jǐn)?shù)據(jù)集:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,
test_size=0.30,
random_state=100,
)
# 顯示訓(xùn)練集和測試集的維度
print("x_train.shape:",x_train.shape)
print("x_test.shape:",x_test.shape)
print("y_train.shape:",y_train.shape)
print('y_test.shape:',y_test.shape)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
# 隨機森林去進(jìn)行預(yù)測
rf = RandomForestRegressor()
#設(shè)置網(wǎng)格超參數(shù)
param = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]}
# 超參數(shù)調(diào)優(yōu)
gc = GridSearchCV(rf, param_grid=param, cv=2) #網(wǎng)格搜索與交叉驗證
gc.fit(x_train, y_train)
y_pre=gc.predict(x_test)
print(y_pre)#輸出預(yù)測值
print("隨機森林預(yù)測的準(zhǔn)確率為:", gc.score(x_test, y_test)) #會運行一段時間
print("最佳參數(shù):",gc.best_params_)
print("最佳分?jǐn)?shù):",gc.best_score_)
print("最佳估計器:",gc.best_estimator_)
print("交叉驗證結(jié)果:\n",gc.cv_results_)
最后(源碼):
這樣一個簡單的數(shù)據(jù)挖掘?qū)嵺`案例就做好了,我還有很多平時積累的案例,后續(xù)我會持續(xù)編寫分享的,如果您覺得有一定的意義,請點個關(guān)注唄,您的支持是我創(chuàng)作的最大動力,如果需要源碼:
鏈接:https://pan.baidu.com/s/1BIXUNwOrSEydEskuOB-_6g
提取碼:8848文章來源:http://www.zghlxwxcb.cn/news/detail-465597.html
文章來源地址http://www.zghlxwxcb.cn/news/detail-465597.html
到了這里,關(guān)于數(shù)據(jù)分析與數(shù)據(jù)挖掘?qū)崙?zhàn)案例本地房價預(yù)測(716):的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!