Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

這篇具有很好參考價(jià)值的文章主要介紹了Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

需要源碼和數(shù)據(jù)集請(qǐng)點(diǎn)贊關(guān)注收藏后評(píng)論區(qū)留言私信~~~

下面對(duì)天池項(xiàng)目中的紅酒數(shù)據(jù)集進(jìn)行分析與挖掘

實(shí)現(xiàn)步驟

1：導(dǎo)入模塊

2：顏色和打印精度設(shè)置

3：獲取數(shù)據(jù)并顯示數(shù)據(jù)維度

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

字段中英文對(duì)照表如下

? Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

然后利用describe函數(shù)顯示數(shù)值屬性的統(tǒng)計(jì)描述值

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

?顯示quality取值的相關(guān)信息

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

顯示各個(gè)變量的直方圖如下

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

?顯示各個(gè)變量的盒圖

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

酸性相關(guān)的特征分析該數(shù)據(jù)集與酸度相關(guān)的特征有’fixed acidity’, ‘volatile acidity’, ‘citric acid’,‘chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’,‘PH’。其中前6中酸度特征都會(huì)對(duì)PH產(chǎn)生影響。PH在對(duì)數(shù)尺度，然后對(duì)6中酸度取對(duì)數(shù)做直方圖

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

pH值主要是與fixed acidity有關(guān)，fixed acidity比volatile acidity和citric acid高1到2個(gè)數(shù)量級(jí)(Figure 4)，比f(wàn)ree sulfur dioxide, total sulfur dioxide, sulphates高3個(gè)數(shù)量級(jí)。   一個(gè)新特征total acid來(lái)自于前三個(gè)特征的和

?甜度（sweetness） residual sugar主要與酒的甜度有關(guān)，干紅（<= 4g/L），半干（4-12g/L），半甜（12-45g/L），甜（>= 45g/L），該數(shù)據(jù)集中沒(méi)有甜葡萄酒

繪制甜度的直方圖如下

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

?繪制不同品質(zhì)紅酒的各個(gè)屬性的盒圖

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

? Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

從上圖可以看出：

紅酒品質(zhì)與檸檬酸，硫酸鹽，酒精度成正相關(guān) 紅酒品質(zhì)與易揮發(fā)性酸，密度，PH成負(fù)相關(guān) 殘留糖分，氯離子，二氧化硫?qū)t酒品質(zhì)沒(méi)有什么影響

下面分析密度和酒精濃度的關(guān)系

密度和酒精濃度是相關(guān)的，物理上，但兩者并不是線性關(guān)系。另外密度還與酒精中的其中物質(zhì)含量有關(guān)，但是相關(guān)性很小

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)） ?

?酸性物質(zhì)含量和PH 因?yàn)镻H和非揮發(fā)性酸之間存在著-0.68的相關(guān)性，因?yàn)榉菗]發(fā)性酸的總量特別高，所以total acid這個(gè)指標(biāo)意義不大

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

多變量分析與紅酒品質(zhì)相關(guān)性最高的三個(gè)特征分別是酒精濃度，揮發(fā)性酸含量，檸檬酸。下面研究三個(gè)特征對(duì)紅酒的品質(zhì)有何影響

? Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

PH和非揮發(fā)性酸，檸檬酸成負(fù)相關(guān)

? Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

總結(jié) 對(duì)于紅酒品質(zhì)影響最重要的三個(gè)特征：酒精度、揮發(fā)性酸含量和檸檬酸。對(duì)于品質(zhì)高于7的優(yōu)質(zhì)紅酒和品質(zhì)低于4的劣質(zhì)紅酒，直觀上線性可分，對(duì)于品質(zhì)為5和6的紅酒很難進(jìn)行線性區(qū)分

?隨機(jī)森林、線性回歸等算法部分

對(duì)數(shù)據(jù)類(lèi)型編碼，將數(shù)據(jù)集劃分為訓(xùn)練集和測(cè)試集等等

對(duì)比原始數(shù)據(jù)與做了標(biāo)準(zhǔn)化處理的數(shù)據(jù)，其結(jié)果相差不大，所以該數(shù)據(jù)集不需要做標(biāo)準(zhǔn)化處理

下面我們展示各種算法的預(yù)測(cè)精度結(jié)果

可以發(fā)現(xiàn)誤差都比較大，其中隨機(jī)森林誤差較高

Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

? Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）

?代碼

部分代碼如下需要全部代碼請(qǐng)點(diǎn)贊關(guān)注收藏后評(píng)論區(qū)留言私信~~~

#!/usr/bin/env python
# coding: utf-8

# ## 數(shù)據(jù)分析部分

# In[1]:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# from sklearn.datasets import load_wine


# In[2]:


# 顏色
color = sns.color_palette()
# 數(shù)據(jù)print精度
pd.set_option('precision',3) 


# In[3]:


df = pd.read_csv('.\winequality-red.csv',sep = ';')
display(df.head())
print('數(shù)據(jù)的維度：',df.shape)


# ![image.png](attachment:image.png)

# In[4]:


df.info()


# In[5]:


df.describe()


# In[6]:


print('quality的取值：',df['quality'].unique())
print('quality的取值個(gè)數(shù)：',df['quality'].nunique())
print(df.groupby('quality').mean())


# 顯示各個(gè)變量的直方圖

# In[ ]:





# In[7]:


color = sns.color_palette()
column= df.columns.tolist()
fig = plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    df[column[i]].hist(bins = 100,color = color[3])
    plt.xlabel(column[i],fontsize = 12)
    plt.ylabel('Frequency',fontsize = 12)
plt.tight_layout()


# 顯示各個(gè)變量的盒圖

# In[8]:


fig = plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    sns.boxplot(df[column[i]],orient = 'v',width = 0.5,color = color[4])
    plt.ylabel(column[i],fontsize = 12)
plt.tight_layout()


# 酸性相關(guān)的特征分析
# 該數(shù)據(jù)集與酸度相關(guān)的特征有’fixed acidity’, ‘volatile acidity’, ‘citric acid’,‘chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’,‘PH’。其中前6中酸度特征都會(huì)對(duì)PH產(chǎn)生影響。PH在對(duì)數(shù)尺度，然后對(duì)6中酸度取對(duì)數(shù)做直方圖。

# In[9]:


acidityfeat = ['fixed acidity', 
				'volatile acidity', 
				'citric acid', 
				'chlorides', 
				'free sulfur dioxide', 
				'total sulfur dioxide',]

fig = plt.figure(figsize = (10,6))
for i in range(6):
    plt.subplot(2,3,i+1)
    v = np.log10(np.clip(df[acidityfeat[i]].values,a_min = 0.001,a_max = None))
    plt.hist(v,bins = 50,color = color[0])
    plt.xlabel('log('+ acidityfeat[i] +')',fontsize = 12)
    plt.ylabel('Frequency')    
plt.tight_layout()


# In[10]:


plt.figure(figsize=(6,3))

bins = 10**(np.linspace(-2,2)) # linspace 默認(rèn)50等分
plt.hist(df['fixed acidity'], bins=bins, edgecolor = 'k', label='Fixed Acidity') #bins: 直方圖的柱數(shù)，可選項(xiàng)，默認(rèn)為10
plt.hist(df['volatile acidity'], bins=bins, edgecolor = 'k', label='Volatitle Acidity')#label:字符串或任何可以用'%s'轉(zhuǎn)換打印的內(nèi)容。
plt.hist(df['citric acid'], bins=bins, edgecolor = 'k', label='Citric Acid')
plt.xscale('log')
plt.xlabel('Acid Concentration(g/dm^3)')
plt.ylabel('Frequency')
plt.title('Histogram of Acid Contacts')#title ：圖形標(biāo)題
plt.legend()#plt.legend（）函數(shù)主要的作用就是給圖加上圖例
plt.tight_layout()

print('Figure 4')
"""
pH值主要是與fixed acidity有關(guān)，
fixed acidity比volatile acidity和citric acid高1到2個(gè)數(shù)量級(jí)(Figure 4)，比f(wàn)ree sulfur dioxide, total sulfur dioxide, sulphates高3個(gè)數(shù)量級(jí)。
   一個(gè)新特征total acid來(lái)自于前三個(gè)特征的和。
"""


# 甜度（sweetness）
# residual sugar主要與酒的甜度有關(guān)，干紅（<= 4g/L），半干（4-12g/L），半甜（12-45g/L），甜（>= 45g/L），該數(shù)據(jù)集中沒(méi)有甜葡萄酒。

# In[11]:


df['sweetness'] = pd.cut(df['residual sugar'],bins = [0,4,12,45],labels = ['dry','semi-dry','semi-sweet'])
df.head()


# In[12]:


plt.figure(figsize = (6,4))
df['sweetness'].value_counts().plot(kind = 'bar',color = color[0])
plt.xticks(rotation = 0)
plt.xlabel('sweetness')
plt.ylabel('frequency')
plt.tight_layout()
print('Figure 5')


# In[13]:


# 創(chuàng)建一個(gè)新特征total acid
df['total acid'] = df['fixed acidity'] + df['volatile acidity'] + df['citric acid']

columns = df.columns.tolist()
columns.remove('sweetness')
# columns

# ['fixed acidity',
#  'volatile acidity',
#  'citric acid',
#  'residual sugar',
#  'chlorides',
#  'free sulfur dioxide',
#  'total sulfur dioxide',
#  'density',
#  'pH',
#  'sulphates',
#  'alcohol',
#  'quality',
#  'total acid']
sns.set_style('ticks')
sns.set_context('notebook',font_scale = 1.1)

column = columns[0:11] + ['total acid']
plt.figure(figsize = (10,8))
for i in range(12):
    plt.subplot(4,3,i+1)
    sns.boxplot(x = 'quality',y = column[i], data = df,color = color[1],width = 0.6)
    plt.ylabel(column[i],fontsize = 12)
plt.tight_layout()

print('Figure 7:PhysicoChemico Propertise and Wine Quality by Boxplot')


# 從上圖可以看出：
# 
# 紅酒品質(zhì)與檸檬酸，硫酸鹽，酒精度成正相關(guān)
# 紅酒品質(zhì)與易揮發(fā)性酸，密度，PH成負(fù)相關(guān)
# 殘留糖分，氯離子，二氧化硫?qū)t酒品質(zhì)沒(méi)有什么影響

# In[14]:


sns.set_style('dark')
plt.figure(figsize = (10,8))
mcorr = df[column].corr()
mask = np.zeros_like(mcorr,dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')

# print('Figure 8:Pairwise colleration plot')


# In[ ]:





# In[15]:


# 密度和酒精濃度
# 密度和酒精濃度是相關(guān)的，物理上，但兩者并不是線性關(guān)系。另外密度還與酒精中的其中物質(zhì)含量有關(guān)，但是相關(guān)性很小。

sns.set_style('ticks')
sns.set_context('notebook',font_scale = 1.4)

plt.figure(figsize = (6,4))
sns.regplot(x = 'density',y = 'alcohol',data = df,scatter_kws = {'s':10},color = color[1])
plt.xlabel('density',fontsize = 12)
plt.ylabel('alcohol',fontsize = 12)

plt.xlim(0.989,1.005)
plt.ylim(7,16)

# print('Figure 9: Density vs Alcohol')


# 酸性物質(zhì)含量和PH
# 因?yàn)镻H和非揮發(fā)性酸之間存在著-0.68的相關(guān)性，因?yàn)榉菗]發(fā)性酸的總量特別高，所以total acid這個(gè)指標(biāo)意義不大。

# In[16]:


column


# In[17]:


acidity_raleted = ['fixed acidity','volatile acidity','total sulfur dioxide','chlorides','total acid']

plt.figure(figsize = (10,6))

for i in range(5):
    plt.subplot(2,3,i+1)
    sns.regpltx = 'pH',y = acidity_raleted[i],data = df,scatter_kws = {'s':10},color = color[1])
    plt.xlabel('PH',fontsize = 12)
    plt.ylabel(acidity_raleted[i],fontsize = 12)
    
plt.tight_layout()
print('Figure 10:The correlation between different acid and PH')


# 多變量分析
# 與紅酒品質(zhì)相性最高的三個(gè)特征分別是酒精濃度，揮發(fā)性酸含量，檸檬酸。下面研究三個(gè)特征對(duì)紅酒的品質(zhì)有何影響。

# In[18]:


plt.style.use('ggplot')

plt.figure(figsize = (6,4))
sns.lmplot(x = 'alcohol',y = 'volatile acidity',hue = 'quality',data = df,fit_reg = False,scatter_kws = {'s':10},size = 5)
# In[19]:


sns.lmplot(x = 'alcohol', y = 'volatile acidity', col='quality', hue = 'quality', 
           data = df,fit_reg = False, size = 3,  aspect = 0.9, col_wrap=3,
           ={'s':20})
print("Figure 11-2: Scatter Plots of Alcohol, Volatile Acid and Quality")

# PH和非揮發(fā)性酸，檸檬酸
# PH和非揮發(fā)性酸，檸檬酸成負(fù)相關(guān)
# In[20]:


sns.set_style('ticks')
sns.set_context("notebook", font_scale= 1.4)

plt.figure(figsize=(6,5))
cm = plt.cm.get_cmap('RdBu')
sc = plt.scatter(df['fixed acidity'], df['citric acid'], c=df['pH'], vmin=2.6, vmax=4, s=15, cmap=cm)
bar = plt.colorbar(sc)
bar.n = 0)
plt.xlabel('fixed acidity')
plt.ylabel('ciric acid')
plt.xlim(4,18)
plt.ylim(0,1)
print('Figure 12: pH with Fixed Acidity and Citric Acid')


# 總結(jié)
# 對(duì)于紅酒品質(zhì)影響最重要的三個(gè)特征：酒精度、揮發(fā)性酸含量和檸檬酸。對(duì)于品質(zhì)高于7的優(yōu)質(zhì)紅酒和品質(zhì)低于4的劣質(zhì)紅酒，直觀上線性可分，對(duì)于品質(zhì)為5和6的紅酒很難進(jìn)行線性區(qū)分。

# ## 數(shù)據(jù)掘時(shí)間部分

# In[21]:


# 數(shù)據(jù)建模
# 線性回歸
# 集成算法
# 提升算
# 模型評(píng)估
# 確定模型參數(shù)
# 1.數(shù)據(jù)集切分
# 1.1 切分特征和標(biāo)簽
# 1.2 切分訓(xùn)練集個(gè)測(cè)試集

df.head()


# In[22]:


# 數(shù)據(jù)預(yù)處理工作

# 檢查數(shù)據(jù)的完整性
df.isnull().sum()


# In[23]:


# 將object類(lèi)型的數(shù)據(jù)轉(zhuǎn)化為int類(lèi)型
sweetness = pd.get_dummies(df['sweetness'])
df = pd.concat([df,sweetness],axis = 1)
df.head()


# In[24]:


df = df.drop('sweetness',axis = 1)
labels = df['quality']
features = df.drop('quality',axis = 1)
# 對(duì)原始數(shù)據(jù)集進(jìn)行切分
from sklearn.model_selection import train_test_split
train_features,test_fatures,train_labels,test_labels = train_test_split(features,labels,test_size = 0.3,random_state = 0
print('訓(xùn)練特征的規(guī)模:'.shape)
print('訓(xùn)練標(biāo)簽的規(guī)模:',train_labels.shape)
print('測(cè)試特征的規(guī)模:',test_features.shape)
print('測(cè)試標(biāo)簽的規(guī)模:',test_labels.shape)


# In[25]:


from sklearn import svm
classifier=svm.SVC(kernel='linear',gamma=0.1)
classifier.fit(train_features,train_labels)
print('訓(xùn)練集的準(zhǔn)確率',classifier.score(train_features,train_labels))
print('測(cè)試集的準(zhǔn)確率',classifier.score(test_features,test_labels))


# In[26]:


from sklearn.linear_model import LinearRegression
LR = LinearRegression)LR.fit(train_features,train_labels
prediction = LR.predict(test_features)
prediction[:5]


# In[27]:


#對(duì)模型進(jìn)行評(píng)估
from sklearn.metrics import mean_squared_error
RMSE = np.sqrt(mean_squared_error(test_labels,prediction))
print('線性回歸模型的預(yù)測(cè)誤差:',RMSE)


# In[28]:


# 對(duì)訓(xùn)練特征和測(cè)試特征做標(biāo)準(zhǔn)化處理，觀察結(jié)果

from sklearn.preprocessing import StandardScaler
train_features_std = StandardScaler().fit_transform(train_features)
test_features_std = StandardScaler().fit_transform(test_features)
LR = LinearRegression()
LR.fit(train_features_std,train_labels)
prediction = LR.predict(test_features_std)

#觀察預(yù)測(cè)結(jié)果誤差
RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('線性回歸模型預(yù)測(cè)誤差:',RMSE)


# 對(duì)比原始數(shù)據(jù)與做了標(biāo)準(zhǔn)化處理的數(shù)據(jù)，其結(jié)果相差不大，所以該數(shù)據(jù)集不需要做標(biāo)準(zhǔn)化處理。
# 
# 集成算法：隨機(jī)森林

# In[29]:


from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor()
RF.fit(train_features,train_labels)
prediction = RF.pre

# In[30]:


RF.get_params


# In[31]:


from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators':[100,200,300,400,500],
             'max_depth':[3,4,5,6],
             'min_samples_split':[2,3,4]}

RF = RandomForestRegressor()
grid = GridSearchCV(RF,param_grid = param_grid,scoring = 'neg_mean_squared_error',cv = 3,n_jobs = -1)
grid.fit(train_features,train_labels)


# In[32]:


# GridSearchCV(cv=3, error_score='raise-deprecating',
#        estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
#            max_features='auto', max_leaf_nodes=None,
#            min_impurity_decrease=0.0, min_impurity_split=None,
#            min_samples_leaf=1, min_samples_split=2,
#            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
#            oob_sc

# In[33]:


grid.best_params_


# In[34]:


RF = RandomForestRegressor(n_estimators = 300,min_samples_split = 2,max_depth = 6)

RF.fit(train_features,train_labels)


# In[35]:


RandomForestRe

# In[36]:


prediction = RF.predict(test_features)

RF_RMSE = np.sqrt(mean_squared_error(prediction,test_labels))
print('隨機(jī)森林模型的預(yù)測(cè)誤差:',RF_RMSE)


# 集成算法：GBDT

# In[37]:


from sklearn.ensemble import GradientBoostingRegressor

GBDT = GradientBoostingRegressor()
GBDT.fit(train_features,train_labels)
gbdt_prediction =
GBDT.get_params


# In[ ]:

創(chuàng)作不易覺(jué)得有幫助請(qǐng)點(diǎn)贊關(guān)注收藏~~~文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-411897.html

到了這里，關(guān)于Python利用線性回歸、隨機(jī)森林等對(duì)紅酒數(shù)據(jù)進(jìn)行分析與可視化實(shí)戰(zhàn)（附源碼和數(shù)據(jù)集超詳細(xì)）的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！