問題描述
數(shù)據(jù)來源:California Housing Prices dataset from the StatLib repository,1990年加州的統(tǒng)計(jì)數(shù)據(jù)。
要求:預(yù)測(cè)任意一個(gè)街區(qū)的房價(jià)中位數(shù)
縮小問題:superwised multiple regressiong(用到人口、收入等特征) univariate regression(只預(yù)測(cè)一個(gè)數(shù)據(jù))plain batch learning(數(shù)據(jù)量不大+不咋變動(dòng))
準(zhǔn)備數(shù)據(jù)
下載數(shù)據(jù)
可以去github,也可以自動(dòng)下載。
import os
import tarfile
import urllib.request
import pandas as pd
down_root = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = "datasets"
HOUSING_URL = down_root + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
查看數(shù)據(jù)
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
# housing.head() 默認(rèn)打印前5行信息,中間列可能省略
# housing.info() 打印行列信息、類型等
housing.info()可以簡單查看數(shù)據(jù)情況??梢钥吹剑瑃otal_bedrooms里有數(shù)據(jù)缺失,而ocean_proximity的類型是object。因?yàn)槲募莄sv格式,所以肯定是字符串類型。
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None
打印一下ocean_proximity的分類及統(tǒng)計(jì),可以看到是標(biāo)簽,category
print(housing["ocean_proximity"].value_counts())
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
housing.describe()可以計(jì)算各個(gè)數(shù)值列的count,mean,std,min,25%、50%和75%(中位數(shù))、max。計(jì)算時(shí)null會(huì)被忽略。
也可以通過繪制柱形圖觀察數(shù)據(jù)。
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
要看柱形圖是因?yàn)槟承C(jī)器學(xué)習(xí)算法更適合用正態(tài)數(shù)據(jù),如果是tail-heavy(左偏)需要通過一些方法修正。
劃分測(cè)試集與訓(xùn)練集
最簡單的是直接隨機(jī)挑選。但是要設(shè)置seed,因?yàn)槿绻辉O(shè)置的話,每次運(yùn)行得到的訓(xùn)練集不一樣,時(shí)間長了整個(gè)訓(xùn)練集都是已知了,那測(cè)試集就失去意義了。
import numpy as np
def get_train_set(data, ratio=0.2):
# 可以先設(shè)置seed以保持shuffled不變
np.random.seed(42)
shuffled = np.random.permutation(len(data))
test_set_size = int(len(data) * ratio)
test_indices = shuffled[:test_set_size]
train_indices = shuffled[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
同時(shí)scikit learn也提供了方法:random_state就跟前面設(shè)seed的功能一樣。
from sklearn.model_selection import train_test_split
# random_state是隨機(jī)種子,如果兩次設(shè)置相同,則劃分結(jié)果相同
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
但是,隨機(jī)挑選的數(shù)據(jù)可以不夠有代表性。假設(shè)median income是一個(gè)重要的特性,需要對(duì)它進(jìn)行分層抽樣。先看一下數(shù)據(jù)分布:
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1,2,3,4,5])
housing["income_cat"].hist()
plt.show()
使用scikit learn帶的分層抽樣函數(shù)進(jìn)行分層:
from sklearn.model_selection import StratifiedShuffleSplit
# n_splits 參數(shù)指定了要生成的劃分?jǐn)?shù)量. 1就是生成1種隨機(jī)劃分
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
print(strat_test_set)
此時(shí)可以看到,
longitude latitude ... ocean_proximity income_cat
5241 -118.39 34.12 ... <1H OCEAN 5
17352 -120.42 34.89 ... <1H OCEAN 4
3505 -118.45 34.25 ... <1H OCEAN 3
7777 -118.10 33.91 ... <1H OCEAN 3
14155 -117.07 32.77 ... NEAR OCEAN 3
... ... ... ... ... ...
12182 -117.29 33.72 ... <1H OCEAN 2
7275 -118.24 33.99 ... <1H OCEAN 2
17223 -119.72 34.44 ... <1H OCEAN 4
10786 -117.91 33.63 ... <1H OCEAN 4
3965 -118.56 34.19 ... <1H OCEAN 3
[4128 rows x 11 columns]
驗(yàn)證一下是否正確分層抽樣了:
print(strat_test_set["income_cat"].value_counts() / len(strat_test_set))
3 0.350533
2 0.318798
4 0.176357
5 0.114341
1 0.039971
Name: income_cat, dtype: float64
最終函數(shù)為:
def get_train_test_split(data, test_size):
# 完全隨機(jī)分類
# from sklearn.model_selection import train_test_split
# random_state是隨機(jī)種子,如果兩次設(shè)置相同,則劃分結(jié)果相同
# test_size是測(cè)試集所占的比例 0-1
# train_set, test_set = train_test_split(data, test_size=test_size, random_state=42)
# return train_set, test_set
# 需要對(duì)某一列進(jìn)行分層抽樣
# 先創(chuàng)造一個(gè)新列,根據(jù)某列內(nèi)容,給各行打上標(biāo)簽
data["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
from sklearn.model_selection import StratifiedShuffleSplit
# n_splits 參數(shù)指定了要生成的劃分?jǐn)?shù)量
split = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
# 刪除剛才創(chuàng)造的新列
for set_ in (strat_train_set, strat_test_set):
# axis=1表示刪除列
set_.drop("income_cat", axis=1, inplace=True)
return strat_train_set, strat_test_set
數(shù)據(jù)可視化
train_set, test_set = get_train_test_split(housing, 0.2)
visual_data = train_set.copy()
# alpha=0是透明,1是實(shí)心
visual_data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
plt.show()
換一種包含信息更多的方式:令散點(diǎn)的直徑大小表示人口,顏色表示房價(jià)中位值。文章來源:http://www.zghlxwxcb.cn/news/detail-418279.html
# s是指定散點(diǎn)圖中點(diǎn)的大小,figsize默認(rèn)(6.4, 4.8)格式(width, height)
# c是散點(diǎn)圖中點(diǎn)的顏色
# cmp是將數(shù)據(jù)映射到顏色的方式. jet 是一種常用的 colormap,但是它在一些情況下可能會(huì)導(dǎo)致誤導(dǎo)性
# 的視覺效果,例如在顏色變化過程中的亮度或暗度變化不均勻。因此,在科學(xué)可視化領(lǐng)域,已經(jīng)不推薦使用
# jet 了。相反,viridis、plasma、magma 等 colormap 更適合用于科學(xué)可視化。
# 具體來說,viridis 可以在不失真的情況下傳達(dá)數(shù)據(jù)的漸變,
# 而 plasma 和 magma 可以在強(qiáng)調(diào)數(shù)據(jù)的變化時(shí)保持不同的亮度和暗度。
visual_data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=visual_data["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("viridis"),
colorbar=True,
figsize=(10,7))
plt.legend()
plt.show()
關(guān)于幾種colormap代表的顏色如下圖所示:文章來源地址http://www.zghlxwxcb.cn/news/detail-418279.html
到了這里,關(guān)于機(jī)器學(xué)習(xí)入門實(shí)例-加州房價(jià)預(yù)測(cè)-1(數(shù)據(jù)準(zhǔn)備與可視化)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!