Randomized Search
如果需要嘗試、調(diào)整的超參數(shù)只有有限幾個(gè),比如之前的例子,那只用grid search就夠了;但如果超參數(shù)的搜索空間非常大,應(yīng)該用RandomizedSearchCV。它有兩個(gè)優(yōu)點(diǎn):
- 支持更大的參數(shù)范圍
- 它可以更快找到最優(yōu)的超參數(shù)組合。因?yàn)椴皇潜闅v所有組合,而是在指定的參數(shù)范圍內(nèi)隨機(jī)采樣,然后評(píng)估性能。
- 可以根據(jù)現(xiàn)有資源情況給參數(shù)的上下限,因此更靈活。
缺點(diǎn)是可能錯(cuò)過(guò)最優(yōu),只得到一個(gè)可以接受的“最優(yōu)”。如果時(shí)間允許,還是可以用GridSearch的。
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
forest_reg = RandomForestRegressor()
# randint(low=1,high=101).rvs(5) 輸出:array([64, 98, 35, 2, 72]) 不要用size控制個(gè)數(shù)了
param_grid = {
# 'n_estimators': list(range(1, 200)),
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}
grid_search = RandomizedSearchCV(forest_reg, param_grid, cv=5,
n_iter=20,
scoring="neg_mean_squared_error",
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
print(grid_search.best_params_)
print(grid_search.best_estimator_)
print(np.sqrt(-grid_search.best_score_))
{'max_features': 6, 'n_estimators': 199}
RandomForestRegressor(max_features=6, n_estimators=199)
49012.16057617387
其中n_iter表示嘗試的參數(shù)組合總數(shù)。如果n_iter太小,可能錯(cuò)過(guò)最優(yōu)的超參數(shù)組合;如果n_iter太大,會(huì)增加搜索時(shí)間,消耗更多計(jì)算資源。
評(píng)估模型
查看每一列在預(yù)測(cè)時(shí)的重要性
param_grid = [
{'n_estimators': [3, 10, 30, 50], 'max_features': [2, 4, 6, 8, None]},
{'bootstrap': [False], 'n_estimators': [3, 10, 30], 'max_features': [2, 3, 4, 8]}
]
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring="neg_mean_squared_error",
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
print(grid_search.best_params_)
print(grid_search.best_estimator_)
print(np.sqrt(-grid_search.best_score_))
# 獲取列標(biāo)簽
housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
extra_attribs = ["rooms_per_household", "pop_per_household", "bedrooms_per_room"]
# 獲取每一列在準(zhǔn)確預(yù)測(cè)時(shí)的相對(duì)重要性數(shù)值
feature_importances = grid_search.best_estimator_.feature_importances_
# 這里我修改了函數(shù),多返回了full_pipeline
# 從pipeline中獲取某個(gè)transformer中輸入的列
cat_encoder = full_pipeline.named_transformers_['cat']
cat_one_hot_attribs = list(cat_encoder.categories_[0])
# 最終列名 = 純數(shù)值列的列名 + 新增的三列列名 + one-hot時(shí)產(chǎn)生的列名
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
print(sorted(zip(feature_importances, attributes), reverse=True))
{'bootstrap': False, 'max_features': 8, 'n_estimators': 30}
RandomForestRegressor(bootstrap=False, max_features=8, n_estimators=30)
49442.37738967349
[(0.3250563395483288, 'median_income'),
(0.1633435907899842, 'INLAND'),
(0.11059555286375254, 'pop_per_household'),
(0.08114145071753134, 'longitude'),
(0.0728049997803568, 'latitude'),
(0.07264703358828413, 'bedrooms_per_room'),
(0.06346893798818128, 'rooms_per_household'),
(0.04130518938735756, 'housing_median_age'),
(0.014117547726336705, 'total_rooms'),
(0.01405138434431168, 'population'),
(0.013966918312688084, 'total_bedrooms'),
(0.013656643753704638, 'households'),
(0.009607652315968867, '<1H OCEAN'),
(0.002484053857680537, 'NEAR OCEAN'),
(0.001674961006904646, 'NEAR BAY'),
(7.774401862815335e-05, 'ISLAND')]
知道了重要性后,可以舍棄掉一些不太重要的列,或者調(diào)整不太重要的列,使之更為重要。文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-416970.html
在測(cè)試集上評(píng)估
from sklearn.metrics import mean_squared_error
# 直接用
final_model = grid_search.best_estimator_
# 處理測(cè)試集數(shù)據(jù)
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()
# 使用總pipeline處理數(shù)據(jù)
X_test_prepared,f = transform_data(X_test)
# 使用模型預(yù)測(cè)
final_predictions = final_model.predict(X_test_prepared)
# 計(jì)算rmse
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print(final_rmse)
# 計(jì)算95%置信區(qū)間
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
interval = np.sqrt(stats.t.interval(confidence, len(squared_errors)-1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))
print(interval)
后續(xù)工作
將模型部署到生產(chǎn)環(huán)境后,隨著新數(shù)據(jù)的加入,模型的準(zhǔn)確率可能會(huì)降低,所以需要監(jiān)控預(yù)測(cè)效果,并且做一些自動(dòng)調(diào)整。可能的工作方向?yàn)椋?span toymoban-style="hidden">文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-416970.html
- 經(jīng)常收集新數(shù)據(jù)并添加人工標(biāo)簽;
- 寫(xiě)自動(dòng)訓(xùn)練和調(diào)參的腳本,定期執(zhí)行;
- 寫(xiě)自動(dòng)比較腳本,在更新的測(cè)試集中比較新模型和老模型的效果,如果效果更好了就更新模型,如果效果更差,需要研究為何變差;
- 評(píng)估模型輸入數(shù)據(jù)的質(zhì)量;
- 備份每個(gè)模型和一些數(shù)據(jù)集,確保可以快速回滾
到了這里,關(guān)于機(jī)器學(xué)習(xí)入門(mén)實(shí)例-加州房?jī)r(jià)預(yù)測(cè)-4(繼續(xù)調(diào)參+評(píng)估)的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!