使用 Amazon SageMaker 構(gòu)建機(jī)器學(xué)習(xí)應(yīng)用
全程部署視頻看這里,原視頻30分鐘左右為了觀看體驗(yàn)剪掉了等待時(shí)間:
小白使用Amazon SageMaker 構(gòu)建機(jī)器學(xué)習(xí)應(yīng)用
一、創(chuàng)建Sagemaker Notebook實(shí)例
Amazon SageMaker: https://aws.amazon.com/cn/sagemaker/
輸入名稱、選擇實(shí)例類型、配置磁盤大小,具體如下圖
創(chuàng)建新角色,選擇任意S3存儲(chǔ)桶,點(diǎn)擊創(chuàng)建角色
配置VPC網(wǎng)絡(luò),選擇VPC、子網(wǎng)和安全組,并點(diǎn)擊創(chuàng)建筆記本實(shí)例
等待5-6分鐘,狀態(tài)變?yōu)閕nSerice,點(diǎn)擊打開jupyter
新建文件,如下圖
二、下載數(shù)據(jù)集
輸入如下代碼,下載數(shù)據(jù)集并解壓:
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip
粘貼代碼后點(diǎn)擊運(yùn)行
通過(guò)pandas展示數(shù)據(jù)集
使用 bank-additional-full.csv 數(shù)據(jù)集文件,將其通過(guò) pandas 讀入并展示:
import numpy as np # For matrix operations and numerical processing
import pandas as pd # For munging tabular data
import os
data = pd.read_csv("./bank-additional/bank-additional-full.csv", sep=";")
pd.set_option("display.max_columns", 500) # Make sure we can see all of the columns
pd.set_option("display.max_rows", 50) # Keep the output on one page
data
特征解釋如下:
三、數(shù)據(jù)預(yù)處理
數(shù)據(jù)清洗將分類類型數(shù)據(jù)通過(guò)獨(dú)熱編碼轉(zhuǎn)換為數(shù)字。
data["no_previous_contact"] = np.where(
data["pdays"] == 999, 1, 0
) # Indicator variable to capture when pdays takes a value of 999
data["not_working"] = np.where(
np.in1d(data["job"], ["student", "retired", "unemployed"]), 1, 0
) # Indicator for individuals not actively employed
model_data = pd.get_dummies(data) # Convert categorical variables to sets of indicators
model_data
刪除數(shù)據(jù)中相關(guān)的特征和 duration 特征
model_data = model_data.drop(
["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"], axis=1)
model_data = model_data.drop(["y_no"], axis=1)
model_data
將數(shù)據(jù)集拆分為訓(xùn)練(90%)和測(cè)試(10%)數(shù)據(jù)集,并將數(shù)據(jù)集轉(zhuǎn)換為算法期望的正確格式。 在訓(xùn)練期間使用訓(xùn)練數(shù)據(jù)集,這些測(cè)試數(shù)據(jù)集將在模型訓(xùn)練完成后用于評(píng)估模型性能。
四、使用XGBoost訓(xùn)練模型
安裝XGBoost
!pip install xgboost
使用python XGBoost API
啟動(dòng)模型訓(xùn)練,并在完成后保存模型。然后將前面預(yù)留出的測(cè)試數(shù)據(jù)集送入模型中進(jìn)行推理,我們將推理結(jié)果大于閾值(0.5)的認(rèn)為是1,否則為0,然后與測(cè)試集中的標(biāo)簽進(jìn)行對(duì)比來(lái)評(píng)估模型效果。
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
# 用sklearn.cross_validation進(jìn)行訓(xùn)練數(shù)據(jù)集劃分,這里訓(xùn)練集和交叉驗(yàn)證集比例為8:2,可以自己根據(jù)需要設(shè)置
X, val_X, y, val_y = train_test_split(
train_x,
train_y,
test_size=0.2,
random_state=2022,
stratify=train_y
)
# xgb矩陣賦值
xgb_val = xgb.DMatrix(val_X, label=val_y)
xgb_train = xgb.DMatrix(X, label=y)
xgb_test = xgb.DMatrix(test_x)
# xgboost模型 #####################
params = {
'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc', #logloss
'gamma': 0.1, # 用于控制是否后剪枝的參數(shù),越大越保守,一般0.1、0.2
'max_depth': 8, # 構(gòu)建樹的深度,越大越容易過(guò)擬合
'alpha': 0, # L1正則化系數(shù)
'lambda': 10, # 控制模型復(fù)雜度的權(quán)重值的L2正則化項(xiàng)參數(shù),參數(shù)越大,模型越不容易過(guò)擬合
'subsample': 0.7, # 隨機(jī)采樣訓(xùn)練樣本
'colsample_bytree': 0.5, # 生成樹時(shí)進(jìn)行的列采樣
'min_child_weight': 3,
# 這個(gè)參數(shù)默認(rèn)是 1,是每個(gè)葉子里面 h 的和至少是多少,對(duì)正負(fù)樣本不均衡時(shí)的 0-1 分類而言
# ,假設(shè) h 在 0.01 附近,min_child_weight 為 1 意味著葉子節(jié)點(diǎn)中最少需要包含 100 個(gè)樣本。
# 這個(gè)參數(shù)非常影響結(jié)果,控制葉子節(jié)點(diǎn)中二階導(dǎo)的和的最小值,該參數(shù)值越小,越容易 overfitting。
'silent': 0, # 設(shè)置成1則沒(méi)有運(yùn)行信息輸出,最好是設(shè)置為0.
'eta': 0.03, # 如同學(xué)習(xí)率
'seed': 1000,
'nthread': -1, # cpu 線程數(shù)
'missing': 1,
'scale_pos_weight': (np.sum(y==0)/np.sum(y==1)) # 用來(lái)處理正負(fù)樣本不均衡的問(wèn)題,通常?。簊um(negative cases) / sum(positive cases)
}
plst = list(params.items())
num_rounds = 500 # 迭代次數(shù)
watchlist = [(xgb_train, 'train'), (xgb_val, 'val')]
# 訓(xùn)練模型并保存
# early_stopping_rounds 當(dāng)設(shè)置的迭代次數(shù)較大時(shí),early_stopping_rounds 可在一定的迭代次數(shù)內(nèi)準(zhǔn)確率沒(méi)有提升就停止訓(xùn)練
model = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=200)
model.save_model('./xgb.model') # 用于存儲(chǔ)訓(xùn)練出的模型
preds = model.predict(xgb_test)
# 導(dǎo)出結(jié)果
threshold = 0.5
ypred = np.where(preds > 0.5, 1, 0)
from sklearn import metrics
print ('AUC: %.4f' % metrics.roc_auc_score(test_y,ypred))
print ('ACC: %.4f' % metrics.accuracy_score(test_y,ypred))
print ('Recall: %.4f' % metrics.recall_score(test_y,ypred))
print ('F1-score: %.4f' %metrics.f1_score(test_y,ypred))
print ('Precesion: %.4f' %metrics.precision_score(test_y,ypred))
print(metrics.confusion_matrix(test_y,ypred))
輸出模型中不同特征的重要性,這通常幫忙我們更好的理解模型行為。
from xgboost import plot_importance
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (10.0, 8.0) #
# 顯示重要特征
plot_importance(model)
plt.show()
五、使用 SageMaker Training API 開展模型訓(xùn)練
初始化
import sagemaker
import boto3
import numpy as np # For matrix operations and numerical processing
import pandas as pd # For munging tabular data
from time import gmtime, strftime
import os
region = boto3.Session().region_name
smclient = boto3.Session().client("sagemaker")
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-hpo-xgboost-dm"
數(shù)據(jù)處理
data = pd.read_csv("./bank-additional/bank-additional-full.csv", sep=";")
pd.set_option("display.max_columns", 500) # Make sure we can see all of the columns
data["no_previous_contact"] = np.where(
data["pdays"] == 999, 1, 0
) # Indicator variable to capture when pdays takes a value of 999
data["not_working"] = np.where(
np.in1d(data["job"], ["student", "retired", "unemployed"]), 1, 0
) # Indicator for individuals not actively employed
model_data = pd.get_dummies(data) # Convert categorical variables to sets of indicators
model_data = model_data.drop(
["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
axis=1,
)
將數(shù)據(jù)集拆分為訓(xùn)練(70%)、驗(yàn)證(20%)和測(cè)試(10%)數(shù)據(jù)集,并將數(shù)據(jù)集轉(zhuǎn)換為 SageMaker 內(nèi)置 XGBoost 算法期望的正確格式。 我們將在訓(xùn)練期間使用訓(xùn)練和驗(yàn)證數(shù)據(jù)集。測(cè)試數(shù)據(jù)集將在部署到端點(diǎn)后用于評(píng)估模型性能。
train_data, validation_data, test_data = np.split(
model_data.sample(frac=1, random_state=1729),
[int(0.7 * len(model_data)), int(0.9 * len(model_data))],
)
pd.concat([train_data["y_yes"], train_data.drop(["y_no", "y_yes"], axis=1)], axis=1).to_csv(
"train.csv", index=False, header=False
)
pd.concat(
[validation_data["y_yes"], validation_data.drop(["y_no", "y_yes"], axis=1)], axis=1
).to_csv("validation.csv", index=False, header=False)
pd.concat([test_data["y_yes"], test_data.drop(["y_no", "y_yes"], axis=1)], axis=1).to_csv(
"test.csv", index=False, header=False
)
將生成的數(shù)據(jù)集上傳到 S3,供下一步模型訓(xùn)練時(shí)使用。
boto3.Session().resource("s3").Bucket(bucket).Object(
os.path.join(prefix, "train/train.csv")
).upload_file("train.csv")
boto3.Session().resource("s3").Bucket(bucket).Object(
os.path.join(prefix, "validation/validation.csv")
).upload_file("validation.csv")
from sagemaker.inputs import TrainingInput
s3_input_train = TrainingInput(
s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv"
)
s3_input_validation = TrainingInput(
s3_data="s3://{}/{}/validation/".format(bucket, prefix), content_type="csv"
)
生成XGBoost 模型訓(xùn)練報(bào)告,這里會(huì)比較慢
from sagemaker.debugger import Rule, rule_configs
rules=[
Rule.sagemaker(rule_configs.create_xgboost_report())
]
sess = sagemaker.Session()
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "1.2-1")
xgb = sagemaker.estimator.Estimator(
container,
role,
instance_count=1,
instance_type="ml.m4.xlarge",
base_job_name="bank-dm-xgboost-report",
output_path="s3://{}/{}/output".format(bucket, prefix),
sagemaker_session=sess,
rules=rules
)
xgb.set_hyperparameters(
max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
objective="binary:logistic",
num_round=500,
)
xgb.fit({"train": s3_input_train, "validation": s3_input_validation})
輸出結(jié)果如下圖
六、訓(xùn)練任務(wù)管理
找到SageMaker 控制臺(tái)-訓(xùn)練-訓(xùn)練任務(wù)
訓(xùn)練報(bào)告S3存儲(chǔ)位置
七、AutoGluon 訓(xùn)練模型
AutoGluon安裝
# Install AutoGluon
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0"
!pip install autogluon
使用 AutoGluon Tabular 訓(xùn)練模型
from autogluon.tabular import TabularDataset, TabularPredictor
ag_data = pd.read_csv("./bank-additional/bank-additional-full.csv", sep=";")
label = 'y'
print("Summary of y variable: \n", ag_data[label].describe())
ag_train_data, ag_test_data = np.split(
ag_data.sample(frac=1, random_state=1729),
[int(0.9 * len(model_data)),],
)
使用 AutoGluon,我們無(wú)需做數(shù)據(jù)處理(缺失值處理,獨(dú)熱編碼等),AutoGloun 會(huì)自動(dòng)幫我們做這些工作。
ag_test_data_X = ag_test_data.iloc[:,:-1]
ag_test_data_y =ag_test_data.iloc[:,20]
save_path = 'agModels-predictClass' # specifies folder to store trained models
learner_kwargs = {'ignored_columns':[["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"]]}
predictor = TabularPredictor(label=label, path=save_path,
eval_metric='recall', learner_kwargs=learner_kwargs
).fit(ag_train_data)
predictor = TabularPredictor.load(save_path) # unnecessary, just demonstrates how to load previously-trained predictor from file
ag_y_pred = predictor.predict(ag_test_data_X)
ag_y_pred_proa = predictor.predict_proba(ag_test_data_X)
print("Predictions: \n", ag_y_pred)
perf = predictor.evaluate_predictions(y_true=ag_test_data_y, y_pred=ag_y_pred, auxiliary_metrics=True)
# perf = predictor.evaluate_predictions(y_true=ag_test_data_y, y_pred=ag_y_pred_proa, auxiliary_metrics=True) #when eval_metric='auc' in TabularPredictor()
實(shí)驗(yàn)結(jié)束,記得停止實(shí)例,刪除所以相關(guān)內(nèi)容如角色、策略、日志組等等
八、總結(jié)
整個(gè)實(shí)驗(yàn)參考亞馬遜官方手冊(cè)部署完成,過(guò)程比較簡(jiǎn)單。但是訪問(wèn)速度很慢。有時(shí)候執(zhí)行一半會(huì)卡住,需要重新執(zhí)行,網(wǎng)絡(luò)慢的問(wèn)題希望亞馬遜可以改進(jìn)一下,有點(diǎn)影響使用體驗(yàn)。Sagemaker很方便, Jupyter Notebook、Notebook 實(shí)例提供了好幾種開發(fā)環(huán)境PyTorch、Numpy、Pandas 等,減少了安裝的時(shí)間成本,我一個(gè)沒(méi)有學(xué)過(guò)機(jī)器學(xué)習(xí)的小白都可以快速上手搭建。降低了機(jī)器學(xué)習(xí)的門檻。這次實(shí)驗(yàn)學(xué)到很多,部署完畢只是第一步,具體的還需要多看幾遍官方部署手冊(cè)來(lái)消化。文章來(lái)源:http://www.zghlxwxcb.cn/news/detail-417412.html
九、參考資料
參考文章:亞馬遜云科技【云上探索實(shí)驗(yàn)室】使用 Amazon SageMaker 構(gòu)建機(jī)器學(xué)習(xí)應(yīng)用、構(gòu)建細(xì)粒度情感分析應(yīng)用、基于Stable Diffusion模型,快速搭建你的第一個(gè)AIGC應(yīng)用
部署文章:https://dev.amazoncloud.cn/column/article/63ff329f4891d26f36585a9c文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-417412.html
到了這里,關(guān)于【小白】使用 Amazon SageMaker 構(gòu)建機(jī)器學(xué)習(xí)應(yīng)用【附全程部署視頻】的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!