openGauss學(xué)習(xí)筆記-50 openGauss 高級特性-DB4AI
openGauss當(dāng)前版本支持了原生DB4AI能力,通過引入原生AI算子,簡化操作流程,充分利用數(shù)據(jù)庫優(yōu)化器、執(zhí)行器的優(yōu)化與執(zhí)行能力,獲得高性能的數(shù)據(jù)庫內(nèi)模型訓(xùn)練能力。更簡化的模型訓(xùn)練與預(yù)測流程、更高的性能表現(xiàn),讓開發(fā)者在更短時間內(nèi)能更專注于模型的調(diào)優(yōu)與數(shù)據(jù)分析上,而避免了碎片化的技術(shù)棧與冗余的代碼實現(xiàn)。
當(dāng)前版本的DB4AI支持基于SGD算子的邏輯回歸(目前支持二分類任務(wù))、線性回歸和支持向量機算法(分類任務(wù)),以及基于K-Means算子的Kmeans聚類算法。
50.1 關(guān)鍵字解析
表 1 DB4AI語法及關(guān)鍵字
名稱 | 描述 | |
---|---|---|
語句 | CREATE MODEL | 創(chuàng)建模型并進行訓(xùn)練,同時保存模型。 |
PREDICT BY | 利用已有模型進行推斷。 | |
關(guān)鍵字 | TARGET | 訓(xùn)練/推斷任務(wù)的目標(biāo)列名。 |
FEATURES | 訓(xùn)練/推斷任務(wù)的數(shù)據(jù)特征列名。 | |
MODEL | 訓(xùn)練任務(wù)的模型名稱。 |
50.2 使用指導(dǎo)
-
使用“CREATE MODEL”語句可以進行模型的創(chuàng)建和訓(xùn)練。
模型訓(xùn)練SQL語句,現(xiàn)有一個數(shù)據(jù)集為kmeans_2d,該表的數(shù)據(jù)內(nèi)容如下:
openGauss=# select * from kmeans_2d; id | position ----+------------------------------------- 1 | {74.5268815685995,88.2141939294524} 2 | {70.9565760521218,98.8114827475511} 3 | {76.2756086327136,23.8387574302033} 4 | {17.8495847294107,81.8449544720352} 5 | {81.2175785354339,57.1677675866522} 6 | {53.97752255667,49.3158342130482} 7 | {93.2475341879763,86.934042100329} 8 | {72.7659293473698,19.7020415100269} 9 | {16.5800288529135,75.7475957670249} 10 | {81.8520747194998,40.3476078575477} 11 | {76.796671198681,86.3827232690528} 12 | {59.9231450678781,90.9907738864422} 13 | {70.161884885747,19.7427458665334} 14 | {11.1269539105706,70.9988166182302} 15 | {80.5005071521737,65.2822235273197} 16 | {54.7030725912191,52.151339428965} 17 | {103.059707058128,80.8419883321039} 18 | {85.3574452036992,14.9910179991275} 19 | {28.6501615960151,76.6922890325077} 20 | {69.7285806713626,49.5416352967732} (20 rows)
該表的字段position的數(shù)據(jù)類型為 double precision[].
從kmeans_2d訓(xùn)練集中指定position為特征列,使用kmeans算法,創(chuàng)建并保存模型point_kmeans。
openGauss=# CREATE MODEL point_kmeans USING kmeans FEATURES position FROM kmeans_2d WITH num_centroids=3; NOTICE: Hyperparameter max_iterations takes value DEFAULT (10) NOTICE: Hyperparameter num_centroids takes value 3 NOTICE: Hyperparameter tolerance takes value DEFAULT (0.000010) NOTICE: Hyperparameter batch_size takes value DEFAULT (10) NOTICE: Hyperparameter num_features takes value DEFAULT (2) NOTICE: Hyperparameter distance_function takes value DEFAULT (L2_Squared) NOTICE: Hyperparameter seeding_function takes value DEFAULT (Random++) NOTICE: Hyperparameter verbose takes value DEFAULT (0) NOTICE: Hyperparameter seed takes value DEFAULT (0) MODEL CREATED. PROCESSED 1
上述命令中:
-
“CREATE MODEL”語句用于模型的訓(xùn)練和保存。
-
USING關(guān)鍵字指定算法名稱。
-
FEATURES用于指定訓(xùn)練模模型的特征,需根據(jù)訓(xùn)練數(shù)據(jù)表的列名添加。
-
TARGET指定模型的訓(xùn)練目標(biāo),它可以是訓(xùn)練所需數(shù)據(jù)表的列名,也可以是一個表達式,例如: price > 10000。
-
WITH用于指定訓(xùn)練模型時的超參數(shù)。當(dāng)超參未被用戶進行設(shè)置的時候,框架會使用默認(rèn)數(shù)值。
針對不同的算子,框架支持不同的超參組合,見表2。
表 2 算子支持的超參
算子 超參 GD(logistic_regression、linear_regression、svm_classification) optimizer(char*); verbose(bool); max_iterations(int); max_seconds(double); batch_size(int); learning_rate(double); decay(double); tolerance(double)其中,SVM限定超參lambda(double) Kmeans max_iterations(int); num_centroids(int); tolerance(double); batch_size(int); num_features(int); distance_function(char*); seeding_function(char*); verbose(int);seed(int) 當(dāng)前各個超參數(shù)設(shè)置的默認(rèn)值和取值范圍,見表3。
表 3 超參的默認(rèn)值以及取值范圍
算子 超參(默認(rèn)值) 取值范圍 超參描述 GD (logistic_regression、linear_regression、svm_classification) optimizer = gd(梯度下降法) gd/ngd(自然梯度下降) 優(yōu)化器 verbose = false T/F 日志顯示 max_iterations = 100 (0, INT_MAX_VALUE] 最大迭代次數(shù) max_seconds = 0 (不對運行時長設(shè)限制) [0,INT_MAX_VALUE] 運行時長 batch_size = 1000 (0, MAX_MEMORY_LIMIT] 一次訓(xùn)練所選取的樣本數(shù) learning_rate = 0.8 (0, DOUBLE_MAX_VALUE] 學(xué)習(xí)率 decay = 0.95 (0, DOUBLE_MAX_VALUE] 權(quán)值衰減率 tolerance = 0.0005 (0, DOUBLE_MAX_VALUE] 公差 seed = 0(對seed取隨機值) [0, INT_MAX_VALUE] 種子 just for SVM:lambda = 0.01 (0, DOUBLE_MAX_VALUE) 正則化參數(shù) Kmeans max_iterations = 10 [1, INT_MAX_VALUE] 最大迭代次數(shù) num_centroids = 10 [1, MAX_MEMORY_LIMIT] 簇的數(shù)目 tolerance = 0.00001 (0,1) 中心點誤差 batch_size = 10 [1, MAX_MEMORY_LIMIT] 一次訓(xùn)練所選取的樣本數(shù) num_features = 2 [1, GS_MAX_COLS] 輸入樣本特征數(shù) distance_function = “L2_Squared” L1\L2\L2_Squared\Linf 正則化方法 seeding_function = “Random++” “Random++”“KMeans||” 初始化種子點方法 verbose = 0U { 0, 1, 2 } 冗長模式 seed = 0U [0, INT_MAX_VALUE] 種子 MAX_MEMORY_LIMIT = 最大內(nèi)存加載的元組數(shù)量 GS_MAX_COLS = 數(shù)據(jù)庫單表最大屬性數(shù)量
模型保存成功,則返回創(chuàng)建成功信息如下。
MODEL CREATED. PROCESSED x
-
-
查看模型信息。
當(dāng)訓(xùn)練完成后模型會被存儲到系統(tǒng)表gs_model_warehouse中。系統(tǒng)表gs_model_warehouse可以查看到關(guān)于模型本身和訓(xùn)練過程的相關(guān)信息。
用戶可以通過查看系統(tǒng)表的方式查看模型,例如查看模型名為“point_kmeans”的SQL語句如下:
openGauss=# select * from gs_model_warehouse where modelname='point_kmeans'; -[ RECORD 1 ]---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- modelname | point_kmeans modelowner | 10 createtime | 2021-04-30 17:30:39.59044 processedtuples | 20 discardedtuples | 0 pre_process_time | 6.2001e-05 exec_time | .000185272 iterations | 5 outputtype | 23 modeltype | kmeans query | CREATE MODEL point_kmeans USING kmeans FEATURES position FROM kmeans_2d WITH num_centroids=3; modeldata | weight | hyperparametersnames | {max_iterations,num_centroids,tolerance,batch_size,num_features,distance_function,seeding_function,verbose,seed} hyperparametersvalues | {10,3,1e-05,10,2,L2_Squared,Random++,0,0} hyperparametersoids | {23,23,701,23,23,1043,1043,23,23} coefnames | {original_num_centroids,actual_num_centroids,dimension,distance_function_id,seed,coordinates} coefvalues | {3,3,2,2,572368998,"(77.282589,23.724434)(74.421616,73.239455)(18.551682,76.320914)"} coefoids | trainingscoresname | trainingscoresvalue | modeldescribe | {"id:1,objective_function:542.851169,avg_distance_to_centroid:108.570234,min_distance_to_centroid:1.027078,max_distance_to_centroid:297.210108,std_dev_distance_to_centroid:105.053257,cluster_size:5","id:2,objective_function:5825.982139,avg_distance_to_centroid:529.634740,min_distance_to_centroid:100.270449,max_distance_to_centroid:990.300588,std_dev_distance_to_centroid:285.915094,cluster_size:11","id:3,objective_function:220.792591,avg_distance_to_centroid:55.198148,min_distance_to_centroid:4.216111,max_distance_to_centroid:102.117204,std_dev_distance_to_centroid:39.319118,cluster_size:4"}
-
利用已存在的模型做推斷任務(wù)。
使用“SELECT”和“PREDICT BY”關(guān)鍵字利用已有模型完成推斷任務(wù)。
查詢語法:SELECT…PREDICT BY…(FEATURES…)…FROM…;
openGauss=# SELECT id, PREDICT BY point_kmeans (FEATURES position) as pos FROM (select * from kmeans_2d limit 10); id | pos ----+----- 1 | 2 2 | 2 3 | 1 4 | 3 5 | 2 6 | 2 7 | 2 8 | 1 9 | 3 10 | 1 (10 rows)
針對相同的推斷任務(wù),同一個模型的結(jié)果是穩(wěn)定的。且基于相同的超參數(shù)和訓(xùn)練集訓(xùn)練的模型也具有穩(wěn)定性,同時AI模型訓(xùn)練存在隨機成分(每個batch的數(shù)據(jù)分布、隨機梯度下降),所以不同的模型間的計算表現(xiàn)、結(jié)果允許存在小的差別。
-
查看執(zhí)行計劃。
使用explain語句可對“CREATE MODEL”和“PREDICT BY”的模型訓(xùn)練或預(yù)測過程中的執(zhí)行計劃進行分析。Explain關(guān)鍵字后可直接拼接CREATE MODEL/ PREDICT BY語句(子句),也可接可選的參數(shù),支持的參數(shù)見表4。
表 4 EXPLAIN支持的參數(shù)
參數(shù)名 描述 ANALYZE 布爾型變量,追加運行時間、循環(huán)次數(shù)等描述信息 VERBOSE 布爾型變量,控制訓(xùn)練的運行信息是否輸出到客戶端 COSTS 布爾型變量 CPU 布爾型變量 DETAIL 布爾型變量,不可用。 NODES 布爾型變量,不可用 NUM_NODES 布爾型變量,不可用 BUFFERS 布爾型變量 TIMING 布爾型變量 PLAN 布爾型變量 FORMAT 可選格式類型:TEXT / XML / JSON / YAML 示例:
openGauss=# Explain CREATE MODEL patient_logisitic_regression USING logistic_regression FEATURES second_attack, treatment TARGET trait_anxiety > 50 FROM patients WITH batch_size=10, learning_rate = 0.05; NOTICE: Hyperparameter batch_size takes value 10 NOTICE: Hyperparameter decay takes value DEFAULT (0.950000) NOTICE: Hyperparameter learning_rate takes value 0.050000 NOTICE: Hyperparameter max_iterations takes value DEFAULT (100) NOTICE: Hyperparameter max_seconds takes value DEFAULT (0) NOTICE: Hyperparameter optimizer takes value DEFAULT (gd) NOTICE: Hyperparameter tolerance takes value DEFAULT (0.000500) NOTICE: Hyperparameter seed takes value DEFAULT (0) NOTICE: Hyperparameter verbose takes value DEFAULT (FALSE) NOTICE: GD shuffle cache size 212369 QUERY PLAN ------------------------------------------------------------------- Gradient Descent (cost=0.00..0.00 rows=0 width=0) -> Seq Scan on patients (cost=0.00..32.20 rows=1776 width=12) (2 rows)
-
異常場景。
-
訓(xùn)練階段。
-
場景一:當(dāng)超參數(shù)的設(shè)置超出取值范圍,模型訓(xùn)練失敗,返回ERROR,并提示錯誤,例如:
openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety FROM patients WITH optimizer='aa'; NOTICE: Hyperparameter batch_size takes value DEFAULT (1000) NOTICE: Hyperparameter decay takes value DEFAULT (0.950000) NOTICE: Hyperparameter learning_rate takes value DEFAULT (0.800000) NOTICE: Hyperparameter max_iterations takes value DEFAULT (100) NOTICE: Hyperparameter max_seconds takes value DEFAULT (0) NOTICE: Hyperparameter optimizer takes value aa ERROR: Invalid hyperparameter value for optimizer. Valid values are: gd, ngd. (default is gd)
-
場景二:當(dāng)模型名稱已存在,模型保存失敗,返回ERROR,并提示錯誤原因:
openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety FROM patients; NOTICE: Hyperparameter batch_size takes value DEFAULT (1000) NOTICE: Hyperparameter decay takes value DEFAULT (0.950000) NOTICE: Hyperparameter learning_rate takes value DEFAULT (0.800000) NOTICE: Hyperparameter max_iterations takes value DEFAULT (100) NOTICE: Hyperparameter max_seconds takes value DEFAULT (0) NOTICE: Hyperparameter optimizer takes value DEFAULT (gd) NOTICE: Hyperparameter tolerance takes value DEFAULT (0.000500) NOTICE: Hyperparameter seed takes value DEFAULT (0) NOTICE: Hyperparameter verbose takes value DEFAULT (FALSE) NOTICE: GD shuffle cache size 5502 ERROR: The model name "patient_linear_regression" already exists in gs_model_warehouse.
-
場景三:FEATURE或者TARGETS列是*,返回ERROR,并提示錯誤原因:
openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES * TARGET trait_anxiety FROM patients; ERROR: FEATURES clause cannot be * -----------------------------------------------------------------------------------------------------------------------、 openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET * FROM patients; ERROR: TARGET clause cannot be *
-
場景四:對于無監(jiān)督學(xué)習(xí)方法使用TARGET關(guān)鍵字,或者在監(jiān)督學(xué)習(xí)方法中不適用TARGET關(guān)鍵字,均會返回ERROR,并提示錯誤原因:
openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment FROM patients; ERROR: Supervised ML algorithms require TARGET clause ----------------------------------------------------------------------------------------------------------------------------- CREATE MODEL patient_linear_regression USING linear_regression TARGET trait_anxiety FROM patients; ERROR: Supervised ML algorithms require FEATURES clause
-
場景五:當(dāng)GUC參數(shù)statement_timeout設(shè)置了時長,訓(xùn)練超時執(zhí)行的語句將被終止:執(zhí)行CREATE MODEL語句。訓(xùn)練集的大小、訓(xùn)練輪數(shù)(iteration)、提前終止條件(tolerance、max_seconds)、并行線程數(shù)(nthread)等參數(shù)都會影響訓(xùn)練時長。當(dāng)時長超過數(shù)據(jù)庫限制,語句被終止模型訓(xùn)練失敗。
-
-
推斷階段。
-
場景六:當(dāng)模型名在系統(tǒng)表中查找不到,數(shù)據(jù)庫會報ERROR:
openGauss=# select id, PREDICT BY patient_logistic_regression (FEATURES second_attack,treatment) FROM patients; ERROR: There is no model called "patient_logistic_regression".
-
場景七:當(dāng)做推斷任務(wù)FEATURES的數(shù)據(jù)維度和數(shù)據(jù)類型與訓(xùn)練集存在不一致,將報ERROR,并提示錯誤原因,例如:
openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES second_attack) FROM patients; ERROR: Invalid number of features for prediction, provided 1, expected 2 CONTEXT: referenced column: patient_linear_regression_pred ------------------------------------------------------------------------------------------------------------------------------------- openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES 1,second_attack,treatment) FROM patients; ERROR: Invalid number of features for prediction, provided 3, expected 2 CONTEXT: referenced column: patient_linear_regression_pre
-
-
?? 點贊,你的認(rèn)可是我創(chuàng)作的動力!
?? 收藏,你的青睞是我努力的方向!
?? 評論,你的意見是我進步的財富!文章來源:http://www.zghlxwxcb.cn/news/detail-673430.html
文章來源地址http://www.zghlxwxcb.cn/news/detail-673430.html
到了這里,關(guān)于openGauss學(xué)習(xí)筆記-50 openGauss 高級特性-DB4AI的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!