數(shù)據(jù)倉庫性能測試方法論與工具集

這篇具有很好參考價值的文章主要介紹了數(shù)據(jù)倉庫性能測試方法論與工具集。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方，請大家不吝賜教，您也可以點擊"舉報違法"按鈕提交疑問。

目錄
數(shù)據(jù)倉庫 v.s. 傳統(tǒng)數(shù)據(jù)庫
數(shù)據(jù)倉庫性能測試案例
- 性能指標(biāo)
- 測試方案
  - 測試場景
  - 測試數(shù)據(jù)集
  - 測試用例
  - 性能指標(biāo)
  - 測試腳本工具
- 基準(zhǔn)環(huán)境準(zhǔn)備
  - 硬件環(huán)境
  - 軟件環(huán)境
- 測試操作步驟
  - Cloudwave 執(zhí)行步驟
    - 導(dǎo)入數(shù)據(jù)集
    - TestCase 1. 執(zhí)行 13 條標(biāo)準(zhǔn) SQL 測試語句
    - TestCase 2. 執(zhí)行多表聯(lián)合 join 拓展 SQL1 測試語句
    - TestCase 3. 執(zhí)行多表聯(lián)合 join 拓展 SQL2 測試語句
  - StarRocks 執(zhí)行步驟
    - 導(dǎo)入數(shù)據(jù)集
    - TestCase 1. 執(zhí)行 13 條標(biāo)準(zhǔn) SQL 測試語句
    - TestCase 2. 執(zhí)行多表聯(lián)合 join 拓展 SQL1 測試語句
    - TestCase 3. 執(zhí)行多表聯(lián)合 join 拓展 SQL2 測試語句
- 測試結(jié)果分析
從數(shù)據(jù)倉庫到云原生數(shù)據(jù)倉庫

數(shù)據(jù)倉庫 v.s. 傳統(tǒng)數(shù)據(jù)庫

隨著 5G 網(wǎng)絡(luò)和 IoT 技術(shù)的興起，以及越來越復(fù)雜多變的企業(yè)經(jīng)營環(huán)境，都在促使著包括工業(yè)制造、能源、交通、教育和醫(yī)療在內(nèi)的傳統(tǒng)行業(yè)紛紛開啟了數(shù)字化轉(zhuǎn)型之路。由于長尾效應(yīng)的存在，千行百業(yè)的數(shù)字化轉(zhuǎn)型過程中必然會釋放出比以往任何時候都要龐大的海量數(shù)據(jù)。那么如何對這些涌現(xiàn)的數(shù)據(jù)集合進(jìn)行有效的存儲、分析和利用，繼而幫忙企業(yè)進(jìn)行運營決策優(yōu)化甚至創(chuàng)造出新的獲客模式和商業(yè)模式形成競爭力，就成為了擺在企業(yè)主面前亟需解決的問題。

在這樣的需求背景下，我們也觀察到近年來市場上正在出現(xiàn)越來越多的數(shù)據(jù)倉庫產(chǎn)品。數(shù)據(jù)倉庫（Data Warehouse）是一種用于集成、存儲和分析大規(guī)模結(jié)構(gòu)化數(shù)據(jù)與非結(jié)構(gòu)化數(shù)據(jù)的數(shù)據(jù)管理系統(tǒng)。相對于傳統(tǒng)的僅用于數(shù)據(jù)存儲的數(shù)據(jù)庫（Database）而言，數(shù)據(jù)倉庫更是一種專門設(shè)計的 “數(shù)據(jù)存儲 + 數(shù)據(jù)分析 + 數(shù)據(jù)管理" 一體化解決方案，強(qiáng)調(diào)數(shù)據(jù)的易用性、可分析性和可管理性，提供了包括：數(shù)據(jù)清洗、整合、轉(zhuǎn)換、復(fù)雜查詢、報表生成和數(shù)據(jù)分析等功能，用于幫助企業(yè)實現(xiàn)基于數(shù)據(jù)的決策制定和數(shù)字化運營場景。

更具體而言，下列表格中從技術(shù)層面更細(xì)致的對比了兩者的區(qū)別：

對比項	傳統(tǒng)數(shù)據(jù)庫	云原生數(shù)據(jù)倉庫
需求面向	面向數(shù)據(jù)存儲，主要用于支持事務(wù)處理以滿足業(yè)務(wù)操作的需求。	面向大規(guī)模數(shù)據(jù)存儲與高效能數(shù)據(jù)分析，主要用于數(shù)據(jù)分析和決策支持和，以滿足企業(yè)的報表、分析和數(shù)據(jù)挖掘需求。
數(shù)據(jù)結(jié)構(gòu)和組織方式	通常以表格的形式組織數(shù)據(jù)，采用關(guān)系型數(shù)據(jù)模型，通過 SQL 語句進(jìn)行數(shù)據(jù)操作。	采用星型或雪花型的結(jié)構(gòu)，將數(shù)據(jù)組織成事實表和維度表，通過復(fù)雜的查詢和分析操作進(jìn)行數(shù)據(jù)處理。
數(shù)據(jù)處理復(fù)雜性	通常處理相對較小規(guī)模和實時的數(shù)據(jù)。	處理的數(shù)據(jù)量通常很大，并且涉及到多個源系統(tǒng)的數(shù)據(jù)集成和轉(zhuǎn)換，需要處理復(fù)雜的查詢和分析操作，同時兼容 SQL 語句。
可擴(kuò)展性	從分析到方案制定再到落地實施，周期較長。	在線水平擴(kuò)展，分鐘級擴(kuò)展。
數(shù)據(jù)量級	一般處理 TB 左右以下性能良好，隨著數(shù)據(jù)量增加維護(hù)難度增加。	支持 TB 至 PB 量級，通過平臺管理功能進(jìn)行運維實例管理和監(jiān)控。
DBA 維護(hù)成本	工作量較大，中間件，SQL 優(yōu)化性能分析要求 DBA 有豐富的技術(shù)經(jīng)驗。	平臺化運維管理，功能模塊化處理，DBA 工作更便捷高效。
數(shù)據(jù)分片	引用中間件層需要手動維護(hù)分片規(guī)則，制定不當(dāng)容易出現(xiàn)數(shù)據(jù)傾斜。	分布式數(shù)據(jù)庫自身具有路由分片算法，分布相對均勻可按需調(diào)整。

可見，在數(shù)據(jù)價值爆發(fā)的時代背景中，數(shù)據(jù)倉庫在千行百業(yè)中都有著相應(yīng)的應(yīng)用場景，例如：

金融和銀行業(yè)：應(yīng)用數(shù)據(jù)倉庫平臺對大量的金融數(shù)據(jù)進(jìn)行分析和建模，繼而支持風(fēng)險評估、交易分析和決策制定。
零售和電子商務(wù)行業(yè)：應(yīng)用數(shù)據(jù)倉庫平臺完成銷售分析、供應(yīng)鏈分析、客戶行為分析等，幫助零售商了解產(chǎn)品銷售情況、優(yōu)化庫存策略、提升客戶滿意度，并進(jìn)行個性化推薦和營銷活動。
市場營銷和廣告行業(yè)：應(yīng)用數(shù)據(jù)倉庫平臺整合不同渠道的市場數(shù)據(jù)和客戶行為數(shù)據(jù)，幫助企業(yè)了解客戶需求，支持目標(biāo)市場分析、廣告效果評估、客戶細(xì)分等工作。

數(shù)據(jù)倉庫性能測試方法論與工具集

基于以上原因，我們也希望能夠與時俱進(jìn)地去考察市場上的數(shù)據(jù)倉庫產(chǎn)品的特性，并以此支撐公司技術(shù)選型工作。技術(shù)選型是一項系統(tǒng)且嚴(yán)謹(jǐn)?shù)墓ぷ鲀?nèi)容，需要從功能、性能、成熟度、可控性、成本等多個方面進(jìn)行考慮，本文則主要關(guān)注在性能方面，嘗試探討一種可復(fù)用的性能測試方案，包括：性能指標(biāo)、方法論和工具集這 3 個方面的內(nèi)容。

數(shù)據(jù)倉庫性能測試案例

性能指標(biāo)

數(shù)據(jù)倉庫的性能指標(biāo)需要根據(jù)具體的應(yīng)用場景來設(shè)定，但通常的會包括以下幾個方面：

讀寫性能：衡量數(shù)據(jù)倉庫在讀取和寫入數(shù)據(jù)方面的性能表現(xiàn)。包括：吞吐量（每秒處理的請求數(shù)量）、延遲（請求的響應(yīng)時間）、并發(fā)性（同時處理的請求數(shù)量）等。
水平擴(kuò)展性：衡量數(shù)據(jù)倉庫在大規(guī)模系統(tǒng)中的水平擴(kuò)展能力，能夠隨著客戶端的并發(fā)增長而進(jìn)行彈性擴(kuò)展，并獲得線性的性能提升。
數(shù)據(jù)一致性：測試數(shù)據(jù)倉庫在分布式環(huán)境中的數(shù)據(jù)一致性保證程度。根據(jù)應(yīng)用場景的不同，對數(shù)據(jù)強(qiáng)一致性、弱一致性、最終一致性會有不同的側(cè)重。
故障恢復(fù)和高可用性：測試數(shù)據(jù)倉庫在面對故障時的恢復(fù)能力和高可用性?？梢阅M節(jié)點故障或網(wǎng)絡(luò)分區(qū)等場景，評估數(shù)據(jù)倉庫的故障轉(zhuǎn)移和數(shù)據(jù)恢復(fù)性能。
數(shù)據(jù)安全性：評估數(shù)據(jù)倉庫在數(shù)據(jù)保護(hù)方面的性能。包括：數(shù)據(jù)的備份和恢復(fù)速度、數(shù)據(jù)加密和訪問控制等。
集群管理和資源利用率：評估數(shù)據(jù)倉庫在集群管理和資源利用方面的性能。包括：節(jié)點的動態(tài)擴(kuò)縮容、負(fù)載均衡、資源利用率等。
數(shù)據(jù)庫管理工具性能：評估數(shù)據(jù)倉庫管理工具在配置、監(jiān)控、診斷和優(yōu)化等方面的性能表現(xiàn)。

在本文中主要關(guān)注讀寫性能方面的操作實踐。

測試方案

為了進(jìn)一步完善測試流程，以及對國產(chǎn)數(shù)據(jù)倉庫大趨勢的傾向性，所以本文采用了相對方便獲取且同樣都是采用了 Hadoop 作為底層分布式文件系統(tǒng)支撐的兩款國產(chǎn)數(shù)據(jù)倉庫產(chǎn)品進(jìn)行測試：

Cloudwave 4.0（2023 年 5 月發(fā)版）是一款由北京翰云時代數(shù)據(jù)技術(shù)有限公司推出的國產(chǎn)商業(yè)云原生數(shù)據(jù)倉庫產(chǎn)品。
StarRocks 3.0（2023 年 4 月發(fā)版）是一款使用 Elastic License 2.0 協(xié)議的國產(chǎn)開源數(shù)據(jù)倉庫產(chǎn)品，

另外，這兩款產(chǎn)品的安裝部署和操作手冊的文檔都非常詳盡，請大家自行查閱，下文中主要記錄了測試操作步驟，并不贅述基本安裝部署的步驟。

Cloudwave：https://github.com/CloudwaveDatabase/cloudwave
StarRocks：https://github.com/StarRocks/starrocks

測試場景

在本文中首先關(guān)注應(yīng)用場景更加廣泛的結(jié)構(gòu)化數(shù)據(jù)的 SQL 讀寫場景。

數(shù)據(jù)倉庫性能測試方法論與工具集

測試數(shù)據(jù)集

測試數(shù)據(jù)集則采用了常見的 SSB1000 國際標(biāo)準(zhǔn)測試數(shù)據(jù)集，該數(shù)據(jù)集的主要內(nèi)容如下表所示：

表名	表行數(shù)（單位：行）	描述
lineorder	60 億	SSB 商品訂單表
customer	3000 萬	SSB 客戶表
part	200 萬	SSB 零部件表
supplier	200 萬	SSB 供應(yīng)商表
dates	2556	日期表

測試用例

TestCase 1. 執(zhí)行 13 條標(biāo)準(zhǔn) SQL 測試語句。

use ssb1000;

# 1
select sum(lo_revenue) as revenue from lineorder,dates where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;
# 2
select sum(lo_revenue) as revenue from lineorder,dates where lo_orderdate = d_datekey and d_yearmonthnum = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35;
# 3
select sum(lo_revenue) as revenue from lineorder,dates where lo_orderdate = d_datekey and d_weeknuminyear = 6 and d_year = 1994 and lo_discount between 5 and 7 and lo_quantity between 26 and 35;
# 4
select sum(lo_revenue) as lo_revenue, d_year, p_brand from lineorder ,dates,part,supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_category = 'MFGR#12' and s_region = 'AMERICA' group by d_year, p_brand order by d_year, p_brand;
# 5
select sum(lo_revenue) as lo_revenue, d_year, p_brand from lineorder,dates,part,supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA' group by d_year, p_brand order by d_year, p_brand;
# 6
select sum(lo_revenue) as lo_revenue, d_year, p_brand from lineorder,dates,part,supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_brand = 'MFGR#2239' and s_region = 'EUROPE' group by d_year, p_brand order by d_year, p_brand;
# 7
select c_nation, s_nation, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and c_region = 'ASIA' and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, lo_revenue desc;
# 8
select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and  c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES' and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, lo_revenue desc;
# 9
select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, lo_revenue desc;
# 10
select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_yearmonth  = 'Dec1997' group by c_city, s_city, d_year order by d_year asc, lo_revenue desc;
# 11
select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit from lineorder,dates,customer,supplier,part where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group by d_year, c_nation order by d_year, c_nation;
# 12
select d_year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit from lineorder,dates,customer,supplier,part where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and c_region = 'AMERICA'and s_region = 'AMERICA' and (d_year = 1997 or d_year = 1998) and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group by d_year, s_nation, p_category order by d_year, s_nation, p_category;
# 13
select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit from lineorder,dates,customer,supplier,part where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and c_region = 'AMERICA'and s_nation = 'UNITED STATES' and (d_year = 1997 or d_year = 1998) and p_category = 'MFGR#14' group by d_year, s_city, p_brand order by d_year, s_city, p_brand;

TestCase 2. 執(zhí)行多表聯(lián)合 join 拓展 SQL1 測試語句。

select count(*) from lineorder,customer where lo_custkey = c_custkey;

TestCase 3. 執(zhí)行多表聯(lián)合 join 拓展 SQL2 測試語句。

select count(*) from lineorder,customer,supplier where lo_custkey = c_custkey and lo_suppkey = s_suppkey;

性能指標(biāo)

這里設(shè)定 2 個最常見的性能指標(biāo)：

最大 CPU 資源占用數(shù)據(jù)；
最大 TestCase 執(zhí)行耗時數(shù)據(jù)。

并且為了對測試結(jié)果進(jìn)行 “去噪“，每個 TestCases 都會執(zhí)行 19 輪 SQL 測試腳本。值得注意的是，還需要額外的去除掉第 1 輪的測試數(shù)據(jù)，因為第 1 次查詢性能數(shù)據(jù)會收到系統(tǒng) I/O 的變量因素影響。所以應(yīng)該對余下的 18 輪測試數(shù)據(jù)做平均計算，以此獲得更加準(zhǔn)確的 SQL 執(zhí)行平均耗時數(shù)據(jù)。

測試腳本工具

Cloudwave 測試腳本：

#!/bin/bash
# Program:
#       test ssb
# History:
# 2023/03/17    junfenghe.cloud@qq.com  version:0.0.1

rm -rf ./n*txt
for ((i=1; i<20; i++))
do
    cat sql_ssb.sql |./cplus.sh > n${i}.txt
done

StarRocks 測試腳本：

#!/bin/bash
# Program:
#       test ssb
# History:
# 2023/03/17    junfenghe.cloud@qq.com  version:0.0.1

rm -rf ./n*txt
for ((i=1; i<20; i++))
do
    cat sql_ssb.sql | mysql -uroot -P 9030 -h 127.0.0.1 -v -vv -vvv >n${i}.txt
done

結(jié)果分析腳本：

#!/bin/bash
# Program:
#       analysis cloudwave/starrocks logs of base compute
# History:
# 2023/02/20     junfenghe.cloud@qq.com  version:0.0.1

path=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/bin:~/bin
export path

suff="(s)#####"

if [ -z "${1}" ]
then
    echo "Please input database'name"
    exit -1
fi

if [ -z "$2" ]
then
    echo "Please input times of scanner"
    exit -f
fi

if [ -n "${3}" ]
then
    suff=${3}
fi

for current in ${2}
do
    result_time=""

    if [ "${1}" == "starrocks" ]
    then
        for time in $( cat ${current} | grep sec  | awk -F '('  '{print $2}' | awk -F ' ' '{print $1}' )
        do
            result_time="${result_time}${time}${suff}"
        done
    elif [ "${1}" == "cloudwave" ]
    then
        for time in $( cat ${current} | grep Elapsed | awk '{print $2}'| sed 's/:/*60+/g'| sed 's/+00\*60//g ; s/+0\*60//g ; s/^0\*60+//g' )
        do
            result_time="${result_time}${time}${suff}"
        done
    fi
    echo ${result_time%${suff}*}
done

exit 0

sql_ssb.sql 文件：用于保存不同 TestCases 中的 SQL 測試語句，然后被測試腳本讀取。

基準(zhǔn)環(huán)境準(zhǔn)備

硬件環(huán)境

為了方便測試環(huán)境的準(zhǔn)備和節(jié)省成本，同時盡量靠近分布式的常規(guī)部署方式。所以測試的硬件環(huán)境采用了阿里云上的 4 臺 64 Core 和 256G Memory 的云主機(jī)來組成分布式集群，同時為了進(jìn)一步避免磁盤 I/O 成為了性能瓶頸，所以也都掛載了 ESSD pl1 高性能云盤。

數(shù)據(jù)倉庫性能測試方法論與工具集

軟件環(huán)境

JDK 19：Cloudwave 4.0 依賴
JDK 8：StarRocks 3.0 依賴
MySQL 8：作為 StarRocks FE（前端）
Hadoop 3.2.2：作為 Cloudwave 和 StarRocks 的分布式存儲，并設(shè)定文件副本數(shù)為 2。

測試操作步驟

Cloudwave 執(zhí)行步驟

導(dǎo)入數(shù)據(jù)集

查看為 Hadoop 準(zhǔn)備的存儲空間。

$ ./sync_scripts.sh 'df -h' | grep home

數(shù)據(jù)倉庫性能測試方法論與工具集

格式化 Hadoop 存儲空間。

$ hdfs namenode -format

數(shù)據(jù)倉庫性能測試方法論與工具集

啟動 HDFS，并查看服務(wù)狀態(tài)。

$ start-dfs.sh 
$ ./sync_scripts.sh 'jps'

數(shù)據(jù)倉庫性能測試方法論與工具集

創(chuàng)建 SSB1000 數(shù)據(jù)集的上傳目錄。

$ hdfs dfs -mkdir /cloudwave
$ hdfs dfs -mkdir /cloudwave/uploads
$ hdfs dfs -put ssb1000 /cloudwave/uploads/

數(shù)據(jù)倉庫性能測試方法論與工具集

檢查數(shù)據(jù)上傳結(jié)果，可以看到 SSB1000 數(shù)據(jù)集，占用了 606GB 的存儲空間。

$ hdfs dfs -du -h /
$ du -sh /home/cloudwave/ssb-poc-0.9.3/ssb-poc/output/data_dir/ssb1000

數(shù)據(jù)倉庫性能測試方法論與工具集

啟動 Cloudwave。

$ ./start-all-server.sh

數(shù)據(jù)倉庫性能測試方法論與工具集

導(dǎo)入 SSB1000 數(shù)據(jù)集。

$ ./cplus_go.bin -s 'loaddata ssb1000'

數(shù)據(jù)倉庫性能測試方法論與工具集

因為數(shù)據(jù)集非常大所以導(dǎo)入的時間較長，大概 58 分鐘。
通過執(zhí)行 HDFS 的命令，可以看到 Cloudwave 對數(shù)據(jù)集同步進(jìn)行了數(shù)據(jù)壓縮，這也是 Cloudwave 的特性功能之一。SSB1000 的原始大小是 606G，導(dǎo)入后被壓縮到到了 360G。下圖中的 720G 表示 HDFS 中 2 個數(shù)據(jù)副本的總大小，壓縮比達(dá)到了可觀的 59%。

TestCase 1. 執(zhí)行 13 條標(biāo)準(zhǔn) SQL 測試語句

將 TestCase 1 的 13 條標(biāo)準(zhǔn) SQL 測試語句寫入到 sql_ssb.sql 文件中，然后執(zhí)行 Cloudwave 測試腳本，同時監(jiān)控記錄 CPU 資源的使用率數(shù)據(jù)。

$ ./test_ssb.sh

結(jié)果如下圖所示。在 TestCase 1 中，4 節(jié)點的 Cloudwave 集群的最大 CPU 使用率平均為 5763% / 6400% = 90%（注：64 Core CPU 總量為 6400%）。

數(shù)據(jù)倉庫性能測試方法論與工具集

如下圖所示，執(zhí)行分析腳本程序來計算 TestCase 1 的平均耗時為 7.6s。

$ ./analysis.sh cloudwave "$(ls n*txt)" +

數(shù)據(jù)倉庫性能測試方法論與工具集

TestCase 2. 執(zhí)行多表聯(lián)合 join 拓展 SQL1 測試語句

將 TestCase 2 的多表聯(lián)合 join 拓展 SQL1 測試語句寫入到 sql_ssb.sql 文件中，然后執(zhí)行 Cloudwave 測試腳本，同時監(jiān)控記錄 CPU 資源的使用率數(shù)據(jù)。

$ ./test_ex.sh

結(jié)果如下圖所示。在 TestCase 2 中，4 節(jié)點的 Cloudwave 集群的最大 CPU 使用率平均為 0.0935%（6% / 6400%）。
數(shù)據(jù)倉庫性能測試方法論與工具集

數(shù)據(jù)倉庫性能測試方法論與工具集

如下圖所示，執(zhí)行分析腳本程序來計算 TestCase 2 的平均耗時為 12ms。

$ ./analysis.sh cloudwave "$(ls n*txt)" +

數(shù)據(jù)倉庫性能測試方法論與工具集

TestCase 3. 執(zhí)行多表聯(lián)合 join 拓展 SQL2 測試語句

將 TestCase 2 的多表聯(lián)合 join 拓展 SQL2 測試語句寫入到 sql_ssb.sql 文件中，然后執(zhí)行 Cloudwave 測試腳本，同時監(jiān)控記錄 CPU 資源的使用率數(shù)據(jù)。

$ ./test_ex.sh

結(jié)果如下圖所示。在 TestCase 2 中，4 節(jié)點的 Cloudwave 集群的最大 CPU 使用率平均為 0.118%（7.6% / 6400%）。

數(shù)據(jù)倉庫性能測試方法論與工具集

如下圖所示，執(zhí)行分析腳本程序來計算 TestCase 3 的平均耗時為 14ms。

$ ./analysis.sh cloudwave "$(ls n*txt)" +

數(shù)據(jù)倉庫性能測試方法論與工具集

StarRocks 執(zhí)行步驟

導(dǎo)入數(shù)據(jù)集

清空 HDFS 存儲。

$ hdfs dfs -rm -r /cloudwave
$ hdfs dfs -ls /

數(shù)據(jù)倉庫性能測試方法論與工具集

啟動 StarRocks FE（前端）守護(hù)進(jìn)程。

$ ./fe/bin/start_fe.sh --daemon

數(shù)據(jù)倉庫性能測試方法論與工具集

添加 StarRocks BE（后端）單元。

$ mysql -uroot -h127.0.0.1 -P9030
$ ALTER SYSTEM ADD BACKEND "172.17.161.33:9050"; 
$ ALTER SYSTEM ADD BACKEND "172.17.161.32:9050"; 
$ ALTER SYSTEM ADD BACKEND "172.17.161.31:9050"; 
$ ALTER SYSTEM ADD BACKEND "172.17.161.30:9050";

啟動 StarRocks BE 守護(hù)進(jìn)程。

$ ./sync_scripts.sh "cd $(pwd)/be/bin && ./start_be.sh --daemon &&ps -ef | grep starrocks_be"

數(shù)據(jù)倉庫性能測試方法論與工具集

驗證 StarRocks 集群狀態(tài)，依次查看 4 個節(jié)點都 Alive=true 了。
創(chuàng)建表。
開始導(dǎo)入數(shù)據(jù)，SSB1000 的導(dǎo)入時間總計為 112 分鐘。

$ date && ./bin/stream_load.sh data_dir/ssb100 && date

數(shù)據(jù)倉庫性能測試方法論與工具集

導(dǎo)入過程中可以發(fā)現(xiàn)雖然設(shè)置了 HDFS 的副本數(shù)為 2，但 StarRocks 將副本數(shù)自動修改為了 3。
數(shù)據(jù)倉庫性能測試方法論與工具集
另外在導(dǎo)入數(shù)據(jù)集時，發(fā)現(xiàn) StarRocks 似乎沒有進(jìn)行數(shù)據(jù)壓縮，占用了 1T 的存儲空間，所以導(dǎo)入時間也相應(yīng)的變得更長。

TestCase 1. 執(zhí)行 13 條標(biāo)準(zhǔn) SQL 測試語句

將 TestCase 1 的 13 條標(biāo)準(zhǔn) SQL 測試語句寫入到 sql_ssb.sql 文件中，然后執(zhí)行 StarRocks 測試腳本，同時監(jiān)控記錄 CPU 資源的使用率數(shù)據(jù)。

$ ./test_ssb.sh

結(jié)果如下圖所示。在 TestCase 1 中，4 節(jié)點的 StarRocks 集群的最大 CPU 使用率平均為 67%（4266% / 6400%）。

數(shù)據(jù)倉庫性能測試方法論與工具集

如下圖所示，執(zhí)行分析腳本程序來計算 TestCase 1 的平均耗時為 10.39s。

$ ./analysis.sh cloudwave "$(ls n*txt)" +

數(shù)據(jù)倉庫性能測試方法論與工具集

TestCase 2. 執(zhí)行多表聯(lián)合 join 拓展 SQL1 測試語句

將 TestCase 2 的多表聯(lián)合 join 拓展 SQL1 測試語句寫入到 sql_ssb.sql 文件中，然后執(zhí)行 StarRocks 測試腳本，同時監(jiān)控記錄 CPU 資源的使用率數(shù)據(jù)。

$ ./test_ex.sh

結(jié)果如下圖所示。在 TestCase 2 中，4 節(jié)點的 StarRocks 集群的最大 CPU 使用率平均為 78.7%（5037% / 6400%）。

數(shù)據(jù)倉庫性能測試方法論與工具集

如下圖所示，執(zhí)行分析腳本程序來計算 TestCase 2 的平均耗時為 2.79s。

$ ./analysis.sh cloudwave "$(ls n*txt)" +

數(shù)據(jù)倉庫性能測試方法論與工具集

TestCase 3. 執(zhí)行多表聯(lián)合 join 拓展 SQL2 測試語句

將 TestCase 2 的多表聯(lián)合 join 拓展 SQL2 測試語句寫入到 sql_ssb.sql 文件中，然后執(zhí)行 StarRocks 測試腳本，同時監(jiān)控記錄 CPU 資源的使用率數(shù)據(jù)。

$ ./test_ex.sh

結(jié)果如下圖所示。在 TestCase 2 中，4 節(jié)點的 Cloudwave 集群的最大 CPU 使用率平均為 90.5%（5797% / 6400%）。

數(shù)據(jù)倉庫性能測試方法論與工具集

如下圖所示，執(zhí)行分析腳本程序來計算 TestCase 3 的平均耗時為 4.8s。

$ ./analysis.sh cloudwave "$(ls n*txt)" +

數(shù)據(jù)倉庫性能測試方法論與工具集

測試結(jié)果分析

13 條標(biāo)準(zhǔn) SQL 測試語句結(jié)果統(tǒng)計：

數(shù)據(jù)倉庫	數(shù)據(jù)集	響應(yīng)時間（s）	CPU 最大占用率	存儲壓縮比	數(shù)據(jù)導(dǎo)入時間
Cloudwave 4.0	ssb1000	7.602	90%（5763%/6400%）	59%（360G/606G）	58分鐘
StarRocks 3.0	ssb1000	10.397	66.6%（4266%/6400%）	169%（1024G/606G）	112分鐘

2 條多表聯(lián)合 join 擴(kuò)展 SQL 測試語句結(jié)果統(tǒng)計：

數(shù)據(jù)倉庫	數(shù)據(jù)集	拓展SQL1響應(yīng)時間（s）	拓展SQL1 CPU 最大占用率	拓展SQL2響應(yīng)時間（s）	拓展SQL2 CPU 最大占用率
Cloudwave 4.0	ssb1000	0.012	0.0935%（6%/6400）	0.014	0.118%（7.6%/6400）
StarRocks 3.0	ssb1000	2.79	78.7%（5037%/6400）	4.8	90.5%（5797%/6400）

從上述測試結(jié)果中可以看出 Cloudwave 云原生數(shù)據(jù)倉庫的性能表現(xiàn)是非常突出的，尤其在在多表聯(lián)合 join 擴(kuò)展 SQL 場景下，Cloudwave 4.0版本的 CPU 資源占有率非常低的同時執(zhí)行速度也非?？?。

當(dāng)然，數(shù)據(jù)倉庫性能優(yōu)化和測試是一門復(fù)雜的系統(tǒng)工程，由于文檔篇幅的限制上文中也只是選取了比較有限的測試場景和性能指標(biāo)，主要是為了學(xué)習(xí)研究和交流之用，實際上還有很多值得優(yōu)化和擴(kuò)展的細(xì)節(jié)。

從數(shù)據(jù)倉庫到云原生數(shù)據(jù)倉庫

最后在記錄下一些學(xué)習(xí)心得。從前提到數(shù)據(jù)庫（Database）我會認(rèn)為它們單純就是一個用于存放結(jié)構(gòu)化數(shù)據(jù)或非結(jié)構(gòu)化數(shù)據(jù)的 DBMS（Database Management System）應(yīng)用軟件。但隨著數(shù)據(jù)挖掘的價值體系被越來越多用戶所認(rèn)可，以及越來越多的用戶需求將數(shù)據(jù)應(yīng)用于提升實際的生產(chǎn)效率上。使得單純面向數(shù)據(jù)存儲的數(shù)據(jù)庫逐漸被堆疊了越來越多的業(yè)務(wù)應(yīng)用功能，進(jìn)而演變成一個面向數(shù)據(jù)分析的數(shù)據(jù)倉庫（Data Warehouse）。

以基于云原生架構(gòu)的 Cloudwave 4.0 數(shù)據(jù)倉庫的為例，從下圖的產(chǎn)品架構(gòu)可以看出，Cloudwave 除了支持常規(guī)的結(jié)構(gòu)化數(shù)據(jù)和非結(jié)構(gòu)化數(shù)據(jù)存儲功能之外，還具有面向頂層應(yīng)用程序的數(shù)據(jù)服務(wù)層，以多樣化的 SDK 驅(qū)動程序向應(yīng)用程序提供數(shù)據(jù)存儲、數(shù)據(jù)管理、平臺管理、服務(wù)接入插件等能力。

數(shù)據(jù)倉庫性能測試方法論與工具集

尤其是 Cloudwave 所支持的并行全文檢索功能令我印象深刻，這個功能在文本信息處理場景中非常必要。下面引用了《翰云數(shù)據(jù)庫技術(shù)白皮書》中的一段介紹。更多的技術(shù)細(xì)節(jié)也推薦閱讀這本技術(shù)白皮書。

Cloudwave 能夠?qū)?CLOB 大文本字段以及 Bfile 文件（e.g. 常用的 PDF、Word、 Excel、PPT、Txt 以及 Html 等）實現(xiàn)全文索引功能，實現(xiàn)了基于 HDFS 的 Lucene 索引存儲，保證了索引數(shù)據(jù)的安全性，并對 Lucene 索引數(shù)據(jù)進(jìn)行自動分段，由多服務(wù)器均衡管理。全文檢索時，多服務(wù)器對索引段并行檢索，這樣就提高了查詢效率。處理 Bfile 類型的文件時，利用現(xiàn)有的解析類庫，從不同格式的文檔中偵測和提取出元數(shù)據(jù)和結(jié)構(gòu)化內(nèi)容。

數(shù)據(jù)倉庫性能測試方法論與工具集

此外，Cloudwave 云原生數(shù)據(jù)倉庫還集成了云原生架構(gòu)技術(shù)體系，帶來了更多的集群化管理優(yōu)勢，例如：

彈性擴(kuò)展性：支持根據(jù)需求進(jìn)行彈性擴(kuò)展，根據(jù)數(shù)據(jù)量和工作負(fù)載的變化自動調(diào)整資源。這使得數(shù)據(jù)倉庫能夠處理大規(guī)模數(shù)據(jù)集和高并發(fā)查詢，并滿足不斷增長的業(yè)務(wù)需求。
靈活性和敏捷性：可以快速適應(yīng)業(yè)務(wù)變化和新的數(shù)據(jù)分析需求，支持與多種云原生平臺上多種分析工具和技術(shù)的無縫集成。
強(qiáng)大的生態(tài)系統(tǒng)支持：便于與其他云服務(wù)和工具進(jìn)行集成，例如：機(jī)器學(xué)習(xí)平臺、可視化平臺等等。它與云提供商的生態(tài)系統(tǒng)緊密結(jié)合，能夠快速獲取最新的技術(shù)和功能更新，并獲得強(qiáng)大的支持和服務(wù)。

數(shù)據(jù)倉庫性能測試方法論與工具集文章來源地址http://www.zghlxwxcb.cn/news/detail-517852.html