1. 分區(qū)表
分區(qū)表實(shí)際上就是對(duì)應(yīng)一個(gè) HDFS 文件系統(tǒng)上的獨(dú)立的文件夾,該文件夾下是該分區(qū)所有的數(shù)據(jù)文件。Hive 中的分區(qū)就是分目錄,把一個(gè)大的數(shù)據(jù)集根據(jù)業(yè)務(wù)需要分割成小的數(shù)據(jù)集。在查詢時(shí)通過 WHERE 子句中的表達(dá)式選擇查詢所需要的指定的分區(qū),這樣的查詢效率會(huì)提高很多。
2. 分區(qū)表基本操作
1)引入分區(qū)表(需要根據(jù)日期對(duì)日志進(jìn)行管理,通過部門信息模擬)
2)創(chuàng)建分區(qū)表語法
hive (hive3)> create table dept_par(deptno int , dname string, loc string) partitioned by (day string) row format delimited fields terminated by'';
OK
Time taken: 2.547 seconds
注意:分區(qū)字段不能是表中已經(jīng)存在的數(shù)據(jù),可以將分區(qū)字段看作表的偽列。
3)加載數(shù)據(jù)到分區(qū)表中
hive (hive3)> load data local inpath '/home/zzdq/hive/dept_20200401.log' into table dept_par partition(day='20200401');
Loading data to table hive3.dept_par partition (day=20200401)
OK
Time taken: 1.763 seconds
hive (hive3)> load data local inpath '/home/zzdq/hive/dept_20200402.log' into table dept_par partition(day='20200402');
Loading data to table hive3.dept_par partition (day=20200402)
OK
Time taken: 0.962 seconds
hive (hive3)> load data local inpath '/home/zzdq/hive/dept_20200403.log' into table dept_par partition(day='20200403');
Loading data to table hive3.dept_par partition (day=20200403)
OK
Time taken: 0.869 seconds
4)查找全部數(shù)據(jù)(可以看到多了個(gè)分區(qū)字段,但這個(gè)字段不是放在表中,而是放在目錄上,所以條件查分區(qū)的效率會(huì)高很多)
hive (hive3)> select * from dept_par;
OK
dept_par.deptno dept_par.dname dept_par.loc dept_par.day
10 ACCOUNTING 1700 20200401
20 RESEARCH 1800 20200401
30 SALES 1900 20200402
40 OPERATIONS 1700 20200402
50 TEST 2000 20200403
60 DEV 1900 20200403
NULL ?NULL ?NULL 20200403
Time taken: 3.684 seconds, Fetched: 7 row(s)
5)條件查詢
hive (hive3)> select * from dept_par where day = 20200401;
OK
dept_par.deptno dept_par.dname dept_par.loc dept_par.day
10 ACCOUNTING 1700 20200401
20 RESEARCH 1800 20200401
Time taken: 2.749 seconds, Fetched: 2 row(s)
另外:分區(qū)的信息也存放在mysql的partition表中。
3. 分區(qū)的增刪改查
單分區(qū)查詢
hive (default)> select * from dept_partition where day='20200401';
多分區(qū)聯(lián)合查詢
hive (default)> select * from dept_partition where day='20200401'
union
select * from dept_partition where day='20200402'
union
select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or day='20200402' or day='20200403';
5)增加分區(qū)
創(chuàng)建單個(gè)分區(qū)
hive (default)> alter table dept_partition add partition(day='20200404');
創(chuàng)建多個(gè)分區(qū)
hive (hive3)> alter table dept_par add partition(day='20200404') partition(day='20200405') partition(day='20200406');
OK
Time taken: 0.844 seconds
6)刪除分區(qū)
刪除單個(gè)分區(qū)
hive (default)> alter table dept_partition drop partition (day='20200406');
同時(shí)刪除多個(gè)分區(qū)
hive (hive3)> alter table dept_par drop partition(day='20200404'),partition(day='20200405'),partition(day='20200406');
Dropped the partition day=20200404
Dropped the partition day=20200405
Dropped the partition day=20200406
OK
Time taken: 0.981 seconds
7)查看分區(qū)表有多少分區(qū)
hive (hive3)> show partitions dept_par;
OK
partition
day=20200401
day=20200402
day=20200403
Time taken: 0.296 seconds, Fetched: 3 row(s)
8)查看分區(qū)表結(jié)構(gòu)
hive (hive3)> desc formatted dept_par;
# Partition Information
# col_name data_type comment
day string
4. 二級(jí)分區(qū)
思考: 如何一天的日志數(shù)據(jù)量也很大,如何再將數(shù)據(jù)拆分?
1)創(chuàng)建二級(jí)分區(qū)表
hive (hive3)> create table dept_par2(deptno int , dname string , loc string) partitioned by (day string , hour string) row format delimited fields terminated by '';
OK
Time taken: 0.172 seconds
2)正常的加載數(shù)據(jù)
hive (hive3)> load data local inpath "/home/zzdq/hive/dept_20200401.log" into table dept_par2 partition(day='001',hour='401');
Loading data to table hive3.dept_par2 partition (day=001, hour=401)
OK
Time taken: 1.221 seconds
hive (hive3)> load data local inpath "/home/zzdq/hive/dept_20200402.log" into table dept_par2 partition(day='001',hour='402');
Loading data to table hive3.dept_par2 partition (day=001, hour=402)
OK
Time taken: 0.899 seconds
hive (hive3)> load data local inpath "/home/zzdq/hive/dept_20200403.log" into table dept_par2 partition(day='001',hour='403');
Loading data to table hive3.dept_par2 partition (day=001, hour=403)
OK
Time taken: 1.011 seconds
hive (hive3)> load data local inpath "/home/zzdq/hive/dept_20200403.log" into table dept_par2 partition(day='002',hour='404');
Loading data to table hive3.dept_par2 partition (day=001, hour=404)
OK
Time taken: 0.878 seconds
查詢
hive (hive3)> select * from dept_par2 where day = '001' and hour = '404';
OK
dept_par2.deptno dept_par2.dname dept_par2.loc dept_par2.day dept_par2.hour
50 TEST 2000 001 404
60 DEV 1900 001 404
NULL ?NULL ?NULL 001 404
Time taken: 0.692 seconds, Fetched: 3 row(s)
3)把數(shù)據(jù)直接上傳到分區(qū)目錄上,讓分區(qū)表和數(shù)據(jù)產(chǎn)生關(guān)聯(lián)的三種方式
(1)方式一:上傳數(shù)據(jù)后修復(fù)
上傳數(shù)據(jù)
hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept1.txt /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
查詢數(shù)據(jù)(查詢不到剛上傳的數(shù)據(jù))
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
讓分區(qū)信息產(chǎn)生關(guān)聯(lián)的操作:執(zhí)行修復(fù)命令
hive> msck repair table dept_partition2;
再次查詢數(shù)據(jù)
hive (default)> select * from dept_partition2 where day='20200401' and hour='13';
(2)方式二:上傳數(shù)據(jù)后添加分區(qū)
上傳數(shù)據(jù)
hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
執(zhí)行添加分區(qū)
hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');
查詢數(shù)據(jù)
hive (default)> select * from dept_partition2 where day='20200401' and hour='14';
(3)方式三:創(chuàng)建文件夾后,使用load
load 數(shù)據(jù)到分區(qū)創(chuàng)建目錄
hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;
上傳數(shù)據(jù)
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='15');
查詢數(shù)據(jù)
hive (default)> select * from dept_partition2 where day='20200401' and hour='15';
5. 動(dòng)態(tài)分區(qū)調(diào)整
關(guān)系型數(shù)據(jù)庫中,對(duì)分區(qū)表 Insert 數(shù)據(jù)時(shí)候,數(shù)據(jù)庫自動(dòng)會(huì)根據(jù)分區(qū)字段的值,將數(shù)據(jù)插入到相應(yīng)的分區(qū)中,Hive 中也提供了類似的機(jī)制,即動(dòng)態(tài)分區(qū)(Dynamic Partition),只不過,使用 Hive 的動(dòng)態(tài)分區(qū),需要進(jìn)行相應(yīng)的配置。
1)開啟動(dòng)態(tài)分區(qū)參數(shù)設(shè)置
(1)開啟動(dòng)態(tài)分區(qū)功能(默認(rèn) true,開啟)
hive.exec.dynamic.partition=true
(2)設(shè)置為非嚴(yán)格模式(動(dòng)態(tài)分區(qū)的模式,默認(rèn) strict,表示必須指定至少一個(gè)分區(qū)為靜態(tài)分區(qū),nonstrict 模式表示允許所有的分區(qū)字段都可以使用動(dòng)態(tài)分區(qū)。)
hive (hive3)> set hive.exec.dynamic.partition.mode=nonstrict;
(3)在所有執(zhí)行 MR 的節(jié)點(diǎn)上,最大一共可以創(chuàng)建多少個(gè)動(dòng)態(tài)分區(qū)。默認(rèn)1000
hive.exec.max.dynamic.partitions=1000
(4)在每個(gè)執(zhí)行 MR 的節(jié)點(diǎn)上,最大可以創(chuàng)建多少個(gè)動(dòng)態(tài)分區(qū)。該參數(shù)需要根據(jù)實(shí)際的數(shù)據(jù)來設(shè)定。比如:源數(shù)據(jù)中包含了一年的數(shù)據(jù),即 day 字段有365 個(gè)值,那么該參數(shù)就需要設(shè)置成大于365,如果使用默認(rèn)值100,則會(huì)報(bào)錯(cuò)。(4)在每個(gè)執(zhí)行 MR 的節(jié)點(diǎn)上,最大可以創(chuàng)建多少個(gè)動(dòng)態(tài)分區(qū)。該參數(shù)需要根據(jù)實(shí)際的數(shù)據(jù)來設(shè)定。比如:源數(shù)據(jù)中包含了一年的數(shù)據(jù),即 day 字段有365 個(gè)值,那么該參數(shù)就需要設(shè)置成大于365,如果使用默認(rèn)值100,則會(huì)報(bào)錯(cuò)。
hive.exec.max.dynamic.partitions.pernode=100
(5)整個(gè) MR Job 中,最大可以創(chuàng)建多少個(gè) HDFS 文件。默認(rèn)100000
hive.exec.max.created.files=100000
(6)當(dāng)有空分區(qū)生成時(shí),是否拋出異常。一般不需要設(shè)置。默認(rèn) false
hive.error.on.empty.partition=false
2)案例實(shí)操
需求:將 dept 表中的數(shù)據(jù)按照地區(qū)(loc 字段),插入到目標(biāo)表 dept_partition 的相應(yīng)分區(qū)中。
(1)創(chuàng)建目標(biāo)分區(qū)表
hive (hive3)> create table dept_no_par(dname string, loc string) partitioned by (deptno int) row format delimited fields terminated by '';
OK
Time taken: 0.185 seconds
(2)設(shè)置動(dòng)態(tài)分區(qū)
hive (hive3)> set hive.exec.dynamic.partition.mode=nonstrict;
(3)導(dǎo)入數(shù)據(jù)
hive (hive3)>insert into table dept_no_par partition(deptno) select dname,loc,deptno from dept;
OK
Time taken: 0.062 seconds
Query ID = atguigu_20211219145313_ad874c4a-da69-43f2-a0bb-6e0e9ea0ce6c
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-12-19 14:53:33,924 Stage-1 map = 0%, reduce = 0%
2021-12-19 14:53:41,638 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.93 sec
2021-12-19 14:53:51,240 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.7 sec
MapReduce Total cumulative CPU time: 4 seconds 700 msec
Ended Job = job_1639880318289_0007
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.7 sec HDFS Read: 14680 HDFS Write: 773 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 700 msec
OK
dname loc deptno
Time taken: 42.654 seconds
(4)查看目標(biāo)分區(qū)表的分區(qū)情況文章來源:http://www.zghlxwxcb.cn/news/detail-799811.html
hive (hive3)> show partitions dept_no_par;
OK
partition
deptno=10
deptno=20
deptno=30
deptno=40
deptno=70
Time taken: 0.39 seconds, Fetched: 5 row(s)
動(dòng)態(tài)分區(qū)也可以這么寫(3.0新增):文章來源地址http://www.zghlxwxcb.cn/news/detail-799811.html
hive (hive3)> create table dept_no_par2(dname string, loc string) partitioned by (deptno int) row format delimited fields terminated by '';
OK
Time taken: 0.188 seconds
hive (hive3)> insert into table dept_no_par2 select dname,loc,deptno from dept;
hive (hive3)> show partitions dept_no_par2;
OK
partition
deptno=10
deptno=20
deptno=30
deptno=40
Time taken: 0.137 seconds, Fetched: 4 row(s)
到了這里,關(guān)于Hive基礎(chǔ)知識(shí)(十六):Hive-SQL分區(qū)表使用與優(yōu)化的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!