Apache Hive 系列文章
1、apache-hive-3.1.2簡(jiǎn)介及部署(三種部署方式-內(nèi)嵌模式、本地模式和遠(yuǎn)程模式)及驗(yàn)證詳解
2、hive相關(guān)概念詳解–架構(gòu)、讀寫文件機(jī)制、數(shù)據(jù)存儲(chǔ)
3、hive的使用示例詳解-建表、數(shù)據(jù)類型詳解、內(nèi)部外部表、分區(qū)表、分桶表
4、hive的使用示例詳解-事務(wù)表、視圖、物化視圖、DDL(數(shù)據(jù)庫、表以及分區(qū))管理詳細(xì)操作
5、hive的load、insert、事務(wù)表使用詳解及示例
6、hive的select(GROUP BY、ORDER BY、CLUSTER BY、SORT BY、LIMIT、union、CTE)、join使用詳解及示例
7、hive shell客戶端與屬性配置、內(nèi)置運(yùn)算符、函數(shù)(內(nèi)置運(yùn)算符與自定義UDF運(yùn)算符)
8、hive的關(guān)系運(yùn)算、邏輯預(yù)算、數(shù)學(xué)運(yùn)算、數(shù)值運(yùn)算、日期函數(shù)、條件函數(shù)和字符串函數(shù)的語法與使用示例詳解
9、hive的explode、Lateral View側(cè)視圖、聚合函數(shù)、窗口函數(shù)、抽樣函數(shù)使用詳解
10、hive綜合示例:數(shù)據(jù)多分隔符(正則RegexSerDe)、url解析、行列轉(zhuǎn)換常用函數(shù)(case when、union、concat和explode)詳細(xì)使用示例
11、hive綜合應(yīng)用示例:json解析、窗口函數(shù)應(yīng)用(連續(xù)登錄、級(jí)聯(lián)累加、topN)、拉鏈表應(yīng)用
12、Hive優(yōu)化-文件存儲(chǔ)格式和壓縮格式優(yōu)化與job執(zhí)行優(yōu)化(執(zhí)行計(jì)劃、MR屬性、join、優(yōu)化器、謂詞下推和數(shù)據(jù)傾斜優(yōu)化)詳細(xì)介紹及示例
13、java api訪問hive操作示例
本文介紹了hive的分組、排序、CTE以及join的詳細(xì)操作及示例。
本文依賴hive環(huán)境可用。
本文分為2個(gè)部分,即select的使用和join的使用。
一、Hive SQL-DQL-Select查詢數(shù)據(jù)
從哪里查詢?nèi)Q于FROM關(guān)鍵字后面的table_reference??梢允瞧胀ㄎ锢肀怼⒁晥D、join結(jié)果或子查詢結(jié)果。表名和列名不區(qū)分大小寫。
1、GROUP BY、ORDER BY、CLUSTER BY、SORT BY、LIMIT語法及示例
1)、語法
[WITH CommonTableExpression (, CommonTableExpression)*]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows];
2)、示例
------------案例:美國(guó)Covid-19新冠數(shù)據(jù)之select查詢---------------
--step1:創(chuàng)建普通表t_usa_covid19
drop table if exists t_usa_covid19;
CREATE TABLE t_usa_covid19(
count_date string,
county string,
state string,
fips int,
cases int,
deaths int)
row format delimited fields terminated by ",";
--將源數(shù)據(jù)load加載到t_usa_covid19表對(duì)應(yīng)的路徑下
load data local inpath '/usr/local/bigdata/us-covid19-counties.dat' into table t_usa_covid19;
select * from t_usa_covid19;
--step2:創(chuàng)建一張分區(qū)表 基于count_date日期,state州進(jìn)行分區(qū)
CREATE TABLE if not exists t_usa_covid19_p(
county string,
fips int,
cases int,
deaths int)
partitioned by(count_date string,state string)
row format delimited fields terminated by ",";
--step3:使用動(dòng)態(tài)分區(qū)插入將數(shù)據(jù)導(dǎo)入t_usa_covid19_p中
set hive.exec.dynamic.partition.mode = nonstrict;
insert into table t_usa_covid19_p partition (count_date,state)
select county,fips,cases,deaths,count_date,state from t_usa_covid19;
---------------Hive SQL select查詢基礎(chǔ)語法------------------
--1、select_expr
--查詢所有字段或者指定字段
select * from t_usa_covid19_p;
select county, cases, deaths from t_usa_covid19_p;
--查詢匹配正則表達(dá)式的所有字段
SET hive.support.quoted.identifiers = none; --反引號(hào)不在解釋為其他含義,被解釋為正則表達(dá)式
--查詢以c開頭的字段
select `^c.*` from t_usa_covid19_p;
0: jdbc:hive2://server4:10000> select `^c.*` from t_usa_covid19_p limit 3;
+-------------------------+------------------------+-----------------------------+
| t_usa_covid19_p.county | t_usa_covid19_p.cases | t_usa_covid19_p.count_date |
+-------------------------+------------------------+-----------------------------+
| Autauga | 5554 | 2021-01-28 |
| Baldwin | 17779 | 2021-01-28 |
| Barbour | 1920 | 2021-01-28 |
+-------------------------+------------------------+-----------------------------+
--查詢當(dāng)前數(shù)據(jù)庫
select current_database(); --省去from關(guān)鍵字
--查詢使用函數(shù)
select count(county) from t_usa_covid19_p;
--2、ALL DISTINCT
--返回所有匹配的行
select state from t_usa_covid19_p;
--相當(dāng)于
select all state from t_usa_covid19_p;
--返回所有匹配的行 去除重復(fù)的結(jié)果
select distinct state from t_usa_covid19_p;
--多個(gè)字段distinct 整體去重
select county,state from t_usa_covid19_p;
select distinct county,state from t_usa_covid19_p;
select distinct sex from student;
0: jdbc:hive2://server4:10000> select distinct sex from student;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+------+
| sex |
+------+
| 女 |
| 男 |
+------+
--3、WHERE CAUSE
select * from t_usa_covid19_p where 1 > 2; -- 1 > 2 返回false
select * from t_usa_covid19_p where 1 = 1; -- 1 = 1 返回true
--where條件中使用函數(shù) 找出州名字母長(zhǎng)度超過10位的有哪些
select * from t_usa_covid19_p where length(state) >10 ;
--where子句支持子查詢
SELECT *
FROM A
WHERE A.a IN (SELECT foo FROM B);
--注意:where條件中不能使用聚合函數(shù)
--報(bào)錯(cuò) SemanticException:Not yet supported place for UDAF 'count'
--聚合函數(shù)要使用它的前提是結(jié)果集已經(jīng)確定。
--而where子句還處于“確定”結(jié)果集的過程中,因而不能使用聚合函數(shù)。
select state,count(deaths) from t_usa_covid19_p where count(deaths) >100 group by state;
0: jdbc:hive2://server4:10000> select state,count(deaths) from t_usa_covid19_p where count(deaths) >100 group by state;
Error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 1:54 Not yet supported place for UDAF 'count' (state=42000,code=10128)
--可以使用Having實(shí)現(xiàn)
select state,count(deaths)
from t_usa_covid19_p group by state
having count(deaths) > 100;
--4、分區(qū)查詢、分區(qū)裁剪
--找出來自加州,累計(jì)死亡人數(shù)大于1000的縣 state字段就是分區(qū)字段 進(jìn)行分區(qū)裁剪 避免全表掃描
select * from t_usa_covid19_p where state ="California" and deaths > 1000;
--多分區(qū)裁剪
select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California" and deaths > 1000;
--5、GROUP BY
--根據(jù)state州進(jìn)行分組
--SemanticException:Expression not in GROUP BY key 'deaths'
--deaths不是分組字段 報(bào)錯(cuò)
--state是分組字段 可以直接出現(xiàn)在select_expr中
select state,deaths from t_usa_covid19_p where count_date = "2021-01-28" group by state;
--被聚合函數(shù)應(yīng)用
select state,sum(deaths) from t_usa_covid19_p where count_date = "2021-01-28" group by state;
--6、having
--統(tǒng)計(jì)死亡病例數(shù)大于10000的州
--where語句中不能使用聚合函數(shù) 語法報(bào)錯(cuò)
select state,sum(deaths) from t_usa_covid19_p where count_date = "2021-01-28" and sum(deaths) >10000 group by state;
--先where分組前過濾(此處是分區(qū)裁剪),再進(jìn)行g(shù)roup by分組, 分組后每個(gè)分組結(jié)果集確定 再使用having過濾
select state,sum(deaths) from t_usa_covid19_p
where count_date = "2021-01-28"
group by state
having sum(deaths) > 10000;
--這樣寫更好 即在group by的時(shí)候聚合函數(shù)已經(jīng)作用得出結(jié)果 having直接引用結(jié)果過濾 不需要再單獨(dú)計(jì)算一次了
select state,sum(deaths) as cnts from t_usa_covid19_p
where count_date = "2021-01-28"
group by state
having cnts> 10000;
--7、limit
--沒有限制返回2021.1.28 加州的所有記錄
select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California";
--返回結(jié)果集的前5條
select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California" limit 5;
--返回結(jié)果集從第3行(含)開始 共3行 以下是查詢結(jié)果比較
--[LIMIT [offset,] rows]
select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California" order by deaths desc limit 2,3; --注意 第一個(gè)參數(shù)偏移量是從0開始的
0: jdbc:hive2://server4:10000> select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California" order by deaths desc ;
+-------------------------+-----------------------+------------------------+-------------------------+-----------------------------+------------------------+
| t_usa_covid19_p.county | t_usa_covid19_p.fips | t_usa_covid19_p.cases | t_usa_covid19_p.deaths | t_usa_covid19_p.count_date | t_usa_covid19_p.state |
+-------------------------+-----------------------+------------------------+-------------------------+-----------------------------+------------------------+
| Los Angeles | 6037 | 1098363 | 16107 | 2021-01-28 | California |
| Riverside | 6065 | 270105 | 3058 | 2021-01-28 | California |
| Orange | 6059 | 241648 | 2868 | 2021-01-28 | California |
| San Diego | 6073 | 233033 | 2534 | 2021-01-28 | California |
| San Bernardino | 6071 | 271189 | 1776 | 2021-01-28 | California |
| Santa Clara | 6085 | 100468 | 1345 | 2021-01-28 | California |
| Sacramento | 6067 | 85427 | 1216 | 2021-01-28 | California |
| Fresno | 6019 | 86886 | 1122 | 2021-01-28 | California |
。。。。
0: jdbc:hive2://server4:10000> select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California" order by deaths desc limit 2,3;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-------------------------+-----------------------+------------------------+-------------------------+-----------------------------+------------------------+
| t_usa_covid19_p.county | t_usa_covid19_p.fips | t_usa_covid19_p.cases | t_usa_covid19_p.deaths | t_usa_covid19_p.count_date | t_usa_covid19_p.state |
+-------------------------+-----------------------+------------------------+-------------------------+-----------------------------+------------------------+
| Orange | 6059 | 241648 | 2868 | 2021-01-28 | California |
| San Diego | 6073 | 233033 | 2534 | 2021-01-28 | California |
| San Bernardino | 6071 | 271189 | 1776 | 2021-01-28 | California |
+-------------------------+-----------------------+------------------------+-------------------------+-----------------------------+------------------------+
---------------Hive SQL select查詢高階語法------------------
---1、order by
--根據(jù)字段進(jìn)行排序
--默認(rèn)asc, nulls first 也可以手動(dòng)指定nulls last
select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California"
order by deaths ;
--指定desc nulls last
select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California"
order by deaths desc;
--強(qiáng)烈建議將LIMIT與ORDER BY一起使用。避免數(shù)據(jù)集行數(shù)過大
--當(dāng)hive.mapred.mode設(shè)置為strict嚴(yán)格模式時(shí),使用不帶LIMIT的ORDER BY時(shí)會(huì)引發(fā)異常。
select * from t_usa_covid19_p where count_date = "2021-01-28" and state ="California"
order by deaths desc
limit 3;
--2、cluster by
--根據(jù)指定字段將數(shù)據(jù)分組,每組內(nèi)再根據(jù)該字段正序排序(只能正序)。根據(jù)同一個(gè)字段,分且排序。
select * from student;
--不指定reduce task個(gè)數(shù)
--日志顯示:Number of reduce tasks not specified. Estimated from input data size: 1
select * from student cluster by num;
--分組規(guī)則hash散列(分桶表規(guī)則一樣):Hash_Func(col_name) % reducetask個(gè)數(shù)
--分為幾組取決于reducetask的個(gè)數(shù)(結(jié)果見下圖)
--手動(dòng)設(shè)置reduce task個(gè)數(shù)
set mapreduce.job.reduces =2;
select * from student cluster by num;
--3、distribute by + sort by
--案例:把學(xué)生表數(shù)據(jù)根據(jù)性別分為兩個(gè)部分,每個(gè)分組內(nèi)根據(jù)年齡的倒序排序。
--錯(cuò)誤
select * from student cluster by sex order by age desc;
select * from student cluster by sex sort by age desc;
CLUSTER BY無法單獨(dú)完成,因?yàn)榉趾团判虻淖侄沃荒苁峭粋€(gè);
ORDER BY更不能在這里使用,因?yàn)槭侨峙判?,只有一個(gè)輸出,無法滿足分的需求。
--正確
--DISTRIBUTE BY +SORT BY就相當(dāng)于把CLUSTER BY的功能一分為二
--前提:DISTRIBUTE BY 是在多個(gè)reduce的時(shí)候才會(huì)有效果,否則不能看到效果
--1.DISTRIBUTE BY負(fù)責(zé)根據(jù)指定字段分組;
--2.SORT BY負(fù)責(zé)分組內(nèi)排序規(guī)則。
--分組和排序的字段可以不同。
set mapreduce.job.reduces=3;
select * from student distribute by sex sort by age desc;
--下面兩個(gè)語句執(zhí)行結(jié)果一樣
select * from student distribute by num sort by num;
select * from student cluster by num;
set mapreduce.job.reduces =2;
select * from student cluster by num;
執(zhí)行結(jié)果如下:
set mapreduce.job.reduces=3;
select * from student distribute by sex sort by age desc;
執(zhí)行結(jié)果如下:
2、CLUSTER、 DISTRIBUTE、SORT、ORDER BY總結(jié)
- order by全局排序,因此只有一個(gè)reducer,結(jié)果輸出在一個(gè)文件中,當(dāng)輸入規(guī)模大時(shí),需要較長(zhǎng)的計(jì)算時(shí)間。
- distribute by根據(jù)指定字段將數(shù)據(jù)分組,算法是hash散列。sort by是在分組之后,每個(gè)組內(nèi)局部排序。
- cluster by既有分組,又有排序,但是兩個(gè)字段只能是同一個(gè)字段。
如果distribute和sort的字段是同一個(gè)時(shí),cluster by = distribute by + sort by
3、Union聯(lián)合查詢
UNION用于將來自于多個(gè)SELECT語句的結(jié)果合并為一個(gè)結(jié)果集。
使用DISTINCT關(guān)鍵字與只使用UNION默認(rèn)值效果一樣,都會(huì)刪除重復(fù)行。1.2.0之前的Hive版本僅支持UNION ALL,在這種情況下不會(huì)消除重復(fù)的行。
使用ALL關(guān)鍵字,不會(huì)刪除重復(fù)行,結(jié)果集包括所有SELECT語句的匹配行(包括重復(fù)行)。
每個(gè)select_statement返回的列的數(shù)量和名稱必須相同。
---------------Union聯(lián)合查詢----------------------------
--語法規(guī)則
select_statement UNION [ALL | DISTINCT] select_statement UNION [ALL | DISTINCT] select_statement ...;
--使用DISTINCT關(guān)鍵字與使用UNION默認(rèn)值效果一樣,都會(huì)刪除重復(fù)行。
select num,name from student_local
UNION
select num,name from student_hdfs;
--和上面一樣
select num,name from student_local
UNION DISTINCT
select num,name from student_hdfs;
--使用ALL關(guān)鍵字會(huì)保留重復(fù)行。
select num,name from student_local
UNION ALL
select num,name from student_hdfs limit 2;
--如果要將ORDER BY,SORT BY,CLUSTER BY,DISTRIBUTE BY或LIMIT應(yīng)用于單個(gè)SELECT
--請(qǐng)將子句放在括住SELECT的括號(hào)內(nèi)
SELECT num,name FROM (select num,name from student_local LIMIT 2) subq1
UNION
SELECT num,name FROM (select num,name from student_hdfs LIMIT 3) subq2;
--如果要將ORDER BY,SORT BY,CLUSTER BY,DISTRIBUTE BY或LIMIT子句應(yīng)用于整個(gè)UNION結(jié)果
--請(qǐng)將ORDER BY,SORT BY,CLUSTER BY,DISTRIBUTE BY或LIMIT放在最后一個(gè)之后。
select num,name from student_local
UNION
select num,name from student_hdfs
order by num desc;
------------子查詢Subqueries--------------
--from子句中子查詢(Subqueries)
--子查詢
SELECT num
FROM (
select num,name from student_local
) tmp;
--包含UNION ALL的子查詢的示例
SELECT t3.name
FROM (
select num,name from student_local
UNION distinct
select num,name from student_hdfs
) t3;
--where子句中子查詢(Subqueries)
--不相關(guān)子查詢,相當(dāng)于IN、NOT IN,子查詢只能選擇一個(gè)列。
--(1)執(zhí)行子查詢,其結(jié)果不被顯示,而是傳遞給外部查詢,作為外部查詢的條件使用。
--(2)執(zhí)行外部查詢,并顯示整個(gè)結(jié)果?! ?/span>
SELECT *
FROM student_hdfs
WHERE student_hdfs.num IN (select num from student_local limit 2);
--相關(guān)子查詢,指EXISTS和NOT EXISTS子查詢
--子查詢的WHERE子句中支持對(duì)父查詢的引用
SELECT A
FROM T1
WHERE EXISTS (SELECT B FROM T2 WHERE T1.X = T2.Y);
4、Common Table Expressions(CTE)
公用表表達(dá)式(CTE)是一個(gè)臨時(shí)結(jié)果集:該結(jié)果集是從WITH子句中指定的簡(jiǎn)單查詢派生而來的,緊接在SELECT或INSERT關(guān)鍵字之前。
CTE僅在單個(gè)語句的執(zhí)行范圍內(nèi)定義。
CTE可以在 SELECT,INSERT, CREATE TABLE AS SELECT或CREATE VIEW AS SELECT語句中使用。
-----------------Common Table Expressions(CTE)-----------------------------------
--select語句中的CTE
with q1 as (select num,name,age from student where num = 95002)
select * from q1;
-- from風(fēng)格
with q1 as (select num,name,age from student where num = 95002)
from q1 select *;
-- chaining CTEs 鏈?zhǔn)?/span>
with q1 as ( select * from student where num = 95002),
q2 as ( select num,name,age from q1)
select * from (select num from q2) a;
-- union
with q1 as (select * from student where num = 95002),
q2 as (select * from student where num = 95004)
select * from q1 union all select * from q2;
--視圖,CTAS和插入語句中的CTE
-- insert
create table s1 like student;
with q1 as ( select * from student where num = 95002)
from q1
insert overwrite table s1
select *;
select * from s1;
-- ctas
create table s2 as
with q1 as ( select * from student where num = 95002)
select * from q1;
-- view
create view v1 as
with q1 as ( select * from student where num = 95002)
select * from q1;
select * from v1;
二、Hive SQL Join連接操作
join語法的出現(xiàn)是用于根據(jù)兩個(gè)或多個(gè)表中的列之間的關(guān)系,從這些表中共同組合查詢數(shù)據(jù)
在Hive中,當(dāng)下版本3.1.2總共支持6種join語法。分別是:
inner join(內(nèi)連接)、left join(左連接)、right join(右連接)full outer join(全外連接)、left semi join(左半開連接)、cross join(交叉連接,也叫做笛卡爾乘積)。
1、join語法
join_table:
table_reference [INNER] JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
join_condition:
ON expression
-- 1、table_reference:是join查詢中使用的表名,也可以是子查詢別名(查詢結(jié)果當(dāng)成表參與join)。
-- 2、table_factor:與table_reference相同,是聯(lián)接查詢中使用的表名,也可以是子查詢別名。
-- 3、join_condition:join查詢關(guān)聯(lián)的條件,如果在兩個(gè)以上的表上需要連接,則使用AND關(guān)鍵字。
2、數(shù)據(jù)準(zhǔn)備
--table1: 員工表
CREATE TABLE employee(
id int,
name string,
deg string,
salary int,
dept string
)
row format delimited
fields terminated by ',';
--table2:員工住址信息表
CREATE TABLE employee_address (
id int,
hno string,
street string,
city string
)
row format delimited
fields terminated by ',';
--table3:員工聯(lián)系方式表
CREATE TABLE employee_connection (
id int,
phno string,
email string
)
row format delimited
fields terminated by ',';
--加載數(shù)據(jù)到表中
load data local inpath '/usr/local/bigdata/employee.txt' into table employee;
load data local inpath '/usr/local/bigdata/employee_address.txt' into table employee_address;
load data local inpath '/usr/local/bigdata/employee_connection.txt' into table employee_connection;
0: jdbc:hive2://server4:10000> select * from employee;
+--------------+----------------+---------------+------------------+----------------+
| employee.id | employee.name | employee.deg | employee.salary | employee.dept |
+--------------+----------------+---------------+------------------+----------------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | cto | 50000 | TP |
| 1203 | khalil | dev | 30000 | AC |
| 1204 | prasanth | dev | 30000 | AC |
| 1206 | kranthi | admin | 20000 | TP |
+--------------+----------------+---------------+------------------+----------------+
0: jdbc:hive2://server4:10000> select * from employee_address ;
+----------------------+-----------------------+--------------------------+------------------------+
| employee_address.id | employee_address.hno | employee_address.street | employee_address.city |
+----------------------+-----------------------+--------------------------+------------------------+
| 1201 | 288A | vgiri | jublee |
| 1202 | 108I | aoc | ny |
| 1204 | 144Z | pgutta | hyd |
| 1206 | 78B | old city | la |
| 1207 | 720X | hitec | ny |
+----------------------+-----------------------+--------------------------+------------------------+
0: jdbc:hive2://server4:10000> select * from employee_connection;
+-------------------------+---------------------------+----------------------------+
| employee_connection.id | employee_connection.phno | employee_connection.email |
+-------------------------+---------------------------+----------------------------+
| 1201 | 2356742 | gopal@tp.com |
| 1203 | 1661663 | manisha@tp.com |
| 1204 | 8887776 | khalil@ac.com |
| 1205 | 9988774 | prasanth@ac.com |
| 1206 | 1231231 | kranthi@tp.com |
+-------------------------+---------------------------+----------------------------+
3、inner join 內(nèi)連接
內(nèi)連接是最常見的一種連接,它也被稱為普通連接,其中inner可以省略:inner join == join ;
只有進(jìn)行連接的兩個(gè)表中都存在與連接條件相匹配的數(shù)據(jù)才會(huì)被留下來。
--1、inner join
select e.id,e.name,e_a.city,e_a.street
from employee e
inner join employee_address e_a on e.id =e_a.id;
--等價(jià)于 inner join=join
select e.id,e.name,e_a.city,e_a.street
from employee e
join employee_address e_a on e.id =e_a.id;
--等價(jià)于 隱式連接表示法
select e.id,e.name,e_a.city,e_a.street
from employee e , employee_address e_a
where e.id =e_a.id;
-- 查詢員工的地址
0: jdbc:hive2://server4:10000> select e.*,a.street from employee e join employee_address a on e.id = a.id;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-------+-----------+----------+-----------+---------+-----------+
| e.id | e.name | e.deg | e.salary | e.dept | a.street |
+-------+-----------+----------+-----------+---------+-----------+
| 1201 | gopal | manager | 50000 | TP | vgiri |
| 1202 | manisha | cto | 50000 | TP | aoc |
| 1204 | prasanth | dev | 30000 | AC | pgutta |
| 1206 | kranthi | admin | 20000 | TP | old city |
+-------+-----------+----------+-----------+---------+-----------+
0: jdbc:hive2://server4:10000> select e.*,a.street from employee e ,employee_address a where e.id = a.id;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-------+-----------+----------+-----------+---------+-----------+
| e.id | e.name | e.deg | e.salary | e.dept | a.street |
+-------+-----------+----------+-----------+---------+-----------+
| 1201 | gopal | manager | 50000 | TP | vgiri |
| 1202 | manisha | cto | 50000 | TP | aoc |
| 1204 | prasanth | dev | 30000 | AC | pgutta |
| 1206 | kranthi | admin | 20000 | TP | old city |
+-------+-----------+----------+-----------+---------+-----------+
4、left join 左連接
left join中文叫做是左外連接(Left Outer Join)或者左連接,其中outer可以省略,left outer join是早期的寫法。
left join的核心就在于left左。左指的是join關(guān)鍵字左邊的表,簡(jiǎn)稱左表。
通俗解釋:join時(shí)以左表的全部數(shù)據(jù)為準(zhǔn),右邊與之關(guān)聯(lián);左表數(shù)據(jù)全部返回,右表關(guān)聯(lián)上的顯示返回,關(guān)聯(lián)不上的顯示null返回。
--2、left join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e
left join employee_connection e_conn on e.id =e_conn.id;
--等價(jià)于 left outer join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e
left outer join employee_connection e_conn on e.id =e_conn.id;
0: jdbc:hive2://server4:10000> select e.*,c.phno,c.email from employee e left join employee_connection c on e.id = c.id;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-------+-----------+----------+-----------+---------+----------+-----------------+
| e.id | e.name | e.deg | e.salary | e.dept | c.phno | c.email |
+-------+-----------+----------+-----------+---------+----------+-----------------+
| 1201 | gopal | manager | 50000 | TP | 2356742 | gopal@tp.com |
| 1202 | manisha | cto | 50000 | TP | NULL | NULL |
| 1203 | khalil | dev | 30000 | AC | 1661663 | manisha@tp.com |
| 1204 | prasanth | dev | 30000 | AC | 8887776 | khalil@ac.com |
| 1206 | kranthi | admin | 20000 | TP | 1231231 | kranthi@tp.com |
+-------+-----------+----------+-----------+---------+----------+-----------------+
5、right join 右連接
right join中文叫做是右外連接(Right Outer Jion)或者右連接,其中outer可以省略。
right join的核心就在于Right右。右指的是join關(guān)鍵字右邊的表,簡(jiǎn)稱右表。
通俗解釋:join時(shí)以右表的全部數(shù)據(jù)為準(zhǔn),左邊與之關(guān)聯(lián);右表數(shù)據(jù)全部返回,左表關(guān)聯(lián)上的顯示返回,關(guān)聯(lián)不上的顯示null返回。
right join和left join之間很相似,重點(diǎn)在于以哪邊為準(zhǔn),也就是一個(gè)方向的問題。
--3、right join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e
right join employee_connection e_conn on e.id =e_conn.id;
--等價(jià)于 right outer join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e
right outer join employee_connection e_conn on e.id =e_conn.id;
0: jdbc:hive2://server4:10000> select e.*,c.id,c.phno,c.email from employee e right join employee_connection c on e.id = c.id;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-------+-----------+----------+-----------+---------+-------+----------+------------------+
| e.id | e.name | e.deg | e.salary | e.dept | c.id | c.phno | c.email |
+-------+-----------+----------+-----------+---------+-------+----------+------------------+
| 1201 | gopal | manager | 50000 | TP | 1201 | 2356742 | gopal@tp.com |
| 1203 | khalil | dev | 30000 | AC | 1203 | 1661663 | manisha@tp.com |
| 1204 | prasanth | dev | 30000 | AC | 1204 | 8887776 | khalil@ac.com |
| NULL | NULL | NULL | NULL | NULL | 1205 | 9988774 | prasanth@ac.com |
| 1206 | kranthi | admin | 20000 | TP | 1206 | 1231231 | kranthi@tp.com |
+-------+-----------+----------+-----------+---------+-------+----------+------------------+
6、full outer join 全外連接
full outer join 等價(jià) full join ,中文叫做全外連接或者外連接。
包含左、右兩個(gè)表的全部行,不管另外一邊的表中是否存在與它們匹配的行;
在功能上:等價(jià)于對(duì)這兩個(gè)數(shù)據(jù)集合分別進(jìn)行左外連接和右外連接,然后再使用消去重復(fù)行的操作將上述兩個(gè)結(jié)果集合并為一個(gè)結(jié)果集。
--4、full outer join
select e.id,e.name,e_a.city,e_a.street
from employee e
full outer join employee_address e_a on e.id =e_a.id;
--等價(jià)于
select e.id,e.name,e_a.city,e_a.street
from employee e
full join employee_address e_a on e.id =e_a.id;
0: jdbc:hive2://server4:10000> select e.*,c.id,c.phno,c.email from employee e full join employee_connection c on e.id = c.id;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-------+-----------+----------+-----------+---------+-------+----------+------------------+
| e.id | e.name | e.deg | e.salary | e.dept | c.id | c.phno | c.email |
+-------+-----------+----------+-----------+---------+-------+----------+------------------+
| 1201 | gopal | manager | 50000 | TP | 1201 | 2356742 | gopal@tp.com |
| 1202 | manisha | cto | 50000 | TP | NULL | NULL | NULL |
| 1203 | khalil | dev | 30000 | AC | 1203 | 1661663 | manisha@tp.com |
| 1204 | prasanth | dev | 30000 | AC | 1204 | 8887776 | khalil@ac.com |
| NULL | NULL | NULL | NULL | NULL | 1205 | 9988774 | prasanth@ac.com |
| 1206 | kranthi | admin | 20000 | TP | 1206 | 1231231 | kranthi@tp.com |
+-------+-----------+----------+-----------+---------+-------+----------+------------------+
7、left semi join 左半開連接
左半開連接(LEFT SEMI JOIN)會(huì)返回左邊表的記錄,前提是其記錄對(duì)于右邊的表滿足ON語句中的判定條件。
從效果上來看有點(diǎn)像inner join之后只返回左表的結(jié)果。
--5、left semi join 但是只返回左表全部數(shù)據(jù), 只不過效率高一些
select *
from employee e
left semi join employee_address e_addr on e.id =e_addr.id;
--相當(dāng)于 inner join 只不過效率高一些
select e.*
from employee e
inner join employee_address e_addr on e.id =e_addr.id;
0: jdbc:hive2://server4:10000> select * from employee e left semi join employee_address e_addr on e.id =e_addr.id;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-------+-----------+----------+-----------+---------+
| e.id | e.name | e.deg | e.salary | e.dept |
+-------+-----------+----------+-----------+---------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | cto | 50000 | TP |
| 1204 | prasanth | dev | 30000 | AC |
| 1206 | kranthi | admin | 20000 | TP |
+-------+-----------+----------+-----------+---------+
8、cross join 交叉連接
交叉連接cross join,將會(huì)返回被連接的兩個(gè)表的笛卡爾積,返回結(jié)果的行數(shù)等于兩個(gè)表行數(shù)的乘積。對(duì)于大表來說,cross join慎用。
在SQL標(biāo)準(zhǔn)中定義的cross join就是無條件的inner join。返回兩個(gè)表的笛卡爾積,無需指定關(guān)聯(lián)鍵。
在HiveSQL語法中,cross join 后面可以跟where子句進(jìn)行過濾,或者on條件過濾。
--6、cross join
--下列A、B、C 執(zhí)行結(jié)果相同,但是效率不一樣
--A:
select a.*,b.* from employee a,employee_address b where a.id=b.id;
--B:
select * from employee a cross join employee_address b on a.id=b.id;
select * from employee a cross join employee_address b where a.id=b.id;
--C:
select * from employee a inner join employee_address b on a.id=b.id;
--一般不建議使用方法A和B,因?yàn)槿绻蠾HERE子句的話,往往會(huì)先生成兩個(gè)表行數(shù)乘積的行的數(shù)據(jù)表然后才根據(jù)WHERE條件從中選擇。
--因此,如果兩個(gè)需要求交集的表太大,將會(huì)非常非常慢,不建議使用。
--A:
explain select a.*,b.* from employee a,employee_address b where a.id=b.id;
--B:
explain select * from employee a cross join employee_address b on a.id=b.id;
--C:
explain select * from employee a inner join employee_address b on a.id=b.id;
9、注意事項(xiàng)
1)、允許使用復(fù)雜的聯(lián)接表達(dá)式,支持非等值連接
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id)
2)、同一查詢中可以連接2個(gè)以上的表
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
3)、如果每個(gè)表在聯(lián)接子句中使用相同的列,則Hive將多個(gè)表上的聯(lián)接轉(zhuǎn)換為單個(gè)MR作業(yè)
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key1)
--由于聯(lián)接中僅涉及b的key1列,因此被轉(zhuǎn)換為1個(gè)MR作業(yè)來執(zhí)行
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
--會(huì)轉(zhuǎn)換為兩個(gè)MR作業(yè),因?yàn)樵诘谝粋€(gè)連接條件中使用了b中的key1列,而在第二個(gè)連接條件中使用了b中的key2列。
-- 第一個(gè)map / reduce作業(yè)將a與b聯(lián)接在一起,然后將結(jié)果與c聯(lián)接到第二個(gè)map / reduce作業(yè)中。
4)、大小表的查詢位置
join時(shí)的最后一個(gè)表會(huì)通過reducer流式傳輸,并在其中緩沖之前的其他表,因此,將大表放置在最后有助于減少reducer階段緩存數(shù)據(jù)所需要的內(nèi)存文章來源:http://www.zghlxwxcb.cn/news/detail-516085.html
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key1)
--由于聯(lián)接中僅涉及b的key1列,因此被轉(zhuǎn)換為1個(gè)MR作業(yè)來執(zhí)行,并且表a和b的鍵的特定值的值被緩沖在reducer的內(nèi)存中。
--然后,對(duì)于從c中檢索的每一行,將使用緩沖的行來計(jì)算聯(lián)接。
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
--計(jì)算涉及兩個(gè)MR作業(yè)。其中的第一個(gè)將a與b連接起來,并緩沖a的值,同時(shí)在reducer中流式傳輸b的值。
-- 在第二個(gè)MR作業(yè)中,將緩沖第一個(gè)連接的結(jié)果,同時(shí)將c的值通過reducer流式傳輸。
5)、在join的時(shí)候,可以通過語法STREAMTABLE提示指定要流式傳輸?shù)谋?/h4>
--如果省略STREAMTABLE提示,則Hive將流式傳輸最右邊的表。
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key1)
--a,b,c三個(gè)表都在一個(gè)MR作業(yè)中聯(lián)接,并且表b和c的鍵的特定值的值被緩沖在reducer的內(nèi)存中。
-- 然后,對(duì)于從a中檢索到的每一行,將使用緩沖的行來計(jì)算聯(lián)接。如果省略STREAMTABLE提示,則Hive將流式傳輸最右邊的表。
6)、join在WHERE條件之前進(jìn)行
7)、如果除一個(gè)要連接的表之外的所有表都很小,則可以將其作為僅map作業(yè)執(zhí)行(map join)
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a
JOIN b ON a.key = b.key
--不需要reducer。對(duì)于A的每個(gè)Mapper,B都會(huì)被完全讀取。限制是不能執(zhí)行FULL / RIGHT OUTER JOIN b。
--如果省略STREAMTABLE提示,則Hive將流式傳輸最右邊的表。
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key1)
--a,b,c三個(gè)表都在一個(gè)MR作業(yè)中聯(lián)接,并且表b和c的鍵的特定值的值被緩沖在reducer的內(nèi)存中。
-- 然后,對(duì)于從a中檢索到的每一行,將使用緩沖的行來計(jì)算聯(lián)接。如果省略STREAMTABLE提示,則Hive將流式傳輸最右邊的表。
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a
JOIN b ON a.key = b.key
--不需要reducer。對(duì)于A的每個(gè)Mapper,B都會(huì)被完全讀取。限制是不能執(zhí)行FULL / RIGHT OUTER JOIN b。
以上,介紹了hive的分組、排序、CTE以及join的詳細(xì)操作及示例。文章來源地址http://www.zghlxwxcb.cn/news/detail-516085.html
到了這里,關(guān)于6、hive的select(GROUP BY、ORDER BY、CLUSTER BY、SORT BY、LIMIT、union、CTE)、join使用詳解及示例的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!