国产无码综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

【系統(tǒng)學(xué)習(xí)Hive常用知識(shí)】

2年前作者：zhangqw1013分類：Toy博客閱讀(12)違法舉報(bào)

這篇具有很好參考價(jià)值的文章主要介紹了【系統(tǒng)學(xué)習(xí)Hive常用知識(shí)】。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

前言

Hive是由Facebook開(kāi)源，基于Hadoop的一個(gè)數(shù)據(jù)倉(cāng)庫(kù)工具，可以將結(jié)構(gòu)化的數(shù)據(jù)文件映射為一張表，并提供類SQL查詢功能。

一、Hive是什么？

Hive是一個(gè)Hadoop客戶端，用于將HQL（Hive SQL）轉(zhuǎn)化成MapReduce程序。
（1）Hive中每張表的數(shù)據(jù)存儲(chǔ)在HDFS
（2）Hive分析數(shù)據(jù)底層的實(shí)現(xiàn)是MapReduce（也可配置為Spark或者Tez）
（3）執(zhí)行程序運(yùn)行在Yarn上

1）用戶接口：Client
CLI（command-line interface）、JDBC/ODBC。
說(shuō)明：JDBC和ODBC的區(qū)別。
（1）JDBC的移植性比ODBC好；（通常情況下，安裝完ODBC驅(qū)動(dòng)程序之后，還需要經(jīng)過(guò)確定的配置才能夠應(yīng)用。而不相同的配置在不相同數(shù)據(jù)庫(kù)服務(wù)器之間不能夠通用。所以，安裝一次就需要再配置一次。JDBC只需要選取適當(dāng)?shù)腏DBC數(shù)據(jù)庫(kù)驅(qū)動(dòng)程序，就不需要額外的配置。在安裝過(guò)程中，JDBC數(shù)據(jù)庫(kù)驅(qū)動(dòng)程序會(huì)自己完成有關(guān)的配置。）
（2）兩者使用的語(yǔ)言不同，JDBC在Java編程時(shí)使用，ODBC一般在C/C++編程時(shí)使用。

2）元數(shù)據(jù)：Metastore
元數(shù)據(jù)包括：數(shù)據(jù)庫(kù)（默認(rèn)是default）、表名、表的擁有者、列/分區(qū)字段、表的類型（是否是外部表）、表的數(shù)據(jù)所在目錄等。
默認(rèn)存儲(chǔ)在自帶的derby數(shù)據(jù)庫(kù)中，由于derby數(shù)據(jù)庫(kù)只支持單客戶端訪問(wèn)，生產(chǎn)環(huán)境中為了多人開(kāi)發(fā)，推薦使用MySQL存儲(chǔ)Metastore。

二、Hive安裝配置

1.hive包安裝

1）把a(bǔ)pache-hive-3.1.3-bin.tar.gz上傳到Linux的/opt/software目錄下
2）解壓apache-hive-3.1.3-bin.tar.gz到/opt/module/目錄下面

 tar -zxvf /opt/software/apache-hive-3.1.3-bin.tar.gz -C /opt/module/

3）修改apache-hive-3.1.3-bin.tar.gz的名稱為hive

 mv /opt/module/apache-hive-3.1.3-bin/ /opt/module/hive

4）修改/etc/profile.d/my_env.sh，添加環(huán)境變量

 sudo vim /etc/profile.d/my_env.sh

（1）添加內(nèi)容

#HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin

（2）source一下

source /etc/profile.d/my_env.sh

5）初始化元數(shù)據(jù)庫(kù)（默認(rèn)是derby數(shù)據(jù)庫(kù)）

bin/schematool -dbType derby -initSchema

2、配置Hive元數(shù)據(jù)存儲(chǔ)到MySQL

1、mysql創(chuàng)建metastore數(shù)據(jù)庫(kù)

#登錄MySQL
mysql -uroot -p****

#創(chuàng)建Hive元數(shù)據(jù)庫(kù)
mysql> create database metastore;
mysql> quit;

2）將MySQL的JDBC驅(qū)動(dòng)拷貝到Hive的lib目錄下。

 cp /opt/software/mysql-connector-java-5.1.37.jar $HIVE_HOME/lib

3）在$HIVE_HOME/conf目錄下新建hive-site.xml文件

 vim $HIVE_HOME/conf/hive-site.xml

4）添加以下內(nèi)容

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- jdbc連接的URL -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value>
    </property>
    
    <!-- jdbc連接的Driver-->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    
	<!-- jdbc連接的username-->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>

    <!-- jdbc連接的password -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>****</value>
    </property>

    <!-- Hive默認(rèn)在HDFS的工作目錄 -->
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
</configuration>

5）初始化Hive元數(shù)據(jù)庫(kù)（修改為采用MySQL存儲(chǔ)元數(shù)據(jù)）

 bin/schematool -dbType mysql -initSchema -verbose

3、Hive服務(wù)部署

在遠(yuǎn)程訪問(wèn)Hive數(shù)據(jù)時(shí)，客戶端并未直接訪問(wèn)Hadoop集群，而是由Hivesever2代理訪問(wèn)。由于Hadoop集群中的數(shù)據(jù)具備訪問(wèn)權(quán)限控制，所以此時(shí)需考慮一個(gè)問(wèn)題：那就是訪問(wèn)Hadoop集群的用戶身份是誰(shuí)？是Hiveserver2的啟動(dòng)用戶？還是客戶端的登錄用戶？答案是都有可能，具體是誰(shuí)，由Hiveserver2的hive.server2.enable.doAs參數(shù)決定，該參數(shù)的含義是是否啟用Hiveserver2用戶模擬的功能。若啟用，則Hiveserver2會(huì)模擬成客戶端的登錄用戶去訪問(wèn)Hadoop集群的數(shù)據(jù)，不啟用，則Hivesever2會(huì)直接使用啟動(dòng)用戶訪問(wèn)Hadoop集群數(shù)據(jù)。模擬用戶的功能，默認(rèn)是開(kāi)啟的。生產(chǎn)環(huán)境，推薦開(kāi)啟用戶模擬功能，因?yàn)殚_(kāi)啟后才能保證各用戶之間的權(quán)限隔離。hivesever2的模擬用戶功能，依賴于Hadoop提供的proxy user（代理用戶功能），只有Hadoop中的代理用戶才能模擬其他用戶的身份訪問(wèn)Hadoop集群。因此，需要將hiveserver2的啟動(dòng)用戶設(shè)置為Hadoop的代理用戶，配置方式如下：

(1)修改配置文件core-site.xml

cd $HADOOP_HOME/etc/hadoop
vim core-site.xml

增加如下配置：

<!--配置所有節(jié)點(diǎn)的zqw用戶都可作為代理用戶-->
<property>
    <name>hadoop.proxyuser.zqw.hosts</name>
    <value>*</value>
</property>

<!--配置zqw用戶能夠代理的用戶組為任意組-->
<property>
    <name>hadoop.proxyuser.zqw.groups</name>
    <value>*</value>
</property>

<!--配置zqw用戶能夠代理的用戶為任意用戶-->
<property>
    <name>hadoop.proxyuser.zqw.users</name>
    <value>*</value>
</property>

（2）Hive端配置
在hive-site.xml文件中添加如下配置信息

vim hive-site.xml

<!-- 指定hiveserver2連接的host -->
<property>
	<name>hive.server2.thrift.bind.host</name>
	<value>hadoop102</value>
</property>

<!-- 指定hiveserver2連接的端口號(hào) -->
<property>
	<name>hive.server2.thrift.port</name>
	<value>10000</value>
</property>

三、Hive數(shù)據(jù)操作

1、DDL數(shù)據(jù)定義

創(chuàng)建數(shù)據(jù)庫(kù)

創(chuàng)建表
語(yǔ)法：

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name   
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format] 
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]

（1）內(nèi)部表與外部表

Hive中默認(rèn)創(chuàng)建的表都是的內(nèi)部表，有時(shí)也被稱為管理表。對(duì)于內(nèi)部表，Hive會(huì)完全管理表的元數(shù)據(jù)和數(shù)據(jù)文件。
創(chuàng)建內(nèi)部表

create table if not exists student(
    id int, 
    name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student';

創(chuàng)建外部表

create external table if not exists student(
    id int, 
    name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student';

刪除表時(shí)，內(nèi)部表會(huì)將hdfs的數(shù)據(jù)文件一并刪除，外部表不會(huì)。
(1)

(2)以json數(shù)據(jù)格式創(chuàng)建，使用專門負(fù)責(zé)JSON文件的JSON Serde，設(shè)計(jì)表字段時(shí)，表的字段與JSON字符串中的一級(jí)字段保持一致，對(duì)于具有嵌套結(jié)構(gòu)的JSON字符串，考慮使用合適復(fù)雜數(shù)據(jù)類型保存其內(nèi)容。

{
    "name": "laoshi",
    "friends": [
        "xiaoming",
        "xiaogang"
    ],
    "students": {
        "xiaoli": 18,
        "xiaozhang": 16
    },
    "address": {
        "street": "shang hai nan lu",
        "city": "beijing",
        "postal_code": 10010
    }
}

create table if not exists teacher(
    name string,
    friends array<string>,
    students map<string,int>,
    address struct<city:string,street:string,postal_code:int>
) row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'
location '/teacher';

(3)as select 與 create table like 創(chuàng)建表

create table teacher1 as select * from teacher;
create table teacher2 like teacher;

修改表

（1）表重命名

ALTER TABLE table_name RENAME TO new_table_name

（2）增加列

ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment], ...)

（3）修改列

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

（4）替換列

ALTER TABLE table_name REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

刪除表

DROP TABLE [IF EXISTS] table_name;

清空表

TRUNCATE [TABLE] table_name

truncate只能清空管理表，不能刪除外部表中數(shù)據(jù)。

2、DML數(shù)據(jù)操作

Load
Load語(yǔ)句可將文件導(dǎo)入到Hive表中。
語(yǔ)法：

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];

（1）local：表示從本地加載數(shù)據(jù)到Hive表；否則從HDFS加載數(shù)據(jù)到Hive表。
（2）overwrite：表示覆蓋表中已有數(shù)據(jù)，否則表示追加。
（3）partition：表示上傳到指定分區(qū)，若目標(biāo)是分區(qū)表，需指定分區(qū)。
案例：

create table student(
    id int, 
    name string
) 
row format delimited fields terminated by '\t';

1）加載本地文件到hive

 load data local inpath '/opt/module/datas/student.txt' into table student;

（2）加載HDFS文件到hive中

load data inpath '/user/zqw/student.txt' into table student;
#注意：加載之后HDFS數(shù)據(jù)文件將刪除

（3）加載數(shù)據(jù)覆蓋表中已有的數(shù)據(jù)

load data inpath '/user/zqw/student.txt' overwrite into table student;

Insert
（1）查詢結(jié)果插入到表中
語(yǔ)法：

INSERT (INTO | OVERWRITE) TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement;

1）INTO：將結(jié)果追加到目標(biāo)表
2）OVERWRITE：用結(jié)果覆蓋原有數(shù)據(jù)
案例：

create table student1(
    id int, 
    name string
) 
row format delimited fields terminated by '\t';

insert overwrite table student3 select id, name from student;

（2）將給定Values插入表中

insert into table  student1 values(1,'zhangsan'),(2,'lisi');

（3）將查詢結(jié)果寫(xiě)入目標(biāo)路徑
語(yǔ)法：

INSERT OVERWRITE [LOCAL] DIRECTORY directory
  [ROW FORMAT row_format] [STORED AS file_format] select_statement;

案例：

insert overwrite local directory '/opt/module/datas/student' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
select id,name from student;

3、Export&Import

Export導(dǎo)出語(yǔ)句可將表的數(shù)據(jù)和元數(shù)據(jù)信息一并到處的HDFS路徑，Import可將Export導(dǎo)出的內(nèi)容導(dǎo)入Hive，表的數(shù)據(jù)和元數(shù)據(jù)信息都會(huì)恢復(fù)。Export和Import可用于兩個(gè)Hive實(shí)例之間的數(shù)據(jù)遷移。

語(yǔ)法

--導(dǎo)出
EXPORT TABLE tablename TO 'export_target_path'
--導(dǎo)入
IMPORT [EXTERNAL] TABLE new_or_original_tablename FROM 'source_path' [LOCATION 'import_target_path']

案例

--導(dǎo)出
export table default.student to '/user/hive/warehouse/export/student';
--導(dǎo)入，無(wú)需提前創(chuàng)建表
import table student2 from '/user/hive/warehouse/export/student';

四、查詢

語(yǔ)法

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference       -- 從什么表查
  [WHERE where_condition]   -- 過(guò)濾
  [GROUP BY col_list]        -- 分組查詢
   [HAVING col_list]          -- 分組后過(guò)濾
  [ORDER BY col_list]        -- 排序
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT number]                -- 限制輸出的行數(shù)

基本語(yǔ)法與關(guān)系數(shù)據(jù)庫(kù)相同，在此不再記錄

1、Sort By

每個(gè)Reduce內(nèi)部排序
Sort By：對(duì)于大規(guī)模的數(shù)據(jù)集order by的效率非常低。在很多情況下，并不需要全局排序，此時(shí)可以使用Sort by。
Sort by為每個(gè)reduce產(chǎn)生一個(gè)排序文件。每個(gè)Reduce內(nèi)部進(jìn)行排序，對(duì)全局結(jié)果集來(lái)說(shuō)不是排序。

set mapreduce.job.reduces=3;

select * from emp sort by deptno desc;

2、分區(qū)（Distribute By）

Distribute By：在有些情況下，我們需要控制某個(gè)特定行應(yīng)該到哪個(gè)Reducer，通常是為了進(jìn)行后續(xù)的聚集操作。distribute by子句可以做這件事。distribute by類似MapReduce中partition（自定義分區(qū)），進(jìn)行分區(qū)，結(jié)合sort by使用。對(duì)于distribute by進(jìn)行測(cè)試，一定要分配多reduce進(jìn)行處理，否則無(wú)法看到distribute by的效果。
案例

--distribute by的分區(qū)規(guī)則是根據(jù)分區(qū)字段的hash碼與reduce的個(gè)數(shù)進(jìn)行相除后，余數(shù)相同的分到一個(gè)區(qū)。
--Hive要求distribute by語(yǔ)句要寫(xiě)在sort by語(yǔ)句之前。
select * from emp distribute by deptno sort by sal desc;

注意：按照部門編號(hào)分區(qū)，不一定就是固定死的數(shù)值，可以是20號(hào)和30號(hào)部門分到一個(gè)分區(qū)里面去。

3、分區(qū)排序（Cluster By）

當(dāng)distribute by和sort by字段相同時(shí)，可以使用cluster by方式。
cluster by除了具有distribute by的功能外還兼具sort by的功能。但是排序只能是升序排序，不能指定排序規(guī)則為asc或者desc。

select * from emp cluster by deptno;

五、函數(shù)

1、數(shù)值函數(shù)

1）round：四舍五入

select round(3.3);   
--3

2）ceil：向上取整

select ceil(3.1) ;   
--4

3）floor：向下取整

select floor(4.8);  
--4

2、字符串函數(shù)

1）substring：截取字符串
語(yǔ)法一：substring(string A, int start)
返回值：string
說(shuō)明：返回字符串A從start位置到結(jié)尾的字符串
語(yǔ)法二：substring(string A, int start, int len)
返回值：string
說(shuō)明：返回字符串A從start位置開(kāi)始，長(zhǎng)度為len的字符串

2）replace ：替換
語(yǔ)法：replace(string A, string B, string C)
返回值：string
說(shuō)明：將字符串A中的子字符串B替換為C。

3）repeat：重復(fù)字符串
語(yǔ)法：repeat(string A, int n)
返回值：string
說(shuō)明：將字符串A重復(fù)n遍。

4）split ：字符串切割
語(yǔ)法：split(string str, string pat)
返回值：array
說(shuō)明：按照正則表達(dá)式pat匹配到的內(nèi)容分割str，分割后的字符串，以數(shù)組的形式返回。

5）nvl ：替換null值
語(yǔ)法：nvl(A,B)
說(shuō)明：若A的值不為null，則返回A，否則返回B。

6）concat ：拼接字符串
語(yǔ)法：concat(string A, string B, string C, ……)
返回：string
說(shuō)明：將A,B,C……等字符拼接為一個(gè)字符串

7）concat_ws：以指定分隔符拼接字符串或者字符串?dāng)?shù)組
語(yǔ)法：concat_ws(string A, string…| array(string))
返回值：string
說(shuō)明：使用分隔符A拼接多個(gè)字符串，或者一個(gè)數(shù)組的所有元素。

8）get_json_object：解析json字符串
語(yǔ)法：get_json_object(string json_string, string path)
返回值：string
說(shuō)明：解析json的字符串json_string，返回path指定的內(nèi)容。如果輸入的json字符串無(wú)效，那么返回NULL。
案例：
案例實(shí)操：
（1）獲取json數(shù)組里面的json具體數(shù)據(jù)

select get_json_object('[{"name":"張三","sex":"男","age":"25"},{"name":"李四","sex":"男","age":"47"}]','$.[0].name');
--張三

（2）獲取json數(shù)組里面的數(shù)據(jù)

select get_json_object('[{"name":"張三","sex":"男","age":"25"},{"name":"李四","sex":"男","age":"47"}]','$.[0]');
--{"name":"張三","sex":"男","age":"25"}

3、日期函數(shù)

1）unix_timestamp：返回當(dāng)前或指定時(shí)間的時(shí)間戳
語(yǔ)法：unix_timestamp()
返回值：bigint

案例：

select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss');  
--前面是日期，后面是指日期傳進(jìn)來(lái)的具體格式

3）current_date：當(dāng)前日期

select current_date;

4）current_timestamp：當(dāng)前的日期加時(shí)間，并且精確的毫秒

select current_timestamp;

5）獲取日期中的年、月、日、小時(shí)

--年
select year('2022-08-08 08:08:08');
--月
select month('2022-08-08 08:08:08');
--日
select day('2022-08-08 08:08:08');
--小時(shí)
select hour('2022-08-08 08:08:08');

6）datediff：兩個(gè)日期相差的天數(shù)（結(jié)束日期減去開(kāi)始日期的天數(shù)）
語(yǔ)法：datediff(string enddate, string startdate)
返回值：int
案例：

select datediff('2021-08-08','2022-10-09');

7）date_add：日期加天數(shù)
語(yǔ)法：date_add(string startdate, int days)
返回值：string
說(shuō)明：返回開(kāi)始日期 startdate 增加 days 天后的日期
案例實(shí)操：

select date_add('2022-08-08',2);

8）date_sub：日期減天數(shù)
語(yǔ)法：date_sub (string startdate, int days)
返回值：string
說(shuō)明：返回開(kāi)始日期startdate減少days天后的日期。
案例實(shí)操：

select date_sub('2022-08-08',2);

9）date_format:將標(biāo)準(zhǔn)日期解析成指定格式字符串

select date_format('2022-08-08','yyyy年-MM月-dd日')

4、控制函數(shù)

1）case when：條件判斷函數(shù)
語(yǔ)法一：case when a then b [when c then d]* [else e] end
返回值：T
說(shuō)明：如果a為true，則返回b；如果c為true，則返回d；否則返回 e

select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;

語(yǔ)法二： case a when b then c [when d then e]* [else f] end
返回值: T
說(shuō)明：如果a等于b，那么返回c；如果a等于d，那么返回e；否則返回f

select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;

2）if: 條件判斷，類似于Java中三元運(yùn)算符
語(yǔ)法：if（boolean testCondition, T valueTrue, T valueFalseOrNull）
返回值：T
說(shuō)明：當(dāng)條件testCondition為true時(shí)，返回valueTrue；否則返回valueFalseOrNull
（1）條件滿足，輸出正確

select if(10 > 5,'正確','錯(cuò)誤');

5、集合函數(shù)

1）size：集合中元素的個(gè)數(shù)

select size(friends) from test;  
--2/2  每一行數(shù)據(jù)中的friends集合里的個(gè)數(shù)

2）map：創(chuàng)建map集合
語(yǔ)法：map (key1, value1, key2, value2, …)
說(shuō)明：根據(jù)輸入的key和value對(duì)構(gòu)建map類型

select map('xiaozhang',1,'xiaoli',2);

3）map_keys：返回map中的key

select map_keys(map('xiaozhang',1,'xiaoli',2));
--["xiaozhang","xiaoli"]

4）map_values: 返回map中的value

select map_values(map('xiaozhang',1,'xiaoli',2));
--[1,2]

5）array 聲明array集合
語(yǔ)法：array(val1, val2, …)
說(shuō)明：根據(jù)輸入的參數(shù)構(gòu)建數(shù)組array類

select array('1','2','3','4');
--["1","2","3","4"]

6）array_contains: 判斷array中是否包含某個(gè)元素

select array_contains(array('a','b','c','d'),'a');
--true

7）sort_array：將array中的元素排序

select sort_array(array('a','d','c'));
--["a","c","d"]

8）struct聲明struct中的各屬性
語(yǔ)法：struct(val1, val2, val3, …)
說(shuō)明：根據(jù)輸入的參數(shù)構(gòu)建結(jié)構(gòu)體struct類

select struct('name','age','weight');
--{"col1":"name","col2":"age","col3":"weight"}

9）named_struct聲明struct的屬性和值

select named_struct('name','xiaosong','age',18,'weight',80);
--{"name":"xiaosong","age":18,"weight":80}

6、聚合函數(shù)

多進(jìn)一出（多行傳入，一個(gè)行輸出）
（1）collect_list 收集并形成list集合，結(jié)果不去重

select deptno,collect_list(ename) from emp group by deptno;

（2）collect_set 收集并形成set集合，結(jié)果去重

select deptno,collect_set(job) from emp group by deptno;

7、炸裂函數(shù)

一進(jìn)多出（一行傳入，多行輸出）

select movie,cate 
from
(select movie, split(category,',') as cates 
from movie_info)t lateral view explode(cates) tmp as cate

8、窗口函數(shù)

窗口函數(shù)，能為每行數(shù)據(jù)劃分一個(gè)窗口，然后對(duì)窗口范圍內(nèi)的數(shù)據(jù)進(jìn)行計(jì)算，最后將計(jì)算結(jié)果返回給該行數(shù)據(jù)。
按照功能，常用窗口可劃分為如下幾類：聚合函數(shù)、跨行取值函數(shù)、排名函數(shù)。
1）跨行取值函數(shù)
（1）lead和lag
獲取當(dāng)前行的上/下邊某行、某個(gè)字段的值。
案例

select order_id,user_id,user_name,order_date,
		lag(order_date,1,null) over (partition by user_id order by order_date)as last_date,
		lead(order_date,1,null) over (partition by user_id order by order_date) as next_date
from order_info

（2）first_value和last_value
獲取窗口內(nèi)某列第一個(gè)和最后一個(gè)值

select order_id,user_id,user_name,order_date,
       first_value(order_date,false) over (partition by user_id order by order_date) as fir,
       last_value(order_date,false) over (partition by user_id order by order_date) as las
from order_info;

2）排名函數(shù)
常用窗口函數(shù)——rank、dense_rank、row_number
rank：重復(fù)值排名一樣，后續(xù)排名會(huì)不連續(xù)，比如：1，2，2，4，5，5，7
dense_rank：重復(fù)值排名一樣，后續(xù)排名會(huì)連續(xù)，比如：1，2，2，3，4，4，5
row_number：行號(hào)
案例

select ename,sal,deptno,rank() over (partition by deptno order by sal desc) as rk,
       dense_rank() over (partition by deptno order by sal desc) as dk,
       row_number() over (partition by deptno order by sal desc) as rn
       from emp;

結(jié)果：

ename	sal	depyno	rk	dk	rn
小飛	3000	10	1	1	1
趙六	3000	10	1	1	2
王五	1250	10	3	2	3
小李	1100	10	4	3	4
李四	1600	20	1	1	1
馬八	2850	30	1	1	1

3）指定行
PRECEDING：往前
FOLLOWING：往后
CURRENT ROW：當(dāng)前行
UNBOUNDED：起點(diǎn)（一般結(jié)合PRECEDING，F(xiàn)OLLOWING使用）
UNBOUNDED PRECEDING 表示該窗口最前面的行（起點(diǎn)）
UNBOUNDED FOLLOWING：表示該窗口最后面的行（終點(diǎn)）
比如說(shuō)：
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW（表示從起點(diǎn)到當(dāng)前行）
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING（表示往前2行到往后1行）
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW（表示往前2行到當(dāng)前行）
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING（表示當(dāng)前行到終點(diǎn)）

select order_id,order_amount,user_id,user_name,
       sum(order_amount) over (partition by user_id order by order_date rows between unbounded preceding and current row ) as sum_so_far
from order_info;

六、分區(qū)表和分桶表

1、分區(qū)表

Hive中的分區(qū)就是把一張大表的數(shù)據(jù)按照業(yè)務(wù)需要分散的存儲(chǔ)到多個(gè)目錄，每個(gè)目錄就稱為該表的一個(gè)分區(qū)。在查詢時(shí)通過(guò)where子句中的表達(dá)式選擇查詢所需要的分區(qū)，這樣的查詢效率會(huì)提高很多。
創(chuàng)建分區(qū)表

hive (default)> 
create table dept_partition
(
    deptno int,    --部門編號(hào)
    dname  string, --部門名稱
    loc    string  --部門位置
)
    partitioned by (day string)
    row format delimited fields terminated by '\t';

裝載數(shù)據(jù)

load data local inpath '/opt/module/hive/datas/dept_20220401.log' 
into table dept_partition 
partition(day='20220401');

將day='20220401’分區(qū)的數(shù)據(jù)插入到day='20220402’分區(qū)，可執(zhí)行如下裝載語(yǔ)句

insert overwrite table dept_partition partition (day = '20220402')
select deptno, dname, loc
from dept_partition
where day = '2020-04-01';

讀數(shù)據(jù)
查詢分區(qū)表數(shù)據(jù)時(shí)，可以將分區(qū)字段看作表的偽列，可像使用其他字段一樣使用分區(qū)字段。

select deptno, dname, loc ,day
from dept_partition
where day = '2020-04-01';

增加分區(qū)

alter table dept_partition add partition(day='20220403');

同時(shí)創(chuàng)建多個(gè)分區(qū)

alter table dept_partition add partition(day='20220404') partition(day='20220405');

刪除分區(qū)

alter table dept_partition drop partition (day='20220403');

同時(shí)刪除多個(gè)分區(qū)

alter table dept_partition drop partition (day='20220404'), partition(day='20220405');

2、修復(fù)分區(qū)

Hive將分區(qū)表的所有分區(qū)信息都保存在了元數(shù)據(jù)中，只有元數(shù)據(jù)與HDFS上的分區(qū)路徑一致時(shí)，分區(qū)表才能正常讀寫(xiě)數(shù)據(jù)。若用戶手動(dòng)創(chuàng)建/刪除分區(qū)路徑，Hive都是感知不到的，這樣就會(huì)導(dǎo)致Hive的元數(shù)據(jù)和HDFS的分區(qū)路徑不一致。除了使用add partition、drop partition手動(dòng)修復(fù)分區(qū)之外，使用msck命令進(jìn)行修復(fù)，以下是該命令的用法說(shuō)明。

msck repair table table_name [add/drop/sync partitions];

說(shuō)明：

msck repair table table_name add partitions
--該命令會(huì)增加HDFS路徑存在但元數(shù)據(jù)缺失的分區(qū)信息。
msck repair table table_name drop partitions
--該命令會(huì)刪除HDFS路徑已經(jīng)刪除但元數(shù)據(jù)仍然存在的分區(qū)信息。
msck repair table table_name sync partitions
--該命令會(huì)同步HDFS路徑和元數(shù)據(jù)分區(qū)信息，相當(dāng)于同時(shí)執(zhí)行上述的兩個(gè)命令。
msck repair table table_name
--等價(jià)于msck repair table table_name add partitions命令。

3、二級(jí)分區(qū)

可以在按天分區(qū)的基礎(chǔ)上，再對(duì)每天的數(shù)據(jù)按小時(shí)進(jìn)行分區(qū)

hive (default)>
create table dept_partition2(
    deptno int,    -- 部門編號(hào)
    dname string, -- 部門名稱
    loc string     -- 部門位置
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

load data local inpath '/opt/module/hive/datas/dept_20220401.log' 
into table dept_partition2 
partition(day='20220401', hour='12');

select * 
from dept_partition2 
where day='20220401' and hour='12';

4、動(dòng)態(tài)分區(qū)

動(dòng)態(tài)分區(qū)是指向分區(qū)表insert數(shù)據(jù)時(shí)，被寫(xiě)往的分區(qū)不由用戶指定，而是由每行數(shù)據(jù)的最后一個(gè)字段的值來(lái)動(dòng)態(tài)的決定。使用動(dòng)態(tài)分區(qū)，可只用一個(gè)insert語(yǔ)句將數(shù)據(jù)寫(xiě)入多個(gè)分區(qū)。
（1）動(dòng)態(tài)分區(qū)功能總開(kāi)關(guān)（默認(rèn)true，開(kāi)啟）

set hive.exec.dynamic.partition=true

（2）嚴(yán)格模式和非嚴(yán)格模式
動(dòng)態(tài)分區(qū)的模式，默認(rèn)strict（嚴(yán)格模式），要求必須指定至少一個(gè)分區(qū)為靜態(tài)分區(qū)，nonstrict（非嚴(yán)格模式）允許所有的分區(qū)字段都使用動(dòng)態(tài)分區(qū)。

set hive.exec.dynamic.partition.mode=nonstrict

（3）一條insert語(yǔ)句可同時(shí)創(chuàng)建的最大的分區(qū)個(gè)數(shù)，默認(rèn)為1000。

set hive.exec.max.dynamic.partitions=1000

（4）單個(gè)Mapper或者Reducer可同時(shí)創(chuàng)建的最大的分區(qū)個(gè)數(shù)，默認(rèn)為100。

set hive.exec.max.dynamic.partitions.pernode=100

（5）一條insert語(yǔ)句可以創(chuàng)建的最大的文件個(gè)數(shù)，默認(rèn)100000。

set hive.exec.max.created.files=100000

（6）當(dāng)查詢結(jié)果為空時(shí)且進(jìn)行動(dòng)態(tài)分區(qū)時(shí)，是否拋出異常，默認(rèn)false。

set hive.error.on.empty.partition=false

案例

create table dept_partition_dynamic(
    id int, 
    name string
) 
partitioned by (loc int) 
row format delimited fields terminated by '\t';

set hive.exec.dynamic.partition.mode = nonstrict;
insert into table dept_partition_dynamic 
partition(loc) 
select 
    deptno, 
    dname, 
    loc 
from dept;

5、分桶表

分區(qū)提供一個(gè)隔離數(shù)據(jù)和優(yōu)化查詢的便利方式。不過(guò)，并非所有的數(shù)據(jù)集都可形成合理的分區(qū)。對(duì)于一張表或者分區(qū)，Hive 可以進(jìn)一步組織成桶，也就是更為細(xì)粒度的數(shù)據(jù)范圍劃分，分區(qū)針對(duì)的是數(shù)據(jù)的存儲(chǔ)路徑，分桶針對(duì)的是數(shù)據(jù)文件。
分桶表的基本原理是，首先為每行數(shù)據(jù)計(jì)算一個(gè)指定字段的數(shù)據(jù)的hash值，然后模以一個(gè)指定的分桶數(shù)，最后將取模運(yùn)算結(jié)果相同的行，寫(xiě)入同一個(gè)文件中，這個(gè)文件就稱為一個(gè)分桶（bucket）。
1）語(yǔ)法

create table stu_buck(
    id int, 
    name string
)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

分桶排序表
1）語(yǔ)法

create table stu_buck_sort(
    id int, 
    name string
)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by '\t';

七、Hive文件格式

為Hive表中的數(shù)據(jù)選擇一個(gè)合適的文件格式，對(duì)提高查詢性能的提高是十分有益的。Hive表數(shù)據(jù)的存儲(chǔ)格式，可以選擇text file、orc、parquet、sequence file等。

1、Text File

文本文件是Hive默認(rèn)使用的文件格式，文本文件中的一行內(nèi)容，就對(duì)應(yīng)Hive表中的一行記錄。

2、ORC

ORC（Optimized Row Columnar）file format是Hive 0.11版里引入的一種列式存儲(chǔ)的文件格式。ORC文件能夠提高Hive讀寫(xiě)數(shù)據(jù)和處理數(shù)據(jù)的性能。
（1）行存儲(chǔ)的特點(diǎn)
查詢滿足條件的一整行數(shù)據(jù)的時(shí)候，列存儲(chǔ)則需要去每個(gè)聚集的字段找到對(duì)應(yīng)的每個(gè)列的值，行存儲(chǔ)只需要找到其中一個(gè)值，其余的值都在相鄰地方，所以此時(shí)行存儲(chǔ)查詢的速度更快。
（2）列存儲(chǔ)的特點(diǎn)
因?yàn)槊總€(gè)字段的數(shù)據(jù)聚集存儲(chǔ)，在查詢只需要少數(shù)幾個(gè)字段的時(shí)候，能大大減少讀取的數(shù)據(jù)量；每個(gè)字段的數(shù)據(jù)類型一定是相同的，列式存儲(chǔ)可以針對(duì)性的設(shè)計(jì)更好的設(shè)計(jì)壓縮算法。
前文提到的text file和sequence file都是基于行存儲(chǔ)的，orc和parquet是基于列式存儲(chǔ)的。

每個(gè)Orc文件由Header、Body和Tail三部分組成。
其中Header內(nèi)容為ORC，用于表示文件類型。
Body由1個(gè)或多個(gè)stripe組成，每個(gè)stripe一般為HDFS的塊大小，每一個(gè)stripe包含多條記錄，這些記錄按照列進(jìn)行獨(dú)立存儲(chǔ)，每個(gè)stripe里有三部分組成，分別是Index Data，Row Data，Stripe Footer。
Index Data：一個(gè)輕量級(jí)的index，默認(rèn)是為各列每隔1W行做一個(gè)索引。每個(gè)索引會(huì)記錄第n萬(wàn)行的位置，和最近一萬(wàn)行的最大值和最小值等信息。
Row Data：存的是具體的數(shù)據(jù)，按列進(jìn)行存儲(chǔ)，并對(duì)每個(gè)列進(jìn)行編碼，分成多個(gè)Stream來(lái)存儲(chǔ)。
Stripe Footer：存放的是各個(gè)Stream的位置以及各column的編碼信息。
Tail由File Footer和PostScript組成。File Footer中保存了各Stripe的其實(shí)位置、索引長(zhǎng)度、數(shù)據(jù)長(zhǎng)度等信息，各Column的統(tǒng)計(jì)信息等；PostScript記錄了整個(gè)文件的壓縮類型以及File Footer的長(zhǎng)度信息等。
在讀取ORC文件時(shí)，會(huì)先從最后一個(gè)字節(jié)讀取PostScript長(zhǎng)度，進(jìn)而讀取到PostScript，從里面解析到File Footer長(zhǎng)度，進(jìn)而讀取FileFooter，從中解析到各個(gè)Stripe信息，再讀各個(gè)Stripe，即從后往前讀。

建表語(yǔ)句文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-828993.html

create table orc_table
(column_specs)
stored as orc
tblproperties (property_name=property_value, ...);

到了這里，關(guān)于【系統(tǒng)學(xué)習(xí)Hive常用知識(shí)】的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！

本文來(lái)自互聯(lián)網(wǎng)用戶投稿，該文觀點(diǎn)僅代表作者本人，不代表本站立場(chǎng)。本站僅提供信息存儲(chǔ)空間服務(wù)，不擁有所有權(quán)，不承擔(dān)相關(guān)法律責(zé)任。如若轉(zhuǎn)載，請(qǐng)注明出處：如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實(shí)不符，請(qǐng)點(diǎn)擊違法舉報(bào)進(jìn)行投訴反饋，一經(jīng)查實(shí)，立即刪除！

分享到：

領(lǐng)支付寶紅包贊助服務(wù)器費(fèi)用

大數(shù)據(jù)技術(shù)之Hadoop學(xué)習(xí)(七)——Hive數(shù)據(jù)倉(cāng)庫(kù)
目錄素材一、數(shù)據(jù)倉(cāng)庫(kù)簡(jiǎn)介 1、數(shù)據(jù)倉(cāng)庫(kù)的認(rèn)識(shí) （1）數(shù)據(jù)倉(cāng)庫(kù)是面向主題的。（2）數(shù)據(jù)倉(cāng)庫(kù)是隨時(shí)間變化的。（3）數(shù)據(jù)倉(cāng)庫(kù)相對(duì)穩(wěn)定（4）OLTP和OLAP 2、數(shù)據(jù)倉(cāng)庫(kù)的結(jié)構(gòu) （1）數(shù)據(jù)源（2）數(shù)據(jù)存儲(chǔ)及管理（3）OLAP 服務(wù)器（4）前端工具 3、數(shù)據(jù)倉(cāng)庫(kù)的數(shù)據(jù)模型（1）星狀模
2024年02月17日
瀏覽(59)
大數(shù)據(jù)畢業(yè)設(shè)計(jì)Flink+Hadoop+Hive地鐵客流量可視化地鐵客流量預(yù)測(cè) 交通大數(shù)據(jù) 地鐵客流量大數(shù)據(jù) 交通可視化機(jī)器學(xué)習(xí) 深度學(xué)習(xí) 人工智能知識(shí)圖譜數(shù)據(jù)可視化計(jì)算機(jī)畢業(yè)設(shè)計(jì)
河北傳媒學(xué)院本科畢業(yè) 論文開(kāi)題報(bào)告專業(yè) 小四號(hào)宋體班級(jí) 小四號(hào)宋體姓名小四號(hào)宋體學(xué)號(hào) 小四號(hào)宋體指導(dǎo)教師小四號(hào)宋體題目基于hadoop+spark的深圳市地鐵運(yùn)營(yíng)的分析與可視化（1.內(nèi)容包括：課題的來(lái)源及意義，國(guó)內(nèi)外發(fā)展?fàn)顩r，本課題的研究目標(biāo)、內(nèi)容、方法、手
2024年03月19日
瀏覽(36)
大數(shù)據(jù)學(xué)習(xí)：Hive常用函數(shù)
1.1 Hive命令行查看hive命令的參數(shù) 語(yǔ)法結(jié)構(gòu): hive [-hiveconf x=y]* [-i filename]* [-f filename|-e query-string][-S] 說(shuō)明： -i 從文件初始化HQL。 -e從命令行執(zhí)行指定的HQL -f 執(zhí)行HQL腳本 -v 輸出執(zhí)行的HQL語(yǔ)句到控制臺(tái) -p connect to Hive Server on port number -hiveconf x=y Use this to set hive/hadoop configuration vari
2024年02月11日
瀏覽(19)
Hadoop+hive+flask+echarts大數(shù)據(jù)可視化之系統(tǒng)數(shù)據(jù)收集
? ? ? ?談到大數(shù)據(jù)的項(xiàng)目，一般以數(shù)據(jù)可視化為主體，收集大數(shù)據(jù)中的有用信息，存儲(chǔ)到分布式存儲(chǔ)系統(tǒng)hadoop中，由hive導(dǎo)入hadoop中存儲(chǔ)的數(shù)據(jù)，使用HQL語(yǔ)句對(duì)數(shù)據(jù)進(jìn)行分析，hive底層會(huì)將HQL語(yǔ)句轉(zhuǎn)化成mapreduce程序，flask作為python語(yǔ)言的后臺(tái)技術(shù)，可以連接hive將HQL語(yǔ)句的分析結(jié)
2023年04月13日
瀏覽(31)
Hadoop3.0大數(shù)據(jù)處理學(xué)習(xí)1（Haddop介紹、部署、Hive部署）
學(xué)習(xí)步驟：三大組件的基本理論和實(shí)際操作 Hadoop3的使用，實(shí)際開(kāi)發(fā)流程結(jié)合具體問(wèn)題，提供排查思路開(kāi)發(fā)技術(shù)棧： Linux基礎(chǔ)操作、Sehll腳本基礎(chǔ) JavaSE、Idea操作 MySQL Hadoop是一個(gè)適合海量數(shù)據(jù)存儲(chǔ)與計(jì)算的平臺(tái)。是基于Google的GoogleFS、Map Reduce、BigTable實(shí)現(xiàn)的。移動(dòng)數(shù)據(jù)：數(shù)據(jù)
2024年02月08日
瀏覽(55)
（最新版本）hive4.0.0 + hadoop3.3.4 集群安裝（無(wú)坑版）-大數(shù)據(jù)學(xué)習(xí)系列（一）
網(wǎng)上的找的文檔大多殘缺不靠譜，所以我整理了一份安裝最新版本的hive4..0.0+hadoop3.3.4的學(xué)習(xí)環(huán)境，可以提供大家安裝一個(gè)完整的hive+hadoop的環(huán)境供學(xué)習(xí)。由于在公司擔(dān)任大數(shù)據(jù)的培訓(xùn)工作后續(xù)還會(huì)更新一些基礎(chǔ)的文章,希望能幫助到大家。機(jī)器信息 Hostname k8s-master k8s-node1 k8
2024年02月06日
瀏覽(32)
【hive】單節(jié)點(diǎn)搭建hadoop和hive
需要使用hive遠(yuǎn)程debug，嘗試使用無(wú)hadoop部署hive方式一直失敗，無(wú)果，還是使用有hadoop方式。最終查看linux內(nèi)存占用 6GB ，還在后臺(tái)運(yùn)行docker的mysql( bitnami/mysql:8.0 )，基本滿意。版本選擇：（1）hive2 hadoop2 和hive3和hadoop3需要搭配使用，不能像chd的hive2和hadoop3搭配使用，容易出現(xiàn)
2024年04月10日
瀏覽(83)
【項(xiàng)目實(shí)戰(zhàn)】基于Hadoop大數(shù)據(jù)電商平臺(tái)用戶行為分析與可視化系統(tǒng)Hive、Spark計(jì)算機(jī)程序開(kāi)發(fā)
注意：該項(xiàng)目只展示部分功能，如需了解，評(píng)論區(qū)咨詢即可。在當(dāng)今數(shù)字化時(shí)代，電商行業(yè)成為全球商業(yè)生態(tài)系統(tǒng)的關(guān)鍵組成部分，電商平臺(tái)已經(jīng)深入各行各業(yè)，影響了人們的購(gòu)物方式和消費(fèi)習(xí)慣。隨著互聯(lián)網(wǎng)技術(shù)的不斷發(fā)展，電商平臺(tái)產(chǎn)生了大量的用戶數(shù)據(jù)，包括點(diǎn)擊、購(gòu)
2024年02月04日
瀏覽(111)
Hadoop、Hive安裝
Linux系統(tǒng)：Centos，版本7.0及以上 JDK：jdk1.8 Hadoop：3.1.3 Hive：3.1.2 虛擬機(jī)：VMware mysql：5.7.11 工具下載地址: https://pan.baidu.com/s/10J_1w1DW9GQC7NOYw5fwvg?pwd=0kdr 提取碼: 0kdr 提示：以下是本篇文章正文內(nèi)容，下面案例可供參考下載jdk-8u181-linux-x64.tar.gz包，將此包上傳至/opt 目錄下。使用
2024年02月08日
瀏覽(14)
Hadoop集成Hive
說(shuō)明：服務(wù)器已用主機(jī)名代替，可根據(jù)自己的需求，改為IP地址環(huán)境服務(wù)器組件 master NameNode、DataNode、Nodemanager、ResourceManager、Hive、Hive的metastore、Hive的hiveserver2、mysql Secondary SecondaryNameNode、DataNode、NodeManager Datanode DataNode、NodeManager、Hive的beeline訪問(wèn)方式 1、java版本1.8 2、Rs
2024年02月16日
瀏覽(17)