陌陌聊天數(shù)據(jù)分析(一)
目標(biāo)
- 基于Hadoop和Hive實現(xiàn)聊天數(shù)據(jù)統(tǒng)計分析,構(gòu)建聊天數(shù)據(jù)分析報表
需求
- 統(tǒng)計今日總消息量
- 統(tǒng)計今日每小時消息量,發(fā)送和接收用戶數(shù)量
- 統(tǒng)計今日各地區(qū)發(fā)送消息數(shù)據(jù)量
- 統(tǒng)計今日發(fā)送消息和接收消息用戶數(shù)
- 統(tǒng)計今日發(fā)送消息最多的用戶前幾名
- 統(tǒng)計今日接收消息最多的用戶前幾名
- 統(tǒng)計發(fā)送人手機型號分布情況
- 統(tǒng)計發(fā)送人設(shè)備系統(tǒng)分布情況
數(shù)據(jù)來源
- 聊天業(yè)務(wù)系統(tǒng)導(dǎo)出2021/11/01一天24小時用戶聊天數(shù)據(jù),以TSV文本形式存儲在文件中
- 數(shù)據(jù)大小:兩個文件共14萬條數(shù)據(jù)
- 列分隔符:\t
數(shù)據(jù)集及所需文件
- 鏈接:https://pan.baidu.com/s/1ToTanDrFRhAVsFTb2uclFg
提取碼:rkun
??基于Hive數(shù)倉實現(xiàn)需求開發(fā)
?建庫建表 加載數(shù)據(jù)
- 建庫建表
--創(chuàng)建數(shù)據(jù)庫
create database db_msg;
--切換數(shù)據(jù)庫
use db_msg;
--建表
create table db_msg.tb_msg_source(
msg_time string comment "消息發(fā)送時間"
, sender_name string comment "發(fā)送人昵稱"
, sender_account string comment "發(fā)送人賬號"
, sender_sex string comment "發(fā)送人性別"
, sender_ip string comment "發(fā)送人IP"
, sender_os string comment "發(fā)送人操作系統(tǒng)"
, sender_phonetype string comment "發(fā)送人手機型號"
, sender_network string comment "發(fā)送人網(wǎng)絡(luò)類型"
, sender_gps string comment "發(fā)送人GPS定位"
, receiver_name string comment "接收人昵稱"
, receiver_ip string comment "接收人IP"
, receiver_account string comment "接收人賬號"
, receiver_os string comment "接收人操作系統(tǒng)"
, receiver_phonetype string comment "接收人手機型號"
, receiver_network string comment "接收人網(wǎng)絡(luò)類型"
, receiver_gps string comment "接收人GPS定位"
, receiver_sex string comment "接收人性別"
, msg_type string comment "消息類型"
, distance string comment "雙方距離"
, message string comment "消息內(nèi)容"
)
--指定分隔符為制表符
row format delimited fields terminated by '\t';
- 加載數(shù)據(jù)
#上傳數(shù)據(jù)到node1服務(wù)器本地文件系統(tǒng)(HS2服務(wù)所在機器)
[root@node1 hivedata]# pwd
/root/hivedata
[root@node1 hivedata]# ll
total 54104
-rw-r--r-- 1 root root 28237023 Jun 13 20:24 data1.tsv
-rw-r--r-- 1 root root 27161148 Jun 13 20:24 data2.tsv
--加載數(shù)據(jù)入表
load data local inpath '/root/hivedata/data1.tsv' into table db_msg.tb_msg_source;
load data local inpath '/root/hivedata/data2.tsv' into table db_msg.tb_msg_source;
- 查詢表,查看數(shù)據(jù)是否導(dǎo)入成功
--查詢表
select * from tb_msg_source limit 5;
?ETL數(shù)據(jù)清洗
數(shù)據(jù)問題
- 當(dāng)前數(shù)據(jù),一些數(shù)據(jù)字段為空,不是合法數(shù)據(jù)。
- 需求需要統(tǒng)計每天每個小時消息量,但數(shù)據(jù)中沒有天和小時字段,只有整體時間字段,不好處理。
- 需求中,GPS對經(jīng)緯度在同一字段,不好處理。
ETL需求
- 對字段為空的不合法數(shù)據(jù)進(jìn)行過濾
- where過濾
- 通過時間字段構(gòu)建天和小時字段
- substr函數(shù)
- 從GPS經(jīng)緯度提取經(jīng)緯度
- split函數(shù)
- 將ETL以后的結(jié)果保存到一張新的Hive表中
- create table …as select…
create table db_msg.tb_msg_etl as
select *,
substr(msg_time, 0, 10) as dayinfo, --獲取天
substr(msg_time, 12, 2) as hourinfo, --獲取小時
split(sender_gps, ",")[0] as sender_lng, --經(jīng)度
split(sender_gps, ",")[1] as sender_lat --緯度
from db_msg.tb_msg_source
--過濾字段為空數(shù)據(jù)
where length(sender_gps) > 0;
select
msg_time,dayinfo,hourinfo,sender_gps,sender_lng,sender_lat
from db_msg.tb_msg_etl
limit 5;
--查詢數(shù)據(jù)
??需求指標(biāo)SQL
- 解讀需求
- 確定待查詢數(shù)據(jù)表
from
- 分析維度
group by
- 找出計算指標(biāo)
聚合
- 細(xì)節(jié)
過濾 排序
-
統(tǒng)計今日消息總量
--需求:統(tǒng)計今日總消息量 create table if not exists tb_rs_total_msg_cnt comment "今日消息總量" as select dayinfo, count(*) as total_msg_cnt from db_msg.tb_msg_etl group by dayinfo; --查詢 select * from tb_rs_total_msg_cnt ;
+------------------------------+------------------------------------+ | tb_rs_total_msg_cnt.dayinfo | tb_rs_total_msg_cnt.total_msg_cnt | +------------------------------+------------------------------------+ | 2021-11-01 | 139062 | +------------------------------+------------------------------------+
-
統(tǒng)計今日每小時消息量,發(fā)送/接收用戶數(shù)
create table tb_rs_hour_msg_cnt comment "每小時消息量趨勢" as select dayinfo, hourinfo, count(*) as total_msg_cnt, count(distinct sender_account) as sender_usr_cnt, count(distinct receiver_account)as receiver_usr_cnt from db_msg.tb_msg_etl group by dayinfo,hourinfo; select * from tb_rs_hour_msg_cnt limit 5;
+-----------------------------+------------------------------+-----------------------------------+------------------------------------+--------------------------------------+ | tb_rs_hour_msg_cnt.dayinfo | tb_rs_hour_msg_cnt.hourinfo | tb_rs_hour_msg_cnt.total_msg_cnt | tb_rs_hour_msg_cnt.sender_usr_cnt | tb_rs_hour_msg_cnt.receiver_usr_cnt | +-----------------------------+------------------------------+-----------------------------------+------------------------------------+--------------------------------------+ | 2021-11-01 | 00 | 4349 | 3520 | 3558 | | 2021-11-01 | 01 | 2892 | 2524 | 2537 | | 2021-11-01 | 02 | 882 | 842 | 838 | | 2021-11-01 | 03 | 471 | 463 | 460 | | 2021-11-01 | 04 | 206 | 202 | 205 | +-----------------------------+------------------------------+-----------------------------------+------------------------------------+--------------------------------------+
-
統(tǒng)計今日各地區(qū)發(fā)送消息數(shù)據(jù)量
create table tb_rs_loc_cnt comment "今日各地區(qū)發(fā)送總消息量" as select dayinfo, sender_gps, cast(sender_lng as double) as longitude, cast(sender_lat as double) as latitude, count(*) as total_msg_cnt from tb_msg_etl group by dayinfo, sender_gps, sender_lng,sender_lat; select * from tb_rs_loc_cnt limit 5;
+------------------------+---------------------------+--------------------------+-------------------------+------------------------------+ | tb_rs_loc_cnt.dayinfo | tb_rs_loc_cnt.sender_gps | tb_rs_loc_cnt.longitude | tb_rs_loc_cnt.latitude | tb_rs_loc_cnt.total_msg_cnt | +------------------------+---------------------------+--------------------------+-------------------------+------------------------------+ | 2021-11-01 | 100.297355,24.206808 | 100.297355 | 24.206808 | 1397 | | 2021-11-01 | 100.591712,24.004148 | 100.591712 | 24.004148 | 1406 | | 2021-11-01 | 101.62196,36.782187 | 101.62196 | 36.782187 | 1439 | | 2021-11-01 | 102.357852,23.801165 | 102.357852 | 23.801165 | 1399 | | 2021-11-01 | 102.357852,25.682909 | 102.357852 | 25.682909 | 1431 | +------------------------+---------------------------+--------------------------+-------------------------+------------------------------+
-
統(tǒng)計今日發(fā)送消息和接受消息用戶數(shù)
create table tb_rs_usr_cnt comment "今日發(fā)送消息人數(shù)、接受消息人數(shù)" as select dayinfo, count(distinct sender_account) as sender_usr_cnt, count(distinct receiver_account) as receiver_usr_cnt from db_msg.tb_msg_etl group by dayinfo; select * from tb_rs_usr_cnt ;
+------------------------+-------------------------------+---------------------------------+ | tb_rs_usr_cnt.dayinfo | tb_rs_usr_cnt.sender_usr_cnt | tb_rs_usr_cnt.receiver_usr_cnt | +------------------------+-------------------------------+---------------------------------+ | 2021-11-01 | 10008 | 10005 | +------------------------+-------------------------------+---------------------------------+
-
統(tǒng)計今日發(fā)送消息最多的Top10用戶
create table tb_rs_susr_top10 comment "發(fā)送消息條數(shù)最多的Top10用戶" as select dayinfo, sender_name as username, count(*) as sender_msg_cnt from db_msg.tb_msg_etl group by dayinfo,sender_name order by sender_msg_cnt desc limit 10; select * from tb_rs_susr_top10;
+---------------------------+----------------------------+----------------------------------+ | tb_rs_susr_top10.dayinfo | tb_rs_susr_top10.username | tb_rs_susr_top10.sender_msg_cnt | +---------------------------+----------------------------+----------------------------------+ | 2021-11-01 | 茹鴻暉 | 1466 | | 2021-11-01 | 盧高達(dá) | 1464 | | 2021-11-01 | 犁彭祖 | 1460 | | 2021-11-01 | 沐范 | 1459 | | 2021-11-01 | 夫濰 | 1452 | | 2021-11-01 | 煙心思 | 1449 | | 2021-11-01 | 稱子瑜 | 1447 | | 2021-11-01 | 麻宏放 | 1442 | | 2021-11-01 | 邴時 | 1439 | | 2021-11-01 | 養(yǎng)昆頡 | 1431 | +---------------------------+----------------------------+----------------------------------+
-
統(tǒng)計今日接受消息最多的Top10用戶
create table tb_rs_rusr_top10 comment "接受消息條數(shù)最多的Top10用戶" as select dayinfo, receiver_name as username, count(*) as receiver_msg_cnt from db_msg.tb_msg_etl group by dayinfo,receiver_name order by receiver_msg_cnt desc limit 10; select * from tb_rs_rusr_top10 limit 3;
+---------------------------+----------------------------+------------------------------------+ | tb_rs_rusr_top10.dayinfo | tb_rs_rusr_top10.username | tb_rs_rusr_top10.receiver_msg_cnt | +---------------------------+----------------------------+------------------------------------+ | 2021-11-01 | 暢雅柏 | 1539 | | 2021-11-01 | 春純 | 1491 | | 2021-11-01 | 鄺琨瑤 | 1469 | +---------------------------+----------------------------+------------------------------------+
-
統(tǒng)計發(fā)送人手機型號分布情況
create table if not exists tb_rs_sender_phone comment "發(fā)送人的手機型號分布" as select dayinfo, sender_phonetype, count(distinct sender_account) as cnt from tb_msg_etl group by dayinfo,sender_phonetype; select * from tb_rs_sender_phone limit 3;
+-----------------------------+--------------------------------------+-------------------------+ | tb_rs_sender_phone.dayinfo | tb_rs_sender_phone.sender_phonetype | tb_rs_sender_phone.cnt | +-----------------------------+--------------------------------------+-------------------------+ | 2021-11-01 | Apple iPhone 10 | 6749 | | 2021-11-01 | Apple iPhone 11 | 3441 | | 2021-11-01 | Apple iPhone 7 | 2424 | +-----------------------------+--------------------------------------+-------------------------+
-
統(tǒng)計發(fā)送人設(shè)備操作系統(tǒng)分布情況
create table tb_rs_sender_os comment "發(fā)送人的OS分布" as select dayinfo, sender_os, count(distinct sender_account) as cnt from tb_msg_etl group by dayinfo,sender_os; select * from tb_rs_sender_os;
+--------------------------+----------------------------+----------------------+ | tb_rs_sender_os.dayinfo | tb_rs_sender_os.sender_os | tb_rs_sender_os.cnt | +--------------------------+----------------------------+----------------------+ | 2021-11-01 | Android 5.1 | 5750 | | 2021-11-01 | Android 6 | 8514 | | 2021-11-01 | Android 6.0 | 9398 | | 2021-11-01 | Android 7.0 | 9181 | | 2021-11-01 | Android 8.0 | 8594 | | 2021-11-01 | IOS 10.0 | 1289 | | 2021-11-01 | IOS 12.0 | 8102 | | 2021-11-01 | IOS 9.0 | 8760 | +--------------------------+----------------------------+----------------------+
?
???FineBI實現(xiàn)可視化報表
官網(wǎng)
https://www.finebi.com/
??配置數(shù)據(jù)源及數(shù)據(jù)準(zhǔn)備
官方文檔
https://help.fanruan.com/finebi/doc-view-301.html
-
使用FineBI連接Hive,讀取Hive數(shù)據(jù)表,需要在FineBI中添加Hive驅(qū)動jar包
-
將Hive驅(qū)動jar包放入FineBI的lib目錄下
-
找到提供文件的HiveConnectDrive
- 放入安裝路徑下的
webapps\webroot\WEB-INF\lib
插件安裝
- 我們自己Hive驅(qū)動包會與FineBI自帶驅(qū)動包沖突,導(dǎo)致FineBI無法識別我們自己的驅(qū)動
- 安裝FineBI官方提供驅(qū)動包隔離插件
隔離插件:fr-plugin-hive-driver-loader-3.0.zip
-
安裝插件
文章來源:http://www.zghlxwxcb.cn/news/detail-490053.html
-
重啟FineBI文章來源地址http://www.zghlxwxcb.cn/news/detail-490053.html
到了這里,關(guān)于陌陌聊天數(shù)據(jù)分析 (一)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!