某互聯(lián)網(wǎng)大廠的一道比較有深度的面試題,
參考文章 :?hive|性能優(yōu)化|_Hive ROW_NUMBER TopN 性能優(yōu)化
問題
有50W個 店鋪,每個顧客訪客訪問任何一個店鋪的任何一個商品時都會產(chǎn)生一條訪問日志,
訪問日志存儲的表名為Visit,訪客的用戶id為user_id,被訪問的店鋪名稱為shop,數(shù)據(jù)如下:
? ? ? ? ? ? ? ? user_id shop
????????????????u1 ?a
? ? ? ? ? ? ? ? u2 ?b
? ? ? ? ? ? ? ? u1 ?b
? ? ? ? ? ? ? ? u1 ?a
? ? ? ? ? ? ? ? u3 ?c
? ? ? ? ? ? ? ? u4 ?b
? ? ? ? ? ? ? ? u1 ?a
? ? ? ? ? ? ? ? u2 ?c
? ? ? ? ? ? ? ? u5 ?b
請統(tǒng)計(jì):
(1)每個店鋪的UV(訪客數(shù))
(2)每個店鋪訪問次數(shù)top3的訪客信息。輸出店鋪名稱、訪客id、訪問次數(shù)
問題(1)解法
問題一的目的是考察能否在大數(shù)據(jù)量下考慮用 group做去重預(yù)聚合,而不是直接用count(distinct)
正確解法?
with?
user_dist_log as (
?? ?select?
?? ??? ?user_id,
?? ??? ?shop,
?? ??? ?count(1) as pv_cnt
?? ?from visit
?? ?group by?
?? ??? ?user_id,
?? ??? ?shop
)
select?
?? ?shop,
?? ?count(1) as uv?
from user_dist_log
group by shop;
問題(2)解法
問題二的目的是考察開窗函數(shù)情況下的數(shù)據(jù)傾斜如何解決
由于開窗函數(shù)是把數(shù)據(jù)分發(fā)到同一個 executor 進(jìn)行單點(diǎn)的數(shù)據(jù)排序,那么熱點(diǎn)數(shù)據(jù)很容易導(dǎo)致數(shù)據(jù)傾斜,所以這里的解決方案是2步聚合的方式,加鹽打散,二次聚合。文章來源:http://www.zghlxwxcb.cn/news/detail-606728.html
具體代碼文章來源地址http://www.zghlxwxcb.cn/news/detail-606728.html
with
user_dist_log as (
select
user_id,
shop,
count(1) as pv_cnt
from visit
group by
user_id,
shop
)
select
shop,
user_id,
shop_rank2
from
(
select
shop,
user_id,
row_number() over(partition by shop order by pv_cnt desc) as shop_rank2
from
(
select
shop,
rand1,
user_id,
pv_cnt
from
(
select
shop,
ceil(rand()*100) as rand1,
user_id,
pv_cnt,
row_number() over(partition by shop, ceil(rand()*100) order by pv_cnt desc) as shop_rank
from user_dist_log
) mid_tmp
where shop_rank <= 3
)
)
where shop_rank2 <= 3
測試數(shù)據(jù)構(gòu)建
with
visit as (
select
'u1' as user_id,
'a' as shop
union all
select
'u2' as user_id,
'b' as shop
union all
select
'u1' as user_id,
'b' as shop
union all
select
'u1' as user_id,
'a' as shop
union all
select
'u3' as user_id,
'c' as shop
union all
select
'u4' as user_id,
'b' as shop
union all
select
'u1' as user_id,
'a' as shop
union all
select
'u2' as user_id,
'c' as shop
union all
select
'u5' as user_id,
'b' as shop
),
到了這里,關(guān)于SQL_求店鋪的topN && 開窗函數(shù)數(shù)據(jù)傾斜的文章就介紹完了。如果您還想了解更多內(nèi)容,請?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!