0


Hive hql 经典5道面试题

最近在深入了解Hive,尚硅谷的这5道题很经典,有引导意义,分步解题也很有用,故记录之,方便回看

1.连续问题

如下数据为蚂蚁森林中用户领取的减少碳排放量
找出连续 3 天及以上减少碳排放量在 100 以上的用户

id dt lowcarbon
1001 2021-12-12 123
1002 2021-12-12 45
1001 2021-12-13 43
1001 2021-12-13 45
1001 2021-12-13 23
1002 2021-12-14 45
1001 2021-12-14 230
1002 2021-12-15 45
1001 2021-12-15 23

建表

createtable mayi
(
    id        int,
    dt        string,
    lowcarbon int)row format delimited
        fieldsterminatedby'\t';loaddatalocal inpath "/home/hadoop/test/mayi.txt"intotable mayi;select*from mayi;
--1)求出每日lowcarbon在100以上的用户记录select id, dt,sum(lowcarbon) lowcarbon
from mayi
groupby id, dt
having lowcarbon >100;-- t1-- 2)等差数列法-- 两个等差数据如果等差相同,则相同位置的数据相减得到的相同-- 按照用户分组,同时按照时间排序,计算每条数据的Rankselect id,
       dt,
       lowcarbon,
       rank()over(partitionby id orderby dt) rk
from t1;-- t2-- 3)将每行数据中的日期间区Rank值select id,
       dt,
       lowcarbon,
       date_sub(dt, rk) flag
from t2;-- t3-- 4) 按照用户及 Flag 分组,求每个组有多少条数据,并找出大于等于3条的数据select id,
       flag,count(*) ct
from(select id,
             dt,
             lowcarbon,
             date_sub(dt, rk) flag
      from(select id,
                   dt,
                   lowcarbon,
                   rank()over(partitionby id orderby dt) rk
            from(select id, dt,sum(lowcarbon) lowcarbon
                  from mayi
                  groupby id, dt
                  having lowcarbon >100) t1) t2) t3
groupby id, flag
having ct >=3;

2.分组问题

如下为电商公司用户访问时间数据

id ts(秒)
1001 17523641234
1001 17523641256
1002 17523641278
1001 17523641334
1002 17523641434
1001 17523641534
1001 17523641544
1002 17523641634
1001 17523641638
1001 17523641654

某个用户连续的访问记录如果时间间隔小于 60 秒,则分为同一个组,结果为:

id ts(秒) group
1001 17523641234 1
1001 17523641256 1
1001 17523641334 2
1001 17523641534 3
1001 17523641544 3
1001 17523641638 4
1001 17523641654 4
1002 17523641278 1
1002 17523641434 2
1002 17523641634 3

createtable elec
(
    id int,
    ts bigint)row format delimited
        fieldsterminatedby'\t';loaddatalocal inpath "/home/hadoop/test/elec.txt"intotable elec;select*from elec;
-- 1)将上一行时间数据下移-- lead:领导-- lag:延迟select id,
       ts,
       lag(ts,1,0)over(partitionby id orderby ts) lagts
from elec;-- t1-- 2)将当前行时间数据减去上一行时间数据select id,
       ts,
       ts - lagts tsdiff
from t1;-- t2select id,
       ts,
       ts - lagts tsdiff
from(select id,
             ts,
             lag(ts,1,0)over(partitionby id orderby ts) lagts
      from elec) t1;-- 3)计算每个用户范围内从第一行到当前行tsdiff大于等于60的总个数(分组号)select id,
       ts,sum(if(tsdiff >=60,1,0))over(partitionby id orderby ts) groupid
from(select id,
             ts,
             ts - lagts tsdiff
      from(select id,
                   ts,
                   lag(ts,1,0)over(partitionby id orderby ts) lagts
            from elec) t1) t2;

3.间隔连续问题

某游戏公司记录的用户每日登录数据
注意:有的时候hql过于复杂,需要spark任务也比较多,可能需要建立临时表分多步运行

id dt
1001 2021-12-12
1002 2021-12-12
1001 2021-12-13
1001 2021-12-14
1001 2021-12-16
1002 2021-12-16
1001 2021-12-19
1002 2021-12-17
1001 2021-12-20

计算每个用户最大的连续登录天数,可以间隔一天。
解释:如果一个用户在 1,3,5,6 登录游戏,则视为连续 6 天登录。

createtable user_login
(
    id int,
    dt string
)row format delimited
        fieldsterminatedby'\t';loaddatalocal inpath "/home/hadoop/test/user_login.txt"intotable user_login;select*from user_login;
-- 思路二:分组-- 1001    2021-12-12-- 1001    2021-12-13-- 1001    2021-12-14-- 1001    2021-12-16-- 1001    2021-12-19-- 1001    2021-12-20-- 1)将上一行时间数据下移-- 1001    2021-12-12    1970-01-01-- 1001    2021-12-13    2021-12-12-- 1001    2021-12-14    2021-12-13-- 1001    2021-12-16    2021-12-14-- 1001    2021-12-19    2021-12-16-- 1001    2021-12-20    2021-12-19select id,
       dt,
       lag(dt,1,'1970-01-01')over(partitionby id orderby dt) lagdt
from user_login;-- t1-- 2)将当前行时间减去上一行时间数据(datediff(dt1,dt2))-- 1001    2021-12-12    564564-- 1001    2021-12-13    1-- 1001    2021-12-14    1-- 1001    2021-12-16    2-- 1001    2021-12-19    3-- 1001    2021-12-20    1select id,
       dt,
       datediff(dt, lagdt) flag
from t1;createtable user_login_tmp asselect id,
       dt,
       datediff(dt, lagdt) flag
from(select id,
       dt,
       lag(dt,1,'1970-01-01')over(partitionby id orderby dt) lagdt
from user_login)t1;select*from user_login_tmp;-- t2-- 3)按照用户分组,同时按照时间排序,计算从第一行到当前行大于2的数据的总条数(sum(if(flag>2,1,0)))-- 1001    2021-12-12    1-- 1001    2021-12-13    1-- 1001    2021-12-14    1-- 1001    2021-12-16    1-- 1001    2021-12-19    2-- 1001    2021-12-20    2select id,
       dt,sum(if(flag >2,1,0))over(partitionby id orderby dt) flag
from t2;select id,
       dt,sum(if(flag >2,1,0))over(partitionby id orderby dt) flag2
from(select id,
       dt,
       datediff(dt, lagdt) flag
from(select id,
       dt,
       lag(dt,1,'1970-01-01')over(partitionby id orderby dt) lagdt
from user_login)t1)t2;select id,
       dt,sum(if(flag >2,1,0))over(partitionby id orderby dt) flag2
from user_login_tmp;-- t3-- 4)按照用户和flag分组,求最大时间减去最小时间并加上1select id,
       flag,
       datediff(max(dt),min(dt)) days
from t3
groupby id, flag;-- t4-- 5)取连续登录天数的最大值select id,max(days)+1from t4
groupby id;-- 6)最终HQL-- 6)最终HQL-- set yarn.scheduler.maximum-allocation-mb=4;-- set spark.executor.cores=4;-- set  spark.executor.instances=6;-- set  spark.executor.memory="3.2g";-- set  spark.yarn.executor.memoryOverhead="0.8G";-- set spark.executor.cores=1;-- set  spark.executor.instances=24;-- set  spark.default.parallelism=20;-- set  spark.driver.memory="819MB";-- set  spark.driver.memoryOverhead="205MB";set spark.executor.instances =20;set  spark.executor.cores =1;set spark.executor.memory =1g;select id,max(days)+1from(select id,
             flag,
             datediff(max(dt),min(dt)) days
      from(select id,
                   dt,sum(if(flag >2,1,0))over(partitionby id orderby dt) flag
            from(select id,
                         dt,
                         datediff(dt, lagdt) flag
                  from(select id,
                               dt,
                               lag(dt,1,'1970-01-01')over(partitionby id orderby dt) lagdt
                        from user_login) t1) t2) t3
      groupby id, flag) t4
groupby id;-- join配置-- set hive.auto.convert.join=false;---- set mapred.map.tasks.speculative.execution=true;-- set mapred.reduce.tasks.speculative.execution=true;

4.打折日期交叉问题
如下为平台商品促销数据:字段为品牌,打折开始日期,打折结束日期

brand stt edt
oppo 2021-06-05 2021-06-09
oppo 2021-06-11 2021-06-21
vivo 2021-06-05 2021-06-15
vivo 2021-06-09 2021-06-21
redmi 2021-06-05 2021-06-21
redmi 2021-06-09 2021-06-15
redmi 2021-06-17 2021-06-26
huawei 2021-06-05 2021-06-26
huawei 2021-06-09 2021-06-15
huawei 2021-06-17 2021-06-21

– 计算每个品牌总的打折销售天数,注意其中的交叉日期,比如 vivo 品牌,第一次活动时间为 2021-06-05 到 2021-06-15,
– 第二次活动时间为 2021-06-09 到 2021-06-21 其中 9 号到 15号为重复天数,只统计一次,即 vivo 总打折天数为 2021-06-05 到 2021-06-21 共计 17 天。

droptableifexists  discount;createtable discount
(
    id string,
    stt string,
    edt string
)row format delimited
        fieldsterminatedby'\t';loaddatalocal inpath "/home/hadoop/test/discount.txt"intotable discount;select*from discount;
-- id         stt         edt-- oppo    2021-06-05    2021-06-09-- oppo    2021-06-11    2021-06-21---- vivo    2021-06-05    2021-06-15-- vivo    2021-06-09    2021-06-21---- redmi    2021-06-05    2021-06-21-- redmi    2021-06-09    2021-06-15-- redmi    2021-06-17    2021-06-26---- huawei    2021-06-05    2021-06-26-- huawei    2021-06-09    2021-06-15-- huawei    2021-06-17    2021-06-21-- select * from business;-- select name,orderdate,cost,-- sum(cost) over() as sample1--        ,--所有行相加-- sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4-- ,--和 sample3 一样,由起点到当前行的聚合-- sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6-- --        ,--当前行和前边一行及后面一行-- from business-- 1)将当前行以前的数据中最大的edt放置当前行select
    id,
    stt,
    edt,max(edt)over(partitionby id orderby stt rowsbetweenUNBOUNDEDPRECEDINGand1PRECEDING) maxEdt
from discount;-- t1-- redmi    2021-06-05    2021-06-21    null-- redmi    2021-06-09    2021-06-15    2021-06-21-- redmi    2021-06-17    2021-06-26    2021-06-21-- 2)比较开始时间与移动下来的数据,如果开始时间大,则不需要操作,-- 反之则需要将移动下来的数据 加一 替换当前行的开始时间-- 如果是第一行数据,maxEDT为null,则不需要操作select
    id,if(maxEdt isnull,stt,if(stt>maxEdt,stt,date_add(maxEdt,1))) stt,
    edt
from t1;-- t2-- redmi    2021-06-05    2021-06-21-- redmi    2021-06-22    2021-06-15-- redmi    2021-06-22    2021-06-26-- 3)将每行数据中的结束日期减去开始日期select
    id,
    datediff(edt,stt) days
from
    t2;-- t3-- redmi    16-- redmi    -7-- redmi    4-- 4)按照品牌分组,计算每条数据加一的总和select
    id,sum(if(days>=0,days+1,0)) days
from
    t3
groupby id;-- redmi    22-- 5)最终HQLselect
    id,sum(if(days>=0,days+1,0)) days
from(select
    id,
    datediff(edt,stt) days
from(select
    id,if(maxEdt isnull,stt,if(stt>maxEdt,stt,date_add(maxEdt,1))) stt,
    edt
from(select
    id,
    stt,
    edt,max(edt)over(partitionby id orderby stt rowsbetweenUNBOUNDEDPRECEDINGand1PRECEDING) maxEdt
from discount)t1)t2)t3
groupby id;

5.同时在线问题

如下为某直播平台主播开播及关播时间,根据该数据计算出平台最高峰同时在线的主播人数。

id stt edt
1003 2021-06-14 12:12:12 2021-06-14 18:12:12
1004 2021-06-14 13:12:12 2021-06-14 16:12:12
1002 2021-06-14 15:12:12 2021-06-14 16:12:12
1005 2021-06-14 15:18:12 2021-06-14 20:12:12
1001 2021-06-14 20:12:12 2021-06-14 23:12:12
1006 2021-06-14 21:12:12 2021-06-14 23:15:12
1007 2021-06-14 22:12:12 2021-06-14 23:10:12

droptableifexists  living;createtable living
(
    id string,
    stt string,
    edt string
)row format delimited
        fieldsterminatedby'\t';loaddatalocal inpath "/home/hadoop/test/living.txt"intotable living;select*from living;
-- 流式!-- 1)对数据分类,在开始数据后添加正1,表示有主播上线,同时在关播数据后添加-1,表示有主播下select id,stt dt,1 p from living
unionselect id,edt dt,-1 p from living;-- t1-- 1001    2021-06-14 12:12:12    1-- 1001    2021-06-14 18:12:12    -1-- 1001    2021-06-14 20:12:12    1-- 1001    2021-06-14 23:12:12    -1-- 1002    2021-06-14 15:12:12    1-- 1002    2021-06-14 16:12:12    -1-- 1003    2021-06-14 13:12:12    1-- 1003    2021-06-14 16:12:12    -1-- 1004    2021-06-14 13:15:12    1-- 1004    2021-06-14 20:12:12    -1-- 1005    2021-06-14 15:18:12    1-- 1005    2021-06-14 20:12:12    -1-- 1006    2021-06-14 21:12:12    1-- 1006    2021-06-14 23:15:12    -1-- 1007    2021-06-14 22:12:12    1-- 1007    2021-06-14 23:10:12    -1-- 2)按照时间排序,计算累加人数select
    id,
    dt,sum(p)over(orderby dt) sum_p
from(select id,stt dt,1 p from living
unionselect id,edt dt,-1 p from living)t1;-- t2-- 3)找出同时在线人数最大值selectmax(sum_p)--  *from(select
    id,
    dt,sum(p)over(orderby dt) sum_p
from(select id,stt dt,1 p from living
unionselect id,edt dt,-1 p from living)t1)t2
where dt='2021-06-14 20:12:12';

本文转载自: https://blog.csdn.net/qq_32656561/article/details/122685866
版权归原作者 Programmer-Awei 所有, 如有侵权,请联系我们删除。

“Hive hql 经典5道面试题”的评论:

还没有评论