0


HiveSQL一天一个小技巧:如何设计一张1-180天注册、活跃留存表?

0 需求

现有一个用户活跃表user_active(user_id,active_date)、用户注册表user_regist(user_id,regist_date),

表中分区字段都为dt(yyyy-MM-dd),用户字段均为user_id;

设计一张1-180天注册活跃留存表;

1 分析

需求中要求设计留存周期为1-180天的注册、活跃留存表,即目标如下:

注册日期

留存周期

活跃数

注册数

留存率

2023-01-10

1

100

200

活跃数/注册数

2023-01-10

2

50

200

2023-01-10

3

10

200

.....

...

...

主要考察点:笛卡尔集(一对多关联)

观察表结构:我们可以看出,针对每一天来讲分母固定,分子随留存周期变动

第一步:求出注册表中,每日注册数,注册数作为分母,针对每一天来讲是个固定值,因我们用开窗求解该指标。

select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

 from user_regist

 where dt >= date_sub(current_date(), 180)

第二步:用户注册表作为主表,关联活跃表,关联健为user_id,由于一对多的关系,产生笛卡尔集

注意:活跃用户表,每天用户会有多次活跃注意去重

select  regist_date
       ,t1.user_id
       ,t1.regist_count  
       ,t2.user_id
       ,t2.active_date
       ,datediff(t2.active_date, t1.regist_date) as date_diff
from (

       select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

       from user_regist

       where dt >= date_sub(current_date(), 180)

   ) t1

   left join (

       select user_id

             ,to_date(active_date) as active_date

       from user_active

       where dt >= date_sub(current_date(), 180)

       group by user_id, to_date(active_date)

   ) t2 
on t1.user_id = t2.user_id

regist_date

t1.user_id

t1.regist_count

t2.user_id

t2.active_date

date_diff

2023-01-10

A

200

A

2023-01-11

1

2023-01-10

A

200

A

2023-01-12

2

2023-01-10

A

200

A

2023-01-13

3

2023-01-10

A

200

A

2023-01-14

4

2023-01-10

B

200

B

2023-01-13

3

2023-01-10

B

200

B

2023-01-14

4

2023-01-10

B

200

B

2023-01-15

5

2023-01-10

B

200

B

2023-01-16

6

第三步:按照注册日期和留存周期分组,计算该留存周期下、该时间点下的活跃用户数

 select t1.regist_date

         ,max(t1.regist_count) as regist_cnt --每天是固定值,用max()函数取出该值

         ,datediff(t2.active_date, t1.regist_date) as date_diff

         ,count(t1.user_id) as active_user_cnt

   from (

       select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

       from user_regist

       where dt >= date_sub(current_date(), 180)

   ) t1

   left join (

       select user_id

             ,to_date(active_date) as active_date

       from user_active

       where dt >= date_sub(current_date(), 180)

       group by user_id, to_date(active_date)

   ) t2 on t1.user_id = t2.user_id

   where datediff(t2.active_date, t1.regist_date) >=1

   and datediff(t2.active_date, t1.regist_date) <= 180

   group by t1.regist_date, datediff(t2.active_date, t1.regist_date)

第四步:计算率留存

select regist_date

     , date_diff
     
     , active_user_cnt

     , case when nvl(regist_cnt,0)!=0
            then active_user_cnt/regist_cnt end as retention_rate        
from 
(select t1.regist_date

         ,max(t1.regist_count) as regist_cnt --每天是固定值,用max()函数取出该值

         ,datediff(t2.active_date, t1.regist_date) as date_diff

         ,count(t1.user_id) as active_user_cnt

   from (

       select user_id

             ,to_date(regist_date) as regist_date

             ,count(user_id) over(partition by to_date(regist_date)) as regist_count

       from user_regist

       where dt >= date_sub(current_date(), 180)

   ) t1

   left join (

       select user_id

             ,to_date(active_date) as active_date

       from user_active

       where dt >= date_sub(current_date(), 180)

       group by user_id, to_date(active_date)

   ) t2 on t1.user_id = t2.user_id

   where datediff(t2.active_date, t1.regist_date) >=1

   and datediff(t2.active_date, t1.regist_date) <= 180

   group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
) t

2 小结

本文给出了一种1-180天注册活跃留存表计算模型,主要采用笛卡尔集的形式求解,这也是数据报表经常采用的方法,需要掌握。

标签: sql hive

本文转载自: https://blog.csdn.net/godlovedaniel/article/details/128884709
版权归原作者 莫叫石榴姐 所有, 如有侵权,请联系我们删除。

“HiveSQL一天一个小技巧:如何设计一张1-180天注册、活跃留存表?”的评论:

还没有评论