0


大数据开发之Hive(统计影音视频网站的常规指标)

第 11 章:Hive实战

11.1 数据结构

1、视频表
字段备注详细描述videoId视频唯一id(String)11位字符串uploader视频上传者(String)上传视频的用户名Stringage视频年龄(int)视频在平台上的整天数category视频类别(Array)上传视频指定的视频分类length视频长度(Int)整形数字标识的视频长度views观看次数(Int)视频被浏览的次数rate视频评分(Double)满分5分Ratings流量(Int)视频的流量,整形数字comments评论数(Int)一个视频的整数评论数relatedId相关视频id(Array)相关视频的id,最多20个
2、用户表
字段备注字段类型uploader上传者用户名stringvideos上传视频数intfriends朋友数量int

11.2 准备工作

1、需要准备的表
1)创建原始数据表:gulivideo_ori,gulivideo_user_ori,
2)创建最终表:gulivideo_orc,gulivideo_user_orc
2、创建原始数据表
1)创建原始数据表gulivideo_ori

create external table gulivideo_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)row format delimited fieldsterminatedby"\t"
collection items terminatedby"&"
stored as textfile
location '/gulivideo/video';

2)创建原始数据表:gulivideo_user_ori

create external table gulivideo_user_ori(
    uploader string,
    videos int,
    friends int)row format delimited 
fieldsterminatedby"\t" 
stored as textfile
location '/gulivideo/user';

3)创建orc存储格式带snappy压缩的表gulivideo_orc

createtable gulivideo_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");

4)创建orc存储格式带snappy压缩的表gulivideo_user_orc

createtable gulivideo_user_orc(
    uploader string,
    videos int,
    friends int)row format delimited 
fieldsterminatedby"\t" 
stored as orc
tblproperties("orc.compress"="SNAPPY");

5)向ori表插入数据

loaddatalocal inpath "/opt/module/hive/datas/video"intotable gulivideo_ori;loaddatalocal inpath "/opt/module/hive/datas/user.txt"intotable gulivideo_user_ori;

6)向orc表插入数据

insertintotable gulivideo_orc select*from gulivideo_ori;insertintotable gulivideo_user_orc select*from gulivideo_user_ori;

11.3 业务分析

11.3.1 统计视频观看数Top10

1、思路:
使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
2、代码

select
    videoId,`views`from gulivideo_orc
orderby`views`desclimit10;
OK
videoid          views
dMH0bHeiRNg     425134170XxI-hvPRRA     202824641dmVU08zVpA     16087899
RB-wUgnyGv0     15712924
QjA5faZF1A8     15256922-_CSo1gOd48     1319983349IDp76kjPw     11970018
tYnn51C3X_w     11823701
pv5zWaTEVkI     11672017
D2kJZOfq7zk     11184051

11.3.2 统计视频类别热度Top10(类别热度:类别下的总视频数)

1、思路:
1)统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2)我们需要按照类别group by聚合,然后count组内的videoId个数即可。
3)因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行行列转化(展开),然后再进行count即可。
4)最后按照热度排序,显示前10条。
2、代码

select
    tmp01.category_col,count(tmp01.videoId) num
from(select
         videoId,
         category_col
     from gulivideo_orc
              lateral view
                  explode(category) t as category_col
) tmp01
groupby tmp01.category_col
orderby num desclimit10;// 结果显示
OK
tmp01.category_col        num
Music                      179049
Entertainment             127674
Comedy                     87818
Animation                 73293
Film                       73293
Sports                     67329
Gadgets                    59817
Games                      59817
Blogs                      48890
People                     48890

11.3.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

1、思路
1)先找到观看书最高的20个视频所属条目的所有信息(主要是类目),降序排列
2)先把20条信息中的category分裂出来(列转行),形成新的字段category_name
3)在第二步的结果下,按照炸开的视频类别category_name分组,然后统计组内的个数category_count
2、最终代码

select
    table02.categroy_name,count(table02.videoId) num
from(select
         videoId,
         categroy_name
     from(select
                  videoId,`views`,
                  category
              from gulivideo_orc
              orderby`views`desclimit20) table01
              lateral view
                  explode(category) tmp as categroy_name
) table02
groupby table02.categroy_nam;// 结果显示
OK
table02.categroy_name   num
 Blogs                     2
 UNA                        1
Comedy                     6
Entertainment             6
Music                      5
People                     2

11.3.4 统计视频观看数Top50所关联视频的所属类别排序

1、思路
1)先找到观看数前50的视频信息(主要是求出关联视频)
2)炸开第一步求出的关联视频array,形成一个新字段new_relatedid
3)用new_relatedid和gulivideo_orc表进行join,求出new_relatedid的类别
4)炸开第三步结果的category,形成新字段category_name
5)按照catedory_name分组,然后求出每个分组的个数category_count
6)对category_count进行排序,利用开窗函数
2、代码

select 
    t5.category_name,
    t5.num,
    rank()over(orderby t5.num desc) rk
from(select
         t4.category_name,count(t4.realte_id) num
     from(select
                  t3.realte_id,
                  category_name
              from(select
                           t2.realte_id,
                           g.category
                       from(select
                                    realte_id
                                from(select
                                             videoId,
                                             relatedId,`views`from gulivideo_orc
                                         orderby`views`desclimit50) t1
                                         lateral view
                                             explode(t1.relatedId) tmp as realte_id
                            ) t2 join gulivideo_orc g on t2.realte_id = g.videoId
                   ) t3
                       lateral view
                           explode(t3.category) tmp as category_name
          ) t4
     groupby t4.category_name        
) t5 ;// 结果显示OK
t5.category_name        t5.num  rk
Comedy  2371
Entertainment   2162
Music   1953
People  514
Blogs   514
Animation       476
Film    476
News    248
Politics        248
Games   2210
Gadgets 2210
Sports  1912
Howto   1413
DIY     1413
UNA     1315
Travel  1216
Places  1216
Animals 1118
Pets    1118
Autos   420
Vehicles        420

11.3.5统计每个类别中的视频热度Top10,以Music为例

1、思路
1)要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开成新的字段categary_name。
2)然后通过category_name过滤“Music”分类的所有视频信息,按照视频观看数倒序排序,取前10
3)统计对应类别(Music)中的视频热度
2、代码

select
    videoId,`views` hot
from(select
         videoId,
         category_name,`views`from gulivideo_orc
              lateral view
                  explode(category) tmp as category_name        
) t1
where category_name ="Music"orderby hot desclimit10;// 结果显示
OK
videoid          hot
QjA5faZF1A8     15256922
tYnn51C3X_w     11823701
pv5zWaTEVkI     116720178bbTtPL1jRs     9579911
UMf40daefsI     7533070-xEzGIuY7kw     6946033
d6C0bNDqf3Y     6935578
HSoVKUVOnfQ     61930573URfWTEPmtE     5581171
thtmaZnxk_0     5142238

11.3.6 统计每个类别视频观看数Top10

1、思路
1)把每个原始表的类别炸开,形成新的字段category_name
2)按照炸裂开的类别字段category_name分区,按照视频观看数views倒叙排序进行开窗,求出每个类别下的所有视频的观看次数排序rk
3)按照rk字段对全表进行where过滤,求出每个类别观看书Top10
2、代码

select
    t2.category_name,
    t2.views,
    t2.rk
from(select
         t1.category_name,
         t1.views,
         rank()over(partitionby t1.category_name orderby t1.views desc) rk
     from(select
               category_name,`views`from gulivideo_orc
                    lateral view
                        explode(category) tmp as category_name
       ) t1
) t2
where rk <=10;
// 结果显示
OK
t2.category_name        t2.views        t2.rk
Comedy  425134171
Comedy  202824642
Comedy  119700183
Comedy  101074914
Comedy  95666095
Comedy  70666766
Comedy  63221177
Comedy  58269238
Comedy  55872999
Comedy  550807910
News    47060301
News    28993972
News    28170783
News    28035204
News    23487095
News    23350606
News    23266807
News    23187828
News    23105839
News    229136910
……
Time taken: 11.376 seconds, Fetched: 210row(s)

11.3.7 统计上传视频最多的用户Top10以及它们上传的视频观看次数在前20的视频

有三种理解
理解一:取Top10中所有人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的十个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频,按照观看数取前20
2、代码

SELECT
    t1.uploader,
    t2.videoid,
    t2.views
FROM(select
            uploader,
            videos
        from gulivideo_user_orc
        orderby videos DESClimit10) t1
        JOIN
    gulivideo_orc t2
    on t1.uploader = t2.uploader
ORDERBY t2.views DESCLIMIT20;// 结果显示
OK
t1.uploader     t2.videoid      t2.views
expertvillage   -IxHBW0YpZw     39059
expertvillage   BU-fT5XI_8I     29975
expertvillage   ADOcaBYbMl0     26270
expertvillage   yAqsULIDJFE     25511
expertvillage   vcm-t0TJXNg     25366
expertvillage   0KYGFawp14c     24659
expertvillage   j4DpuPvMLF4     22593
expertvillage   Msu4lZb2oeQ     18822
expertvillage   ZHZVj44rpjE     16304
expertvillage   foATQY3wovI     13576
expertvillage   -UnQ8rcBOQs     13450
expertvillage   crtNd46CDks     11639
expertvillage   D1leA0JKHhE     11553
expertvillage   NJu2oG1Wm98     11452
expertvillage   CapbXdyv4j4     10915
expertvillage   epr5erraEp4     10817
expertvillage   IyQoDgaLM7U     10597
expertvillage   tbZibBnusLQ     10402
expertvillage   _GnCHodc7mk     9422
expertvillage   hvEYlSlRitU     7123Time taken: 57.272 seconds, Fetched: 20row(s)

理解二:取Top10中每个人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有视频id,视频观看次数,还要按照uploader分区,views倒叙排序,求出每个uploder的上传的视频的观看排名
3)按照rk进行where过滤,求出rk<=20的数据
2、代码

select
    t3.uploader,
    t3.videoId,
    t3.views,
    t3.rk
from(select
         t2.uploader,
         t2.videoId,
         t2.views,
         rank()over(partitionby uploader orderby t2.views desc) rk
     from(select
                  t1.uploader,
                  g.videoId,
                  g.`views`from(select
                           uploader
                       from gulivideo_user_orc
                       orderby videos desclimit10) t1 join gulivideo_orc g on t1.uploader = g.uploader
          ) t2    
) t3
where rk <=20;// 结果显示
OK
t3.uploader     t3.videoid      t3.views        t3.rk
expertvillage   -IxHBW0YpZw     390591
expertvillage   BU-fT5XI_8I     299752
expertvillage   ADOcaBYbMl0     262703
expertvillage   yAqsULIDJFE     255114
expertvillage   vcm-t0TJXNg     253665
expertvillage   0KYGFawp14c     246596
expertvillage   j4DpuPvMLF4     225937
expertvillage   Msu4lZb2oeQ     188228
expertvillage   ZHZVj44rpjE     163049
expertvillage   foATQY3wovI     1357610
expertvillage   -UnQ8rcBOQs     1345011
expertvillage   crtNd46CDks     1163912
expertvillage   D1leA0JKHhE     1155313
expertvillage   NJu2oG1Wm98     1145214
expertvillage   CapbXdyv4j4     1091515
expertvillage   epr5erraEp4     1081716
expertvillage   IyQoDgaLM7U     1059717
expertvillage   tbZibBnusLQ     1040218
expertvillage   _GnCHodc7mk     942219
expertvillage   hvEYlSlRitU     712320
Ruchaneewan     5_T5Inddsuo     31321
Ruchaneewan     wje4lUtbYNU     10862
Ruchaneewan     i8rLbOUhAlM     5493
Ruchaneewan     OwnEtde9_Co     4534
Ruchaneewan     5Zf0lbAdJP0     4415
Ruchaneewan     wenI5MrYT20     4266
Ruchaneewan     Iq4e3SopjxQ     4207
Ruchaneewan     3hzOiFP-5so     4207
Ruchaneewan     JgyOlXjjuw0     4189
Ruchaneewan     fGBVShTsuyo     39510
Ruchaneewan     O3aoL70DlVc     38911
Ruchaneewan     q4y2ZS5OQ88     34412
Ruchaneewan     lyUJB2eMVVg     27113
Ruchaneewan     _RF_3VhaQpw     24214
Ruchaneewan     DDl2cjI-aJs     23115
Ruchaneewan     xbYyjUdhtJw     22716
Ruchaneewan     4dkKeIUkN7E     22617
Ruchaneewan     qCfuQA6N4K0     21318
Ruchaneewan     TmYbGQaRcNM     20919
Ruchaneewan     dOlfPsFSjw0     20620Time taken: 30.772 seconds, Fetched: 40row(s

理解三:Top10用户上传的所有视频,有哪些视频是在视频观看次数前20的视频
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频id,视频观看次数
3)在第二步的结果上,与视频表观看次数前20的数据进行内连接,求出Top10用户上传的视频有哪些是观看次数前20的视频
2、代码

SELECT
    t3.uploader,
    t3.videoid,
    t3.views
FROM(SELECT
            t1.uploader,
            t2.videoid,
            t2.views
        FROM(select
                    uploader,
                    videos
                from gulivideo_user_orc
                orderby videos DESClimit10) t1
                JOIN
            gulivideo_orc t2
            on t1.uploader = t2.uploader
    ) t3
        JOIN(select
            videoid,`views`from gulivideo_orc
        orderby`views`desclimit20) t4
on t3.videoid = t4.videoid;
标签: 大数据 hive hadoop

本文转载自: https://blog.csdn.net/key_honghao/article/details/135579982
版权归原作者 Key-Key 所有, 如有侵权,请联系我们删除。

“大数据开发之Hive(统计影音视频网站的常规指标)”的评论:

还没有评论