第 11 章:Hive实战
11.1 数据结构
1、视频表
字段备注详细描述videoId视频唯一id(String)11位字符串uploader视频上传者(String)上传视频的用户名Stringage视频年龄(int)视频在平台上的整天数category视频类别(Array)上传视频指定的视频分类length视频长度(Int)整形数字标识的视频长度views观看次数(Int)视频被浏览的次数rate视频评分(Double)满分5分Ratings流量(Int)视频的流量,整形数字comments评论数(Int)一个视频的整数评论数relatedId相关视频id(Array)相关视频的id,最多20个
2、用户表
字段备注字段类型uploader上传者用户名stringvideos上传视频数intfriends朋友数量int
11.2 准备工作
1、需要准备的表
1)创建原始数据表:gulivideo_ori,gulivideo_user_ori,
2)创建最终表:gulivideo_orc,gulivideo_user_orc
2、创建原始数据表
1)创建原始数据表gulivideo_ori
create external table gulivideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)row format delimited fieldsterminatedby"\t"
collection items terminatedby"&"
stored as textfile
location '/gulivideo/video';
2)创建原始数据表:gulivideo_user_ori
create external table gulivideo_user_ori(
uploader string,
videos int,
friends int)row format delimited
fieldsterminatedby"\t"
stored as textfile
location '/gulivideo/user';
3)创建orc存储格式带snappy压缩的表gulivideo_orc
createtable gulivideo_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");
4)创建orc存储格式带snappy压缩的表gulivideo_user_orc
createtable gulivideo_user_orc(
uploader string,
videos int,
friends int)row format delimited
fieldsterminatedby"\t"
stored as orc
tblproperties("orc.compress"="SNAPPY");
5)向ori表插入数据
loaddatalocal inpath "/opt/module/hive/datas/video"intotable gulivideo_ori;loaddatalocal inpath "/opt/module/hive/datas/user.txt"intotable gulivideo_user_ori;
6)向orc表插入数据
insertintotable gulivideo_orc select*from gulivideo_ori;insertintotable gulivideo_user_orc select*from gulivideo_user_ori;
11.3 业务分析
11.3.1 统计视频观看数Top10
1、思路:
使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
2、代码
select
videoId,`views`from gulivideo_orc
orderby`views`desclimit10;
OK
videoid views
dMH0bHeiRNg 425134170XxI-hvPRRA 202824641dmVU08zVpA 16087899
RB-wUgnyGv0 15712924
QjA5faZF1A8 15256922-_CSo1gOd48 1319983349IDp76kjPw 11970018
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
D2kJZOfq7zk 11184051
11.3.2 统计视频类别热度Top10(类别热度:类别下的总视频数)
1、思路:
1)统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2)我们需要按照类别group by聚合,然后count组内的videoId个数即可。
3)因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行行列转化(展开),然后再进行count即可。
4)最后按照热度排序,显示前10条。
2、代码
select
tmp01.category_col,count(tmp01.videoId) num
from(select
videoId,
category_col
from gulivideo_orc
lateral view
explode(category) t as category_col
) tmp01
groupby tmp01.category_col
orderby num desclimit10;// 结果显示
OK
tmp01.category_col num
Music 179049
Entertainment 127674
Comedy 87818
Animation 73293
Film 73293
Sports 67329
Gadgets 59817
Games 59817
Blogs 48890
People 48890
11.3.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
1、思路
1)先找到观看书最高的20个视频所属条目的所有信息(主要是类目),降序排列
2)先把20条信息中的category分裂出来(列转行),形成新的字段category_name
3)在第二步的结果下,按照炸开的视频类别category_name分组,然后统计组内的个数category_count
2、最终代码
select
table02.categroy_name,count(table02.videoId) num
from(select
videoId,
categroy_name
from(select
videoId,`views`,
category
from gulivideo_orc
orderby`views`desclimit20) table01
lateral view
explode(category) tmp as categroy_name
) table02
groupby table02.categroy_nam;// 结果显示
OK
table02.categroy_name num
Blogs 2
UNA 1
Comedy 6
Entertainment 6
Music 5
People 2
11.3.4 统计视频观看数Top50所关联视频的所属类别排序
1、思路
1)先找到观看数前50的视频信息(主要是求出关联视频)
2)炸开第一步求出的关联视频array,形成一个新字段new_relatedid
3)用new_relatedid和gulivideo_orc表进行join,求出new_relatedid的类别
4)炸开第三步结果的category,形成新字段category_name
5)按照catedory_name分组,然后求出每个分组的个数category_count
6)对category_count进行排序,利用开窗函数
2、代码
select
t5.category_name,
t5.num,
rank()over(orderby t5.num desc) rk
from(select
t4.category_name,count(t4.realte_id) num
from(select
t3.realte_id,
category_name
from(select
t2.realte_id,
g.category
from(select
realte_id
from(select
videoId,
relatedId,`views`from gulivideo_orc
orderby`views`desclimit50) t1
lateral view
explode(t1.relatedId) tmp as realte_id
) t2 join gulivideo_orc g on t2.realte_id = g.videoId
) t3
lateral view
explode(t3.category) tmp as category_name
) t4
groupby t4.category_name
) t5 ;// 结果显示OK
t5.category_name t5.num rk
Comedy 2371
Entertainment 2162
Music 1953
People 514
Blogs 514
Animation 476
Film 476
News 248
Politics 248
Games 2210
Gadgets 2210
Sports 1912
Howto 1413
DIY 1413
UNA 1315
Travel 1216
Places 1216
Animals 1118
Pets 1118
Autos 420
Vehicles 420
11.3.5统计每个类别中的视频热度Top10,以Music为例
1、思路
1)要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开成新的字段categary_name。
2)然后通过category_name过滤“Music”分类的所有视频信息,按照视频观看数倒序排序,取前10
3)统计对应类别(Music)中的视频热度
2、代码
select
videoId,`views` hot
from(select
videoId,
category_name,`views`from gulivideo_orc
lateral view
explode(category) tmp as category_name
) t1
where category_name ="Music"orderby hot desclimit10;// 结果显示
OK
videoid hot
QjA5faZF1A8 15256922
tYnn51C3X_w 11823701
pv5zWaTEVkI 116720178bbTtPL1jRs 9579911
UMf40daefsI 7533070-xEzGIuY7kw 6946033
d6C0bNDqf3Y 6935578
HSoVKUVOnfQ 61930573URfWTEPmtE 5581171
thtmaZnxk_0 5142238
11.3.6 统计每个类别视频观看数Top10
1、思路
1)把每个原始表的类别炸开,形成新的字段category_name
2)按照炸裂开的类别字段category_name分区,按照视频观看数views倒叙排序进行开窗,求出每个类别下的所有视频的观看次数排序rk
3)按照rk字段对全表进行where过滤,求出每个类别观看书Top10
2、代码
select
t2.category_name,
t2.views,
t2.rk
from(select
t1.category_name,
t1.views,
rank()over(partitionby t1.category_name orderby t1.views desc) rk
from(select
category_name,`views`from gulivideo_orc
lateral view
explode(category) tmp as category_name
) t1
) t2
where rk <=10;
// 结果显示
OK
t2.category_name t2.views t2.rk
Comedy 425134171
Comedy 202824642
Comedy 119700183
Comedy 101074914
Comedy 95666095
Comedy 70666766
Comedy 63221177
Comedy 58269238
Comedy 55872999
Comedy 550807910
News 47060301
News 28993972
News 28170783
News 28035204
News 23487095
News 23350606
News 23266807
News 23187828
News 23105839
News 229136910
……
Time taken: 11.376 seconds, Fetched: 210row(s)
11.3.7 统计上传视频最多的用户Top10以及它们上传的视频观看次数在前20的视频
有三种理解
理解一:取Top10中所有人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的十个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频,按照观看数取前20
2、代码
SELECT
t1.uploader,
t2.videoid,
t2.views
FROM(select
uploader,
videos
from gulivideo_user_orc
orderby videos DESClimit10) t1
JOIN
gulivideo_orc t2
on t1.uploader = t2.uploader
ORDERBY t2.views DESCLIMIT20;// 结果显示
OK
t1.uploader t2.videoid t2.views
expertvillage -IxHBW0YpZw 39059
expertvillage BU-fT5XI_8I 29975
expertvillage ADOcaBYbMl0 26270
expertvillage yAqsULIDJFE 25511
expertvillage vcm-t0TJXNg 25366
expertvillage 0KYGFawp14c 24659
expertvillage j4DpuPvMLF4 22593
expertvillage Msu4lZb2oeQ 18822
expertvillage ZHZVj44rpjE 16304
expertvillage foATQY3wovI 13576
expertvillage -UnQ8rcBOQs 13450
expertvillage crtNd46CDks 11639
expertvillage D1leA0JKHhE 11553
expertvillage NJu2oG1Wm98 11452
expertvillage CapbXdyv4j4 10915
expertvillage epr5erraEp4 10817
expertvillage IyQoDgaLM7U 10597
expertvillage tbZibBnusLQ 10402
expertvillage _GnCHodc7mk 9422
expertvillage hvEYlSlRitU 7123Time taken: 57.272 seconds, Fetched: 20row(s)
理解二:取Top10中每个人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有视频id,视频观看次数,还要按照uploader分区,views倒叙排序,求出每个uploder的上传的视频的观看排名
3)按照rk进行where过滤,求出rk<=20的数据
2、代码
select
t3.uploader,
t3.videoId,
t3.views,
t3.rk
from(select
t2.uploader,
t2.videoId,
t2.views,
rank()over(partitionby uploader orderby t2.views desc) rk
from(select
t1.uploader,
g.videoId,
g.`views`from(select
uploader
from gulivideo_user_orc
orderby videos desclimit10) t1 join gulivideo_orc g on t1.uploader = g.uploader
) t2
) t3
where rk <=20;// 结果显示
OK
t3.uploader t3.videoid t3.views t3.rk
expertvillage -IxHBW0YpZw 390591
expertvillage BU-fT5XI_8I 299752
expertvillage ADOcaBYbMl0 262703
expertvillage yAqsULIDJFE 255114
expertvillage vcm-t0TJXNg 253665
expertvillage 0KYGFawp14c 246596
expertvillage j4DpuPvMLF4 225937
expertvillage Msu4lZb2oeQ 188228
expertvillage ZHZVj44rpjE 163049
expertvillage foATQY3wovI 1357610
expertvillage -UnQ8rcBOQs 1345011
expertvillage crtNd46CDks 1163912
expertvillage D1leA0JKHhE 1155313
expertvillage NJu2oG1Wm98 1145214
expertvillage CapbXdyv4j4 1091515
expertvillage epr5erraEp4 1081716
expertvillage IyQoDgaLM7U 1059717
expertvillage tbZibBnusLQ 1040218
expertvillage _GnCHodc7mk 942219
expertvillage hvEYlSlRitU 712320
Ruchaneewan 5_T5Inddsuo 31321
Ruchaneewan wje4lUtbYNU 10862
Ruchaneewan i8rLbOUhAlM 5493
Ruchaneewan OwnEtde9_Co 4534
Ruchaneewan 5Zf0lbAdJP0 4415
Ruchaneewan wenI5MrYT20 4266
Ruchaneewan Iq4e3SopjxQ 4207
Ruchaneewan 3hzOiFP-5so 4207
Ruchaneewan JgyOlXjjuw0 4189
Ruchaneewan fGBVShTsuyo 39510
Ruchaneewan O3aoL70DlVc 38911
Ruchaneewan q4y2ZS5OQ88 34412
Ruchaneewan lyUJB2eMVVg 27113
Ruchaneewan _RF_3VhaQpw 24214
Ruchaneewan DDl2cjI-aJs 23115
Ruchaneewan xbYyjUdhtJw 22716
Ruchaneewan 4dkKeIUkN7E 22617
Ruchaneewan qCfuQA6N4K0 21318
Ruchaneewan TmYbGQaRcNM 20919
Ruchaneewan dOlfPsFSjw0 20620Time taken: 30.772 seconds, Fetched: 40row(s
理解三:Top10用户上传的所有视频,有哪些视频是在视频观看次数前20的视频
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频id,视频观看次数
3)在第二步的结果上,与视频表观看次数前20的数据进行内连接,求出Top10用户上传的视频有哪些是观看次数前20的视频
2、代码
SELECT
t3.uploader,
t3.videoid,
t3.views
FROM(SELECT
t1.uploader,
t2.videoid,
t2.views
FROM(select
uploader,
videos
from gulivideo_user_orc
orderby videos DESClimit10) t1
JOIN
gulivideo_orc t2
on t1.uploader = t2.uploader
) t3
JOIN(select
videoid,`views`from gulivideo_orc
orderby`views`desclimit20) t4
on t3.videoid = t4.videoid;
版权归原作者 Key-Key 所有, 如有侵权,请联系我们删除。