一、背景
发现一个10.19号的任务下午还没跑完,正常情况下,一般一个小时就已经跑完,而今天已经超过3小时了,因此去观察实际的任务,发现9个map 其中8个已经完成,就一个还在run,说明有明显的数据倾斜
** 二、数据倾斜问题处理和Hive SQL 优化**
原始sql
insert overwrite table raw_search_behavior partition(dt='2022-10-19')
select cookie_id,event_query,count(*) as cnt,
max(from_unixtime(unix_timestamp(event_time), 'yyyyMMddHHmmss')) as last_dt
from raw_query
where dt >= '20220718' and dt <= '20221019' and event_query is not null
group by cookie_id,event_query
发现有使用group by
优化后sql
insert overwrite table raw_search_behavior partition(dt='2022-10-19')
select split(tkey,'_')[1] as cookie_id,event_query, sum(cnt) as cnt,max(last_dt) as last_dt from
(
select concat_ws('_', cast(ceiling(rand()*99) as string), cookie_id) as tkey,event_query, count(*) as cnt,
max(from_unixtime(unix_timestamp(event_time), 'yyyyMMddHHmmss')) as last_dt
from raw_query
where dt >= '20220718' and dt <= '20221019' and event_query is not null
group by concat_ws('_', cast(ceiling(rand()*99) as string), iqiyi_cookie_id),event_query
) temp
group by split(tkey,'_')[1],event_query;
三、观察任务
在一小时内结束,并且没有哪个节点耗时较长
版权归原作者 诸葛子房_ 所有, 如有侵权,请联系我们删除。