0


一文弄懂Hive中谓词下推(on与where的区别)

文章目录

场景模拟

  1. 数仓实际开发中经常会涉及到多表关联,这个时候就会涉及到onwhere的使用。如果对这两者在数仓中的作用比较混乱的,读完这一文就可以理解透彻了。

先来说一下where与on在SQL中最直观的区别

  1. on 在筛选条件的时候,on会显示所有满足 | 不满足条件的数据(补NULL),而 where 只显示满足条件的数据。
  2. on对join类型(内外连接)的改变而会有反应而where没有,对where来说只是当个连接作用。

上面的说法就不具体举例验证了,这里我们主要研究where与on在hive中对性能的影响,有条件的小伙伴可以手动试一下,贴上数据源

  1. CREATETABLE a (id string,name string) PARTITIONED BY(dt STRING);CREATETABLE b (id string,dept string) PARTITIONED BY(dt STRING);INSERTINTOTABLE a PARTITION(dt='2022-09-08')VALUES("1","Daniel");INSERTINTOTABLE a PARTITION(dt='2022-09-08')VALUES("2","Andy");INSERTINTOTABLE a PARTITION(dt='2022-09-08')VALUES("3","Marc");INSERTINTOTABLE b PARTITION(dt='2022-09-08')VALUES("1","BD");INSERTINTOTABLE b PARTITION(dt='2022-09-08')VALUES("2","BE");SELECT*from a where dt ='2022-09-08';SELECT*from b where dt ='2022-09-08';

先上一个实际的需求,关联a,b两表,取a表最新日期的数据

  1. SELECT*FROM a
  2. JOIN b ON a.id = b.id
  3. WHERE a.dt ='2022-09-08';

相信绝大多数人会这么写,先说结论,这样写没有任何问题


问题描述

  1. 可能有的小伙伴会这样尝试
  1. SELECT*FROM a
  2. JOIN b ON a.id = b.id
  3. AND a.dt ='2022-09-08';

这样与上面的效果是等同的,也没有问题,那么问题在哪里?

如果需要以a表为主表,关联查询b表,也就是左外连接,这个时候两种写法就有问题了

  • 写法一
  1. SELECT*FROM a
  2. LEFTJOIN b ON a.id = b.id
  3. WHERE a.dt ='2022-09-08';

高效写法,hive会只取指定日期的数据

  • 写法二
  1. SELECT*FROM a
  2. LEFTJOIN b ON a.id = b.id
  3. AND a.dt ='2022-09-08';

缓慢写法,hive会先查出所有数据做关联,然后再去关联指定日期的数据

  • 写法三
  1. SELECT*FROM(SELECT*FROM a
  2. WHERE dt ='2022-09-08') t1
  3. LEFTJOIN b ON t1.id = b.id;

高效写法,hive会只取指定日期的数据。虽然写法看着比较low,但是效果是等同于1的,为了写出不那么low的sql,这里先介绍一下Hive中的谓词下推

这里拿写法一和写法二的执行计划来简单说明证明一下这个观点,我这里引擎为hive on spark

  • 写法一
  1. Explain
  2. STAGE DEPENDENCIES:
  3. Stage-2is a root stage
  4. Stage-1 depends on stages: Stage-2
  5. Stage-0 depends on stages: Stage-1
  6. STAGE PLANS:
  7. Stage: Stage-2
  8. Spark
  9. DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53374
  10. Vertices:
  11. Map 2
  12. Map Operator Tree:
  13. TableScan
  14. alias: b
  15. Statistics: Num rows: 2Data size: 30 Basic stats: COMPLETE Column stats: NONE
  16. // 无需过滤
  17. Spark HashTable Sink Operator
  18. keys:
  19. 0 id (type: string)1 id (type: string)LocalWork:
  20. Map Reduce LocalWork
  21. Stage: Stage-1
  22. Spark
  23. DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53373
  24. Vertices:
  25. Map 1
  26. Map Operator Tree:
  27. TableScan
  28. alias: a
  29. // 可以看到在表扫描的时候就做了过滤,所以在后面的HashTable Sink Operator就不需要过滤了
  30. filterExpr: (dt ='2022-09-08')(type: boolean)Statistics: Num rows: 3Data size: 53 Basic stats: COMPLETE Column stats: NONE
  31. Filter Operator
  32. predicate: (dt ='2022-09-08')(type: boolean)Statistics: Num rows: 1Data size: 17 Basic stats: COMPLETE Column stats: NONE
  33. Map Join Operator
  34. condition map:
  35. LeftOuter Join0 to1keys:
  36. 0 id (type: string)1 id (type: string)
  37. outputColumnNames: _col0, _col1, _col6, _col7, _col8
  38. input vertices:
  39. 1 Map 2Statistics: Num rows: 2Data size: 33 Basic stats: COMPLETE Column stats: NONE
  40. Select Operator
  41. expressions: _col0 (type: string), _col1 (type: string),'2022-09-08'(type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
  42. outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
  43. Statistics: Num rows: 2Data size: 33 Basic stats: COMPLETE Column stats: NONE
  44. File Output Operator
  45. compressed: falseStatistics: Num rows: 2Data size: 33 Basic stats: COMPLETE Column stats: NONE
  46. table:
  47. input format: org.apache.hadoop.mapred.SequenceFileInputFormat
  48. output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  49. serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  50. LocalWork:
  51. Map Reduce LocalWork
  52. Stage: Stage-0Fetch Operator
  53. limit: -1
  54. Processor Tree:
  55. ListSink
  • 写法二
  1. Explain
  2. STAGE DEPENDENCIES:
  3. Stage-2is a root stage
  4. Stage-1 depends on stages: Stage-2
  5. Stage-0 depends on stages: Stage-1
  6. STAGE PLANS:
  7. Stage: Stage-2
  8. Spark
  9. DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53395
  10. Vertices:
  11. Map 2
  12. Map Operator Tree:
  13. TableScan
  14. alias: b
  15. Statistics: Num rows: 2Data size: 30 Basic stats: COMPLETE Column stats: NONE
  16. Spark HashTable Sink Operator
  17. // 过滤一次
  18. filter predicates:
  19. 0 {(dt ='2022-09-08')}
  20. 1keys:
  21. 0 id (type: string)1 id (type: string)LocalWork:
  22. Map Reduce LocalWork
  23. Stage: Stage-1
  24. Spark
  25. DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53394
  26. Vertices:
  27. Map 1
  28. Map Operator Tree:
  29. TableScan
  30. // 可以看到表扫描的时候没有过滤,所以需要在每个stage HashTable Sink Operator的进行过滤
  31. alias: a
  32. Statistics: Num rows: 3Data size: 53 Basic stats: COMPLETE Column stats: NONE
  33. Map Join Operator
  34. condition map:
  35. LeftOuter Join0 to1// 过滤两次
  36. filter predicates:
  37. 0 {(dt ='2022-09-08')}
  38. 1keys:
  39. 0 id (type: string)1 id (type: string)
  40. outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
  41. input vertices:
  42. 1 Map 2Statistics: Num rows: 3Data size: 58 Basic stats: COMPLETE Column stats: NONE
  43. Select Operator
  44. expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
  45. outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
  46. Statistics: Num rows: 3Data size: 58 Basic stats: COMPLETE Column stats: NONE
  47. File Output Operator
  48. compressed: falseStatistics: Num rows: 3Data size: 58 Basic stats: COMPLETE Column stats: NONE
  49. table:
  50. input format: org.apache.hadoop.mapred.SequenceFileInputFormat
  51. output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  52. serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  53. LocalWork:
  54. Map Reduce LocalWork
  55. Stage: Stage-0Fetch Operator
  56. limit: -1
  57. Processor Tree:
  58. ListSink

从上面的注释可以看出,在写法一的谓词下推后,数据在一开始扫描的时候就已经被过滤掉了。而在写法的不推的情况下,会拿所有的数据进行查询,最后再进行多次过滤。

Hive谓词下推

谓词下推概念

谓词下推

  1. Predicate PushdownPPD

:简而言之,就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。

PPD 配置

PPD控制参数:

  1. hive.optimize.ppd

默认开启

基本概念

Name名称解释Preserved Row table保留表在outer join中需要返回所有数据的表叫做保留表;
left outer join中,左表是保留表;
right outer join中,右表则是保留表;
full outer join中左表和右表都要返回所有数据,则左右表都是保留表。Null Supplying table空表相对来讲,在outer join中对于没有匹配到的行需要用NULL来填充的表称为空表;
left outer join中,左表的数据全返回,对于左表在右表中无法匹配的数据的列用NULL表示,则此时右表是空表;
right outer join中,左表是空表;
full outer join中左表和右表都是Null Supplying table,因为左表和右表都会用NULL来填充无法匹配的数据。During Join predicateJoin中的谓词Join中的谓词是指Join On语句中的谓词; 如:a join b on a.id=1 那么a.id=1是Join中的谓词。After Join predicateJoin之后的谓词where语句中的谓词称之为Join之后的谓词。

官网解释

The logic can be summarized by these two rules:

  1. During Join predicates cannot be pushed past Preserved Row tables.(保留表的谓词写在join中不能下推)
  2. After Join predicates cannot be pushed past Null Supplying tables.(空表的谓词写在join之后不能下推)

This captured in the following table:
Preserved Row TableNull Supplying TableJoin PredicateCase J1: Not PushedCase J2: PushedWhere PredicateCase W1: PushedCase W2: Not Pushed

具体case见官网,这里有比较详细的执行计划分析https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

规则总结

  1. 保留表的谓词写在join中不能下推,需要用where;
  2. 空表的谓词写在join之后不能下推,需要用on;
  3. 在 join关联情况下,过滤条件无论在join中还是where中谓词下推都生效;
  4. 在full join关联情况下,过滤条件无论在join中还是where中谓词下推都不生效。

具体案例
Pushed or NotSQLPushedselect * from a join b on a.id = b.id and a.dt = ‘2022-09-08’;Pushedselect * from a join b on a.id = b.id where a.dt = ‘2022-09-08’;Pushedselect * from a join b on a.id = b.id and b.dt = ‘2022-09-08’;Pushedselect * from a join b on a.id = b.id where b.dt = ‘2022-09-08’;Not Pushedselect * from a left join b on a.id = b.id and a.dt = ‘2022-09-08’;Pushedselect * from a left join b on a.id = b.id where a.dt = ‘2022-09-08’;Pushedselect * from a left join b on a.id = b.id and b.dt = ‘2022-09-08’;Not Pushedselect * from a left join b on a.id = b.id where b.dt = ‘2022-09-08’;Pushedselect * from a right join b on a.id = b.id and a.dt = ‘2022-09-08’;Not Pushedselect * from a right join b on a.id = b.id where a.dt = ‘2022-09-08’;Not Pushedselect * from a right join b on a.id = b.id and b.dt = ‘2022-09-08’;Pushedselect * from a right join b on a.id = b.id where b.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id and a.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id where a.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id and b.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id where b.dt = ‘2022-09-08’;

规则表

join(inner join)left outer joinright outer joinfull outer joinleft tableright tableleft tableright tableleft tableright tableleft tableright tablejoinPushedPushedNot PushedPushedPushedNot PushedNot PushedNot PushedwherePushedPushedPushedNot PushedNot PushedPushedNot PushedNot Pushed

特殊说明

不确定函数之类的函数的是不能下推的,例如rand()类,但是unix_timestamp()除外,观察它的执行计划可以知,它可以下推

  1. EXPLAINSELECT*FROM a
  2. LEFTJOIN b ON a.id = b.id
  3. WHERE a.dt = unix_timestamp();
  1. Explain
  2. STAGE DEPENDENCIES:
  3. Stage-2is a root stage
  4. Stage-1 depends on stages: Stage-2
  5. Stage-0 depends on stages: Stage-1
  6. STAGE PLANS:
  7. Stage: Stage-2
  8. Spark
  9. DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53432
  10. Vertices:
  11. Map 2
  12. Map Operator Tree:
  13. TableScan
  14. alias: b
  15. Statistics: Num rows: 2Data size: 30 Basic stats: COMPLETE Column stats: NONE
  16. Spark HashTable Sink Operator
  17. // 无需过滤keys:
  18. 0 id (type: string)1 id (type: string)LocalWork:
  19. Map Reduce LocalWork
  20. Stage: Stage-1
  21. Spark
  22. DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53431
  23. Vertices:
  24. Map 1
  25. Map Operator Tree:
  26. TableScan
  27. alias: a
  28. // 表扫描时已过滤
  29. filterExpr: (dt =1662522398)(type: boolean)Statistics: Num rows: 3Data size: 53 Basic stats: COMPLETE Column stats: NONE
  30. Filter Operator
  31. predicate: (dt =1662522398)(type: boolean)Statistics: Num rows: 1Data size: 17 Basic stats: COMPLETE Column stats: NONE
  32. Map Join Operator
  33. condition map:
  34. LeftOuter Join0 to1keys:
  35. 0 id (type: string)1 id (type: string)
  36. outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
  37. input vertices:
  38. 1 Map 2Statistics: Num rows: 2Data size: 33 Basic stats: COMPLETE Column stats: NONE
  39. Select Operator
  40. expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
  41. outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
  42. Statistics: Num rows: 2Data size: 33 Basic stats: COMPLETE Column stats: NONE
  43. File Output Operator
  44. compressed: falseStatistics: Num rows: 2Data size: 33 Basic stats: COMPLETE Column stats: NONE
  45. table:
  46. input format: org.apache.hadoop.mapred.SequenceFileInputFormat
  47. output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  48. serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  49. LocalWork:
  50. Map Reduce LocalWork
  51. Stage: Stage-0Fetch Operator
  52. limit: -1
  53. Processor Tree:
  54. ListSink

结论

1. 对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2. 对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3. 对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
4. 当条件分散在两个表时,谓词下推可按上述结论2和3自由组合,情况如下:

SQL过滤时机select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’);id在map端过滤,dt在reduce端过滤,低效select * from a left outer join b on ( a.id = b.id and b.id = ‘2022-09-08’) where a.dt=‘2022-09-08’;id,dt都在map端过滤,高效select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’) where b.id = ‘2022-09-08’;id,dt都在reduce端过滤,极低效select * from a left outer join b on ( a.id = b.id ) where a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’;id在reduce端过滤,dt在map端过滤,低效

标签: hive 大数据

本文转载自: https://blog.csdn.net/a805814077/article/details/126777345
版权归原作者 DanielMaster 所有, 如有侵权,请联系我们删除。

“一文弄懂Hive中谓词下推(on与where的区别)”的评论:

还没有评论