从一个sql引发的hive谓词下推的全面复盘及源码分析（上）

嗨，大家好，我是小萝卜算子。（微信公众号：数据仓库践行者。感谢关注）
首先先更新一下我的VIP群里的状态
在这里插入图片描述
日常答疑

在这里插入图片描述
飞书上的视频课程也已经有条不紊的进行了。
欢迎有需要的小伙伴一起来搬砖。

下面开始今天的正题
Hive版本：hive-2.1.1
经常听到【谓词下推】这个词，却从来没有对它进行全面的深入的研究，直到前些天，我们的数据产品跑过来跟我讨论他写的一个sql，这个sql最终出现的结果并不是他想要的。看了具体的sql后，引发了我的一些思考，决定来挖一挖谓词下推。

1、那个引人思考的sql

原sql业务比较复杂，我们用一些简单的测试数据复现场景。
创建两张表：

create table test1(id int,openid string) PARTITIONED BY ( day string ) STORED AS ORC;
create table test2(id int,openid string)  PARTITIONED BY ( day string ) STORED AS ORC;

数据：

insert into table test1 partition (day='20190521') values(1,'apple');
insert into table test1 partition (day='20190521') values(2,'peach'); 
insert into table test2 partition (day='20190521') values(1,'apple');
insert into table test2 partition (day='20190521') values(3,'lemon');

数据产品跑过来跟我讨论的那个sql:

select 
  count(distinct case when b.openid is null then a.openid end) as n1,
    count(distinct case when a.openid is null then b.openid end) as n2 
from test2 a full join   test1 b on a.openid = b.openid  
where a.day = '20190521' and b.day = '20190521'

本意：想计算出20190521 这天，test1和test2表 openid字段各自己非交集的条数，如下图
在这里插入图片描述
理想中的结果应该是 1，1
但实际上输出结果是：0，0

为什么呢？其实也不难判断
是使用者对该sql的执行顺序理解有误（Joins occur BEFOREWHERE CLAUSES）

这里需要明白一点， join之后的where决定了最终要呈现的数据

hive full join不管会不会进行谓词下推， where后面的a.day = ‘20190521’ and b.day = ‘20190521’ 的条件一定会在full join之后过滤一遍，full join的中间结果如下：
在这里插入图片描述
在这样的结果中，计算a.day = ‘20190521’ and b.day = ‘20190521’ 条件下，a.openid为null的条数及b.openid为null的条数，肯定都是0了

谓词下推复盘

上面那条sql引发的场景，实际上跟有没有进行谓词下推关系不大，但是这样的一个sql却引发我们的思考。
这次复盘的hive版本是 hive 2.1.1，不同的版本，对谓词下推做的优化有所不同，尤其是不同大版本间，相差比较多，比如：hive1.x和hive2.x。当然，hive2.x相对于hive1.x是更智能，更优化了。

2.1 谓词

谓词下推概念中的谓词指返回bool值即true和false的函数，或是隐式转换为bool的函数，
比如：like ，is null,in,exists,=,!、= 这些返回bool值的函数

2.2 几个概念

2.2.1 Preserved Row table（保留表）

在outer join中需要返回所有数据的表叫做保留表，也就是说在left outer join中，左表需要返回所有数据，则左表是保留表；right outer join中右表则是保留表；在full outer join中左表和右表都要返回所有数据，则左右表都是保留表。

2.2.2 Null Supplying table（空表）

在outer join中对于没有匹配到的行需要用null来填充的表称为Null Supplying table。在left outer join中，左表的数据全返回，对于左表在右表中无法匹配的数据的相应列用null表示，则此时右表是Null Supplying table，相应的如果是right outer join的话，左表是Null Supplying table。但是在full outer join中左表和右表都是Null Supplying table，因为左表和右表都会用null来填充无法匹配的数据。

2.2.3 During Join predicate（Join中的谓词）

Join中的谓词是指 Join On语句中的谓词。如：R1 join R2 on R1.x = 5 the predicate R1.x = 5是Join中的谓词

2.2.4 After Join predicate（Join之后的谓词）

where语句中的谓词称之为Join之后的谓词

2.3 谓词下推

谓词下推的基本思想：将过滤表达式尽可能移动至靠近数据源的位置，以使真正执行时能直接跳过无关的数据。

在hive官网上给出了outer join【left outer join、right out join、full outer join】的谓下推规则：
在这里插入图片描述
翻译一下如下：

Join(只包括left join ,right join,full join)中的谓词如果是保留表的，则不会下推
Join(只包括left join ,right join,full join)之后的谓词如果是Null Supplying tables的，则不会下推

这种规则在hive2.x版本以后，就不是很准确了，hive2.x对CBO做了优化，CBO也对谓词下推规则产生了一些影响。

因此在hive2.1.1中影响谓词下推规则的，主要有两方面

Hive逻辑执行计划层面的优化
CBO 下面写例子来加深理解：

为了方便我们再往test1里插入几条数据

insert into table test1 partition (day='20190521') values(1,'pear');
insert into table test1 partition (day='20190521') values(2,'pear');
insert into table test1 partition (day='20190521') values(3,'banana');

在执行例子之前，避免map join影响到各个例子的执行计划，先关闭map join：

set hive.auto.convert.join=false;
set hive.cbo.enable=false;

case1 inner join 中的谓词

由于CBO对谓词下推也做了优化，我们在分析时，主要分析两部分，CBO功能关闭和开启，控制CBO的参数：hive.cbo.enable 在hive2版本中默认都是开启的 true

select t1.*,t2.* from test1 t1 join test2 t2 on t1.id=t2.id and t1.openid='pear' 
and t2.openid='apple';

hive> set hive.cbo.enable=false;
hive> explain select t1.*,t2.* from test1 t1 join test2 t2 on t1.id=t2.id and t1.openid='pear' and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'pear') and id is not null) (type: boolean)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: id (type: int)
                sort order: +
                Map-reduce partition columns: id (type: int)
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                value expressions: day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'apple') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: id (type: int)
                sort order: +
                Map-reduce partition columns: id (type: int)
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                value expressions: day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col2, _col6, _col8
          Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col6 (type: int), 'apple' (type: string), _col8 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.682 seconds, Fetched: 59 row(s)

hive> set hive.cbo.enable=true;
hive> explain select t1.*,t2.* from test1 t1 join test2 t2 on t1.id=t2.id and t1.openid='pear' and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'pear') and id is not null) (type: boolean)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'apple') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
          outputColumnNames: _col0, _col2, _col3, _col5
          Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col3 (type: int), 'apple' (type: string), _col5 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.704 seconds, Fetched: 67 row(s)

从执行计划上来看，hive对join的谓下推优化做的很到位，不管有没有开启CBO优化，在on中的两个条件都提至table scan阶段进行过滤。实际上，这个也是因为join的特殊性，要求左表和右表必须完全匹配，这个也为做优化提供了可能性。

case2 inner join 之后的谓词

select t1.*,t2.* from test1 t1 join test2 t2 on t1.id=t2.id where t1.openid='pear'  
and t2.openid='apple';

inner join之后的谓词与 inner join之中的谓词的效果是完全一样的，很容易理解，执行计划如下：

hive> set hive.cbo.enable=false;
hive> explain select t1.*,t2.* from test1 t1 join test2 t2 on t1.id=t2.id where t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id is not null and (openid = 'pear')) (type: boolean)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: id (type: int)
                sort order: +
                Map-reduce partition columns: id (type: int)
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                value expressions: day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (id is not null and (openid = 'apple')) (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: id (type: int)
                sort order: +
                Map-reduce partition columns: id (type: int)
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                value expressions: day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col2, _col6, _col8
          Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col6 (type: int), 'apple' (type: string), _col8 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.41 seconds, Fetched: 59 row(s)

hive> set hive.cbo.enable=true;
hive> explain select t1.*,t2.* from test1 t1 join test2 t2 on t1.id=t2.id where t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'pear') and id is not null) (type: boolean)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'apple') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
          outputColumnNames: _col0, _col2, _col3, _col5
          Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col3 (type: int), 'apple' (type: string), _col5 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.462 seconds, Fetched: 67 row(s)

case3 left join 中的谓词

select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='pear'  
and t2.openid='apple';

hive> set hive.cbo.enable=false;
hive> explain select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='pear'  
    > and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (openid = 'apple') (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: id (type: int)
                sort order: +
                Map-reduce partition columns: id (type: int)
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          filter predicates:
            0 {(VALUE._col0 = 'pear')}
            1 
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.237 seconds, Fetched: 59 row(s)

在hive.cbo.enable=false的情况下，执行计划如上图，test2表进行了谓词下推，将predicate: (openid = ‘apple’) (type: boolean)过滤件放在table scan之后。在这里，test1为保留表，符合【Join(只包括left join ,right join,full join)中的谓词如果是保留表的，则不会下推】，test2进行了下推。

查询的过程：遍历test1所有的数据，提取test2中openid为apple的数据，该条数据即为：

1       apple        20190521

然后，拿test1中 openid为pear的数据与test2表中的这条关联，关联上的就展示，关联不上的，补null；另外，test1中的openid不为pear的数据不与test2表做关联，直接补null。。。

因此，执行结果如下：
在这里插入图片描述
由于join中谓词执行过程的特殊性，CBO也不可能再做什么优化，因此，打开CBO开关，执行计划不变：

hive> set hive.cbo.enable=true;
hive> explain select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id and t1.openid='pear'  
    > and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), openid (type: string), day (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: int)
                sort order: +
                Map-reduce partition columns: _col0 (type: int)
                Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string), _col2 (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (openid = 'apple') (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), 'apple' (type: string), day (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          filter predicates:
            0 {(VALUE._col0 = 'pear')}
            1 
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
          Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.383 seconds, Fetched: 63 row(s)

case4 left join 之后的谓词

select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='pear'  
and t2.openid='apple';

hive> set hive.cbo.enable=false;
hive> explain select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (openid = 'pear') (type: boolean)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: id (type: int)
                sort order: +
                Map-reduce partition columns: id (type: int)
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                value expressions: day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col2, _col6, _col7, _col8
          Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (_col7 = 'apple') (type: boolean)
            Statistics: Num rows: 1 Data size: 102 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col6 (type: int), 'apple' (type: string), _col8 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 1 Data size: 102 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 102 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.213 seconds, Fetched: 59 row(s)

屏蔽掉CBO对谓词下推的影响后，从执行计划中我们可以看到，对test1进行了下推，将predicate: (openid = ‘pear’) (type: boolean)过滤条件下推到 tablescan处进行；而test2没有下推，最后是在 join之后，才进行了 predicate: (_col7 = ‘apple’) (type: boolean)过滤。这个也符合【 Join之后的谓词如果是 Null Supplying tables的，则不会下推】。

但是，我们会发现， join之后的 where无论怎样都是要过滤一遍的，这里的test2实际上是可以下推的，下推了会更优化，也不会改变最终的结果。

hive> set hive.cbo.enable=true;
hive> explain select t1.*,t2.* from test1 t1 left join test2 t2 on t1.id=t2.id where t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'pear') and id is not null) (type: boolean)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'apple') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
          outputColumnNames: _col0, _col2, _col3, _col5
          Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col3 (type: int), 'apple' (type: string), _col5 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.513 seconds, Fetched: 67 row(s)

打CBO开关 set hive.cbo.enable=true 后，我们会发现，test2也进行了谓词下推，做到了更优化

case5 full join 中的谓词

select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='pear'  
and t2.openid='apple';

hive> set hive.cbo.enable=false;
hive> explain select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          filter predicates:
            0 {(VALUE._col0 = 'pear')}
            1 {(VALUE._col0 = 'apple')}
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.308 seconds, Fetched: 56 row(s)

full join 的左右表很矛盾，即是保留表，也是 Null Supplying tables。但是有一条不变，就是左右表的数据都一定是要保留表下来的，因此也不难理解，这里为什么不能进行谓词下推，只要下推了，就不能保证两个表的数据都保留。。。

这也算是符合了【 Join(只包括 left join ,right join,full join)中的谓词如果是保留表的，则不会下推】

解释一下数据结果：
在这里插入图片描述
test1, test2 两个表全遍历，test1表拿 openid为 pear的与test2表中的数据关联，其它 openid不为 pear的，test2表直接补 null。而test2 表拿 openid为 apple的与test1表中的数据关联，其它 openid不为 apple的，test1 表直接补 null。

基于我们对full join的理解，打开CBO开关，也不会做谓词下推的优化：

hive> set hive.cbo.enable=true;
hive> explain select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id and t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), openid (type: string), day (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: int)
                sort order: +
                Map-reduce partition columns: _col0 (type: int)
                Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string), _col2 (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), openid (type: string), day (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: int)
                sort order: +
                Map-reduce partition columns: _col0 (type: int)
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string), _col2 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          filter predicates:
            0 {(VALUE._col0 = 'pear')}
            1 {(VALUE._col0 = 'apple')}
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
          Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.321 seconds, Fetched: 60 row(s)

case6 full join 之后的谓词

select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='pear'  
and t2.openid='apple';


hive> set hive.cbo.enable=false;
hive> explain select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
              value expressions: openid (type: string), day (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              value expressions: openid (type: string), day (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
          Statistics: Num rows: 5 Data size: 511 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: ((_col1 = 'pear') and (_col7 = 'apple')) (type: boolean)
            Statistics: Num rows: 1 Data size: 102 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col6 (type: int), 'apple' (type: string), _col8 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 1 Data size: 102 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 102 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.274 seconds, Fetched: 56 row(s)

关掉CBO开关，我们会发现，没有进行谓词下推的优化，而是在 full join之后，做了 predicate: ((_col1 = ‘pear’) and (_col7 = ‘apple’)) (type: boolean)的过滤，这也符合【Join(只包括 left join ,right join,full join)之后的谓词如果是 Null Supplying tables的，则不会下推】这一规则，毕竟test1,test2也属于 Null Supplying tables 。

实际上，这里是可以下推的，与 left join同理，下推后，不影响最后的结果：

hive>  set hive.cbo.enable=true;
hive> explain select t1.*,t2.* from test1 t1 full join test2 t2 on t1.id=t2.id where t1.openid='pear'  and t2.openid='apple';
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 5 Data size: 465 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'pear') and id is not null) (type: boolean)
              Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
          TableScan
            alias: t2
            Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((openid = 'apple') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: id (type: int), day (type: string)
                outputColumnNames: _col0, _col2
                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
          outputColumnNames: _col0, _col2, _col3, _col5
          Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), 'pear' (type: string), _col2 (type: string), _col3 (type: int), 'apple' (type: string), _col5 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 2 Data size: 204 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.606 seconds, Fetched: 67 row(s)

打开CBO开关，两表都进行了下推

结论

经过上面的案例分析，我们可以给出结论了，这里没有分析right join，因为right jion和left join原理是一样的
在这里插入图片描述

原文链接：https://mp.weixin.qq.com/s/ZwoAjkMlvAOoOI8Z-QNFsQ

标签：大数据 hadoop spark

本文转载自: https://blog.csdn.net/weixin_39282730/article/details/123871724
版权归原作者 嘉祐-小萝卜算子 所有，如有侵权，请联系我们删除。