0


Hive/Spark窗口函数

窗口函数

hive文档链接
spark文档链接

1. OVER支持的函数

  • 自然序编号 Syntax: ROW_NUMBER
  • 按等级编号 Syntax: RANK | DENSE_RANK | PERCENT_RANK
  • 分组内分桶,并返回对应桶的序号 Syntax: NTILE(n)
  • Analytic Functions(分析函数) Syntax: CUME_DIST | LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE
  • Aggregate Functions(聚合函数) Syntax: MAX | MIN | COUNT | SUM | AVG | …

1.1. 准备工作

创建测试表并插入测试数据

CREATETABLE employees (name STRING, dept STRING, salary INT, age INT);INSERTINTO employees VALUES("Lisa","Sales",10000,35),("Evan","Sales",32000,38),("Fred","Engineering",21000,28),("Alex","Sales",30000,33),("Tom","Engineering",23000,33),("Jane","Marketing",29000,28),("Jeff","Marketing",35000,38),("Paul","Engineering",29000,23),("Chloe","Engineering",23000,25);

1.2. row_number()

row_number() over()

row_number可能是窗口函数中使用最频繁的函数,作用是在分组内按自然序进行编号,结果值为:1 2 3 4 5

select*,row_number()over(partitionby dept orderby salary)as rn from employees;+-----------------+-----------------+-------------------+----------------+-----+| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |+-----------------+-----------------+-------------------+----------------+-----+| Jane            | Marketing       |29000|28|1|| Jeff            | Marketing       |35000|38|2|| Fred            | Engineering     |21000|28|1|| Tom             | Engineering     |23000|33|2|| Chloe           | Engineering     |23000|25|3|| Paul            | Engineering     |29000|23|4|| Lisa            | Sales           |10000|35|1|| Alex            | Sales           |30000|33|2|| Evan            | Sales           |32000|38|3|+-----------------+-----------------+-------------------+----------------+-----+

1.3. rank函数

rank有等级的含义,函数作用是在分组内按照order by的结果得到有等级编号。

rank() over ()

并列有间隔,结果如:1 2 2 4 5

dense_rank() over()

dense有密集的含义,并列无间隔,结果如:1 2 2 3 4

percent_rank() over()

百分比的rank,含义是(当前行-1)/(当前组总行数-1),当前行从1开始

注意,如果order by的结果相同,则rank得到的结果都相同,在这里的语义是排序结果相同,因此等级编号也相同。

select*,rank()over(partitionby dept orderby salary)as rn from employees;-- 执行结果中,engineering中salary相同的编号相同,paul的值为4+-----------------+-----------------+-------------------+----------------+-----+| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |+-----------------+-----------------+-------------------+----------------+-----+| Jane            | Marketing       |29000|28|1|| Jeff            | Marketing       |35000|38|2|| Fred            | Engineering     |21000|28|1|| Tom             | Engineering     |23000|33|2|| Chloe           | Engineering     |23000|25|2|| Paul            | Engineering     |29000|23|4|| Lisa            | Sales           |10000|35|1|| Alex            | Sales           |30000|33|2|| Evan            | Sales           |32000|38|3|+-----------------+-----------------+-------------------+----------------+-----+select*,dense_rank()over(partitionby dept orderby salary)as rn from employees;-- 执行结果中,engineering中salary相同的编号相同,paul的值为3+-----------------+-----------------+-------------------+----------------+-----+| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |+-----------------+-----------------+-------------------+----------------+-----+| Jane            | Marketing       |29000|28|1|| Jeff            | Marketing       |35000|38|2|| Fred            | Engineering     |21000|28|1|| Tom             | Engineering     |23000|33|2|| Chloe           | Engineering     |23000|25|2|| Paul            | Engineering     |29000|23|3|| Lisa            | Sales           |10000|35|1|| Alex            | Sales           |30000|33|2|| Evan            | Sales           |32000|38|3|+-----------------+-----------------+-------------------+----------------+-----+select*,percent_rank()over(partitionby dept orderby salary)as rn from employees;-- rn的结果是每行在当前分组中的百分比,注意order by中相同值结果相同+-----------------+-----------------+-------------------+----------------+---------------------+| employees.name  | employees.dept  | employees.salary  | employees.age  |         rn          |+-----------------+-----------------+-------------------+----------------+---------------------+| Jane            | Marketing       |29000|28|0.0|| Jeff            | Marketing       |35000|38|1.0|| Fred            | Engineering     |21000|28|0.0|| Tom             | Engineering     |23000|33|0.3333333333333333|| Chloe           | Engineering     |23000|25|0.3333333333333333|| Paul            | Engineering     |29000|23|1.0|| Lisa            | Sales           |10000|35|0.0|| Alex            | Sales           |30000|33|0.5|| Evan            | Sales           |32000|38|1.0|+-----------------+-----------------+-------------------+----------------+---------------------+-- order by结果相同的情况,order by deptselect*,rank()over(partitionby dept orderby dept)as rn from employees;select*,dense_rank()over(partitionby dept orderby dept)as rn from employees;-- 二者结果相同+-----------------+-----------------+-------------------+----------------+-----+| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |+-----------------+-----------------+-------------------+----------------+-----+| Fred            | Engineering     |21000|28|1|| Tom             | Engineering     |23000|33|1|| Paul            | Engineering     |29000|23|1|| Chloe           | Engineering     |23000|25|1|| Jane            | Marketing       |29000|28|1|| Jeff            | Marketing       |35000|38|1|| Lisa            | Sales           |10000|35|1|| Evan            | Sales           |32000|38|1|| Alex            | Sales           |30000|33|1|+-----------------+-----------------+-------------------+----------------+-----+select*,percent_rank()over(partitionby dept orderby dept)as rn from employees;+-----------------+-----------------+-------------------+----------------+------+| employees.name  | employees.dept  | employees.salary  | employees.age  |  rn  |+-----------------+-----------------+-------------------+----------------+------+| Fred            | Engineering     |21000|28|0.0|| Tom             | Engineering     |23000|33|0.0|| Paul            | Engineering     |29000|23|0.0|| Chloe           | Engineering     |23000|25|0.0|| Jane            | Marketing       |29000|28|0.0|| Jeff            | Marketing       |35000|38|0.0|| Lisa            | Sales           |10000|35|0.0|| Evan            | Sales           |32000|38|0.0|| Alex            | Sales           |30000|33|0.0|+-----------------+-----------------+-------------------+----------------+------+

1.4. ntile(n)

ntile(n) over

tile有瓷砖、瓦片的含义,作用是在分组内对数据进行分桶,然后返回桶的序号。

按照order by结果将数据平均的放入到分好的桶中,如果数据无法按照桶个数均分,则将多余的数据放在第一个桶内。

使用场景:例如获取每个部门薪资前三分之一的员工数据,则按照salary降序排序,然后分成3个桶,最后取第一个桶中的数据。

select*,NTILE(2)over(partitionby dept orderby salary)as rn from employees;-- ntile(2)表示分成2桶,+-----------------+-----------------+-------------------+----------------+-----+| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |+-----------------+-----------------+-------------------+----------------+-----+| Jane            | Marketing       |29000|28|1|| Jeff            | Marketing       |35000|38|2|| Fred            | Engineering     |21000|28|1|| Tom             | Engineering     |23000|33|1|| Chloe           | Engineering     |23000|25|2|| Paul            | Engineering     |29000|23|2|| Lisa            | Sales           |10000|35|1|| Alex            | Sales           |30000|33|1|| Evan            | Sales           |32000|38|2|+-----------------+-----------------+-------------------+----------------+-----+select*,NTILE(3)over(partitionby dept orderby salary)as rn from employees;-- 分成3桶+-----------------+-----------------+-------------------+----------------+-----+| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |+-----------------+-----------------+-------------------+----------------+-----+| Jane            | Marketing       |29000|28|1|| Jeff            | Marketing       |35000|38|2|| Fred            | Engineering     |21000|28|1|| Tom             | Engineering     |23000|33|1|| Chloe           | Engineering     |23000|25|2|| Paul            | Engineering     |29000|23|3|| Lisa            | Sales           |10000|35|1|| Alex            | Sales           |30000|33|2|| Evan            | Sales           |32000|38|3|+-----------------+-----------------+-------------------+----------------+-----+

1.5. cume_dist()

cume表示累计,dist表示距离
用于求累计分布,即分组中按照order by结果的分位数

select*,cume_dist()over(partitionby dept orderby salary)from employees;-- enginnering分组中,tom和chloe中salary相同,二者的分位数结果相同+-----------------+-----------------+-------------------+----------------+---------------------+| employees.name  | employees.dept  | employees.salary  | employees.age  | cume_dist_window_0  |+-----------------+-----------------+-------------------+----------------+---------------------+| Jane            | Marketing       |29000|28|0.5|| Jeff            | Marketing       |35000|38|1.0|| Fred            | Engineering     |21000|28|0.25|| Tom             | Engineering     |23000|33|0.75|| Chloe           | Engineering     |23000|25|0.75|| Paul            | Engineering     |29000|23|1.0|| Lisa            | Sales           |10000|35|0.3333333333333333|| Alex            | Sales           |30000|33|0.6666666666666666|| Evan            | Sales           |32000|38|1.0|+-----------------+-----------------+-------------------+----------------+---------------------+select*,cume_dist()over(partitionby dept orderby age)from employees;-- 按照age排序,tom和chloe中age不同,因此分位数不同+-----------------+-----------------+-------------------+----------------+---------------------+| employees.name  | employees.dept  | employees.salary  | employees.age  | cume_dist_window_0  |+-----------------+-----------------+-------------------+----------------+---------------------+| Jane            | Marketing       |29000|28|0.5|| Jeff            | Marketing       |35000|38|1.0|| Paul            | Engineering     |29000|23|0.25|| Chloe           | Engineering     |23000|25|0.5|| Fred            | Engineering     |21000|28|0.75|| Tom             | Engineering     |23000|33|1.0|| Alex            | Sales           |30000|33|0.3333333333333333|| Lisa            | Sales           |10000|35|0.6666666666666666|| Evan            | Sales           |32000|38|1.0|+-----------------+-----------------+-------------------+----------------+---------------------+

1.6. lag和lead

lag 表示落后的含义,在使用场景中是小于的含义。
lead 表示领先的含义,在使用场景中是大于的含义。

可以实现不自连接的前提下,按照order by结果得到当前行指定列 前/后移动num行 的列值。

函数参数

LAG/LEAD(col,num,default_value)

:col表示指定的列名;num表示指定的位移行数,默认为1;default_value表示末尾或开头返回的值,默认值null。

SELECT*,
    LAG(salary)OVER(PARTITIONBY dept ORDERBY salary)AS lag,
    LEAD(salary)OVER(PARTITIONBY dept ORDERBY salary)AS lead
    FROM employees;-- 从结果看,按照order by排序结果,lag取的值是当前行前1行的值,lead取的值是当前行后1行的值,对于第1行或者最后1行,取值默认为null+-----------------+-----------------+-------------------+----------------+--------+--------+| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |+-----------------+-----------------+-------------------+----------------+--------+--------+| Jane            | Marketing       |29000|28|NULL|35000|| Jeff            | Marketing       |35000|38|29000|NULL|| Fred            | Engineering     |21000|28|NULL|23000|| Tom             | Engineering     |23000|33|21000|23000|| Chloe           | Engineering     |23000|25|23000|29000|| Paul            | Engineering     |29000|23|23000|NULL|| Lisa            | Sales           |10000|35|NULL|30000|| Alex            | Sales           |30000|33|10000|32000|| Evan            | Sales           |32000|38|30000|NULL|+-----------------+-----------------+-------------------+----------------+--------+--------+SELECT*,
    LAG(salary,2,0)OVER(PARTITIONBY dept ORDERBY salary)AS lag,
    LEAD(salary,1,0)OVER(PARTITIONBY dept ORDERBY salary)AS lead
    FROM employees;+-----------------+-----------------+-------------------+----------------+--------+--------+| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |+-----------------+-----------------+-------------------+----------------+--------+--------+| Jane            | Marketing       |29000|28|0|35000|| Jeff            | Marketing       |35000|38|0|0|| Fred            | Engineering     |21000|28|0|23000|| Tom             | Engineering     |23000|33|0|23000|| Chloe           | Engineering     |23000|25|21000|29000|| Paul            | Engineering     |29000|23|23000|0|| Lisa            | Sales           |10000|35|0|30000|| Alex            | Sales           |30000|33|0|32000|| Evan            | Sales           |32000|38|10000|0|+-----------------+-----------------+-------------------+----------------+--------+--------+-- 排序结果和取值列不同的情况SELECT*,
    LAG(salary)OVER(PARTITIONBY dept ORDERBY age)AS lag,
    LEAD(salary)OVER(PARTITIONBY dept ORDERBY age)AS lead
    FROM employees;+-----------------+-----------------+-------------------+----------------+--------+--------+| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |+-----------------+-----------------+-------------------+----------------+--------+--------+| Jane            | Marketing       |29000|28|NULL|35000|| Jeff            | Marketing       |35000|38|29000|NULL|| Paul            | Engineering     |29000|23|NULL|23000|| Chloe           | Engineering     |23000|25|29000|21000|| Fred            | Engineering     |21000|28|23000|23000|| Tom             | Engineering     |23000|33|21000|NULL|| Alex            | Sales           |30000|33|NULL|10000|| Lisa            | Sales           |10000|35|30000|32000|| Evan            | Sales           |32000|38|10000|NULL|+-----------------+-----------------+-------------------+----------------+--------+--------+

1.7. first_value和last_value

在分组中按照order by的结果,获取指定列的第一个或最后一个值。
注意,默认情况下last_value取的是第一行截止到当前行的最后一个值(当前行的值),并不是整个分区中排序后的最后一个值。

SELECT*,
    first_value(salary)OVER(PARTITIONBY dept ORDERBY age)ASfirst,
    last_value(salary)OVER(PARTITIONBY dept ORDERBY age)ASlastFROM employees;-- 注意last的结果+-----------------+-----------------+-------------------+----------------+--------+--------+| employees.name  | employees.dept  | employees.salary  | employees.age  |first|last|+-----------------+-----------------+-------------------+----------------+--------+--------+| Jane            | Marketing       |29000|28|29000|29000|| Jeff            | Marketing       |35000|38|29000|35000|| Paul            | Engineering     |29000|23|29000|29000|| Chloe           | Engineering     |23000|25|29000|23000|| Fred            | Engineering     |21000|28|29000|21000|| Tom             | Engineering     |23000|33|29000|23000|| Alex            | Sales           |30000|33|30000|30000|| Lisa            | Sales           |10000|35|30000|10000|| Evan            | Sales           |32000|38|30000|32000|+-----------------+-----------------+-------------------+----------------+--------+--------+SELECT*,
    first_value(salary)OVER(PARTITIONBY dept ORDERBY salary)ASfirst,
    last_value(salary)OVER(PARTITIONBY dept ORDERBY salary)ASlastFROM employees;+-----------------+-----------------+-------------------+----------------+--------+--------+| employees.name  | employees.dept  | employees.salary  | employees.age  |first|last|+-----------------+-----------------+-------------------+----------------+--------+--------+| Jane            | Marketing       |29000|28|29000|29000|| Jeff            | Marketing       |35000|38|29000|35000|| Fred            | Engineering     |21000|28|21000|21000|| Tom             | Engineering     |23000|33|21000|23000|| Chloe           | Engineering     |23000|25|21000|23000|| Paul            | Engineering     |29000|23|21000|29000|| Lisa            | Sales           |10000|35|10000|10000|| Alex            | Sales           |30000|33|10000|30000|| Evan            | Sales           |32000|38|10000|32000|+-----------------+-----------------+-------------------+----------------+--------+--------+

1.8. nth_value

nth表示第几个的含义

作用,在分组中返回order by结果中指定列的第N行值。

注意,hive中无此函数

SELECT*,nth_value(salary,2)OVER(PARTITIONBY dept ORDERBY salary)AS nth FROM employees;

2. OVER从句

  • 在hive中over语句支持仅有partition by语句,没有order by语句。当没有order by时,order by默认使用partition by指定的字段序列,
  • 在spark中如果未指定order by,将会报错Error in query: Window function row_number() requires window to be ordered, please add ORDER BY clause
select*, row_number()over(partitionby name)as rn from employees;-- 执行结果如下,从结果中得知按照name的字典序排序+-----------------+-----------------+-------------------+----------------+-----+| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |+-----------------+-----------------+-------------------+----------------+-----+| Alex            | Sales           |30000|33|1|| Evan            | Sales           |32000|38|1|| Fred            | Engineering     |21000|28|1|| Lisa            | Sales           |10000|35|1|| Tom             | Engineering     |23000|33|1|| Chloe           | Engineering     |23000|25|1|| Jane            | Marketing       |29000|28|1|| Jeff            | Marketing       |35000|38|1|| Paul            | Engineering     |29000|23|1|+-----------------+-----------------+-------------------+----------------+-----+

2.1. window specification

在OVER语句中可以带有一个window specification,支持以下格式:

格式:
(ROWS| RANGE)BETWEEN xxx AND xxx
具体支持以下3种格式
(ROWS| RANGE)BETWEEN(UNBOUNDED|[num])PRECEDINGAND([num]PRECEDING|CURRENTROW|(UNBOUNDED|[num])FOLLOWING)(ROWS| RANGE)BETWEENCURRENTROWAND(CURRENTROW|(UNBOUNDED|[num])FOLLOWING)(ROWS| RANGE)BETWEEN[num]FOLLOWINGAND(UNBOUNDED|[num])FOLLOWING
  • ROWS 表示在同分区中,按照order by结果按行进行逐行处理
  • RANGE 表示在同分区中按照order by结果的范围处理,如1,2,3,3,4,4,5排序中,3,3和4,4的排序结果相同,则3,3和4,4将分别当成整体处理,每行对应开窗函数的结果相同。
  • UNBOUNDED PRECEDING 表示未绑定当前行之前的行。整个分区中,从第一行开始处理
  • num PRECEDING 表示限制当前行的前num行。如在sum时,每行sum的结果是从前num行到当前行的累加值
  • CURRENT ROW 表示处理过程中的当前行(处理数据过程中的游标指针)
  • UNBOUNDED FOLLOWING 表示未绑定当前行之后的行。整个分区,处理到最后一行
  • num FOLLOWING 表示限制当前行的后num行。如在sum时,每行值的sum的结果将处理到后num行。

注意:在hive中函数不支持使用window specification

rank,

dense_rank

percent_rank()
cume_dist()

ntile
lead

lag
2.1.1. 显式order by下的默认值

**当指定了order by语句而未指定window specification语句时,默认的window specification语句是

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

**。

即按照order by结果的range处理,并且从分区中的第一行处理到当前行。

select*,sum(salary)over(partitionby dept orderby salary)as sum_salary from employees;select*,sum(salary)over(partitionby dept orderby salary range betweenunboundedprecedingandcurrentrow)as sum_salary from employees;-- 二者的执行结果相同+-----------------+-----------------+-------------------+----------------+-------------+| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |+-----------------+-----------------+-------------------+----------------+-------------+| Jane            | Marketing       |29000|28|29000|| Jeff            | Marketing       |35000|38|64000|| Fred            | Engineering     |21000|28|21000|| Tom             | Engineering     |23000|33|67000|| Chloe           | Engineering     |23000|25|67000|| Paul            | Engineering     |29000|23|96000|| Lisa            | Sales           |10000|35|10000|| Alex            | Sales           |30000|33|40000|| Evan            | Sales           |32000|38|72000|+-----------------+-----------------+-------------------+----------------+-------------+

注意:Engineering部门的tom和chloe员工的sum_salary值相同。这是因为在Engineering分区下,二者order by salary结果相同。在range策略下,tom和chloe将当成整体处理,即46000,因此tom和chloe的累计值都为67000(21000+460000),而不是44000和67000。

在SQL中显示指定如下语句,实现分区中的按行累加效果,即

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Engineering部门的tom和chloe员工的sum_salary结果将分别为44000和67000。

select*,sum(salary)over(partitionby dept orderby salary rowsbetweenunboundedprecedingandcurrentrow)as sum_salary from employees;-- 结果如下+-----------------+-----------------+-------------------+----------------+-------------+| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |+-----------------+-----------------+-------------------+----------------+-------------+| Jane            | Marketing       |29000|28|29000|| Jeff            | Marketing       |35000|38|64000|| Fred            | Engineering     |21000|28|21000|| Tom             | Engineering     |23000|33|44000|| Chloe           | Engineering     |23000|25|67000|| Paul            | Engineering     |29000|23|96000|| Lisa            | Sales           |10000|35|10000|| Alex            | Sales           |30000|33|40000|| Evan            | Sales           |32000|38|72000|+-----------------+-----------------+-------------------+----------------+-------------+

在SQL中显示指定如下语句,实现相同分组内进行全部值求和效果。

RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
select*,sum(salary)over(partitionby dept orderby salary range betweenunboundedprecedingandunboundedfollowing)as sum_salary from employees;-- 结果如下+-----------------+-----------------+-------------------+----------------+-------------+| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |+-----------------+-----------------+-------------------+----------------+-------------+| Jane            | Marketing       |29000|28|64000|| Jeff            | Marketing       |35000|38|64000|| Fred            | Engineering     |21000|28|96000|| Tom             | Engineering     |23000|33|96000|| Chloe           | Engineering     |23000|25|96000|| Paul            | Engineering     |29000|23|96000|| Lisa            | Sales           |10000|35|72000|| Alex            | Sales           |30000|33|72000|| Evan            | Sales           |32000|38|72000|+-----------------+-----------------+-------------------+----------------+-------------+

结论:

  • 情形一:在partition by和order by同时存在的情况下,对于MAX | MIN | COUNT | SUM | AVG 等函数 如果想要对相同分组中的数整体进行计算,则要显示指定RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  • 情形二:如果按在分组中实现按行处理,则要显示指定ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  • 情形三:默认情况下(即未显示指定window specification时)对于order by列相同的值处理结果相同。
2.1.2. 无order by下的默认值

**当over语句中order by和window specification都缺失时,window specification的默认值是

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

**

结论

  • 对MAX | MIN | COUNT | SUM | AVG 等函数,在order by缺失或order by和partition by相同时,效果同上述情形一
select*,sum(salary)over(partitionby dept)as sum_salary from employees;-- 结果如下+-----------------+-----------------+-------------------+----------------+-------------+| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |+-----------------+-----------------+-------------------+----------------+-------------+| Fred            | Engineering     |21000|28|96000|| Tom             | Engineering     |23000|33|96000|| Paul            | Engineering     |29000|23|96000|| Chloe           | Engineering     |23000|25|96000|| Jane            | Marketing       |29000|28|64000|| Jeff            | Marketing       |35000|38|64000|| Lisa            | Sales           |10000|35|72000|| Evan            | Sales           |32000|38|72000|| Alex            | Sales           |30000|33|72000|+-----------------+-----------------+-------------------+----------------+-------------+

结果中,对于相同dept中不同员工,效果并不是按行并逐行处理,而是对相同dept下的员工进行了统一处理。即UNBOUNDED FOLLOWING表示不跟随当前处理的行,直接对中整个分区中进行计算。

疑问:

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

中rows的作用是什么了?

3. last_value函数的注意事项

前面在提到last_value时,特意强调了该函数的结果并不是分区中的最后一个值,结合上述介绍的window specification再来看下该函数的结果值。

SELECT*,
    first_value(salary)OVER(PARTITIONBY dept ORDERBY salary)ASfirst,
    last_value(salary)OVER(PARTITIONBY dept ORDERBY salary)ASlastFROM employees;+-----------------+-----------------+-------------------+----------------+--------+--------+| employees.name  | employees.dept  | employees.salary  | employees.age  |first|last|+-----------------+-----------------+-------------------+----------------+--------+--------+| Jane            | Marketing       |29000|28|29000|29000|| Jeff            | Marketing       |35000|38|29000|35000|| Fred            | Engineering     |21000|28|21000|21000|| Tom             | Engineering     |23000|33|21000|23000|| Chloe           | Engineering     |23000|25|21000|23000|| Paul            | Engineering     |29000|23|21000|29000|| Lisa            | Sales           |10000|35|10000|10000|| Alex            | Sales           |30000|33|10000|30000|| Evan            | Sales           |32000|38|10000|32000|+-----------------+-----------------+-------------------+----------------+--------+--------+

结果中,对于每一行的last_value的结果都是当前值,并不分区中按salary升序的最后一个值。造成这个结果的原因正是由于默认的window specification导致的(

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)。
如想要实现得到整个分区中按salary升序的最大值,则需要显示设置window specification为

RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
select*,first_value(salary)over(partitionby dept orderby salary)as first_salary
,last_value(salary)over(partitionby dept orderby salary range betweenunboundedprecedingandunboundedfollowing)as last_salary from employees;-- 结果如下+-----------------+-----------------+-------------------+----------------+---------------+--------------+| employees.name  | employees.dept  | employees.salary  | employees.age  | first_salary  | last_salary  |+-----------------+-----------------+-------------------+----------------+---------------+--------------+| Jane            | Marketing       |29000|28|29000|35000|| Jeff            | Marketing       |35000|38|29000|35000|| Fred            | Engineering     |21000|28|21000|29000|| Tom             | Engineering     |23000|33|21000|29000|| Chloe           | Engineering     |23000|25|21000|29000|| Paul            | Engineering     |29000|23|21000|29000|| Lisa            | Sales           |10000|35|10000|32000|| Alex            | Sales           |30000|33|10000|32000|| Evan            | Sales           |32000|38|10000|32000|+-----------------+-----------------+-------------------+----------------+---------------+--------------+
标签: hive spark 大数据

本文转载自: https://blog.csdn.net/qq_33446500/article/details/140306640
版权归原作者 有数的编程笔记 所有, 如有侵权,请联系我们删除。

“Hive/Spark窗口函数”的评论:

还没有评论