0


SQL函数 - 开窗(窗口)函数

什么是开窗函数?

开窗函数对一组值进行操作,它不像普通聚合函数那样需要使用GROUP BY子句对数据进行分组,能够在同一行中同时返回基础行的列和聚合列

开窗函数的语法形式为:函数 + over(partition by <分组用列> order by <排序用列>),表示对数据集按照分组用列进行分区,并且并且对每个分区按照函数聚合计算,最终将计算结果按照排序用列排序后返回到该行。括号中的两个关键词partition by 和order by 可以只出现一个。

注意:开窗函数不会互相干扰,因此在同一个查询语句中可以同时使用多个开窗函数

开窗函数适用于 mysql 8.0以上版本, sql sever 、hive、oracle 等

开窗函数分类

窗口函数大致分为以下几类:

一、排序开窗函数

① row_number() -- 相同值排名顺延,返回结果1、2、3、4

② rank() -- 相同结果排名相同,后续排名不连续,返回结果为 1、2、2、4

③ dense_rank() -- 相同结果排名相同,后续排名顺延,返回结果为 1、2、2、3

④ ntile(n) -- 分组排名,将数据分为n组并返回对应组号1、2......n

具体例子如下:

select grades
,subjects
,results
,row_number() over(partition by grades,subjects order by results desc) as row_numbers
,rank() over(partition by grades,subjects order by results desc) as ranks
,dense_rank() over(partition by grades,subjects order by results desc) as dense_ranks
,ntile(3) over(partition by grades,subjects order by results desc) as ntiles
from test11

表示将数据集按照grades、subjects字段进行分组后,根据对应排序函数并按照results字段降序返回排名,具体结果如下

二、聚合开窗函数

① sum() -- 分组求和

② count() -- 分组求总数

③ min() -- 分组求最小值

④ max() -- 分组求最大值

⑤ avg() --分组求均值

具体例子如下:

select grades
,subjects
,results
,sum(results) over(partition by grades,subjects order by results desc) as sum聚合1
,sum(results) over(partition by grades,subjects) as sum聚合2
,count(results) over(partition by grades,subjects order by results desc) as count聚合1
,count(results) over(partition by grades,subjects) as count聚合2
,min(results) over(partition by grades,subjects order by results desc) as min聚合1
,min(results) over(partition by grades,subjects) as min聚合2
,max(results) over(partition by grades,subjects order by results desc) as max聚合1
,max(results) over(partition by grades,subjects) as max聚合2
,avg(results) over(partition by grades,subjects order by results desc) as avg聚合1
,avg(results) over(partition by grades,subjects) as avg聚合2
from test11

聚合1表示将数据集按照grades、subjects进行分组后,按照results降序排序,将每组中的results依次聚合;

聚合2表示将数据集按照grades、subjects进行分组后,将每组中的results整体聚合。因此 count(results) over(partition by grades,subjects order by results desc) 与 row_number() over(partition by grades,subjects order by results desc) 可以达到同样目的

深挖两者的差异,主要是 order by 默认统计范围是 rows between unbounded preceding and current row,也就是取当前行数据与当前行之前的数据运算。如果在聚合1 order by 条件的后面加上语句:rows between unbounded preceding and unbounded following,也就可以对分组中的所有数据进行运算,可以得到聚合1相同结果。如下

select grades
,subjects
,results
,sum(results) over(partition by grades,subjects order by results desc rows between unbounded preceding and unbounded following) as sum聚合1
,sum(results) over(partition by grades,subjects) as sum聚合2
,count(results) over(partition by grades,subjects order by results desc rows between unbounded preceding and unbounded following) as count聚合1
,count(results) over(partition by grades,subjects) as count聚合2
,min(results) over(partition by grades,subjects order by results desc rows between unbounded preceding and unbounded following) as min聚合1
,min(results) over(partition by grades,subjects) as min聚合2
,max(results) over(partition by grades,subjects order by results desc rows between unbounded preceding and unbounded following) as max聚合1
,max(results) over(partition by grades,subjects) as max聚合2
,avg(results) over(partition by grades,subjects order by results desc rows between unbounded preceding and unbounded following) as avg聚合1
,avg(results) over(partition by grades,subjects) as avg聚合2
from test11

关于这一部分下文详细描述

三、其他开窗函数

① lag(字段名,n,0) -- 移位开窗函数,表示返回向上第n行指定字段对应数据。其中n代表向上偏移n行,0代表若偏移行数超出表范围则返回0也可以改成其他值,若不写则默认null

② lead(字段名,n,0) -- 移位开窗函数,与lag()相反,表示返回向下第n行指定字段对应数据

③ first_value() -- 取分组内排序后,截止到当前行,第一个值

④ last_value() -- 取分组内排序后,截止到当前行,最后一个值

具体例子如下:

#加order by
select grades
,subjects
,results
,lag(results,1,0) over(partition by grades,subjects order by results desc) as lag移位1
,lead(results,1,0) over(partition by grades,subjects order by results desc) as lead移位1
,first_value(results) over(partition by grades,subjects order by results desc) as first_value排序1
,last_value(results) over(partition by grades,subjects order by results desc) as last_value排序1
from test11

加 order by 代表将数据集按照grades、subjects进行分组后,再根据results降序排序,然后根据函数取当前行数据与当前行之前的数据运算。若不加 order by 则是对分组后的数据直接运算

 #不加order by 
select grades
,subjects
,results
,lag(results,1,0) over(partition by grades,subjects) as lag移位2
,lead(results,1,0) over(partition by grades,subjects) as lead移位2
,first_value(results) over(partition by grades,subjects) as first_value排序2
,last_value(results) over(partition by grades,subjects) as last_value排序2
from test11

不加 order by

⑤ ratio_to_report(字段名) over(partition by 字段名) -- 百分比分析函数,rratio_to_report(字段名) 为分子,over(partition by 字段名) 为分母,若分母中partition by 字段名 省略则表示占数据集整体百分比。为Oracle数据库函数,mysql不能使用

具体例子如下:

select grades
,subjects
,results
,ratio_to_report(results) over()  百分比函数1
,ratio_to_report(results) over(partition by grades,subjects) 百分比函数2
from test11

开窗函数的定位框架

窗口函数除了经常使用的 partition by <分组用列> order by <排序用列> 外,在order by 后存在可省略的窗口框架 range/rows between x and y ,主要用于对partition by的分组结果做进一步限制,并定位出限制后的运算范围。

其中range表示按照值的范围进行范围的定义,而rows表示按照行的范围进行范围的定义。若order by 后未指定框架,那么默认框架将采用 range unbounded preceding and current row,表示从开窗后的第一行到当前行。

若窗口函数没有order by,也就不存在框架range/rows between x and y。

框架range/rows between x and y 具体x、y可取值见下表:
可取值含义unbounded precedingpartition by 分组order by后 第一行unbounded followingpartition by 分组order by后 最后一行current rowpartition by 分组order by后 当前行n precedingpartition by 分组order by后 前n行n followingpartition by 分组order by后 后n行
说明:rows between 5 preceding and current row 可缩写为 rows 5 preceding

range 只支持使用 unbounded preceding、 unbounded following、current row

具体例子如下:

select grades
,subjects
,results
,sum(results) over(partition by grades,subjects order by results desc rows between unbounded preceding and current row) as 定位窗口求和1
,sum(results) over(partition by grades,subjects order by results desc range between unbounded preceding and current row) as 定位窗口求和2
from test11

由结果可见,定位窗口求和1 中**rows **按照固定行定义,此处表示返回从当前行到分组后第一行之和

定位窗口求和2 中**range **按照值进行范围定义,此处仍表示返回从当前行到分组后第一行之和,不过由于当前行存在2个相同order by值(如上图黄框中89),则会先对这两行汇总后再向分组后第一行求和。

开窗函数使用场景

开窗函数应用比较多的场景,以下简单列举:

① 求某个分组下的最大/最小值/TOPn值对应信息,如年级中每个班级的第一名,大区中销售额最高城市,此处用排序函数row_number(),需用order by 排序

②对某个分组求和/个数/均值,如城市历史截至昨天累积销售额/营业天数/平均销售额,此处用sum()/count()/avg(),需用order by 默认的定位框架

③相邻时间求时间差,如用户复购时间周期,此处用lag(),需用order by 排序

标签: sql 大数据

本文转载自: https://blog.csdn.net/wr_java/article/details/130200374
版权归原作者 wrr-cat 所有, 如有侵权,请联系我们删除。

“SQL函数 - 开窗(窗口)函数”的评论:

还没有评论