MySQL的窗口函数

窗口函数的介绍

Posted by qin4zhang on February 15, 2020

注意

想法及时记录,实现可以待做。

官方文档

窗口函数

MySQL在 8.0开始支持窗口函数。

窗口函数介绍

本节介绍非聚合窗口函数,这些函数针对查询中的每一行,使用与该行相关的行执行计算。

名称 说明
CUME_DIST() Cumulative distribution value
DENSE_RANK() Rank of current row within its partition, without gaps
FIRST_VALUE() Value of argument from first row of window frame
LAG() Value of argument from row lagging current row within partition
LAST_VALUE() Value of argument from last row of window frame
LEAD() Value of argument from row leading current row within partition
NTH_VALUE() Value of argument from N-th row of window frame
NTILE() Bucket number of current row within its partition.
PERCENT_RANK() Percentage rank value
RANK() Rank of current row within its partition, with gaps
ROW_NUMBER() Number of current row within its partition

窗口函数的概念和语法

示例数据:

mysql> SELECT * FROM sales ORDER BY country, year, product;
+------+---------+------------+--------+
| year | country | product    | profit |
+------+---------+------------+--------+
| 2000 | Finland | Computer   |   1500 |
| 2000 | Finland | Phone      |    100 |
| 2001 | Finland | Phone      |     10 |
| 2000 | India   | Calculator |     75 |
| 2000 | India   | Calculator |     75 |
| 2000 | India   | Computer   |   1200 |
| 2000 | USA     | Calculator |     75 |
| 2000 | USA     | Computer   |   1500 |
| 2001 | USA     | Calculator |     50 |
| 2001 | USA     | Computer   |   1500 |
| 2001 | USA     | Computer   |   1200 |
| 2001 | USA     | TV         |    150 |
| 2001 | USA     | TV         |    100 |
+------+---------+------------+--------+

窗口函数对一组查询行执行类似聚合的操作。但是,尽管聚合操作将查询行分组为单个结果行,但是窗口函数为每个查询行生成结果:

  • 发生功能评估的行称为当前行。
  • 与发生功能评估的当前行相关的查询行包括当前行的窗口。

例如,使用销售信息表,这两个查询执行汇总操作,这些汇总操作将作为一组的所有行产生单个全局总和,并按国家/地区分组总和:

mysql> SELECT SUM(profit) AS total_profit
       FROM sales;
+--------------+
| total_profit |
+--------------+
|         7535 |
+--------------+
mysql> SELECT country, SUM(profit) AS country_profit
       FROM sales
       GROUP BY country
       ORDER BY country;
+---------+----------------+
| country | country_profit |
+---------+----------------+
| Finland |           1610 |
| India   |           1350 |
| USA     |           4575 |
+---------+----------------+

相比之下,窗口操作不会将查询行组折叠为单个输出行。相反,它们为每一行产生一个结果。与前面的查询类似,下面的查询使用SUM(),但是这次作为窗口函数:

mysql> SELECT
         year, country, product, profit,
         SUM(profit) OVER() AS total_profit,
         SUM(profit) OVER(PARTITION BY country) AS country_profit
       FROM sales
       ORDER BY country, year, product, profit;
+------+---------+------------+--------+--------------+----------------+
| year | country | product    | profit | total_profit | country_profit |
+------+---------+------------+--------+--------------+----------------+
| 2000 | Finland | Computer   |   1500 |         7535 |           1610 |
| 2000 | Finland | Phone      |    100 |         7535 |           1610 |
| 2001 | Finland | Phone      |     10 |         7535 |           1610 |
| 2000 | India   | Calculator |     75 |         7535 |           1350 |
| 2000 | India   | Calculator |     75 |         7535 |           1350 |
| 2000 | India   | Computer   |   1200 |         7535 |           1350 |
| 2000 | USA     | Calculator |     75 |         7535 |           4575 |
| 2000 | USA     | Computer   |   1500 |         7535 |           4575 |
| 2001 | USA     | Calculator |     50 |         7535 |           4575 |
| 2001 | USA     | Computer   |   1200 |         7535 |           4575 |
| 2001 | USA     | Computer   |   1500 |         7535 |           4575 |
| 2001 | USA     | TV         |    100 |         7535 |           4575 |
| 2001 | USA     | TV         |    150 |         7535 |           4575 |
+------+---------+------------+--------+--------------+----------------+

查询中的每个窗口操作均通过包含OVER子句来表示,该子句指定如何将查询行划分为组以供窗口函数处理:

  • 第一个OVER子句为空,它将整个查询行集视为一个分区。因此,窗口函数会产生一个全局和,但对于每一行都会如此。
  • 第二个OVER子句按国家对行进行分区,从而为每个分区(每个国家)产生一个总和。函数为每个分区行产生该总和。

仅在选择列表和ORDER BY子句中允许窗口函数。 在WHERE,GROUP BY和HAVING处理之后,从FROM子句确定查询结果行,并在ORDER BY,LIMIT和SELECT DISTINCT之前执行窗口执行。

OVER子句允许用于许多聚合函数,因此可以根据是否存在OVER子句将其用作窗口或非窗口函数:

AVG()
BIT_AND()
BIT_OR()
BIT_XOR()
COUNT()
JSON_ARRAYAGG()
JSON_OBJECTAGG()
MAX()
MIN()
STDDEV_POP(), STDDEV(), STD()
STDDEV_SAMP()
SUM()
VAR_POP(), VARIANCE()
VAR_SAMP()

MySQL还支持仅用作窗口函数的非聚合函数。对于这些,OVER子句是必需的:

CUME_DIST()
DENSE_RANK()
FIRST_VALUE()
LAG()
LAST_VALUE()
LEAD()
NTH_VALUE()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()

作为这些非聚合窗口函数之一的示例,此查询使用ROW_NUMBER(),该函数将生成其分区内每一行的行号。 在这种情况下,行按国家/地区编号。 默认情况下,分区行是无序的,行编号是不确定的。 要对分区行进行排序,请在窗口定义内包含ORDER BY子句。 该查询使用无序分区和有序分区(row_num1和row_num2列)来说明省略和包括ORDER BY之间的区别:

mysql> SELECT
         year, country, product, profit,
         ROW_NUMBER() OVER(PARTITION BY country) AS row_num1,
         ROW_NUMBER() OVER(PARTITION BY country ORDER BY year, product) AS row_num2
       FROM sales;
+------+---------+------------+--------+----------+----------+
| year | country | product    | profit | row_num1 | row_num2 |
+------+---------+------------+--------+----------+----------+
| 2000 | Finland | Computer   |   1500 |        2 |        1 |
| 2000 | Finland | Phone      |    100 |        1 |        2 |
| 2001 | Finland | Phone      |     10 |        3 |        3 |
| 2000 | India   | Calculator |     75 |        2 |        1 |
| 2000 | India   | Calculator |     75 |        3 |        2 |
| 2000 | India   | Computer   |   1200 |        1 |        3 |
| 2000 | USA     | Calculator |     75 |        5 |        1 |
| 2000 | USA     | Computer   |   1500 |        4 |        2 |
| 2001 | USA     | Calculator |     50 |        2 |        3 |
| 2001 | USA     | Computer   |   1500 |        3 |        4 |
| 2001 | USA     | Computer   |   1200 |        7 |        5 |
| 2001 | USA     | TV         |    150 |        1 |        6 |
| 2001 | USA     | TV         |    100 |        6 |        7 |
+------+---------+------------+--------+----------+----------+

窗口函数窗体说明

与窗口函数一起使用的窗口的定义可以包含frame子句。窗体是当前分区的子集,而frame子句指定如何定义子集。

相对于当前行确定窗体,这使得窗体可以根据分区中当前行的位置在分区中移动。例子:

  • 通过将窗体定义为从分区开始到当前行的所有行,可以计算每一行的运行总计。
  • 通过将窗体定义为在当前行的两侧扩展N行,可以计算滚动平均值。

以下查询演示了如何使用移动窗体来计算每组按时间排序的级别值内的运行总计,以及从当前行以及紧接其前后的行计算出的滚动平均值:

mysql> SELECT
         time, subject, val,
         SUM(val) OVER (PARTITION BY subject ORDER BY time
                        ROWS UNBOUNDED PRECEDING)
           AS running_total,
         AVG(val) OVER (PARTITION BY subject ORDER BY time
                        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
           AS running_average
       FROM observations;
+----------+---------+------+---------------+-----------------+
| time     | subject | val  | running_total | running_average |
+----------+---------+------+---------------+-----------------+
| 07:00:00 | st113   |   10 |            10 |          9.5000 |
| 07:15:00 | st113   |    9 |            19 |         14.6667 |
| 07:30:00 | st113   |   25 |            44 |         18.0000 |
| 07:45:00 | st113   |   20 |            64 |         22.5000 |
| 07:00:00 | xh458   |    0 |             0 |          5.0000 |
| 07:15:00 | xh458   |   10 |            10 |          5.0000 |
| 07:30:00 | xh458   |    5 |            15 |         15.0000 |
| 07:45:00 | xh458   |   30 |            45 |         20.0000 |
| 08:00:00 | xh458   |   25 |            70 |         27.5000 |
+----------+---------+------+---------------+-----------------+

以下查询演示了FIRST_VALUE(),LAST_VALUE()和两个NTH_VALUE()实例:

mysql> SELECT
         time, subject, val,
         FIRST_VALUE(val)  OVER w AS 'first',
         LAST_VALUE(val)   OVER w AS 'last',
         NTH_VALUE(val, 2) OVER w AS 'second',
         NTH_VALUE(val, 4) OVER w AS 'fourth'
       FROM observations
       WINDOW w AS (PARTITION BY subject ORDER BY time
                    ROWS UNBOUNDED PRECEDING);
+----------+---------+------+-------+------+--------+--------+
| time     | subject | val  | first | last | second | fourth |
+----------+---------+------+-------+------+--------+--------+
| 07:00:00 | st113   |   10 |    10 |   10 |   NULL |   NULL |
| 07:15:00 | st113   |    9 |    10 |    9 |      9 |   NULL |
| 07:30:00 | st113   |   25 |    10 |   25 |      9 |   NULL |
| 07:45:00 | st113   |   20 |    10 |   20 |      9 |     20 |
| 07:00:00 | xh458   |    0 |     0 |    0 |   NULL |   NULL |
| 07:15:00 | xh458   |   10 |     0 |   10 |     10 |   NULL |
| 07:30:00 | xh458   |    5 |     0 |    5 |     10 |   NULL |
| 07:45:00 | xh458   |   30 |     0 |   30 |     10 |     30 |
| 08:00:00 | xh458   |   25 |     0 |   25 |     10 |     30 |
+----------+---------+------+-------+------+--------+--------+

命名式窗口

可以定义Windows并为其指定名称,以便在OVER子句中引用它们。为此,请使用WINDOW子句。如果查询中存在WINDOW子句,则它位于HAVING和ORDER BY子句的位置之间,并且具有以下语法:

WINDOW window_name AS (window_spec)
    [, window_name AS (window_spec)] ...

WINDOW子句对于其中多个OVER子句否则将定义同一窗口的查询很有用。 相反,您可以定义一次窗口,为其命名,然后在OVER子句中引用该名称。 考虑一下此查询,它多次定义了相同的窗口:

SELECT
  val,
  ROW_NUMBER() OVER (ORDER BY val) AS 'row_number',
  RANK()       OVER (ORDER BY val) AS 'rank',
  DENSE_RANK() OVER (ORDER BY val) AS 'dense_rank'
FROM numbers;

通过使用WINDOW一次定义窗口并在OVER子句中按名称引用窗口,可以更简单地编写查询:

SELECT
  val,
  ROW_NUMBER() OVER w AS 'row_number',
  RANK()       OVER w AS 'rank',
  DENSE_RANK() OVER w AS 'dense_rank'
FROM numbers
WINDOW w AS (ORDER BY val);

命名窗口还使您更容易尝试使用窗口定义来查看对查询结果的影响。您只需要修改WINDOW子句中的窗口定义,而不需要修改多个OVER子句定义。

如果OVER子句使用OVER(window_name …)而不是OVER window_name,则可以通过添加其他子句来修改命名窗口。 例如,此查询定义一个包含分区的窗口,并在OVER子句中使用ORDER BY以不同的方式修改该窗口:

SELECT
  DISTINCT year, country,
  FIRST_VALUE(year) OVER (w ORDER BY year ASC) AS first,
  FIRST_VALUE(year) OVER (w ORDER BY year DESC) AS last
FROM sales
WINDOW w AS (PARTITION BY country);

窗口函数的限制

SQL标准对窗口函数施加了约束,使其不能在UPDATE或DELETE语句中用于更新行。允许在这些语句的子查询中使用此类函数(以选择行)。

MySQL不支持以下窗口功能:

  • 聚合窗口函数的DISTINCT语法。
  • 嵌套窗口函数
  • 取决于当前行的值的动态窗体端点。

解析器可以识别以下窗口构造,但仍不受支持:

  • 解析GROUPS窗体单位说明符,但会产生错误。仅支持ROWS和RANGE。
  • 解析了窗体规范的EXCLUDE子句,但会产生错误。
  • IGNORE NULLS被解析,但是产生错误。仅支持RESPECT NULLS。
  • FROM LAST被解析,但是产生错误。仅支持FROM。