一、前言
分组查询是常见的SQL查询语句。SQL分组功能主要通过GROUP BY关键字来实现,并且GROUP BY通常需要配合聚合函数来使用。比如说,分组之后可以计数(COUNT),求和(SUM),求平均数(AVG)等。
本文探讨的不是GROUP BY关键字的学习和使用,而是一种另类的“分组”查询。
有这样一个功能需求:系统中存在资讯信息模块,用于发布一些和业务相关的活动动态,其中每条资讯信息都有一个所属类型(如科技类的资讯、娱乐类、军事类•••)和浏览量字段。官网上需要滚动展示一些热门资讯信息列表(浏览量越大代表越热门),而且每个类别的相关资讯记录至多显示3条,换句话:“按照资讯分类分组,取每组的前3条资讯信息列表”。
尝试使用GROUP BY的各种方式都不能实现,最后通过使用窗口函数获得了解决方法,即子查询。
下面,将模拟一些实际的测试数据重现问题的解决过程。

二、数据准备
1.表设计

新闻分类表:
id 主键
name 分类名称
新闻信息记录表:
id 主键
title 资讯名称
views 浏览量
info_type_id 资讯类别
code 说明

初始化SQL语句:
set work_mem to ‘1GB’;
Create table info(
id numeric not null primary key ,
title varchar(100) ,
Viewnum numeric ,
info_type_id numeric ,
Code text
);
create index info_infotypeid on info (info_type_id);
Create table info_type(
Id numeric not null primary key,
Name varchar(100)
);
–插入100个新闻分类
Insert into info_type
select id, ‘TYPE’ || lpad(id::text, 5, ‘0’ ) from generate_series(1, 100) id;

–插入1000000个新闻
Insert into info_type
select id, ‘TTL’ || lpad(id::text, 20, ‘0’ ) title, ceil(random()*10000) views, ceil(random()*10000) info_type_id , md5(id) code
from generate_series(1, 1000000) id;

vacuum analyse info_type,info;

2.核心思想
通常我们在取前N条记录时候,都是根据某个业务字段进行降序排序,然后取前N条。
例如“select * from info order by views asc limit 3”,这条SQL就是取info表中的前3条记录。但是“它是在每个类型下都要取浏览量的前3条记录”。
另一种比较简单粗暴的方式是在Java代码中循环所有的资讯类型,取出每个类型的前3条记录,最后进行汇总。但是,这种方式虽然也能实现我们的要求,但存在严重的弊端,即有可能发送多次(成百上千次也有可能)sql语句。这种程序显然是有重大缺陷的。
如果能够在查询每条资讯记录时能查出其所在类型的排名,然后根据排名字段进行过滤,是否也能解决该问题?这时候我们就想到了子查询,并且MySQL是可以实现这样的功能子查询的。
要计算出某条资讯信息在同资讯分类下所有记录中排第几名,换算出有多少条浏览量比当前记录的浏览量高,然后根据具体的多少(N)条+1就是N+1就是当前记录所在其分类下的的排名。
假如以本文上面的示例数据说明:就是在计算每个资讯信息记录时,多计算出一列作为其“排名”字段,然后取“排名”字段的小于等于3的记录即可。
如果这里还不是很理解的话,就先看下面的SQL,然后根据SQL再回过头来理解这段话。

三、SQL实现
1.方法一: 分组排名,取前3名记录

explain (analyse ,buffers )
with i as (
select i.*,
row_number() over (partition by i.info_type_id order by i.viewnum desc) sn
from info i)
select *
from info_type t
left join i on i.sn <= 3 and i.info_type_id = t.id;

                                                             QUERY PLAN

Merge Right Join (cost=122990.46…123343.08 rows=3341 width=96) (actual time=1634.866…1642.284 rows=200 loops=1)
Merge Cond: (i.info_type_id = t.id)
Buffers: shared hit=13325
-> Subquery Scan on i (cost=122981.84…155481.84 rows=333333 width=81) (actual time=1634.822…1642.165 rows=201 loops=1)
Filter: (i.sn <= 2)
Rows Removed by Filter: 9863
Buffers: shared hit=13324
-> WindowAgg (cost=122981.84…142981.84 rows=1000000 width=81) (actual time=1634.821…1641.693 rows=10064 loops=1)
Buffers: shared hit=13324
-> Sort (cost=122981.84…125481.84 rows=1000000 width=73) (actual time=1634.807…1636.324 rows=10065 loops=1)
Sort Key: i_1.info_type_id, i_1.viewnum DESC
Sort Method: quicksort Memory: 165202kB
Buffers: shared hit=13324
-> Seq Scan on info i_1 (cost=0.00…23324.00 rows=1000000 width=73) (actual time=0.006…123.806 rows=1000000 loops=1)
Buffers: shared hit=13324
-> Sort (cost=5.32…5.57 rows=100 width=15) (actual time=0.040…0.044 rows=100 loops=1)
Sort Key: t.id
Sort Method: quicksort Memory: 29kB
Buffers: shared hit=1
-> Seq Scan on info_type t (cost=0.00…2.00 rows=100 width=15) (actual time=0.016…0.021 rows=100 loops=1)
Buffers: shared hit=1
Planning Time: 0.149 ms
Execution Time: 1644.193 ms
(23 行记录)

查询计划分析:Info全表读取到内存中,hit块数13325,行数1000000行。用时 1644.193 ms。

2.方法二: 按新闻分类,取第3名新闻记录(一条记录)
在方法一中,由于读取了大量数据块,耗时过多。
功能要求只需返回每组1条记录,希望避免读冗余数据块。因此新的SQL特点,每个类型使用子查询通过info表的info_type_id列的索引,就可以避免读取多余的数据。
select list的子查询作为计算列,只能返回一个值,所以使用row (i.)::info 先整合,然后使用 (inf). 再分解,同时使用 offset2 limit 1获取第三名的一行记录。

explain (analyse ,buffers )
select id, name, (inf).*
from (select t.,
(select row (i.
)::info
from info i
where i.info_type_id = t.id
order by i.viewnum desc
offset 2
limit 1) inf
from info_type t
) t;

                                                                QUERY PLAN

Seq Scan on info_type t (cost=0.00…190882.06 rows=100 width=361) (actual time=0.664…56.966 rows=100 loops=1)
Buffers: shared hit=51821
SubPlan 1
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.159…0.159 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.158…0.158 rows=3 loops=100)
Sort Key: i.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i (cost=5.20…380.46 rows=100 width=37) (actual time=0.018…0.134 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.011…0.011 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 2
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.103…0.103 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.103…0.103 rows=3 loops=100)
Sort Key: i_1.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_1 (cost=5.20…380.46 rows=100 width=37) (actual time=0.015…0.079 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.010…0.010 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 3
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.104…0.104 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.104…0.104 rows=3 loops=100)
Sort Key: i_2.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_2 (cost=5.20…380.46 rows=100 width=37) (actual time=0.014…0.077 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.009…0.009 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 4
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.102…0.102 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.101…0.101 rows=3 loops=100)
Sort Key: i_3.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_3 (cost=5.20…380.46 rows=100 width=37) (actual time=0.014…0.077 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.010…0.010 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 5
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.098…0.098 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.098…0.098 rows=3 loops=100)
Sort Key: i_4.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_4 (cost=5.20…380.46 rows=100 width=37) (actual time=0.013…0.073 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.009…0.009 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
Planning Time: 0.269 ms
Execution Time: 57.030 ms
(74 行记录)

查询计划分析:Info表有类型列的索引,使用子查询,可以充分利用这个索引,使得每个subplan的hit数据块数10364大为减少,总hit块数51821,总用时57.030 ms 。

按viewnum列排序,但是没有相应的索引,仍然需要读表,造成hit数据块过多,创建所需索引,数据块和时间将改善。
create index info_typeview on info(info_type_id,viewnum);

                                                                  QUERY PLAN

Seq Scan on info_type t (cost=0.00…6300.70 rows=100 width=361) (actual time=0.073…4.067 rows=100 loops=1)
Buffers: shared hit=3001
SubPlan 1
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.011…0.011 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i (cost=0.42…406.17 rows=100 width=37) (actual time=0.008…0.010 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 2
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_1 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 3
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_2 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 4
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_3 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 5
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_4 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
Planning Time: 0.192 ms
Execution Time: 4.109 ms
(34 行记录)

查询计划分析:新的复合索引,同满足了分组和排序的需要,省略了读表数据块和时长都大为改善,hit数据块数3001,总时长4.109 ms。

3.方法三: 按新闻分类,取前3名新闻记录(三条记录)
在方法二中,限制了返回行数,仅一行,同时info表有5个列,所以有5个subplan,其中4个是冗余的。
新的SQL特点,select list的子查询作为计算列,只能返回一行值,所以使用array() 先转换成数组类型,然后使用 unnest() 再分解成多行,同时使用 offset0 limit 3获取前三名的三行记录。

explain (analyse ,buffers )
select id, name, (inf).*
from (select t.id, t.name, unnest(inf) inf
from (select t.,
array(select row (i.
)::info
from info i
where i.info_type_id = t.id
order by i.viewnum desc
offset 0
limit 3) inf
from info_type t
) t) t;
QUERY PLAN

Subquery Scan on t (cost=0.00…1277.49 rows=1000 width=361) (actual time=0.039…1.234 rows=300 loops=1)
Buffers: shared hit=601
-> ProjectSet (cost=0.00…1267.49 rows=1000 width=47) (actual time=0.037…1.156 rows=300 loops=1)
Buffers: shared hit=601
-> Seq Scan on info_type t_1 (cost=0.00…2.00 rows=100 width=15) (actual time=0.003…0.010 rows=100 loops=1)
Buffers: shared hit=1
SubPlan 1
-> Limit (cost=0.42…12.60 rows=3 width=37) (actual time=0.007…0.010 rows=3 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i (cost=0.42…406.17 rows=100 width=37) (actual time=0.007…0.010 rows=3 loops=100)
Index Cond: (info_type_id = t_1.id)
Buffers: shared hit=600
Planning Time: 0.122 ms
Execution Time: 1.269 ms
(14 行记录)

查询计划分析:使用数组函数不仅解决了多行需求功能,连带将冗余的subplan也避免了,使得性能有大幅提升,hit数据块数601,时长 1.269 ms。如果每组只需要一行记录,数据块和时长相应会减少。

四、总结
子查询,在一些限定数据的需求中,会带来不可思议的性能提升。

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐