基于金仓数据库KingbaseES的SQL优化

一、前言分组查询是常见的SQL查询语句。SQL分组功能主要通过GROUP BY关键字来实现，并且GROUP BY通常需要配合聚合函数来使用。比如说，分组之后可以计数（COUNT），求和（SUM），求平均数（AVG）等。本文探讨的不是GROUP BY关键字的学习和使用，而是一种另类的“分组”查询。有这样一个功能需求：系统中存在资讯信息模块，用于发布一些和业务相关的活动动态，其中每条资讯信息都有一个所

金仓数据库

1412人浏览 · 2022-03-01 10:01:08

金仓数据库 · 2022-03-01 10:01:08 发布

一、前言
分组查询是常见的SQL查询语句。SQL分组功能主要通过GROUP BY关键字来实现，并且GROUP BY通常需要配合聚合函数来使用。比如说，分组之后可以计数（COUNT），求和（SUM），求平均数（AVG）等。
本文探讨的不是GROUP BY关键字的学习和使用，而是一种另类的“分组”查询。
有这样一个功能需求：系统中存在资讯信息模块，用于发布一些和业务相关的活动动态，其中每条资讯信息都有一个所属类型（如科技类的资讯、娱乐类、军事类•••）和浏览量字段。官网上需要滚动展示一些热门资讯信息列表（浏览量越大代表越热门），而且每个类别的相关资讯记录至多显示3条，换句话：“按照资讯分类分组，取每组的前3条资讯信息列表”。
尝试使用GROUP BY的各种方式都不能实现，最后通过使用窗口函数获得了解决方法，即子查询。
下面，将模拟一些实际的测试数据重现问题的解决过程。

二、数据准备
1．表设计

新闻分类表:
id 主键
name 分类名称
新闻信息记录表:
id 主键
title 资讯名称
views 浏览量
info_type_id 资讯类别
code 说明

初始化SQL语句:
set work_mem to ‘1GB’;
Create table info(
id numeric not null primary key ,
title varchar(100) ,
Viewnum numeric ,
info_type_id numeric ,
Code text
);
create index info_infotypeid on info (info_type_id);
Create table info_type(
Id numeric not null primary key,
Name varchar(100)
);
–插入100个新闻分类
Insert into info_type
select id, ‘TYPE’ || lpad(id::text, 5, ‘0’ ) from generate_series(1, 100) id;

–插入1000000个新闻
Insert into info_type
select id, ‘TTL’ || lpad(id::text, 20, ‘0’ ) title, ceil(random()*10000) views, ceil(random()*10000) info_type_id , md5(id) code
from generate_series(1, 1000000) id;

vacuum analyse info_type,info;

2．核心思想
通常我们在取前N条记录时候，都是根据某个业务字段进行降序排序，然后取前N条。
例如“select * from info order by views asc limit 3”，这条SQL就是取info表中的前3条记录。但是“它是在每个类型下都要取浏览量的前3条记录”。
另一种比较简单粗暴的方式是在Java代码中循环所有的资讯类型，取出每个类型的前3条记录，最后进行汇总。但是，这种方式虽然也能实现我们的要求，但存在严重的弊端，即有可能发送多次（成百上千次也有可能）sql语句。这种程序显然是有重大缺陷的。
如果能够在查询每条资讯记录时能查出其所在类型的排名，然后根据排名字段进行过滤，是否也能解决该问题？这时候我们就想到了子查询，并且MySQL是可以实现这样的功能子查询的。
要计算出某条资讯信息在同资讯分类下所有记录中排第几名，换算出有多少条浏览量比当前记录的浏览量高，然后根据具体的多少（N）条+1就是N+1就是当前记录所在其分类下的的排名。
假如以本文上面的示例数据说明：就是在计算每个资讯信息记录时，多计算出一列作为其“排名”字段，然后取“排名”字段的小于等于3的记录即可。
如果这里还不是很理解的话，就先看下面的SQL，然后根据SQL再回过头来理解这段话。

三、SQL实现
1．方法一：分组排名，取前3名记录

explain (analyse ,buffers )
with i as (
select i.*,
row_number() over (partition by i.info_type_id order by i.viewnum desc) sn
from info i)
select *
from info_type t
left join i on i.sn <= 3 and i.info_type_id = t.id;

                                                             QUERY PLAN

Merge Right Join (cost=122990.46…123343.08 rows=3341 width=96) (actual time=1634.866…1642.284 rows=200 loops=1)
Merge Cond: (i.info_type_id = t.id)
Buffers: shared hit=13325
-> Subquery Scan on i (cost=122981.84…155481.84 rows=333333 width=81) (actual time=1634.822…1642.165 rows=201 loops=1)
Filter: (i.sn <= 2)
Rows Removed by Filter: 9863
Buffers: shared hit=13324
-> WindowAgg (cost=122981.84…142981.84 rows=1000000 width=81) (actual time=1634.821…1641.693 rows=10064 loops=1)
Buffers: shared hit=13324
-> Sort (cost=122981.84…125481.84 rows=1000000 width=73) (actual time=1634.807…1636.324 rows=10065 loops=1)
Sort Key: i_1.info_type_id, i_1.viewnum DESC
Sort Method: quicksort Memory: 165202kB
Buffers: shared hit=13324
-> Seq Scan on info i_1 (cost=0.00…23324.00 rows=1000000 width=73) (actual time=0.006…123.806 rows=1000000 loops=1)
Buffers: shared hit=13324
-> Sort (cost=5.32…5.57 rows=100 width=15) (actual time=0.040…0.044 rows=100 loops=1)
Sort Key: t.id
Sort Method: quicksort Memory: 29kB
Buffers: shared hit=1
-> Seq Scan on info_type t (cost=0.00…2.00 rows=100 width=15) (actual time=0.016…0.021 rows=100 loops=1)
Buffers: shared hit=1
Planning Time: 0.149 ms
Execution Time: 1644.193 ms
(23 行记录)

查询计划分析：Info全表读取到内存中，hit块数13325，行数1000000行。用时 1644.193 ms。

2．方法二：按新闻分类，取第3名新闻记录（一条记录）
在方法一中，由于读取了大量数据块，耗时过多。
功能要求只需返回每组1条记录，希望避免读冗余数据块。因此新的SQL特点，每个类型使用子查询通过info表的info_type_id列的索引，就可以避免读取多余的数据。
select list的子查询作为计算列，只能返回一个值，所以使用row (i.)::info 先整合，然后使用 (inf). 再分解，同时使用 offset2 limit 1获取第三名的一行记录。

explain (analyse ,buffers )
select id, name, (inf).*
from (select t.,
(select row (i.)::info
from info i
where i.info_type_id = t.id
order by i.viewnum desc
offset 2
limit 1) inf
from info_type t
) t;

                                                                QUERY PLAN

Seq Scan on info_type t (cost=0.00…190882.06 rows=100 width=361) (actual time=0.664…56.966 rows=100 loops=1)
Buffers: shared hit=51821
SubPlan 1
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.159…0.159 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.158…0.158 rows=3 loops=100)
Sort Key: i.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i (cost=5.20…380.46 rows=100 width=37) (actual time=0.018…0.134 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.011…0.011 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 2
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.103…0.103 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.103…0.103 rows=3 loops=100)
Sort Key: i_1.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_1 (cost=5.20…380.46 rows=100 width=37) (actual time=0.015…0.079 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.010…0.010 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 3
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.104…0.104 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.104…0.104 rows=3 loops=100)
Sort Key: i_2.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_2 (cost=5.20…380.46 rows=100 width=37) (actual time=0.014…0.077 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.009…0.009 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 4
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.102…0.102 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.101…0.101 rows=3 loops=100)
Sort Key: i_3.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_3 (cost=5.20…380.46 rows=100 width=37) (actual time=0.014…0.077 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.010…0.010 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
SubPlan 5
-> Limit (cost=381.76…381.76 rows=1 width=37) (actual time=0.098…0.098 rows=1 loops=100)
Buffers: shared hit=10364
-> Sort (cost=381.75…382.00 rows=100 width=37) (actual time=0.098…0.098 rows=3 loops=100)
Sort Key: i_4.viewnum DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10364
-> Bitmap Heap Scan on info i_4 (cost=5.20…380.46 rows=100 width=37) (actual time=0.013…0.073 rows=101 loops=100)
Recheck Cond: (info_type_id = t.id)
Heap Blocks: exact=10037
Buffers: shared hit=10364
-> Bitmap Index Scan on info_infotypeid (cost=0.00…5.17 rows=100 width=0) (actual time=0.009…0.009 rows=101 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=327
Planning Time: 0.269 ms
Execution Time: 57.030 ms
(74 行记录)

查询计划分析：Info表有类型列的索引，使用子查询，可以充分利用这个索引，使得每个subplan的hit数据块数10364大为减少，总hit块数51821，总用时57.030 ms 。

按viewnum列排序，但是没有相应的索引，仍然需要读表，造成hit数据块过多，创建所需索引，数据块和时间将改善。
create index info_typeview on info(info_type_id,viewnum);

                                                                  QUERY PLAN

Seq Scan on info_type t (cost=0.00…6300.70 rows=100 width=361) (actual time=0.073…4.067 rows=100 loops=1)
Buffers: shared hit=3001
SubPlan 1
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.011…0.011 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i (cost=0.42…406.17 rows=100 width=37) (actual time=0.008…0.010 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 2
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_1 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 3
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_2 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 4
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_3 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
SubPlan 5
-> Limit (cost=8.54…12.60 rows=1 width=37) (actual time=0.007…0.007 rows=1 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i_4 (cost=0.42…406.17 rows=100 width=37) (actual time=0.006…0.007 rows=3 loops=100)
Index Cond: (info_type_id = t.id)
Buffers: shared hit=600
Planning Time: 0.192 ms
Execution Time: 4.109 ms
(34 行记录)

查询计划分析：新的复合索引，同满足了分组和排序的需要，省略了读表数据块和时长都大为改善，hit数据块数3001，总时长4.109 ms。

3．方法三：按新闻分类，取前3名新闻记录（三条记录）
在方法二中，限制了返回行数，仅一行，同时info表有5个列，所以有5个subplan，其中4个是冗余的。
新的SQL特点，select list的子查询作为计算列，只能返回一行值，所以使用array() 先转换成数组类型，然后使用 unnest() 再分解成多行，同时使用 offset0 limit 3获取前三名的三行记录。

explain (analyse ,buffers )
select id, name, (inf).*
from (select t.id, t.name, unnest(inf) inf
from (select t.,
array(select row (i.)::info
from info i
where i.info_type_id = t.id
order by i.viewnum desc
offset 0
limit 3) inf
from info_type t
) t) t;
QUERY PLAN

Subquery Scan on t (cost=0.00…1277.49 rows=1000 width=361) (actual time=0.039…1.234 rows=300 loops=1)
Buffers: shared hit=601
-> ProjectSet (cost=0.00…1267.49 rows=1000 width=47) (actual time=0.037…1.156 rows=300 loops=1)
Buffers: shared hit=601
-> Seq Scan on info_type t_1 (cost=0.00…2.00 rows=100 width=15) (actual time=0.003…0.010 rows=100 loops=1)
Buffers: shared hit=1
SubPlan 1
-> Limit (cost=0.42…12.60 rows=3 width=37) (actual time=0.007…0.010 rows=3 loops=100)
Buffers: shared hit=600
-> Index Scan Backward using info_typeview on info i (cost=0.42…406.17 rows=100 width=37) (actual time=0.007…0.010 rows=3 loops=100)
Index Cond: (info_type_id = t_1.id)
Buffers: shared hit=600
Planning Time: 0.122 ms
Execution Time: 1.269 ms
(14 行记录)

查询计划分析：使用数组函数不仅解决了多行需求功能，连带将冗余的subplan也避免了，使得性能有大幅提升，hit数据块数601，时长 1.269 ms。如果每组只需要一行记录，数据块和时长相应会减少。

四、总结
子查询，在一些限定数据的需求中，会带来不可思议的性能提升。

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git