You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using large columnar tables, window functions in subqueries greatly slow down query performance, even when the window function columns are not referenced by the outer query.
I am hoping that this could be fixed by adjusting columnar scans to ignore columns in subqueries that are unreferenced.
I am more than happy to provide additional context, including query plans, plpgsql functions to recreate test data, version info, etc.
Thanks for your time and support. 🙇
Why would someone ever need an unreferenced window function?
I work on a manufacturing analytics platform that allows users to create and update views of their time-series data with arbitrary expressions. Some of those expressions contain window functions.
Below is an example of a command to create such a view, containing a trivial expression.
CREATEVIEWv_washerAS (
SELECT*, 1AS x
FROM washer
);
After the view is created, it becomes the new basis for all users' queries on the platform, regardless of whether those queries need any of the computed expressions. This has been a fantastic way to allow users to easily customize their data.
Unfortunately, as soon as any user adds a window function to the view, it spoils load times for everyone.
What does the data look like?
Manufacturing time-series data. Each row represents the state of a machine for a given time interval. The data used for testing was structured as follows:
starttime
endtime
machine__source
...700 other columns
2021-01-01 00:00:00+00
2021-01-01 00:00:30+00
A
...
2021-01-01 00:00:30+00
2021-01-01 00:01:00+00
A
...
2021-01-01 00:01:00+00
2021-01-01 00:01:30+00
A
...
...
...
...
...
The test queries below were performed on a partitioned table with the following properties:
700 columns
12 million total rows
3 years of data, split into 36 monthly, columnar partitions
Each of the 36 partitions is about 36MB for a total of ~1.3 GB (highly compressed--the same data is ~100GB in row-based storage.)
Query performance
Below are average query speeds for each of 7 test queries. All queries were performed 3 times after throwing out the first result to remove the variance associated with caching.
Query
Time
Query1 (small subquery, no wf)
0.09s
Query2 (large subquery, no wf)
0.09s
Query3 (no subquery, wf)
0.15s
Query4 (small subquery, ignored wf)
3.0s
Query5 (small subquery, referenced wf)
6.2s
Query6 (large subquery, ignored wf)
7.5s
Query7 (large subquery, referenced wf)
9.8s
My hope is that, by fixing the query planner:
Query4 and Query6 will have times close to Query1 and Query2
Query7 will have a time close to Query5
Full Queries
Below are the full queries referenced in the table above.
Query1 (small subquery, no wf)
selectsum("inner".output2) from (
select
endtime,
machine__source,
output +1as output2
from timeseries
) as"inner"where"inner".machine__source in ('A') and"inner".endtime >'2023-09-01'::timestamptz;
Query2 (large subquery, no wf)
selectsum("inner".output2) from (
select*,
output +1as output2
from timeseries
) as"inner"where"inner".machine__source in ('A') and"inner".endtime >'2023-09-01'::timestamptz;
Query3 (no subquery, wf)
select
case
when
shift ='A'and
lag(shift) over (partition by machine__source order by endtime) !='A'
then 1
else 0
end as window_expr
from timeseries
where machine__source in ('A') and
endtime >'2023-09-01'::timestamptz;
Query4 (small subquery, ignored wf)
select"inner".output from (
select
output,
machine__source,
endtime,
case
when shift ='A'and
lag(shift) over (partition by machine__source order by endtime) !='A'
then 1
else 0
end as window_expr
from timeseries
) as"inner"where"inner".machine__source in ('A') and"inner".endtime >'2023-09-01'::timestamptz;
Query5 (small subquery, referenced wf)
select"inner".window_expr from (
select
output,
machine__source,
endtime,
case
when shift ='A'and
lag(shift) over (partition by machine__source order by endtime) !='A'
then 1
else 0
end as window_expr
from timeseries
) as"inner"where"inner".machine__source in ('A') and"inner".endtime >'2023-09-01'::timestamptz;
Query6 (large subquery, ignored wf)
select"inner".output from (
select*,
case
when
shift ='A'and
lag(shift) over (partition by machine__source order by endtime) !='A'
then 1
else 0
end as window_expr
from timeseries
) as"inner"where"inner".machine__source in ('A') and"inner".endtime >'2023-09-01'::timestamptz;
Query7 (large subquery, referenced wf)
select"inner".window_expr from (
select*,
case
when
shift ='A'and
lag(shift) over (partition by machine__source order by endtime) !='A'
then 1
else 0
end as window_expr
from timeseries
) as"inner"where"inner".machine__source in ('A') and"inner".endtime >'2023-09-01'::timestamptz;
Version
# select * from version();
version
-----------------------------------------------------------------------------------------------------------------------
PostgreSQL 14.11 (Debian 14.11-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
# select * from pg_extension;
oid | extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
-----------+----------------+----------+--------------+----------------+------------+-----------+--------------
867184318 | citus_columnar | 10 | 11 | f | 11.3-1 | |
Note: the issue was not resolved when running the same queries in PostgreSQL 16.3
The text was updated successfully, but these errors were encountered:
srblum
changed the title
Columnar: window function in subquery greatly slows down query, even when unreferenced
[Performance] Columnar: window function in subquery greatly slows down query, even when unreferenced
Jul 19, 2024
srblum
changed the title
[Performance] Columnar: window function in subquery greatly slows down query, even when unreferenced
[Performance, Columnar] window function in subquery greatly slows down query, even when unreferenced
Jul 19, 2024
Problem
When using large columnar tables, window functions in subqueries greatly slow down query performance, even when the window function columns are not referenced by the outer query.
I am hoping that this could be fixed by adjusting columnar scans to ignore columns in subqueries that are unreferenced.
I am more than happy to provide additional context, including query plans, plpgsql functions to recreate test data, version info, etc.
Thanks for your time and support. 🙇
Why would someone ever need an unreferenced window function?
I work on a manufacturing analytics platform that allows users to create and update views of their time-series data with arbitrary expressions. Some of those expressions contain window functions.
Below is an example of a command to create such a view, containing a trivial expression.
After the view is created, it becomes the new basis for all users' queries on the platform, regardless of whether those queries need any of the computed expressions. This has been a fantastic way to allow users to easily customize their data.
Unfortunately, as soon as any user adds a window function to the view, it spoils load times for everyone.
What does the data look like?
Manufacturing time-series data. Each row represents the state of a machine for a given time interval. The data used for testing was structured as follows:
The test queries below were performed on a partitioned table with the following properties:
700 columns
12 million total rows
3 years of data, split into 36 monthly, columnar partitions
Each of the 36 partitions is about 36MB for a total of ~1.3 GB (highly compressed--the same data is ~100GB in row-based storage.)
Query performance
Below are average query speeds for each of 7 test queries. All queries were performed 3 times after throwing out the first result to remove the variance associated with caching.
My hope is that, by fixing the query planner:
Full Queries
Below are the full queries referenced in the table above.
Query1 (small subquery, no wf)
Query2 (large subquery, no wf)
Query3 (no subquery, wf)
Query4 (small subquery, ignored wf)
Query5 (small subquery, referenced wf)
Query6 (large subquery, ignored wf)
Query7 (large subquery, referenced wf)
Version
Note: the issue was not resolved when running the same queries in PostgreSQL 16.3
The text was updated successfully, but these errors were encountered: