best way to generate partition column #159

bloukanov · 2021-11-08T16:00:47Z

bloukanov
Nov 8, 2021

Hi there, what is the best way to generate a partition column if the data does not naturally contain one? I am doing this:

select *, row_number() over (order by (select 1)) as partition from my_data

also, does specifying partition_range increase the speed?

Answered by wangxiaoying

Nov 8, 2021

Hi @bloukanov , thanks for bringing up this question. Setting up the partition column can be tricky, and also query and database dependent.

Adding a row number column can be a solution for partition column, but it might also compromise the performance. Here is the query plan I tried using this strategy in mssql:

And this is the query plan if using an existing numerical column in the table:

The second query plan will push down the filtering predicate to the scan operator, while the first one cannot. Which means that if we use generated row number as a partition column, the entire result needs to be generated first and then filtered. I tested it in my benchmark environment, and it is 2x s…

View full answer

wangxiaoying · 2021-11-08T19:52:15Z

wangxiaoying
Nov 8, 2021
Maintainer

Hi @bloukanov , thanks for bringing up this question. Setting up the partition column can be tricky, and also query and database dependent.

Adding a row number column can be a solution for partition column, but it might also compromise the performance. Here is the query plan I tried using this strategy in mssql:

And this is the query plan if using an existing numerical column in the table:

The second query plan will push down the filtering predicate to the scan operator, while the first one cannot. Which means that if we use generated row number as a partition column, the entire result needs to be generated first and then filtered. I tested it in my benchmark environment, and it is 2x slower when using 10 partitions than partition on existing column (but still faster than no partition).

If you want to speed up this, one solution might be creating a materialized view with the row id. Then fetching the query will be simply select * from view with partition_on set to the row id column directly. But it might need some permission and will use some extra resource in the database.

Another solution can be manually partition the query with some knowledge of the data. For example, if there is a column month in string and each month has similar number of rows, we can manually partition the query as follow:

# 6 partitions
queries = ['select * from table where month in ("JAN", "FEB")', ..., 'select * from table where month in ("NOV", "DEC")']
cx.read_sql(DB_URL, queries)

also, does specifying partition_range increase the speed?

Here is the workflow if the partition_on and partition_num are set:

If partition_range is also set, use the values as min and max value of partition column, otherwise issue a query: select min(partition_column), max(partition_column) from original query to get the range.
Getting the schema of the query.
Split the original query to partition_num queries by evenly split the partition column.
Issue a count query to get the number of rows in the result (for now we do this for both pandas and arrow, later we will omit this for arrow in this PR: Stream write to destination #147)
Fetch and write the result to the result dataframe for each partition in parallel.

Specifying the partition_range can help us omit the min, max query in step 1. The performance improvement will be determined by the seed of the min max query. Usually this min max query will not take long since database would have some statistics, however if the query is complex to derive the partition column, providing the partition_range could help.

If you want to know how long does each step take (may give you some hints if you want to try different ways to partition the query.), you can print the log by setting up RUST_LOG environment variable before importing the library like the following example:

import os
os.environ["RUST_LOG"]="connectorx=debug,connectorx_python=debug"
import connectorx as cx

df = cx.read_sql(DB_URL, query)

0 replies

bloukanov · 2021-11-08T21:03:59Z

bloukanov
Nov 8, 2021
Author

@wangxiaoying thank you for the very thorough response!

I ended up adding the row_number variable to the tables themselves, so it is no longer in ConnectorX's select

1 reply

wangxiaoying Nov 9, 2021
Maintainer

Hi @bloukanov , no problem! May I ask what's your table originally look like, for example the types of the columns? If there is a numerical column then maybe you can also try to partition on that one.

bloukanov · 2021-11-09T18:48:11Z

bloukanov
Nov 9, 2021
Author

Sure! It is part of a package I am putting together to create and download model features from our raw data warehouse. So the only 2 consistent columns will be person_id, which is a uniqueidentifier, and person_cutoff_date, a date relevant to the model for this individual. All other columns are being generated ad hoc with SQL scripts in the package, with specific sets of features selected by the user. That is why I thought it would be easy to just add the row_number partition in each of these scripts, and remove the column later. What are your thoughts? For what it’s worth, I actually did not see a performance difference, comparing this method to the `select *, row_number()`. For data sets of 10M rows and 10-40 features.

…

On Tue, Nov 9, 2021 at 1:37 PM Xiaoying Wang ***@***.***> wrote: Hi @bloukanov <https://github.com/bloukanov> , no problem! May I ask what's your table originally look like, for example the types of the columns? If there is a numerical column then maybe you can also try to partition on that one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#159 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO4DO37D3Y6RMNFHWPDZCTLULFS6RANCNFSM5HTA3EHA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

3 replies

wangxiaoying Nov 9, 2021
Maintainer

Thanks for the info!

Seems like there is indeed no appropriate column for partition column. Maybe one workaround is to convert the date into integer in the query and partition on this converted column. But I think your current solution is better and simpler when adding an ID column is possible.

For what it’s worth, I actually did not see a performance difference,
comparing this method to the select *, row_number(). For data sets of 10M
rows and 10-40 features.

That's good to know. I guess this may related to how much the query execution time takes compare with the end-to-end time (if query execution time is long, then adding some overhead in generating row_number on the fly may not affect the overall performance), which can be affected by factors like query complexity/plan generated, database performance and network condition.

bloukanov Nov 9, 2021
Author

got it, all makes sense thanks! appreciate the help.

armamut Feb 2, 2023

Thanks for the info!

Seems like there is indeed no appropriate column for partition column. Maybe one workaround is to convert the date into integer in the query and partition on this converted column. But I think your current solution is better and simpler when adding an ID column is possible.

For what it’s worth, I actually did not see a performance difference,
comparing this method to the select *, row_number(). For data sets of 10M
rows and 10-40 features.

That's good to know. I guess this may related to how much the query execution time takes compare with the end-to-end time (if query execution time is long, then adding some overhead in generating row_number on the fly may not affect the overall performance), which can be affected by factors like query complexity/plan generated, database performance and network condition.

Well, I think there is an apropriate column for partition in this situation (MSSQL specific) and that is the uniqueidentifier column. Uniqueidentifier type is comparable, and you can get results from queries like this:

SELECT Min(ProductId) MinPid, Max(ProductId) MaxPid FROM [tablename];

SELECT DISTINCT 
	PERCENTILE_DISC(0.0) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_0,
	PERCENTILE_DISC(0.1) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_1,
	PERCENTILE_DISC(0.2) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_2,
	PERCENTILE_DISC(0.3) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_3,
	PERCENTILE_DISC(0.4) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_4,
	PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_5,
	PERCENTILE_DISC(0.6) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_6,
	PERCENTILE_DISC(0.7) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_7,
	PERCENTILE_DISC(0.8) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_8,
	PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_9,
	PERCENTILE_DISC(1.0) WITHIN GROUP (ORDER BY ProductId) OVER () AS Pid_10
FROM [tablename]

SELECT COUNT(*) FROM [tablename];
>>> 3000

SELECT COUNT(*) FROM [tablename]
WHERE ProductId >= 'D3AC9AC6-FFFF-41AD-FFFF-08D958E35151'
AND ProductId < 'BB8A7077-EEEE-43BC-EEEE-08D95CCD37C9'; -- ID's are made up
>>> 300

Source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best way to generate partition column #159

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

best way to generate partition column #159

bloukanov Nov 8, 2021

Replies: 3 comments · 4 replies

wangxiaoying Nov 8, 2021 Maintainer

bloukanov Nov 8, 2021 Author

wangxiaoying Nov 9, 2021 Maintainer

bloukanov Nov 9, 2021 Author

wangxiaoying Nov 9, 2021 Maintainer

bloukanov Nov 9, 2021 Author

armamut Feb 2, 2023

bloukanov
Nov 8, 2021

Replies: 3 comments 4 replies

wangxiaoying
Nov 8, 2021
Maintainer

bloukanov
Nov 8, 2021
Author

wangxiaoying Nov 9, 2021
Maintainer

bloukanov
Nov 9, 2021
Author

wangxiaoying Nov 9, 2021
Maintainer

bloukanov Nov 9, 2021
Author