best way to generate partition column #159
-
Hi there, what is the best way to generate a partition column if the data does not naturally contain one? I am doing this:
also, does specifying |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
Hi @bloukanov , thanks for bringing up this question. Setting up the partition column can be tricky, and also query and database dependent. Adding a row number column can be a solution for partition column, but it might also compromise the performance. Here is the query plan I tried using this strategy in mssql: And this is the query plan if using an existing numerical column in the table: The second query plan will push down the filtering predicate to the scan operator, while the first one cannot. Which means that if we use generated row number as a partition column, the entire result needs to be generated first and then filtered. I tested it in my benchmark environment, and it is 2x slower when using 10 partitions than partition on existing column (but still faster than no partition). If you want to speed up this, one solution might be creating a materialized view with the row id. Then fetching the query will be simply Another solution can be manually partition the query with some knowledge of the data. For example, if there is a column # 6 partitions
queries = ['select * from table where month in ("JAN", "FEB")', ..., 'select * from table where month in ("NOV", "DEC")']
cx.read_sql(DB_URL, queries)
Here is the workflow if the
Specifying the If you want to know how long does each step take (may give you some hints if you want to try different ways to partition the query.), you can print the log by setting up import os
os.environ["RUST_LOG"]="connectorx=debug,connectorx_python=debug"
import connectorx as cx
df = cx.read_sql(DB_URL, query) |
Beta Was this translation helpful? Give feedback.
-
@wangxiaoying thank you for the very thorough response! I ended up adding the |
Beta Was this translation helpful? Give feedback.
-
Sure!
It is part of a package I am putting together to create and download model
features from our raw data warehouse. So the only 2 consistent columns will
be person_id, which is a uniqueidentifier, and person_cutoff_date, a date
relevant to the model for this individual. All other columns are being
generated ad hoc with SQL scripts in the package, with specific sets of
features selected by the user. That is why I thought it would be easy to
just add the row_number partition in each of these scripts, and remove the
column later.
What are your thoughts?
For what it’s worth, I actually did not see a performance difference,
comparing this method to the `select *, row_number()`. For data sets of 10M
rows and 10-40 features.
…On Tue, Nov 9, 2021 at 1:37 PM Xiaoying Wang ***@***.***> wrote:
Hi @bloukanov <https://github.com/bloukanov> , no problem! May I ask
what's your table originally look like, for example the types of the
columns? If there is a numerical column then maybe you can also try to
partition on that one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#159 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AO4DO37D3Y6RMNFHWPDZCTLULFS6RANCNFSM5HTA3EHA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
Hi @bloukanov , thanks for bringing up this question. Setting up the partition column can be tricky, and also query and database dependent.
Adding a row number column can be a solution for partition column, but it might also compromise the performance. Here is the query plan I tried using this strategy in mssql:
And this is the query plan if using an existing numerical column in the table:
The second query plan will push down the filtering predicate to the scan operator, while the first one cannot. Which means that if we use generated row number as a partition column, the entire result needs to be generated first and then filtered. I tested it in my benchmark environment, and it is 2x s…