-
Notifications
You must be signed in to change notification settings - Fork 476
Description
The preserve_insertion_order
flag is mentioned very frequently when users raise performance issues, and a recent discussion item, duckdb/duckdb#19201 , goes so far as to claim that most performance problems can be solved with this flag, yet its documentation is a bit sparse and vague.
For example, see the comments duckdb/duckdb#9987 (comment) and duckdb/duckdb#9987 (comment)
I think concrete query examples why one might want to use or not use the flag would be helpful, including discussion of performance and compatibility with other flags (I saw that there are a couple of parquet writer flags that are incompatible with the flag).
Currently,
https://duckdb.org/docs/stable/sql/dialect/order_preservation.html#insertion-order only says
read_csv/json/parquet
[...] Preservation of insertion order is controlled by thepreserve_insertion_order
configuration option
which is almost redundant and omits the effect on the COPY
command.
Confusingly, higher up on the page, the COPY
clause does link the preserve_insertion_order paragraph , but the FROM
clause doesn't.
In the opposite direction, https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads.html#the-preserve_insertion_order-option
and
https://duckdb.org/docs/stable/configuration/overview.html
say
This allows the systems to re-order any results that do not contain ORDER BY clauses, potentially reducing memory usage.
and
Whether or not to preserve insertion order. If set to false the system is allowed to re-order any results that do not contain ORDER BY clauses
respectively, which seems to omit (at least not mentioning it explicitly) the effect on read_x
functions.
Even when taking the descriptions together, I'm not entirely clear to what extent both of the write and read aspects go.
For example, I'm unsure whether the flag only affects direct reads from files into duckdb tables (that's what "insertion' sounds like), or any usage of the readers in a query (from context on the first page I'd guess it affects the implicit order of any FROM read_x
clause in the absence of an explicit order by clause, which in turn may or may not be preserved by subsequent steps depending on the rules laid out earlier on that page).
For writes, too, I'm not 100% sure if any implicit ordering is preserved, whether it originated from a read_x
function or not.
I'm also not sure whether preserve_insertion_order
applies to writing to or reading from native duckdb tables, or more generally also to any FROM
clause, including subqueries.
Certainly feels like it's be useful if it did.
I'm also not sure what it means that
The following operations do not guarantee that the row order is preserved: FROM with multiple tables and/or subqueries JOIN
Isn't that a join or a union (which are already mentioned elsewhere)?
Finally, I'm confused that even when preserve_insertion_order=false
, the rows of a parquet are read in order when row_number() OVER ()
appears in the select clause. That's super implicit behavior, and the more I think about all this, I don't understand how a preserve_insertion_order
config flag is better than if all read
and COPY
commands took a preserve_order
argument.