Skip to content

preserve_insertion_order could do with more detailed documentation #5861

@soerenwolfers

Description

@soerenwolfers

The preserve_insertion_order flag is mentioned very frequently when users raise performance issues, and a recent discussion item, duckdb/duckdb#19201 , goes so far as to claim that most performance problems can be solved with this flag, yet its documentation is a bit sparse and vague.

For example, see the comments duckdb/duckdb#9987 (comment) and duckdb/duckdb#9987 (comment)

I think concrete query examples why one might want to use or not use the flag would be helpful, including discussion of performance and compatibility with other flags (I saw that there are a couple of parquet writer flags that are incompatible with the flag).

Currently,

https://duckdb.org/docs/stable/sql/dialect/order_preservation.html#insertion-order only says

read_csv/json/parquet[...] Preservation of insertion order is controlled by the preserve_insertion_order configuration option

which is almost redundant and omits the effect on the COPY command.
Confusingly, higher up on the page, the COPY clause does link the preserve_insertion_order paragraph , but the FROM clause doesn't.

In the opposite direction, https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads.html#the-preserve_insertion_order-option
and

https://duckdb.org/docs/stable/configuration/overview.html
say

This allows the systems to re-order any results that do not contain ORDER BY clauses, potentially reducing memory usage.

and

Whether or not to preserve insertion order. If set to false the system is allowed to re-order any results that do not contain ORDER BY clauses

respectively, which seems to omit (at least not mentioning it explicitly) the effect on read_x functions.

Even when taking the descriptions together, I'm not entirely clear to what extent both of the write and read aspects go.
For example, I'm unsure whether the flag only affects direct reads from files into duckdb tables (that's what "insertion' sounds like), or any usage of the readers in a query (from context on the first page I'd guess it affects the implicit order of any FROM read_x clause in the absence of an explicit order by clause, which in turn may or may not be preserved by subsequent steps depending on the rules laid out earlier on that page).

For writes, too, I'm not 100% sure if any implicit ordering is preserved, whether it originated from a read_x function or not.

I'm also not sure whether preserve_insertion_order applies to writing to or reading from native duckdb tables, or more generally also to any FROM clause, including subqueries.

Certainly feels like it's be useful if it did.

I'm also not sure what it means that

The following operations do not guarantee that the row order is preserved: FROM with multiple tables and/or subqueries JOIN

Isn't that a join or a union (which are already mentioned elsewhere)?

Finally, I'm confused that even when preserve_insertion_order=false, the rows of a parquet are read in order when row_number() OVER () appears in the select clause. That's super implicit behavior, and the more I think about all this, I don't understand how a preserve_insertion_order config flag is better than if all read and COPY commands took a preserve_order argument.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions