Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

offset command not working as expected #9488

Open
13minutes-yt opened this issue Mar 7, 2024 · 3 comments
Open

offset command not working as expected #9488

13minutes-yt opened this issue Mar 7, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@13minutes-yt
Copy link

13minutes-yt commented Mar 7, 2024

Describe the bug

Giving wrong results for large rows

To Reproduce

If I have data of less than 753459 rows it will give this result. this is data for 753458 rows
┌────────┬────────┐
│ data │ index │
│ int64 │ int64 │
├────────┼────────┤
│ 1 │ 0 │
│ 2 │ 1 │
│ 3 │ 2 │
│ 4 │ 3 │
│ 5 │ 4 │
│ 6 │ 5 │
│ 7 │ 6 │
│ 8 │ 7 │
│ 9 │ 8 │
│ 10 │ 9 │
│ 11 │ 10 │
│ 12 │ 11 │
│ 13 │ 12 │
│ 14 │ 13 │
│ 15 │ 14 │
│ 16 │ 15 │
│ 17 │ 16 │
│ 18 │ 17 │
│ 19 │ 18 │
│ 20 │ 19 │
│ · │ · │
│ · │ · │
│ · │ · │
│ 793439 │ 793438 │
│ 793440 │ 793439 │
│ 793441 │ 793440 │
│ 793442 │ 793441 │
│ 793443 │ 793442 │
│ 793444 │ 793443 │
│ 793445 │ 793444 │
│ 793446 │ 793445 │
│ 793447 │ 793446 │
│ 793448 │ 793447 │
│ 793449 │ 793448 │
│ 793450 │ 793449 │
│ 793451 │ 793450 │
│ 793452 │ 793451 │
│ 793453 │ 793452 │
│ 793454 │ 793453 │
│ 793455 │ 793454 │
│ 793456 │ 793455 │
│ 793457 │ 793456 │
│ 793458 │ 793457 │
├────────┴────────┤
│ 793458 rows │
│ (40 shown)


datafusion-cli -q -c " select * from 'offset.csv' limit 4 offset 3"
data,index
4,3
5,4
6,5
7,6

If the data is more it gives another
this is data for 793459 rows

datafusion-cli -q -c " select * from 'offset.csv' limit 4 offset 3"
data,index
113075,113074
113076,113075
113077,113076
113078,113077

for 793460 rows

datafusion-cli -q -c " select * from 'offset.csv' limit 4 offset 3"
data,index
696266,696265
696267,696266
696268,696267
696269,696268

Expected behavior

No response

Additional context

No response

@13minutes-yt 13minutes-yt added the bug Something isn't working label Mar 7, 2024
@mustafasrepo
Copy link
Contributor

This might be related to built-in parallelism of the Datafusion. Datafusion by default executes in parallel, if query planner thinks it is helpful. You can force DataFusion to work in. single partition by executing command
set datafusion.execution.target_partitions = 1;
In this case, scan order of the file wouldn't change (At least this is what I expect).

Another option might be to specify desired order in the query. if you add ORDER BY index to the query it should produce expected result even if query works in multi partitions.

@MohamedAbdeen21
Copy link
Contributor

MohamedAbdeen21 commented Mar 22, 2024

The query in the issue occasionally returns the correct answer; however, increasing the offset to a large enough number < total rows, almost never works as expected.

First option works well, but it kills the parallelism. Second option isn't always viable especially when data is not originally sorted.

Two possible solutions I can think of:

  • Attach the partition number to the record batch and consider it when executing the limit. This can delay the early termination of the input stream. Worst case scenario, limit has to read the entire input looking for the required partition, and at that point, we are probably better off killing the parallelism anyway. But on average the solution should be faster.

  • Attach the partition number as a new column and insert a sort by that column before the limit. This doesn't look easy given that DF doesn't have a with_column that can append columns as far as I can tell.

@alamb
Copy link
Contributor

alamb commented Mar 24, 2024

I agree with @mustafasrepo 's analysis

Fundamentally, select * from 'offset.csv' limit 4 offset 3 is a non deterministic query as the output order is not defined a SQL engine is allowed to return rows in any order (there is no ORDER BY clause).

If you want to treat the file as though it is ordered, you could consder defining it with a CREATE EXTERNAL TABLE ... WITH ORDER

https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants