Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting multi-column dataframe is slower than expected if key column is already sorted #19364

Open
Chuck321123 opened this issue Oct 21, 2024 · 6 comments · May be fixed by #19872
Open

Sorting multi-column dataframe is slower than expected if key column is already sorted #19364

Chuck321123 opened this issue Oct 21, 2024 · 6 comments · May be fixed by #19872
Labels
accepted Ready for implementation bug Something isn't working P-medium Priority: medium performance Performance issues or improvements

Comments

@Chuck321123
Copy link

Description

So when you run df.sort("Col") it sorts the column, even if the column is already sorted. This can become a problem if the df/column is very large and the process becomes quite resource heavy. Would be nice if it would be possible to check if the column was sorted before hand and not sort if it was already sorted. Example below to illustrate:

import polars as pl
import numpy as np

random_numbers = np.random.rand(10_000_000)
df = pl.DataFrame({
    "random_numbers": random_numbers
})

df = df.with_columns(pl.col("random_numbers").rle_id().alias("ID"))
print("Sorted?:", df["ID"].is_sorted())

%timeit df.sort("ID")
Sorted?: True
57.1 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@Chuck321123 Chuck321123 added the enhancement New feature or an improvement of an existing feature label Oct 21, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Oct 22, 2024

You need a control group here, i.e. compare the sort function on an un-sorted array to a sorted array. If you do this, you'll find that sort is indeed much faster if the array is already sorted:

import numpy as np

import polars as pl

random_numbers = np.random.rand(10_000_000)
df = pl.DataFrame({
    "random_numbers": random_numbers
})

print("when unsorted")
%timeit df.sort("random_numbers")

df = df.sort("random_numbers")
print("when sorted")
%timeit df.sort("random_numbers")
when unsorted
1.6 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
when sorted
120 μs ± 653 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

That's 1,000,000x faster.

@orlp orlp closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024
@Chuck321123
Copy link
Author

@mcrumiller Understood. Then I might have observed that the difference comes from adding an extra column in one of the dataframes (which shouldn't affect the sorting speed after being sorted?) Take a look at this:

import polars as pl
import numpy as np

np.random.seed(42)

random_numbers = np.random.rand(10_000_000)

df = pl.DataFrame({"random_numbers": random_numbers})

df2 = pl.DataFrame({"random_numbers": random_numbers})

# Creating a new column that affects sorting speed
df2 = df2.with_columns(pl.col("random_numbers").alias("New_Column"))

print("Same values?", (df["random_numbers"] == df2["random_numbers"]).all())

df = df.sort("random_numbers")
%timeit df.sort("random_numbers")

df2 = df2.sort("random_numbers")
%timeit df2.sort("random_numbers")

Results:

Same values? True
10.4 μs ± 918 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
115 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The difference is enormous.

@mcrumiller
Copy link
Contributor

mcrumiller commented Oct 22, 2024

Here are my timings for df.sort("a") for a single-column frame and a two-column frame. I've put all timings in units of ms. Values in table are pre-sorted Y/N.

Expand to view timing code
import numpy as np

import polars as pl

n = 1_000_000
df = pl.DataFrame({
    "a": np.random.rand(n),
    "b": np.random.rand(n),
})


# -- One column
print("A unsorted")
df2 = df.select("a")
%timeit df2.sort("a")

print("A sorted")
df2 = df.select("a").sort("a")
%timeit df2.sort("a")

# -- Two columns
print("A unsorted, B unsorted")
df2 = df.clone()
%timeit df2.sort("a")

print("A unsorted, B sorted")
df2 = df.sort("b")
%timeit df2.sort("a")

print("A sorted, B unsorted")
df2 = df.sort("a")
%timeit df2.sort("a")


print("A sorted, B sorted")
df2 = df.sort("a", "b")
%timeit df2.sort("a")

One-column table

A Time
N 150
Y 0.00012

Two-column table

A B time
N N 150
N Y 150
Y N 40
Y Y 40

The last two rows of the second table should be no-ops but still appear to take a bit longer than they should. @orlp do you have an explanation for this, or do you think the issue should be re-opened?

@Chuck321123
Copy link
Author

@orlp

@orlp orlp reopened this Oct 23, 2024
@orlp
Copy link
Collaborator

orlp commented Oct 23, 2024

Yes, that is strange. I'll edit the title.

@orlp orlp changed the title Drop sorting of column when the column is already sorted Sorting multi-column dataframe is slower than expected if key column is already sorted Oct 23, 2024
@orlp orlp added bug Something isn't working performance Performance issues or improvements accepted Ready for implementation P-medium Priority: medium and removed enhancement New feature or an improvement of an existing feature labels Oct 23, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Oct 23, 2024
@VillePuuska
Copy link

Looks like when the dataframe has just one column it goes down a fast path here which leads to a check on whether it's already sorted correctly in this macro.

On the other hand, the path for a dataframe with more than one column leads to arg_sort_numeric which doesn't check for already being sorted and leads to calling sorting from here every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-medium Priority: medium performance Performance issues or improvements
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

4 participants