-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting multi-column dataframe is slower than expected if key column is already sorted #19364
Comments
You need a control group here, i.e. compare the import numpy as np
import polars as pl
random_numbers = np.random.rand(10_000_000)
df = pl.DataFrame({
"random_numbers": random_numbers
})
print("when unsorted")
%timeit df.sort("random_numbers")
df = df.sort("random_numbers")
print("when sorted")
%timeit df.sort("random_numbers")
That's 1,000,000x faster. |
@mcrumiller Understood. Then I might have observed that the difference comes from adding an extra column in one of the dataframes (which shouldn't affect the sorting speed after being sorted?) Take a look at this: import polars as pl
import numpy as np
np.random.seed(42)
random_numbers = np.random.rand(10_000_000)
df = pl.DataFrame({"random_numbers": random_numbers})
df2 = pl.DataFrame({"random_numbers": random_numbers})
# Creating a new column that affects sorting speed
df2 = df2.with_columns(pl.col("random_numbers").alias("New_Column"))
print("Same values?", (df["random_numbers"] == df2["random_numbers"]).all())
df = df.sort("random_numbers")
%timeit df.sort("random_numbers")
df2 = df2.sort("random_numbers")
%timeit df2.sort("random_numbers") Results: Same values? True
10.4 μs ± 918 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
115 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) The difference is enormous. |
Here are my timings for Expand to view timing codeimport numpy as np
import polars as pl
n = 1_000_000
df = pl.DataFrame({
"a": np.random.rand(n),
"b": np.random.rand(n),
})
# -- One column
print("A unsorted")
df2 = df.select("a")
%timeit df2.sort("a")
print("A sorted")
df2 = df.select("a").sort("a")
%timeit df2.sort("a")
# -- Two columns
print("A unsorted, B unsorted")
df2 = df.clone()
%timeit df2.sort("a")
print("A unsorted, B sorted")
df2 = df.sort("b")
%timeit df2.sort("a")
print("A sorted, B unsorted")
df2 = df.sort("a")
%timeit df2.sort("a")
print("A sorted, B sorted")
df2 = df.sort("a", "b")
%timeit df2.sort("a") One-column table
Two-column table
The last two rows of the second table should be no-ops but still appear to take a bit longer than they should. @orlp do you have an explanation for this, or do you think the issue should be re-opened? |
Yes, that is strange. I'll edit the title. |
Looks like when the dataframe has just one column it goes down a fast path here which leads to a check on whether it's already sorted correctly in this macro. On the other hand, the path for a dataframe with more than one column leads to arg_sort_numeric which doesn't check for already being sorted and leads to calling sorting from here every time. |
Description
So when you run
df.sort("Col")
it sorts the column, even if the column is already sorted. This can become a problem if the df/column is very large and the process becomes quite resource heavy. Would be nice if it would be possible to check if the column was sorted before hand and not sort if it was already sorted. Example below to illustrate:The text was updated successfully, but these errors were encountered: