Sorting multi-column dataframe is slower than expected if key column is already sorted #19364

Chuck321123 · 2024-10-21T22:42:03Z

Description

So when you run df.sort("Col") it sorts the column, even if the column is already sorted. This can become a problem if the df/column is very large and the process becomes quite resource heavy. Would be nice if it would be possible to check if the column was sorted before hand and not sort if it was already sorted. Example below to illustrate:

import polars as pl
import numpy as np

random_numbers = np.random.rand(10_000_000)
df = pl.DataFrame({
    "random_numbers": random_numbers
})

df = df.with_columns(pl.col("random_numbers").rle_id().alias("ID"))
print("Sorted?:", df["ID"].is_sorted())

%timeit df.sort("ID")

Sorted?: True
57.1 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The text was updated successfully, but these errors were encountered:

mcrumiller · 2024-10-22T01:22:54Z

You need a control group here, i.e. compare the sort function on an un-sorted array to a sorted array. If you do this, you'll find that sort is indeed much faster if the array is already sorted:

import numpy as np

import polars as pl

random_numbers = np.random.rand(10_000_000)
df = pl.DataFrame({
    "random_numbers": random_numbers
})

print("when unsorted")
%timeit df.sort("random_numbers")

df = df.sort("random_numbers")
print("when sorted")
%timeit df.sort("random_numbers")

when unsorted
1.6 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
when sorted
120 μs ± 653 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

That's 1,000,000x faster.

Chuck321123 · 2024-10-22T09:39:48Z

@mcrumiller Understood. Then I might have observed that the difference comes from adding an extra column in one of the dataframes (which shouldn't affect the sorting speed after being sorted?) Take a look at this:

import polars as pl
import numpy as np

np.random.seed(42)

random_numbers = np.random.rand(10_000_000)

df = pl.DataFrame({"random_numbers": random_numbers})

df2 = pl.DataFrame({"random_numbers": random_numbers})

# Creating a new column that affects sorting speed
df2 = df2.with_columns(pl.col("random_numbers").alias("New_Column"))

print("Same values?", (df["random_numbers"] == df2["random_numbers"]).all())

df = df.sort("random_numbers")
%timeit df.sort("random_numbers")

df2 = df2.sort("random_numbers")
%timeit df2.sort("random_numbers")

Results:

Same values? True
10.4 μs ± 918 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
115 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The difference is enormous.

mcrumiller · 2024-10-22T12:41:47Z

Here are my timings for df.sort("a") for a single-column frame and a two-column frame. I've put all timings in units of ms. Values in table are pre-sorted Y/N.

Expand to view timing code

import numpy as np

import polars as pl

n = 1_000_000
df = pl.DataFrame({
    "a": np.random.rand(n),
    "b": np.random.rand(n),
})


# -- One column
print("A unsorted")
df2 = df.select("a")
%timeit df2.sort("a")

print("A sorted")
df2 = df.select("a").sort("a")
%timeit df2.sort("a")

# -- Two columns
print("A unsorted, B unsorted")
df2 = df.clone()
%timeit df2.sort("a")

print("A unsorted, B sorted")
df2 = df.sort("b")
%timeit df2.sort("a")

print("A sorted, B unsorted")
df2 = df.sort("a")
%timeit df2.sort("a")


print("A sorted, B sorted")
df2 = df.sort("a", "b")
%timeit df2.sort("a")

One-column table

A	Time
N	150
Y	0.00012

Two-column table

A	B	time
N	N	150
N	Y	150
Y	N	40
Y	Y	40

The last two rows of the second table should be no-ops but still appear to take a bit longer than they should. @orlp do you have an explanation for this, or do you think the issue should be re-opened?

Chuck321123 · 2024-10-23T08:20:44Z

@orlp

orlp · 2024-10-23T08:38:24Z

Yes, that is strange. I'll edit the title.

VillePuuska · 2024-10-27T13:10:32Z

Looks like when the dataframe has just one column it goes down a fast path here which leads to a check on whether it's already sorted correctly in this macro.

On the other hand, the path for a dataframe with more than one column leads to arg_sort_numeric which doesn't check for already being sorted and leads to calling sorting from here every time.

Chuck321123 added the enhancement New feature or an improvement of an existing feature label Oct 21, 2024

orlp closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024

orlp reopened this Oct 23, 2024

orlp changed the title ~~Drop sorting of column when the column is already sorted~~ Sorting multi-column dataframe is slower than expected if key column is already sorted Oct 23, 2024

orlp added bug Something isn't working performance Performance issues or improvements accepted Ready for implementation P-medium Priority: medium and removed enhancement New feature or an improvement of an existing feature labels Oct 23, 2024

github-project-automation bot added this to Backlog Oct 23, 2024

github-project-automation bot moved this to Ready in Backlog Oct 23, 2024

siddharth-vi linked a pull request Nov 19, 2024 that will close this issue

perf: Add fast paths for series.arg_sort and dataframe.sort #19872

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting multi-column dataframe is slower than expected if key column is already sorted #19364

Sorting multi-column dataframe is slower than expected if key column is already sorted #19364

Chuck321123 commented Oct 21, 2024

mcrumiller commented Oct 22, 2024 •

edited

Loading

Chuck321123 commented Oct 22, 2024

mcrumiller commented Oct 22, 2024 •

edited

Loading

Chuck321123 commented Oct 23, 2024

orlp commented Oct 23, 2024

VillePuuska commented Oct 27, 2024

Sorting multi-column dataframe is slower than expected if key column is already sorted #19364

Sorting multi-column dataframe is slower than expected if key column is already sorted #19364

Comments

Chuck321123 commented Oct 21, 2024

Description

mcrumiller commented Oct 22, 2024 • edited Loading

Chuck321123 commented Oct 22, 2024

mcrumiller commented Oct 22, 2024 • edited Loading

One-column table

Two-column table

Chuck321123 commented Oct 23, 2024

orlp commented Oct 23, 2024

VillePuuska commented Oct 27, 2024

mcrumiller commented Oct 22, 2024 •

edited

Loading

mcrumiller commented Oct 22, 2024 •

edited

Loading