Skip to content

Conversation

jacob-greenfield
Copy link

When the sort utility is searching for newlines in a large buffer, skip past any previously-searched data. This fixes a quadratic-time overhead that would occur in case of a line that is far longer than the configured buffer size (and START_BUFFER_SIZE).

On my M1 MacBook Pro:

$ dd if=/dev/zero bs=1M count=4096 status=progress | tr '\0' 'A' | head -c 4294967295 > oneline_4G.txt && cargo build --release -p uu_sort && time ./target/release/sort oneline_4G.txt >out.txt
...
./target/release/sort oneline_4G.txt > out.txt  1.96s user 3.32s system 78% cpu 6.687 total

This PR also fixes a separate issue where the check last_file_target_size != leftover_len was used to determine whether a file is non-empty; however this could fail if the buffer was recently resized, since leftover_len accounts for the additional capacity but last_file_target_size does not. This can cause two files to be concatenated without a newline in between. To reproduce: run head -c 8000 /dev/zero | tr '\0' 'b' >b.txt; echo aaa >a.txt; cargo run sort b.txt a.txt. I added a new test test_sort::test_start_buffer to cover this.

Fixes #8583.

@jacob-greenfield
Copy link
Author

jacob-greenfield commented Sep 26, 2025

I am a bit embarrassed to admit I did not notice #8652 before submitting this PR. I am not sure how this affects or overlaps with 8652 but I don't doubt that 8652 is also a valid approach to the problem. :)

Edit: If nothing else, this PR still fixes the last_file_target_size issue, and those commits could be cherry-picked.

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

Copy link

codspeed-hq bot commented Sep 26, 2025

CodSpeed Performance Report

Merging #8746 will not alter performance

Comparing jacob-greenfield:sort-fix-buffer-read (0f6f6d4) with main (5c15d79)

Summary

✅ 41 untouched
⏩ 64 skipped1

Footnotes

  1. 64 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link

GNU testsuite comparison:

GNU test failed: tests/wc/wc-files0-from. tests/wc/wc-files0-from is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@jacob-greenfield jacob-greenfield force-pushed the sort-fix-buffer-read branch 2 times, most recently from e04e64d to d86638e Compare September 26, 2025 22:15
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

@jacob-greenfield
Copy link
Author

Test fixtures are now generated programmatically, let me know if I should squash to remove the hardcoded fixtures from the git history?

@kehrazy
Copy link

kehrazy commented Sep 29, 2025

I am a bit embarrassed to admit I did not notice #8652 before submitting this PR. I am not sure how this affects or overlaps with 8652 but I don't doubt that 8652 is also a valid approach to the problem. :)

Edit: If nothing else, this PR still fixes the last_file_target_size issue, and those commits could be cherry-picked.

if anything, this change is much leaner and easier to review (and passes CI, at the very least), as opposed to 8652 :)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

@sylvestre sylvestre force-pushed the sort-fix-buffer-read branch from 0f6f6d4 to 15c0028 Compare October 1, 2025 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sort does not finish for large one line file
3 participants