Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jxl-oxide spends 62% of the time in copy_nonoverlapping #202

Closed
Shnatsel opened this issue Jan 16, 2024 · 4 comments
Closed

jxl-oxide spends 62% of the time in copy_nonoverlapping #202

Shnatsel opened this issue Jan 16, 2024 · 4 comments
Labels
optimization Something can be done faster/better

Comments

@Shnatsel
Copy link
Contributor

I have converted a large image I had lying around to a slightly lossy JPEG XL with cjxl -d 1:
gast1verandering2_015_d1.jxl.gz

Decoding it with djxl --disable_output takes 50ms. With jxl-oxide CLI, ignoring the time to encode and write the PNG, the decoding process takes 300ms, or 6x as long.

I've profiled the decoding process with samply, you can browse the results here: https://share.firefox.dev/48WOaNU

If you look at the inverted call stacks, 62% of the total time is spent in core::intrinsics::copy_nonoverlapping, literally just copying memory around.

Looking at the flame graph, the jxl_render::filter::epf::apply_epf function takes up 38% of the time, with its runtime almost entirely consists of copying and freeing memory. jxl_render::region::ImageWithRegion::try_clone also takes 30% of the time and consists purely of copying memory.

Eliminating these copies would boost performance by nearly 3x.

@tirr-c tirr-c added the optimization Something can be done faster/better label Jan 16, 2024
@tirr-c
Copy link
Owner

tirr-c commented Jan 16, 2024

Optimizing edge-preserving filter is something I've tried before (#80), but it surely seems insufficient. Maybe I'll need a different approach/strategy to apply those filters...

@Shnatsel
Copy link
Contributor Author

The actual filtering in EPF only accounts for 5% of the total decoding runtime. It's the memory copying that it performs that takes up the vast majority of the time.

Could it be changed to accept &mut input and operate in-place instead of performing multiple copies along the way?

If you cannot perform the operation entirely in-place, it's fine to use a small scratch buffer (row-sized?) and copy to it and back to &mut input, since the small scratch buffer will always be in the CPU cache.

@tirr-c
Copy link
Owner

tirr-c commented Jan 17, 2024

I mean, it needs more optimization including those copies and memory accesses. It should be more cache friendly indeed, I'll consider keeping small scratch buffer.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented May 4, 2024

Now that I've profiled the execution with perf so I could see into the kernel, I see that copy_nonoverlapping itself took very little time and it was only slow because it called into the kernel that had to allocate more memory. The time spent inside copy_nonoverlapping itself is negligible. Profile showing it: https://share.firefox.dev/4dpr9Gw

This is just a symptom of #302, so I'm closing this in favor of that issue.

Sorry for the accidental misdirection!

@Shnatsel Shnatsel closed this as completed May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Something can be done faster/better
Projects
None yet
Development

No branches or pull requests

2 participants