`jxl-oxide` spends 62% of the time in `copy_nonoverlapping` #202

Shnatsel · 2024-01-16T16:51:10Z

I have converted a large image I had lying around to a slightly lossy JPEG XL with cjxl -d 1:
gast1verandering2_015_d1.jxl.gz

Decoding it with djxl --disable_output takes 50ms. With jxl-oxide CLI, ignoring the time to encode and write the PNG, the decoding process takes 300ms, or 6x as long.

I've profiled the decoding process with samply, you can browse the results here: https://share.firefox.dev/48WOaNU

If you look at the inverted call stacks, 62% of the total time is spent in core::intrinsics::copy_nonoverlapping, literally just copying memory around.

Looking at the flame graph, the jxl_render::filter::epf::apply_epf function takes up 38% of the time, with its runtime almost entirely consists of copying and freeing memory. jxl_render::region::ImageWithRegion::try_clone also takes 30% of the time and consists purely of copying memory.

Eliminating these copies would boost performance by nearly 3x.

The text was updated successfully, but these errors were encountered:

tirr-c · 2024-01-16T16:56:47Z

Optimizing edge-preserving filter is something I've tried before (#80), but it surely seems insufficient. Maybe I'll need a different approach/strategy to apply those filters...

Shnatsel · 2024-01-16T17:18:29Z

The actual filtering in EPF only accounts for 5% of the total decoding runtime. It's the memory copying that it performs that takes up the vast majority of the time.

Could it be changed to accept &mut input and operate in-place instead of performing multiple copies along the way?

If you cannot perform the operation entirely in-place, it's fine to use a small scratch buffer (row-sized?) and copy to it and back to &mut input, since the small scratch buffer will always be in the CPU cache.

tirr-c · 2024-01-17T15:20:15Z

I mean, it needs more optimization including those copies and memory accesses. It should be more cache friendly indeed, I'll consider keeping small scratch buffer.

Shnatsel · 2024-05-04T00:39:52Z

Now that I've profiled the execution with perf so I could see into the kernel, I see that copy_nonoverlapping itself took very little time and it was only slow because it called into the kernel that had to allocate more memory. The time spent inside copy_nonoverlapping itself is negligible. Profile showing it: https://share.firefox.dev/4dpr9Gw

This is just a symptom of #302, so I'm closing this in favor of that issue.

Sorry for the accidental misdirection!

tirr-c added the optimization Something can be done faster/better label Jan 16, 2024

saschanaz mentioned this issue Mar 1, 2024

Starting from version 0.2, opening jxl files has become long saschanaz/jxl-winthumb#33

Open

Shnatsel closed this as completed May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`jxl-oxide` spends 62% of the time in `copy_nonoverlapping` #202

`jxl-oxide` spends 62% of the time in `copy_nonoverlapping` #202

Shnatsel commented Jan 16, 2024

tirr-c commented Jan 16, 2024

Shnatsel commented Jan 16, 2024

tirr-c commented Jan 17, 2024

Shnatsel commented May 4, 2024 •

edited

Loading

jxl-oxide spends 62% of the time in copy_nonoverlapping #202

jxl-oxide spends 62% of the time in copy_nonoverlapping #202

Comments

Shnatsel commented Jan 16, 2024

tirr-c commented Jan 16, 2024

Shnatsel commented Jan 16, 2024

tirr-c commented Jan 17, 2024

Shnatsel commented May 4, 2024 • edited Loading

`jxl-oxide` spends 62% of the time in `copy_nonoverlapping` #202

`jxl-oxide` spends 62% of the time in `copy_nonoverlapping` #202

Shnatsel commented May 4, 2024 •

edited

Loading