-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jxl-oxide
spends 62% of the time in copy_nonoverlapping
#202
Comments
Optimizing edge-preserving filter is something I've tried before (#80), but it surely seems insufficient. Maybe I'll need a different approach/strategy to apply those filters... |
The actual filtering in EPF only accounts for 5% of the total decoding runtime. It's the memory copying that it performs that takes up the vast majority of the time. Could it be changed to accept If you cannot perform the operation entirely in-place, it's fine to use a small scratch buffer (row-sized?) and copy to it and back to |
I mean, it needs more optimization including those copies and memory accesses. It should be more cache friendly indeed, I'll consider keeping small scratch buffer. |
Now that I've profiled the execution with This is just a symptom of #302, so I'm closing this in favor of that issue. Sorry for the accidental misdirection! |
I have converted a large image I had lying around to a slightly lossy JPEG XL with
cjxl -d 1
:gast1verandering2_015_d1.jxl.gz
Decoding it with
djxl --disable_output
takes 50ms. Withjxl-oxide
CLI, ignoring the time to encode and write the PNG, the decoding process takes 300ms, or 6x as long.I've profiled the decoding process with
samply
, you can browse the results here: https://share.firefox.dev/48WOaNUIf you look at the inverted call stacks, 62% of the total time is spent in
core::intrinsics::copy_nonoverlapping
, literally just copying memory around.Looking at the flame graph, the
jxl_render::filter::epf::apply_epf
function takes up 38% of the time, with its runtime almost entirely consists of copying and freeing memory.jxl_render::region::ImageWithRegion::try_clone
also takes 30% of the time and consists purely of copying memory.Eliminating these copies would boost performance by nearly 3x.
The text was updated successfully, but these errors were encountered: