Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imlib/jpege: Minor DCT speedup. #2157

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kwagyeman
Copy link
Member

@kwagyeman kwagyeman commented Feb 18, 2024

This PR provides a minor 5% speedup on the software jpeg code. I had hoped it would provide more speed given the vertical DCT has 50% less instructions over two column passes.

This means any more software optimization of the software jpeg encoder has to improve the bit packing logic.

...

Tests are performed on the color bar image and at 90% quality to generate the most work.

On the STM32H7 with hardware jpeg disabled:

VGA, color, @ 90% quality = 84ms (before 87 ms) - 3% improvement
VGA, grayscale, @ 90% quality = 26ms (before 27 ms) - 3% improvement

SVGA, color, @ 90% quality = 130ms (before 136 ms) - 4% improvement
SVGA, grayscale, @ 90% quality = 40ms (before 42 ms) - 5% improvement

XGA, color, @ 90% quality = 216ms (before 225 ms) - 4% improvement
XGA, grayscale, @ 90% quality = 70ms (before 73 ms) - 4% improvement

HD, color, @ 90% quality = 249ms (before 259 ms) - 4% improvement
HD, grayscale, @ 90% quality = 76ms (before 80 ms) - 5% improvement

...

I found weird behavior on the RT1062. While the code provides a minor speedup on the STM32H7 it appears to cause a slow down across the board and is worse than baseline.

VGA, color, @ 90% quality = 74ms (before 70 ms) - 5% slow down
VGA, grayscale, @ 90% quality = 23ms (before 22 ms) - 5% slow down

SVGA, color, @ 90% quality = 116ms (before 110 ms) - 5% slow down
SVGA, grayscale, @ 90% quality = 36ms (before 35 ms) - 3% slow down

XGA, color, @ 90% quality = 194ms (before 184 ms) - 5% slow down
XGA, grayscale, @ 90% quality = 59ms (before 56 ms) - 5% slow down

HD, color, @ 90% quality = 217ms (before 205 ms) - 5% slow down
HD, grayscale, @ 90% quality = 67ms (before 64 ms) - 5% slow down

After applying #2142 the numbers become (thinking maybe there's something weird happening because of the interrupt overhead):

VGA, color, @ 90% quality = 70ms (before 66 ms) - 5% slow down
VGA, grayscale, @ 90% quality = 22ms (before 20 ms) - 10% slow down

SVGA, color, @ 90% quality = 109ms (before 102 ms) - 7% slow down
SVGA, grayscale, @ 90% quality = 34ms (before 32 ms) - 5% slow down

XGA, color, @ 90% quality = 184ms (before 173 ms) - 5% slow down
XGA, grayscale, @ 90% quality = 56ms (before 52 ms) - 8% slow down

HD, color, @ 90% quality = 207ms (before 195 ms) - 6% slow down
HD, grayscale, @ 90% quality = 65ms (before 61 ms) - 7% slow down

I'm not sure what's causing the slow down in performance on the RT1062. Something interesting I noticed though, changing int DU[64]; to int16_t DU[64]; causes a massive slow down. On both the STM32 and RT1062 before doing any SIMD work. Maybe the issue is that there's an overflow in the DU array causing the quantization to break.

Just for sanity sake, I added a cycle counter to the DCT loop on the RT1062. Without the new code the horizontal and vertical loops take about ~606 clock cycles on average. After the new code they take 578 clock cycles. So, the code does indeed speeup to the loop time by 5%. So... there must be some issue with using an int16_t DU[64] array.

Will continue to debug this.

@kwagyeman kwagyeman force-pushed the kwabena/minor_jpeg_du_speedup branch from 189718d to 2a3eb33 Compare February 18, 2024 22:47
@kwagyeman kwagyeman changed the title imlib/jpege: 5% software speedup compression speedup. imlib/jpege: Minor DCT speedup. Feb 18, 2024
@kwagyeman
Copy link
Member Author

With no SIMD changes on the RT1062... just changing int DU[64] to int16_t DU[64]:

time: 230 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 230 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 228 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 228 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

Baseline is:

{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 209 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 209 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 209 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 207 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 207 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 206 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

The image size output is the same, though.

On the H7 Plus, baseline:

time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 260 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 260 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 260 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

And with a int16_t DU[64]:

time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}

The image size changes which is odd given I'm using colorbar to test. The slow down seems to be less for the H7 than the RT1062.

@kwagyeman
Copy link
Member Author

kwagyeman commented Feb 18, 2024

Okay, I just checked if there were any overflows beyond int16_t range by doing a comparison for an overflow after every operation and there was not. So, it's not a math issue.

Maybe it's the location of that array.

@kwagyeman
Copy link
Member Author

Okay... figured out something interesting:

Moving the DU array to .data on the H7 Plus:

time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 242 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

And then as int16_t DU[64]:

time: 262 ms
{"w":1280, "h":720, "type":"jpeg", "size":47748}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

So, it seems array access is faster in .data than on the stack. However, for whatever reason, doing 16-bit accesses is messed up. Reading/writing to ints is just faster. I guess at the cache level maybe it can't perform 16-bit/8-bit updates instantly and has to do things 32-bits at a time.

@kwagyeman
Copy link
Member Author

kwagyeman commented Feb 19, 2024

Okay, after removing 16-bit writes and ensuring DU is the first item in the stack and making sure it's aligned... there is no positive performance change. Packing the 16-bit writes into 32-bits, though, appears to reduce performance more.

So... yeah, I'm out of ideas on why the code generates JPEGs slower on the RT1062 while at the same time also knowing that the JPEG DCT cycle time was infact reduced by the optimization (measured via the processor cycle counter).

@kwagyeman
Copy link
Member Author

I noticed an issue. When you set the resolution to something less than 320x240 you noticed issues with the image. I think I can fix this and then the performance will be good.

@kwagyeman kwagyeman force-pushed the kwabena/minor_jpeg_du_speedup branch from 2a3eb33 to ce38de0 Compare February 21, 2024 01:40
@kwagyeman
Copy link
Member Author

I fixed the issue, and it's still a slowdown on the RT1062.

Real head-scratcher.

@iabdalkader iabdalkader added this to the v4.5.7 milestone Jul 19, 2024
@kwagyeman kwagyeman added the experimental Work in progress. label Aug 2, 2024
@iabdalkader iabdalkader removed this from the v4.5.9 milestone Aug 31, 2024
@CLAassistant
Copy link

CLAassistant commented Dec 20, 2024

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants