-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deblock chroma: remove stack #278
base: deblock_asm
Are you sure you want to change the base?
Deblock chroma: remove stack #278
Conversation
We only need this when plt enabled
checkasm: using random seed 1749355096 checkasm: bench runs 1024 (1 << 10) AVX: - vvc_deblock.luma [OK] - vvc_deblock.chroma [OK] checkasm: all 66 tests passed benchmarking with native FFmpeg timers nop: 28.3 vvc_h_loop_filter_chroma_8_mix_no-shift_c: 69.9 vvc_h_loop_filter_chroma_8_mix_no-shift_avx: 20.1 vvc_h_loop_filter_chroma_8_mix_shift_c: 85.4 vvc_h_loop_filter_chroma_8_mix_shift_avx: 44.9 vvc_h_loop_filter_chroma_8_one-side_no-shift_c: 92.1 vvc_h_loop_filter_chroma_8_one-side_no-shift_avx: 31.9 vvc_h_loop_filter_chroma_8_one-side_shift_c: 100.1 vvc_h_loop_filter_chroma_8_one-side_shift_avx: 27.9 vvc_h_loop_filter_chroma_8_strong_no-shift_c: 79.1 vvc_h_loop_filter_chroma_8_strong_no-shift_avx: 30.9 vvc_h_loop_filter_chroma_8_strong_shift_c: 108.1 vvc_h_loop_filter_chroma_8_strong_shift_avx: 29.4 vvc_h_loop_filter_chroma_8_weak_no-shift_c: 56.1 vvc_h_loop_filter_chroma_8_weak_no-shift_avx: 19.6 vvc_h_loop_filter_chroma_8_weak_shift_c: 68.1 vvc_h_loop_filter_chroma_8_weak_shift_avx: 19.1 vvc_h_loop_filter_chroma_10_mix_no-shift_c: 61.6 vvc_h_loop_filter_chroma_10_mix_no-shift_avx: 33.9 vvc_h_loop_filter_chroma_10_mix_shift_c: 108.1 vvc_h_loop_filter_chroma_10_mix_shift_avx: 26.4 vvc_h_loop_filter_chroma_10_one-side_no-shift_c: 70.6 vvc_h_loop_filter_chroma_10_one-side_no-shift_avx: 28.9 vvc_h_loop_filter_chroma_10_one-side_shift_c: 87.9 vvc_h_loop_filter_chroma_10_one-side_shift_avx: 27.4 vvc_h_loop_filter_chroma_10_strong_no-shift_c: 68.6 vvc_h_loop_filter_chroma_10_strong_no-shift_avx: 27.1 vvc_h_loop_filter_chroma_10_strong_shift_c: 97.6 vvc_h_loop_filter_chroma_10_strong_shift_avx: 27.1 vvc_h_loop_filter_chroma_10_weak_no-shift_c: 46.6 vvc_h_loop_filter_chroma_10_weak_no-shift_avx: 21.4 vvc_h_loop_filter_chroma_10_weak_shift_c: 54.1 vvc_h_loop_filter_chroma_10_weak_shift_avx: 20.9 vvc_h_loop_filter_chroma_12_mix_no-shift_c: 63.1 vvc_h_loop_filter_chroma_12_mix_no-shift_avx: 28.4 vvc_h_loop_filter_chroma_12_mix_shift_c: 61.6 vvc_h_loop_filter_chroma_12_mix_shift_avx: 27.4 vvc_h_loop_filter_chroma_12_one-side_no-shift_c: 64.9 vvc_h_loop_filter_chroma_12_one-side_no-shift_avx: 26.6 vvc_h_loop_filter_chroma_12_one-side_shift_c: 82.9 vvc_h_loop_filter_chroma_12_one-side_shift_avx: 40.9 vvc_h_loop_filter_chroma_12_strong_no-shift_c: 117.1 vvc_h_loop_filter_chroma_12_strong_no-shift_avx: 28.1 vvc_h_loop_filter_chroma_12_strong_shift_c: 125.4 vvc_h_loop_filter_chroma_12_strong_shift_avx: 46.6 vvc_h_loop_filter_chroma_12_weak_no-shift_c: 49.4 vvc_h_loop_filter_chroma_12_weak_no-shift_avx: 18.1 vvc_h_loop_filter_chroma_12_weak_shift_c: 53.6 vvc_h_loop_filter_chroma_12_weak_shift_avx: 17.1 vvc_h_loop_filter_luma_8_skip_c: 38.4 vvc_h_loop_filter_luma_8_skip_avx: 10.4 vvc_h_loop_filter_luma_8_strong_c: 142.6 vvc_h_loop_filter_luma_8_strong_avx: 41.4 vvc_h_loop_filter_luma_8_weak_c: 86.9 vvc_h_loop_filter_luma_8_weak_avx: 10.6 vvc_h_loop_filter_luma_10_skip_c: 46.4 vvc_h_loop_filter_luma_10_skip_avx: 9.1 vvc_h_loop_filter_luma_10_strong_c: 86.4 vvc_h_loop_filter_luma_10_strong_avx: 39.9 vvc_h_loop_filter_luma_10_weak_c: 65.1 vvc_h_loop_filter_luma_10_weak_avx: 9.6 vvc_h_loop_filter_luma_12_skip_c: 47.4 vvc_h_loop_filter_luma_12_skip_avx: 9.4 vvc_h_loop_filter_luma_12_strong_c: 126.9 vvc_h_loop_filter_luma_12_strong_avx: 39.4 vvc_h_loop_filter_luma_12_weak_c: 46.4 vvc_h_loop_filter_luma_12_weak_avx: 39.9 vvc_v_loop_filter_chroma_8_mix_no-shift_c: 64.6 vvc_v_loop_filter_chroma_8_mix_no-shift_avx: 41.9 vvc_v_loop_filter_chroma_8_mix_shift_c: 86.4 vvc_v_loop_filter_chroma_8_mix_shift_avx: 40.6 vvc_v_loop_filter_chroma_8_one-side_no-shift_c: 70.9 vvc_v_loop_filter_chroma_8_one-side_no-shift_avx: 40.9 vvc_v_loop_filter_chroma_8_one-side_shift_c: 93.6 vvc_v_loop_filter_chroma_8_one-side_shift_avx: 37.9 vvc_v_loop_filter_chroma_8_strong_no-shift_c: 78.9 vvc_v_loop_filter_chroma_8_strong_no-shift_avx: 42.1 vvc_v_loop_filter_chroma_8_strong_shift_c: 99.6 vvc_v_loop_filter_chroma_8_strong_shift_avx: 42.6 vvc_v_loop_filter_chroma_8_weak_no-shift_c: 54.6 vvc_v_loop_filter_chroma_8_weak_no-shift_avx: 27.6 vvc_v_loop_filter_chroma_8_weak_shift_c: 66.6 vvc_v_loop_filter_chroma_8_weak_shift_avx: 27.1 vvc_v_loop_filter_chroma_10_mix_no-shift_c: 60.6 vvc_v_loop_filter_chroma_10_mix_no-shift_avx: 49.4 vvc_v_loop_filter_chroma_10_mix_shift_c: 85.4 vvc_v_loop_filter_chroma_10_mix_shift_avx: 34.1 vvc_v_loop_filter_chroma_10_one-side_no-shift_c: 61.6 vvc_v_loop_filter_chroma_10_one-side_no-shift_avx: 45.4 vvc_v_loop_filter_chroma_10_one-side_shift_c: 106.6 vvc_v_loop_filter_chroma_10_one-side_shift_avx: 46.6 vvc_v_loop_filter_chroma_10_strong_no-shift_c: 45.1 vvc_v_loop_filter_chroma_10_strong_no-shift_avx: 50.6 vvc_v_loop_filter_chroma_10_strong_shift_c: 124.4 vvc_v_loop_filter_chroma_10_strong_shift_avx: 47.1 vvc_v_loop_filter_chroma_10_weak_no-shift_c: 47.1 vvc_v_loop_filter_chroma_10_weak_no-shift_avx: 37.4 vvc_v_loop_filter_chroma_10_weak_shift_c: 55.6 vvc_v_loop_filter_chroma_10_weak_shift_avx: 34.9 vvc_v_loop_filter_chroma_12_mix_no-shift_c: 66.9 vvc_v_loop_filter_chroma_12_mix_no-shift_avx: 48.1 vvc_v_loop_filter_chroma_12_mix_shift_c: 94.1 vvc_v_loop_filter_chroma_12_mix_shift_avx: 45.4 vvc_v_loop_filter_chroma_12_one-side_no-shift_c: 99.6 vvc_v_loop_filter_chroma_12_one-side_no-shift_avx: 47.6 vvc_v_loop_filter_chroma_12_one-side_shift_c: 104.1 vvc_v_loop_filter_chroma_12_one-side_shift_avx: 58.1 vvc_v_loop_filter_chroma_12_strong_no-shift_c: 65.9 vvc_v_loop_filter_chroma_12_strong_no-shift_avx: 49.1 vvc_v_loop_filter_chroma_12_strong_shift_c: 96.6 vvc_v_loop_filter_chroma_12_strong_shift_avx: 46.1 vvc_v_loop_filter_chroma_12_weak_no-shift_c: 48.1 vvc_v_loop_filter_chroma_12_weak_no-shift_avx: 34.9 vvc_v_loop_filter_chroma_12_weak_shift_c: 55.1 vvc_v_loop_filter_chroma_12_weak_shift_avx: 36.4 vvc_v_loop_filter_luma_8_skip_c: 51.6 vvc_v_loop_filter_luma_8_skip_avx: 13.6 vvc_v_loop_filter_luma_8_strong_c: 96.6 vvc_v_loop_filter_luma_8_strong_avx: 55.1 vvc_v_loop_filter_luma_8_weak_c: 51.6 vvc_v_loop_filter_luma_8_weak_avx: 14.6 vvc_v_loop_filter_luma_10_skip_c: 47.1 vvc_v_loop_filter_luma_10_skip_avx: 15.1 vvc_v_loop_filter_luma_10_strong_c: 89.6 vvc_v_loop_filter_luma_10_strong_avx: 57.6 vvc_v_loop_filter_luma_10_weak_c: 45.9 vvc_v_loop_filter_luma_10_weak_avx: 14.1 vvc_v_loop_filter_luma_12_skip_c: 47.4 vvc_v_loop_filter_luma_12_skip_avx: 15.1 vvc_v_loop_filter_luma_12_strong_c: 129.4 vvc_v_loop_filter_luma_12_strong_avx: 57.1 vvc_v_loop_filter_luma_12_weak_c: 79.9 vvc_v_loop_filter_luma_12_weak_avx: 15.6
mova m10, m1 ; save p2 | ||
mova m12, m2 ; save p1 | ||
|
||
paddw m14, m1, m2 ; p2 + p1 | ||
paddw m13, m3, m4 ; p0 + q0 | ||
paddw m13, m14 ; p2 + p1 + p0 + q0 | ||
|
||
cmp no_pq, 0 | ||
je .end_p_calcs | ||
pand m11, [rsp + 16] ; which p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to remove no_p no_q?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just kidding, I see the commit message now, I'll have to look up plt but LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If palette coding is used, the deblocking filter will be disabled.
With the current code, we apply weak -> one-side -> strong filtering. It should be easy to add no_q when needed.
Is there a way to run the fuzz regression tests locally? |
fuzz is not a problem. I merged new cases which not fixed by current deblock_asm branch.
Thank you for your great help with this! We still need to add the luma long filter code, and after that, I believe we’ll be ready to upstream. I’ll be dedicating more time to this in the coming weeks. |
Oh yeah, I just assumed it was the same problem. I've switched to windows and it seems like both yasm/nasm are failing checkasm and conformance. |
With this refact, the chroma weak avx is 1.5x faster than c version. and it's no stack usage.
see 31a0f2a