Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deblock chroma: remove stack #278

Open
wants to merge 6 commits into
base: deblock_asm
Choose a base branch
from

Conversation

nuomi2021
Copy link
Member

With this refact, the chroma weak avx is 1.5x faster than c version. and it's no stack usage.
see 31a0f2a

checkasm: using random seed 1749355096
checkasm: bench runs 1024 (1 << 10)
AVX:
 - vvc_deblock.luma   [OK]
 - vvc_deblock.chroma [OK]
checkasm: all 66 tests passed
benchmarking with native FFmpeg timers
nop: 28.3
vvc_h_loop_filter_chroma_8_mix_no-shift_c: 69.9
vvc_h_loop_filter_chroma_8_mix_no-shift_avx: 20.1
vvc_h_loop_filter_chroma_8_mix_shift_c: 85.4
vvc_h_loop_filter_chroma_8_mix_shift_avx: 44.9
vvc_h_loop_filter_chroma_8_one-side_no-shift_c: 92.1
vvc_h_loop_filter_chroma_8_one-side_no-shift_avx: 31.9
vvc_h_loop_filter_chroma_8_one-side_shift_c: 100.1
vvc_h_loop_filter_chroma_8_one-side_shift_avx: 27.9
vvc_h_loop_filter_chroma_8_strong_no-shift_c: 79.1
vvc_h_loop_filter_chroma_8_strong_no-shift_avx: 30.9
vvc_h_loop_filter_chroma_8_strong_shift_c: 108.1
vvc_h_loop_filter_chroma_8_strong_shift_avx: 29.4
vvc_h_loop_filter_chroma_8_weak_no-shift_c: 56.1
vvc_h_loop_filter_chroma_8_weak_no-shift_avx: 19.6
vvc_h_loop_filter_chroma_8_weak_shift_c: 68.1
vvc_h_loop_filter_chroma_8_weak_shift_avx: 19.1
vvc_h_loop_filter_chroma_10_mix_no-shift_c: 61.6
vvc_h_loop_filter_chroma_10_mix_no-shift_avx: 33.9
vvc_h_loop_filter_chroma_10_mix_shift_c: 108.1
vvc_h_loop_filter_chroma_10_mix_shift_avx: 26.4
vvc_h_loop_filter_chroma_10_one-side_no-shift_c: 70.6
vvc_h_loop_filter_chroma_10_one-side_no-shift_avx: 28.9
vvc_h_loop_filter_chroma_10_one-side_shift_c: 87.9
vvc_h_loop_filter_chroma_10_one-side_shift_avx: 27.4
vvc_h_loop_filter_chroma_10_strong_no-shift_c: 68.6
vvc_h_loop_filter_chroma_10_strong_no-shift_avx: 27.1
vvc_h_loop_filter_chroma_10_strong_shift_c: 97.6
vvc_h_loop_filter_chroma_10_strong_shift_avx: 27.1
vvc_h_loop_filter_chroma_10_weak_no-shift_c: 46.6
vvc_h_loop_filter_chroma_10_weak_no-shift_avx: 21.4
vvc_h_loop_filter_chroma_10_weak_shift_c: 54.1
vvc_h_loop_filter_chroma_10_weak_shift_avx: 20.9
vvc_h_loop_filter_chroma_12_mix_no-shift_c: 63.1
vvc_h_loop_filter_chroma_12_mix_no-shift_avx: 28.4
vvc_h_loop_filter_chroma_12_mix_shift_c: 61.6
vvc_h_loop_filter_chroma_12_mix_shift_avx: 27.4
vvc_h_loop_filter_chroma_12_one-side_no-shift_c: 64.9
vvc_h_loop_filter_chroma_12_one-side_no-shift_avx: 26.6
vvc_h_loop_filter_chroma_12_one-side_shift_c: 82.9
vvc_h_loop_filter_chroma_12_one-side_shift_avx: 40.9
vvc_h_loop_filter_chroma_12_strong_no-shift_c: 117.1
vvc_h_loop_filter_chroma_12_strong_no-shift_avx: 28.1
vvc_h_loop_filter_chroma_12_strong_shift_c: 125.4
vvc_h_loop_filter_chroma_12_strong_shift_avx: 46.6
vvc_h_loop_filter_chroma_12_weak_no-shift_c: 49.4
vvc_h_loop_filter_chroma_12_weak_no-shift_avx: 18.1
vvc_h_loop_filter_chroma_12_weak_shift_c: 53.6
vvc_h_loop_filter_chroma_12_weak_shift_avx: 17.1
vvc_h_loop_filter_luma_8_skip_c: 38.4
vvc_h_loop_filter_luma_8_skip_avx: 10.4
vvc_h_loop_filter_luma_8_strong_c: 142.6
vvc_h_loop_filter_luma_8_strong_avx: 41.4
vvc_h_loop_filter_luma_8_weak_c: 86.9
vvc_h_loop_filter_luma_8_weak_avx: 10.6
vvc_h_loop_filter_luma_10_skip_c: 46.4
vvc_h_loop_filter_luma_10_skip_avx: 9.1
vvc_h_loop_filter_luma_10_strong_c: 86.4
vvc_h_loop_filter_luma_10_strong_avx: 39.9
vvc_h_loop_filter_luma_10_weak_c: 65.1
vvc_h_loop_filter_luma_10_weak_avx: 9.6
vvc_h_loop_filter_luma_12_skip_c: 47.4
vvc_h_loop_filter_luma_12_skip_avx: 9.4
vvc_h_loop_filter_luma_12_strong_c: 126.9
vvc_h_loop_filter_luma_12_strong_avx: 39.4
vvc_h_loop_filter_luma_12_weak_c: 46.4
vvc_h_loop_filter_luma_12_weak_avx: 39.9
vvc_v_loop_filter_chroma_8_mix_no-shift_c: 64.6
vvc_v_loop_filter_chroma_8_mix_no-shift_avx: 41.9
vvc_v_loop_filter_chroma_8_mix_shift_c: 86.4
vvc_v_loop_filter_chroma_8_mix_shift_avx: 40.6
vvc_v_loop_filter_chroma_8_one-side_no-shift_c: 70.9
vvc_v_loop_filter_chroma_8_one-side_no-shift_avx: 40.9
vvc_v_loop_filter_chroma_8_one-side_shift_c: 93.6
vvc_v_loop_filter_chroma_8_one-side_shift_avx: 37.9
vvc_v_loop_filter_chroma_8_strong_no-shift_c: 78.9
vvc_v_loop_filter_chroma_8_strong_no-shift_avx: 42.1
vvc_v_loop_filter_chroma_8_strong_shift_c: 99.6
vvc_v_loop_filter_chroma_8_strong_shift_avx: 42.6
vvc_v_loop_filter_chroma_8_weak_no-shift_c: 54.6
vvc_v_loop_filter_chroma_8_weak_no-shift_avx: 27.6
vvc_v_loop_filter_chroma_8_weak_shift_c: 66.6
vvc_v_loop_filter_chroma_8_weak_shift_avx: 27.1
vvc_v_loop_filter_chroma_10_mix_no-shift_c: 60.6
vvc_v_loop_filter_chroma_10_mix_no-shift_avx: 49.4
vvc_v_loop_filter_chroma_10_mix_shift_c: 85.4
vvc_v_loop_filter_chroma_10_mix_shift_avx: 34.1
vvc_v_loop_filter_chroma_10_one-side_no-shift_c: 61.6
vvc_v_loop_filter_chroma_10_one-side_no-shift_avx: 45.4
vvc_v_loop_filter_chroma_10_one-side_shift_c: 106.6
vvc_v_loop_filter_chroma_10_one-side_shift_avx: 46.6
vvc_v_loop_filter_chroma_10_strong_no-shift_c: 45.1
vvc_v_loop_filter_chroma_10_strong_no-shift_avx: 50.6
vvc_v_loop_filter_chroma_10_strong_shift_c: 124.4
vvc_v_loop_filter_chroma_10_strong_shift_avx: 47.1
vvc_v_loop_filter_chroma_10_weak_no-shift_c: 47.1
vvc_v_loop_filter_chroma_10_weak_no-shift_avx: 37.4
vvc_v_loop_filter_chroma_10_weak_shift_c: 55.6
vvc_v_loop_filter_chroma_10_weak_shift_avx: 34.9
vvc_v_loop_filter_chroma_12_mix_no-shift_c: 66.9
vvc_v_loop_filter_chroma_12_mix_no-shift_avx: 48.1
vvc_v_loop_filter_chroma_12_mix_shift_c: 94.1
vvc_v_loop_filter_chroma_12_mix_shift_avx: 45.4
vvc_v_loop_filter_chroma_12_one-side_no-shift_c: 99.6
vvc_v_loop_filter_chroma_12_one-side_no-shift_avx: 47.6
vvc_v_loop_filter_chroma_12_one-side_shift_c: 104.1
vvc_v_loop_filter_chroma_12_one-side_shift_avx: 58.1
vvc_v_loop_filter_chroma_12_strong_no-shift_c: 65.9
vvc_v_loop_filter_chroma_12_strong_no-shift_avx: 49.1
vvc_v_loop_filter_chroma_12_strong_shift_c: 96.6
vvc_v_loop_filter_chroma_12_strong_shift_avx: 46.1
vvc_v_loop_filter_chroma_12_weak_no-shift_c: 48.1
vvc_v_loop_filter_chroma_12_weak_no-shift_avx: 34.9
vvc_v_loop_filter_chroma_12_weak_shift_c: 55.1
vvc_v_loop_filter_chroma_12_weak_shift_avx: 36.4
vvc_v_loop_filter_luma_8_skip_c: 51.6
vvc_v_loop_filter_luma_8_skip_avx: 13.6
vvc_v_loop_filter_luma_8_strong_c: 96.6
vvc_v_loop_filter_luma_8_strong_avx: 55.1
vvc_v_loop_filter_luma_8_weak_c: 51.6
vvc_v_loop_filter_luma_8_weak_avx: 14.6
vvc_v_loop_filter_luma_10_skip_c: 47.1
vvc_v_loop_filter_luma_10_skip_avx: 15.1
vvc_v_loop_filter_luma_10_strong_c: 89.6
vvc_v_loop_filter_luma_10_strong_avx: 57.6
vvc_v_loop_filter_luma_10_weak_c: 45.9
vvc_v_loop_filter_luma_10_weak_avx: 14.1
vvc_v_loop_filter_luma_12_skip_c: 47.4
vvc_v_loop_filter_luma_12_skip_avx: 15.1
vvc_v_loop_filter_luma_12_strong_c: 129.4
vvc_v_loop_filter_luma_12_strong_avx: 57.1
vvc_v_loop_filter_luma_12_weak_c: 79.9
vvc_v_loop_filter_luma_12_weak_avx: 15.6
mova m10, m1 ; save p2
mova m12, m2 ; save p1

paddw m14, m1, m2 ; p2 + p1
paddw m13, m3, m4 ; p0 + q0
paddw m13, m14 ; p2 + p1 + p0 + q0

cmp no_pq, 0
je .end_p_calcs
pand m11, [rsp + 16] ; which p
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to remove no_p no_q?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just kidding, I see the commit message now, I'll have to look up plt but LGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If palette coding is used, the deblocking filter will be disabled.
With the current code, we apply weak -> one-side -> strong filtering. It should be easy to add no_q when needed.

@stone-d-chen
Copy link
Collaborator

stone-d-chen commented Feb 9, 2025

Is there a way to run the fuzz regression tests locally?
Otherwise this is much cleaner than my implementation, thank you for your refactor!

@nuomi2021
Copy link
Member Author

Is there a way to run the fuzz regression tests locally?

fuzz is not a problem. I merged new cases which not fixed by current deblock_asm branch.
But window failure maybe a issue. Need to check why.

Otherwise this is much cleaner than my implementation, thank you for your refactor!

Thank you for your great help with this! We still need to add the luma long filter code, and after that, I believe we’ll be ready to upstream. I’ll be dedicating more time to this in the coming weeks.

@stone-d-chen
Copy link
Collaborator

stone-d-chen commented Feb 10, 2025

Is there a way to run the fuzz regression tests locally?

fuzz is not a problem. I merged new cases which not fixed by current deblock_asm branch. But window failure maybe a issue. Need to check why.

Oh yeah, I just assumed it was the same problem. I've switched to windows and it seems like both yasm/nasm are failing checkasm and conformance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants