Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream into ROCM develop #54

Merged
merged 65 commits into from
Jan 17, 2025
Merged

Merge upstream into ROCM develop #54

merged 65 commits into from
Jan 17, 2025

Conversation

qianfengz
Copy link
Collaborator

In order to make the thing easier when upstreaming the change from develop.

qianfengz and others added 30 commits September 10, 2024 08:43
* Building xformers using ck-tiled as default

* ensure ck_decoder does not dispatch

* Add disable_on_rocm on some test scripts

* Update to test_mem_eff_attention.py

* apply isort

* apply black

* fix flake8 suggestions

* add license headers and reapply black

* Tiny update to rocm_ci.yml

* Add conditional compiling for cuda-depending codes in ROCM

* Update to benchmark scripts

* Rename the one script file

* Revert "Add conditional compiling for cuda-depending codes in ROCM"

This reverts commit 12fb41c.

* Update to scripts

* Change and add readme for tests and benchmarks

* Remove the stuffs for supporting old ck

* Remove old composable_kernel from submodule list

* Remove folder third_party/composable_kernel

* Rename the folder

* Remove unused script file

* apply black

* pacify mypy

* fix clang-format

* reapply black

* fix lints

* make test_splitk_reference run on cpu

* add ck modules to docs

* try fixing nvidia build by re-including sparse24 cpp folder into extension sources

* update cutlass to upstream commit

* update flash-attention to upstream commit

* simplify setup.py

* remove duplicate run_batched_infer_causalmask_attnbias_dispatched<f16, true, true, 128>

* add hip version and pytorch hip arch list to xformers build info

* fix build

* patch around the unhappy path in get_hip_version

* skip test_grad_checkpointing for triton_splitk since it doesn't have bwop

* re-enable test_mqa_forward since ck tiled is the current implementation

* make skip test_wrong_alignment more generic

* reapply black

* simplify test_decoder

* put python version check inside triton_splitk op

* fix logic

* cleanup python3.9 checks in tests

* cleanup test_attentions

* cleanup test_checkpoint as test running on cpu does not depend on gpu platform

* fix lints

* try fixing win build by conditional import of triton in triton op

* re-enable test_triton_layernorm as it passes

* re-enable test_triton_blocksparse as it passes

* cleanup test_sparse_tensors

* cleanup test_custom_ops

* reapply black

* cleanup test_core_attention

* benchmark ck ops on rocm only

* fix mypy

* fix lint: black

* fix lints: mypy

* split-k decoder: move all tunable parameters to the top of cpp file

* apply clang-format

* Rename HDim/headdim to MaxK/maxk

* Move some headers files to ck examples for later reusing

* Replace using qs_ks_vs pipeline by qr_ks_vs pipeline while HeadDim is 256 for better performance

* rm test_ck_7

* dump kernel resource usage to compilation logs similar to nv

* Add the c++ extension to the latest change of ck_tile/dev fwd kernel (added droppout)

* Add the c++ extension to use ck_tile/dev/ fmha bwd kernel

* Update to add dropout for fmah backward

* Update in attention.cpp to align efficient_attention_backward_ck interface parameters

* Enable BwdOp in ck.py

* Support grad_out to have different strides as out

* Force seqstart_q/seqstart_k to be in device memory in ck.py

* Remove duplicated codes in ck_tiled_fmha_grouped_forward.h/infer.h

* Use optimized async pipeline where 8x headdim length is assumed

* Fix in batched_infer

* Update to track ck_tile/opt_padding_fa_train_xformers branch

* Update rocm_ci.yml

configuring the self-hosted runner

* Update to use the newer FmhaFwdEpilogue

* Update rocm_ci.yml

add option to manually trigger workflow

* Update rocm_ci.yml

remove condition which skips ci unless github event contains string 'rocm'

* copy rocm_ci workflow from main branch

* Update rocm_ci.yml

Bump upload-artifact version

* Update to use the newer FmhaFwdEpilogue for grouped infer/forward

* Temporarily disable the using of QRKSVSAsync() pipeline

* Update rocm_ci.yml

add a daily run

* Implement the ck_rand_uniform interface for generating random number tensor

* Add dropout to the infer path (needed by xformers test_dropout)

* Update to support test_dropout and test_dropout_backward tests

* Update the padding method in batched_backward.h

* Update the OGradDotO kernel padding method

* Change the backward padding checking condition

* Add batch_stride_lse/d parameters to adapt grouped mode forward/backward to [num_batches, H, MaxSeqlenQ] layout

* Fill the grad_bias in advance

* Add support for kHasBiasGrad as instance template

* Remove using hdim_stride_do in fmha backward

* Force kPadSeqLenQ/kPadSeqLenK to be true in batched-backward to save compiling time

* Fix missing passing of {philox_seed, philox_offset} in inference path

* Use SimplifiedGenericAttentionMask to replace GenericAttentionMask

* Shorten the instance file names

* Rename the template parameters

* Simplify the names of the dispatch class and interfaces

* Changes to reuse the kernel files under ck_tile examples/91_tile_program/fmha folder

* Update test_mem_eff_attention.py for test_dropout/test_dropout_backward/test_backward on rocm

* Tiny change to the philox_cuda_state input setting

* Allocate logsumexp to ensure aligned access by each thread-group

* Add checking for query/key headdim size attention_backward_generic

* Using ck_tile/opt_padding_fa_train_pr2 and synchronize the backward codes with the changes

* Enable using async pipeline in the batched inference path for performance

* Re-organize cpp instances for calling fmha infer kernel

* Re-organize cpp instances for calling fmha forward kernel

* Re-organize cpp instances for calling fmha backward kernel

* Position the composable_kernel_tiled to ck_tile/opt_padding_fa_train branch

* Update to synchronize with the latest commits in ck_tile/opt_padding_fa_train

* update submodule to public

* Update to the criteria for padding seqlen_k in batched infer/forward

* Keep latest track of ck-tile commits

* Tiny fixing to the decoder including

* Position the ck-tiled to ck_tile/opt_padding branch

* Enable some attn_bias types which were previously disabled by old-ck in ck.py

* Add script generate_instances.py which helps to generate instances

* Simplify logic for seqstart_q/k

566d26f has put the seqstart_k/q on device. So simplify the logic here.

The upstream xformers don't have this optmization and is copying the seqstart_q/k every iterations. We'd like this change to get in and then merge to upstream.

* Add Async pipeline to grouped mode inference path

* Use explict true for kPadSeqLenQ/kPadHeadDimQ/kPadHeadDimV templates for the Async pipeline

* Synchronize to the update of composable_kernel_tiled for better performance

* Update rocm_ci.yml - clean up dangling images after ci run

* Avoid unused-const-variable warning

Our compiler will error on unused-const-variable warning. So just fix this

* Tiny change in the BlockTile/Shape setting overriddings

* try to align fmha C++ extension to the ck_tile in ck develop branch

* Synchronize composable_kernel_tiled to latest ck develop

* Use FmhaFwdTilePartitioner_HBS only with seqlen_k padded cases

* Tiny fix/change to make test_forward/test_backward/test_dropout/test_dropout_backward_ck pass

* Fix compiling issue with regard to Invoker definitions in forward_decoder/forward_decoder_split operators

* Keep using -Woverloaded-virtual

* Fix clang-format for headers and cpp files

* Fix format in python scripts

* Add noqa: C801 for generate_instances.py

* Align dispatch_bw with main branch

* Align ops/fmha/common.py with main branch

* Synchronize the thirty-party/composable_kernel_tiled to latest ck_tile commits for better performance

* Relax the atol for test_forward and test_dropout due to the using of packed fp16_2_fp32 conversion in ck_tile

* Generate html report for tests run with rocm_ci.yml

* archive test results when tests have failed

* Always clean up dangling docker images in rocm_ci

* Bump python to 3.11 in rocm_ci.yml

* Disable flash attention tests rocm_ci.yml

Since the op is broken; tbd either make the op work, or disable it on ROCm

* Try to fix rocm_ci.yml

Init must be called before activation

* try to fix rocm_ci.yml flow by overriding PATH

* Fix setup.py path in rocm_ci.yml

* cd to xformers dir before running install in rocm_ci.yml

* Use pip to install xformers in rocm_ci.yml

* Possibly fix python version resolution in rocm_ci.yml

* Set the correct path for pytest in rocm_ci.yml

* remove test_reference_splitk as it was moved to a different file during the first upstream

remove test_mqa_forward from develop, as the test fails in develop and doesn't run upstream

remove reference attention splitk from the test file; it exists in test_splitk_reference

sync test_mem_eff_attention with upstream

* make sure ck operators have a name to be visible in the dispatcher

* fix sm version checks to happen only on CUDA, not ROCm

* (2/n) fix sm version checks to happen only on CUDA, not ROCm

* Remove _check_large_shapes checking in fmha/ck.py (facebookresearch#1067)

* make xformers install editable to fix cpp extensions detection

* Update to using the improved fmha-bwd (compiling passed)

* Update to get 80% of the test_backward and test_dropout_backward_ck cases passed

* Replace the using of ConvertGradQ by using torch tensor type converting

* Change the tile settings for MaxK=32

* Fix padding setting bug in grouped_backward

* Change -DCK_FMHA_FWD_FAST_EXP2=1 to -DCK_TILE_FMHA_FWD_FAST_EXP2=1

* Point the composable_kernel_tiled submodule to ck_tile/fa_bwd_opt branch

* Disable flshattF and flshattB on ROCM

* Add -mllvm and -enable-post-misched=0 compiling options for ROCM on setup.py

* Disable flshattF and flshattB on ROCM

* Update to support separate grad_q_f32_strides do to the API change in the fmd_bwd_kernel

* Use old method for setting BlockDropout due to the revert in fmha_fwd_kernel

* Tiny fix in grouped_backward

* Use packed tensor allocation for grad_q_f32

* Update to the ConvertGradQ kernel calling

* Tiny update

* Fix the parameter location in grouped_backward

* Adjust headdim128 tile shapes for better performance

* Update backward kernel calling due to adding of nhead_stride_dk/nhead_stride_dv parameters

* Synchronize with CK to use separate pipeline for kPadHeadDim true of false situtation

* Use convertDQ kernel

* Update to use unpadded lse layout

* Add explicit headdim256 instances for fmha backward

* Add leaked headdim256 instance references

* Change to generate.py and the re-generate the instance files using it

* Change to generate.py to generate instances refences and uses the generated reference headers

* Relax the RTOL of ckFwOp from 4e-4 to 3e-3 due to one big result case

* Change to use .h rather than .hpp as suffix for generated header files

* Fix in .gitignore

* Update to bwd setting to use only IGLP pipeline

* Synchronize to latest ck_tile fix and align the headdim64 tile shape setting

* Reformat the generated instances cpp files

* Fix to the backward Trait

* Set occupancy to -1 to avoid the compiling warning

* Revert "Set occupancy to -1 to avoid the compiling warning"

This reverts commit fa6d8b3.

* Add environment variable and compiler definition to control the generating of headdim256 instances

* Add --ignore-hd256 argument to generate_instance.py and some update in this script

* Add environment variable ENABLE_HIP_FMHA_RTN_BF16_CONVERT to enable using rtn bf16 conversion

* Remove commented lines in test_mem_eff_attention.py

* Synchronize to latest ck_tile commit

* apply black

* apply flake8

* fix mypy

* revert disable flash operator on rocm

* Synchronize to ck_tile latest commit again

* Re-position the composable_kernel submodule to the develop branch

* Avoid the Async pipeline when khasBias is true

* clang-format for two files

* Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts

* Change in generate_instances.py so that this scripts can be called from flexible location

* Add manual for generate_instances.py (.md)

* Modification in GENERATE_INSTANCES.md

* Fix in GENERATE_INSTANCES.md

* Update GENERATE_INSTANCES.md

* clean-up commented codes

* Revert "Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts"

This reverts commit 7a91589.

* Synchronize to latest ck develop for using the latest RTN bf16 convert

* Add c++ extension compiling options for better performance on ROCM 6.2

* Use the same rocm_ci.yml as upstream

* Use the same ck.py as upstream

* Reformat setup.py

---------

Co-authored-by: Max Podkorytov <[email protected]>
Co-authored-by: carlushuang <[email protected]>
Co-authored-by: Xiaodong Wang <[email protected]>
ghstack-source-id: 6c3e9fe00b08d2157543daa1d3c2ceaf69cf3cc1
Pull Request resolved: fairinternal/xformers#1218

__original_commit__ = fairinternal/xformers@cf44fd3
ghstack-source-id: 3204616cecd297f658bda0cf31e82ba0edf491c6
Pull Request resolved: fairinternal/xformers#1215

__original_commit__ = fairinternal/xformers@adc8664
ghstack-source-id: dc28f4bb77751ffc4f8d67a5ba088b262830659c
Pull Request resolved: fairinternal/xformers#1217

__original_commit__ = fairinternal/xformers@eec3f5f
ghstack-source-id: 8bfbe7959ef67f4f01ab36ea20e12c9d89daf664
Pull Request resolved: fairinternal/xformers#1219

__original_commit__ = fairinternal/xformers@08743da
ghstack-source-id: 872cd3e587c1fd3a23e14a8682ba17f0274a0e6e
Pull Request resolved: fairinternal/xformers#1220

__original_commit__ = fairinternal/xformers@79d0b02
ghstack-source-id: 988dd690ff1011948d610a495eca9cc0b719a067
Pull Request resolved: fairinternal/xformers#1223

__original_commit__ = fairinternal/xformers@202cd4b
Fused seqpar used to have multiple staging buffers (and associated counters), each called a "stripe", and would cycle through them in successive operations.

We needed this because when the fused seqpar op returns, the staging buffer isn't immediately available for reuse: there's an asynchronous signalling kernel (write_values) in a background stream. This used to matter in benchmarks, where we run these ops back-to-back, to avoid an artificial delay due to each op waiting for the previous one's async kernel.

This feature was however disabled "in production" (since by default we have only one stripe) as the ops don't run back-to-back there.

Stripes are rather burdensome to handle inside CUDA graphs (which is what we'll enable in fairinternal/xformers#1014) because instead of having the CPU pass different pointers to different kernels we would need each GPU kernel pick the pointers it wants to operate on from a tensor of pointers (rotating round-robin). This is doable-ish in custom kernels, but not in standard PyTorch ones.

Moreover, we need to get make these async kernels synchronous anyways in order for the fused seqpar ops to be graphable (see fairinternal/xformers#1034), hence stripes won't provide any benefit anymore.

And removing stripes simplifies the code a bit too!

Truth be told, these write_values kernels weren't that long, thus making them sync shouldn't worsen the perfs again. However, I'll later try to make these kernels a bit faster, such as in fairinternal/xformers#1036.

ghstack-source-id: f8c1f5f0d39762a1ff0b964d18efd411ddca42ec
Pull Request resolved: fairinternal/xformers#1013

__original_commit__ = fairinternal/xformers@0be6c94
This is needed for capturing a CUDA graph (i.e., all work must be "joined back" into capturing stream).

But it's also needed by the next commit in the stack to avoid a race condition between the seq-num increase kernel of the next iteration and the write_values kernel of the previous iteration.

ghstack-source-id: 4f8e9bfada42566b83b1325cb6dca0bd05374249
Pull Request resolved: fairinternal/xformers#1034

__original_commit__ = fairinternal/xformers@c57618e
ghstack-source-id: 985575fd3ca72ae6691318af80e13caed6aa2a25
Pull Request resolved: fairinternal/xformers#1224

__original_commit__ = fairinternal/xformers@5fbc8d4
ghstack-source-id: 94de742e80c9c6ca529e1cb89ec0f6101adfb1c0
Pull Request resolved: fairinternal/xformers#1035

__original_commit__ = fairinternal/xformers@97d178a
ghstack-source-id: 2eb41508d764a4db0e9ed83e16bb9924723202b1
Pull Request resolved: fairinternal/xformers#1225

__original_commit__ = fairinternal/xformers@48b7589
We had two bugs (deadlocks) due to incorrect updates to the shared counters that are used by processes to coordinate. This was because the same counter was updated at different times by different processes. This felt very error-prone (no clear owner for a counter) thus brittle.

I'm proposing here what I believe to be a clearer and more robust solution, which uses two clearly-defined counters:
- one which tracks the completion of full operations (waited on at the beginning, updated at the end)
- one which updates mid-operation by the producer rank to signal to the consumer rank that it's ready (either it sent the data, if push-based, or it readied the data, if pull-based)

This protocol works! The tests all pass. Except... the ones based on Triton, because Triton expects the counters to be a contiguous tensor, and this PR breaks that. This can be fixed but it's too much effort for now. Hence I'm just parking this PR here as a reference for the future.

ghstack-source-id: 112aa146768b4e0bd1c22afef23778f15eb29278
Pull Request resolved: fairinternal/xformers#989

__original_commit__ = fairinternal/xformers@1b62907
ghstack-source-id: f114488fe3e4b0b73df3740539220e2d61ee41ba
Pull Request resolved: fairinternal/xformers#1226

__original_commit__ = fairinternal/xformers@5ce29ac
* remove pre-dep torch

* fix lint

* isort

* silly mistake

---------

Co-authored-by: Leonid Shamis <[email protected]>

__original_commit__ = fairinternal/xformers@51111b0
ghstack-source-id: 8f2dbaabff2a5935c45b58bad4de3de803bf0fc1
Pull Request resolved: fairinternal/xformers#1238

__original_commit__ = fairinternal/xformers@f4cbd36
ghstack-source-id: c6675f5e50a7147e4c6f18a9ff2e64cecd795b18
Pull Request resolved: fairinternal/xformers#1239

__original_commit__ = fairinternal/xformers@9bed28e
…rnal/xformers#1229)

* sp24: restore __tensor_unflatten__
* sp24: Add tuning of cusparselt algorithm
* fix linter
* Address comments from lw

__original_commit__ = fairinternal/xformers@7b79453
This commit will be tagged as 0.0.28.post2, which will trigger CI jobs to build a new release which depends on the newly-released PyTorch 2.5.0.

ghstack-source-id: 2d08aac54157145d0fed78ddf4589379ea9e0d99
Pull Request resolved: fairinternal/xformers#1244

__original_commit__ = fairinternal/xformers@94d6d0e
It looks like BlockDiagonalMask import was accidentally removed in commit 166fd2e.
This commit adds it back.
…facebookresearch#1127)

* [fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp

According to the docs (https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) forward() method should not be called directly, apply() method have to be used instead.
After removing forward call, activation checkpointing starts working.

* [fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp

The IF conditional on the x.requires_grad state (to change the behavior between inference/training modes) changes behavior of the recomputation of the forward() method which breaks activation checkpointing
(as on recomputation phase x is detached with requires_grad==False, and different number of tensors are saved in the save_for_backward() method).

* [fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp by removing the inference path.

The IF conditional on the x.requires_grad state changes the behavior of the recomputation of the forward() method which breaks activation checkpointing
(as on the recomputation phase x is detached with requires_grad==False, and different number of tensors are saved in the save_for_backward() method).
This commit will be tagged as 0.0.28.post3, which will trigger CI jobs to build a new release which depends on
the newly-released PyTorch 2.5.1.

__original_commit__ = fairinternal/xformers@76e48ce
ghstack-source-id: 340763fa2a5d7db2545d403e64bd4f74b529301f
Pull Request resolved: fairinternal/xformers#1251

__original_commit__ = fairinternal/xformers@4987b46
ghstack-source-id: ae3c4358b3ae51c5e781ac79641cfe92a49fa542
Pull Request resolved: fairinternal/xformers#1250

__original_commit__ = fairinternal/xformers@f9e77b4
bottler and others added 28 commits December 6, 2024 14:54
ghstack-source-id: cb18cfeb9c8f2371e521220d8ff9de161f4e9b91
Pull Request resolved: fairinternal/xformers#1262

__original_commit__ = fairinternal/xformers@9c2f100
ghstack-source-id: d2d359afaa683495df4b838dae194767993a850a
Pull Request resolved: fairinternal/xformers#1263

__original_commit__ = fairinternal/xformers@3e3c06c
I had to run this on a job with 256 ranks and found it too slow. Thus I modify it to add some parallelism, leverage pandas instead of manual loops, and leverage jq to filter and convert json to csv.

ghstack-source-id: e10e40293afeee2f64ebfaadfbe11dda0cbdd01e
Pull Request resolved: fairinternal/xformers#1267

__original_commit__ = fairinternal/xformers@1899cdf
…l/xformers#1266)

* Introduce "conditional" version of unroll_varargs

This PR contains another implementation of "unroll_varargs" that instead replaces indexing into the varargs argument with a big conditional block. This results in smaller code size and better performance

__original_commit__ = fairinternal/xformers@399a352
* Remove most of deprecated components

* Forgot test

* More fixes

__original_commit__ = fairinternal/xformers@5ec57d1
* Bump FA to 2.7.2

* Update flash.py

__original_commit__ = fairinternal/xformers@6d989fa
…earch#1166)

Test was passing when it should fail if you corrupt output of swiglu_packedw.
 - Instead of comparing two custom implementations of SwiGLU with each other _eager_functional_swiglu is now used as a reference implementation.
 - Changed RTOL tolerance for fp16 backward test from 1e-2 to 3e-2 to make 2 failed tests to pass (out of 800 passed).
 - Fixed _eager_functional_swiglu type for mypy.
…tK (facebookresearch#1181)

Summary:
Triton 3.2 made some changes to its interpretation of constants
(triton-lang/triton#4613) which makes Triton more
consistent with pytorch/numpy, but cause some surprising issues with this
kernel.  Specifically it seems like log2e is interpreted as float32 in one
instance and float64 in another, which leads to reduced prediction accuracy in
some cases.

To prevent this, let's make log2e a constant and define it as float32.
Probably we should have a test somewhere, but it becomes hard to cover all combinations of {flash version}x{GPU}x{PyTorch version}x{Built-in Flash / PT flash / third-party flash}

__original_commit__ = fairinternal/xformers@9c39da4
ghstack-source-id: 5eb3152cb965ccb4a6227e50f026539ca5bccd76
Pull Request resolved: fairinternal/xformers#1281

__original_commit__ = fairinternal/xformers@02f4d4b
@qianfengz qianfengz merged commit 9045af7 into develop Jan 17, 2025
2 of 6 checks passed
@qianfengz qianfengz deleted the merge_upstream branch January 17, 2025 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.