Merge upstream into ROCM develop #54

qianfengz · 2025-01-17T10:36:12Z

In order to make the thing easier when upstreaming the change from develop.

* Building xformers using ck-tiled as default * ensure ck_decoder does not dispatch * Add disable_on_rocm on some test scripts * Update to test_mem_eff_attention.py * apply isort * apply black * fix flake8 suggestions * add license headers and reapply black * Tiny update to rocm_ci.yml * Add conditional compiling for cuda-depending codes in ROCM * Update to benchmark scripts * Rename the one script file * Revert "Add conditional compiling for cuda-depending codes in ROCM" This reverts commit 12fb41c. * Update to scripts * Change and add readme for tests and benchmarks * Remove the stuffs for supporting old ck * Remove old composable_kernel from submodule list * Remove folder third_party/composable_kernel * Rename the folder * Remove unused script file * apply black * pacify mypy * fix clang-format * reapply black * fix lints * make test_splitk_reference run on cpu * add ck modules to docs * try fixing nvidia build by re-including sparse24 cpp folder into extension sources * update cutlass to upstream commit * update flash-attention to upstream commit * simplify setup.py * remove duplicate run_batched_infer_causalmask_attnbias_dispatched<f16, true, true, 128> * add hip version and pytorch hip arch list to xformers build info * fix build * patch around the unhappy path in get_hip_version * skip test_grad_checkpointing for triton_splitk since it doesn't have bwop * re-enable test_mqa_forward since ck tiled is the current implementation * make skip test_wrong_alignment more generic * reapply black * simplify test_decoder * put python version check inside triton_splitk op * fix logic * cleanup python3.9 checks in tests * cleanup test_attentions * cleanup test_checkpoint as test running on cpu does not depend on gpu platform * fix lints * try fixing win build by conditional import of triton in triton op * re-enable test_triton_layernorm as it passes * re-enable test_triton_blocksparse as it passes * cleanup test_sparse_tensors * cleanup test_custom_ops * reapply black * cleanup test_core_attention * benchmark ck ops on rocm only * fix mypy * fix lint: black * fix lints: mypy * split-k decoder: move all tunable parameters to the top of cpp file * apply clang-format * Rename HDim/headdim to MaxK/maxk * Move some headers files to ck examples for later reusing * Replace using qs_ks_vs pipeline by qr_ks_vs pipeline while HeadDim is 256 for better performance * rm test_ck_7 * dump kernel resource usage to compilation logs similar to nv * Add the c++ extension to the latest change of ck_tile/dev fwd kernel (added droppout) * Add the c++ extension to use ck_tile/dev/ fmha bwd kernel * Update to add dropout for fmah backward * Update in attention.cpp to align efficient_attention_backward_ck interface parameters * Enable BwdOp in ck.py * Support grad_out to have different strides as out * Force seqstart_q/seqstart_k to be in device memory in ck.py * Remove duplicated codes in ck_tiled_fmha_grouped_forward.h/infer.h * Use optimized async pipeline where 8x headdim length is assumed * Fix in batched_infer * Update to track ck_tile/opt_padding_fa_train_xformers branch * Update rocm_ci.yml configuring the self-hosted runner * Update to use the newer FmhaFwdEpilogue * Update rocm_ci.yml add option to manually trigger workflow * Update rocm_ci.yml remove condition which skips ci unless github event contains string 'rocm' * copy rocm_ci workflow from main branch * Update rocm_ci.yml Bump upload-artifact version * Update to use the newer FmhaFwdEpilogue for grouped infer/forward * Temporarily disable the using of QRKSVSAsync() pipeline * Update rocm_ci.yml add a daily run * Implement the ck_rand_uniform interface for generating random number tensor * Add dropout to the infer path (needed by xformers test_dropout) * Update to support test_dropout and test_dropout_backward tests * Update the padding method in batched_backward.h * Update the OGradDotO kernel padding method * Change the backward padding checking condition * Add batch_stride_lse/d parameters to adapt grouped mode forward/backward to [num_batches, H, MaxSeqlenQ] layout * Fill the grad_bias in advance * Add support for kHasBiasGrad as instance template * Remove using hdim_stride_do in fmha backward * Force kPadSeqLenQ/kPadSeqLenK to be true in batched-backward to save compiling time * Fix missing passing of {philox_seed, philox_offset} in inference path * Use SimplifiedGenericAttentionMask to replace GenericAttentionMask * Shorten the instance file names * Rename the template parameters * Simplify the names of the dispatch class and interfaces * Changes to reuse the kernel files under ck_tile examples/91_tile_program/fmha folder * Update test_mem_eff_attention.py for test_dropout/test_dropout_backward/test_backward on rocm * Tiny change to the philox_cuda_state input setting * Allocate logsumexp to ensure aligned access by each thread-group * Add checking for query/key headdim size attention_backward_generic * Using ck_tile/opt_padding_fa_train_pr2 and synchronize the backward codes with the changes * Enable using async pipeline in the batched inference path for performance * Re-organize cpp instances for calling fmha infer kernel * Re-organize cpp instances for calling fmha forward kernel * Re-organize cpp instances for calling fmha backward kernel * Position the composable_kernel_tiled to ck_tile/opt_padding_fa_train branch * Update to synchronize with the latest commits in ck_tile/opt_padding_fa_train * update submodule to public * Update to the criteria for padding seqlen_k in batched infer/forward * Keep latest track of ck-tile commits * Tiny fixing to the decoder including * Position the ck-tiled to ck_tile/opt_padding branch * Enable some attn_bias types which were previously disabled by old-ck in ck.py * Add script generate_instances.py which helps to generate instances * Simplify logic for seqstart_q/k 566d26f has put the seqstart_k/q on device. So simplify the logic here. The upstream xformers don't have this optmization and is copying the seqstart_q/k every iterations. We'd like this change to get in and then merge to upstream. * Add Async pipeline to grouped mode inference path * Use explict true for kPadSeqLenQ/kPadHeadDimQ/kPadHeadDimV templates for the Async pipeline * Synchronize to the update of composable_kernel_tiled for better performance * Update rocm_ci.yml - clean up dangling images after ci run * Avoid unused-const-variable warning Our compiler will error on unused-const-variable warning. So just fix this * Tiny change in the BlockTile/Shape setting overriddings * try to align fmha C++ extension to the ck_tile in ck develop branch * Synchronize composable_kernel_tiled to latest ck develop * Use FmhaFwdTilePartitioner_HBS only with seqlen_k padded cases * Tiny fix/change to make test_forward/test_backward/test_dropout/test_dropout_backward_ck pass * Fix compiling issue with regard to Invoker definitions in forward_decoder/forward_decoder_split operators * Keep using -Woverloaded-virtual * Fix clang-format for headers and cpp files * Fix format in python scripts * Add noqa: C801 for generate_instances.py * Align dispatch_bw with main branch * Align ops/fmha/common.py with main branch * Synchronize the thirty-party/composable_kernel_tiled to latest ck_tile commits for better performance * Relax the atol for test_forward and test_dropout due to the using of packed fp16_2_fp32 conversion in ck_tile * Generate html report for tests run with rocm_ci.yml * archive test results when tests have failed * Always clean up dangling docker images in rocm_ci * Bump python to 3.11 in rocm_ci.yml * Disable flash attention tests rocm_ci.yml Since the op is broken; tbd either make the op work, or disable it on ROCm * Try to fix rocm_ci.yml Init must be called before activation * try to fix rocm_ci.yml flow by overriding PATH * Fix setup.py path in rocm_ci.yml * cd to xformers dir before running install in rocm_ci.yml * Use pip to install xformers in rocm_ci.yml * Possibly fix python version resolution in rocm_ci.yml * Set the correct path for pytest in rocm_ci.yml * remove test_reference_splitk as it was moved to a different file during the first upstream remove test_mqa_forward from develop, as the test fails in develop and doesn't run upstream remove reference attention splitk from the test file; it exists in test_splitk_reference sync test_mem_eff_attention with upstream * make sure ck operators have a name to be visible in the dispatcher * fix sm version checks to happen only on CUDA, not ROCm * (2/n) fix sm version checks to happen only on CUDA, not ROCm * Remove _check_large_shapes checking in fmha/ck.py (facebookresearch#1067) * make xformers install editable to fix cpp extensions detection * Update to using the improved fmha-bwd (compiling passed) * Update to get 80% of the test_backward and test_dropout_backward_ck cases passed * Replace the using of ConvertGradQ by using torch tensor type converting * Change the tile settings for MaxK=32 * Fix padding setting bug in grouped_backward * Change -DCK_FMHA_FWD_FAST_EXP2=1 to -DCK_TILE_FMHA_FWD_FAST_EXP2=1 * Point the composable_kernel_tiled submodule to ck_tile/fa_bwd_opt branch * Disable flshattF and flshattB on ROCM * Add -mllvm and -enable-post-misched=0 compiling options for ROCM on setup.py * Disable flshattF and flshattB on ROCM * Update to support separate grad_q_f32_strides do to the API change in the fmd_bwd_kernel * Use old method for setting BlockDropout due to the revert in fmha_fwd_kernel * Tiny fix in grouped_backward * Use packed tensor allocation for grad_q_f32 * Update to the ConvertGradQ kernel calling * Tiny update * Fix the parameter location in grouped_backward * Adjust headdim128 tile shapes for better performance * Update backward kernel calling due to adding of nhead_stride_dk/nhead_stride_dv parameters * Synchronize with CK to use separate pipeline for kPadHeadDim true of false situtation * Use convertDQ kernel * Update to use unpadded lse layout * Add explicit headdim256 instances for fmha backward * Add leaked headdim256 instance references * Change to generate.py and the re-generate the instance files using it * Change to generate.py to generate instances refences and uses the generated reference headers * Relax the RTOL of ckFwOp from 4e-4 to 3e-3 due to one big result case * Change to use .h rather than .hpp as suffix for generated header files * Fix in .gitignore * Update to bwd setting to use only IGLP pipeline * Synchronize to latest ck_tile fix and align the headdim64 tile shape setting * Reformat the generated instances cpp files * Fix to the backward Trait * Set occupancy to -1 to avoid the compiling warning * Revert "Set occupancy to -1 to avoid the compiling warning" This reverts commit fa6d8b3. * Add environment variable and compiler definition to control the generating of headdim256 instances * Add --ignore-hd256 argument to generate_instance.py and some update in this script * Add environment variable ENABLE_HIP_FMHA_RTN_BF16_CONVERT to enable using rtn bf16 conversion * Remove commented lines in test_mem_eff_attention.py * Synchronize to latest ck_tile commit * apply black * apply flake8 * fix mypy * revert disable flash operator on rocm * Synchronize to ck_tile latest commit again * Re-position the composable_kernel submodule to the develop branch * Avoid the Async pipeline when khasBias is true * clang-format for two files * Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts * Change in generate_instances.py so that this scripts can be called from flexible location * Add manual for generate_instances.py (.md) * Modification in GENERATE_INSTANCES.md * Fix in GENERATE_INSTANCES.md * Update GENERATE_INSTANCES.md * clean-up commented codes * Revert "Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts" This reverts commit 7a91589. * Synchronize to latest ck develop for using the latest RTN bf16 convert * Add c++ extension compiling options for better performance on ROCM 6.2 * Use the same rocm_ci.yml as upstream * Use the same ck.py as upstream * Reformat setup.py --------- Co-authored-by: Max Podkorytov <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: Xiaodong Wang <[email protected]>

ghstack-source-id: 6c3e9fe00b08d2157543daa1d3c2ceaf69cf3cc1 Pull Request resolved: fairinternal/xformers#1218 __original_commit__ = fairinternal/xformers@cf44fd3

ghstack-source-id: 3204616cecd297f658bda0cf31e82ba0edf491c6 Pull Request resolved: fairinternal/xformers#1215 __original_commit__ = fairinternal/xformers@adc8664

ghstack-source-id: dc28f4bb77751ffc4f8d67a5ba088b262830659c Pull Request resolved: fairinternal/xformers#1217 __original_commit__ = fairinternal/xformers@eec3f5f

ghstack-source-id: 8bfbe7959ef67f4f01ab36ea20e12c9d89daf664 Pull Request resolved: fairinternal/xformers#1219 __original_commit__ = fairinternal/xformers@08743da

ghstack-source-id: 872cd3e587c1fd3a23e14a8682ba17f0274a0e6e Pull Request resolved: fairinternal/xformers#1220 __original_commit__ = fairinternal/xformers@79d0b02

…formers#1221) __original_commit__ = fairinternal/xformers@a3bcf08

…okresearch#1108)

ghstack-source-id: 988dd690ff1011948d610a495eca9cc0b719a067 Pull Request resolved: fairinternal/xformers#1223 __original_commit__ = fairinternal/xformers@202cd4b

Fused seqpar used to have multiple staging buffers (and associated counters), each called a "stripe", and would cycle through them in successive operations. We needed this because when the fused seqpar op returns, the staging buffer isn't immediately available for reuse: there's an asynchronous signalling kernel (write_values) in a background stream. This used to matter in benchmarks, where we run these ops back-to-back, to avoid an artificial delay due to each op waiting for the previous one's async kernel. This feature was however disabled "in production" (since by default we have only one stripe) as the ops don't run back-to-back there. Stripes are rather burdensome to handle inside CUDA graphs (which is what we'll enable in fairinternal/xformers#1014) because instead of having the CPU pass different pointers to different kernels we would need each GPU kernel pick the pointers it wants to operate on from a tensor of pointers (rotating round-robin). This is doable-ish in custom kernels, but not in standard PyTorch ones. Moreover, we need to get make these async kernels synchronous anyways in order for the fused seqpar ops to be graphable (see fairinternal/xformers#1034), hence stripes won't provide any benefit anymore. And removing stripes simplifies the code a bit too! Truth be told, these write_values kernels weren't that long, thus making them sync shouldn't worsen the perfs again. However, I'll later try to make these kernels a bit faster, such as in fairinternal/xformers#1036. ghstack-source-id: f8c1f5f0d39762a1ff0b964d18efd411ddca42ec Pull Request resolved: fairinternal/xformers#1013 __original_commit__ = fairinternal/xformers@0be6c94

This is needed for capturing a CUDA graph (i.e., all work must be "joined back" into capturing stream). But it's also needed by the next commit in the stack to avoid a race condition between the seq-num increase kernel of the next iteration and the write_values kernel of the previous iteration. ghstack-source-id: 4f8e9bfada42566b83b1325cb6dca0bd05374249 Pull Request resolved: fairinternal/xformers#1034 __original_commit__ = fairinternal/xformers@c57618e

ghstack-source-id: 985575fd3ca72ae6691318af80e13caed6aa2a25 Pull Request resolved: fairinternal/xformers#1224 __original_commit__ = fairinternal/xformers@5fbc8d4

ghstack-source-id: 94de742e80c9c6ca529e1cb89ec0f6101adfb1c0 Pull Request resolved: fairinternal/xformers#1035 __original_commit__ = fairinternal/xformers@97d178a

ghstack-source-id: 2eb41508d764a4db0e9ed83e16bb9924723202b1 Pull Request resolved: fairinternal/xformers#1225 __original_commit__ = fairinternal/xformers@48b7589

We had two bugs (deadlocks) due to incorrect updates to the shared counters that are used by processes to coordinate. This was because the same counter was updated at different times by different processes. This felt very error-prone (no clear owner for a counter) thus brittle. I'm proposing here what I believe to be a clearer and more robust solution, which uses two clearly-defined counters: - one which tracks the completion of full operations (waited on at the beginning, updated at the end) - one which updates mid-operation by the producer rank to signal to the consumer rank that it's ready (either it sent the data, if push-based, or it readied the data, if pull-based) This protocol works! The tests all pass. Except... the ones based on Triton, because Triton expects the counters to be a contiguous tensor, and this PR breaks that. This can be fixed but it's too much effort for now. Hence I'm just parking this PR here as a reference for the future. ghstack-source-id: 112aa146768b4e0bd1c22afef23778f15eb29278 Pull Request resolved: fairinternal/xformers#989 __original_commit__ = fairinternal/xformers@1b62907

ghstack-source-id: f114488fe3e4b0b73df3740539220e2d61ee41ba Pull Request resolved: fairinternal/xformers#1226 __original_commit__ = fairinternal/xformers@5ce29ac

* remove pre-dep torch * fix lint * isort * silly mistake --------- Co-authored-by: Leonid Shamis <[email protected]> __original_commit__ = fairinternal/xformers@51111b0

ghstack-source-id: 8f2dbaabff2a5935c45b58bad4de3de803bf0fc1 Pull Request resolved: fairinternal/xformers#1238 __original_commit__ = fairinternal/xformers@f4cbd36

ghstack-source-id: c6675f5e50a7147e4c6f18a9ff2e64cecd795b18 Pull Request resolved: fairinternal/xformers#1239 __original_commit__ = fairinternal/xformers@9bed28e

…nal/xformers#1241) This reverts commit 51111b0. __original_commit__ = fairinternal/xformers@16cb3bc

…rnal/xformers#1229) * sp24: restore __tensor_unflatten__ * sp24: Add tuning of cusparselt algorithm * fix linter * Address comments from lw __original_commit__ = fairinternal/xformers@7b79453

This commit will be tagged as 0.0.28.post2, which will trigger CI jobs to build a new release which depends on the newly-released PyTorch 2.5.0. ghstack-source-id: 2d08aac54157145d0fed78ddf4589379ea9e0d99 Pull Request resolved: fairinternal/xformers#1244 __original_commit__ = fairinternal/xformers@94d6d0e

It looks like BlockDiagonalMask import was accidentally removed in commit 166fd2e. This commit adds it back.

[ghstack-poisoned] __original_commit__ = fairinternal/xformers@81b14a6

…facebookresearch#1127) * [fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp According to the docs (https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) forward() method should not be called directly, apply() method have to be used instead. After removing forward call, activation checkpointing starts working. * [fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp The IF conditional on the x.requires_grad state (to change the behavior between inference/training modes) changes behavior of the recomputation of the forward() method which breaks activation checkpointing (as on recomputation phase x is detached with requires_grad==False, and different number of tensors are saved in the save_for_backward() method). * [fix] Fix the activation checkpointing when using SwiGLUPackedFusedOp by removing the inference path. The IF conditional on the x.requires_grad state changes the behavior of the recomputation of the forward() method which breaks activation checkpointing (as on the recomputation phase x is detached with requires_grad==False, and different number of tensors are saved in the save_for_backward() method).

[ghstack-poisoned] __original_commit__ = fairinternal/xformers@0b8fb7f

This commit will be tagged as 0.0.28.post3, which will trigger CI jobs to build a new release which depends on the newly-released PyTorch 2.5.1. __original_commit__ = fairinternal/xformers@76e48ce

ghstack-source-id: 340763fa2a5d7db2545d403e64bd4f74b529301f Pull Request resolved: fairinternal/xformers#1251 __original_commit__ = fairinternal/xformers@4987b46

ghstack-source-id: ae3c4358b3ae51c5e781ac79641cfe92a49fa542 Pull Request resolved: fairinternal/xformers#1250 __original_commit__ = fairinternal/xformers@f9e77b4

…nternal/xformers#1253) __original_commit__ = fairinternal/xformers@a596a9e

ghstack-source-id: cb18cfeb9c8f2371e521220d8ff9de161f4e9b91 Pull Request resolved: fairinternal/xformers#1262 __original_commit__ = fairinternal/xformers@9c2f100

ghstack-source-id: d2d359afaa683495df4b838dae194767993a850a Pull Request resolved: fairinternal/xformers#1263 __original_commit__ = fairinternal/xformers@3e3c06c

I had to run this on a job with 256 ranks and found it too slow. Thus I modify it to add some parallelism, leverage pandas instead of manual loops, and leverage jq to filter and convert json to csv. ghstack-source-id: e10e40293afeee2f64ebfaadfbe11dda0cbdd01e Pull Request resolved: fairinternal/xformers#1267 __original_commit__ = fairinternal/xformers@1899cdf

…l/xformers#1266) * Introduce "conditional" version of unroll_varargs This PR contains another implementation of "unroll_varargs" that instead replaces indexing into the varargs argument with a big conditional block. This results in smaller code size and better performance __original_commit__ = fairinternal/xformers@399a352

* Remove most of deprecated components * Forgot test * More fixes __original_commit__ = fairinternal/xformers@5ec57d1

* Bump FA to 2.7.2 * Update flash.py __original_commit__ = fairinternal/xformers@6d989fa

__original_commit__ = fairinternal/xformers@9731771

…ch (fairinternal/xformers#1259) __original_commit__ = fairinternal/xformers@08d73ce

…ebookresearch#1177) Fixes facebookresearch#1176

…earch#1166) Test was passing when it should fail if you corrupt output of swiglu_packedw. - Instead of comparing two custom implementations of SwiGLU with each other _eager_functional_swiglu is now used as a reference implementation. - Changed RTOL tolerance for fp16 backward test from 1e-2 to 3e-2 to make 2 failed tests to pass (out of 800 passed). - Fixed _eager_functional_swiglu type for mypy.

…ROCM/hip (facebookresearch#1164)

__original_commit__ = fairinternal/xformers@0fce003

) __original_commit__ = fairinternal/xformers@5190e34

__original_commit__ = fairinternal/xformers@2e1a714

__original_commit__ = fairinternal/xformers@c0afbd4

…tK (facebookresearch#1181) Summary: Triton 3.2 made some changes to its interpretation of constants (triton-lang/triton#4613) which makes Triton more consistent with pytorch/numpy, but cause some surprising issues with this kernel. Specifically it seems like log2e is interpreted as float32 in one instance and float64 in another, which leads to reduced prediction accuracy in some cases. To prevent this, let's make log2e a constant and define it as float32.

…rinternal/xformers#1274) __original_commit__ = fairinternal/xformers@4a6a2a1

__original_commit__ = fairinternal/xformers@b6c5e6c

__original_commit__ = fairinternal/xformers@cbb3ea5

__original_commit__ = fairinternal/xformers@0f7311b

Probably we should have a test somewhere, but it becomes hard to cover all combinations of {flash version}x{GPU}x{PyTorch version}x{Built-in Flash / PT flash / third-party flash} __original_commit__ = fairinternal/xformers@9c39da4

…mers#1280) * Fix cusparselt detection with PT upgrade __original_commit__ = fairinternal/xformers@cc525a3

ghstack-source-id: 5eb3152cb965ccb4a6227e50f026539ca5bccd76 Pull Request resolved: fairinternal/xformers#1281 __original_commit__ = fairinternal/xformers@02f4d4b

qianfengz and others added 30 commits September 10, 2024 08:43

Build wheels with manylinux_2_28_x86_64

6f3b9db

ghstack-source-id: 6c3e9fe00b08d2157543daa1d3c2ceaf69cf3cc1 Pull Request resolved: fairinternal/xformers#1218 __original_commit__ = fairinternal/xformers@cf44fd3

[ci] Update download-artifact

0dd19df

ghstack-source-id: 3204616cecd297f658bda0cf31e82ba0edf491c6 Pull Request resolved: fairinternal/xformers#1215 __original_commit__ = fairinternal/xformers@adc8664

Build FAv3

1705ffe

ghstack-source-id: dc28f4bb77751ffc4f8d67a5ba088b262830659c Pull Request resolved: fairinternal/xformers#1217 __original_commit__ = fairinternal/xformers@eec3f5f

Fix wheels/conda build

5d7aeb9

ghstack-source-id: 8bfbe7959ef67f4f01ab36ea20e12c9d89daf664 Pull Request resolved: fairinternal/xformers#1219 __original_commit__ = fairinternal/xformers@08743da

Build binaries for PyTorch 2.4.1

c909f0d

ghstack-source-id: 872cd3e587c1fd3a23e14a8682ba17f0274a0e6e Pull Request resolved: fairinternal/xformers#1220 __original_commit__ = fairinternal/xformers@79d0b02

Properly upload wheels for cu124 to PyTorch S3 bucket (fairinternal/x…

d3948b5

…formers#1221) __original_commit__ = fairinternal/xformers@a3bcf08

complete BlockDiagonalGappy/BlockDiagonalPadded support on CK (facebo…

8c24081

…okresearch#1108)

[fused seqpar] Fix random init of staging buffer (for benchmarks)

df93dd5

ghstack-source-id: 988dd690ff1011948d610a495eca9cc0b719a067 Pull Request resolved: fairinternal/xformers#1223 __original_commit__ = fairinternal/xformers@202cd4b

[fused seqpar] Use memcpy instead of kernel

b503a6a

ghstack-source-id: 985575fd3ca72ae6691318af80e13caed6aa2a25 Pull Request resolved: fairinternal/xformers#1224 __original_commit__ = fairinternal/xformers@5fbc8d4

[fused seqpar] Force sequencing of memcpys

4a4dc43

ghstack-source-id: 94de742e80c9c6ca529e1cb89ec0f6101adfb1c0 Pull Request resolved: fairinternal/xformers#1035 __original_commit__ = fairinternal/xformers@97d178a

[fused seqpar] Remove Triton kernels

a99afbb

ghstack-source-id: 2eb41508d764a4db0e9ed83e16bb9924723202b1 Pull Request resolved: fairinternal/xformers#1225 __original_commit__ = fairinternal/xformers@48b7589

[fused seqpar] Rework the usage of streams

4a9dd7e

ghstack-source-id: f114488fe3e4b0b73df3740539220e2d61ee41ba Pull Request resolved: fairinternal/xformers#1226 __original_commit__ = fairinternal/xformers@5ce29ac

remove pre-dep torch (fairinternal/xformers#1237)

3a25236

* remove pre-dep torch * fix lint * isort * silly mistake --------- Co-authored-by: Leonid Shamis <[email protected]> __original_commit__ = fairinternal/xformers@51111b0

export recent AMD-focussed triton_splitk work

916611d

ghstack-source-id: 8f2dbaabff2a5935c45b58bad4de3de803bf0fc1 Pull Request resolved: fairinternal/xformers#1238 __original_commit__ = fairinternal/xformers@f4cbd36

extract triton from triton_splitk

c6c0ac3

ghstack-source-id: c6675f5e50a7147e4c6f18a9ff2e64cecd795b18 Pull Request resolved: fairinternal/xformers#1239 __original_commit__ = fairinternal/xformers@9bed28e

Revert "remove pre-dep torch (fairinternal/xformers#1237)" (fairinter…

77c1da7

…nal/xformers#1241) This reverts commit 51111b0. __original_commit__ = fairinternal/xformers@16cb3bc

sp24: restore __tensor_unflatten__ & add performance tuning (fairinte…

bed4b06

…rnal/xformers#1229) * sp24: restore __tensor_unflatten__ * sp24: Add tuning of cusparselt algorithm * fix linter * Address comments from lw __original_commit__ = fairinternal/xformers@7b79453

[fix] Add back BlockDiagonalMask import (facebookresearch#1133)

ad75c5e

It looks like BlockDiagonalMask import was accidentally removed in commit 166fd2e. This commit adds it back.

Fix/ignore type errors for SymInt (fairinternal/xformers#1245)

a97a1e0

[ghstack-poisoned] __original_commit__ = fairinternal/xformers@81b14a6

Fix lint (once more...) (fairinternal/xformers#1247)

8792aa6

[ghstack-poisoned] __original_commit__ = fairinternal/xformers@0b8fb7f

Build for PyTorch 2.5.1 (fairinternal/xformers#1248)

f3bc7a7

This commit will be tagged as 0.0.28.post3, which will trigger CI jobs to build a new release which depends on the newly-released PyTorch 2.5.1. __original_commit__ = fairinternal/xformers@76e48ce

[ci/windows] Remove internal code (fairinternal/xformers#1252)

07f631c

ghstack-source-id: 340763fa2a5d7db2545d403e64bd4f74b529301f Pull Request resolved: fairinternal/xformers#1251 __original_commit__ = fairinternal/xformers@4987b46

[internal] add embedding bag backward for bf16 with atomics

de742ec

ghstack-source-id: ae3c4358b3ae51c5e781ac79641cfe92a49fa542 Pull Request resolved: fairinternal/xformers#1250 __original_commit__ = fairinternal/xformers@f9e77b4

[ci] Make sure we don't initialize CUDA context at import time (fairi…

1277989

…nternal/xformers#1253) __original_commit__ = fairinternal/xformers@a596a9e

bottler and others added 28 commits December 6, 2024 14:54

triton_splitk triton 3.2 compat

0320918

ghstack-source-id: cb18cfeb9c8f2371e521220d8ff9de161f4e9b91 Pull Request resolved: fairinternal/xformers#1262 __original_commit__ = fairinternal/xformers@9c2f100

new triton: copy matmul_perf_model from triton-lang/kernels

2583317

ghstack-source-id: d2d359afaa683495df4b838dae194767993a850a Pull Request resolved: fairinternal/xformers#1263 __original_commit__ = fairinternal/xformers@3e3c06c

Remove most of deprecated components (fairinternal/xformers#1264)

291ddf6

* Remove most of deprecated components * Forgot test * More fixes __original_commit__ = fairinternal/xformers@5ec57d1

Bump FA to 2.7.2 (fairinternal/xformers#1269)

839c4ec

* Bump FA to 2.7.2 * Update flash.py __original_commit__ = fairinternal/xformers@6d989fa

Enable FA3 for the BW pass (fairinternal/xformers#1268)

3e62181

__original_commit__ = fairinternal/xformers@9731771

Remove the cutlass backend of mem-eff, as it is now embedded in PyTor…

3d947a6

…ch (fairinternal/xformers#1259) __original_commit__ = fairinternal/xformers@08d73ce

Fix performance regression in cutlass backward pass since 0.0.25 (fac…

b17d794

…ebookresearch#1177) Fixes facebookresearch#1176

Avoid using cuda when it is installed but torch was compiled against …

f0a401c

…ROCM/hip (facebookresearch#1164)

Documentation update FMHA __init__.py (facebookresearch#1119)

089f177

Enable FAv3 by default 🎄🎅 (facebookresearch#1178)

9a25558

CI: Update to use pip instead of conda (fairinternal/xformers#1270)

89fc994

__original_commit__ = fairinternal/xformers@0fce003

Attempt fairinternal/xformers#2 at fixing CI (fairinternal/xformers#1271

ca7bc31

) __original_commit__ = fairinternal/xformers@5190e34

Fix swiglu with newer torch compile (fairinternal/xformers#1272)

4b035ad

__original_commit__ = fairinternal/xformers@2e1a714

Selective build/testing in CI (fairinternal/xformers#1273)

9a59df2

__original_commit__ = fairinternal/xformers@c0afbd4

Creating a LowerTriangularMask no longer creates a CUDA tensor (fai…

8fa35b5

…rinternal/xformers#1274) __original_commit__ = fairinternal/xformers@4a6a2a1

Changelog for 0.0.29 (fairinternal/xformers#1275)

56be3b5

__original_commit__ = fairinternal/xformers@b6c5e6c

Test CI with 256GB (fairinternal/xformers#1276)

a0987e8

__original_commit__ = fairinternal/xformers@cbb3ea5

Fix .to() method for all attention biases (fairinternal/xformers#1278)

a8746f3

__original_commit__ = fairinternal/xformers@0f7311b

[fix] benchmark shows 0.00MB consumed (facebookresearch#1018)

2fdd396

merge recent changes from ROCm/xformers (facebookresearch#1182)

6440945

* Fix tests on NVIDIA GPUs after CK recent changes (fairinternal/xfor…

08cc74d

…mers#1280) * Fix cusparselt detection with PT upgrade __original_commit__ = fairinternal/xformers@cc525a3

Provide FLOPs formula for FlashAttention3

536363e

ghstack-source-id: 5eb3152cb965ccb4a6227e50f026539ca5bccd76 Pull Request resolved: fairinternal/xformers#1281 __original_commit__ = fairinternal/xformers@02f4d4b

Merge remote-tracking branch 'upstream/main' into merge_upstream

9928374

qianfengz merged commit 9045af7 into develop Jan 17, 2025
2 of 6 checks passed

qianfengz deleted the merge_upstream branch January 17, 2025 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge upstream into ROCM develop #54

Merge upstream into ROCM develop #54

qianfengz commented Jan 17, 2025

Merge upstream into ROCM develop #54

Merge upstream into ROCM develop #54

Conversation

qianfengz commented Jan 17, 2025