Enable Sparse compression #822

rahul-tuli · 2024-10-07T00:51:19Z

This PR makes compressors composable. After this change, both sparse and quantization compression can be applied to a sparse quantized model.

Notable Changes:

Reworked the logic for inferring sparse compressor format:
- Previously, if a model was quantized, the sparse compressor was always set to the Identity (dense) compressor. Now, an appropriate sparse compressor is selected (currently only sparse_bitmask or sparse_24).
- A special condition for Marlin-style compression was added (Sparse compressor should be set to Dense if marlin_24 compressor is used).
- Updated the SPARSITY_THRESHOLD to 50% as discussed offline

Dependencies:

Test Script

from transformers import AutoTokenizer

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot

MODEL_ID = "nm-testing/llama2.c-stories110M-pruned50-compressed-tensors"

# Load model
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 4
                        type: int
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 4
                        type: int
                        strategy: tensor
                        dynamic: true
                        symmetric: true
                    targets: ["Linear"]
    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0
"""

# Apply quantization.
oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

src/llmcompressor/transformers/compression/quantization_format.py

src/llmcompressor/transformers/compression/sparsity_config.py

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py

kylesayrs

These changes highlight how tied sparsity and quantization are (quantization format now depends on sparsity structure, and sparsity config now depends on quantization format).

I think this is the proper way to do it (first infer format, then use format to generate configs), but I hope in the future that we can structure the classes in a way such that the flow is clearer and users have an easier time understanding and using the tools

src/llmcompressor/transformers/compression/quantization_format.py

Special condition for marlin_24 compressor Update tests Signed-off-by: Rahul Tuli <[email protected]>

Signed-off-by: Rahul Tuli <[email protected]>

dsikka

We should add a couple of e2e tests to run sample generation in llm-compressor.
We can then eventually expand on them to also run in vllm (once we've integrated).

src/llmcompressor/transformers/compression/sparsity_config.py

kylesayrs · 2024-10-24T14:52:31Z

src/llmcompressor/transformers/compression/sparsity_config.py


    @staticmethod
    def from_pretrained(
        model: Module,
        state_dict: Optional[Dict[str, Tensor]] = None,
        compress: bool = False,
+        is_marlin: bool = False,


To make this more generic, why not pass in the quantization config? We will for sure have different compression formats which affect sparsity in the future

Yeah we should make this generic as marlin will likely not be the only case

kylesayrs

Overall structure is good (having quantization config be dependent on sparsity structure is better than it being dependent on a sparsity config)

I definitely recommend genericising the is_marlin a little bit by passing the compression format directly so it's easier to support more formats in the future

dsikka

Just summarizing offline discussions:

Although marlin_24 should no longer be the default, we still need to make sure we have a pathway to enable it, such as through the addition of a "format" argument in the recipe.

rahul-tuli marked this pull request as ready for review October 7, 2024 13:57

rahul-tuli requested review from kylesayrs, mgoin, dsikka and horheynm October 7, 2024 13:57

dsikka reviewed Oct 7, 2024

View reviewed changes

src/llmcompressor/transformers/compression/quantization_format.py Outdated Show resolved Hide resolved

src/llmcompressor/transformers/compression/sparsity_config.py Show resolved Hide resolved

Base automatically changed from set-sparse-compression-true to main October 8, 2024 14:22

kylesayrs reviewed Oct 8, 2024

View reviewed changes

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py Show resolved Hide resolved

kylesayrs reviewed Oct 8, 2024

View reviewed changes

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py Show resolved Hide resolved

kylesayrs previously approved these changes Oct 8, 2024

View reviewed changes

kylesayrs reviewed Oct 8, 2024

View reviewed changes

src/llmcompressor/transformers/compression/quantization_format.py Show resolved Hide resolved

rahul-tuli marked this pull request as draft October 22, 2024 22:31

rahul-tuli dismissed kylesayrs’s stale review via 86cd1d9 October 22, 2024 23:42

rahul-tuli force-pushed the make-compressors-composable branch from a29da0d to 86cd1d9 Compare October 22, 2024 23:42

rahul-tuli added 3 commits October 22, 2024 23:47

Make compressors stackable

6c9ee5a

Special condition for marlin_24 compressor Update tests Signed-off-by: Rahul Tuli <[email protected]>

Enable Sparse24 compressor

4baf16d

Signed-off-by: Rahul Tuli <[email protected]>

Add SparsityStructure Enum

43cb1d7

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli force-pushed the make-compressors-composable branch from 86cd1d9 to 43cb1d7 Compare October 22, 2024 23:47

dsikka reviewed Oct 23, 2024

View reviewed changes

src/llmcompressor/transformers/compression/sparsity_config.py Show resolved Hide resolved

src/llmcompressor/transformers/compression/sparsity_config.py Outdated Show resolved Hide resolved

Add 0:0 Sparsity Structure

9fd57a1

rahul-tuli changed the title ~~Composable Compressors~~ Enable Sparse compression Oct 23, 2024

Move SparsityStructure enum to compressed_tensors

15bcbc1

rahul-tuli mentioned this pull request Oct 23, 2024

Support for targets and ignore in Sparsity Compressors neuralmagic/compressed-tensors#182

Open

rahul-tuli marked this pull request as ready for review October 23, 2024 14:38

kylesayrs reviewed Oct 24, 2024

View reviewed changes

kylesayrs approved these changes Oct 24, 2024

View reviewed changes

horheynm approved these changes Oct 24, 2024

View reviewed changes

dsikka requested changes Oct 28, 2024

View reviewed changes

rahul-tuli closed this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Sparse compression #822

Enable Sparse compression #822

rahul-tuli commented Oct 7, 2024 •

edited

Loading

kylesayrs left a comment •

edited

Loading

dsikka left a comment

kylesayrs Oct 24, 2024

dsikka Dec 2, 2024

kylesayrs left a comment

dsikka left a comment

Enable Sparse compression #822

Enable Sparse compression #822

Conversation

rahul-tuli commented Oct 7, 2024 • edited Loading

Notable Changes:

Dependencies:

kylesayrs left a comment • edited Loading

Choose a reason for hiding this comment

dsikka left a comment

Choose a reason for hiding this comment

kylesayrs Oct 24, 2024

Choose a reason for hiding this comment

dsikka Dec 2, 2024

Choose a reason for hiding this comment

kylesayrs left a comment

Choose a reason for hiding this comment

dsikka left a comment

Choose a reason for hiding this comment

rahul-tuli commented Oct 7, 2024 •

edited

Loading

kylesayrs left a comment •

edited

Loading