Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Sparse compression #822

Closed
wants to merge 5 commits into from
Closed

Conversation

rahul-tuli
Copy link
Collaborator

@rahul-tuli rahul-tuli commented Oct 7, 2024

This PR makes compressors composable. After this change, both sparse and quantization compression can be applied to a sparse quantized model.

Notable Changes:

  • Reworked the logic for inferring sparse compressor format:
    • Previously, if a model was quantized, the sparse compressor was always set to the Identity (dense) compressor. Now, an appropriate sparse compressor is selected (currently only sparse_bitmask or sparse_24).
    • A special condition for Marlin-style compression was added (Sparse compressor should be set to Dense if marlin_24 compressor is used).
    • Updated the SPARSITY_THRESHOLD to 50% as discussed offline

Dependencies:

Test Script
from transformers import AutoTokenizer

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot

MODEL_ID = "nm-testing/llama2.c-stories110M-pruned50-compressed-tensors"

# Load model
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 4
                        type: int
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 4
                        type: int
                        strategy: tensor
                        dynamic: true
                        symmetric: true
                    targets: ["Linear"]
    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0
"""

# Apply quantization.
oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

@rahul-tuli rahul-tuli marked this pull request as ready for review October 7, 2024 13:57
Base automatically changed from set-sparse-compression-true to main October 8, 2024 14:22
kylesayrs
kylesayrs previously approved these changes Oct 8, 2024
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes highlight how tied sparsity and quantization are (quantization format now depends on sparsity structure, and sparsity config now depends on quantization format).

I think this is the proper way to do it (first infer format, then use format to generate configs), but I hope in the future that we can structure the classes in a way such that the flow is clearer and users have an easier time understanding and using the tools

@rahul-tuli rahul-tuli marked this pull request as draft October 22, 2024 22:31
@rahul-tuli rahul-tuli force-pushed the make-compressors-composable branch from a29da0d to 86cd1d9 Compare October 22, 2024 23:42
Special condition for marlin_24 compressor
Update tests

Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
@rahul-tuli rahul-tuli force-pushed the make-compressors-composable branch from 86cd1d9 to 43cb1d7 Compare October 22, 2024 23:47
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a couple of e2e tests to run sample generation in llm-compressor.
We can then eventually expand on them to also run in vllm (once we've integrated).

@rahul-tuli rahul-tuli changed the title Composable Compressors Enable Sparse compression Oct 23, 2024

@staticmethod
def from_pretrained(
model: Module,
state_dict: Optional[Dict[str, Tensor]] = None,
compress: bool = False,
is_marlin: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this more generic, why not pass in the quantization config? We will for sure have different compression formats which affect sparsity in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should make this generic as marlin will likely not be the only case

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure is good (having quantization config be dependent on sparsity structure is better than it being dependent on a sparsity config)

I definitely recommend genericising the is_marlin a little bit by passing the compression format directly so it's easier to support more formats in the future

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just summarizing offline discussions:

Although marlin_24 should no longer be the default, we still need to make sure we have a pathway to enable it, such as through the addition of a "format" argument in the recipe.

@rahul-tuli rahul-tuli closed this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants