[Bugfix] Rename files to remove colons (#846)

* rename files to remove colons Signed-off-by: Kyle Sayers <[email protected]> * [Bugfix] Workaround tied tensors bug (#659) * load offload state dict * add test * remove merge duplication * prepare to fix tie_word_embeddings * add full tests * patch second bug * comment out failing tests, point to next pr * link to issue * accomodate offloaded models in test * add back passing test * WIP * add error if not in expected list * apply style * update passing failing list * add shared tensors tests * clean up * add comment with link * make failing tests a todo * Remove failing tests * explicitly set safe_serialization * separate out gpu tests, apply style --------- Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> * only untie word embeddings (#839) Signed-off-by: Kyle Sayers <[email protected]> * check for config hidden size (#840) Signed-off-by: Kyle Sayers <[email protected]> * Use float32 for Hessian dtype (#847) * use float32 for hessian dtype * explicitly set inp dtype as well * float precision for obcq hessian Signed-off-by: Kyle Sayers <[email protected]> * GPTQ: Depreciate non-sequential update option (#762) * remove from gptq, apply style * remove instances of sequential_update argument in GPTQ tests * update examples * update example tests * documentation, remove from example * apply style * revert back to auto type * apply style --------- Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> * Typehint nits (#826) Signed-off-by: Kyle Sayers <[email protected]> * [ DOC ] Remove version restrictions in W8A8 exmaple (#849) The latest compressored-tensor 0.8.0 removed some API, https://github.com/neuralmagic/compressed-tensors/pull/156/files If installed the older llmcompressor from pip, it would throw the error like: ``` ImportError: cannot import name 'update_layer_weight_quant_params' from 'compressed_tensors.quantization' ``` Signed-off-by: Kyle Sayers <[email protected]> * Fix inconsistence (#80) Use group strategy with 128 group size instead of channel Co-authored-by: Dipika Sikka <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> * 2of4 Signed-off-by: Kyle Sayers <[email protected]> * revert change to unrelated example Signed-off-by: Kyle Sayers <[email protected]> * rename test file Signed-off-by: Kyle Sayers <[email protected]> * fix fwd func call (#845) Signed-off-by: Kyle Sayers <[email protected]> --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Jincheng Miao <[email protected]> Co-authored-by: 黄石 <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>
vllm-project · Nov 21, 2024 · 9107190 · 9107190
1 parent 04df6ef
commit 9107190
Show file tree

Hide file tree

Showing 5 changed files with 13 additions and 13 deletions.
diff --git a/...rse_w4a16/2:4_w4a16_group-128_recipe.yaml → ...se_w4a16/2of4_w4a16_group-128_recipe.yaml b/...rse_w4a16/2:4_w4a16_group-128_recipe.yaml → ...se_w4a16/2of4_w4a16_group-128_recipe.yaml
diff --git a/...ion_24_sparse_w4a16/2:4_w4a16_recipe.yaml → ..._2of4_sparse_w4a16/2of4_w4a16_recipe.yaml b/...ion_24_sparse_w4a16/2:4_w4a16_recipe.yaml → ..._2of4_sparse_w4a16/2of4_w4a16_recipe.yaml
diff --git a/...es/quantization_24_sparse_w4a16/README.md → .../quantization_2of4_sparse_w4a16/README.md b/...es/quantization_24_sparse_w4a16/README.md → .../quantization_2of4_sparse_w4a16/README.md
@@ -29,7 +29,7 @@ This example uses LLMCompressor and Compressed-Tensors to create a 2:4 sparse an
 The model is calibrated and trained with the ultachat200k dataset.
 At least 75GB of GPU memory is required to run this example.
 
-Follow the steps below, or to run the example as `python examples/quantization_24_sparse_w4a16/llama7b_sparse_w4a16.py`
+Follow the steps below, or to run the example as `python examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py`
 
 ## Step 1: Select a model, dataset, and recipe
 In this step, we select which model to use as a baseline for sparsification, a dataset to
@@ -40,7 +40,7 @@ Models can reference a local directory, or a model in the huggingface hub.
 Datasets can be from a local compatible directory or the huggingface hub.
 
 Recipes are YAML files that describe how a model should be optimized during or after training.
-The recipe used for this flow is located in [2:4_w4a16_recipe.yaml](./2:4_w4a16_recipe.yaml).
+The recipe used for this flow is located in [2of4_w4a16_recipe.yaml](./2of4_w4a16_recipe.yaml).
 It contains instructions to prune the model to 2:4 sparsity, run one epoch of recovery finetuning,
 and quantize to 4 bits in one show using GPTQ.
 
@@ -56,18 +56,18 @@ model = SparseAutoModelForCausalLM.from_pretrained(
 dataset = "ultrachat-200k"
 splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}
 
-recipe = "2:4_w4a16_recipe.yaml"
+recipe = "2of4_w4a16_recipe.yaml"
 ```
 
 ## Step 2: Run sparsification using `apply`
 The `apply` function applies the given recipe to our model and dataset.
 The hardcoded kwargs may be altered based on each model's needs.
-After running, the sparsified model will be saved to `output_llama7b_2:4_w4a16_channel`.
+After running, the sparsified model will be saved to `output_llama7b_2of4_w4a16_channel`.
 
 ```python
 from llmcompressor.transformers import apply
 
-output_dir = "output_llama7b_2:4_w4a16_channel"
+output_dir = "output_llama7b_2of4_w4a16_channel"
 
 apply(
     model=model,
@@ -98,12 +98,12 @@ run the following:
 import torch
 from llmcompressor.transformers import SparseAutoModelForCausalLM
 
-compressed_output_dir = "output_llama7b_2:4_w4a16_channel_compressed"
+compressed_output_dir = "output_llama7b_2of4_w4a16_channel_compressed"
 model = SparseAutoModelForCausalLM.from_pretrained(output_dir, torch_dtype=torch.bfloat16)
 model.save_pretrained(compressed_output_dir, save_compressed=True)
 ```
 
 ### Custom Quantization
 The current repo supports multiple quantization techniques configured using a recipe. Supported strategies are `tensor`, `group` and `channel`. 
-The above recipe (`2:4_w4a16_recipe.yaml`) uses channel-wise quantization specified by `strategy: "channel"` in its config group. 
-To use quantize per tensor, change strategy from `channel` to `tensor`. To use group size quantization, change from `channel` to `group` and specify its value, say 128, by including `group_size: 128`. A group size quantization example is shown in `2:4_w4a16_group-128_recipe.yaml`.
+The above recipe (`2of4_w4a16_recipe.yaml`) uses channel-wise quantization specified by `strategy: "channel"` in its config group. 
+To use quantize per tensor, change strategy from `channel` to `tensor`. To use group size quantization, change from `channel` to `group` and specify its value, say 128, by including `group_size: 128`. A group size quantization example is shown in `2of4_w4a16_group-128_recipe.yaml`.
diff --git a/...n_24_sparse_w4a16/llama7b_sparse_w4a16.py → ...2of4_sparse_w4a16/llama7b_sparse_w4a16.py b/...n_24_sparse_w4a16/llama7b_sparse_w4a16.py → ...2of4_sparse_w4a16/llama7b_sparse_w4a16.py
@@ -3,7 +3,7 @@
 from llmcompressor.transformers import SparseAutoModelForCausalLM, apply
 
 # define a recipe to handle sparsity, finetuning and quantization
-recipe = "2:4_w4a16_recipe.yaml"
+recipe = "2of4_w4a16_recipe.yaml"
 
 # load the model in as bfloat16 to save on memory and compute
 model_stub = "neuralmagic/Llama-2-7b-ultrachat200k"
@@ -15,7 +15,7 @@
 dataset = "ultrachat-200k"
 
 # save location of quantized model
-output_dir = "output_llama7b_2:4_w4a16_channel"
+output_dir = "output_llama7b_2of4_w4a16_channel"
 
 # set dataset config parameters
 splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}

diff --git a/...ples/test_quantization_24_sparse_w4a16.py → ...es/test_quantization_2of4_sparse_w4a16.py b/...ples/test_quantization_24_sparse_w4a16.py → ...es/test_quantization_2of4_sparse_w4a16.py
@@ -16,14 +16,14 @@
 
 @pytest.fixture
 def example_dir() -> str:
-    return "examples/quantization_24_sparse_w4a16"
+    return "examples/quantization_2of4_sparse_w4a16"
 
 
 @pytest.mark.example
 @requires_gpu_count(1)
 class TestQuantization24SparseW4A16:
     """
-    Tests for examples in the "quantization_24_sparse_w4a16" example folder.
+    Tests for examples in the "quantization_2of4_sparse_w4a16" example folder.
     """
 
     def test_doc_example_command(self, example_dir: str, tmp_path: Path):
@@ -52,7 +52,7 @@ def test_alternative_recipe(self, example_dir: str, tmp_path: Path):
         script_path = tmp_path / example_dir / script_filename
         content = script_path.read_text(encoding="utf-8")
         content = content.replace(
-            "2:4_w4a16_recipe.yaml", "2:4_w4a16_group-128_recipe.yaml"
+            "2of4_w4a16_recipe.yaml", "2of4_w4a16_group-128_recipe.yaml"
         )
         script_path.write_text(content, encoding="utf-8")