[Release 0.3.0] Basic Readme and user-facing pathways (#30)

* initial commit * is this a version problem * or wrong find_packages logic * all_right * initial commit * add load_compress func * More tests (loading dense tensors) * simplify UX * cosmetic changes * finishing the PR * finalize the PR * Update src/compressed_tensors/compressors/sparse_bitmask.py * disable ipynb test
neuralmagic · Apr 25, 2024 · d4787e2 · d4787e2
1 parent 67005d7
commit d4787e2
Show file tree

Hide file tree

Showing 17 changed files with 725 additions and 214 deletions.
diff --git a/README.md b/README.md
@@ -1 +1,82 @@
 # compressed-tensors
+
+This repository extends a [safetensors](https://github.com/huggingface/safetensors) format to efficiently store sparse and/or quantized tensors on disk. `compressed-tensors` format supports multiple compression types to minimize the disk space and facilitate the tensor manipulation.
+
+## Motivation
+
+### Reduce disk space by saving sparse tensors in a compressed format
+
+The compressed format stores the data much more efficiently by taking advantage of two properties of tensors:
+
+- Sparse tensors -> due to a large number of entries that are equal to zero.
+- Quantized -> due to their low precision representation.
+
+### Introduce an elegant interface to save/load compressed tensors
+
+The library provides the user with the ability to compress/decompress tensors. The properties of tensors are defined by human-readable configs, allowing the users to understand the compression format at a quick glance.
+
+## Installation
+
+### Pip
+
+```bash
+pip install compressed-tensors
+```
+
+### From source
+
+```bash
+git clone https://github.com/neuralmagic/compressed-tensors
+cd compressed-tensors
+pip install -e .
+```
+
+## Getting started
+
+### Saving/Loading Compressed Tensors (Bitmask Compression)
+
+The function `save_compressed` uses the `compression_format` argument to apply compression to tensors.
+The function `load_compressed` reverses the process: converts the compressed weights on disk to decompressed weights in device memory.
+
+```python
+from compressed_tensors import save_compressed, load_compressed, BitmaskConfig
+from torch import Tensor
+from typing import Dict
+
+# the example BitmaskConfig method efficiently compresses 
+# tensors with large number of zero entries 
+compression_config = BitmaskConfig()
+
+tensors: Dict[str, Tensor] = {"tensor_1": Tensor(
+    [[0.0, 0.0, 0.0], 
+     [1.0, 1.0, 1.0]]
+)}
+# compress tensors using BitmaskConfig compression format (save them efficiently on disk)
+save_compressed(tensors, "model.safetensors", compression_format=compression_config.format)
+
+# decompress tensors (load_compressed returns a generator for memory efficiency)
+decompressed_tensors = {}
+for tensor_name, tensor in load_compressed("model.safetensors", compression_config = compression_config):
+    decompressed_tensors[tensor_name] = tensor
+```
+
+## Saving/Loading Compressed Models (Bitmask Compression)
+
+We can apply bitmask compression to a whole model. For more detailed example see `example` directory.
+```python
+from compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig
+from transformers import AutoModelForCausalLM
+
+model_name = "neuralmagic/llama2.c-stories110M-pruned50"
+model = AutoModelForCausalLM.from_pretrained(model_name)
+
+original_state_dict = model.state_dict()
+
+compression_config = BitmaskConfig()
+
+# save compressed model weights
+save_compressed_model(model, "compressed_model.safetensors", compression_format=compression_config.format)
+
+# load compressed model weights (`dict` turns generator into a dictionary)
+state_dict = dict(load_compressed("compressed_model.safetensors", compression_config))
+```
diff --git a/examples/bitmask_compression.ipynb b/examples/bitmask_compression.ipynb
@@ -0,0 +1,252 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Bitmask Compression Example ##\n",
+    "\n",
+    "Bitmask compression allows for storing sparse tensors efficiently on the disk. \n",
+    "\n",
+    "Instead of storing each zero element represented as an actual number, we use bitmask to indicate which tensor entries correspond to zero elements. This approach is useful when the matrix is mostly zero values, as it saves space by not wastefully storing those zeros explicitly.\n",
+    "\n",
+    "The example below shows how to save and load sparse tensors using bitmask compression. It also demonstrates the benefits of the bitmask compression over \"dense\" representation, and finally, introduces the enhanced `safetensors` file format for storing sparse weights."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import os\n",
+    "from safetensors import safe_open\n",
+    "from safetensors.torch import save_model\n",
+    "from compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig\n",
+    "from transformers import AutoModelForCausalLM"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "LlamaForCausalLM(\n",
+       "  (model): LlamaModel(\n",
+       "    (embed_tokens): Embedding(32000, 768)\n",
+       "    (layers): ModuleList(\n",
+       "      (0-11): 12 x LlamaDecoderLayer(\n",
+       "        (self_attn): LlamaSdpaAttention(\n",
+       "          (q_proj): Linear(in_features=768, out_features=768, bias=False)\n",
+       "          (k_proj): Linear(in_features=768, out_features=768, bias=False)\n",
+       "          (v_proj): Linear(in_features=768, out_features=768, bias=False)\n",
+       "          (o_proj): Linear(in_features=768, out_features=768, bias=False)\n",
+       "          (rotary_emb): LlamaRotaryEmbedding()\n",
+       "        )\n",
+       "        (mlp): LlamaMLP(\n",
+       "          (gate_proj): Linear(in_features=768, out_features=2048, bias=False)\n",
+       "          (up_proj): Linear(in_features=768, out_features=2048, bias=False)\n",
+       "          (down_proj): Linear(in_features=2048, out_features=768, bias=False)\n",
+       "          (act_fn): SiLU()\n",
+       "        )\n",
+       "        (input_layernorm): LlamaRMSNorm()\n",
+       "        (post_attention_layernorm): LlamaRMSNorm()\n",
+       "      )\n",
+       "    )\n",
+       "    (norm): LlamaRMSNorm()\n",
+       "  )\n",
+       "  (lm_head): Linear(in_features=768, out_features=32000, bias=False)\n",
+       ")"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# load a tiny, pruned llama2 model\n",
+    "model_name = \"neuralmagic/llama2.c-stories110M-pruned50\"\n",
+    "model = AutoModelForCausalLM.from_pretrained(model_name)\n",
+    "model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The example layer model.layers.0.self_attn.q_proj.weight has sparsity 0.50%\n"
+     ]
+    }
+   ],
+   "source": [
+    "# most of the weights of the model are pruned to 50% (except for few layers such as lm_head or embeddings)\n",
+    "state_dict = model.state_dict()\n",
+    "state_dict.keys()\n",
+    "example_layer = \"model.layers.0.self_attn.q_proj.weight\"\n",
+    "print(f\"The example layer {example_layer} has sparsity {torch.sum(state_dict[example_layer] == 0).item() / state_dict[example_layer].numel():.2f}%\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The model is 31.67% sparse overall\n"
+     ]
+    }
+   ],
+   "source": [
+    "# we can inspect to total sparisity of the state_dict\n",
+    "total_num_parameters = 0\n",
+    "total_num_zero_parameters = 0\n",
+    "for key in state_dict:\n",
+    "    total_num_parameters += state_dict[key].numel()\n",
+    "    total_num_zero_parameters += state_dict[key].eq(0).sum().item()\n",
+    "print(f\"The model is {total_num_zero_parameters/total_num_parameters*100:.2f}% sparse overall\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Compressing model: 100%|██████████| 111/111 [00:06<00:00, 17.92it/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Size of the model's weights on disk using safetensors: 417.83 MB\n",
+      "Size of the model's weights on disk using compressed-tensors: 366.82 MB\n",
+      "The compression ratio is x1.14\n"
+     ]
+    }
+   ],
+   "source": [
+    "# let's save the model on disk using safetensors and compressed-tensors and compare the size on disk\n",
+    "\n",
+    "## save the model using safetensors ##\n",
+    "save_model(model, \"model.safetensors\")\n",
+    "size_on_disk_mb = os.path.getsize('model.safetensors') / 1024 / 1024\n",
+    "\n",
+    "## save the model using compressed-tensors ##\n",
+    "save_compressed_model(model, \"compressed_model.safetensors\", compression_format=\"sparse-bitmask\")\n",
+    "compressed_size_on_disk_mb = os.path.getsize('compressed_model.safetensors') / 1024 / 1024\n",
+    "\n",
+    "print(f\"Size of the model's weights on disk using safetensors: {size_on_disk_mb:.2f} MB\")\n",
+    "print(f\"Size of the model's weights on disk using compressed-tensors: {compressed_size_on_disk_mb:.2f} MB\")\n",
+    "print(\"The compression ratio is x{:.2f}\".format(size_on_disk_mb / compressed_size_on_disk_mb))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Storing weights with around 30% of zero entries requires significantly less disk space when using `compressed-tensors`. The compression ratio improves radically for more sparse models. \n",
+    "\n",
+    "We can load back the `state_dict` from the compressed and uncompressed representation on disk and confirm, that they represent same tensors in memory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Once loaded, the state_dicts from safetensors and compressed-tensors are equal: True\n"
+     ]
+    }
+   ],
+   "source": [
+    "# load the safetensor and the compressed-tensor and show that they have the same representation\n",
+    "\n",
+    "## load the uncompressed safetensors to memory ##\n",
+    "state_dict_1 = {}\n",
+    "with safe_open('model.safetensors', framework=\"pt\") as f:\n",
+    "   for key in f.keys():\n",
+    "       state_dict_1[key] = f.get_tensor(key)\n",
+    "\n",
+    "## load the compressed-tensors to memory ##\n",
+    "config = BitmaskConfig() # we need to specify the method for decompression\n",
+    "state_dict_2 = dict(load_compressed(\"compressed_model.safetensors\", config)) # load_compressed returns a generator, we convert it to a dict\n",
+    "\n",
+    "tensors_equal = all(torch.equal(state_dict_1[key], state_dict_2[key]) for key in state_dict_1)\n",
+    "\n",
+    "print(f\"Once loaded, the state_dicts from safetensors and compressed-tensors are equal: {tensors_equal}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### SafeTensors File Format\n",
+    "\n",
+    "The reason why the introduced bitmask compression is much more efficient, is imbibing the information about the compression in the header of the `.safetensors` file.\n",
+    "For each parameter in the uncompressed `state_dict`, we store the following attributes needed for decompression in the compressed `state_dict`:\n",
+    "\n",
+    "* Compressed tensor\n",
+    "* Bitmask\n",
+    "* Uncompressed shape\n",
+    "* Row offsets\n",
+    "\n",
+    "```bash\n",
+    "# Dense\n",
+    "{\n",
+    "    PARAM_NAME: uncompressed_tensor\n",
+    "}\n",
+    "\n",
+    "# Compressed\n",
+    "{\n",
+    "    PARAM_NAME.compressed: compressed_tensor,  # 1d tensor\n",
+    "    PARAM_NAME.bitmask: value,  # 2d bitmask tensor (nrows x (ncols / 8))\n",
+    "    PARAM_NAME.shape: value,  # Uncompressed shape tensor\n",
+    "    PARAM_NAME.row_offsets: value  # 1d offsets tensor\n",
+    "}\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/makefile b/makefile
@@ -1,4 +1,4 @@
-BUILDDIR := $(PWD)
+
 PYCHECKDIRS := src tests
 PYCHECKGLOBS := 'src/**/*.py' 'tests/**/*.py' 'utils/**/*.py' 'examples/**/*.py' setup.py
 # run checks on all files for the repo
@@ -23,6 +23,10 @@ test:
 	@echo "Running python tests";
 	pytest tests;
 
+# creates wheel file
+build:
+	python3 setup.py sdist bdist_wheel $(BUILD_ARGS)
+
 # clean package
 clean:
 	@echo "Cleaning up";

diff --git a/setup.py b/setup.py
@@ -25,7 +25,7 @@ def _setup_install_requires() -> List:
     return ["torch>=1.7.0", "transformers<=4.40", "pydantic<2.7"]
 
 def _setup_extras() -> Dict:
-    return {"dev": ["black==22.12.0", "isort==5.8.0", "wheel>=0.36.2", "flake8>=3.8.3", "pytest>=6.0.0",]}
+    return {"dev": ["black==22.12.0", "isort==5.8.0", "wheel>=0.36.2", "flake8>=3.8.3", "pytest>=6.0.0", "nbconvert>=7.16.3"]}
 
 setup(
     name="compressed-tensors",