A lightweight tool for quantizing large language models to GGUF format with configurable bit precision.
- Support for 4-bit and 8-bit quantization
- Compatible with Hugging Face models
- Memory efficient processing
- Simple API for custom implementations
- Built-in scaling factor calculation
- Automatic tensor type handling
git clone https://github.com/KevinDKao/gguf-quantization
cd gguf-quantization
pip install -r requirements.txt
- torch
- transformers
- numpy
- gguf
from quantize import quantize_model
model_path = "path/to/model"
output_path = "quantized_model.gguf"
# Quantize to 4-bit
quantize_model(model_path, output_path, bits=4)
# 8-bit quantization
quantize_model("gpt2", "gpt2_quantized.gguf", bits=8)
# Custom model quantization
model = AutoModelForCausalLM.from_pretrained("custom_model")
tokenizer = AutoTokenizer.from_pretrained("custom_model")
quantize_model("custom_model", "custom_quantized.gguf", bits=4)
- Loads model and tokenizer from specified path
- Calculates optimal scaling factors for quantization
- Converts float32 tensors to int4/int8 with scaling
- Preserves non-float tensors in original format
- Writes quantized model to GGUF format
- Automatically handles tokenizer configuration
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
MIT License
- GitHub: @KevinDKao
- Issues: https://github.com/KevinDKao/gguf-quantization/issues