Skip to content

JonSnow1807/gradient-cache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Gradient Cache - GPU Memory-Efficient Training

PyPI version Downloads Python Versions License

Gradient Cache is a production-ready PyTorch extension that reduces GPU memory usage by 90%+ during neural network training through intelligent gradient compression and CPU offloading.

πŸš€ Key Features

  • 90%+ Memory Savings: Compress gradients by 100x with minimal accuracy impact
  • Larger Batch Sizes: Train with 2-3x larger batches on the same hardware
  • Simple Integration: Just 3 lines of code to add to any training loop
  • Universal Compatibility: Works with any PyTorch model and optimizer
  • Production Ready: Tested on A100 and T4 GPUs with real models

πŸ“Š Proven Results

Model Parameters Memory Saved Compression
GPT-2 Small 124M 479 MB/step 100x
GPT-2 Medium 350M ~1.3 GB/step 100x
Custom NN 50M 144 MB/step 100x

πŸ”§ Installation

pip install gradient-cache

Or install from source:

git clone https://github.com/JonSnow1807/gradient-cache
cd gradient-cache
pip install -e .

πŸ’‘ Quick Start

Add gradient cache to any PyTorch training loop with just 3 lines:

import gradient_cache

# Create your model
model = create_your_model().cuda()

# Add gradient cache (1 line)
hook_manager = gradient_cache.create_gradient_cache(model, compression_ratio=100)

# Normal training loop
optimizer = torch.optim.Adam(model.parameters())

for batch in dataloader:
    loss = model(batch).mean()
    loss.backward()
    
    # Compress gradients (1 line)
    hook_manager.compress_and_free_gradients()
    
    # Restore gradients and update (1 line)
    hook_manager.apply_gradients()
    optimizer.step()
    optimizer.zero_grad()

🎯 Integration with Training Frameworks

Metaflow Integration

Use the decorator for automatic integration:

from metaflow import FlowSpec, step
import gradient_cache

class MyTrainingFlow(FlowSpec):
    @step
    @gradient_cache.optimize(compression_ratio=100)
    def train(self):
        # Your training code - no changes needed!
        model = create_model()
        optimizer = torch.optim.Adam(model.parameters())
        # ... rest of training

PyTorch Lightning

import pytorch_lightning as pl
import gradient_cache

class MyModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = create_model()
        self.hook_manager = gradient_cache.create_gradient_cache(self.model)
        
    def training_step(self, batch, batch_idx):
        loss = self.model(batch).mean()
        return loss
    
    def on_after_backward(self):
        self.hook_manager.compress_and_free_gradients()
        
    def optimizer_step(self, *args, **kwargs):
        self.hook_manager.apply_gradients()
        super().optimizer_step(*args, **kwargs)

πŸ› οΈ Advanced Usage

Custom Compression Ratios

# Conservative - 10x compression (keep 10%)
hook_manager = gradient_cache.create_gradient_cache(model, compression_ratio=10)

# Aggressive - 1000x compression (keep 0.1%) 
hook_manager = gradient_cache.create_gradient_cache(model, compression_ratio=1000)

Exclude Critical Layers

# Don't compress embeddings or output layers
hook_manager = gradient_cache.GradientCacheHookManager(
    model,
    compression_ratio=100,
    exclude_layers=['embedding', 'lm_head']
)

Monitor Compression

# Enable verbose mode
hook_manager = gradient_cache.create_gradient_cache(model, verbose=True)

# Get compression statistics
stats = hook_manager.get_compression_summary()
print(f"Compression ratio: {stats['overall_compression_ratio']:.1f}x")
print(f"Memory saved: {stats['memory_saved_mb']:.1f} MB")

πŸ“ˆ How It Works

  1. Gradient Computation: Normal backward pass computes gradients
  2. Compression: Keep only top 1% of gradient values by magnitude
  3. CPU Offload: Move compressed gradients to system RAM
  4. GPU Memory Release: Free GPU memory for next batch
  5. Gradient Restoration: Restore gradients for optimizer step

πŸ† Benefits

  • Cost Savings: Use smaller, cheaper GPU instances
  • Larger Models: Train models that don't fit in GPU memory
  • Faster Research: Iterate quickly with larger batch sizes
  • Easy Integration: No model architecture changes needed

πŸ§ͺ Testing

Run the test suite:

python tests/test_gradient_cache.py

πŸ“ Citation

If you use Gradient Cache in your research, please cite:

@software{gradient_cache,
  title = {Gradient Cache: GPU Memory-Efficient Training},
  author = {Gradient Cache Contributors},
  year = {2024},
  url = {https://github.com/gradient-cache/gradient-cache}
}

πŸ“„ License

Apache License 2.0 - see LICENSE for details.

🀝 Contributing

We welcome contributions! Please submit issues and pull requests on GitHub.

πŸ“§ Support


Built with ❀️ for the ML community