GPT2-with-noise 🧩🔒

Welcome to GPT2-with-noise – a custom implementation of the GPT-2 language model enhanced with noise injection techniques for privacy preservation. This project combines advanced language modeling with differential privacy to ensure user data remains confidential. Plus, we've added a handy PDF scraper to help you build your own dataset. Let's dive in!

🚀 Overview

GPT2-with-noise aims to:

Preserve Privacy: Incorporates noise into training and inference processes to prevent extraction of individual data points.
Maintain Performance: Strives to deliver high-quality text generation despite the added noise.
Facilitate Data Collection: Includes a PDF scraper to download and extract text data for training.

🛠 Features

Differential Privacy with Opacus: Utilizes Opacus to introduce differential privacy during training.
Noise Injection in Inference: Adds Gaussian noise to model outputs to enhance privacy.
Custom GPT-2 Architecture: Built from scratch using PyTorch, following GPT-2 configurations.
PDF Scraper: A script to download and extract text from PDFs in a specified GitHub repository.
Progress Tracking: Implements tqdm for progress bars during data processing and training.

📁 Project Structure

GPT2-with-noise/
├── GPT/
│   ├── __init__.py
│   ├── train.py         # Training script with privacy mechanisms
│   ├── model.ptl        # Saved model state dictionary
│   └── data/
│       ├── TinyStories-train.txt  # Sample training data
│       └── scraper.py   # PDF scraper script
├── app.py               # FastAPI application (if applicable)
├── requirements.txt     # List of dependencies
├── .gitignore           # Ignoring unnecessary files
├── LICENSE              # MIT License
└── README.md            # You're here!

🔧 Installation

Clone the Repository

git clone https://github.com/yourusername/GPT2-with-noise.git
cd GPT2-with-noise

Set Up a Virtual Environment (Optional but Recommended)
```
python3 -m venv venv
source venv/bin/activate
```
Install Dependencies
```
pip install -r requirements.txt
```
If requirements.txt is not available, install the necessary packages:
```
pip install torch opacus tiktoken PyPDF2 tqdm
```

📝 Usage

1. Data Preparation

Before training the model, you'll need some data.

Using the Scraper

The scraper downloads PDFs from the free-cybersecurity-ebooks repository and extracts their text content.

Run the scraper:

python GPT/data/scraper.py

Note: Extracted text files will be saved in the extracted_texts/ directory.

2. Training the Model

Train the GPT-2 model with differential privacy enhancements.

Run the training script:

python GPT/train.py

Key Components in train.py:

Model Configuration (GPTConfig): Defines model parameters like block size, vocabulary size, number of layers, heads, embedding dimensions, and batch size.

from dataclasses import dataclass

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    batch_size: int = 2

Differential Privacy Integration: Uses Opacus's PrivacyEngine to make the model training process private.

from opacus import PrivacyEngine

# Initialize PrivacyEngine
privacy_engine = PrivacyEngine()

# Make the model private
model, optimizer, train_loader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    noise_multiplier=1.0,  # Adjust based on privacy requirements
    max_grad_norm=1.0      # Clipping threshold
)

Training Loop with Gradient Accumulation and Clipping:

for step in range(max_steps):
    optimizer.zero_grad()
    t0 = time.time()

    for mikro_step in range(grad_accum_steps):
        x, y = train_loader.next_batch()
        x, y = x.to(device), y.to(device)

        with autocast("cuda", dtype=torch.bfloat16):
            logits, loss = model(x, y)

        loss = loss / grad_accum_steps
        loss.backward()

    # Gradient clipping
    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    t1 = time.time()

    print(f"Step {step} | Loss: {loss.item():.4f} | Time: {(t1 - t0)*1000:.2f}ms")

3. Inference

Instructions for running inference (if applicable) would go here.

⚙️ Configuration

Model Parameters: Adjust parameters in the GPTConfig class as needed.
Privacy Parameters:
- noise_multiplier: Controls the level of noise added for privacy. Higher values increase privacy but may impact performance.
- max_grad_norm: Sets the threshold for gradient clipping to limit the influence of any single training example.

Learning Rate Scheduler: Implements a cosine decay learning rate scheduler.

def get_lr(it):
    if it < warmup_steps:
        return max_lr * (it + 1) / warmup_steps
    elif it > max_steps:
        return min_lr
    decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
    coeff = 0.5 * (1 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (max_lr - min_lr)

📝 Notes

GPU Utilization: The training script leverages GPU acceleration if available. Ensure your PyTorch installation supports CUDA.
Batch Size Considerations: Adjust batch_size in GPTConfig based on your hardware capabilities to prevent out-of-memory errors.
Balancing Privacy and Performance: Tuning noise_multiplier is crucial. Experiment with different values to find the optimal balance.
Data Loading: DataLoaderLite serves batches from tokenized text data efficiently.

🤝 Contributing

Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Happy coding! If you have questions or need assistance, feel free to reach out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GPT2-with-noise 🧩🔒

🚀 Overview

🛠 Features

📁 Project Structure

🔧 Installation

📝 Usage

1. Data Preparation

Using the Scraper

2. Training the Model

3. Inference

⚙️ Configuration

📝 Notes

🤝 Contributing

📄 License

Files

README.md

Latest commit

History

README.md

File metadata and controls

GPT2-with-noise 🧩🔒

🚀 Overview

🛠 Features

📁 Project Structure

🔧 Installation

📝 Usage

1. Data Preparation

Using the Scraper

2. Training the Model

3. Inference

⚙️ Configuration

📝 Notes

🤝 Contributing

📄 License