Merge branch 'dev'

hbaghramyan · Nov 18, 2024 · 1a5806b · 1a5806b
2 parents 22f288c + abad574
commit 1a5806b
Show file tree

Hide file tree

Showing 13 changed files with 315 additions and 71 deletions.
diff --git a/.github/workflows/check-links.yml b/.github/workflows/check-links.yml
@@ -29,6 +29,6 @@ jobs:
 
     - name: Check links
       run: |
-        pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
+        pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore https://unsloth.ai/blog/gradient --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
         # pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org" --retries 2 --retry-delay 5
 
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,18 @@
+cff-version: 1.2.0
+message: "If you use this book or its accompanying code, please cite it as follows."
+title: "Build A Large Language Model (From Scratch), Published by Manning, ISBN 978-1633437166"
+abstract: "This book provides a comprehensive, step-by-step guide to implementing a ChatGPT-like large language model from scratch in PyTorch."
+date-released: 2024-09-12
+authors:
+  - family-names: "Raschka"
+    given-names: "Sebastian"
+license: "Apache-2.0"
+url: "https://www.manning.com/books/build-a-large-language-model-from-scratch"
+repository-code: "https://github.com/rasbt/LLMs-from-scratch"
+keywords:
+  - large language models
+  - natural language processing
+  - artificial intelligence
+  - PyTorch
+  - machine learning
+  - deep learning
diff --git a/README.md b/README.md
@@ -101,16 +101,16 @@ Several folders contain optional materials as a bonus for interested readers:
   - [Python Setup Tips](setup/01_optional-python-setup-preferences)
   - [Installing Python Packages and Libraries Used In This Book](setup/02_installing-python-libraries)
   - [Docker Environment Setup Guide](setup/03_optional-docker-environment)
-- **Chapter 2:**
+- **Chapter 2: Working with text data**
   - [Comparing Various Byte Pair Encoding (BPE) Implementations](ch02/02_bonus_bytepair-encoder)
   - [Understanding the Difference Between Embedding Layers and Linear Layers](ch02/03_bonus_embedding-vs-matmul)
   - [Dataloader Intuition with Simple Numbers](ch02/04_bonus_dataloader-intuition)
-- **Chapter 3:**
+- **Chapter 3: Coding attention mechanisms**
   - [Comparing Efficient Multi-Head Attention Implementations](ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb)
   - [Understanding PyTorch Buffers](ch03/03_understanding-buffers/understanding-buffers.ipynb)
-- **Chapter 4:**
+- **Chapter 4: Implementing a GPT model from scratch**
   - [FLOPS Analysis](ch04/02_performance-analysis/flops-analysis.ipynb)
-- **Chapter 5:**
+- **Chapter 5: Pretraining on unlabeled data:**
   - [Alternative Weight Loading from Hugging Face Model Hub using Transformers](ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb)
   - [Pretraining GPT on the Project Gutenberg Dataset](ch05/03_bonus_pretraining_on_gutenberg)
   - [Adding Bells and Whistles to the Training Loop](ch05/04_learning_rate_schedulers)
@@ -119,11 +119,11 @@ Several folders contain optional materials as a bonus for interested readers:
   - [Converting GPT to Llama](ch05/07_gpt_to_llama)
   - [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
   - [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
-- **Chapter 6:**
+- **Chapter 6: Finetuning for classification**
   - [Additional experiments finetuning different layers and using larger models](ch06/02_bonus_additional-experiments)
   - [Finetuning different models on 50k IMDB movie review dataset](ch06/03_bonus_imdb-classification)
   - [Building a User Interface to Interact With the GPT-based Spam Classifier](ch06/04_user_interface)
-- **Chapter 7:**
+- **Chapter 7: Finetuning to follow instructions**
   - [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities)
   - [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation)
   - [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb)

diff --git a/appendix-D/01_main-chapter-code/appendix-D.ipynb b/appendix-D/01_main-chapter-code/appendix-D.ipynb
@@ -203,7 +203,7 @@
    "id": "5bf3a8da-abc4-4b80-a5d8-f1cc1c7cc5f3",
    "metadata": {},
    "source": [
-    "- Typically, the number of warmup steps is between 0.1% to 10% of the total number of steps\n",
+    "- Typically, the number of warmup steps is between 0.1% to 20% of the total number of steps\n",
     "- We can compute the increment as the difference between the `peak_lr` and `initial_lr` divided by the number of warmup steps"
    ]
   },
@@ -227,6 +227,14 @@
     "print(warmup_steps)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "4b6bbdc8-0104-459e-a7ed-b08be8578709",
+   "metadata": {},
+   "source": [
+    "- Note that the print book accidentally includes a leftover code line, `warmup_steps = 20`, which is not used and can be safely ignored"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -544,6 +552,8 @@
    "source": [
     "from previous_chapters import evaluate_model, generate_and_print_sample\n",
     "\n",
+    "BOOK_VERSION = True\n",
+    "\n",
     "\n",
     "def train_model(model, train_loader, val_loader, optimizer, device,\n",
     "                n_epochs, eval_freq, eval_iter, start_context, tokenizer,\n",
@@ -587,9 +597,14 @@
     "            loss.backward()\n",
     "\n",
     "            # Apply gradient clipping after the warmup phase to avoid exploding gradients\n",
-    "            if global_step > warmup_steps:\n",
-    "                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
-    "            \n",
+    "\n",
+    "            if BOOK_VERSION:\n",
+    "                if global_step > warmup_steps:\n",
+    "                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  \n",
+    "            else:\n",
+    "                if global_step >= warmup_steps:  # the book originally used global_step > warmup_steps, which lead to a skipped clipping step after warmup\n",
+    "                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
+    "                \n",
     "            optimizer.step()\n",
     "            tokens_seen += input_batch.numel()\n",
     "\n",
@@ -683,8 +698,8 @@
     "model = GPTModel(GPT_CONFIG_124M)\n",
     "model.to(device)\n",
     "\n",
-    "peak_lr = 5e-4\n",
-    "optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1)\n",
+    "peak_lr = 0.001  # this was originally set to 5e-4 in the book by mistake\n",
+    "optimizer = torch.optim.AdamW(model.parameters(), lr=peak_lr, weight_decay=0.1)  # the book accidentally omitted the lr assignment\n",
     "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
     "\n",
     "n_epochs = 15\n",

diff --git a/ch03/01_main-chapter-code/ch03.ipynb b/ch03/01_main-chapter-code/ch03.ipynb
@@ -1485,7 +1485,8 @@
    "id": "5a575458-a6da-4e54-8688-83e155f2de06",
    "metadata": {},
    "source": [
-    "- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2."
+    "- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2\n",
+    "- The scaling is calculated by the formula 1 / (1 - `dropout_rate`)"
    ]
   },
   {

diff --git a/ch05/01_main-chapter-code/gpt_generate.py b/ch05/01_main-chapter-code/gpt_generate.py
@@ -270,7 +270,7 @@ def main(gpt_config, input_prompt, model_size):
 
     token_ids = generate(
         model=gpt,
-        idx=text_to_token_ids(input_prompt, tokenizer),
+        idx=text_to_token_ids(input_prompt, tokenizer).to(device),
         max_new_tokens=25,
         context_size=gpt_config["context_length"],
         top_k=50,

diff --git a/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb b/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb
@@ -381,7 +381,7 @@
     "id": "qcD8LSHNhBRW"
    },
    "source": [
-    "- Note that we also added a `dtype=cfg[\"dtype\"]` setting above, which will allow us to load the model directly in lower precision formats later to save memory (versus instantiating it in the original 32-bit precision format and then converting it)\n",
+    "- Note that we also added a `dtype=cfg[\"dtype\"]` setting above, which will allow us to load the model directly in lower precision formats later to reduce memory usage (versus instantiating it in the original 32-bit precision format and then converting it)\n",
     "- We also set `bias=False` since Llama doesn't use any bias units"
    ]
   },
@@ -648,7 +648,7 @@
     "\n",
     "mha(example_batch)\n",
     "\n",
-    "del mha  # delete to safe memory"
+    "del mha  # delete to free up memory"
    ]
   },
   {
@@ -781,7 +781,7 @@
     "        self.out_head = nn.Linear(cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False, dtype=cfg[\"dtype\"])\n",
     "\n",
     "    def forward(self, in_idx):\n",
-    "        batch_size, seq_len = in_idx.shape\n",
+    "        # batch_size, seq_len = in_idx.shape\n",
     "        tok_embeds = self.tok_emb(in_idx)\n",
     "        # pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
     "        x = tok_embeds  # + pos_embeds  # Shape [batch_size, num_tokens, emb_size]\n",
@@ -890,7 +890,7 @@
     "    \"n_heads\": 32,           # Number of attention heads\n",
     "    \"n_layers\": 32,          # Number of layers\n",
     "    \"hidden_dim\": 11008,     # NEW: Size of the intermediate dimension in FeedForward\n",
-    "    \"dtype\": torch.bfloat16  # NEW: Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16  # NEW: Lower-precision dtype to reduce memory usage\n",
     "}"
    ]
   },
@@ -1691,7 +1691,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.6"
+   "version": "3.11.4"
   },
   "widgets": {
    "application/vnd.jupyter.widget-state+json": {

diff --git a/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb b/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb
@@ -481,7 +481,7 @@
     "        ):\n",
     "        super().__init__()\n",
     "        assert d_out % num_heads == 0, \"d_out must be divisible by num_heads\"\n",
-    "        assert num_heads % num_kv_groups == 0, \"num_heads must be divisible by num_kv_groups\"\n",
+    "        assert num_heads % num_kv_groups == 0, \"num_heads must be divisible by num_kv_groups\"  # NEW\n",
     "\n",
     "        self.d_out = d_out\n",
     "        self.num_heads = num_heads\n",
@@ -886,7 +886,7 @@
     "    \"n_heads\": 32,           # Number of attention heads\n",
     "    \"n_layers\": 32,          # Number of layers\n",
     "    \"hidden_dim\": 11_008,    # Size of the intermediate dimension in FeedForward\n",
-    "    \"dtype\": torch.bfloat16  # Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16  # Lower-precision dtype to reduce memory usage\n",
     "}"
    ]
   },
@@ -909,7 +909,7 @@
     "    \"n_kv_groups\": 8,        # NEW: Key-Value groups for grouped-query attention\n",
     "    \"rope_base\": 500_000.0,  # NEW: The base in RoPE's \"theta\" was increased to 500_000\n",
     "    \"rope_freq\": None,       # NEW: Additional configuration for adjusting the RoPE frequencies\n",
-    "    \"dtype\": torch.bfloat16  # Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16  # Lower-precision dtype to reduce memory usage\n",
     "}"
    ]
   },
@@ -2062,7 +2062,7 @@
     "    \"n_kv_groups\": 8,        # Key-Value groups for grouped-query attention\n",
     "    \"rope_base\": 500_000.0,  # The base in RoPE's \"theta\"\n",
     "    \"rope_freq\": None,       # Additional configuration for adjusting the RoPE frequencies\n",
-    "    \"dtype\": torch.bfloat16  # Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16  # Lower-precision dtype to reduce memory usage\n",
     "}\n",
     "\n",
     "LLAMA31_CONFIG_8B = {\n",
@@ -2074,7 +2074,7 @@
     "    \"hidden_dim\": 14_336,       # Size of the intermediate dimension in FeedForward\n",
     "    \"n_kv_groups\": 8,           # Key-Value groups for grouped-query attention\n",
     "    \"rope_base\": 500_000.0,     # The base in RoPE's \"theta\"\n",
-    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to reduce memory usage\n",
     "    \"rope_freq\": {              # NEW: RoPE frequency scaling\n",
     "        \"factor\": 8.0,\n",
     "        \"low_freq_factor\": 1.0,\n",
@@ -2448,7 +2448,7 @@
     "    \"hidden_dim\": 14_336,       # Size of the intermediate dimension in FeedForward\n",
     "    \"n_kv_groups\": 8,           # Key-Value groups for grouped-query attention\n",
     "    \"rope_base\": 500_000.0,     # The base in RoPE's \"theta\"\n",
-    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to reduce memory usagey\n",
     "    \"rope_freq\": {              # NEW: RoPE frequency scaling\n",
     "        \"factor\": 8.0,\n",
     "        \"low_freq_factor\": 1.0,\n",
@@ -2467,7 +2467,7 @@
     "    \"hidden_dim\": 8192,         # NEW: Almost half the size of the intermediate dimension in FeedForward\n",
     "    \"n_kv_groups\": 8,           # Key-Value groups for grouped-query attention\n",
     "    \"rope_base\": 500_000.0,     # The base in RoPE's \"theta\"\n",
-    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to reduce memory usage\n",
     "    \"rope_freq\": {              # RoPE frequency scaling\n",
     "        \"factor\": 32.0,         # NEW: Adjustment of the rescaling factor\n",
     "        \"low_freq_factor\": 1.0,\n",

diff --git a/ch05/07_gpt_to_llama/standalone-llama32.ipynb b/ch05/07_gpt_to_llama/standalone-llama32.ipynb
@@ -438,7 +438,7 @@
     "    \"hidden_dim\": 8192,         # Size of the intermediate dimension in FeedForward\n",
     "    \"n_kv_groups\": 8,           # Key-Value groups for grouped-query attention\n",
     "    \"rope_base\": 500_000.0,     # The base in RoPE's \"theta\"\n",
-    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to save memory\n",
+    "    \"dtype\": torch.bfloat16,    # Lower-precision dtype to reduce memory usage\n",
     "    \"rope_freq\": {              # RoPE frequency scaling\n",
     "        \"factor\": 32.0,\n",
     "        \"low_freq_factor\": 1.0,\n",
@@ -458,7 +458,7 @@
     "#     \"hidden_dim\": 8192,         # Size of the intermediate dimension in FeedForward\n",
     "#     \"n_kv_groups\": 8,           # Key-Value groups for grouped-query attention\n",
     "#     \"rope_base\": 500_000.0,     # The base in RoPE's \"theta\"\n",
-    "#     \"dtype\": torch.bfloat16,    # Lower-precision dtype to save memory\n",
+    "#     \"dtype\": torch.bfloat16,    # Lower-precision dtype to reduce memory usage\n",
     "#     \"rope_freq\": {              # RoPE frequency scaling\n",
     "#         \"factor\": 32.0,\n",
     "#         \"low_freq_factor\": 1.0,\n",