Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
hbaghramyan committed Nov 18, 2024
2 parents 22f288c + abad574 commit 1a5806b
Show file tree
Hide file tree
Showing 13 changed files with 315 additions and 71 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/check-links.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,6 @@ jobs:
- name: Check links
run: |
pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore https://unsloth.ai/blog/gradient --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
# pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org" --retries 2 --retry-delay 5
18 changes: 18 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
cff-version: 1.2.0
message: "If you use this book or its accompanying code, please cite it as follows."
title: "Build A Large Language Model (From Scratch), Published by Manning, ISBN 978-1633437166"
abstract: "This book provides a comprehensive, step-by-step guide to implementing a ChatGPT-like large language model from scratch in PyTorch."
date-released: 2024-09-12
authors:
- family-names: "Raschka"
given-names: "Sebastian"
license: "Apache-2.0"
url: "https://www.manning.com/books/build-a-large-language-model-from-scratch"
repository-code: "https://github.com/rasbt/LLMs-from-scratch"
keywords:
- large language models
- natural language processing
- artificial intelligence
- PyTorch
- machine learning
- deep learning
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,16 +101,16 @@ Several folders contain optional materials as a bonus for interested readers:
- [Python Setup Tips](setup/01_optional-python-setup-preferences)
- [Installing Python Packages and Libraries Used In This Book](setup/02_installing-python-libraries)
- [Docker Environment Setup Guide](setup/03_optional-docker-environment)
- **Chapter 2:**
- **Chapter 2: Working with text data**
- [Comparing Various Byte Pair Encoding (BPE) Implementations](ch02/02_bonus_bytepair-encoder)
- [Understanding the Difference Between Embedding Layers and Linear Layers](ch02/03_bonus_embedding-vs-matmul)
- [Dataloader Intuition with Simple Numbers](ch02/04_bonus_dataloader-intuition)
- **Chapter 3:**
- **Chapter 3: Coding attention mechanisms**
- [Comparing Efficient Multi-Head Attention Implementations](ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb)
- [Understanding PyTorch Buffers](ch03/03_understanding-buffers/understanding-buffers.ipynb)
- **Chapter 4:**
- **Chapter 4: Implementing a GPT model from scratch**
- [FLOPS Analysis](ch04/02_performance-analysis/flops-analysis.ipynb)
- **Chapter 5:**
- **Chapter 5: Pretraining on unlabeled data:**
- [Alternative Weight Loading from Hugging Face Model Hub using Transformers](ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb)
- [Pretraining GPT on the Project Gutenberg Dataset](ch05/03_bonus_pretraining_on_gutenberg)
- [Adding Bells and Whistles to the Training Loop](ch05/04_learning_rate_schedulers)
Expand All @@ -119,11 +119,11 @@ Several folders contain optional materials as a bonus for interested readers:
- [Converting GPT to Llama](ch05/07_gpt_to_llama)
- [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
- [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
- **Chapter 6:**
- **Chapter 6: Finetuning for classification**
- [Additional experiments finetuning different layers and using larger models](ch06/02_bonus_additional-experiments)
- [Finetuning different models on 50k IMDB movie review dataset](ch06/03_bonus_imdb-classification)
- [Building a User Interface to Interact With the GPT-based Spam Classifier](ch06/04_user_interface)
- **Chapter 7:**
- **Chapter 7: Finetuning to follow instructions**
- [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities)
- [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation)
- [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb)
Expand Down
27 changes: 21 additions & 6 deletions appendix-D/01_main-chapter-code/appendix-D.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@
"id": "5bf3a8da-abc4-4b80-a5d8-f1cc1c7cc5f3",
"metadata": {},
"source": [
"- Typically, the number of warmup steps is between 0.1% to 10% of the total number of steps\n",
"- Typically, the number of warmup steps is between 0.1% to 20% of the total number of steps\n",
"- We can compute the increment as the difference between the `peak_lr` and `initial_lr` divided by the number of warmup steps"
]
},
Expand All @@ -227,6 +227,14 @@
"print(warmup_steps)"
]
},
{
"cell_type": "markdown",
"id": "4b6bbdc8-0104-459e-a7ed-b08be8578709",
"metadata": {},
"source": [
"- Note that the print book accidentally includes a leftover code line, `warmup_steps = 20`, which is not used and can be safely ignored"
]
},
{
"cell_type": "code",
"execution_count": 6,
Expand Down Expand Up @@ -544,6 +552,8 @@
"source": [
"from previous_chapters import evaluate_model, generate_and_print_sample\n",
"\n",
"BOOK_VERSION = True\n",
"\n",
"\n",
"def train_model(model, train_loader, val_loader, optimizer, device,\n",
" n_epochs, eval_freq, eval_iter, start_context, tokenizer,\n",
Expand Down Expand Up @@ -587,9 +597,14 @@
" loss.backward()\n",
"\n",
" # Apply gradient clipping after the warmup phase to avoid exploding gradients\n",
" if global_step > warmup_steps:\n",
" torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
" \n",
"\n",
" if BOOK_VERSION:\n",
" if global_step > warmup_steps:\n",
" torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) \n",
" else:\n",
" if global_step >= warmup_steps: # the book originally used global_step > warmup_steps, which lead to a skipped clipping step after warmup\n",
" torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
" \n",
" optimizer.step()\n",
" tokens_seen += input_batch.numel()\n",
"\n",
Expand Down Expand Up @@ -683,8 +698,8 @@
"model = GPTModel(GPT_CONFIG_124M)\n",
"model.to(device)\n",
"\n",
"peak_lr = 5e-4\n",
"optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1)\n",
"peak_lr = 0.001 # this was originally set to 5e-4 in the book by mistake\n",
"optimizer = torch.optim.AdamW(model.parameters(), lr=peak_lr, weight_decay=0.1) # the book accidentally omitted the lr assignment\n",
"tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
"\n",
"n_epochs = 15\n",
Expand Down
3 changes: 2 additions & 1 deletion ch03/01_main-chapter-code/ch03.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1485,7 +1485,8 @@
"id": "5a575458-a6da-4e54-8688-83e155f2de06",
"metadata": {},
"source": [
"- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2."
"- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2\n",
"- The scaling is calculated by the formula 1 / (1 - `dropout_rate`)"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion ch05/01_main-chapter-code/gpt_generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ def main(gpt_config, input_prompt, model_size):

token_ids = generate(
model=gpt,
idx=text_to_token_ids(input_prompt, tokenizer),
idx=text_to_token_ids(input_prompt, tokenizer).to(device),
max_new_tokens=25,
context_size=gpt_config["context_length"],
top_k=50,
Expand Down
10 changes: 5 additions & 5 deletions ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -381,7 +381,7 @@
"id": "qcD8LSHNhBRW"
},
"source": [
"- Note that we also added a `dtype=cfg[\"dtype\"]` setting above, which will allow us to load the model directly in lower precision formats later to save memory (versus instantiating it in the original 32-bit precision format and then converting it)\n",
"- Note that we also added a `dtype=cfg[\"dtype\"]` setting above, which will allow us to load the model directly in lower precision formats later to reduce memory usage (versus instantiating it in the original 32-bit precision format and then converting it)\n",
"- We also set `bias=False` since Llama doesn't use any bias units"
]
},
Expand Down Expand Up @@ -648,7 +648,7 @@
"\n",
"mha(example_batch)\n",
"\n",
"del mha # delete to safe memory"
"del mha # delete to free up memory"
]
},
{
Expand Down Expand Up @@ -781,7 +781,7 @@
" self.out_head = nn.Linear(cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False, dtype=cfg[\"dtype\"])\n",
"\n",
" def forward(self, in_idx):\n",
" batch_size, seq_len = in_idx.shape\n",
" # batch_size, seq_len = in_idx.shape\n",
" tok_embeds = self.tok_emb(in_idx)\n",
" # pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
" x = tok_embeds # + pos_embeds # Shape [batch_size, num_tokens, emb_size]\n",
Expand Down Expand Up @@ -890,7 +890,7 @@
" \"n_heads\": 32, # Number of attention heads\n",
" \"n_layers\": 32, # Number of layers\n",
" \"hidden_dim\": 11008, # NEW: Size of the intermediate dimension in FeedForward\n",
" \"dtype\": torch.bfloat16 # NEW: Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16 # NEW: Lower-precision dtype to reduce memory usage\n",
"}"
]
},
Expand Down Expand Up @@ -1691,7 +1691,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.11.4"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
Expand Down
14 changes: 7 additions & 7 deletions ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -481,7 +481,7 @@
" ):\n",
" super().__init__()\n",
" assert d_out % num_heads == 0, \"d_out must be divisible by num_heads\"\n",
" assert num_heads % num_kv_groups == 0, \"num_heads must be divisible by num_kv_groups\"\n",
" assert num_heads % num_kv_groups == 0, \"num_heads must be divisible by num_kv_groups\" # NEW\n",
"\n",
" self.d_out = d_out\n",
" self.num_heads = num_heads\n",
Expand Down Expand Up @@ -886,7 +886,7 @@
" \"n_heads\": 32, # Number of attention heads\n",
" \"n_layers\": 32, # Number of layers\n",
" \"hidden_dim\": 11_008, # Size of the intermediate dimension in FeedForward\n",
" \"dtype\": torch.bfloat16 # Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16 # Lower-precision dtype to reduce memory usage\n",
"}"
]
},
Expand All @@ -909,7 +909,7 @@
" \"n_kv_groups\": 8, # NEW: Key-Value groups for grouped-query attention\n",
" \"rope_base\": 500_000.0, # NEW: The base in RoPE's \"theta\" was increased to 500_000\n",
" \"rope_freq\": None, # NEW: Additional configuration for adjusting the RoPE frequencies\n",
" \"dtype\": torch.bfloat16 # Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16 # Lower-precision dtype to reduce memory usage\n",
"}"
]
},
Expand Down Expand Up @@ -2062,7 +2062,7 @@
" \"n_kv_groups\": 8, # Key-Value groups for grouped-query attention\n",
" \"rope_base\": 500_000.0, # The base in RoPE's \"theta\"\n",
" \"rope_freq\": None, # Additional configuration for adjusting the RoPE frequencies\n",
" \"dtype\": torch.bfloat16 # Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16 # Lower-precision dtype to reduce memory usage\n",
"}\n",
"\n",
"LLAMA31_CONFIG_8B = {\n",
Expand All @@ -2074,7 +2074,7 @@
" \"hidden_dim\": 14_336, # Size of the intermediate dimension in FeedForward\n",
" \"n_kv_groups\": 8, # Key-Value groups for grouped-query attention\n",
" \"rope_base\": 500_000.0, # The base in RoPE's \"theta\"\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to reduce memory usage\n",
" \"rope_freq\": { # NEW: RoPE frequency scaling\n",
" \"factor\": 8.0,\n",
" \"low_freq_factor\": 1.0,\n",
Expand Down Expand Up @@ -2448,7 +2448,7 @@
" \"hidden_dim\": 14_336, # Size of the intermediate dimension in FeedForward\n",
" \"n_kv_groups\": 8, # Key-Value groups for grouped-query attention\n",
" \"rope_base\": 500_000.0, # The base in RoPE's \"theta\"\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to reduce memory usagey\n",
" \"rope_freq\": { # NEW: RoPE frequency scaling\n",
" \"factor\": 8.0,\n",
" \"low_freq_factor\": 1.0,\n",
Expand All @@ -2467,7 +2467,7 @@
" \"hidden_dim\": 8192, # NEW: Almost half the size of the intermediate dimension in FeedForward\n",
" \"n_kv_groups\": 8, # Key-Value groups for grouped-query attention\n",
" \"rope_base\": 500_000.0, # The base in RoPE's \"theta\"\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to reduce memory usage\n",
" \"rope_freq\": { # RoPE frequency scaling\n",
" \"factor\": 32.0, # NEW: Adjustment of the rescaling factor\n",
" \"low_freq_factor\": 1.0,\n",
Expand Down
4 changes: 2 additions & 2 deletions ch05/07_gpt_to_llama/standalone-llama32.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -438,7 +438,7 @@
" \"hidden_dim\": 8192, # Size of the intermediate dimension in FeedForward\n",
" \"n_kv_groups\": 8, # Key-Value groups for grouped-query attention\n",
" \"rope_base\": 500_000.0, # The base in RoPE's \"theta\"\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to save memory\n",
" \"dtype\": torch.bfloat16, # Lower-precision dtype to reduce memory usage\n",
" \"rope_freq\": { # RoPE frequency scaling\n",
" \"factor\": 32.0,\n",
" \"low_freq_factor\": 1.0,\n",
Expand All @@ -458,7 +458,7 @@
"# \"hidden_dim\": 8192, # Size of the intermediate dimension in FeedForward\n",
"# \"n_kv_groups\": 8, # Key-Value groups for grouped-query attention\n",
"# \"rope_base\": 500_000.0, # The base in RoPE's \"theta\"\n",
"# \"dtype\": torch.bfloat16, # Lower-precision dtype to save memory\n",
"# \"dtype\": torch.bfloat16, # Lower-precision dtype to reduce memory usage\n",
"# \"rope_freq\": { # RoPE frequency scaling\n",
"# \"factor\": 32.0,\n",
"# \"low_freq_factor\": 1.0,\n",
Expand Down
Loading

0 comments on commit 1a5806b

Please sign in to comment.