Skip to content

Commit

Permalink
chore: set up GitHub action to install package (#44)
Browse files Browse the repository at this point in the history
* set up GitHub action to install package
Fixes #27

* chore: run actions in PR

* chore: run actions in PR

* chore: run actions in PR

* chore: run actions in PR

* chore: run actions in PR

* chore: also pages

* chore: add deployment workflow

* updated constrained decoding notebook

* some more installation instructions
  • Loading branch information
kjappelbaum authored Jul 2, 2024
1 parent d49ebc5 commit 844f13d
Show file tree
Hide file tree
Showing 21 changed files with 661 additions and 587 deletions.
26 changes: 26 additions & 0 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: deploy-book

on:
push:
branches: [ main ]

jobs:
check-build-book:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: s-weigand/setup-conda@v1
with:
python-version: 3.11
- name: Install dependencies
run: |
cd package && python -m pip --no-cache-dir install . && cd ..
pip install jupyter-book
- name: Build the book
run: jupyter-book build -n .
- name: Deploy Jupyter book to GitHub pages
uses: peaceiris/actions-gh-pages@v4
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: _build/html
force_orphan: true
74 changes: 27 additions & 47 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -1,60 +1,40 @@
name: deploy-book
name: build-and-check-book

# Run this when the master or main branch changes
on:
push:
branches:
- master
- main
# If your git repository has the Jupyter Book within some-subfolder next to
# unrelated files, you can make this run only if a file within that specific
# folder has been modified.
#
# paths:
# - some-subfolder/**
branches: [ main ]
pull_request:
branches: [ main ]

# This job installs dependencies, builds the book, and pushes it to `gh-pages`
jobs:
deploy-book:
check-build-book:
runs-on: ubuntu-latest
permissions:
pages: write
id-token: write
steps:
- uses: actions/checkout@v3

# Install dependencies
- name: Set up Python 3.11
uses: actions/setup-python@v4
- uses: actions/checkout@v4
- uses: s-weigand/setup-conda@v1
with:
python-version: 3.11

- name: Install dependencies
run: |
pip install -r requirements.txt
# (optional) Cache your executed notebooks between runs
# if you have config:
# execute:
# execute_notebooks: cache
- name: cache executed notebooks
uses: actions/cache@v3
with:
path: _build/.jupyter_cache
key: jupyter-book-cache-${{ hashFiles('requirements.txt') }}

# Build the book
- name: Build the book
cd package && python -m pip --no-cache-dir install . && cd ..
pip install jupyter-book
- name: Check pre-commit
run: |
jupyter-book build .
# Upload the book's HTML as an artifact
- name: Upload artifact
uses: actions/upload-pages-artifact@v2
pip install pre-commit
pre-commit run --all-files || ( git status --short ; git diff ; exit 1 )
- name: Build the book
run: jupyter-book build -n .
- name: Link Checker
uses: lycheeverse/[email protected]
continue-on-error: true
id: lc
env:
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
with:
path: "_build/html"

# Deploy the book's HTML to GitHub Pages
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v2
args: --verbose --exclude --max-redirects 100 --no-progress _build/html**/*.html
- name: Deploy Jupyter book to GitHub pages
uses: peaceiris/actions-gh-pages@v4
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: _build/html
force_orphan: true
10 changes: 5 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ repos:
# Run the formatter.
- id: ruff-format
types_or: [ python, pyi, jupyter ]
- repo: https://github.com/Yelp/detect-secrets
rev: v1.0.3
hooks:
- id: detect-secrets
args: [--exclude-files, ".github/workflows/"]
# - repo: https://github.com/Yelp/detect-secrets
# rev: v1.0.3
# hooks:
# - id: detect-secrets
# args: [--exclude-files, ".github/workflows/"]
7 changes: 5 additions & 2 deletions _toc.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
format: jb-book
root: index
parts:
- caption: A. Structured Extraction Workflow
- caption: Introduction and background
numbered: true
chapters:
- file: content/background/resources_LLMs.md
- caption: A. Structured Extraction Workflow
numbered: true
chapters:
- file: content/obtaining_data/index.md
sections:
- file: content/obtaining_data/crossref_search.ipynb
Expand All @@ -18,8 +21,8 @@ parts:
- file: content/finetune/choosing_paradigm.ipynb
- file: content/beyond_text/beyond_images.ipynb
- file: content/agents/agent.ipynb
- file: content/constrained_decoding/index.ipynb
- caption: B. Case Studies
numbered: true
chapters:
- file: content/perovskite/constrained_formulas.ipynb
- file: content/constrained_decoding/index.ipynb
5 changes: 1 addition & 4 deletions content/agents/agent.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@
"from rdkit import Chem\n",
"\n",
"from huggingface_hub import hf_hub_download\n",
"from dotenv import load_dotenv\n",
"\n",
"from langchain import hub\n",
"from langchain.pydantic_v1 import BaseModel, Field\n",
Expand All @@ -39,9 +38,7 @@
"from langchain.agents.react.agent import create_react_agent\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"from litellm import completion\n",
"\n",
"import llmstructdata"
"from litellm import completion"
]
},
{
Expand Down
48 changes: 24 additions & 24 deletions content/background/resources_LLMs.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,27 +14,27 @@ Transformers have become the leading architecture for solving natural language p
```

* [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer)

This post from Harvard University presents an annotated version of the paper *"Attention is All You Need"* in the form of a line-by-line implementation. It reorders and deletes some sections from the original paper and adds comments throughout. Thus, it explains the model architecture (Transformer), model training, and a real-world example. This document is a notebook that allows a completely usable implementation.

* [Understanding the Transformer architecture for neural networks](https://www.jeremyjordan.me/transformer-architecture)

In this article, Jeremy Jordan explains the Transformer architecture with a focus on the attention mechanism, encoder, decoder, and embeddings. His didactic approach is enriched with many explanatory schemas, making the concepts easy to understand.

* [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer)

In this post, Jay Alammar demystifies how Transformers work. He simplifies complex concepts like encoders, decoders, embeddings, self-attention mechanisms, and model training by explaining each one individually, using numerous schemas to enhance understanding.

* [The Transformer architecture of GPT models](https://bea.stollnitz.com/blog/gpt-transformer/)

This article by Bea Stollnitz explains the architecture of GPT models, which are built using a subset of the original Transformer architecture. It shows a GPT-like version of the code that can be compared with the original Transformer to understand the differences.

* [A walkthrough of transformer architecture code](https://github.com/markriedl/transformer-walkthrough)

This notebook, designed for illustration and didactic purposes, provides a comprehensive walkthrough of the Transformer architecture code. It guides you through a single forward pass, explaining each stage of the architecture with the help of a detailed computation graph.

* [LLM Visualization](https://bbycroft.net/llm)

Here, Brendan Bycroft shows an impressive interactive visualization of the LLM algorithm behind some of OpenAI GPT models, allowing you to see the entire process in action.

🎥 You can also find a clear explanation of Transformers in the video *[The Transformer Architecture](https://www.youtube.com/watch?v=tstbZXNCfLY)* by Sebastian Raschka.
Expand All @@ -45,19 +45,19 @@ The concept of "attention" in deep learning emerged from the need to improve Rec

```{figure} ./attention-scores.png
:width: 450px
:align: right
:align: right
```

* [Understanding the attention mechanism in sequence models](https://www.jeremyjordan.me/attention)

Jeremy Jordan explains in this post the attention mechanism using numerous helpful diagrams and an accessible language style. The attention mechanism enables the decoder to search across the entire input sequence for information at each step during the output sequence generation. This is a key innovation in sequence-to-sequence neural network architectures because it significantly improves model performance.

* [Attention? Attention!](https://lilianweng.github.io/posts/2018-06-24-attention)

This post by Lilian Weng also includes detailed explanations and numerous useful diagrams to understand the attention mechanism.

* [Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)

Here Sebastian Raschka explains how self-attention works from scratch by coding it step-by-step.

🎥 You can also find valuable information about the attention mechanism in the following videos:
Expand All @@ -79,35 +79,35 @@ Tokenization and embeddings are two essential steps in the data processing pipel

```{figure} ./tiktokenizer.png
:width: 400px
:align: right
:align: right
```

* [The Technical User's Introduction to LLM Tokenization](https://christophergs.com/blog/understanding-llm-tokenization)

In this post, Christopher Samiullah delves into the mechanics of tokenization in LLMs, referencing Andrej Karpathy’s YouTube talk [*Let’s build the GPT Tokenizer*](https://www.youtube.com/watch?v=zduSFxRajkE).

* [Let’s build the GPT Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE)

If you want to dive deeper into this topic, check out the mentioned talk by Andrej Karpathy, where he builds from scratch the Tokenizer used in the GPT series from OpenAI, highlighting "weird behaviors" and common issues associated with tokenization. He also created `minbpe`, a [repository](https://github.com/karpathy/minbpe) with code and exercises for further learning.

* [Tiktokenizer](https://tiktokenizer.vercel.app/)

You should check out this link and try tokenization using the Tiktoken web app. With this tool, tokenization runs live in your web browser, allowing you to easily input some text string on the left side and see the tokenized output on the right side in real-time.

* [What Are Word and Sentence Embeddings?](https://cohere.com/blog/sentence-word-embeddings)

In this post, Luis Serrano (Cohere) provides a straightforward introduction to embeddings using practical examples.

### Model training

* [nanoGPT](https://github.com/karpathy/nanoGPT/)

The simplest, fastest repository for training/finetuning medium-sized GPTs by Andrej Karpathy. It includes a minimal code implementation of a generative language model for educational purposes.

* [Building a GPT that can generate molecules from scratch](https://kjablonka.com/blog/posts/building_an_llm)

Here, Kevin M. Jablonka explains how LLMs work through a practical approach applied to materials science, guiding you on how to build a GPT model that can generate molecules from scratch. It includes a detailed tutorial covering the tokenization process, conversion of tokens into numbers, vector embeddings and positional encoding, as well as model training and evaluation. Through this relatively simple example, you will also learn how the attention mechanism works with an exhaustive implementation of self-attention into the model.

🎥 In this video you can learn how to build a GPT model following the paper *"Attention is All You Need"*:
[Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY) by Andrej Karpathy.

Expand All @@ -123,15 +123,15 @@ Tokenization and embeddings are two essential steps in the data processing pipel

* [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789)
*(Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022)*

While the overall objective of this book is to show you how to build language applications using the Hugging Face 🤗 Transformers library, it explains the Transformer architecture in a clear and detailed way. Chapter 2, *Text Classification*, shows how tokenizers work, whereas Chapter 3, *Transformer Anatomy*, takes a closer look at how transformers work for natural language processing. You will learn about the encoder-decoder architecture, embeddings, and self-attention mechanism. The authors present a hands-on approach, making it easy to read and simple to understand the Transformer architecture.

* [Generative Deep Learning](https://www.oreilly.com/library/view/generative-deep-learning/9781098134174)
*(David Foster, 2023)*

Starting with an introduction to generative modeling and deep learning, this book explores the different techniques to build generative models. In Chapter 9, *Transformers*, it provides an overview of the Transformer model architecture, attention mechanism, and encoder-decoder architectures.

* [Understanding Deep Learning](https://udlbook.github.io/udlbook/)
*(Simon J.D. Prince, 2023)*

This book begins by introducing deep learning models and discuss how to train them, measure their performance, and improve this performance. It then presents the architectures that are specialized to images, text, and graph data. Chapter 12, *Transformers*, explains self-attention and the transformer architecture. This is one of the most educational resources on deep learning available today.
3 changes: 0 additions & 3 deletions content/beyond_text/beyond_images.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,6 @@
},
"outputs": [],
"source": [
"import llmstructdata\n",
"\n",
"from pdf2image import convert_from_path\n",
"\n",
"file_path = \"../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf\"\n",
Expand Down Expand Up @@ -357,7 +355,6 @@
}
],
"source": [
"import os\n",
"from litellm import completion\n",
"\n",
"\n",
Expand Down
Loading

0 comments on commit 844f13d

Please sign in to comment.