chore: set up GitHub action to install package (#44)

* set up GitHub action to install package Fixes #27 * chore: run actions in PR * chore: run actions in PR * chore: run actions in PR * chore: run actions in PR * chore: run actions in PR * chore: also pages * chore: add deployment workflow * updated constrained decoding notebook * some more installation instructions
lamalab-org · Jul 2, 2024 · 844f13d · 844f13d
1 parent d49ebc5
commit 844f13d
Show file tree

Hide file tree

Showing 21 changed files with 661 additions and 587 deletions.
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -0,0 +1,26 @@
+name: deploy-book
+
+on:
+  push:
+    branches: [ main ]
+
+jobs:
+  check-build-book:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v4
+    - uses: s-weigand/setup-conda@v1
+      with:
+        python-version: 3.11
+    - name: Install dependencies
+      run: |
+        cd package && python -m pip  --no-cache-dir install . && cd ..
+        pip install jupyter-book
+    - name: Build the book
+      run: jupyter-book build -n .
+    - name: Deploy Jupyter book to GitHub pages
+      uses: peaceiris/actions-gh-pages@v4
+      with:
+        github_token: ${{ secrets.GITHUB_TOKEN }}
+        publish_dir: _build/html
+        force_orphan: true
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -1,60 +1,40 @@
-name: deploy-book
+name: build-and-check-book
 
-# Run this when the master or main branch changes
 on:
   push:
-    branches:
-    - master
-    - main
-    # If your git repository has the Jupyter Book within some-subfolder next to
-    # unrelated files, you can make this run only if a file within that specific
-    # folder has been modified.
-    #
-    # paths:
-    # - some-subfolder/**
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
 
-# This job installs dependencies, builds the book, and pushes it to `gh-pages`
 jobs:
-  deploy-book:
+  check-build-book:
     runs-on: ubuntu-latest
-    permissions:
-      pages: write
-      id-token: write
     steps:
-    - uses: actions/checkout@v3
-
-    # Install dependencies
-    - name: Set up Python 3.11
-      uses: actions/setup-python@v4
+    - uses: actions/checkout@v4
+    - uses: s-weigand/setup-conda@v1
       with:
         python-version: 3.11
-
     - name: Install dependencies
       run: |
-        pip install -r requirements.txt
-
-    # (optional) Cache your executed notebooks between runs
-    # if you have config:
-    # execute:
-    #   execute_notebooks: cache
-    - name: cache executed notebooks
-      uses: actions/cache@v3
-      with:
-        path: _build/.jupyter_cache
-        key: jupyter-book-cache-${{ hashFiles('requirements.txt') }}
-
-    # Build the book
-    - name: Build the book
+        cd package && python -m pip  --no-cache-dir install . && cd ..
+        pip install jupyter-book
+    - name: Check pre-commit
       run: |
-        jupyter-book build .
-
-    # Upload the book's HTML as an artifact
-    - name: Upload artifact
-      uses: actions/upload-pages-artifact@v2
+        pip install pre-commit
+        pre-commit run --all-files || ( git status --short ; git diff ; exit 1 )
+    - name: Build the book
+      run: jupyter-book build -n .
+    - name: Link Checker
+      uses: lycheeverse/[email protected]
+      continue-on-error: true
+      id: lc
+      env:
+          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
       with:
-        path: "_build/html"
-
-    # Deploy the book's HTML to GitHub Pages
-    - name: Deploy to GitHub Pages
-      id: deployment
-      uses: actions/deploy-pages@v2
+          args: --verbose --exclude --max-redirects 100 --no-progress  _build/html**/*.html
+    - name: Deploy Jupyter book to GitHub pages
+      uses: peaceiris/actions-gh-pages@v4
+      with:
+        github_token: ${{ secrets.GITHUB_TOKEN }}
+        publish_dir: _build/html
+        force_orphan: true
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -17,8 +17,8 @@ repos:
       # Run the formatter.
       - id: ruff-format
         types_or: [ python, pyi, jupyter ]
-  - repo: https://github.com/Yelp/detect-secrets
-    rev: v1.0.3
-    hooks:
-      - id: detect-secrets
-        args: [--exclude-files, ".github/workflows/"]
+  # - repo: https://github.com/Yelp/detect-secrets
+  #   rev: v1.0.3
+  #   hooks:
+  #     - id: detect-secrets
+  #       args: [--exclude-files, ".github/workflows/"]
diff --git a/_toc.yml b/_toc.yml
@@ -1,10 +1,13 @@
 format: jb-book
 root: index
 parts:
-- caption: A. Structured Extraction Workflow
+- caption: Introduction and background
   numbered: true
   chapters:
     - file: content/background/resources_LLMs.md
+- caption: A. Structured Extraction Workflow
+  numbered: true
+  chapters:
     - file: content/obtaining_data/index.md
       sections:
         - file: content/obtaining_data/crossref_search.ipynb
@@ -18,8 +21,8 @@ parts:
     - file: content/finetune/choosing_paradigm.ipynb
     - file: content/beyond_text/beyond_images.ipynb
     - file: content/agents/agent.ipynb
+    - file: content/constrained_decoding/index.ipynb
 - caption: B. Case Studies
   numbered: true
   chapters:
     - file: content/perovskite/constrained_formulas.ipynb
-    - file: content/constrained_decoding/index.ipynb
diff --git a/content/agents/agent.ipynb b/content/agents/agent.ipynb
@@ -30,7 +30,6 @@
     "from rdkit import Chem\n",
     "\n",
     "from huggingface_hub import hf_hub_download\n",
-    "from dotenv import load_dotenv\n",
     "\n",
     "from langchain import hub\n",
     "from langchain.pydantic_v1 import BaseModel, Field\n",
@@ -39,9 +38,7 @@
     "from langchain.agents.react.agent import create_react_agent\n",
     "from langchain_openai import ChatOpenAI\n",
     "\n",
-    "from litellm import completion\n",
-    "\n",
-    "import llmstructdata"
+    "from litellm import completion"
    ]
   },
   {

diff --git a/content/background/resources_LLMs.md b/content/background/resources_LLMs.md
@@ -14,27 +14,27 @@ Transformers have become the leading architecture for solving natural language p
 ```
 
 * [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer)
-  
+
     This post from Harvard University presents an annotated version of the paper *"Attention is All You Need"* in the form of a line-by-line implementation. It reorders and deletes some sections from the original paper and adds comments throughout. Thus, it explains the model architecture (Transformer), model training, and a real-world example. This document is a notebook that allows a completely usable implementation.
 
 * [Understanding the Transformer architecture for neural networks](https://www.jeremyjordan.me/transformer-architecture)
-  
+
     In this article, Jeremy Jordan explains the Transformer architecture with a focus on the attention mechanism, encoder, decoder, and embeddings. His didactic approach is enriched with many explanatory schemas, making the concepts easy to understand.
 
 * [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer)
-  
+
     In this post, Jay Alammar demystifies how Transformers work. He simplifies complex concepts like encoders, decoders, embeddings, self-attention mechanisms, and model training by explaining each one individually, using numerous schemas to enhance understanding.
-  
+
 * [The Transformer architecture of GPT models](https://bea.stollnitz.com/blog/gpt-transformer/)
-  
+
     This article by Bea Stollnitz explains the architecture of GPT models, which are built using a subset of the original Transformer architecture. It shows a GPT-like version of the code that can be compared with the original Transformer to understand the differences.
-  
+
 * [A walkthrough of transformer architecture code](https://github.com/markriedl/transformer-walkthrough)
-  
+
     This notebook, designed for illustration and didactic purposes, provides a comprehensive walkthrough of the Transformer architecture code. It guides you through a single forward pass, explaining each stage of the architecture with the help of a detailed computation graph.
-  
+
 * [LLM Visualization](https://bbycroft.net/llm)
-  
+
     Here, Brendan Bycroft shows an impressive interactive visualization of the LLM algorithm behind some of OpenAI GPT models, allowing you to see the entire process in action.
 
 🎥 You can also find a clear explanation of Transformers in the video *[The Transformer Architecture](https://www.youtube.com/watch?v=tstbZXNCfLY)* by Sebastian Raschka.
@@ -45,19 +45,19 @@ The concept of "attention" in deep learning emerged from the need to improve Rec
 
 ```{figure} ./attention-scores.png
 :width: 450px
-:align: right       
+:align: right
 ```
 
 * [Understanding the attention mechanism in sequence models](https://www.jeremyjordan.me/attention)
-  
+
     Jeremy Jordan explains in this post the attention mechanism using numerous helpful diagrams and an accessible language style. The attention mechanism enables the decoder to search across the entire input sequence for information at each step during the output sequence generation. This is a key innovation in sequence-to-sequence neural network architectures because it significantly improves model performance.
 
 * [Attention? Attention!](https://lilianweng.github.io/posts/2018-06-24-attention)
-  
+
     This post by Lilian Weng also includes detailed explanations and numerous useful diagrams to understand the attention mechanism.
 
 * [Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)
-  
+
     Here Sebastian Raschka explains how self-attention works from scratch by coding it step-by-step.
 
 🎥 You can also find valuable information about the attention mechanism in the following videos:
@@ -79,35 +79,35 @@ Tokenization and embeddings are two essential steps in the data processing pipel
 
 ```{figure} ./tiktokenizer.png
 :width: 400px
-:align: right       
+:align: right
 ```
 
 * [The Technical User's Introduction to LLM Tokenization](https://christophergs.com/blog/understanding-llm-tokenization)
-  
+
     In this post, Christopher Samiullah delves into the mechanics of tokenization in LLMs, referencing Andrej Karpathy’s YouTube talk [*Let’s build the GPT Tokenizer*](https://www.youtube.com/watch?v=zduSFxRajkE).
 
 * [Let’s build the GPT Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE)
-  
+
     If you want to dive deeper into this topic, check out the mentioned talk by Andrej Karpathy, where he builds from scratch the Tokenizer used in the GPT series from OpenAI, highlighting "weird behaviors" and common issues associated with tokenization. He also created `minbpe`, a [repository](https://github.com/karpathy/minbpe) with code and exercises for further learning.
 
 * [Tiktokenizer](https://tiktokenizer.vercel.app/)
-  
+
     You should check out this link and try tokenization using the Tiktoken web app. With this tool, tokenization runs live in your web browser, allowing you to easily input some text string on the left side and see the tokenized output on the right side in real-time.
 
 * [What Are Word and Sentence Embeddings?](https://cohere.com/blog/sentence-word-embeddings)
-  
+
     In this post, Luis Serrano (Cohere) provides a straightforward introduction to embeddings using practical examples.
 
 ### Model training
 
 * [nanoGPT](https://github.com/karpathy/nanoGPT/)
-  
+
     The simplest, fastest repository for training/finetuning medium-sized GPTs by Andrej Karpathy. It includes a minimal code implementation of a generative language model for educational purposes.
 
 * [Building a GPT that can generate molecules from scratch](https://kjablonka.com/blog/posts/building_an_llm)
-  
+
     Here, Kevin M. Jablonka explains how LLMs work through a practical approach applied to materials science, guiding you on how to build a GPT model that can generate molecules from scratch. It includes a detailed tutorial covering the tokenization process, conversion of tokens into numbers, vector embeddings and positional encoding, as well as model training and evaluation. Through this relatively simple example, you will also learn how the attention mechanism works with an exhaustive implementation of self-attention into the model.
-  
+
 🎥 In this video you can learn how to build a GPT model following the paper *"Attention is All You Need"*:
     [Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY) by Andrej Karpathy.
 
@@ -123,15 +123,15 @@ Tokenization and embeddings are two essential steps in the data processing pipel
 
 * [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789)
     *(Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022)*
-  
+
     While the overall objective of this book is to show you how to build language applications using the Hugging Face 🤗 Transformers library, it explains the Transformer architecture in a clear and detailed way. Chapter 2, *Text Classification*, shows how tokenizers work, whereas Chapter 3, *Transformer Anatomy*, takes a closer look at how transformers work for natural language processing. You will learn about the encoder-decoder architecture, embeddings, and self-attention mechanism. The authors present a hands-on approach, making it easy to read and simple to understand the Transformer architecture.
 
 * [Generative Deep Learning](https://www.oreilly.com/library/view/generative-deep-learning/9781098134174)
     *(David Foster, 2023)*
-  
+
     Starting with an introduction to generative modeling and deep learning, this book explores the different techniques to build generative models. In Chapter 9, *Transformers*, it provides an overview of the Transformer model architecture, attention mechanism, and encoder-decoder architectures.
 
 * [Understanding Deep Learning](https://udlbook.github.io/udlbook/)
     *(Simon J.D. Prince, 2023)*
-  
+
     This book begins by introducing deep learning models and discuss how to train them, measure their performance, and improve this performance. It then presents the architectures that are specialized to images, text, and graph data. Chapter 12, *Transformers*, explains self-attention and the transformer architecture. This is one of the most educational resources on deep learning available today.
diff --git a/content/beyond_text/beyond_images.ipynb b/content/beyond_text/beyond_images.ipynb
@@ -51,8 +51,6 @@
    },
    "outputs": [],
    "source": [
-    "import llmstructdata\n",
-    "\n",
     "from pdf2image import convert_from_path\n",
     "\n",
     "file_path = \"../obtaining_data/PDFs/10.26434_chemrxiv-2024-1l0sn.pdf\"\n",
@@ -357,7 +355,6 @@
     }
    ],
    "source": [
-    "import os\n",
     "from litellm import completion\n",
     "\n",
     "\n",