dottxt-ai · rlouf · Nov 8, 2024 · Nov 7, 2024 · Nov 7, 2024 · Nov 7, 2024
diff --git a/docs/cookbook/images/trader-joes-receipt.jpg b/docs/cookbook/images/trader-joes-receipt.jpg
diff --git a/docs/cookbook/index.md b/docs/cookbook/index.md
@@ -14,4 +14,5 @@ This part of the documentation provides a few cookbooks that you can browse to g
 - [ReAct Agent](react_agent.md): Build an agent with open weights models using regex-structured generation.
 - [Earnings reports to CSV](earnings-reports.md): Extract data from earnings reports to CSV using regex-structured generation.
 - [Vision-Language Models](atomic_caption.md): Use Outlines with vision-language models for tasks like image captioning and visual reasoning.
+- [Receipt Digitization](receipt-digitization.md): Extract information from a picture of a receipt using structured generation.
 - [Structured Generation from PDFs](read-pdfs.md): Use Outlines with vision-language models to read PDFs and produce structured output.
diff --git a/docs/cookbook/receipt-digitization.md b/docs/cookbook/receipt-digitization.md
@@ -0,0 +1,296 @@
+# Receipt Data Extraction with VLMs
+
+## Setup
+
+You'll need to install the dependencies:
+
+```bash
+pip install outlines torch==2.4.0 transformers accelerate pillow rich
+```
+
+## Import libraries
+
+Load all the necessary libraries:
+
+```python
+# LLM stuff
+import outlines
+import torch
+from transformers import AutoProcessor
+from pydantic import BaseModel, Field
+from typing import Literal, Optional, List
+
+# Image stuff
+from PIL import Image
+import requests
+
+# Rich for pretty printing
+from rich import print
+```
+
+## Choose a model
+
+This example has been tested with `mistral-community/pixtral-12b` ([HF link](https://huggingface.co/mistral-community/pixtral-12b)) and `Qwen/Qwen2-VL-7B-Instruct` ([HF link](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)).
+
+We recommend Qwen-2-VL as we have found it to be more accurate than Pixtral.
+
+If you want to use Qwen-2-VL, you can do the following:
+
+```python
+# To use Qwen-2-VL:
+from transformers import Qwen2VLForConditionalGeneration
+model_name = "Qwen/Qwen2-VL-7B-Instruct"
+model_class = Qwen2VLForConditionalGeneration
+```
+
+If you want to use Pixtral, you can do the following:
+
+```python
+# To use Pixtral:
+from transformers import LlavaForConditionalGeneration
+model_name="mistral-community/pixtral-12b"
+model_class=LlavaForConditionalGeneration
+```
+
+## Load the model
+
+Load the model into memory:
+
+```python
+model = outlines.models.transformers_vision(
+    model_name,
+    model_class=model_class,
+    model_kwargs={
+        "device_map": "auto",
+        "torch_dtype": torch.bfloat16,
+    },
+    processor_kwargs={
+        "device": "cuda", # set to "cpu" if you don't have a GPU
+    },
+)
+```
+
+## Image processing
+
+Images can be quite large. In GPU-poor environments, you may need to resize the image to a smaller size.
+
+Here's a helper function to do that:
+
+```python
+def load_and_resize_image(image_path, max_size=1024):
+    """
+    Load and resize an image while maintaining aspect ratio
+
+    Args:
+        image_path: Path to the image file
+        max_size: Maximum dimension (width or height) of the output image
+
+    Returns:
+        PIL Image: Resized image
+    """
+    image = Image.open(image_path)
+
+    # Get current dimensions
+    width, height = image.size
+
+    # Calculate scaling factor
+    scale = min(max_size / width, max_size / height)
+
+    # Only resize if image is larger than max_size
+    if scale < 1:
+        new_width = int(width * scale)
+        new_height = int(height * scale)
+        image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
+
+    return image
+```
+
+You can change the resolution of the image by changing the `max_size` argument. Small max sizes will make the image more blurry, but processing will be faster and require less memory.
+
+## Load an image
+
+Load an image and resize it. We've provided a sample image of a Trader Joe's receipt, but you can use any image you'd like.
+
+Here's what the image looks like:
+
+![Trader Joe's receipt](./images/trader-joes-receipt.jpg)
+
+```python
+# Path to the image
+image_path = "https://dottxt-ai.github.io/outlines/main/cookbook/images/trader-joes-receipt.png"
+
+# Download the image
+response = requests.get(image_path)
+with open("receipt.png", "wb") as f:
+    f.write(response.content)
+
+# Load + resize the image
+image = load_and_resize_image("receipt.png")
+```
+
+## Define the output structure
+
+We'll define a Pydantic model to describe the data we want to extract from the image.
+
+In our case, we want to extract the following information:
+
+- The store name
+- The store address
+- The store number
+- A list of items, including the name, quantity, price per unit, and total price
+- The tax
+- The total
+- The date
+- The payment method
+
+Most fields are optional, as not all receipts contain all information.
+
+```python
+class Item(BaseModel):
+    name: str
+    quantity: Optional[int]
+    price_per_unit: Optional[float]
+    total_price: Optional[float]
+
+class ReceiptSummary(BaseModel):
+    store_name: str
+    store_address: str
+    store_number: Optional[int]
+    items: List[Item]
+    tax: Optional[float]
+    total: Optional[float]
+    # Date is in the format YYYY-MM-DD. We can apply a regex pattern to ensure it's formatted correctly.
+    date: Optional[str] = Field(pattern=r'\d{4}-\d{2}-\d{2}', description="Date in the format YYYY-MM-DD")
+    payment_method: Literal["cash", "credit", "debit", "check", "other"]
+```
+
+## Prepare the prompt
+
+We'll use the `AutoProcessor` to convert the image and the text prompt into a format that the model can understand. Practically,
+this is the code that adds user, system, assistant, and image tokens to the prompt.
+
+```python
+# Set up the content you want to send to the model
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                # The image is provided as a PIL Image object
+                "type": "image",
+                "image": image,
+            },
+            {
+                "type": "text",
+                "text": f"""You are an expert at extracting information from receipts.
+                Please extract the information from the receipt. Be as detailed as possible --
+                missing or misreporting information is a crime.
+
+                Return the information in the following JSON schema:
+                {ReceiptSummary.model_json_schema()}
+            """},
+        ],
+    }
+]
+
+# Convert the messages to the final prompt
+processor = AutoProcessor.from_pretrained(model_name)
+prompt = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+```
+
+If you are curious, the final prompt that is sent to the model looks (roughly) like this:
+
+```
+<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+<|vision_start|><|image_pad|><|vision_end|>
+You are an expert at extracting information from receipts.
+Please extract the information from the receipt. Be as detailed as
+possible -- missing or misreporting information is a crime.
+
+Return the information in the following JSON schema:
+
+<JSON SCHEMA OMITTED>
+<|im_end|>
+<|im_start|>assistant
+```
+
+## Run the model
+
+```python
+# Prepare a function to process receipts
+receipt_summary_generator = outlines.generate.json(
+    model,
+    ReceiptSummary,
+
+    # Greedy sampling is a good idea for numeric
+    # data extraction -- no randomness.
+    sampler=outlines.samplers.greedy()
+)
+
+# Generate the receipt summary
+result = receipt_summary_generator(prompt, [image])
+print(result)
+```
+
+## Output
+
+The output should look like this:
+
+```
+ReceiptSummary(
+    store_name="Trader Joe's",
+    store_address='401 Bay Street, San Francisco, CA 94133',
+    store_number=0,
+    items=[
+        Item(name='BANANA EACH', quantity=7, price_per_unit=0.23, total_price=1.61),
+        Item(name='BAREBELLS CHOCOLATE DOUG', quantity=1, price_per_unit=2.29, total_price=2.29),
+        Item(name='BAREBELLS CREAMY CRISP', quantity=1, price_per_unit=2.29, total_price=2.29),
+        Item(name='BAREBELLS CHOCOLATE DOUG', quantity=1, price_per_unit=2.29, total_price=2.29),
+        Item(name='BAREBELLS CARAMEL CASHEW', quantity=2, price_per_unit=2.29, total_price=4.58),
+        Item(name='BAREBELLS CREAMY CRISP', quantity=1, price_per_unit=2.29, total_price=2.29),
+        Item(name='SPINDRIFT ORANGE MANGO 8', quantity=1, price_per_unit=7.49, total_price=7.49),
+        Item(name='Bottle Deposit', quantity=8, price_per_unit=0.05, total_price=0.4),
+        Item(name='MILK ORGANIC GALLON WHOL', quantity=1, price_per_unit=6.79, total_price=6.79),
+        Item(name='CLASSIC GREEK SALAD', quantity=1, price_per_unit=3.49, total_price=3.49),
+        Item(name='COBB SALAD', quantity=1, price_per_unit=5.99, total_price=5.99),
+        Item(name='PEPPER BELL RED XL EACH', quantity=1, price_per_unit=1.29, total_price=1.29),
+        Item(name='BAG FEE.', quantity=1, price_per_unit=0.25, total_price=0.25),
+        Item(name='BAG FEE.', quantity=1, price_per_unit=0.25, total_price=0.25)
+    ],
+    tax=0.68,
+    total=41.98,
+    date='2023-11-04',
+    payment_method='debit',
+
+)
+```
+
+Voila! You've successfully extracted information from a receipt using an LLM.
+
+## Bonus: roasting the user for their receipt
+
+You can roast the user for their receipt by adding a `roast` field to the end of the  `ReceiptSummary` model.
+
+```python
+class ReceiptSummary(BaseModel):
+    ...
+    roast: str
+```
+
+which gives you a result like
+
+```
+ReceiptSummary(
+    ...
+    roast="You must be a fan of Trader Joe's because you bought enough
+    items to fill a small grocery bag and still had to pay for a bag fee.
+    Maybe you should start using reusable bags to save some money and the
+    environment."
+)
+```
+
+Qwen is not particularly funny, but worth a shot.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -122,6 +122,7 @@ nav:
       - Vision-Language Models: cookbook/atomic_caption.md
       - Structured Generation from PDFs: cookbook/read-pdfs.md
       - Earnings reports to CSV: cookbook/earnings-reports.md
+      - Digitizing receipts with vision models: cookbook/receipt-digitization.md
       - Run on the cloud:
           - BentoML: cookbook/deploy-using-bentoml.md
           - Cerebrium: cookbook/deploy-using-cerebrium.md