Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
diptanu committed May 22, 2024
1 parent e33db8c commit 34023e9
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 10 deletions.
40 changes: 30 additions & 10 deletions docs/docs/getting_started_intermediate.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,29 @@ In this example, we will make an LLM answer how much someone would be paying in

While this example is simple, if you were building a production application on tax laws, you can ingest and extract information from 100s of state specific documents.

##### Extraction Graph Setup
## Indexify Server

Set up an extraction graph to process the PDF documents -
Download the indexify server and run it

```bash title="( Terminal 1 ) Download Indexify Server"
curl https://getindexify.ai | sh
./indexify server -d
```
## Download the Extractors
Before we begin, let's download the extractors

```bash title="( Terminal 2 ) Download Indexify Extractors"
python3 -m venv venv
source venv/bin/activate
pip3 install indexify-extractor-sdk indexify wikipedia openai
indexify-extractor download hub://pdf/marker
indexify-extractor download hub://embedding/minilm-l6
indexify-extractor download hub://text/chunking
```

- Set the name of the extraction graph to "pdfqa".
- The first stage of the graph converts the PDF document into Markdown. We use the extractor `tensorlake/marker`, which uses a popular Open Source PDF to markdown converter model.
- The text is then chunked into smaller fragments. Chunking makes retrieval and processing by LLMs efficient.
- The chunks are then embedded to make them searchable.
- Each stage has of the pipeline is named and connected to their upstream extractors using the field `content_source`
## Extraction Graph Setup

Set up an extraction graph to process the PDF documents -

=== "Python"
```python
Expand Down Expand Up @@ -42,7 +56,7 @@ Set up an extraction graph to process the PDF documents -
import { ExtractionGraph } from "getindexify";

const graph = ExtractionGraph.fromYaml(`
name: 'pdfqa'
name: 'pdfqa' #(1)!
extraction_policies:
- extractor: 'tensorlake/marker'
name: 'mdextract'
Expand All @@ -58,7 +72,13 @@ Set up an extraction graph to process the PDF documents -
`);
await client.createExtractionGraph(graph);
```
##### Document Ingestion
!!! note "The Graph"
1. Set the name of the extraction graph to "pdfqa".
2. Converts the PDF document into Markdown. We use the extractor `tensorlake/marker`, which uses a popular Open Source PDF to markdown converter model.
3. The text is then chunked into smaller fragments. Chunking makes retrieval and processing by LLMs efficient.
4. The chunks are then embedded to make them searchable.
5. Each stage has of the pipeline is named and connected to their upstream extractors using the field `content_source`
## Document Ingestion

Add the PDF document to the "pdfqa" extraction graph
=== "Python"
Expand Down Expand Up @@ -86,7 +106,7 @@ Add the PDF document to the "pdfqa" extraction graph

await client.uploadFile("taxes", "taxes.pdf");
```
##### Prompting and Context Retrieval Function
## Prompting and Context Retrieval Function
We can use the same prompting and context retrieval function defined above to get context for the LLM based on the question.

=== "Python"
Expand Down
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ copyright: Copyright © 2024 Tensorlake

markdown_extensions:
- attr_list
- md_in_html
- def_list
- admonition
- pymdownx.details
Expand Down

0 comments on commit 34023e9

Please sign in to comment.