Skip to content

Commit

Permalink
Finish Chunk and Parse documentation (#712)
Browse files Browse the repository at this point in the history
* Add chunk baseline code

* add test code (not complete)

* complete llama index chunk test code

* add chunker test code

* add maek metadata_list function

* delete langchain at get chunk instance

* add rst file

* add langchain_chunk and its test code

* delete kowipiepy at chunk install list at pyproject.toml

* add annotation and use LazyInit at kiwi sentence splitter

* delete unused test code

* add return "path", "start_end_idx" at chunk

* move get_start_end_idx func from llama_index_chunk.py to data/utils/util.py

* change get_start_end_idx func to use find

* add return type "path" and "start_end_idx"

* delete chunk_type parameter and add langchain embedding model at init

* add chunk = ["langchain-experimental"] at pyproject.toml

* change expect_character_idx end id

* add rst file

* delete async

* delete semantic chunking and delete langchain-experimental

* delete async logic at langchain chunk

* just commit

* add new data_creation.png

* create qa_creation folder and add qa_creation.md and answer_gen.md

* Write langchain_parse.md

* first baseline docs

* finish parse.md

* Add Run Parse Pipeline Parse.md

* finish chunk.md

* finish llama_parse.md

* finish langchain_chunk.md

* Add features at chunk.md

* finish llama_index_chunk.md

* finish clova.md

* finish table_hybrid_parse.md

* Add new data_creation.png
  • Loading branch information
bwook00 authored Sep 16, 2024
1 parent 57eafc9 commit 44eaa7c
Show file tree
Hide file tree
Showing 9 changed files with 439 additions and 10 deletions.
Binary file modified docs/source/_static/data_creation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
157 changes: 156 additions & 1 deletion docs/source/data_creation/beta/chunk/chunk.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,163 @@
# Chunk

In this section, we will cover how to chunk parsed result.

It is a crucial step because if the parsed result is not chunked well, the RAG will not be optimized well.

#### Supported Modules
Using only YAML files, you can easily use the various chunk methods.
The chunked result is saved according to the data format used by AutoRAG.

## Overview

The sample chunk pipeline looks like this.

```python
from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")
chunker.start_chunking("your/path/to/chunk_config.yaml")
```

## Features

### 1. Add File Name
You need to set one of 'English' and 'Korean'
The 'add_file_name' feature is to add a file_name to chunked_contents.
This is used to prevent hallucination by retrieving contents from the wrong document.
Default form of English is `"file_name: {file_name}\n contents: {content}"`

#### Example YAML

```yaml
modules:
- module_type: llama_index_chunk
chunk_method: [ Token, Sentence ]
chunk_size: [ 1024, 512 ]
chunk_overlap: 24
add_file_name: english
```
### 2. Sentence Splitter
The following chunk methods in the `llama_index_chunk` module use the sentence splitter.

- `Semantic_llama_index`
- `SemanticDoubling`
- `SentenceWindow`

The following methods use `PunktSentenceTokenizer` as the default sentence splitter.

See below for the available languages of `PunktSentenceTokenizer`.

["Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Malayalam, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish"]

So if the language you want to use is not in the list, or you want to use a different sentence splitter, you can use the sentence_splitter parameter.

#### Available Sentence Splitter
- [kiwi](https://github.com/bab2min/kiwipiepy) : For Korean 🇰🇷

#### Example YAML

```yaml
modules:
- module_type: llama_index_chunk
chunk_method: [ SentenceWindow ]
sentence_splitter: kiwi
window_size: 3
add_file_name: english
```

#### Using sentence splitter that is not in the Available Sentence Splitter

If you want to use `kiwi`, you can use the following code.

```python
from autorag.data import sentence_splitter_modules, LazyInit
def split_by_sentence_kiwi() -> Callable[[str], List[str]]:
from kiwipiepy import Kiwi
kiwi = Kiwi()
def split(text: str) -> List[str]:
kiwi_result = kiwi.split_into_sents(text)
sentences = list(map(lambda x: x.text, kiwi_result))
return sentences
return split
sentence_splitter_modules["kiwi"] = LazyInit(split_by_sentence_kiwi)
```

## Run Chunk Pipeline

### 1. Set chunker instance

```python
from autorag.chunker import Chunker
chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")
```

```{admonition} Want to specify project folder?
You can specify project directory with `--project_dir` option or project_dir parameter.
```

### 2. Set YAML file

Here is an example of how to use the `llama_index_chunk` module.

```yaml
modules:
- module_type: llama_index_chunk
chunk_method: [ Token, Sentence ]
chunk_size: [ 1024, 512 ]
chunk_overlap: 24
```
### 3. Start chunking
Use `start_chunking` function to start parsing.

```python
chunker.start_chunking("your/path/to/chunk_config.yaml")
```

### 4. Check the result

If you set `project_dir` parameter, you can check the result in the project directory.
If not, you can check the result in the current directory.

The way to check the result is the same as the `Evaluator` and `Parser` in AutoRAG.

A `trial_folder` is created in `project_dir` first.

If the chunking is completed successfully, the following three types of files are created in the trial_folder.

1. Chunked Result
2. Used YAML file
3. Summary file

For example, if chunking is performed using three chunk methods, the following files are created.
`0.parquet`, `1.parquet`, `2.parquet`, `parse_config.yaml`, `summary.csv`

Finally, in the summary.csv file, you can see information about the chunked result, such as what chunk method was used to chunk it.

## Output Columns
- `doc_id`: Document ID. The type is string.
- `contents`: The contents of the chunked data. The type is string.
- `path`: The path of the document. The type is string.
- `start_end_idx`:
- Store index of chunked_str based on original_str before chunking
- stored to map the retrieval_gt of Evaluation QA Dataset according to various chunk methods.
- `metadata`: It is also stored in the passage after the data of the parsed result is chunked. The type is dictionary.
- Depending on the dataformat of AutoRAG's `Parsed Result`, metadata should have the following keys: `page`, `last_modified_datetime`, `path`.

#### Supported Chunk Modules

📌 You can check our all Chunk modules
at [here](https://edai.notion.site/Supporting-Chunk-Modules-8db803dba2ec4cd0a8789659106e86a3?pvs=4)

```{toctree}
---
Expand Down
46 changes: 46 additions & 0 deletions docs/source/data_creation/beta/chunk/langchain_chunk.md
Original file line number Diff line number Diff line change
@@ -1 +1,47 @@
# Langchain Chunk

Chunk parsed results to use [langchain text splitters](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#).

## Available Chunk Method

### 1. Token

- [SentenceTransformersToken](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html)

### 2. Character

- [RecursiveCharacter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
- [character](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html)

### 3. Sentence

- [konlpy](https://api.python.langchain.com/en/latest/konlpy/langchain_text_splitters.konlpy.KonlpyTextSplitter.html): For Korean 🇰🇷

#### Example YAML

```yaml
modules:
- module_type: langchain_chunk
parse_method: konlpy
add_file_name: korean
```
## Using Langchain Chunk Method that is not in the Available Chunk Method
You can find more information about the langchain chunk method at
[here](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#)
### How to Use
If you want to use `PythonCodeTextSplitter` that is not in the available chunk method, you can use the following code.

```python
from autorag.data import chunk_modules
from langchain.text_splitter import PythonCodeTextSplitter
chunk_modules["python"] = PythonCodeTextSplitter
```

```{attention}
The key value in chunk_modules must always be written in lowercase.
```
56 changes: 56 additions & 0 deletions docs/source/data_creation/beta/chunk/llama_index_chunk.md
Original file line number Diff line number Diff line change
@@ -1 +1,57 @@
# Llama Index Chunk

Chunk parsed results to use [Llama Index Node_Parsers & Text Splitters](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/).

## Available Chunk Method

### 1. Token

- [Token](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/)

### 2. Sentence

- [Sentence](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/)

### 3. Window

- [SentenceWindow](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_window/)

### 4. Semantic

- [semantic_llama_index](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/semantic_splitter/)
- [SemanticDoubleMerging](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_double_merging_chunking/)

### 5. Simple

- [Simple](https://docs.llamaindex.ai/en/v0.10.19/api/llama_index.core.node_parser.SimpleFileNodeParser.html)

#### Example YAML

```yaml
modules:
- module_type: llama_index_chunk
chunk_method: [ Token, Sentence ]
chunk_size: [ 1024, 512 ]
chunk_overlap: 24
add_file_name: english
```
## Using Llama Index Chunk Method that is not in the Available Chunk Method
You can find more information about the llama index chunk method at
[here](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/).
### How to Use
If you want to use `HTMLNodeParser` that is not in the available chunk method, you can use the following code.

```python
from autorag.data import chunk_modules
from llama_index.core.node_parser import HTMLNodeParser
chunk_modules["html"] = HTMLNodeParser
```

```{attention}
The key value in chunk_modules must always be written in lowercase.
```
29 changes: 29 additions & 0 deletions docs/source/data_creation/beta/parse/clova.md
Original file line number Diff line number Diff line change
@@ -1 +1,30 @@
# Clova

Parse raw documents to use Naver
[Clova OCR](https://guide.ncloud-docs.com/docs/clovaocr-overview).

Clova OCR divides the document into pages for parsing.

## Table Detection

If you have tables in your raw document, set `table_detection: true` to use clova ocr table detection feature.

### Point

#### 1. HTML Parser
Clova OCR provides parsed table information in complex JSON format.
It converts the complex JSON form of the table to HTML for storage in the LLM.

The parser was created by our own AutoRAG team and you can find the detailed code in the `json_to_html_table` function in `autorag.data.parse.clova`.

#### 2. The text information comes separately from the table information.
If your document is a table + text, the text information comes separately from the table information.
So when using table_detection, it will be saved in `{text}\n\ntable html:\n{table}` format.

## Example YAML

```yaml
modules:
- module_type: clova
table_detection: true
```
8 changes: 6 additions & 2 deletions docs/source/data_creation/beta/parse/llama_parse.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,18 @@ You can set language to use `language` parameter.

## Table Extraction

If you have tables in your raw document, set `result_type: markdown` to convert them to Markdown and save them.


📌`result_type`: You can set 3 types of result type.
- text
- markdown
- json

## Example YAML

```yaml
modules:
- module_type: llama_parse
result_type: markdown
language: ko
language: en
```
Loading

0 comments on commit 44eaa7c

Please sign in to comment.