Finish Chunk and Parse documentation (#712)

* Add chunk baseline code * add test code (not complete) * complete llama index chunk test code * add chunker test code * add maek metadata_list function * delete langchain at get chunk instance * add rst file * add langchain_chunk and its test code * delete kowipiepy at chunk install list at pyproject.toml * add annotation and use LazyInit at kiwi sentence splitter * delete unused test code * add return "path", "start_end_idx" at chunk * move get_start_end_idx func from llama_index_chunk.py to data/utils/util.py * change get_start_end_idx func to use find * add return type "path" and "start_end_idx" * delete chunk_type parameter and add langchain embedding model at init * add chunk = ["langchain-experimental"] at pyproject.toml * change expect_character_idx end id * add rst file * delete async * delete semantic chunking and delete langchain-experimental * delete async logic at langchain chunk * just commit * add new data_creation.png * create qa_creation folder and add qa_creation.md and answer_gen.md * Write langchain_parse.md * first baseline docs * finish parse.md * Add Run Parse Pipeline Parse.md * finish chunk.md * finish llama_parse.md * finish langchain_chunk.md * Add features at chunk.md * finish llama_index_chunk.md * finish clova.md * finish table_hybrid_parse.md * Add new data_creation.png
Marker-Inc-Korea · Sep 16, 2024 · 44eaa7c · 44eaa7c
1 parent 57eafc9
commit 44eaa7c
Show file tree

Hide file tree

Showing 9 changed files with 439 additions and 10 deletions.
diff --git a/docs/source/_static/data_creation.png b/docs/source/_static/data_creation.png
diff --git a/docs/source/data_creation/beta/chunk/chunk.md b/docs/source/data_creation/beta/chunk/chunk.md
@@ -1,8 +1,163 @@
 # Chunk
 
+In this section, we will cover how to chunk parsed result.
 
+It is a crucial step because if the parsed result is not chunked well, the RAG will not be optimized well.
 
-#### Supported Modules
+Using only YAML files, you can easily use the various chunk methods.
+The chunked result is saved according to the data format used by AutoRAG.
+
+## Overview
+
+The sample chunk pipeline looks like this.
+
+```python
+from autorag.chunker import Chunker
+
+chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")
+chunker.start_chunking("your/path/to/chunk_config.yaml")
+```
+
+## Features
+
+### 1. Add File Name
+You need to set one of 'English' and 'Korean'
+The 'add_file_name' feature is to add a file_name to chunked_contents.
+This is used to prevent hallucination by retrieving contents from the wrong document.
+Default form of English is `"file_name: {file_name}\n contents: {content}"`
+
+#### Example YAML
+
+```yaml
+modules:
+  - module_type: llama_index_chunk
+    chunk_method: [ Token, Sentence ]
+    chunk_size: [ 1024, 512 ]
+    chunk_overlap: 24
+    add_file_name: english
+```
+
+### 2. Sentence Splitter
+
+The following chunk methods in the `llama_index_chunk` module use the sentence splitter.
+
+- `Semantic_llama_index`
+- `SemanticDoubling`
+- `SentenceWindow`
+
+The following methods use `PunktSentenceTokenizer` as the default sentence splitter.
+
+See below for the available languages of `PunktSentenceTokenizer`.
+
+["Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Malayalam, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish"]
+
+So if the language you want to use is not in the list, or you want to use a different sentence splitter, you can use the sentence_splitter parameter.
+
+#### Available Sentence Splitter
+- [kiwi](https://github.com/bab2min/kiwipiepy) : For Korean 🇰🇷
+
+#### Example YAML
+
+```yaml
+modules:
+  - module_type: llama_index_chunk
+    chunk_method: [ SentenceWindow ]
+    sentence_splitter: kiwi
+    window_size: 3
+    add_file_name: english
+```
+
+#### Using sentence splitter that is not in the Available Sentence Splitter
+
+If you want to use `kiwi`, you can use the following code.
+
+```python
+from autorag.data import sentence_splitter_modules, LazyInit
+
+def split_by_sentence_kiwi() -> Callable[[str], List[str]]:
+	from kiwipiepy import Kiwi
+
+	kiwi = Kiwi()
+
+	def split(text: str) -> List[str]:
+		kiwi_result = kiwi.split_into_sents(text)
+		sentences = list(map(lambda x: x.text, kiwi_result))
+
+		return sentences
+
+	return split
+
+sentence_splitter_modules["kiwi"] = LazyInit(split_by_sentence_kiwi)
+```
+
+## Run Chunk Pipeline
+
+### 1. Set chunker instance
+
+```python
+from autorag.chunker import Chunker
+
+chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")
+```
+
+```{admonition} Want to specify project folder?
+You can specify project directory with `--project_dir` option or project_dir parameter.
+```
+
+### 2. Set YAML file
+
+Here is an example of how to use the `llama_index_chunk` module.
+
+```yaml
+modules:
+  - module_type: llama_index_chunk
+    chunk_method: [ Token, Sentence ]
+    chunk_size: [ 1024, 512 ]
+    chunk_overlap: 24
+```
+
+### 3. Start chunking
+
+Use `start_chunking` function to start parsing.
+
+```python
+chunker.start_chunking("your/path/to/chunk_config.yaml")
+```
+
+### 4. Check the result
+
+If you set `project_dir` parameter, you can check the result in the project directory.
+If not, you can check the result in the current directory.
+
+The way to check the result is the same as the `Evaluator` and `Parser` in AutoRAG.
+
+A `trial_folder` is created in `project_dir` first.
+
+If the chunking is completed successfully, the following three types of files are created in the trial_folder.
+
+1. Chunked Result
+2. Used YAML file
+3. Summary file
+
+For example, if chunking is performed using three chunk methods, the following files are created.
+`0.parquet`, `1.parquet`, `2.parquet`, `parse_config.yaml`, `summary.csv`
+
+Finally, in the summary.csv file, you can see information about the chunked result, such as what chunk method was used to chunk it.
+
+## Output Columns
+- `doc_id`: Document ID. The type is string.
+- `contents`: The contents of the chunked data. The type is string.
+- `path`: The path of the document. The type is string.
+- `start_end_idx`:
+  - Store index of chunked_str based on original_str before chunking
+  - stored to map the retrieval_gt of Evaluation QA Dataset according to various chunk methods.
+- `metadata`: It is also stored in the passage after the data of the parsed result is chunked. The type is dictionary.
+  - Depending on the dataformat of AutoRAG's `Parsed Result`, metadata should have the following keys: `page`, `last_modified_datetime`, `path`.
+
+#### Supported Chunk Modules
+
+📌 You can check our all Chunk modules
+at [here](https://edai.notion.site/Supporting-Chunk-Modules-8db803dba2ec4cd0a8789659106e86a3?pvs=4)
 
 ```{toctree}
 ---

diff --git a/docs/source/data_creation/beta/chunk/langchain_chunk.md b/docs/source/data_creation/beta/chunk/langchain_chunk.md
@@ -1 +1,47 @@
 # Langchain Chunk
+
+Chunk parsed results to use [langchain text splitters](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#).
+
+## Available Chunk Method
+
+### 1. Token
+
+- [SentenceTransformersToken](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html)
+
+### 2. Character
+
+- [RecursiveCharacter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
+- [character](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html)
+
+### 3. Sentence
+
+- [konlpy](https://api.python.langchain.com/en/latest/konlpy/langchain_text_splitters.konlpy.KonlpyTextSplitter.html): For Korean 🇰🇷
+
+#### Example YAML
+
+```yaml
+modules:
+  - module_type: langchain_chunk
+    parse_method: konlpy
+    add_file_name: korean
+```
+
+## Using Langchain Chunk Method that is not in the Available Chunk Method
+
+You can find more information about the langchain chunk method at
+[here](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#)
+
+### How to Use
+
+If you want to use `PythonCodeTextSplitter` that is not in the available chunk method, you can use the following code.
+
+```python
+from autorag.data import chunk_modules
+from langchain.text_splitter import PythonCodeTextSplitter
+
+chunk_modules["python"] = PythonCodeTextSplitter
+```
+
+```{attention}
+The key value in chunk_modules must always be written in lowercase.
+```
diff --git a/docs/source/data_creation/beta/chunk/llama_index_chunk.md b/docs/source/data_creation/beta/chunk/llama_index_chunk.md
@@ -1 +1,57 @@
 # Llama Index Chunk
+
+Chunk parsed results to use [Llama Index Node_Parsers & Text Splitters](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/).
+
+## Available Chunk Method
+
+### 1. Token
+
+- [Token](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/)
+
+### 2. Sentence
+
+- [Sentence](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/)
+
+### 3. Window
+
+- [SentenceWindow](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_window/)
+
+### 4. Semantic
+
+- [semantic_llama_index](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/semantic_splitter/)
+- [SemanticDoubleMerging](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_double_merging_chunking/)
+
+### 5. Simple
+
+- [Simple](https://docs.llamaindex.ai/en/v0.10.19/api/llama_index.core.node_parser.SimpleFileNodeParser.html)
+
+#### Example YAML
+
+```yaml
+modules:
+  - module_type: llama_index_chunk
+    chunk_method: [ Token, Sentence ]
+    chunk_size: [ 1024, 512 ]
+    chunk_overlap: 24
+    add_file_name: english
+```
+
+## Using Llama Index Chunk Method that is not in the Available Chunk Method
+
+You can find more information about the llama index chunk method at
+[here](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/).
+
+### How to Use
+
+If you want to use `HTMLNodeParser` that is not in the available chunk method, you can use the following code.
+
+```python
+from autorag.data import chunk_modules
+from llama_index.core.node_parser import HTMLNodeParser
+
+chunk_modules["html"] = HTMLNodeParser
+```
+
+```{attention}
+The key value in chunk_modules must always be written in lowercase.
+```
diff --git a/docs/source/data_creation/beta/parse/clova.md b/docs/source/data_creation/beta/parse/clova.md
@@ -1 +1,30 @@
 # Clova
+
+Parse raw documents to use Naver
+[Clova OCR](https://guide.ncloud-docs.com/docs/clovaocr-overview).
+
+Clova OCR divides the document into pages for parsing.
+
+## Table Detection
+
+If you have tables in your raw document, set `table_detection: true` to use clova ocr table detection feature.
+
+### Point
+
+#### 1. HTML Parser
+Clova OCR provides parsed table information in complex JSON format.
+It converts the complex JSON form of the table to HTML for storage in the LLM.
+
+The parser was created by our own AutoRAG team and you can find the detailed code in the `json_to_html_table` function in `autorag.data.parse.clova`.
+
+#### 2. The text information comes separately from the table information.
+If your document is a table + text, the text information comes separately from the table information.
+So when using table_detection, it will be saved in `{text}\n\ntable html:\n{table}` format.
+
+## Example YAML
+
+```yaml
+modules:
+  - module_type: clova
+    table_detection: true
+```
diff --git a/docs/source/data_creation/beta/parse/llama_parse.md b/docs/source/data_creation/beta/parse/llama_parse.md
@@ -12,14 +12,18 @@ You can set language to use `language` parameter.
 
 ## Table Extraction
 
+If you have tables in your raw document, set `result_type: markdown` to convert them to Markdown and save them.
 
-
+📌`result_type`: You can set 3 types of result type.
+- text
+- markdown
+- json
 
 ## Example YAML
 
 ```yaml
 modules:
   - module_type: llama_parse
     result_type: markdown
-    language: ko
+    language: en
 ```