diff --git a/docs/source/_static/data_creation.png b/docs/source/_static/data_creation.png index 0971c508b..c55db5f33 100644 Binary files a/docs/source/_static/data_creation.png and b/docs/source/_static/data_creation.png differ diff --git a/docs/source/data_creation/beta/chunk/chunk.md b/docs/source/data_creation/beta/chunk/chunk.md index 44757a0a9..a2edb2f48 100644 --- a/docs/source/data_creation/beta/chunk/chunk.md +++ b/docs/source/data_creation/beta/chunk/chunk.md @@ -1,8 +1,163 @@ # Chunk +In this section, we will cover how to chunk parsed result. +It is a crucial step because if the parsed result is not chunked well, the RAG will not be optimized well. -#### Supported Modules +Using only YAML files, you can easily use the various chunk methods. +The chunked result is saved according to the data format used by AutoRAG. + +## Overview + +The sample chunk pipeline looks like this. + +```python +from autorag.chunker import Chunker + +chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path") +chunker.start_chunking("your/path/to/chunk_config.yaml") +``` + +## Features + +### 1. Add File Name +You need to set one of 'English' and 'Korean' +The 'add_file_name' feature is to add a file_name to chunked_contents. +This is used to prevent hallucination by retrieving contents from the wrong document. +Default form of English is `"file_name: {file_name}\n contents: {content}"` + +#### Example YAML + +```yaml +modules: + - module_type: llama_index_chunk + chunk_method: [ Token, Sentence ] + chunk_size: [ 1024, 512 ] + chunk_overlap: 24 + add_file_name: english +``` + +### 2. Sentence Splitter + +The following chunk methods in the `llama_index_chunk` module use the sentence splitter. + +- `Semantic_llama_index` +- `SemanticDoubling` +- `SentenceWindow` + +The following methods use `PunktSentenceTokenizer` as the default sentence splitter. + +See below for the available languages of `PunktSentenceTokenizer`. + +["Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Malayalam, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish"] + +So if the language you want to use is not in the list, or you want to use a different sentence splitter, you can use the sentence_splitter parameter. + +#### Available Sentence Splitter +- [kiwi](https://github.com/bab2min/kiwipiepy) : For Korean 🇰🇷 + +#### Example YAML + +```yaml +modules: + - module_type: llama_index_chunk + chunk_method: [ SentenceWindow ] + sentence_splitter: kiwi + window_size: 3 + add_file_name: english +``` + +#### Using sentence splitter that is not in the Available Sentence Splitter + +If you want to use `kiwi`, you can use the following code. + +```python +from autorag.data import sentence_splitter_modules, LazyInit + +def split_by_sentence_kiwi() -> Callable[[str], List[str]]: + from kiwipiepy import Kiwi + + kiwi = Kiwi() + + def split(text: str) -> List[str]: + kiwi_result = kiwi.split_into_sents(text) + sentences = list(map(lambda x: x.text, kiwi_result)) + + return sentences + + return split + +sentence_splitter_modules["kiwi"] = LazyInit(split_by_sentence_kiwi) +``` + +## Run Chunk Pipeline + +### 1. Set chunker instance + +```python +from autorag.chunker import Chunker + +chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path") +``` + +```{admonition} Want to specify project folder? +You can specify project directory with `--project_dir` option or project_dir parameter. +``` + +### 2. Set YAML file + +Here is an example of how to use the `llama_index_chunk` module. + +```yaml +modules: + - module_type: llama_index_chunk + chunk_method: [ Token, Sentence ] + chunk_size: [ 1024, 512 ] + chunk_overlap: 24 +``` + +### 3. Start chunking + +Use `start_chunking` function to start parsing. + +```python +chunker.start_chunking("your/path/to/chunk_config.yaml") +``` + +### 4. Check the result + +If you set `project_dir` parameter, you can check the result in the project directory. +If not, you can check the result in the current directory. + +The way to check the result is the same as the `Evaluator` and `Parser` in AutoRAG. + +A `trial_folder` is created in `project_dir` first. + +If the chunking is completed successfully, the following three types of files are created in the trial_folder. + +1. Chunked Result +2. Used YAML file +3. Summary file + +For example, if chunking is performed using three chunk methods, the following files are created. +`0.parquet`, `1.parquet`, `2.parquet`, `parse_config.yaml`, `summary.csv` + +Finally, in the summary.csv file, you can see information about the chunked result, such as what chunk method was used to chunk it. + +## Output Columns +- `doc_id`: Document ID. The type is string. +- `contents`: The contents of the chunked data. The type is string. +- `path`: The path of the document. The type is string. +- `start_end_idx`: + - Store index of chunked_str based on original_str before chunking + - stored to map the retrieval_gt of Evaluation QA Dataset according to various chunk methods. +- `metadata`: It is also stored in the passage after the data of the parsed result is chunked. The type is dictionary. + - Depending on the dataformat of AutoRAG's `Parsed Result`, metadata should have the following keys: `page`, `last_modified_datetime`, `path`. + +#### Supported Chunk Modules + +📌 You can check our all Chunk modules +at [here](https://edai.notion.site/Supporting-Chunk-Modules-8db803dba2ec4cd0a8789659106e86a3?pvs=4) ```{toctree} --- diff --git a/docs/source/data_creation/beta/chunk/langchain_chunk.md b/docs/source/data_creation/beta/chunk/langchain_chunk.md index 29faa8e09..c1dec9b15 100644 --- a/docs/source/data_creation/beta/chunk/langchain_chunk.md +++ b/docs/source/data_creation/beta/chunk/langchain_chunk.md @@ -1 +1,47 @@ # Langchain Chunk + +Chunk parsed results to use [langchain text splitters](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#). + +## Available Chunk Method + +### 1. Token + +- [SentenceTransformersToken](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html) + +### 2. Character + +- [RecursiveCharacter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html) +- [character](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html) + +### 3. Sentence + +- [konlpy](https://api.python.langchain.com/en/latest/konlpy/langchain_text_splitters.konlpy.KonlpyTextSplitter.html): For Korean 🇰🇷 + +#### Example YAML + +```yaml +modules: + - module_type: langchain_chunk + parse_method: konlpy + add_file_name: korean +``` + +## Using Langchain Chunk Method that is not in the Available Chunk Method + +You can find more information about the langchain chunk method at +[here](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#) + +### How to Use + +If you want to use `PythonCodeTextSplitter` that is not in the available chunk method, you can use the following code. + +```python +from autorag.data import chunk_modules +from langchain.text_splitter import PythonCodeTextSplitter + +chunk_modules["python"] = PythonCodeTextSplitter +``` + +```{attention} +The key value in chunk_modules must always be written in lowercase. +``` diff --git a/docs/source/data_creation/beta/chunk/llama_index_chunk.md b/docs/source/data_creation/beta/chunk/llama_index_chunk.md index d8d69cfd7..922b276ad 100644 --- a/docs/source/data_creation/beta/chunk/llama_index_chunk.md +++ b/docs/source/data_creation/beta/chunk/llama_index_chunk.md @@ -1 +1,57 @@ # Llama Index Chunk + +Chunk parsed results to use [Llama Index Node_Parsers & Text Splitters](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/). + +## Available Chunk Method + +### 1. Token + +- [Token](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/) + +### 2. Sentence + +- [Sentence](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) + +### 3. Window + +- [SentenceWindow](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_window/) + +### 4. Semantic + +- [semantic_llama_index](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/semantic_splitter/) +- [SemanticDoubleMerging](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_double_merging_chunking/) + +### 5. Simple + +- [Simple](https://docs.llamaindex.ai/en/v0.10.19/api/llama_index.core.node_parser.SimpleFileNodeParser.html) + +#### Example YAML + +```yaml +modules: + - module_type: llama_index_chunk + chunk_method: [ Token, Sentence ] + chunk_size: [ 1024, 512 ] + chunk_overlap: 24 + add_file_name: english +``` + +## Using Llama Index Chunk Method that is not in the Available Chunk Method + +You can find more information about the llama index chunk method at +[here](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/). + +### How to Use + +If you want to use `HTMLNodeParser` that is not in the available chunk method, you can use the following code. + +```python +from autorag.data import chunk_modules +from llama_index.core.node_parser import HTMLNodeParser + +chunk_modules["html"] = HTMLNodeParser +``` + +```{attention} +The key value in chunk_modules must always be written in lowercase. +``` diff --git a/docs/source/data_creation/beta/parse/clova.md b/docs/source/data_creation/beta/parse/clova.md index 58e6e8bb3..2a58af6ff 100644 --- a/docs/source/data_creation/beta/parse/clova.md +++ b/docs/source/data_creation/beta/parse/clova.md @@ -1 +1,30 @@ # Clova + +Parse raw documents to use Naver +[Clova OCR](https://guide.ncloud-docs.com/docs/clovaocr-overview). + +Clova OCR divides the document into pages for parsing. + +## Table Detection + +If you have tables in your raw document, set `table_detection: true` to use clova ocr table detection feature. + +### Point + +#### 1. HTML Parser +Clova OCR provides parsed table information in complex JSON format. +It converts the complex JSON form of the table to HTML for storage in the LLM. + +The parser was created by our own AutoRAG team and you can find the detailed code in the `json_to_html_table` function in `autorag.data.parse.clova`. + +#### 2. The text information comes separately from the table information. +If your document is a table + text, the text information comes separately from the table information. +So when using table_detection, it will be saved in `{text}\n\ntable html:\n{table}` format. + +## Example YAML + +```yaml +modules: + - module_type: clova + table_detection: true +``` diff --git a/docs/source/data_creation/beta/parse/llama_parse.md b/docs/source/data_creation/beta/parse/llama_parse.md index 9b7de546e..84253df0c 100644 --- a/docs/source/data_creation/beta/parse/llama_parse.md +++ b/docs/source/data_creation/beta/parse/llama_parse.md @@ -12,8 +12,12 @@ You can set language to use `language` parameter. ## Table Extraction +If you have tables in your raw document, set `result_type: markdown` to convert them to Markdown and save them. - +📌`result_type`: You can set 3 types of result type. +- text +- markdown +- json ## Example YAML @@ -21,5 +25,5 @@ You can set language to use `language` parameter. modules: - module_type: llama_parse result_type: markdown - language: ko + language: en ``` diff --git a/docs/source/data_creation/beta/parse/parse.md b/docs/source/data_creation/beta/parse/parse.md index 93c95fc30..f98fe7bb3 100644 --- a/docs/source/data_creation/beta/parse/parse.md +++ b/docs/source/data_creation/beta/parse/parse.md @@ -1,20 +1,101 @@ # Parse -Using only YAML files, you can easily use the various document loaders of the langchain. +In this section, we will cover how to parse raw documents. + +It is a crucial step to parse the raw documents. +Because if the raw documents are not parsed well, the RAG will not be optimized well. + +Using only YAML files, you can easily use the various document loaders. The parsed result is saved according to the data format used by AutoRAG. +## Overview + +The sample parse pipeline looks like this. + +```python +from autorag.parser import Parser + +parser = Parser(data_path_glob="your/data/path/*") +parser.start_parsing("your/path/to/parse_config.yaml") +``` + +## Run Parse Pipeline + +### 1. Set parser instance + +```python +from autorag.parser import Parser + +parser = Parser(data_path_glob="your/data/path/*") +``` + +#### 📌 Parameter: `data_path_glob` + +Parser instance requires `data_path_glob` parameter. +This parameter is used to specify the path of the documents to be parsed. + +Only glob patterns are supported. + +You can use the wildcard character `*` to specify multiple files. + +you can specify the file extension like `*.pdf` to specific file types. + +```{admonition} Want to specify project folder? +You can specify project directory with `--project_dir` option or project_dir parameter. +``` + +### 2. Set YAML file + +Here is an example of how to use the `langchain_parse` module. + +```yaml +modules: + - module_type: langchain_parse + parse_method: [ pdfminer, pdfplumber ] +``` + +### 3. Start parsing + +Use `start_parsing` function to start parsing. + +```python +parser.start_parsing("your/path/to/parse_config.yaml") +``` + +### 4. Check the result + +If you set `project_dir` parameter, you can check the result in the project directory. +If not, you can check the result in the current directory. + +The way to check the result is the same as the `Evaluator` and `Chunker` in AutoRAG. + +A `trial_folder` is created in `project_dir` first. + +If the parsing is completed successfully, the following three types of files are created in the trial_folder. + +1. Parsed Result +2. Used YAML file +3. Summary file + +For example, if parsing is performed using three parse methods, the following files are created. +`0.parquet`, `1.parquet`, `2.parquet`, `parse_config.yaml`, `summary.csv` + +Finally, in the summary.csv file, you can see information about the parsed result, such as what parse method was used to parse it. + ## Output Columns -- texts: Parsed text from the document. -- path: Path of the document. -- pages: Number of pages in the document. Contains page if parsing on a per-page basis, otherwise -1. +- `texts`: Parsed text from the document. +- `path`: Path of the document. +- `pages`: Number of pages in the document. Contains page if parsing on a per-page basis, otherwise -1. - Modules that parse per page: [ `clova`, `table_hybrid_parse` ] - Modules that don't parse on a per-page basis: [ `langchain_parse`, `llama_parse` ] -- last_modified_datetime: Last modified datetime of the document. +- `last_modified_datetime`: Last modified datetime of the document. + +#### Supported Parse Modules -## +📌 You can check our all Parse modules +at [here](https://edai.notion.site/Supporting-Parse-Modules-e0b7579c7c0e4fb2963e408eeccddd75?pvs=4) -#### Supported Modules ```{toctree} --- diff --git a/docs/source/data_creation/beta/parse/table_hybrid_parse.md b/docs/source/data_creation/beta/parse/table_hybrid_parse.md index 5da7c1a21..06e172246 100644 --- a/docs/source/data_creation/beta/parse/table_hybrid_parse.md +++ b/docs/source/data_creation/beta/parse/table_hybrid_parse.md @@ -1 +1,52 @@ # Table Hybrid Parse + +Parse raw documents using a combination of text and table parsing modules. + +Because OCR models are paid models, it can be expensive to OCR-parse all raw documents. + +OCR models are primarily used to parse raw documents that contain tables. +Therefore, it is cost-effective to parse raw documents that do not contain tables with non-OCR methods and parse raw documents that do contain tables with OCR. + +To accomplish this, the Table Hybrid Parse module performs parsing in the following steps + +1. breaks the raw document into pages. +2. uses table detection to distinguish between pages that contain tables and pages that do not. +3. pages that do not contain tables are parsed by the text parsing module. +4. Parses pages that contain tables with the table parsing module. +5. merge the parsing results to return the final result. + +## Table Detection +Use `PDFPlumber` to split pages with and without tables. + +## Table Parse Available Modules +- `llama_parse` + - You need to add `result_type: markdown` to the table_params. +- `clova` + - You need to add `table_detection: true` to the table_params. +- `langchain_parse` + - You need to add `parse_method: upstagelayoutanalysis` to the table_params. + +## Parameters +- `text_parse_module`: str + - The module to use for text parsing. +- `text_params`: dict + - parameters for the text parsing module +- `table_parse_module`: str + - The module to use for table parsing. +- `table_params`: dict + - parameters for the table parsing module + +## Example YAML + +If you want to use the `langchain_parse` module for text parsing and the `clova` module for table parsing, you can use the `table_hybrid_parse` module. + +```yaml +modules: + - module_type: table_hybrid_parse + text_parse_module: langchain_parse + text_params: + parse_method: pdfplumber + table_parse_module: clova + table_params: + table_detection: true +``` diff --git a/sample_config/parse/parse_full.yaml b/sample_config/parse/parse_full.yaml index 1326f5cd6..adc215878 100644 --- a/sample_config/parse/parse_full.yaml +++ b/sample_config/parse/parse_full.yaml @@ -11,3 +11,10 @@ modules: - module_type: llama_parse result_type: markdown language: ko + - module_type: table_hybrid_parse + text_parse_module: langchain_parse + text_params: + parse_method: pdfplumber + table_parse_module: clova + table_params: + table_detection: true