-
-
Notifications
You must be signed in to change notification settings - Fork 270
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Finish Chunk and Parse documentation (#712)
* Add chunk baseline code * add test code (not complete) * complete llama index chunk test code * add chunker test code * add maek metadata_list function * delete langchain at get chunk instance * add rst file * add langchain_chunk and its test code * delete kowipiepy at chunk install list at pyproject.toml * add annotation and use LazyInit at kiwi sentence splitter * delete unused test code * add return "path", "start_end_idx" at chunk * move get_start_end_idx func from llama_index_chunk.py to data/utils/util.py * change get_start_end_idx func to use find * add return type "path" and "start_end_idx" * delete chunk_type parameter and add langchain embedding model at init * add chunk = ["langchain-experimental"] at pyproject.toml * change expect_character_idx end id * add rst file * delete async * delete semantic chunking and delete langchain-experimental * delete async logic at langchain chunk * just commit * add new data_creation.png * create qa_creation folder and add qa_creation.md and answer_gen.md * Write langchain_parse.md * first baseline docs * finish parse.md * Add Run Parse Pipeline Parse.md * finish chunk.md * finish llama_parse.md * finish langchain_chunk.md * Add features at chunk.md * finish llama_index_chunk.md * finish clova.md * finish table_hybrid_parse.md * Add new data_creation.png
- Loading branch information
Showing
9 changed files
with
439 additions
and
10 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,47 @@ | ||
# Langchain Chunk | ||
|
||
Chunk parsed results to use [langchain text splitters](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#). | ||
|
||
## Available Chunk Method | ||
|
||
### 1. Token | ||
|
||
- [SentenceTransformersToken](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html) | ||
|
||
### 2. Character | ||
|
||
- [RecursiveCharacter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html) | ||
- [character](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html) | ||
|
||
### 3. Sentence | ||
|
||
- [konlpy](https://api.python.langchain.com/en/latest/konlpy/langchain_text_splitters.konlpy.KonlpyTextSplitter.html): For Korean 🇰🇷 | ||
|
||
#### Example YAML | ||
|
||
```yaml | ||
modules: | ||
- module_type: langchain_chunk | ||
parse_method: konlpy | ||
add_file_name: korean | ||
``` | ||
## Using Langchain Chunk Method that is not in the Available Chunk Method | ||
You can find more information about the langchain chunk method at | ||
[here](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#) | ||
### How to Use | ||
If you want to use `PythonCodeTextSplitter` that is not in the available chunk method, you can use the following code. | ||
|
||
```python | ||
from autorag.data import chunk_modules | ||
from langchain.text_splitter import PythonCodeTextSplitter | ||
chunk_modules["python"] = PythonCodeTextSplitter | ||
``` | ||
|
||
```{attention} | ||
The key value in chunk_modules must always be written in lowercase. | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,57 @@ | ||
# Llama Index Chunk | ||
|
||
Chunk parsed results to use [Llama Index Node_Parsers & Text Splitters](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/). | ||
|
||
## Available Chunk Method | ||
|
||
### 1. Token | ||
|
||
- [Token](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/) | ||
|
||
### 2. Sentence | ||
|
||
- [Sentence](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/) | ||
|
||
### 3. Window | ||
|
||
- [SentenceWindow](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_window/) | ||
|
||
### 4. Semantic | ||
|
||
- [semantic_llama_index](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/semantic_splitter/) | ||
- [SemanticDoubleMerging](https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_double_merging_chunking/) | ||
|
||
### 5. Simple | ||
|
||
- [Simple](https://docs.llamaindex.ai/en/v0.10.19/api/llama_index.core.node_parser.SimpleFileNodeParser.html) | ||
|
||
#### Example YAML | ||
|
||
```yaml | ||
modules: | ||
- module_type: llama_index_chunk | ||
chunk_method: [ Token, Sentence ] | ||
chunk_size: [ 1024, 512 ] | ||
chunk_overlap: 24 | ||
add_file_name: english | ||
``` | ||
## Using Llama Index Chunk Method that is not in the Available Chunk Method | ||
You can find more information about the llama index chunk method at | ||
[here](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/). | ||
### How to Use | ||
If you want to use `HTMLNodeParser` that is not in the available chunk method, you can use the following code. | ||
|
||
```python | ||
from autorag.data import chunk_modules | ||
from llama_index.core.node_parser import HTMLNodeParser | ||
chunk_modules["html"] = HTMLNodeParser | ||
``` | ||
|
||
```{attention} | ||
The key value in chunk_modules must always be written in lowercase. | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,30 @@ | ||
# Clova | ||
|
||
Parse raw documents to use Naver | ||
[Clova OCR](https://guide.ncloud-docs.com/docs/clovaocr-overview). | ||
|
||
Clova OCR divides the document into pages for parsing. | ||
|
||
## Table Detection | ||
|
||
If you have tables in your raw document, set `table_detection: true` to use clova ocr table detection feature. | ||
|
||
### Point | ||
|
||
#### 1. HTML Parser | ||
Clova OCR provides parsed table information in complex JSON format. | ||
It converts the complex JSON form of the table to HTML for storage in the LLM. | ||
|
||
The parser was created by our own AutoRAG team and you can find the detailed code in the `json_to_html_table` function in `autorag.data.parse.clova`. | ||
|
||
#### 2. The text information comes separately from the table information. | ||
If your document is a table + text, the text information comes separately from the table information. | ||
So when using table_detection, it will be saved in `{text}\n\ntable html:\n{table}` format. | ||
|
||
## Example YAML | ||
|
||
```yaml | ||
modules: | ||
- module_type: clova | ||
table_detection: true | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.