[NeuralChat] Enable RAG's table extraction and summary #1417

xmx-521 · 2024-03-25T07:19:30Z

Type of Change

feature
API changed

Description

Enable RAG's table extraction functionality for pdf
Enable RAG's table summary functionality, with three modes to choose: [none, title, llm]

Expected Behavior & Potential Risk

User can use RAG's table extraction and summary functionality to get better RAG experience

How has this PR been tested?

Local test and pre-CI

Dependency Change?

add tesseract dependency
add poppler dependency
change unstructured dependency unstructured[all-docs] dependency

github-actions · 2024-03-25T07:25:44Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 Format Scan Tests workflow

Check ID	Status
format-scan (pylint)	success	✅
format-scan (bandit)	success	✅
format-scan (cloc)	success	✅
format-scan (cpplint)	success	✅

These checks are required after the changes to intel_extension_for_transformers/neural_chat/assets/docs/LLAMA2_page6.pdf, intel_extension_for_transformers/neural_chat/chatbot.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/parser.py, intel_extension_for_transformers/neural_chat/prompts/prompt.py, intel_extension_for_transformers/neural_chat/tests/ci/plugins/retrieval/test_parameters.py, intel_extension_for_transformers/neural_chat/tests/requirements.txt.

🟢 NeuralChat Unit Test

Check ID	Status
neuralchat-unit-test-baseline	success	✅
neuralchat-unit-test-PR-test	success	✅
Generate-NeuralChat-Report	success	✅

These checks are required after the changes to .github/workflows/script/unitTest/run_unit_test_neuralchat.sh, intel_extension_for_transformers/neural_chat/chatbot.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/parser.py, intel_extension_for_transformers/neural_chat/prompts/prompt.py, intel_extension_for_transformers/neural_chat/tests/ci/plugins/retrieval/test_parameters.py, intel_extension_for_transformers/neural_chat/tests/requirements.txt.

🟢 Chat Bot Test workflow

Check ID	Status	Error details
call-inference-llama-2-7b-chat-hf / inference test	success		✅
call-inference-mpt-7b-chat / inference test	success		✅

These checks are required after the changes to intel_extension_for_transformers/neural_chat/chatbot.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/parser.py, intel_extension_for_transformers/neural_chat/prompts/prompt.py, intel_extension_for_transformers/neural_chat/tests/requirements.txt.

Thank you for your contribution! 💜

Note
This comment is automatically generated and will be updates every 180 seconds within the next 6 hours. If you have any other questions, contact VincyZhang or XuehaoSun for help.

Liangyx2 · 2024-03-26T07:10:49Z

please add Installation and instruction for pdf table-to-text in intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md

intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py

-    return result
+
+    tables_result = []
+    def get_relation(table_coords, caption_coords, table_page_number, caption_page_number, threshold=100):


Signed-off-by: Manxin Xu <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Manxin Xu <[email protected]>

Signed-off-by: Manxin Xu <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Chen Xi <[email protected]>

Signed-off-by: Manxin Xu <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Chen Xi <[email protected]>

XinyuYe-Intel · 2024-04-11T05:19:57Z

intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md

@@ -92,6 +92,7 @@ Below are the description for the available parameters in `agent_QA`,
 | enable_rerank   | bool | Whether to enable retrieval then rerank pipeline |True, False|
 | reranker_model   | str | The name of the reranker model from the Huggingface or a local path |-|
 | top_n   | int | The return number of the reranker model |-|
+| table_strategy | str | The strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast" |"fast", "hq", "llm"|


From the code, seems "fast" table_strategy would only return None instead of table content, is this somewhat unreasonable?

It appears "hq" strategy uses unstructured pkg to extract table, I also used this pkg, and find it actually performed worse than table-transformer.

Also does the "llm" strategy return the reliable table contents? From the code, looks like it uses LLM and a prompt to generate the table summarization of the document, but from my previous experience, such way would generate results that significantly deviate the table content sometimes.

Thanks for insightful comments, my opinion on these issues are as follows:

From the code, seems "fast" table_strategy would only return None instead of table content, is this somewhat unreasonable?

In fact, by default, our program will use OCR to extract all text information in files including table information, which has been implemented in other PRs. This PR is just to further enhance the understanding of the table, so no content is returned in fast mode (fast mode is also the default mode).

It appears "hq" strategy uses unstructured pkg to extract table, I also used this pkg, and find it actually performed worse than table-transformer.

At present, we do use unstructured to extract table information, and the extraction performance is quite satisfactory. We have not tried the table transformer, but it is indeed worth considering.

Also does the "llm" strategy return the reliable table contents? From the code, looks like it uses LLM and a prompt to generate the table summarization of the document, but from my previous experience, such way would generate results that significantly deviate the table content sometimes.

Your understanding of what llm mode does is correct. It is true that llm's table summary is not completely reliable, but according to the experimental results, there will be much better table QA performance in llm mode overall.

xmx-521 added the NeuralChat label Mar 25, 2024

xmx-521 requested review from VincyZhang and lvliang-intel as code owners March 25, 2024 07:19

xmx-521 force-pushed the manxin/rag_table_summary branch from 47dde9c to 22731d5 Compare March 26, 2024 02:12

xmx-521 requested review from XuhuiRen and Liangyx2 March 26, 2024 06:52

XuhuiRen reviewed Mar 27, 2024

View reviewed changes

intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py Outdated

return result

tables_result = []

def get_relation(table_coords, caption_coords, table_page_number, caption_page_number, threshold=100):

This comment was marked as resolved.

Sign in to view

xmx-521 and others added 7 commits March 28, 2024 10:31

enable table and table summary for rag pdf

7781a6e

Signed-off-by: Manxin Xu <[email protected]>

fix code format

222ee81

Signed-off-by: Manxin Xu <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

dc0faf8

for more information, see https://pre-commit.ci Signed-off-by: Manxin Xu <[email protected]>

fix environment issue

d5b93dd

Signed-off-by: Manxin Xu <[email protected]>

fix key error

d8687ef

Signed-off-by: Manxin Xu <[email protected]>

fix two parameters

824c48e

Signed-off-by: Manxin Xu <[email protected]>

fix line too long

81ca43d

Signed-off-by: Manxin Xu <[email protected]>

xmx-521 force-pushed the manxin/rag_table_summary branch from 8da7b00 to 81ca43d Compare March 28, 2024 02:31

xmx-521 and others added 2 commits March 28, 2024 11:09

clear code, update README

e7e5331

Signed-off-by: Manxin Xu <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f71b602

for more information, see https://pre-commit.ci

xmx-521 requested a review from XuhuiRen March 28, 2024 06:05

ClarkChin08 and others added 4 commits March 29, 2024 16:46

polish pr

39bd8d7

Signed-off-by: Chen Xi <[email protected]>

Merge branch 'main' into manxin/rag_table_summary

dddf4de

Signed-off-by: Manxin Xu <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

460110a

for more information, see https://pre-commit.ci

polish readme

ee601db

Signed-off-by: Chen Xi <[email protected]>

hshen14 assigned XinyuYe-Intel Apr 7, 2024

XinyuYe-Intel reviewed Apr 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NeuralChat] Enable RAG's table extraction and summary #1417

[NeuralChat] Enable RAG's table extraction and summary #1417

xmx-521 commented Mar 25, 2024

github-actions bot commented Mar 25, 2024 •

edited

Loading

Liangyx2 commented Mar 26, 2024

This comment was marked as resolved.

XinyuYe-Intel Apr 11, 2024 •

edited

Loading

xmx-521 May 8, 2024

[NeuralChat] Enable RAG's table extraction and summary #1417

Are you sure you want to change the base?

[NeuralChat] Enable RAG's table extraction and summary #1417

Conversation

xmx-521 commented Mar 25, 2024

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

github-actions bot commented Mar 25, 2024 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

Liangyx2 commented Mar 26, 2024

This comment was marked as resolved.

XinyuYe-Intel Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

xmx-521 May 8, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 25, 2024 •

edited

Loading

XinyuYe-Intel Apr 11, 2024 •

edited

Loading