-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This is the proposal to start integrating th document conversion system Deepsearch from IBM Research and InstructLab Co-authored-by: Ming Zhao <[email protected]> Co-authored-by: BJ Hargrave <[email protected]> Signed-off-by: JJ Asghar <[email protected]>
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# DeepSearch + InstructLab Integration Proposal | ||
Check failure on line 1 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintHeadings should be surrounded by blank lines
|
||
<https://github.com/DS4SD> | ||
|
||
## Why is a Conversion Tool Necessary? | ||
Check failure on line 4 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintHeadings should be surrounded by blank lines
|
||
Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing | ||
knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown | ||
format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at | ||
IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format, | ||
necessitating conversion to markdown before being used in InstructLab. | ||
|
||
Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing | ||
tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other | ||
open-source solutions have similar shortcomings. | ||
|
||
## Why DeepSearch? | ||
Check failure on line 15 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintHeadings should be surrounded by blank lines
|
||
IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a | ||
computer vision model layer, it accurately parses content in the files, including titles, headers, and tables. | ||
Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in | ||
the future. | ||
|
||
## Integration Proposal | ||
Check failure on line 21 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintHeadings should be surrounded by blank lines
|
||
To maintain the open-source nature of the project while leveraging the strengths of DeepSearch, we propose a | ||
two-pronged approach: | ||
1. Open-Source Conversion: | ||
Check failure on line 24 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintLists should be surrounded by blank lines
Check failure on line 24 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintLists should be surrounded by blank lines
|
||
- Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be | ||
Check failure on line 25 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintLists should be surrounded by blank lines
|
||
lightweight and easily hosted, ensuring it can be used and improved by the community. | ||
Check failure on line 26 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintLists should be surrounded by blank lines
|
||
2. DeepSearch Integration: | ||
Check failure on line 27 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintOrdered list item prefix
Check failure on line 27 in docs/instructlab-deepsearch-integration.md GitHub Actions / markdown-lintLists should be surrounded by blank lines
|
||
- Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for | ||
backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior | ||
conversion capabilities. | ||
|
||
IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This | ||
arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch | ||
project. IBM's contribution underscores its commitment to supporting and improving open-source projects. | ||
|
||
This integration will highlight the value of DeepSearch, highlighting their potential for those integrating | ||
InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team, | ||
we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the | ||
open-source versions will have improved sufficiently, or the value of the integration will justify continued support. | ||
|
||
By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's | ||
advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology, | ||
fostering innovation and improvement in document processing for the InstructLab project. | ||
|