Skip to content

Commit

Permalink
InstructLab and Deepsearch
Browse files Browse the repository at this point in the history
This is the proposal to start integrating th document conversion
system Deepsearch from IBM Research and InstructLab

Co-authored-by: Ming Zhao <[email protected]>
Co-authored-by: BJ Hargrave <[email protected]>
Signed-off-by: JJ Asghar <[email protected]>
  • Loading branch information
3 people committed Jun 25, 2024
1 parent b4e8df2 commit d757f4f
Showing 1 changed file with 44 additions and 0 deletions.
44 changes: 44 additions & 0 deletions docs/instructlab-deepsearch-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# DeepSearch + InstructLab Integration Proposal

Check failure on line 1 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Headings should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:1 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "# DeepSearch + InstructLab Integration Proposal"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md022.md
<https://github.com/DS4SD>

## Why is a Conversion Tool Necessary?

Check failure on line 4 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Headings should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:4 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Why is a Conversion Tool Necessary?"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md022.md
Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing
knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown
format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at
IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format,
necessitating conversion to markdown before being used in InstructLab.

Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing
tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other
open-source solutions have similar shortcomings.

## Why DeepSearch?

Check failure on line 15 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Headings should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:15 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Why DeepSearch?"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md022.md
IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a
computer vision model layer, it accurately parses content in the files, including titles, headers, and tables.
Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in
the future.

## Integration Proposal

Check failure on line 21 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Headings should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:21 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Integration Proposal"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md022.md
To maintain the open-source nature of the project while leveraging the strengths of DeepSearch, we propose a
two-pronged approach:
1. Open-Source Conversion:

Check failure on line 24 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:24 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "1. Open-Source Conversion:"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md032.md

Check failure on line 24 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:24 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "1. Open-Source Conversion:"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md032.md
- Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be

Check failure on line 25 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:25 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- Implement a basic document c..."] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md032.md
lightweight and easily hosted, ensuring it can be used and improved by the community.

Check failure on line 26 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:26 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "lightweight and easily hosted,..."] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md032.md
2. DeepSearch Integration:

Check failure on line 27 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Ordered list item prefix

docs/instructlab-deepsearch-integration.md:27:1 MD029/ol-prefix Ordered list item prefix [Expected: 1; Actual: 2; Style: 1/1/1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md029.md

Check failure on line 27 in docs/instructlab-deepsearch-integration.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/instructlab-deepsearch-integration.md:27 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "2. DeepSearch Integration:"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md032.md
- Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for
backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior
conversion capabilities.

IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This
arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch
project. IBM's contribution underscores its commitment to supporting and improving open-source projects.

This integration will highlight the value of DeepSearch, highlighting their potential for those integrating
InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team,
we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the
open-source versions will have improved sufficiently, or the value of the integration will justify continued support.

By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's
advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
fostering innovation and improvement in document processing for the InstructLab project.

0 comments on commit d757f4f

Please sign in to comment.