-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knowledge doc ingestion #148
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Knowledge Document Ingestion Pipeline Design Proposal | ||
|
||
**Author**: Aakanksha Duggal | ||
|
||
## 1. Introduction | ||
|
||
As part of extending InstructLab's capabilities, this pipeline is designed to support ingestion and processing of various document formats such as Markdown, PDF, DOCX, and more. The goal is to create a unified, modular system that seamlessly integrates with **Synthetic Data Generation (SDG)** and **Retrieval-Augmented Generation (RAG)** workflows, simplifying the process for users while maintaining flexibility for future enhancements. | ||
|
||
## 2. Use Case | ||
|
||
To enable the ingestion and processing of a wide range of document types, this pipeline must handle formats including: | ||
|
||
- Markdown (MD) | ||
- TXT | ||
- ASCII Docs | ||
- DOCX | ||
- HTML | ||
- PPTX | ||
|
||
This proposal outlines how to build a pluggable system that accommodates these formats and integrates effectively into existing **SDG** and **RAG** workflows. | ||
|
||
## 3. Proposed Approach | ||
|
||
### 3.1 Custom InstructLab Schema Design | ||
|
||
- **Objective**: Define a custom "instructlab" schema that standardizes input formats for SDG and RAG pipelines. This schema will bridge the gap between various document types and the specific formats required for further processing. | ||
- **Modularity**: The system should allow easy extension, enabling support for new document types and processing workflows without disrupting core functionality. | ||
|
||
The instructlab schema will serve as the intermediary format, supporting flexibility while ensuring compatibility with the ingestion process. | ||
|
||
### 3.2 PDF and Document Conversion via Docling | ||
|
||
- **Docling Integration**: We will leverage **Docling** to convert files into structured JSON, which will be the starting point for the instructlab schema. Individual components will post-process this JSON as per the requirements of the specific SDG and RAG workflows. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we include a link here to somewhere where folks can read more about Docling, and perhaps a bit as to why Docling is the chosen solution here? |
||
|
||
- **Docling v2**: We have collaborated with the Docling team to extend their tool’s capabilities, allowing conversion for PDF, HTML and DOCX formats. These new document types will be supported in the ingestion pipeline via docling's upgraded v2 release. | ||
|
||
### 3.3 Introducing the Document Chunking Command | ||
|
||
- **Command Overview**: We propose a new command, `ilab document format`, which will: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a particular reason for using the word
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I propose There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If think the document-related commands are going to expand considerably, then this might be the time to create a new command group. If the document processing will remain a small subset, then integrating them under the
‘ilab data [verb] [path]’
‘Ilab docs/document’
|
||
- Take a document path (defined in `qna.yaml`). | ||
- Format and chunk the document into the desired schema. | ||
|
||
- **Implementation Details**: Initially, this functionality will be integrated into the existing SDG repository. Over time, it can evolve into a standalone utility, allowing external integrations and wider usage. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would be the motivation for moving this out of the SDG repository? "allowing external integrations and wider usage" doesn't really tell me much |
||
|
||
- **Example Workflow**: | ||
```bash | ||
Check failure on line 47 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintFenced code blocks should be surrounded by blank lines
|
||
ilab document format --input path/to/document.pdf --output path/to/schema | ||
``` | ||
Check failure on line 49 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintFenced code blocks should be surrounded by blank lines
|
||
This command will ingest a document, process it into the instructlab schema, and output the result for further use in SDG or RAG workflows. | ||
|
||
### 3.4 Simplifying Git-Based Workflows for Users | ||
|
||
- **Current Challenge**: Knowledge documents are stored in Git-based repositories, which may be unfamiliar to many users. | ||
- **Proposed Solution**: | ||
- Allow users to input a local directory and provide an automated script that: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than a "script," why not just have this be part of the code? We can detect if a given directory is git-tracked (e.g. by checking for a |
||
1. Initializes a Git repository. | ||
2. Creates the necessary branches. | ||
3. Organizes files into the required structure. | ||
|
||
By abstracting the Git setup process, we can retain Git’s benefits (version control, backups) while simplifying the interface for non-technical users. | ||
|
||
- **Implementation Example**: | ||
```bash | ||
Check failure on line 64 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintFenced code blocks should be surrounded by blank lines
|
||
./setup_git_repo.sh --input /path/to/local/docs | ||
``` | ||
Check failure on line 66 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintFenced code blocks should be surrounded by blank lines
|
||
This script automates the process of structuring knowledge documents for ingestion. | ||
|
||
### 3.5 Workflow Visualization | ||
|
||
Here is a conceptual diagram illustrating the workflow from document ingestion to schema conversion and chunking: | ||
|
||
![Knowledge_Document_Ingestion_Workflow](https://github.com/user-attachments/assets/06504b1b-bc8f-4909-b6a2-732a056613c5) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very nice diagram! |
||
|
||
The pipeline begins with the ingestion of knowledge document, passes through a conversion step to instructlab schema using **Docling**, and then processes the document into a format usable by SDG and RAG workflows. | ||
|
||
## 4. Future Enhancements | ||
|
||
### 4.1 Support for Additional Data Types | ||
|
||
To extend beyond text-based documents, the pipeline will explore handling other formats, including: | ||
- **Audio and Video**: Incorporating media formats will require modifications to the schema and additional processing capabilities. | ||
Check failure on line 82 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintLists should be surrounded by blank lines
|
||
- **Visual Language Models (VLMs)**: We will collaborate with the research team to align this work with visual data processing tools and extend the pipeline to handle multimedia. | ||
|
||
### 4.2 Refined Chunker Library | ||
|
||
- **Standalone Library**: The document chunking functionality will eventually be refactored into a dedicated chunking library or utility within the SDG module. This separation will make it easier to maintain and extend in future iterations. | ||
|
||
- **Performance Optimizations**: Ongoing work will aim to reduce the time and resources needed for large-scale document chunking, particularly for multi-format documents like PDFs containing both text and images. | ||
|
||
## 5. InstructLab Schema Overview | ||
|
||
### Key Components: | ||
Check failure on line 93 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintHeadings should be surrounded by blank lines
Check failure on line 93 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintTrailing punctuation in heading
|
||
- **Docling JSON Output**: The output from Docling will be the instructlab schema, which serves as the backbone for both SDG and RAG workflows. For specific details around the leaf node path or timestamp, we will include that as a part of the file nomenclature. | ||
Check failure on line 94 in docs/sdg/knowledge-doc-ingestion.md GitHub Actions / markdown-lintLists should be surrounded by blank lines
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As part of the ingestion command, should we consider a flag where the pipeline could augment the metadata of the final output like |
||
|
||
This schema will standardize the data format for all supported document types, enabling consistency and modularity in the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth pointing out we have an existing
schema
package maintained by the @instructlab/schema-maintainers: https://github.com/instructlab/schemaRight now this is only for Taxonomy schema, but we could extend this with Classes designed specifically for this usecase