Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge doc ingestion #148

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions docs/sdg/knowledge-doc-ingestion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Knowledge Document Ingestion Pipeline Design Proposal

**Author**: Aakanksha Duggal

## 1. Introduction

As part of extending InstructLab's capabilities, this pipeline is designed to support ingestion and processing of various document formats such as Markdown, PDF, DOCX, and more. The goal is to create a unified, modular system that seamlessly integrates with **Synthetic Data Generation (SDG)** and **Retrieval-Augmented Generation (RAG)** workflows, simplifying the process for users while maintaining flexibility for future enhancements.

## 2. Use Case

To enable the ingestion and processing of a wide range of document types, this pipeline must handle formats including:

- Markdown (MD)
- PDF
- TXT
- ASCII Docs
- DOCX
- HTML
- PPTX

This proposal outlines how to build a pluggable system that accommodates these formats and integrates effectively into existing **SDG** and **RAG** workflows.

## 3. Proposed Approach

### 3.1 Custom InstructLab Schema Design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth pointing out we have an existing schema package maintained by the @instructlab/schema-maintainers: https://github.com/instructlab/schema

Right now this is only for Taxonomy schema, but we could extend this with Classes designed specifically for this usecase


- **Objective**: Define a custom "instructlab" schema that standardizes input formats for SDG and RAG pipelines. This schema will bridge the gap between various document types and the specific formats required for further processing.
- **Modularity**: The system should allow easy extension, enabling support for new document types and processing workflows without disrupting core functionality.

The instructlab schema will serve as the intermediary format, supporting flexibility while ensuring compatibility with the ingestion process.

### 3.2 PDF and Document Conversion via Docling

- **Docling Integration**: We will leverage **Docling** to convert files into structured JSON, which will be the starting point for the instructlab schema. Individual components will post-process this JSON as per the requirements of the specific SDG and RAG workflows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include a link here to somewhere where folks can read more about Docling, and perhaps a bit as to why Docling is the chosen solution here?


- **Docling v2**: We have collaborated with the Docling team to extend their tool’s capabilities, allowing conversion for PDF, HTML and DOCX formats. These new document types will be supported in the ingestion pipeline via docling's upgraded v2 release.

### 3.3 Introducing the Document Chunking Command

- **Command Overview**: We propose a new command, `ilab document format`, which will:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for using the word format? The first time I read the command without reading the whole document, it gave me the impression the command was for transforming a document from one format to another format (e.g. pdf to md, or pdf to json). Should we consider a word or verb that more closely resembles what is actually happening in this step? For example:

ilab docs import --input path/to/document.pdf --output path/to/schema
ilab docs ingest --input path/to/document.pdf --output path/to/schema
ilab docs process --input path/to/document.pdf --output path/to/schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose ilab data being used instead of ilab docs or ilab document - we already have this command group implemented in the CLI, and it would be good if users can keep their data manipulation in a single command group. I'd like some folks from UX to weigh-in here as well.

Copy link

@JustinXHale JustinXHale Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If think the document-related commands are going to expand considerably, then this might be the time to create a new command group. If the document processing will remain a small subset, then integrating them under the data group would simplify the CLI structure.

ingest Provide the document path to ingest and process into the desired scheme.
process Provide the document path to process into the scheme.
import Specify the document path to import and format according to the scheme requirements.
chunk Enter the document path to split and structure for the scheme.

‘ilab data [verb] [path]’

  • Pro: Keeps all data-related commands in one place, which potentially makes a unified experience for the user and maybe more easily discoverable for tasks. This simplifies the CLI structure.
  • Con: The broad scope may lead to potentially cluttering the group with varied tasks. Users who are focused on document-specific action might find it harder to locate

‘Ilab docs/document’

  • Pro: Creates a clear and dedicated space for document specific command, which potentially makes it easier for users working with document related functions. This leaves a lot of room for scalability
  • Con: Adds another command group, fragmenting the CLI, especially if the document task/commands are minimal. Users might need to switch between command groups if they are working with documents and other data types.

- Take a document path (defined in `qna.yaml`).
- Format and chunk the document into the desired schema.

- **Implementation Details**: Initially, this functionality will be integrated into the existing SDG repository. Over time, it can evolve into a standalone utility, allowing external integrations and wider usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the motivation for moving this out of the SDG repository? "allowing external integrations and wider usage" doesn't really tell me much


- **Example Workflow**:
```bash

Check failure on line 47 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Fenced code blocks should be surrounded by blank lines

docs/sdg/knowledge-doc-ingestion.md:47 MD031/blanks-around-fences Fenced code blocks should be surrounded by blank lines [Context: "```bash"] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md031.md
ilab document format --input path/to/document.pdf --output path/to/schema
```

Check failure on line 49 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Fenced code blocks should be surrounded by blank lines

docs/sdg/knowledge-doc-ingestion.md:49 MD031/blanks-around-fences Fenced code blocks should be surrounded by blank lines [Context: "```"] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md031.md
This command will ingest a document, process it into the instructlab schema, and output the result for further use in SDG or RAG workflows.

### 3.4 Simplifying Git-Based Workflows for Users

- **Current Challenge**: Knowledge documents are stored in Git-based repositories, which may be unfamiliar to many users.
- **Proposed Solution**:
- Allow users to input a local directory and provide an automated script that:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a "script," why not just have this be part of the code? We can detect if a given directory is git-tracked (e.g. by checking for a .git subdirectory) and do the manipulation described if not

1. Initializes a Git repository.
2. Creates the necessary branches.
3. Organizes files into the required structure.

By abstracting the Git setup process, we can retain Git’s benefits (version control, backups) while simplifying the interface for non-technical users.

- **Implementation Example**:
```bash

Check failure on line 64 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Fenced code blocks should be surrounded by blank lines

docs/sdg/knowledge-doc-ingestion.md:64 MD031/blanks-around-fences Fenced code blocks should be surrounded by blank lines [Context: "```bash"] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md031.md
./setup_git_repo.sh --input /path/to/local/docs
```

Check failure on line 66 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Fenced code blocks should be surrounded by blank lines

docs/sdg/knowledge-doc-ingestion.md:66 MD031/blanks-around-fences Fenced code blocks should be surrounded by blank lines [Context: "```"] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md031.md
This script automates the process of structuring knowledge documents for ingestion.

### 3.5 Workflow Visualization

Here is a conceptual diagram illustrating the workflow from document ingestion to schema conversion and chunking:

![Knowledge_Document_Ingestion_Workflow](https://github.com/user-attachments/assets/06504b1b-bc8f-4909-b6a2-732a056613c5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice diagram!


The pipeline begins with the ingestion of knowledge document, passes through a conversion step to instructlab schema using **Docling**, and then processes the document into a format usable by SDG and RAG workflows.

## 4. Future Enhancements

### 4.1 Support for Additional Data Types

To extend beyond text-based documents, the pipeline will explore handling other formats, including:
- **Audio and Video**: Incorporating media formats will require modifications to the schema and additional processing capabilities.

Check failure on line 82 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/sdg/knowledge-doc-ingestion.md:82 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **Audio and Video**: Incorpo..."] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md032.md
- **Visual Language Models (VLMs)**: We will collaborate with the research team to align this work with visual data processing tools and extend the pipeline to handle multimedia.

### 4.2 Refined Chunker Library

- **Standalone Library**: The document chunking functionality will eventually be refactored into a dedicated chunking library or utility within the SDG module. This separation will make it easier to maintain and extend in future iterations.

- **Performance Optimizations**: Ongoing work will aim to reduce the time and resources needed for large-scale document chunking, particularly for multi-format documents like PDFs containing both text and images.

## 5. InstructLab Schema Overview

### Key Components:

Check failure on line 93 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Headings should be surrounded by blank lines

docs/sdg/knowledge-doc-ingestion.md:93 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "### Key Components:"] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md022.md

Check failure on line 93 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing punctuation in heading

docs/sdg/knowledge-doc-ingestion.md:93:19 MD026/no-trailing-punctuation Trailing punctuation in heading [Punctuation: ':'] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md026.md
- **Docling JSON Output**: The output from Docling will be the instructlab schema, which serves as the backbone for both SDG and RAG workflows. For specific details around the leaf node path or timestamp, we will include that as a part of the file nomenclature.

Check failure on line 94 in docs/sdg/knowledge-doc-ingestion.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

docs/sdg/knowledge-doc-ingestion.md:94 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- **Docling JSON Output**: The..."] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md032.md

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the ingestion command, should we consider a flag where the pipeline could augment the metadata of the final output like --metadata ./path-to-metadata.json to add information such as attribution, timestamps, ilab version, schema version, etc.?


This schema will standardize the data format for all supported document types, enabling consistency and modularity in the pipeline.
Loading