Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FOR SHARING PURPOSES ONLY] RAG ingestion and chat pipelines #2736

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

dmartinol
Copy link

@dmartinol dmartinol commented Dec 3, 2024

TODO list before marking the draft as ready:

  • fix mypy static analysis
  • fix pytest
  • introduce factory for DocumentStore
  • introduce factory for Retriever
  • introduce factory for Document Embedders
  • ilab data process should use the DoclingConverter instead of sdg DocumentChunker
  • add taxonomy navigation for default ilab data ingest pipeline
  • add unit tests
  • integration tests (maybe a different PR)
  • Integrate PaRAGon

Hi team! I’ve put together some initial code that can serve as a starting point for our discussion on the topic in subject.
Given the current limitations on updating the CLI interface in the current version, the approach leverages configuration options to enable the ingestion and chat pipelines without requiring new commands
Details below.

Note:
Since we're using a development version of instructlab-sdg package, the following steps are needed for now:

  • Clone the instructlab/sdg repo
  • Import the local package like pip install -e /path/to/sdg/repo

RAG document transformation and ingestion

  • Transform user docs using available SDG modules (no need to define any qna.yaml knowledge document)
  • Generate embeddings and ingest them in a configured DB

No configuration

Flags and env vars

Reflect the design document RAG ingestion and chat pipelines

Command

From pre-processed documents:

ilab data process --rag /path/to/pre-processsed/docs

Inclusing pre-processing pipeline:

ilab data process --rag --transform --transform-output /path/to/pre-processsed/docs /path/to/user/docs

Output in the configured DB. Supported DB: MilvusLite.

RAG Chat

Configuration

chat:
  rag:
    enable: false
    retriever:
      top_k: 20
      embedder:
        model_name: sentence-transformers/all-minilm-l6-v2
    document_store:
      type: milvuslite
      uri: embeddings.db
      collection_name: Ilab

Command

ilab model chat

Includes additional marker to display the activation status of the RAG function:

% ilab model chat                        
╭───────────────────────────────────────────────────────────────────────────── system ─────────────────────────────────────────────────────────────────────────────╮
│ Welcome to InstructLab Chat w/ GRANITE-7B-LAB-Q4_K_M.GGUF (type /h for help)                                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
>>>                                                                                                                                                [RAG][S][default]

@mergify mergify bot added dependencies Relates to dependencies ci-failure PR has at least one CI failure labels Dec 3, 2024
Signed-off-by: Daniele Martinoli <[email protected]>
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Dec 3, 2024
@mergify mergify bot added the ci-failure PR has at least one CI failure label Dec 9, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 10, 2024
Signed-off-by: Daniele Martinoli <[email protected]>
@mergify mergify bot added testing Relates to testing ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 10, 2024
@nathan-weinberg nathan-weinberg added the hold In-progress PR. Tag should be removed before merge. label Dec 10, 2024
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Dec 13, 2024
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
@mergify mergify bot added the ci-failure PR has at least one CI failure label Dec 15, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 16, 2024
@mergify mergify bot added CI/CD Affects CI/CD configuration ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 16, 2024
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Dec 17, 2024
@mergify mergify bot added the ci-failure PR has at least one CI failure label Dec 17, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 18, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 18, 2024
Signed-off-by: Daniele Martinoli <[email protected]>
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 18, 2024
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some lengthy comments. I know this is in a draft state but I figured I would get a review in. Many of my comments are concerns over config defaulting, reading, and how we are using these new RAG options.

Also, we will need e2e tests for this since it is a major change.

@@ -37,3 +37,4 @@ wandb>=0.16.4
xdg-base-dirs>=6.0.1
psutil>=6.0.0
huggingface_hub[hf_transfer]>=0.1.8
haystack-ai>=2.8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure new requirements like these need to be run by a specific set of people.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please elaborate more on the "specific set of people" part? (for sure I miss something here)

@@ -0,0 +1,4 @@
# Cannot upgrade because of https://github.com/milvus-io/milvus-haystack/issues/39
milvus_haystack==0.0.11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for new requirements

src/instructlab/cli/data/ingest.py Outdated Show resolved Hide resolved
src/instructlab/cli/data/process.py Outdated Show resolved Hide resolved
@@ -162,6 +162,9 @@ class _chat(BaseModel):
default=1.0,
description="Controls the randomness of the model's responses. Lower values make the output more deterministic, while higher values produce more random results.",
)
rag: Optional[Dict[str, Any]] = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be its own class with specific arguments we can validate and run through tests rather than a dictionary that can contain anything. If we are putting something directly into our config, we need to have control over validation.

if we were going to separate config yaml route, we could just blindly read it in, but I think this approach is better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it's been designed follows the initial conversation we had 2 weeks ago, when we decided not to update the configuration to avoid compatibility issues in case we would have changed any of these commands, as their definition seemed "unstable".
I'd be happy if we instead decide to adopt the regular config definitions, as it is a more straightforward implementation.

src/instructlab/haystack/docling_splitter.py Outdated Show resolved Hide resolved
@@ -172,6 +176,7 @@ def is_openai_server_and_serving_model(
"--temperature",
cls=clickext.ConfigOption,
)
@rag_options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are implicitly passing these in?

the way we do this with config entries is map the nested classes to click.options where the defaults are funneled in via cls=clickext.ConfigOption. I would prefer to

  1. make the config options a class not a dictionary
  2. add associated flags to override in ilab model chat where the defaults use the above class ConfigOption to read defaults from the config

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ok, as you also requested in another comment
  2. sure, I'll use the regular design then

logger = logging.getLogger(__name__)


def rag_options(command):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this wrapper apply these to any ilab command using the @rag_options? I would really prefer for ilab commands directly using rag and allowing users to configure RAG, that they have their own validated set of options that read from the config.

A second option I would be ok with, is if these rag_options use the cls=clickext.ConfigOption class to read their defaults from the config file.

Copy link
Author

@dmartinol dmartinol Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this wrapper apply these to any ilab command using the @rag_options?

This was meant to be used only in the chat command, where it makes sense to add these options.

I would really prefer for ilab commands directly using rag and allowing users to configure RAG, that they have their own validated set of options that read from the config

👍

)(command)
command = click.option(
"--retriever-embedder-model-dir",
default=lambda: DEFAULTS.MODELS_DIR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

things like this for example should be reading from the config, we try not to specify defaults at this level

@@ -16,6 +16,9 @@ chat:
# Model to be used for chatting with.
# Default: /cache/instructlab/models/granite-7b-lab-Q4_K_M.gguf
model: /cache/instructlab/models/granite-7b-lab-Q4_K_M.gguf
# The RAG chat configuration
# Default: {}
rag: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this should have some sane defaults or else users will not know how to use this

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 19, 2024
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdoern I already noticed the mypy issues detected in this file, but I was wondering what I had to do with them. Should I try to fix them even if I did not update this code (w/o introducing any side effect, of course) or can we skip these warnings?

Signed-off-by: Daniele Martinoli <[email protected]>
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration ci-failure PR has at least one CI failure dependencies Relates to dependencies hold In-progress PR. Tag should be removed before merge. testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants