[FOR SHARING PURPOSES ONLY] RAG ingestion and chat pipelines #2736

dmartinol · 2024-12-03T12:43:51Z

TODO list before marking the draft as ready:

Hi team! I’ve put together some initial code that can serve as a starting point for our discussion on the topic in subject.
Given the current limitations on updating the CLI interface in the current version, the approach leverages configuration options to enable the ingestion and chat pipelines without requiring new commands
Details below.

Note:
Since we're using a development version of instructlab-sdg package, the following steps are needed for now:

Clone the instructlab/sdg repo
Import the local package like pip install -e /path/to/sdg/repo

RAG document transformation and ingestion

Transform user docs using available SDG modules (no need to define any qna.yaml knowledge document)
Generate embeddings and ingest them in a configured DB

No configuration

Flags and env vars

Reflect the design document RAG ingestion and chat pipelines

Command

From pre-processed documents:

ilab data process --rag /path/to/pre-processsed/docs

Inclusing pre-processing pipeline:

ilab data process --rag --transform --transform-output /path/to/pre-processsed/docs /path/to/user/docs

Output in the configured DB. Supported DB: MilvusLite.

RAG Chat

Configuration

chat:
  rag:
    enable: false
    retriever:
      top_k: 20
      embedder:
        model_name: sentence-transformers/all-minilm-l6-v2
    document_store:
      type: milvuslite
      uri: embeddings.db
      collection_name: Ilab

Command

ilab model chat

Includes additional marker to display the activation status of the RAG function:

% ilab model chat                        
╭───────────────────────────────────────────────────────────────────────────── system ─────────────────────────────────────────────────────────────────────────────╮
│ Welcome to InstructLab Chat w/ GRANITE-7B-LAB-Q4_K_M.GGUF (type /h for help)                                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
>>>                                                                                                                                                [RAG][S][default]

Signed-off-by: Daniele Martinoli <[email protected]>

cdoern

left some lengthy comments. I know this is in a draft state but I figured I would get a review in. Many of my comments are concerns over config defaulting, reading, and how we are using these new RAG options.

Also, we will need e2e tests for this since it is a major change.

cdoern · 2024-12-18T13:43:42Z

requirements.txt

@@ -37,3 +37,4 @@ wandb>=0.16.4
 xdg-base-dirs>=6.0.1
 psutil>=6.0.0
 huggingface_hub[hf_transfer]>=0.1.8
+haystack-ai>=2.8


I am pretty sure new requirements like these need to be run by a specific set of people.

could you please elaborate more on the "specific set of people" part? (for sure I miss something here)

cdoern · 2024-12-18T13:43:53Z

requirements/milvus.txt

@@ -0,0 +1,4 @@
+# Cannot upgrade because of https://github.com/milvus-io/milvus-haystack/issues/39
+milvus_haystack==0.0.11


same here for new requirements

src/instructlab/cli/data/ingest.py

src/instructlab/cli/data/process.py

cdoern · 2024-12-18T13:47:39Z

src/instructlab/configuration.py

@@ -162,6 +162,9 @@ class _chat(BaseModel):
        default=1.0,
        description="Controls the randomness of the model's responses. Lower values make the output more deterministic, while higher values produce more random results.",
    )
+    rag: Optional[Dict[str, Any]] = Field(


This should be its own class with specific arguments we can validate and run through tests rather than a dictionary that can contain anything. If we are putting something directly into our config, we need to have control over validation.

if we were going to separate config yaml route, we could just blindly read it in, but I think this approach is better.

The way it's been designed follows the initial conversation we had 2 weeks ago, when we decided not to update the configuration to avoid compatibility issues in case we would have changed any of these commands, as their definition seemed "unstable".
I'd be happy if we instead decide to adopt the regular config definitions, as it is a more straightforward implementation.

src/instructlab/haystack/docling_splitter.py

cdoern · 2024-12-18T13:54:09Z

src/instructlab/model/chat.py

@@ -172,6 +176,7 @@ def is_openai_server_and_serving_model(
    "--temperature",
    cls=clickext.ConfigOption,
 )
+@rag_options


so we are implicitly passing these in?

the way we do this with config entries is map the nested classes to click.options where the defaults are funneled in via cls=clickext.ConfigOption. I would prefer to

make the config options a class not a dictionary

add associated flags to override in ilab model chat where the defaults use the above class ConfigOption to read defaults from the config

ok, as you also requested in another comment

sure, I'll use the regular design then

cdoern · 2024-12-18T13:56:54Z

src/instructlab/rag/rag_configuration.py

+logger = logging.getLogger(__name__)
+
+
+def rag_options(command):


will this wrapper apply these to any ilab command using the @rag_options? I would really prefer for ilab commands directly using rag and allowing users to configure RAG, that they have their own validated set of options that read from the config.

A second option I would be ok with, is if these rag_options use the cls=clickext.ConfigOption class to read their defaults from the config file.

will this wrapper apply these to any ilab command using the @rag_options?

This was meant to be used only in the chat command, where it makes sense to add these options.

I would really prefer for ilab commands directly using rag and allowing users to configure RAG, that they have their own validated set of options that read from the config

👍

cdoern · 2024-12-18T13:57:20Z

src/instructlab/rag/rag_configuration.py

+    )(command)
+    command = click.option(
+        "--retriever-embedder-model-dir",
+        default=lambda: DEFAULTS.MODELS_DIR,


things like this for example should be reading from the config, we try not to specify defaults at this level

cdoern · 2024-12-18T13:57:53Z

tests/testdata/default_config.yaml

@@ -16,6 +16,9 @@ chat:
  # Model to be used for chatting with.
  # Default: /cache/instructlab/models/granite-7b-lab-Q4_K_M.gguf
  model: /cache/instructlab/models/granite-7b-lab-Q4_K_M.gguf
+  # The RAG chat configuration
+  # Default: {}
+  rag: {}


yeah, this should have some sane defaults or else users will not know how to use this

Signed-off-by: Daniele Martinoli <[email protected]>

dmartinol · 2024-12-20T16:21:34Z

src/instructlab/model/chat.py

@cdoern I already noticed the mypy issues detected in this file, but I was wondering what I had to do with them. Should I try to fix them even if I did not update this code (w/o introducing any side effect, of course) or can we skip these warnings?

Signed-off-by: Daniele Martinoli <[email protected]>

dmartinol and others added 3 commits November 30, 2024 23:24

initial commit of rag chat POC

c0a1dfe

Signed-off-by: Daniele Martinoli <[email protected]>

Merge branch 'instructlab:main' into rag_sample

e437994

RAG ingestion and chat pipelines

81e6723

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added dependencies Relates to dependencies ci-failure PR has at least one CI failure labels Dec 3, 2024

linting

3548250

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot removed the ci-failure PR has at least one CI failure label Dec 3, 2024

Merge branch 'instructlab:main' into rag_sample

af11a91

mergify bot added the ci-failure PR has at least one CI failure label Dec 9, 2024

Merge branch 'instructlab:main' into rag_sample

03f5a46

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 10, 2024

with data process command

0a773f0

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added testing Relates to testing ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 10, 2024

nathan-weinberg added the hold In-progress PR. Tag should be removed before merge. label Dec 10, 2024

dmartinol added 3 commits December 13, 2024 07:43

rag and rag_configuration modules

f7d00ed

Signed-off-by: Daniele Martinoli <[email protected]>

more help command

4511190

Signed-off-by: Daniele Martinoli <[email protected]>

with RagHandler

fce81d3

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot removed the ci-failure PR has at least one CI failure label Dec 13, 2024

dmartinol added 4 commits December 13, 2024 18:49

using docling chunker

82dc6ac

Signed-off-by: Daniele Martinoli <[email protected]>

remove jq dependency

ddbc355

Signed-off-by: Daniele Martinoli <[email protected]>

removed unneeded comment

f0daa10

Signed-off-by: Daniele Martinoli <[email protected]>

added optional milvus dependencies

ed8c45f

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added the ci-failure PR has at least one CI failure label Dec 15, 2024

dmartinol and others added 2 commits December 16, 2024 08:54

haystack components factory

ccae17a

Signed-off-by: Daniele Martinoli <[email protected]>

Merge branch 'instructlab:main' into rag_sample

f84bc01

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 16, 2024

mergify bot added CI/CD Affects CI/CD configuration ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 16, 2024

updated option names to latest dev-docs PR

d884295

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot removed the ci-failure PR has at least one CI failure label Dec 17, 2024

dmartinol and others added 2 commits December 17, 2024 15:57

Merge branch 'instructlab:main' into rag_sample

afffbd4

updated option names to latest dev-docs PR

c23e063

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added the ci-failure PR has at least one CI failure label Dec 17, 2024

dmartinol added 2 commits December 18, 2024 11:39

added taxonomy navigation to data process sub-command

e62fed2

Signed-off-by: Daniele Martinoli <[email protected]>

fixed config issue after latest class renaming

0ea564c

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 18, 2024

dmartinol and others added 3 commits December 18, 2024 12:36

restored proper logging level. using click.secho to report error

e083324

Signed-off-by: Daniele Martinoli <[email protected]>

added navigation to sdg processed documents in the ingest command

ae88f7c

Signed-off-by: Daniele Martinoli <[email protected]>

Merge branch 'instructlab:main' into rag_sample

ac55e11

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 18, 2024

fixed linter issue

419ef6f

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 18, 2024

fixed on test case: torch import is required by docling

93e5369

Signed-off-by: Daniele Martinoli <[email protected]>

cdoern requested changes Dec 18, 2024

View reviewed changes

fixing some messages before recording demo

8591322

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 19, 2024

dmartinol commented Dec 20, 2024

View reviewed changes

integrated initial commands

55c7306

Signed-off-by: Daniele Martinoli <[email protected]>

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 20, 2024

cdoern mentioned this pull request Jan 3, 2025

Split generate_data into multiple top-level APIs and CLIs instructlab/sdg#443

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FOR SHARING PURPOSES ONLY] RAG ingestion and chat pipelines #2736

[FOR SHARING PURPOSES ONLY] RAG ingestion and chat pipelines #2736

dmartinol commented Dec 3, 2024 •

edited

Loading

cdoern left a comment

cdoern Dec 18, 2024

dmartinol Dec 20, 2024

cdoern Dec 18, 2024

cdoern Dec 18, 2024

dmartinol Dec 20, 2024

cdoern Dec 18, 2024

dmartinol Dec 20, 2024

cdoern Dec 18, 2024

dmartinol Dec 20, 2024 •

edited

Loading

cdoern Dec 18, 2024

cdoern Dec 18, 2024

dmartinol Dec 20, 2024

		@@ -0,0 +1,4 @@
		# Cannot upgrade because of https://github.com/milvus-io/milvus-haystack/issues/39
		milvus_haystack==0.0.11

		logger = logging.getLogger(__name__)


		def rag_options(command):

[FOR SHARING PURPOSES ONLY] RAG ingestion and chat pipelines #2736

Are you sure you want to change the base?

[FOR SHARING PURPOSES ONLY] RAG ingestion and chat pipelines #2736

Conversation

dmartinol commented Dec 3, 2024 • edited Loading

RAG document transformation and ingestion

No configuration

Flags and env vars

Command

RAG Chat

Configuration

Command

cdoern left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmartinol Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmartinol commented Dec 3, 2024 •

edited

Loading

dmartinol Dec 20, 2024 •

edited

Loading