-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAG ingestion and chat pipelines #161
base: main
Are you sure you want to change the base?
Conversation
I think that the chat |
docs/cli/ilab-rag-retrieval.md
Outdated
|
||
The script should guide users in overriding default options and generating the RAG artifacts in the configured vector database instance. By default, support for MilvusLite will be included. | ||
|
||
> **ℹ️ CLI-based alternative:** This is an alternative to run both the document transformation and the document ingestion in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added one option to overcome today's limitations and run the docs and ingestion pipelines with no CLI changes
docs/cli/ilab-rag-retrieval.md
Outdated
allowed. Therefore, we also propose alternative approaches to run the same RAG pipelines using existing `ilab` commands or | ||
other provided tools. | ||
|
||
### 3.1 RAG Ingestion Pipeline Command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this be handled for fine tuning?
I would expect that it ends up that many of the components end up being re-invented here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking about fine-tuning of the embedding model or the response-generation model or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the plan regarding fine-tuning of the response-generation model:
- There are no plans to make changes to the existing capability in InstructLab for synthetic data generation (SDG) and fine-tuning the response-generation model from that synthetic data.
- That existing capability includes a preprocessing step that is part of the
ilab data generate
command which fetches source documents (e.g., PDF files) and processes them using docling. - In RAG ingestion and chat pipelines #161 we propose to separate that preprocessing into its own step.
- The outputs of that step will be used as inputs for the capabilities in RAG for vectorizing and indexing that same content (the source documents).
- Ideally there will also be some way to put documents in directly without having to run the SDG preprocessing, but that is lower priority than just getting the primary flow working.
Fine-tuning the embedding model is out of scope for the MVP, but in the future I think we expect that the outputs of SDG would also be useful as training data for an embedding model (e.g., a cross-encoder model that really needs query / response pairs for fine-tuning). Alternatively, maybe we just use the extracted text for fine tuning a basic single-text encoder.
docs/cli/ilab-rag-retrieval.md
Outdated
This command processes embeddings generated from documents located in the */path/to/docs/folder* folder and stores them in a vector database. These embeddings are intended to be used as augmented context in a Retrieval-Augmented Generation (RAG) chat pipeline. | ||
|
||
#### Assumptions | ||
The documents must be in JSON format and pre-processed using the `docling` tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you only have to assume the format, not necessarily that it was processed via docling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it may be true, it's not a strict requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's right, but the format is the output format of docling, not any JSON file. It needs to match the docling JSON schema so we can use hierarchical chunking, at least for now. We can add support for other formats in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can clarify that it follows the docling
format because we should use the instructlab-sdg
package that depend on that tool. As soon as they switch to a different format/tool, we will be updated accordingly.
docs/cli/ilab-rag-retrieval.md
Outdated
Chat pipeline enriched by RAG context: | ||
![rag-chat](../images/rag-chat.png) | ||
|
||
### 3.8 Proposed Implementation Stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feast should be considered here.
We are in the process of preparing it for Tech Preview in RHOAI and, as mentioned, many of the issues Feast handles will end up being reimplemented. In particular:
- Data Ingestion
- Data Transformation
- Data Indexing
- Data Retrieval/Serving
- Dataset Generation / Training Set Preparation
- Governance
Feast has implementations (or starter implementations) for each of these and I believe it would make sense to leverage it to ensure we can scale to meet RHOAI's needs.
The point about separating all of those components is that each one of them should be configurable and they all impact what will be injected into the context and how the model can be fine tuned/optimized and evaluated.
More fundamentally, from an architectural and engineering perspective each sub-area requires optimization for a production system (e.g., large scale indexing, low latency transformation, role based access control, etc.) and using Feast provides most of these out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there documentation for Feast? If so, can you post a link here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.feast.dev/getting-started/quickstart#what-is-feast is probably a good overview
RAG support is alpha and actively being worked on. Documentation here: https://docs.feast.dev/reference/alpha-vector-database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see support for indexing and retrieving in the alpha-vector-database link, but not other aspects of RAG (e.g., response generation). Is that also being worked, or is the RAG focus specifically indexing and retrieval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is exclusively focused on indexing and retrieval. Generation (i.e., inference) would be a separate service. In RHOAI it'd be vLLM with KServe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ilab data ingest
would in an ideal world would be Feast.
ilab model chat
I'm not sure. It would require KServe, potentially Feast (or whatever document serving solution we choose), and whatever chat application/service solution we choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmartinol writes:
we could try to be a bit more generic in the adopted terminology. E.g. user store instead of vectordb when we reference parameter names. This can also come in handy when we integrate with embedding APIs rather than using our own embedder. @jwm4 WDYT?
Yes, this seems like a good idea; as much as possible it would be nice for parameter names to not assume implementation details. Those details are likely to change over time. The example of referring to place we store documents as a store (or "document store"?) instead of a vectordb seems good to me. Can you find other examples too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franciscojavierarceo writes:
When we get to the RHOAI solution we would benefit from all of the operator, RBAC, integration work we've already done, which is where my view comes from.
If we were to go down that route, ilab data ingest would just be applications of Feast with our ILab specific business logic. I think that'd be a reasonable pattern.
I do acknowledge though that it could be bloated for what we're doing and Feast RAG is not as strong of a community as the other nor as good of a solution. The big benefit is that Feast is not single-entity backed.
These are all things I think are worth discussing in a doc (or maybe this ADR).
These seem like very important topics to me. I would rather have a separate doc for them rather than try to get it all sorted out in this doc because they also seem very complicated. FWIW, I think I only understand bits and pieces of the vision here and not the big picture yet, but it does seem like there is a lot of value that this integration could provide for enterprise scale-out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you find other examples too?
For now we don't have many configurations exposed, anddocument-store
can also apply to the chat pipeline.
Post-mvp, we could include the retriever
settings to specify the retriever details:
retriever-type
:default
: our default retriever implementation using the configured document-store.- .
api:
a dedicated service exposing well-defined APIs for retrieving embeddings (/query). Includes the document store.feast
: includes the document store.
Before going into the hell of overthinking, I would wait for the outcome of ET experiments and also carefully evaluate the solution designed by the IBM research team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ilab data ingest
would in an ideal world would be Feast.
ilab model chat
I'm not sure. It would require be KServe, potentially Feast (or whatever document serving solution we choose), and whatever chat application/service solution we choose.
Agree. That being said, how can this ADR support the request to integrate with Feast?
docs/cli/ilab-rag-retrieval.md
Outdated
| **TODO** evaluation framework options | | | | | ||
|
||
Equivalent YAML document for the newly proposed options: | ||
```yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could pretty easily be structured in Feast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is a good start, but needs a lot more input from a lot of stakeholders, especially stakeholders who work on the existing command-line interface.
docs/cli/ilab-rag-retrieval.md
Outdated
* Internal Red Hat CI systems for products or services (e.g., Lightspeed products) | ||
|
||
## 3. Proposed Commands | ||
**Note**: In the context of version 1.4, currently under development, no changes to the command-line interface should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the references to "version 1.4"? This is a version number for a downstream consumer of this open source project so it doesn't belong in a dev-doc for the open source project. If you want to discuss downstream consumers, there are other venues for that.
Also, I think the rest of this note should be dropped too. It matches what I originally believed was a hard constraint, but now I am hearing that this constraint is being considered and also that there is a hard constraint that we not provide alternative approaches to run the same RAG pipelines, so really we need more discussion to find out which constraints are really hard and which are not.
docs/cli/ilab-rag-retrieval.md
Outdated
### 3.1 RAG Ingestion Pipeline Command | ||
The proposal is to add a `rag` subgroup under the `data` group, with an `ingest` command, like: | ||
``` | ||
ilab data rag ingest /path/to/docs/folder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts on this:
- I guess broadly speaking, I was expecting the proposal for how this should be reflected in the command-line interface to come from members of the engine team, e.g., @cdoern . However, I guess it is fine for us to propose things here and iterate with them.
- I don't like
rag ingest
here. I think we want something that describes what we're doing here, which is building an index rather than bringing in the term "RAG" which describes the feature but not really what this specific step is doing. - I'm not sure how to respond to the
/path/to/docs/folder
part. We definitely want some sort of affordance around a flow where you do anilab data generate
and thenilab
just knows where the outputs of that step are rather than you needing to specify it. However, some other affordance for being able to override that location also makes sense to me. So maybe if the folder is optional that solves this? - We need to figure out how this fits in with the broader refactor being considered in Refactor preprocessing and postprocessing in SDG #155
- Also, I would like a flow some day where you can just point this step at source documents and it runs docling for you, but that's lower priority than the flow that is more tightly connected with SDG (or at least SDG preprocessing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like rag ingest here
- Me neither, but I was waiting for the closure of the discussion on the related command at Knowledge doc ingestion #148 that, IIUC, should be the preliminary step before running the embedding ingestion. Depending on the selected verb, we can update this proposal accordingly (maybe, like
ilab data index
orilab data generate index
?).
I'm not sure how to respond to the /path/to/docs/folder part
- Again, this followed the proposal for the other PR, that has both
--input
and--output
options.
We definitely want some sort of affordance around a flow where you do an
ilab data generate
and then ilab just knows where the outputs of that step are rather than you needing to specify it
If this is a valid use case, then yes and the parameter will be optional. We have to think carefully of how to auto-detect the json docs in this case, as the datasets
folder is "versioned" for each data generate
execution, so I assume the requirement is to pick all the files from the latest documents-*
subfolder.
docs/cli/ilab-rag-retrieval.md
Outdated
This command processes embeddings generated from documents located in the */path/to/docs/folder* folder and stores them in a vector database. These embeddings are intended to be used as augmented context in a Retrieval-Augmented Generation (RAG) chat pipeline. | ||
|
||
#### Assumptions | ||
The documents must be in JSON format and pre-processed using the `docling` tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's right, but the format is the output format of docling, not any JSON file. It needs to match the docling JSON schema so we can use hierarchical chunking, at least for now. We can add support for other formats in the future.
docs/cli/ilab-rag-retrieval.md
Outdated
add a reference document to the `qna.yaml` document(s). | ||
|
||
#### Supported Databases | ||
The command supports multiple vector database types. By default, it uses a local `MilvusLite` instance stored at `./rag-output.db`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a separate dev-doc for this decision.
docs/cli/ilab-rag-retrieval.md
Outdated
> * Step 1: Transform the documents using the existing `SDG` modules (powered by `docling`). | ||
> * Step 2: Ingest the pre-processed artifacts from `generate.rag.output` into the configured database. | ||
|
||
**TODO** Introduce the evaluation framework |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to have a separate dev-doc for evaluation framework because I think it is going to take longer to get that sorted out than some of these more urgent issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, after you latest messages in the Slack discussion channel it's definitely worth parking this point for now.
docs/cli/ilab-rag-retrieval.md
Outdated
Chat pipeline enriched by RAG context: | ||
![rag-chat](../images/rag-chat.png) | ||
|
||
### 3.8 Proposed Implementation Stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there documentation for Feast? If so, can you post a link here?
docs/cli/ilab-rag-retrieval.md
Outdated
### 3.8 Proposed Implementation Stack | ||
The following technologies form the foundation of the proposed solution: | ||
|
||
* [Haystack](https://haystack.deepset.ai/): Framework for implementing RAG pipelines and applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a dev-doc for this decision, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am not currently convinced that Haystack is the right choice for us. I did not see a thorough evaluation of the other open source frameworks and given that they are a private company, I have concerns. I am very comfortable with Milvus as it is a part of the LFAI foundation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franciscojavierarceo you can also comment the related ADR #164
I think I need to clarify my position when I commented with a general "looks good to me" - I think that you raise some very valid points, but also I am operating under the assumption that this document serves as a proposal for "directionally where to head right now" and that any less-than-high-level details can, will, and probably should change as we understand the problem domain better during execution. I think that a useful modus operandi is to get a few key stakeholders to give a general approval and that that is enough to get started, and then have continuous feedback cycles all the time going forward to course correct as necessary. Analysis paralysis is a real effect that is best avoided. Not trying to beat a dead horse but this is partly why I keep advocating for atomic decision records like ADRs over all-encompassing design docs like this. A general development roadmap is a necessary thing to have, but nobody will ever have enough information to design a full system specification, especially in the context of a marketplace and a large development organization. The only constant is change. |
I will soon publish an updated version with the outcome of the discussion with the ilab Runtime (aka CLI) team. |
@cdoern Could you please TAL and involve relevant people? |
docs/cli/ilab-rag-retrieval.md
Outdated
transformation, leveraging on the `instructlab-sdg` modules. | ||
|
||
### Why We Need It | ||
This command streamlines the `ilab data generate` pipeline and eliminates the requirement to define a `qna` document, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really far-reaching design decision that could have a lot of consequences for the product. Looks like this came out of a meeting with the engine runtime team, was that recorded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recording link shared on Slack channel
docs/cli/ilab-rag-retrieval.md
Outdated
InstructLab technology stack. | ||
|
||
#### Usage | ||
The generated embeddings can later be retrieved to enrich the context for RAG-based chat pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The embeddings themselves, not text to be substituted into a prompt template?
docs/cli/ilab-rag-retrieval.md
Outdated
|
||
| Option Description | Default Value | CLI Flag | Environment Variable | | ||
|--------------------|---------------|----------|----------------------| | ||
| Whether to include a transformation step. | `False` | `--transform` (boolean) | `ILAB_TRANSFORM` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would some examples of transformations be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ilab data process --rag input
runs the embedding pipeline: fetches pre-processed docs frominput
folder and stores the generated embeddings in the configured vector store.ilab data process --rag --transform --transform-output processed input
runs the transformation pipeline before: fetches user docs frominput
folder, processes them using sdg transformation inprocessed
folder, then run the previous pipeline from this folder.
docs/cli/ilab-rag-retrieval.md
Outdated
|--------------------|---------------|----------|----------------------| | ||
| Whether to include a transformation step. | `False` | `--transform` (boolean) | `ILAB_TRANSFORM` | | ||
| The output path of transformed documents (serve as input for the embedding ingestion pipeline). Mandatory when `--transform` is used. | | `--transform-output` | `ILAB_TRANSFORM_OUTPUT` | | ||
| How to split the documents. One of `page`, `passage`, `sentence`, `word`, `line` | `word` | `--splitter-split-by` | `ILAB_SPLITTER_SPLIT_BY` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has there been discussion about the existence of this logic with respect to the docling-based document transformation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docling
chunkers haven't yet been integrated into Haystack. Instead, we took the DocumentSplitter options from Haystack.
I agree that in the interim we should try to not introduce framework dependencies, so I'd remove them and use default settings for now. In the meantime, we can start exploring the docling chunkers. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have seen on Slack that the Docling hybrid chunker is now released. I looked at the code briefly, and it looks good to me. More details:
I think this will be a good fit for the RAG chunking because (unlike their older hierarchical chunker), it provides chunks that are constrained to be no bigger than a fixed size for a given tokenizer and tries to make the chunks as big as possible within that size limit and the constraints of the structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will sync up with ET engineers to adopt the same approach once available.
docs/cli/ilab-rag-retrieval.md
Outdated
| Vector DB connection username. | | `--vectordb-username` | `ILAB_VECTORDB_USERNAME` | | ||
| Vector DB connection password. | | `--vectordb-password` | `ILAB_VECTORDB_PASSWORD` | | ||
| Name of the embedding model. | `sentence-transformers/all-minilm-l6-v2` | `--model` | `ILAB_EMBEDDING_MODEL_NAME` | | ||
| Token to download private models. | | `--model-token` | `ILAB_EMBEDDING_MODEL_TOKEN` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would introduce model downloading logic in a new place while ilab model download
already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point! I hadn't thought of that. Using the existing model download sounds better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
You mean that the user should first ilab model download -rp sentence-transformers/all-minilm-l6-v2
and the RAG pipelines (both) would validate that a local download exists for that model before proceeding?
Or would the RAG pipelines use the ilab download function to download the model locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the both most convenient and flexible solution would be to package in a default model but also allow downloading and configuring the use of a different model by doing ilab model download...
and configuring it to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's a user responsibility to download it first 👍
I'll add a note and also introduce a --model-dir option to supply a configurable location for looking for downloaded models
docs/cli/ilab-rag-retrieval.md
Outdated
| Minimum number of units per split. | `0` | `--splitter-split-threshold` | `ILAB_SPLITTER_SPLIT_THRESHOLD` | | ||
| Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` | | ||
| Vector DB service URI. | `./rag-output.db` | `--vectordb-uri` | `ILAB_VECTORDB_URI` | | ||
| Vector DB connection token. | | `--vectordb-token` | `ILAB_VECTORDB_TOKEN` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this differ from username/password?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right: I took token
from other document store examples in Haystack, but I agree it's better to focus on Milvus only for now.
WDYT about dropping the authentication part for now and then review the decision once we define the supported stores and verify the available authentication methods?
Milvus seems to offer authentication via username and password, but other stores have a different authn method, or no authn at all (e.g. Chroma).
If we want to be more generic, what about a single --vectordb-authentication
option where we can put a comma-separated list of the store-specific settings?
E.g., for Milvus it would be:
ilab data process --rag --vectordb-type milvus --vectordb-uri 'http://localhost:1234' \
--vectordb-authentication 'username=$MILVUS_USER,password=$MILVUS_PASSWORD'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about dropping the authentication part for now and then review the decision once we define the supported stores and verify the available authentication methods?
Shipping something minimal that works and expanding on it in a feedback cycle sounds good to me!
docs/cli/ilab-rag-retrieval.md
Outdated
| Maximum number of units in each split. | `200` | `--splitter-split-length` | `ILAB_SPLITTER_SPLIT_LENGTH` | | ||
| Number of overlapping units for each split. | `0` | `--splitter-split-overlap` | `ILAB_SPLITTER_SPLIT_OVERLAP` | | ||
| Minimum number of units per split. | `0` | `--splitter-split-threshold` | `ILAB_SPLITTER_SPLIT_THRESHOLD` | | ||
| Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many (all?) databases allow multiple document collections. That should probably be a parameter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding one more option
| Vector DB connection token. | `IlabEmbeddings` | `--vectordb-collection-name` | `ILAB_VECTORDB_COLLECTION_NAME` |
docs/cli/ilab-rag-retrieval.md
Outdated
|-------------------|-------------|---------------|----------|----------------------| | ||
| chat.rag.enabled | Enable or disable the RAG pipeline. | `false` | `--rag` (boolean)| `ILAB_CHAT_RAG_ENABLED` | | ||
| chat.rag.retriever.top_k | The maximum number of documents to retrieve. | `10` | `--retriever-top-k` | `ILAB_CHAT_RAG_RETRIEVER_TOP_K` | | ||
| chat.rag.prompt | Prompt template for RAG-based queries. | Examples below | `--rag-prompt` | `ILAB_CHAT_RAG_PROMPT` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a way to unify prompt templates throughout InstructLab into one place rather than adding another place to store them, that would probably be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Would this deserve its own ADR?
- Is it something we should let the user to configure or are you just thinking of where to place an hardcoded prompt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add this to the list of planned ADRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a way to unify prompt templates throughout InstructLab into one place rather than adding another place to store them, that would probably be ideal.
+1
There are many design decisions being made here that appear to be in a bit of a vacuum and so increase complexity of product usage and configuration while there are opportunities to streamline it instead. @jwm4 I think we need to dedicate a significant effort to work through these as a group. I left comments on what I saw in a first pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to look good to me. I still have some minor disagreements about technical details (see comments below) but mostly this is feeling like it is on the right track.
docs/cli/ilab-rag-retrieval.md
Outdated
| Vector DB connection token. | | `--vectordb-token` | `ILAB_VECTORDB_TOKEN` | | ||
| Vector DB connection username. | | `--vectordb-username` | `ILAB_VECTORDB_USERNAME` | | ||
| Vector DB connection password. | | `--vectordb-password` | `ILAB_VECTORDB_PASSWORD` | | ||
| Name of the embedding model. | `sentence-transformers/all-minilm-l6-v2` | `--model` | `ILAB_EMBEDDING_MODEL_NAME` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am planning to do a separate ADR for the default embedding model. For the purpose of this document would it be OK to just replace sentence-transformers/all-minilm-l6-v2
with TBD?
docs/cli/ilab-rag-retrieval.md
Outdated
| How to split the documents. One of `page`, `passage`, `sentence`, `word`, `line` | `word` | `--splitter-split-by` | `ILAB_SPLITTER_SPLIT_BY` | | ||
| Maximum number of units in each split. | `200` | `--splitter-split-length` | `ILAB_SPLITTER_SPLIT_LENGTH` | | ||
| Number of overlapping units for each split. | `0` | `--splitter-split-overlap` | `ILAB_SPLITTER_SPLIT_OVERLAP` | | ||
| Minimum number of units per split. | `0` | `--splitter-split-threshold` | `ILAB_SPLITTER_SPLIT_THRESHOLD` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure all these splitter options make sense in the context of the Docling hierarchical splitting capability. Also, regardless of the underlying technology, the underlying embedding models only allow a certain number of tokens to encode. So if you let the users split on chunks of 2 pages (for example), what do we do when we need to create the vectors? Just take the first K
tokens of each chunk? It feels like we're giving users too much freedom to do things that don't make sense here without also making it clear what the consquences of doing so would be. We should discuss this topic more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As replied before, while we wait for integrating the docling chunkers, we can drop these settings and use some opinionated defaults for now.
The other question that @ilan-pinto raised around this topic is whether we really need any chunking at all, since the SDG formatting already chunks the original user documents into . Should we review this step?
docs/cli/ilab-rag-retrieval.md
Outdated
``` | ||
|
||
### 2.7 References | ||
* [Haystack-DocumentSplitter](https://github.com/deepset-ai/haystack/blob/f0c3692cf2a86c69de8738d53af925500e8a5126/haystack/components/preprocessors/document_splitter.py#L55) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably the Haystack splitter will wind up getting dropped from the solution in favor of something Docling-based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a note that this is a temporary (non configurable) option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I have some concerns about this approach, especially in light of the current changes happening in SDG. I think a lot of this approach is based on where SDG was and not where SDG is going, but this work wouldn't land in SDG until after we've reconciled with the research changes, have the ability to create custom Pipeline Blocks, expect users to create and execute their own Pipelines, and split out data preprocessing from data generation from data postprocessing.
I think the entire approach to generating vector embeddings and populating those in a vector database could probably be handled with the existing (post-reconcile with Research fork) SDG code along with a custom Pipeline Block implementation or two. We don't document how to do this yet, as the code is just landing, but that's our designated extension mechanism to do any random thing you want during a data generation pipeline.
docs/cli/ilab-rag-retrieval.md
Outdated
The rationale behind this choice is that the `data process` command can support future workflows, making its | ||
introduction an investment to anticipate other needs. | ||
|
||
Since the RAG behavior is the only functionality of this new command, executions without the `--rag` option will result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that ilab data process
does not exist yet, so I'd be careful about designing things that layer on top of it until we see when/if that gets implemented in its current form.
docs/cli/ilab-rag-retrieval.md
Outdated
|
||
#### Assumptions | ||
The provided documents must be in JSON format according to the InstructLab schema: this is the schema generated | ||
when transforming knowledge documents with the `ilab data generate` command (see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ilab data generate
does not output documents in an InstructLab schema, at least not as referenced here. Even once we separate out preprocessing from generation from postprocessing in ilab data
commands, we may keep ilab data generate
as it is today for backwards compatibility. I don't think that's decided yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Agree on the
InstructLab schema
comment, but these pre-processed artifacts are identified in this way in William's document "WC - RAG Artifacts with RHEL AI (PM perspective)" (I can share a link in DM if needed) - Of course
ilab data generate
can remain as it is today, this is outside the purpose of this design document
docs/cli/ilab-rag-retrieval.md
Outdated
transformation, leveraging on the `instructlab-sdg` modules. | ||
|
||
### Why We Need It | ||
This command streamlines the `ilab data generate` pipeline and eliminates the requirement to define a `qna` document, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any intention to remove the need for qna.yaml from ilab data generate
. My latest understanding is that we hope to end up with separate commands for preprocessing qna.yaml into data samples, running a generation pipeline, and post-processing generated results into final mixed datasets. ilab data generate
encompasses all 3 of those stages today, and may continue to. However, we do plan to have some step that starts at input data samples and runs generation pipelines, without a qna.yaml required. It will likely not be called ilab data generate
but that's still undecided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, and there was no intention to remove the need for qna.yaml from ilab data generate
docs/cli/ilab-rag-retrieval.md
Outdated
) | ||
``` | ||
|
||
### 2.3 Embedding Ingestion Pipeline Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if instead of wiring special support for all this into ilab, we should just consider generating and inserting vector embeddings stages in a data generation pipeline? We're moving to a model where users can supply their own custom pipeline and create their own custom pipeline blocks. So, we should ship (or the user could define) a RAG pipeline that handles turning the data samples into embeddings and storing them in a vector database all as part of our existing pipeline flows, without any code changes in SDG itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any design document that you can share about this initiative?
What would the purpose of indexing generated data be? |
I don't mean indexing generated data - I mean using our pipelines concept to run a RAG pipeline that generates embeddings, populates a vector db, whatever you need - as opposed to calling an LLM for inference and data generation. Pipelines take an input dataset, have a sequences of Blocks that get executed in step, with the first block getting each input sample as input, it transforms those samples in some way, outputs samples, and the next block gets those new samples as its input. Today we mostly use this for transforming data in datasets, building prompts and calling LLMs for inference, but you could also use this concept to tokenize text and insert into a vector db. A RAG pipeline just becomes another set of pipelines shipped with the product versus code custom and specific to the RAG use-case, other than perhaps some RAG-specific Blocks we'd like to ship in the product itself. It may be hard to understand how this all works without understanding the code of SDG including the upcoming changes to it, but we should at least try to use the designed SDG extension points of custom Blocks for part of this I think. |
Ah you mean creating a new pipeline for this that has nothing to do SDG. That sounds like it might be a very flexible solution, but at the cost of understandability etc. |
I very much agree with this conclusion. It also has extraordinary consequences for our enterprise customers at the RHOAI scale. |
@jwm4 @anastasds integrated changes from yesterday's meeting. Should we move it from Draft to Ready? |
docs/cli/ilab-rag-retrieval.md
Outdated
The proposal is to add an `ingest` sub-command to the `data` command group: | ||
``` | ||
ilab data ingest /path/to/docs/folder | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to have a default folder path? Or user can specify??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part has been just updated. So we can have 2 ways to run ingestion:
- from documents processed by
ilab data gebnerate
, e.g. from the latest.../datasets/documents-ABC/docling-artifacts
folder. In this case there is no need to specify any path - from user documents processed with
ilab data process
with no taxonomy definitions. In this case the path is specified by the user in both commands
docs/cli/ilab-rag-retrieval.md
Outdated
|
||
### 1.2 Model Training Path | ||
This flow is designed for users who aim to train their own models and leverage the source documents that support knowledge submissions to enhance the chat context: | ||
![model-training](../images/rag-model-training.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these images as-is, but they do appear to violate the official dev-docs guidelines on images as specified here. My inclination is to leave them as is since they look good, but if the oversight committee decides to be very strict about this guideline then we might need to redo them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Images were actually generated from Excalidraw, one of the recommended tool, and I also added a link to edit them for maintenance and sharing purposes: it's in another section below this paragraph, I can move it up if needed.
Could you clarify what is the violation that you see in the guidelines? (I had already read them before, that's why I used this tool, BTW)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwm4 @anastasds please review the "Options to Rebuild Excalidraw Diagrams:" section at the top of the document.
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
…review meeting Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
9f0d0bc
to
f292b33
Compare
Signed-off-by: Daniele Martinoli <[email protected]>
Signed-off-by: Daniele Martinoli <[email protected]>
ilab data process --output /path/to/processed/folder | ||
``` | ||
|
||
For the Plag-and-Play RAG path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Plag" is misspelled. It should be "Plug".
|
||
**Note**: documents are processed using `instructlab-sdg` package and are defined using the docling v1 schema. | ||
|
||
### 1.3 Tanomony Path (no Training) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Tanomony" should be "Taxonomy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor spelling issues
``` | ||
|
||
### 2.8 RAG Chat Options | ||
As we stated in [2.1 Working Assumptions](#21-working-assumption), we will introduce new configuration options for the spceific `chat` command, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"spceific" should be "specific"
| chat.rag.enabled | Enable or disable the RAG pipeline. | `false` | `--rag` (boolean)| `ILAB_RAG` | | ||
| chat.rag.retriever.top_k | The maximum number of documents to retrieve. | `10` | `--retriever-top-k` | `ILAB_RETRIEVER_TOP_K` | | ||
| | Document store implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--document-store-type` | `ILAB_DOCUMENT_STORE_TYPE` | | ||
| | Document storeservice URI. | `./embeddings.db` | `--document-store-uri` | `ILAB_DOCUMENT_STORE_URI` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "storeservice" be "store service"? Or is this one word for a reason?
Introducing
ilab
commands changes to support the RAG ingestion and chat pipelines: