framework to be used by RAG for pipeline #164

cooktheryan · 2024-12-09T20:58:34Z

No description provided.

dhellmann · 2024-12-09T21:17:44Z

docs/retrieval-augmented-generation/rag-framework.md

+
+## Risks
+
+N/A


Bringing in a new dependency always introduces risks related to building that package (and its dependencies) downstream, as well as license compliance, long term supportability, etc. Those aren't unique to this package, but it would be good to see them addressed here anyway to show that a thorough vetting was done of the package, and its dependencies.

For example, is any part of the haystack-ai dependency list GPL or AGPL? Is its license compatible with the Instructlab license?

Does it, and all of its dependencies, appear to be actively maintained by more than one maintainer?

Are those maintainers responsive to requests to review PRs, especially for CVEs?

Is the project following good practices for releasing, including tagging releases, publishing source packages, and using trusted publishing automation?

@cooktheryan this type of analysis is fairly routine for OSPO. I'd recommend reaching out to @lhawthorn

Thanks for your comments.

None of Haystack's dependencies (jinja2, lazy-imports, more-itertools, networkx, numpy, openai, pandas, posthog, python-dateutil, pyyaml, requests, tenacity, tqdm, typing-extensions) are GPL/AGPL. All of them are Apache, MIT, BSD or PSFL.

Haystack is licensed under Apache 2.0. What else makes it compatible or non-compatible with InstructLab?

Haystack is very actively maintained and supported. As for active maintenance of the dependencies, I didn't check them all but most are very well-known and widely used libraries. Moreover, many of them are also on the dependency list of the alternatives to Haystack.

Haystack releases tagged versions. They do not publish to Github Packages.
Could you please elaborate what is considered "trusted publishing automation" and whether there are any additional releasing practices we need to verify?

That would be good information to include in the body of the document.

For trusted publisher, I mean https://docs.pypi.org/trusted-publishers/. We've recently seen an instance of a package in the AI ecosystem where that trusted publishing was set up incorrectly, which led to a third-party having the ability to not only trigger releases but to inject malicious code into multiple versions of source packages. ultralytics/ultralytics#18027 for reference.

Downstream, we're building from source packages, but if we can't trust those source packages we would not want to consume the dependency.

Got it, thanks for the clarification. It looks like Haystack indeed satisfies the requirements for trusted publishing with this Github Actions workflow.

We'll make sure to include the details from this thread in the document.

Signed-off-by: Ryan Cook <[email protected]>

erikerlandson · 2024-12-09T23:33:01Z

There is a lot of emphasis given to haystack being focused on RAG. Does it also have a good story for supporting more generalized agentic patterns?

ilya-kolchinsky · 2024-12-10T09:19:15Z

There is a lot of emphasis given to haystack being focused on RAG. Does it also have a good story for supporting more generalized agentic patterns?

@erikerlandson, well, since Haystack supports building arbitrary graphs of operators (similarly to Langgraph), it can potentially implement many of the agentic patterns. However, as you say, the main focus is on RAG. Since the decision here solely considers our RAG implementation, features and capabilities not directly related to RAG should mostly be out of scope.

erikerlandson · 2024-12-10T16:36:28Z

I believe that agentic should not be totally out of scope; mainly because soon enough the focus is going to shift to agentic, and it would be undesirable (IMO) to be in a position of wanting to add a separate framework for agentic, or having to switch to a new framework altogether.

ilya-kolchinsky · 2024-12-10T18:48:29Z

@erikerlandson, I agree. Since Haystack provides the functionality to build arbitrarily complex pipelines, it is possible to extend our RAG architecture to introduce nodes for tools, reasoning and agentic actions.

Added details regarding Haystack licensing and maintenance. Signed-off-by: Ilya Kolchinsky <[email protected]>

jwm4

I have requested one more alternative to be added to the list that we considered -- I don't think that will change the outcome but I want it included for completeness.

I understand that the discussion of the dependency risk will be added to the risk section, which sounds good to me.

I am not seeing any other edits to this document that I would recommend. So I think if these changes are made, I will approve (unless something else arises before then).

docs/retrieval-augmented-generation/rag-framework.md

ilya-kolchinsky · 2024-12-11T20:35:46Z

@jwm4, sure, I will edit the document to include the alternative you proposed.
As for the dependencies, I have already added this earlier today. To the best of my understanding, there seem to be no problems with the dependencies, therefore I added this discussion to the 'Why' section rather than to the risks. Do you think it would be better to move it under 'Risks'?

jwm4 · 2024-12-11T21:59:47Z

Oh, I missed it under "Why". Sorry! I am fine with it there. I do think there is a tail risk that new dependencies get added that are more problematic. If that happens we have the following options:

Pin to the old version that doesn't have that dependency. That's often OK for a while, but eventually we're likely to run into updates that we need (e.g., critical fixes, compatibility with new vectordbs, etc.).
Fork the project to avoid the problematic dependencies.
Move off of Haystack completely.

Any of these could be viable depending on the circumstances. Furthermore, since Haystack hasn't added any dependencies that we'd consider very problematic so far, the likelihood that they will in the future seems minimal. I'd still like this noted under Risks though for reference if you don't mind.

Expanded risks and alternatives. Signed-off-by: Ilya Kolchinsky <[email protected]>

ilya-kolchinsky · 2024-12-12T15:40:18Z

Done - the document is now updated as we discussed.

jwm4

The concerns I have about this document are all addressed. FWIW, the conclusion that we will use Haystack is not the choice I would have made, and I think in the long run it is likely we will undo this choice and eliminate the dependency on Haystack. However, I don't think that will be particularly hard to do if/when there is a strong case for doing so, and I can't deny that for now having a framework to build on is making it quicker and easier to deliver a minimal viable solution. On the whole, I would say that the argument made in this document is strong and well-reasoned, so I am approving it despite my reservations about the decision.

dmartinol

lgtm

jwm4 · 2024-12-19T22:02:52Z

@instructlab/oversight-committee -- Can someone take a look at this and see if it is ready to approve and merge? If it is not, can we get some guidance about what would be needed to get it approved?

jjasghar · 2024-12-20T20:54:26Z

As an oversight committee member I approve this. Please fix the markdown/spellcheck errors and then we can merge this as a the way forward.

Signed-off-by: Ryan Cook <[email protected]>

mairin

Requesting some changes, some minor spelling / grammar, some just pulling some items from the PR conversation into the doc.

docs/retrieval-augmented-generation/rag-framework.md

mairin

LGTM, thanks!

Signed-off-by: Ryan Cook <[email protected]>

cooktheryan force-pushed the framework branch from 2f9c442 to b617893 Compare December 9, 2024 21:01

dhellmann reviewed Dec 9, 2024

View reviewed changes

framework to be used by RAG for pipeline

fdf2852

Signed-off-by: Ryan Cook <[email protected]>

cooktheryan force-pushed the framework branch from b617893 to fdf2852 Compare December 9, 2024 21:19

Update rag-framework.md

e35d2d9

Added details regarding Haystack licensing and maintenance. Signed-off-by: Ilya Kolchinsky <[email protected]>

jwm4 requested changes Dec 11, 2024

View reviewed changes

docs/retrieval-augmented-generation/rag-framework.md Show resolved Hide resolved

Update rag-framework.md

73469dc

Expanded risks and alternatives. Signed-off-by: Ilya Kolchinsky <[email protected]>

jwm4 approved these changes Dec 16, 2024

View reviewed changes

dmartinol approved these changes Dec 16, 2024

View reviewed changes

dmartinol mentioned this pull request Dec 16, 2024

RAG ingestion and chat pipelines #161

Merged

jwm4 mentioned this pull request Dec 19, 2024

ADR: Vector store for RAG #168

Merged

cooktheryan added 3 commits January 2, 2025 09:42

linting and spellcheck ignores

eee82de

Signed-off-by: Ryan Cook <[email protected]>

Merge branch 'main' into framework

1a8cf9e

Signed-off-by: Ryan Cook <[email protected]>

spelling

304084b

Signed-off-by: Ryan Cook <[email protected]>

mairin requested changes Jan 2, 2025

View reviewed changes

cooktheryan requested a review from mairin January 2, 2025 18:07

cooktheryan force-pushed the framework branch from a969a0a to 409bd1d Compare January 2, 2025 18:10

mairin approved these changes Jan 2, 2025

View reviewed changes

suggestions based on feedback

cd22a8a

Signed-off-by: Ryan Cook <[email protected]>

cooktheryan force-pushed the framework branch from c8afd92 to cd22a8a Compare January 2, 2025 18:15

mairin merged commit 78a5891 into instructlab:main Jan 2, 2025
4 checks passed


		## Risks

		N/A

framework to be used by RAG for pipeline #164

framework to be used by RAG for pipeline #164

Uh oh!

Conversation

cooktheryan commented Dec 9, 2024

Uh oh!

dhellmann Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

wattsteve Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

ilya-kolchinsky Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhellmann Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

ilya-kolchinsky Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

erikerlandson commented Dec 9, 2024

Uh oh!

ilya-kolchinsky commented Dec 10, 2024

Uh oh!

erikerlandson commented Dec 10, 2024

Uh oh!

ilya-kolchinsky commented Dec 10, 2024

Uh oh!

jwm4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ilya-kolchinsky commented Dec 11, 2024

Uh oh!

jwm4 commented Dec 11, 2024

Uh oh!

ilya-kolchinsky commented Dec 12, 2024

Uh oh!

jwm4 left a comment

Choose a reason for hiding this comment

Uh oh!

dmartinol left a comment

Choose a reason for hiding this comment

Uh oh!

jwm4 commented Dec 19, 2024

Uh oh!

jjasghar commented Dec 20, 2024

Uh oh!

mairin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mairin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ilya-kolchinsky Dec 10, 2024 •

edited

Loading