Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

framework to be used by RAG for pipeline #164

Merged
merged 7 commits into from
Jan 2, 2025

Conversation

cooktheryan
Copy link
Contributor

No description provided.


## Risks

N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing in a new dependency always introduces risks related to building that package (and its dependencies) downstream, as well as license compliance, long term supportability, etc. Those aren't unique to this package, but it would be good to see them addressed here anyway to show that a thorough vetting was done of the package, and its dependencies.

For example, is any part of the haystack-ai dependency list GPL or AGPL? Is its license compatible with the Instructlab license?

Does it, and all of its dependencies, appear to be actively maintained by more than one maintainer?

Are those maintainers responsive to requests to review PRs, especially for CVEs?

Is the project following good practices for releasing, including tagging releases, publishing source packages, and using trusted publishing automation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cooktheryan this type of analysis is fairly routine for OSPO. I'd recommend reaching out to @lhawthorn

Copy link
Contributor

@ilya-kolchinsky ilya-kolchinsky Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.

None of Haystack's dependencies (jinja2, lazy-imports, more-itertools, networkx, numpy, openai, pandas, posthog, python-dateutil, pyyaml, requests, tenacity, tqdm, typing-extensions) are GPL/AGPL. All of them are Apache, MIT, BSD or PSFL.

Haystack is licensed under Apache 2.0. What else makes it compatible or non-compatible with InstructLab?

Haystack is very actively maintained and supported. As for active maintenance of the dependencies, I didn't check them all but most are very well-known and widely used libraries. Moreover, many of them are also on the dependency list of the alternatives to Haystack.

Haystack releases tagged versions. They do not publish to Github Packages.
Could you please elaborate what is considered "trusted publishing automation" and whether there are any additional releasing practices we need to verify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be good information to include in the body of the document.

For trusted publisher, I mean https://docs.pypi.org/trusted-publishers/. We've recently seen an instance of a package in the AI ecosystem where that trusted publishing was set up incorrectly, which led to a third-party having the ability to not only trigger releases but to inject malicious code into multiple versions of source packages. ultralytics/ultralytics#18027 for reference.

Downstream, we're building from source packages, but if we can't trust those source packages we would not want to consume the dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the clarification. It looks like Haystack indeed satisfies the requirements for trusted publishing with this Github Actions workflow.

We'll make sure to include the details from this thread in the document.

@erikerlandson
Copy link

There is a lot of emphasis given to haystack being focused on RAG. Does it also have a good story for supporting more generalized agentic patterns?

@ilya-kolchinsky
Copy link
Contributor

There is a lot of emphasis given to haystack being focused on RAG. Does it also have a good story for supporting more generalized agentic patterns?

@erikerlandson, well, since Haystack supports building arbitrary graphs of operators (similarly to Langgraph), it can potentially implement many of the agentic patterns. However, as you say, the main focus is on RAG. Since the decision here solely considers our RAG implementation, features and capabilities not directly related to RAG should mostly be out of scope.

@erikerlandson
Copy link

I believe that agentic should not be totally out of scope; mainly because soon enough the focus is going to shift to agentic, and it would be undesirable (IMO) to be in a position of wanting to add a separate framework for agentic, or having to switch to a new framework altogether.

@ilya-kolchinsky
Copy link
Contributor

@erikerlandson, I agree. Since Haystack provides the functionality to build arbitrarily complex pipelines, it is possible to extend our RAG architecture to introduce nodes for tools, reasoning and agentic actions.

Added details regarding Haystack licensing and maintenance.

Signed-off-by: Ilya Kolchinsky <[email protected]>
Copy link
Contributor

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have requested one more alternative to be added to the list that we considered -- I don't think that will change the outcome but I want it included for completeness.

I understand that the discussion of the dependency risk will be added to the risk section, which sounds good to me.

I am not seeing any other edits to this document that I would recommend. So I think if these changes are made, I will approve (unless something else arises before then).

@ilya-kolchinsky
Copy link
Contributor

@jwm4, sure, I will edit the document to include the alternative you proposed.
As for the dependencies, I have already added this earlier today. To the best of my understanding, there seem to be no problems with the dependencies, therefore I added this discussion to the 'Why' section rather than to the risks. Do you think it would be better to move it under 'Risks'?

@jwm4
Copy link
Contributor

jwm4 commented Dec 11, 2024

Oh, I missed it under "Why". Sorry! I am fine with it there. I do think there is a tail risk that new dependencies get added that are more problematic. If that happens we have the following options:

  1. Pin to the old version that doesn't have that dependency. That's often OK for a while, but eventually we're likely to run into updates that we need (e.g., critical fixes, compatibility with new vectordbs, etc.).
  2. Fork the project to avoid the problematic dependencies.
  3. Move off of Haystack completely.

Any of these could be viable depending on the circumstances. Furthermore, since Haystack hasn't added any dependencies that we'd consider very problematic so far, the likelihood that they will in the future seems minimal. I'd still like this noted under Risks though for reference if you don't mind.

Expanded risks and alternatives.

Signed-off-by: Ilya Kolchinsky <[email protected]>
@ilya-kolchinsky
Copy link
Contributor

Done - the document is now updated as we discussed.

Copy link
Contributor

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concerns I have about this document are all addressed. FWIW, the conclusion that we will use Haystack is not the choice I would have made, and I think in the long run it is likely we will undo this choice and eliminate the dependency on Haystack. However, I don't think that will be particularly hard to do if/when there is a strong case for doing so, and I can't deny that for now having a framework to build on is making it quicker and easier to deliver a minimal viable solution. On the whole, I would say that the argument made in this document is strong and well-reasoned, so I am approving it despite my reservations about the decision.

Copy link

@dmartinol dmartinol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jwm4
Copy link
Contributor

jwm4 commented Dec 19, 2024

@instructlab/oversight-committee -- Can someone take a look at this and see if it is ready to approve and merge? If it is not, can we get some guidance about what would be needed to get it approved?

@jwm4 jwm4 mentioned this pull request Dec 19, 2024
@jjasghar
Copy link
Member

As an oversight committee member I approve this. Please fix the markdown/spellcheck errors and then we can merge this as a the way forward.

Copy link
Member

@mairin mairin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting some changes, some minor spelling / grammar, some just pulling some items from the PR conversation into the doc.

Copy link
Member

@mairin mairin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@mairin mairin merged commit 78a5891 into instructlab:main Jan 2, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants