-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile core data generation features with latest research advances #409
Conversation
This pull request has merge conflicts that must be resolved before it can be |
The fact that the e2e-small test passed even though I haven't actually converted any of our default config prompt templates to Jinja syntax yet is concerning, as that means the test is extremely loose in what it considers success. We wouldn't have actually included any of the user's question or answers in the skills or knowledge prompts we sent to the model, and instead it would have all had placeholder python String format tokens in it. |
b08ac2d
to
80b4bbf
Compare
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
Ok, I believe this has enough of the research code ported over, new tests created, and existing tests passing that it's ready for review. I know it's a big effort to review this, but it was also a big lift to get the core improvements from research working in our codebase.
This does change some of our public API, although at this time there are no known users of the breaking changes there - basically removal of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any problems with any of these changes, but I am not really an expert (and don't have write access, so my approval doesn't count for much). I would note that the research code base also has a very nice README file with a lot of useful information. I would like that merged in too with edits as needed to reflect any differences between that code and the open source code. However, this PR is already plenty big, so I would recommend a separate PR for the README merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this massive lift!
left some minor comments, besides that everything else looks good!
I agree we should pull in those changes - tracked as #428. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bbrowning for a great PR! LGTM 🚢
This pull request has merge conflicts that must be resolved before it can be |
f6b9ffe
to
5d520d8
Compare
While this was technically part of our public Python API, it appears to be entirely unused. Let's pull it out now to make syncing with the latest research advancements easier. Signed-off-by: Ben Browning <[email protected]>
5d520d8
to
1b85ebd
Compare
Rebased on top of latest main since CI is now fixed there - will let CI chew on things and if it's good and no more comments, going to squash a lot of these into a handful of fewer commits, add the co-authorship metadata for the upstream researchers that did much of the initial code, and get this merged. I'm not planning any functional changes to this PR from this point forward unless more review notes come up. |
This stubs in support for Jinja templates in the LLMBlock prompt templates, opening us up to more expressive prompts and handling things like loops that take a variable number of input elements when rendering templates. NOTE: This is a backwards-incompatible change in prompt templates. Any users that had custom pipelines specified will need to update their template variables to look like `{{variable}}` instead of `{variable}` as a result of this change. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>
The Block and Prompt registries are how we keep track of what our supported Block types are and which Prompts map to which teacher models. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>
This brings in changes to move our model prompt templates to Jinja templates and the HuggingFace messages formats, used by their chat templates. Signed-off-by: Ben Browning <[email protected]>
These new blocks don't do anything yet, but stubbing them into the codebase and will continue working on figuring out what they're supposed to do and wiring things up with tests. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
In addition to updating the knowledge configs to use jinja templates, this adds additional tests to validate that we are using jinja templates instead of python string formats. That also required tightening up our usage of jinja `Template` to always preferred `StrictUndefined` behavior everywhere we use it. Signed-off-by: Ben Browning <[email protected]>
This also makes the test running `Block._validate` on all our shipped configs a bit more generic so that it can cover all skill and knowledge yaml files without having to keep a separate list of config files to test. Signed-off-by: Ben Browning <[email protected]>
This gets rid of the hardcoded block types dict and drives everything off the BlockRegistry. This means I also added a functional test showing how users can create and register their own Block implementations and use those in their pipeline config files - see `tests/testdata/custom_block_pipeline.yaml` and `tests/testdata/custom_block.py` for those examples. Signed-off-by: Ben Browning <[email protected]>
This removes the mapping of model families in SDG itself between granite, mixtral, mistral, merlinite, etc. Instead, it uses the PromptRegistry to lookup chat templates based on the model family given. And, if no model family is given, it still falls back to doing a best-guess based on the file path of the selected teacher model. A simple test was added to demonstrate how to register and use custom chat templates for generating prompts via the PromptRegistry. Signed-off-by: Ben Browning <[email protected]>
This adds a new Block type - `IterBlock` - that calls another block N times for a set of given input samples. Every iteration through the loop, the samples returned from the child block's `generate` call get added to the list of samples produced from this block. So, if you use an `IterBlock` to call an `LLMBlock` 5 times, you'll get 5 samples generated (and 5 calls to the LLM) for every sample in the source dataset. The output dataset will contain all 5 generated samples resulting from each 1 input sample. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>
Asserts outside of tests should only be used for programming errors in our own code and not to validate user-facing things. Signed-off-by: Ben Browning <[email protected]>
1b85ebd
to
db3a1ad
Compare
This PR brings in some of the latest advancements prototyped by our research team into the broader codebase for everyone's use. It's a work-in-progress, but also something that others may wish to follow, comment on, and contribute to as the work gets done. There are still outstanding features not yet added to this - some new pipeline block types, an improved skills pipeline config, LLMLogProbBlock and LLMMessagesBlock are just stubs, etc.
And, this may cause the deprecation / removal of some existing functionality. That is not entirely clear yet, but will become apparent as work progresses.
See
docs/upgrading_from_v0.6.x_to_v0.7.x.md
for more details, although that too is still only stubbed out.