Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile core data generation features with latest research advances #409

Merged
merged 12 commits into from
Dec 10, 2024

Conversation

bbrowning
Copy link
Contributor

This PR brings in some of the latest advancements prototyped by our research team into the broader codebase for everyone's use. It's a work-in-progress, but also something that others may wish to follow, comment on, and contribute to as the work gets done. There are still outstanding features not yet added to this - some new pipeline block types, an improved skills pipeline config, LLMLogProbBlock and LLMMessagesBlock are just stubs, etc.

And, this may cause the deprecation / removal of some existing functionality. That is not entirely clear yet, but will become apparent as work progresses.

See docs/upgrading_from_v0.6.x_to_v0.7.x.md for more details, although that too is still only stubbed out.

@mergify mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing labels Nov 26, 2024
Copy link
Contributor

mergify bot commented Nov 26, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added needs-rebase dependencies Pull requests that update a dependency file ci-failure labels Nov 26, 2024
@bbrowning
Copy link
Contributor Author

The fact that the e2e-small test passed even though I haven't actually converted any of our default config prompt templates to Jinja syntax yet is concerning, as that means the test is extremely loose in what it considers success. We wouldn't have actually included any of the user's question or answers in the skills or knowledge prompts we sent to the model, and instead it would have all had placeholder python String format tokens in it.

Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

@bbrowning
Copy link
Contributor Author

bbrowning commented Nov 27, 2024

Ok, I believe this has enough of the research code ported over, new tests created, and existing tests passing that it's ready for review. I know it's a big effort to review this, but it was also a big lift to get the core improvements from research working in our codebase.

BlockRegistry, PromptRegistry, IterBlock, and Jinja prompt templates should all be working. I created #413 and #414 separately to track finishing up LLMLogProbBlock and LLMMessagesBlock.

This does change some of our public API, although at this time there are no known users of the breaking changes there - basically removal of ImportBlock and reorganizing blocks under a instructlab.sdg.blocks package. The API used by the InstructLab CLI should be unchanged, and the e2e CI tests still passing without changes confirms this.

@bbrowning bbrowning marked this pull request as ready for review November 27, 2024 18:28
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 27, 2024
@aakankshaduggal aakankshaduggal requested a review from a team December 2, 2024 20:19
@ktam3 ktam3 linked an issue Dec 2, 2024 that may be closed by this pull request
5 tasks
Copy link

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problems with any of these changes, but I am not really an expert (and don't have write access, so my approval doesn't count for much). I would note that the research code base also has a very nice README file with a lot of useful information. I would like that merged in too with edits as needed to reflect any differences between that code and the open source code. However, this PR is already plenty big, so I would recommend a separate PR for the README merge.

Copy link
Member

@khaledsulayman khaledsulayman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this massive lift!

left some minor comments, besides that everything else looks good!

src/instructlab/sdg/blocks/block.py Outdated Show resolved Hide resolved
src/instructlab/sdg/blocks/block.py Show resolved Hide resolved
src/instructlab/sdg/blocks/filterblock.py Show resolved Hide resolved
@aakankshaduggal aakankshaduggal requested a review from a team December 4, 2024 23:11
@bbrowning
Copy link
Contributor Author

I don't see any problems with any of these changes, but I am not really an expert (and don't have write access, so my approval doesn't count for much). I would note that the research code base also has a very nice README file with a lot of useful information. I would like that merged in too with edits as needed to reflect any differences between that code and the open source code. However, this PR is already plenty big, so I would recommend a separate PR for the README merge.

I agree we should pull in those changes - tracked as #428.

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bbrowning for a great PR! LGTM 🚢

Copy link
Contributor

mergify bot commented Dec 6, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

While this was technically part of our public Python API, it appears
to be entirely unused. Let's pull it out now to make syncing with the
latest research advancements easier.

Signed-off-by: Ben Browning <[email protected]>
@bbrowning
Copy link
Contributor Author

Rebased on top of latest main since CI is now fixed there - will let CI chew on things and if it's good and no more comments, going to squash a lot of these into a handful of fewer commits, add the co-authorship metadata for the upstream researchers that did much of the initial code, and get this merged. I'm not planning any functional changes to this PR from this point forward unless more review notes come up.

bbrowning and others added 11 commits December 10, 2024 14:36
This stubs in support for Jinja templates in the LLMBlock prompt
templates, opening us up to more expressive prompts and handling things
like loops that take a variable number of input elements when
rendering templates.

NOTE: This is a backwards-incompatible change in prompt templates. Any
users that had custom pipelines specified will need to update their
template variables to look like `{{variable}}` instead of `{variable}`
as a result of this change.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
The Block and Prompt registries are how we keep track of what our
supported Block types are and which Prompts map to which teacher
models.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
This brings in changes to move our model prompt templates to Jinja
templates and the HuggingFace messages formats, used by their chat
templates.

Signed-off-by: Ben Browning <[email protected]>
These new blocks don't do anything yet, but stubbing them into the
codebase and will continue working on figuring out what they're
supposed to do and wiring things up with tests.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
In addition to updating the knowledge configs to use jinja templates,
this adds additional tests to validate that we are using jinja
templates instead of python string formats. That also required
tightening up our usage of jinja `Template` to always preferred
`StrictUndefined` behavior everywhere we use it.

Signed-off-by: Ben Browning <[email protected]>
This also makes the test running `Block._validate` on all our shipped
configs a bit more generic so that it can cover all skill and
knowledge yaml files without having to keep a separate list of config
files to test.

Signed-off-by: Ben Browning <[email protected]>
This gets rid of the hardcoded block types dict and drives everything
off the BlockRegistry. This means I also added a functional test
showing how users can create and register their own Block
implementations and use those in their pipeline config files - see
`tests/testdata/custom_block_pipeline.yaml` and
`tests/testdata/custom_block.py` for those examples.

Signed-off-by: Ben Browning <[email protected]>
This removes the mapping of model families in SDG itself between
granite, mixtral, mistral, merlinite, etc. Instead, it uses the
PromptRegistry to lookup chat templates based on the model family
given. And, if no model family is given, it still falls back to doing
a best-guess based on the file path of the selected teacher model.

A simple test was added to demonstrate how to register and use custom
chat templates for generating prompts via the PromptRegistry.

Signed-off-by: Ben Browning <[email protected]>
This adds a new Block type - `IterBlock` - that calls another block N
times for a set of given input samples. Every iteration through the
loop, the samples returned from the child block's `generate` call get
added to the list of samples produced from this block.

So, if you use an `IterBlock` to call an `LLMBlock` 5 times, you'll get
5 samples generated (and 5 calls to the LLM) for every sample in the
source dataset. The output dataset will contain all 5 generated samples
resulting from each 1 input sample.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
Asserts outside of tests should only be used for programming errors in
our own code and not to validate user-facing things.

Signed-off-by: Ben Browning <[email protected]>
@bbrowning bbrowning merged commit fd53dcd into instructlab:main Dec 10, 2024
24 checks passed
@bbrowning bbrowning deleted the research-sync branch December 10, 2024 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation one-approval testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants