Reconcile core data generation features with latest research advances #409

bbrowning · 2024-11-26T01:43:03Z

This PR brings in some of the latest advancements prototyped by our research team into the broader codebase for everyone's use. It's a work-in-progress, but also something that others may wish to follow, comment on, and contribute to as the work gets done. There are still outstanding features not yet added to this - some new pipeline block types, an improved skills pipeline config, LLMLogProbBlock and LLMMessagesBlock are just stubs, etc.

And, this may cause the deprecation / removal of some existing functionality. That is not entirely clear yet, but will become apparent as work progresses.

See docs/upgrading_from_v0.6.x_to_v0.7.x.md for more details, although that too is still only stubbed out.

mergify · 2024-11-26T01:43:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bbrowning · 2024-11-26T12:13:49Z

The fact that the e2e-small test passed even though I haven't actually converted any of our default config prompt templates to Jinja syntax yet is concerning, as that means the test is extremely loose in what it considers success. We wouldn't have actually included any of the user's question or answers in the skills or knowledge prompts we sent to the model, and instead it would have all had placeholder python String format tokens in it.

github-actions · 2024-11-26T21:35:10Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2024-11-26T23:42:06Z

e2e workflow succeeded on this PR: View run, congrats!

bbrowning · 2024-11-27T18:28:03Z

Ok, I believe this has enough of the research code ported over, new tests created, and existing tests passing that it's ready for review. I know it's a big effort to review this, but it was also a big lift to get the core improvements from research working in our codebase.

BlockRegistry, PromptRegistry, IterBlock, and Jinja prompt templates should all be working. I created #413 and #414 separately to track finishing up LLMLogProbBlock and LLMMessagesBlock.

This does change some of our public API, although at this time there are no known users of the breaking changes there - basically removal of ImportBlock and reorganizing blocks under a instructlab.sdg.blocks package. The API used by the InstructLab CLI should be unchanged, and the e2e CI tests still passing without changes confirms this.

jwm4

I don't see any problems with any of these changes, but I am not really an expert (and don't have write access, so my approval doesn't count for much). I would note that the research code base also has a very nice README file with a lot of useful information. I would like that merged in too with edits as needed to reflect any differences between that code and the open source code. However, this PR is already plenty big, so I would recommend a separate PR for the README merge.

khaledsulayman

thanks for this massive lift!

left some minor comments, besides that everything else looks good!

src/instructlab/sdg/blocks/block.py

src/instructlab/sdg/blocks/filterblock.py

bbrowning · 2024-12-05T10:31:51Z

I don't see any problems with any of these changes, but I am not really an expert (and don't have write access, so my approval doesn't count for much). I would note that the research code base also has a very nice README file with a lot of useful information. I would like that merged in too with edits as needed to reflect any differences between that code and the open source code. However, this PR is already plenty big, so I would recommend a separate PR for the README merge.

I agree we should pull in those changes - tracked as #428.

src/instructlab/sdg/blocks/iterblock.py

aakankshaduggal

Thanks @bbrowning for a great PR! LGTM 🚢

mergify · 2024-12-06T18:21:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

src/instructlab/sdg/blocks/llmblock.py

While this was technically part of our public Python API, it appears to be entirely unused. Let's pull it out now to make syncing with the latest research advancements easier. Signed-off-by: Ben Browning <[email protected]>

bbrowning · 2024-12-10T18:51:25Z

Rebased on top of latest main since CI is now fixed there - will let CI chew on things and if it's good and no more comments, going to squash a lot of these into a handful of fewer commits, add the co-authorship metadata for the upstream researchers that did much of the initial code, and get this merged. I'm not planning any functional changes to this PR from this point forward unless more review notes come up.

This stubs in support for Jinja templates in the LLMBlock prompt templates, opening us up to more expressive prompts and handling things like loops that take a variable number of input elements when rendering templates. NOTE: This is a backwards-incompatible change in prompt templates. Any users that had custom pipelines specified will need to update their template variables to look like `{{variable}}` instead of `{variable}` as a result of this change. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>

The Block and Prompt registries are how we keep track of what our supported Block types are and which Prompts map to which teacher models. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>

This brings in changes to move our model prompt templates to Jinja templates and the HuggingFace messages formats, used by their chat templates. Signed-off-by: Ben Browning <[email protected]>

These new blocks don't do anything yet, but stubbing them into the codebase and will continue working on figuring out what they're supposed to do and wiring things up with tests. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>

Signed-off-by: Ben Browning <[email protected]>

In addition to updating the knowledge configs to use jinja templates, this adds additional tests to validate that we are using jinja templates instead of python string formats. That also required tightening up our usage of jinja `Template` to always preferred `StrictUndefined` behavior everywhere we use it. Signed-off-by: Ben Browning <[email protected]>

This also makes the test running `Block._validate` on all our shipped configs a bit more generic so that it can cover all skill and knowledge yaml files without having to keep a separate list of config files to test. Signed-off-by: Ben Browning <[email protected]>

This gets rid of the hardcoded block types dict and drives everything off the BlockRegistry. This means I also added a functional test showing how users can create and register their own Block implementations and use those in their pipeline config files - see `tests/testdata/custom_block_pipeline.yaml` and `tests/testdata/custom_block.py` for those examples. Signed-off-by: Ben Browning <[email protected]>

This removes the mapping of model families in SDG itself between granite, mixtral, mistral, merlinite, etc. Instead, it uses the PromptRegistry to lookup chat templates based on the model family given. And, if no model family is given, it still falls back to doing a best-guess based on the file path of the selected teacher model. A simple test was added to demonstrate how to register and use custom chat templates for generating prompts via the PromptRegistry. Signed-off-by: Ben Browning <[email protected]>

This adds a new Block type - `IterBlock` - that calls another block N times for a set of given input samples. Every iteration through the loop, the samples returned from the child block's `generate` call get added to the list of samples produced from this block. So, if you use an `IterBlock` to call an `LLMBlock` 5 times, you'll get 5 samples generated (and 5 calls to the LLM) for every sample in the source dataset. The output dataset will contain all 5 generated samples resulting from each 1 input sample. Co-authored-by: shivchander <[email protected]> Co-authored-by: abhi1092 <[email protected]> Signed-off-by: Ben Browning <[email protected]>

Asserts outside of tests should only be used for programming errors in our own code and not to validate user-facing things. Signed-off-by: Ben Browning <[email protected]>

mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing labels Nov 26, 2024

mergify bot added needs-rebase dependencies Pull requests that update a dependency file ci-failure labels Nov 26, 2024

bbrowning force-pushed the research-sync branch from b08ac2d to 80b4bbf Compare November 26, 2024 17:11

mergify bot added ci-failure and removed needs-rebase ci-failure labels Nov 26, 2024

This was referenced Nov 27, 2024

Finish syncing LLMLogProbBlock from research code #413

Open

Finish syncing LLMMessagesBlock from research code #414

Open

bbrowning marked this pull request as ready for review November 27, 2024 18:28

mergify bot added ci-failure and removed ci-failure labels Nov 27, 2024

bbrowning mentioned this pull request Nov 28, 2024

[Epic] Reconcile ilab SDG and Research SDG 2.0 #373

Open

5 tasks

aakankshaduggal requested a review from a team December 2, 2024 20:19

ktam3 linked an issue Dec 2, 2024 that may be closed by this pull request

[Epic] Reconcile ilab SDG and Research SDG 2.0 #373

Open

5 tasks

ktam3 removed a link to an issue Dec 2, 2024

[Epic] Reconcile ilab SDG and Research SDG 2.0 #373

Open

5 tasks

jwm4 approved these changes Dec 3, 2024

View reviewed changes

khaledsulayman reviewed Dec 4, 2024

View reviewed changes

src/instructlab/sdg/blocks/block.py Outdated Show resolved Hide resolved

src/instructlab/sdg/blocks/block.py Show resolved Hide resolved

src/instructlab/sdg/blocks/filterblock.py Show resolved Hide resolved

aakankshaduggal requested a review from a team December 4, 2024 23:11

aakankshaduggal reviewed Dec 5, 2024

View reviewed changes

src/instructlab/sdg/blocks/iterblock.py Show resolved Hide resolved

aakankshaduggal approved these changes Dec 6, 2024

View reviewed changes

mergify bot added the one-approval label Dec 6, 2024

aakankshaduggal requested a review from khaledsulayman December 6, 2024 18:18

mergify bot added the needs-rebase label Dec 6, 2024

bbrowning force-pushed the research-sync branch from f6b9ffe to 5d520d8 Compare December 10, 2024 12:40

mergify bot added ci-failure and removed needs-rebase labels Dec 10, 2024

anastasds reviewed Dec 10, 2024

View reviewed changes

src/instructlab/sdg/blocks/llmblock.py Outdated Show resolved Hide resolved

anastasds approved these changes Dec 10, 2024

View reviewed changes

Remove ImportBlock as a pipeline block

426a171

While this was technically part of our public Python API, it appears to be entirely unused. Let's pull it out now to make syncing with the latest research advancements easier. Signed-off-by: Ben Browning <[email protected]>

bbrowning force-pushed the research-sync branch from 5d520d8 to 1b85ebd Compare December 10, 2024 18:48

mergify bot removed the ci-failure label Dec 10, 2024

bbrowning and others added 11 commits December 10, 2024 14:36

Move model prompts to jinja templates and messages

3b2bc7d

This brings in changes to move our model prompt templates to Jinja templates and the HuggingFace messages formats, used by their chat templates. Signed-off-by: Ben Browning <[email protected]>

Add CHANGELOG.md entries for research reconciliation

79d68fb

Signed-off-by: Ben Browning <[email protected]>

Validate blocks by raising BlockConfigParserError instead of asserts

db3a1ad

Asserts outside of tests should only be used for programming errors in our own code and not to validate user-facing things. Signed-off-by: Ben Browning <[email protected]>

bbrowning force-pushed the research-sync branch from 1b85ebd to db3a1ad Compare December 10, 2024 19:38

bbrowning merged commit fd53dcd into instructlab:main Dec 10, 2024
24 checks passed

bbrowning deleted the research-sync branch December 10, 2024 22:18

This was referenced Dec 17, 2024

InstructLab Maintainer nomination instructlab/community#417

Open

InstructLab Maintainer nomination instructlab/community#418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile core data generation features with latest research advances #409

Reconcile core data generation features with latest research advances #409

bbrowning commented Nov 26, 2024

mergify bot commented Nov 26, 2024

bbrowning commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

bbrowning commented Nov 27, 2024 •

edited

Loading

jwm4 left a comment

khaledsulayman left a comment

bbrowning commented Dec 5, 2024

aakankshaduggal left a comment

mergify bot commented Dec 6, 2024

bbrowning commented Dec 10, 2024

Reconcile core data generation features with latest research advances #409

Reconcile core data generation features with latest research advances #409

Conversation

bbrowning commented Nov 26, 2024

mergify bot commented Nov 26, 2024

bbrowning commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

bbrowning commented Nov 27, 2024 • edited Loading

jwm4 left a comment

Choose a reason for hiding this comment

khaledsulayman left a comment

Choose a reason for hiding this comment

bbrowning commented Dec 5, 2024

aakankshaduggal left a comment

Choose a reason for hiding this comment

mergify bot commented Dec 6, 2024

bbrowning commented Dec 10, 2024

bbrowning commented Nov 27, 2024 •

edited

Loading