Reconciling OS and research datamixing code #407

jwm4 · 2024-11-24T21:14:41Z

This PR brings in the datamixing.py file from the research code and some supporting files. It includes updates to test_datamixing.py to reflect the research behavior. Those tests do pass. It also includes some preliminary updates to generate_data.py to make use of the research version of the datamixing code. However, a lot more work is needed to fully reconcile generate_data.py. In this PR, the tests for generate data do not pass and a lot more work is needed to sort out to what extent we want the new behavior to match what the existing tests are checking for and to what extent we want to update the tests to match the new behavior. Here are some highlights of what is included in this PR:

Put datamixing.py into utils which is where it is in the research code instead of the root SDG folder which is where it was in the preexisting repo (starting with the research datamixing.py)
Brought datautils.py over from the research code because it is used by datamixing.py.
Brought parse_and_convert.py over from the research code because it includes some capabilities that the preexisting code included in datamixing.py
test_datamixer_can_load_default_recipes becomes test_load_default_recipes because there is no longer a "DataMixer" class
The preexisting load and sample method was called _load_and_sample_datasets while the research method was called load_ds. I adoped the former because it seems like a more descriptive name that explains what it does better.
test_datamixer_can_load_default_recipes becomes test_load_default_recipes because there is no longer a "DataMixer" class
Removed test_recipe_init_with_empty_params_* because empty params is not supported in the research code
The test_init_with_empty_recipe_files used to check to make sure we were populating defaults, but the research code doesn't do that so now we just verify that we get an empty recipe from it.
Dropped test_load_ds_with_absolute_jsonl_path because the functionality it was testing (looking up a dataset file from a relative path as specified in the recipe file) doesn't exist in the research code. Instead, in the research code, you just send the path directly to the load method.
Updated test_load_ds_with_absolute_jsonl_path code accordingly.
_get_question_hack and _get_response_hack were in the preexisting datamixing.py but not the research datamixing.py. For now, I have copied them into generate_data.py since they seem to be part of some sort of legacy format support that is specific to that file and it is probably better to keep all the temporary/deprecated code localized in one place.
Dropped _convert_to_messages from generate_data.py because the research version of it is in parse_and_convert.py
Brought DataMixer._load_default_recipe from the preexisting code base into the new utils/datamixing.py as load_default_recipe. This was needed because the research code does not have a DataMixer class but we still need a way to load the default recipe. The research version of generate_data handles this by just hard-coding the location of the default recipe, but the open source code needs to take in a set of directories and search all of them for this.
- Commented out the code to set _precomputed_skills_length because this is a field of DataMixer which doesn't exist in the research code. If we can confirm that this functionality is no longer needed, we should delete the commented out code, but I am leaving it there for now while we sort this out.
- The preexisting version defaulted to generating an empty Recipe with a given system prompt. However, the research code does not have a constructor for an empty Recipe or a way to specify a system prompt for a Recipe because the system prompt is always loaded from the Recipe file. So for now load_default_recipe just returns None if no recipe file is found.

bbrowning · 2024-12-02T18:36:02Z

Is this something we still want to do? I recall we talked about perhaps not picking in the latest datamixing changes from the research codebase, and instead using our datamixing bits on top of their core generation pipeline enhancements.

Reconciling OS and research datamixing code

dbe62ca

mergify bot added testing Relates to testing ci-failure labels Nov 24, 2024

jwm4 marked this pull request as draft November 24, 2024 21:19

mairin mentioned this pull request Dec 17, 2024

InstructLab Maintainer nomination instructlab/community#417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciling OS and research datamixing code #407

Reconciling OS and research datamixing code #407

jwm4 commented Nov 24, 2024

bbrowning commented Dec 2, 2024

Reconciling OS and research datamixing code #407

Are you sure you want to change the base?

Reconciling OS and research datamixing code #407

Conversation

jwm4 commented Nov 24, 2024

bbrowning commented Dec 2, 2024