Reconciling OS and research datamixing code #407
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR brings in the
datamixing.py
file from the research code and some supporting files. It includes updates totest_datamixing.py
to reflect the research behavior. Those tests do pass. It also includes some preliminary updates togenerate_data.py
to make use of the research version of the datamixing code. However, a lot more work is needed to fully reconcilegenerate_data.py
. In this PR, the tests for generate data do not pass and a lot more work is needed to sort out to what extent we want the new behavior to match what the existing tests are checking for and to what extent we want to update the tests to match the new behavior. Here are some highlights of what is included in this PR:datamixing.py
intoutils
which is where it is in the research code instead of the root SDG folder which is where it was in the preexisting repo (starting with the research datamixing.py)datautils.py
over from the research code because it is used bydatamixing.py
.parse_and_convert.py
over from the research code because it includes some capabilities that the preexisting code included indatamixing.py
_load_and_sample_datasets
while the research method was calledload_ds
. I adoped the former because it seems like a more descriptive name that explains what it does better.test_recipe_init_with_empty_params_*
because empty params is not supported in the research codetest_init_with_empty_recipe_files
used to check to make sure we were populating defaults, but the research code doesn't do that so now we just verify that we get an empty recipe from it.test_load_ds_with_absolute_jsonl_path
because the functionality it was testing (looking up a dataset file from a relative path as specified in the recipe file) doesn't exist in the research code. Instead, in the research code, you just send the path directly to the load method.test_load_ds_with_absolute_jsonl_path
code accordingly._get_question_hack
and_get_response_hack
were in the preexistingdatamixing.py
but not the researchdatamixing.py
. For now, I have copied them intogenerate_data.py
since they seem to be part of some sort of legacy format support that is specific to that file and it is probably better to keep all the temporary/deprecated code localized in one place._convert_to_messages
fromgenerate_data.py
because the research version of it is inparse_and_convert.py
DataMixer._load_default_recipe
from the preexisting code base into the newutils/datamixing.py
asload_default_recipe
. This was needed because the research code does not have aDataMixer
class but we still need a way to load the default recipe. The research version of generate_data handles this by just hard-coding the location of the default recipe, but the open source code needs to take in a set of directories and search all of them for this._precomputed_skills_length
because this is a field ofDataMixer
which doesn't exist in the research code. If we can confirm that this functionality is no longer needed, we should delete the commented out code, but I am leaving it there for now while we sort this out.load_default_recipe
just returnsNone
if no recipe file is found.