-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Refactor Generate Functionality into a Standalone Python API #412
Comments
After doing a few failed and some not-so-failed prototypes of separating things out locally, here's a proposal I'd like to put forward for how we factor out Python APIsfrom instructlab.sdg import (
preprocess_taxonomy,
generate_taxonomy,
generate_taxonomy_eval,
postprocess_taxonomy,
mix_datasets,
run_pipeline,
)
# Validate taxonomy, determine changed files, fetch knowledge docs,
# convert knowledge docs, chunk, turn into icl_question_* samples
preprocess_taxonomy(
taxonomy_path="taxonomy",
output_dir="preprocessed_samples",
)
# Run the actual data generation for preprocessed taxonomy samples
generate_taxonomy(
pipeline="simple",
input_dir="preprocessed_samples",
output_dir="generated_samples",
endpoint_url="http://localhost:8000/v1",
api_key="EMPTY",
model_family="merlinite",
model_id="foo",
)
# Optionally generate eval (ie MMLU) samples from your taxonomy
generate_taxonomy_eval(
input_dir="preprocessed_samples",
output_dir="eval_samples",
endpoint_url="http://localhost:8000/v1",
api_key="EMPTY",
model_family="merlinite",
model_id="foo",
)
# Create phase07 and phase10 splits, RaFT contexts, add in auxiliary
# datasets, and write Recipes to be used with mixing
postprocess_taxonomy(
input_dir="generated_samples",
output_dir="postprocessed_samples",
)
# Create the final mixed dataset based on the Recipe
mix_datasets(
recipe="postprocessed_samples/skills_recipe.yaml",
output_file="mixed_samples/skills_train_msgs.jsonl",
)
mix_datasets(
recipe="postprocessed_samples/knowledge_recipe.yaml",
output_file="mixed_samples/knowledge_train_msgs.jsonl",
)
# Or, just run a single pipeline (mostly meant to be used via the CLI
# run_pipeline command below), as users can already construct a
# PipelineContext and Pipeline objects directly to run a single
# pipeline if they want direct control of input/output samples and not
# file operations
run_pipeline(
pipeline="my_pipeline.yaml",
input_file="input.jsonl",
output_file="output.jsonl",
endpoint_url="http://localhost:8000/v1",
api_key="EMPTY",
model_family="mixtral",
model_id="foo",
) CLI Commandspython -m instructlab.sdg.cli.preprocess_taxonomy \
--taxonomy-path taxonomy \
--output-dir preprocessed_samples \
--log-level DEBUG
python -m instructlab.sdg.cli.generate_taxonomy \
--pipeline simple \
--input-dir preprocessed_samples \
--output-dir generated_samples \
--endpoint-url http://localhost:8000/v1 \
--api-key EMPTY \
--model-family merlinite \
--model-id foo \
--log-level DEBUG
python -m instructlab.sdg.cli.generate_taxonomy_eval \
--input-dir preprocessed_samples \
--output-dir eval_samples \
--endpoint-url http://localhost:8000/v1 \
--api-key EMPTY \
--model-family merlinite \
--model-id foo \
--log-level DEBUG
python -m instructlab.sdg.cli.postprocess_taxonomy \
--input-dir generated_samples
--output-dir postprocessed_samples \
--model-family merlinite \
--model-id foo \
--log-level DEBUG
python -m instructlab.sdg.cli.mix_datasets \
--recipe postprocessed_samples/skills_recipe.yaml \
--output-file mixed_samples/skills_train_msgs.jsonl
python -m instructlab.sdg.cli.mix_dataset \
--recipe postprocessed_samples/knowledge_recipe.yaml \
--output-file mixed_samples/knowledge_train_msgs.jsonl
python -m instructlab.sdg.client.run_pipeline \
--pipeline my_pipeline.yaml \
--input-file input.jsonl \
--output-file output.jsonl \
--endpoint-url http://localhost:8000/v1 \
--api-key EMPTY \
--model-family mixtral \
--model-id foo \
--log-level DEBUG |
I find actually attempting to implement this greatly improves my reasoning about the API, what it needs to do, not do, and how to tease things apart. In that spirit, I have a draft PR of some of these changes at #443 that's still a work-in-progress and nowhere near complete. As of this comment it only pulls out the preprocessing but needs to be adapted to match the naming I outlined above. I'll keep iterating on that draft PR to make it closer to the API and CLIs outlined above. Concurrently, I'd love some additional input and eyes on the proposed API and breakdown of responsibilities. That input is welcome from anyone in the community, but also specifically tagging in @aakankshaduggal and @khaledsulayman to provide input from an internal SDG perspective, @anastasds and @jwm4 to ensure these APIs/CLIs would work for their RAG work, @abhi1092 and @shivchander for a thumbs-up that the |
Thanks for this @bbrowning! At the risk of throwing a grenade into the conversation, since we are talking about naming... One of my goals with the domain modeling workshop was to try to attach names to operations. Apart from the "verbs" (preprocess, postprocess), the nouns seem to be worth questioning too. For example, |
I'm not strongly attached to the current proposed names, and mostly put them out there as a way to demonstrate the overall shape and granularity of the API. Feel free to suggest alternative names for CLI and/or API commands, or even a different granularity of the exposed APIs. |
How do we feel about these names for things? Had a chat with Andy and @alinaryan about what to name these things and we came up with some suggestions as a discussion starting point.
Thoughts? [Larger context: trying to standardize our vocabulary for use in our codebase, documentation, training, etc] |
I don't think the source dataset is in a format that could be used for training. But the synthetic dataset is in a format that could be used for training. Having two things named as a "dataset" but only one in a trainable format seems like it would be confusing. |
That's a good point. Something like "source corpus"? |
Corpus sounds complicated. The object in this diagram... it includes the docling output from the yaml-referenced source docs, correct? Plus the qna.yaml itself? |
Probably. "Knowledge base"? |
I love that term except, it includes skills so it could cause confusion |
Knowledge workspace? |
The
generate_data
function is currently integrated with various functionalities like taxonomy data ingestion, preprocessing and mixing, leading to maintenance and testing challenges. We propose refactoring this into a clean, dedicated Python API that handles only data generation. This separation will increase modularity and ease further development.Objectives
Acceptance Criteria
The text was updated successfully, but these errors were encountered: