[EPIC] Refactor Generate Functionality into a Standalone Python API #412

aakankshaduggal · 2024-11-26T20:27:27Z

The generate_data function is currently integrated with various functionalities like taxonomy data ingestion, preprocessing and mixing, leading to maintenance and testing challenges. We propose refactoring this into a clean, dedicated Python API that handles only data generation. This separation will increase modularity and ease further development.

Objectives

Extract the generate logic from the existing implementation and encapsulate it within a new Python API.
Ensure this API is compatible with both standalone use and integration into the CLI.
Maintain the integrity of the existing codebase while simplifying the generation process.

Acceptance Criteria

Define the New API

Develop a Python API that focuses solely on the data generation process.
Include additional parameters such as dataset path, output save path, pipeline path.
Utilize the API within a CLI context to ensure seamless integration.

Independent SDG CLI

Use click for CLI development, providing options to configure the generation process directly from the command line.
Ensure that the current existing ilab CLI uses this new API effectively, passing all necessary parameters through command line options.

Testing and Debugging

Write comprehensive unit tests for the new API to ensure it works as expected under various configurations.

Documentation and Examples

Since the new SDG CLI will require you to pass your own dataset and pipeline, it is essential to update the project documentation to include detailed instructions on how to use the new API and CLI.
Provide example commands and configurations to help users get started with the new setup.

The text was updated successfully, but these errors were encountered:

bbrowning · 2024-12-11T14:23:38Z

After doing a few failed and some not-so-failed prototypes of separating things out locally, here's a proposal I'd like to put forward for how we factor out generate_data into separate public APIs (and CLIs) instead of only exposing the one large end-to-end generate_data. The goal would be for generate_data to just delegate to these APIs instead of implementing everything itself - specifically it would call preprocess_taxonomy, generate_taxonomy, generate_taxonomy_eval, postprocess_taxonomy, and mix_datasets to provide the same end-to-end experience it does today via the new APIs.

Python APIs

from instructlab.sdg import (
    preprocess_taxonomy,
    generate_taxonomy,
    generate_taxonomy_eval,
    postprocess_taxonomy,
    mix_datasets,
    run_pipeline,
)

# Validate taxonomy, determine changed files, fetch knowledge docs,
# convert knowledge docs, chunk, turn into icl_question_* samples
preprocess_taxonomy(
    taxonomy_path="taxonomy",
    output_dir="preprocessed_samples",
)

# Run the actual data generation for preprocessed taxonomy samples
generate_taxonomy(
    pipeline="simple",
    input_dir="preprocessed_samples",
    output_dir="generated_samples",
    endpoint_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model_family="merlinite",
    model_id="foo",
)

# Optionally generate eval (ie MMLU) samples from your taxonomy
generate_taxonomy_eval(
    input_dir="preprocessed_samples",
    output_dir="eval_samples",
    endpoint_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model_family="merlinite",
    model_id="foo",
)

# Create phase07 and phase10 splits, RaFT contexts, add in auxiliary
# datasets, and write Recipes to be used with mixing
postprocess_taxonomy(
    input_dir="generated_samples",
    output_dir="postprocessed_samples",
)

# Create the final mixed dataset based on the Recipe
mix_datasets(
    recipe="postprocessed_samples/skills_recipe.yaml",
    output_file="mixed_samples/skills_train_msgs.jsonl",
)
mix_datasets(
    recipe="postprocessed_samples/knowledge_recipe.yaml",
    output_file="mixed_samples/knowledge_train_msgs.jsonl",
)



# Or, just run a single pipeline (mostly meant to be used via the CLI
# run_pipeline command below), as users can already construct a
# PipelineContext and Pipeline objects directly to run a single
# pipeline if they want direct control of input/output samples and not
# file operations
run_pipeline(
    pipeline="my_pipeline.yaml",
    input_file="input.jsonl",
    output_file="output.jsonl",
    endpoint_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model_family="mixtral",
    model_id="foo",
)

CLI Commands

python -m instructlab.sdg.cli.preprocess_taxonomy \
  --taxonomy-path taxonomy \
  --output-dir preprocessed_samples \
  --log-level DEBUG

python -m instructlab.sdg.cli.generate_taxonomy \
  --pipeline simple \
  --input-dir preprocessed_samples \
  --output-dir generated_samples \
  --endpoint-url http://localhost:8000/v1 \
  --api-key EMPTY \
  --model-family merlinite \
  --model-id foo \
  --log-level DEBUG

python -m instructlab.sdg.cli.generate_taxonomy_eval \
  --input-dir preprocessed_samples \
  --output-dir eval_samples \
  --endpoint-url http://localhost:8000/v1 \
  --api-key EMPTY \
  --model-family merlinite \
  --model-id foo \
  --log-level DEBUG

python -m instructlab.sdg.cli.postprocess_taxonomy \
  --input-dir generated_samples 
  --output-dir postprocessed_samples \
  --model-family merlinite \
  --model-id foo \
  --log-level DEBUG

python -m instructlab.sdg.cli.mix_datasets \
  --recipe postprocessed_samples/skills_recipe.yaml \
  --output-file mixed_samples/skills_train_msgs.jsonl

python -m instructlab.sdg.cli.mix_dataset \
  --recipe postprocessed_samples/knowledge_recipe.yaml \
  --output-file mixed_samples/knowledge_train_msgs.jsonl


python -m instructlab.sdg.client.run_pipeline \
  --pipeline my_pipeline.yaml \
  --input-file input.jsonl \
  --output-file output.jsonl \
  --endpoint-url http://localhost:8000/v1 \
  --api-key EMPTY \
  --model-family mixtral \
  --model-id foo \
  --log-level DEBUG

bbrowning · 2024-12-11T14:34:27Z

I find actually attempting to implement this greatly improves my reasoning about the API, what it needs to do, not do, and how to tease things apart. In that spirit, I have a draft PR of some of these changes at #443 that's still a work-in-progress and nowhere near complete. As of this comment it only pulls out the preprocessing but needs to be adapted to match the naming I outlined above.

I'll keep iterating on that draft PR to make it closer to the API and CLIs outlined above. Concurrently, I'd love some additional input and eyes on the proposed API and breakdown of responsibilities. That input is welcome from anyone in the community, but also specifically tagging in @aakankshaduggal and @khaledsulayman to provide input from an internal SDG perspective, @anastasds and @jwm4 to ensure these APIs/CLIs would work for their RAG work, @abhi1092 and @shivchander for a thumbs-up that the run_pipeline API will provide the basic dataset in / dataset out functionality they need, and @relyt0925 for perspective as a community user of SDG.

anastasds · 2024-12-11T16:44:48Z

Thanks for this @bbrowning! At the risk of throwing a grenade into the conversation, since we are talking about naming...

One of my goals with the domain modeling workshop was to try to attach names to operations. preprocess and postprocess could mean anything. Just based on the name, would you expect preprocess to include extracting from user provided PDFs? There is no a priori correct answer.

Apart from the "verbs" (preprocess, postprocess), the nouns seem to be worth questioning too. For example, postprocess_taxonomy - postprocessing is not applied to taxonomy in the sense of the taxonomy you have in git. Especially with the "dataset in, dataset out" concept for SDG, do we need a first-class concept of dataset?

bbrowning · 2024-12-11T17:58:17Z

I'm not strongly attached to the current proposed names, and mostly put them out there as a way to demonstrate the overall shape and granularity of the API. Feel free to suggest alternative names for CLI and/or API commands, or even a different granularity of the exposed APIs.

anastasds · 2024-12-13T19:51:16Z

How do we feel about these names for things? Had a chat with Andy and @alinaryan about what to name these things and we came up with some suggestions as a discussion starting point.

Taxonomy: unchanged - this is the thing in git
Convert Taxonomy: "preprocessing" could really mean anything; "convert_taxonomy" names the operation and implied that there is a goal output. Converting a taxonomy creates:
Source Dataset: This is the starting point of having something that can actually be used for training. A source data set can be used to:
Generate synthetic data: this one seemed pretty accurately named - here just turning "SDG" into an imperative phrase. Doing so gives:
Synthetic data set - what it says on the tin. Finally, at the end, you can
Mix datasets or blend datasets - this one seemed aptly named as well but I do like "blend" as well.

Thoughts?

[Larger context: trying to standardize our vocabulary for use in our codebase, documentation, training, etc]

mairin · 2024-12-14T02:55:12Z

I don't think the source dataset is in a format that could be used for training. But the synthetic dataset is in a format that could be used for training. Having two things named as a "dataset" but only one in a trainable format seems like it would be confusing.

anastasds · 2024-12-16T18:19:48Z

That's a good point. Something like "source corpus"?

mairin · 2024-12-16T18:29:57Z

Corpus sounds complicated. The object in this diagram... it includes the docling output from the yaml-referenced source docs, correct? Plus the qna.yaml itself?

anastasds · 2024-12-16T18:32:06Z

Probably. "Knowledge base"?

mairin · 2024-12-16T18:39:01Z

I love that term except, it includes skills so it could cause confusion

anastasds · 2024-12-16T19:31:25Z

Knowledge workspace?

aakankshaduggal self-assigned this Nov 26, 2024

bbrowning mentioned this issue Dec 11, 2024

feat: allow configurable skills recipe #432

Open

mairin mentioned this issue Dec 17, 2024

InstructLab Maintainer nomination instructlab/instructlab#2932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Refactor Generate Functionality into a Standalone Python API #412

[EPIC] Refactor Generate Functionality into a Standalone Python API #412

aakankshaduggal commented Nov 26, 2024

bbrowning commented Dec 11, 2024

bbrowning commented Dec 11, 2024

anastasds commented Dec 11, 2024 •

edited

Loading

bbrowning commented Dec 11, 2024

anastasds commented Dec 13, 2024 •

edited

Loading

mairin commented Dec 14, 2024

anastasds commented Dec 16, 2024

mairin commented Dec 16, 2024

anastasds commented Dec 16, 2024

mairin commented Dec 16, 2024

anastasds commented Dec 16, 2024

[EPIC] Refactor Generate Functionality into a Standalone Python API #412

[EPIC] Refactor Generate Functionality into a Standalone Python API #412

Comments

aakankshaduggal commented Nov 26, 2024

bbrowning commented Dec 11, 2024

Python APIs

CLI Commands

bbrowning commented Dec 11, 2024

anastasds commented Dec 11, 2024 • edited Loading

bbrowning commented Dec 11, 2024

anastasds commented Dec 13, 2024 • edited Loading

mairin commented Dec 14, 2024

anastasds commented Dec 16, 2024

mairin commented Dec 16, 2024

anastasds commented Dec 16, 2024

mairin commented Dec 16, 2024

anastasds commented Dec 16, 2024

anastasds commented Dec 11, 2024 •

edited

Loading

anastasds commented Dec 13, 2024 •

edited

Loading