Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Refactor Generate Functionality into a Standalone Python API #412

Open
4 tasks
aakankshaduggal opened this issue Nov 26, 2024 · 11 comments
Open
4 tasks
Assignees

Comments

@aakankshaduggal
Copy link
Member

The generate_data function is currently integrated with various functionalities like taxonomy data ingestion, preprocessing and mixing, leading to maintenance and testing challenges. We propose refactoring this into a clean, dedicated Python API that handles only data generation. This separation will increase modularity and ease further development.

Objectives

  1. Extract the generate logic from the existing implementation and encapsulate it within a new Python API.
  2. Ensure this API is compatible with both standalone use and integration into the CLI.
  3. Maintain the integrity of the existing codebase while simplifying the generation process.

Acceptance Criteria

  • Define the New API
  1. Develop a Python API that focuses solely on the data generation process.
  2. Include additional parameters such as dataset path, output save path, pipeline path.
  3. Utilize the API within a CLI context to ensure seamless integration.
  • Independent SDG CLI
  1. Use click for CLI development, providing options to configure the generation process directly from the command line.
  2. Ensure that the current existing ilab CLI uses this new API effectively, passing all necessary parameters through command line options.
  • Testing and Debugging
  1. Write comprehensive unit tests for the new API to ensure it works as expected under various configurations.
  • Documentation and Examples
  1. Since the new SDG CLI will require you to pass your own dataset and pipeline, it is essential to update the project documentation to include detailed instructions on how to use the new API and CLI.
  2. Provide example commands and configurations to help users get started with the new setup.
@aakankshaduggal aakankshaduggal self-assigned this Nov 26, 2024
@bbrowning
Copy link
Contributor

After doing a few failed and some not-so-failed prototypes of separating things out locally, here's a proposal I'd like to put forward for how we factor out generate_data into separate public APIs (and CLIs) instead of only exposing the one large end-to-end generate_data. The goal would be for generate_data to just delegate to these APIs instead of implementing everything itself - specifically it would call preprocess_taxonomy, generate_taxonomy, generate_taxonomy_eval, postprocess_taxonomy, and mix_datasets to provide the same end-to-end experience it does today via the new APIs.

Python APIs

from instructlab.sdg import (
    preprocess_taxonomy,
    generate_taxonomy,
    generate_taxonomy_eval,
    postprocess_taxonomy,
    mix_datasets,
    run_pipeline,
)

# Validate taxonomy, determine changed files, fetch knowledge docs,
# convert knowledge docs, chunk, turn into icl_question_* samples
preprocess_taxonomy(
    taxonomy_path="taxonomy",
    output_dir="preprocessed_samples",
)

# Run the actual data generation for preprocessed taxonomy samples
generate_taxonomy(
    pipeline="simple",
    input_dir="preprocessed_samples",
    output_dir="generated_samples",
    endpoint_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model_family="merlinite",
    model_id="foo",
)

# Optionally generate eval (ie MMLU) samples from your taxonomy
generate_taxonomy_eval(
    input_dir="preprocessed_samples",
    output_dir="eval_samples",
    endpoint_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model_family="merlinite",
    model_id="foo",
)

# Create phase07 and phase10 splits, RaFT contexts, add in auxiliary
# datasets, and write Recipes to be used with mixing
postprocess_taxonomy(
    input_dir="generated_samples",
    output_dir="postprocessed_samples",
)

# Create the final mixed dataset based on the Recipe
mix_datasets(
    recipe="postprocessed_samples/skills_recipe.yaml",
    output_file="mixed_samples/skills_train_msgs.jsonl",
)
mix_datasets(
    recipe="postprocessed_samples/knowledge_recipe.yaml",
    output_file="mixed_samples/knowledge_train_msgs.jsonl",
)



# Or, just run a single pipeline (mostly meant to be used via the CLI
# run_pipeline command below), as users can already construct a
# PipelineContext and Pipeline objects directly to run a single
# pipeline if they want direct control of input/output samples and not
# file operations
run_pipeline(
    pipeline="my_pipeline.yaml",
    input_file="input.jsonl",
    output_file="output.jsonl",
    endpoint_url="http://localhost:8000/v1",
    api_key="EMPTY",
    model_family="mixtral",
    model_id="foo",
)

CLI Commands

python -m instructlab.sdg.cli.preprocess_taxonomy \
  --taxonomy-path taxonomy \
  --output-dir preprocessed_samples \
  --log-level DEBUG

python -m instructlab.sdg.cli.generate_taxonomy \
  --pipeline simple \
  --input-dir preprocessed_samples \
  --output-dir generated_samples \
  --endpoint-url http://localhost:8000/v1 \
  --api-key EMPTY \
  --model-family merlinite \
  --model-id foo \
  --log-level DEBUG

python -m instructlab.sdg.cli.generate_taxonomy_eval \
  --input-dir preprocessed_samples \
  --output-dir eval_samples \
  --endpoint-url http://localhost:8000/v1 \
  --api-key EMPTY \
  --model-family merlinite \
  --model-id foo \
  --log-level DEBUG

python -m instructlab.sdg.cli.postprocess_taxonomy \
  --input-dir generated_samples 
  --output-dir postprocessed_samples \
  --model-family merlinite \
  --model-id foo \
  --log-level DEBUG

python -m instructlab.sdg.cli.mix_datasets \
  --recipe postprocessed_samples/skills_recipe.yaml \
  --output-file mixed_samples/skills_train_msgs.jsonl

python -m instructlab.sdg.cli.mix_dataset \
  --recipe postprocessed_samples/knowledge_recipe.yaml \
  --output-file mixed_samples/knowledge_train_msgs.jsonl


python -m instructlab.sdg.client.run_pipeline \
  --pipeline my_pipeline.yaml \
  --input-file input.jsonl \
  --output-file output.jsonl \
  --endpoint-url http://localhost:8000/v1 \
  --api-key EMPTY \
  --model-family mixtral \
  --model-id foo \
  --log-level DEBUG

@bbrowning
Copy link
Contributor

I find actually attempting to implement this greatly improves my reasoning about the API, what it needs to do, not do, and how to tease things apart. In that spirit, I have a draft PR of some of these changes at #443 that's still a work-in-progress and nowhere near complete. As of this comment it only pulls out the preprocessing but needs to be adapted to match the naming I outlined above.

I'll keep iterating on that draft PR to make it closer to the API and CLIs outlined above. Concurrently, I'd love some additional input and eyes on the proposed API and breakdown of responsibilities. That input is welcome from anyone in the community, but also specifically tagging in @aakankshaduggal and @khaledsulayman to provide input from an internal SDG perspective, @anastasds and @jwm4 to ensure these APIs/CLIs would work for their RAG work, @abhi1092 and @shivchander for a thumbs-up that the run_pipeline API will provide the basic dataset in / dataset out functionality they need, and @relyt0925 for perspective as a community user of SDG.

@anastasds
Copy link

anastasds commented Dec 11, 2024

Thanks for this @bbrowning! At the risk of throwing a grenade into the conversation, since we are talking about naming...

One of my goals with the domain modeling workshop was to try to attach names to operations. preprocess and postprocess could mean anything. Just based on the name, would you expect preprocess to include extracting from user provided PDFs? There is no a priori correct answer.

Apart from the "verbs" (preprocess, postprocess), the nouns seem to be worth questioning too. For example, postprocess_taxonomy - postprocessing is not applied to taxonomy in the sense of the taxonomy you have in git. Especially with the "dataset in, dataset out" concept for SDG, do we need a first-class concept of dataset?

@bbrowning
Copy link
Contributor

I'm not strongly attached to the current proposed names, and mostly put them out there as a way to demonstrate the overall shape and granularity of the API. Feel free to suggest alternative names for CLI and/or API commands, or even a different granularity of the exposed APIs.

@anastasds
Copy link

anastasds commented Dec 13, 2024

How do we feel about these names for things? Had a chat with Andy and @alinaryan about what to name these things and we came up with some suggestions as a discussion starting point.

image

  • Taxonomy: unchanged - this is the thing in git
  • Convert Taxonomy: "preprocessing" could really mean anything; "convert_taxonomy" names the operation and implied that there is a goal output. Converting a taxonomy creates:
  • Source Dataset: This is the starting point of having something that can actually be used for training. A source data set can be used to:
  • Generate synthetic data: this one seemed pretty accurately named - here just turning "SDG" into an imperative phrase. Doing so gives:
  • Synthetic data set - what it says on the tin. Finally, at the end, you can
  • Mix datasets or blend datasets - this one seemed aptly named as well but I do like "blend" as well.

Thoughts?

[Larger context: trying to standardize our vocabulary for use in our codebase, documentation, training, etc]

@mairin
Copy link
Member

mairin commented Dec 14, 2024

I don't think the source dataset is in a format that could be used for training. But the synthetic dataset is in a format that could be used for training. Having two things named as a "dataset" but only one in a trainable format seems like it would be confusing.

@anastasds
Copy link

That's a good point. Something like "source corpus"?

@mairin
Copy link
Member

mairin commented Dec 16, 2024

Corpus sounds complicated. The object in this diagram... it includes the docling output from the yaml-referenced source docs, correct? Plus the qna.yaml itself?

@anastasds
Copy link

Probably. "Knowledge base"?

@mairin
Copy link
Member

mairin commented Dec 16, 2024

I love that term except, it includes skills so it could cause confusion

@anastasds
Copy link

Knowledge workspace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants