Important
The original authors have moved on to other projects. While the code might still be functional for its original purpose, please be aware that the original team does not plan to develop new features, bug fixes, or updates. If you'd like to become a maintainer, please open an issue to discuss it.

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.
Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.
Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.
We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:
-
Community Meetup: listen in or present during one of our bi-weekly events.
-
Discord: get direct support from the community in #argilla-general and #argilla-help.
-
Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.
The Argilla community uses distilabel to create amazing datasets and models.
- The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
- Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
- The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.
pip install distilabel --upgrade
Requires Python 3.9+
In addition, the following extras are available:
anthropic
: for using models available in Anthropic API via theAnthropicLLM
integration.cohere
: for using models available in Cohere via theCohereLLM
integration.argilla
: for exporting the generated datasets to Argilla.groq
: for using models available in Groq usinggroq
Python client via theGroqLLM
integration.hf-inference-endpoints
: for using the Hugging Face Inference Endpoints via theInferenceEndpointsLLM
integration.hf-transformers
: for using models available in transformers package via theTransformersLLM
integration.litellm
: for usingLiteLLM
to call any LLM using OpenAI format via theLiteLLM
integration.llama-cpp
: for using llama-cpp-python Python bindings forllama.cpp
via theLlamaCppLLM
integration.mistralai
: for using models available in Mistral AI API via theMistralAILLM
integration.ollama
: for using Ollama and their available models viaOllamaLLM
integration.openai
: for using OpenAI API models via theOpenAILLM
integration, or the rest of the integrations based on OpenAI and relying on its client asAnyscaleLLM
,AzureOpenAILLM
, andTogetherLLM
.vertexai
: for using Google Vertex AI proprietary models via theVertexAILLM
integration.vllm
: for using vllm serving engine via thevLLM
integration.sentence-transformers
: for generating sentence embeddings using sentence-transformers.mlx
: for using MLX models via theMlxLLM
integration.
outlines
: for using structured generation of LLMs with outlines.instructor
: for using structured generation of LLMs with Instructor.
ray
: for scaling and distributing a pipeline with Ray.faiss-cpu
andfaiss-gpu
: for generating sentence embeddings using faiss.text-clustering
: for using text clustering with UMAP and Scikit-learn.minhash
: for using minhash for duplicate detection with datasketch and nltk.
To run the following example you must install distilabel
with the hf-inference-endpoints
extra:
pip install "distilabel[hf-inference-endpoints]" --upgrade
Then run:
from datasets import load_dataset
from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
with Pipeline() as pipeline:
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
generation_kwargs={"temperature": 0.7, "max_new_tokens": 512},
),
)
if __name__ == "__main__":
dataset = load_dataset("distilabel-internal-testing/instructions", split="test")
distiset = pipeline.run(dataset=dataset)
distiset.push_to_hub(repo_id="distilabel-example")
If you build something cool with distilabel
consider adding one of these badges to your dataset or model card.
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
To directly contribute with distilabel
, check our good first issues or open a new one.
- Modular pipeline system that allows for any type of generation (text, image, images, any output format) and composable steps whereas data_gen is sort of for questions and answers only (limited output format, no composability into more complex pipelines, not quite ready for multiple images).
- This also makes it way more extensible. You can't really build on top of data_gen, only modify its internals to do some simple generation. You can build on top of this with 2 files, a config and a pipeline.
- Better parallelism by handling it with just a config and allowing pretty arbitrary gpu usage via tensor parallelism, replicas and available_gpus. data_gen only has the data parallelism wrapper I made which has no tensor parallelism support and requires sharding the chunks json manually before and after.
- Input and output in huggingface datasets rather than using the chunking library with its custom format and taking/outputting jsons.
- Built in and hidden caching for easy resuming
- According to the documentation, works with Ray for larger scale distributed generation.
- Inherits some cool things from distilabel such as the premade EvolInstructGenerator Task and others.
- Slightly improved prompt sampler by making it part of the config (easier to edit and have multiple of) and adding the ability to generate list fields in an API call (say generate 4 questions instead of 1 and split these into separate rows)
- Run everything from the outside the
distilabel
directory. e.g.python distilabel/pipelines/single_page_qa.py
- In the modified distilabel package, here are some of the files I have added (you could also check the git commit history)
pipelines/single_page_qa.py
. Put new pipelines here. The single page one is a good reference for how to do everything, copy and modifysrc/distilabel/configs/single_pages.py
. The config for single page QA, check it out to understand how the pipeline runs and what you can modifysrc/distilabel/pydantics.py
. Put Pydantic models here (configs, output formats)src/distilabel/llms/openai_compatible.py, vllm_api.py
. The wrapper that handles structured generation with openai compatible endpoints for different providers and vLLM servers as well.src/distilabel/utils/misc.py, prompt_sampler.py, pipe_utils.py, image.py
. Check out the prompt sampler and how it works in the config.pipe_utils.py
has useful/reusable code for pipelines in general.src/distilabel/steps/columns/pydantic_to_cols.py
,.../steps/filtering/filter_rows.py
,.../steps/list_to_rows.py
,.../tasks/lm_generation.py
. You can see each of them imported in the single page qa pipeline.lm_generation.py
is important to know of because I use it for the structured generation step using a LM. Kind of obvious, but this is where your custom steps go.
- The only requirement for the dataset format is having a source column which is expected to be a string (straight input to the LM) or a list of image paths (which can point straight to jpg/png files or a page in a pdf with format
path/to/pdf_page_x.pdf
). This is atm only an expectation inVLM._format_input()
when it is passed toLMGenerationTask.input_formatter
, so you can change theinput_formatter
/override this if you need or just makeVLM._format_input()
more general. - I handle scheduling gpus by overriding the available gpus seen by
CudaDevicePlacementMixin
and breaking the tasks into multiple load stages so that there are enough gpus available during each. - It will launch a vllm server if the model name is not a recognized proprietary model.
- Short Version: distilabel is very particular about how things are done, so there's a reason why every line is the way it is and I recommend starting off of one of the existing pipelines. Also, reading my code for e.g. the single page pipeline will tell you how to build on top of distilabel. Use the rest of this list as an issue tracker so people know how to solve issues in the future.
- It took me a while to figure out how to handle different providers, it turns out their OpenAI compatible endpoints accept varying basic parameters and it works best to ignore most of the parameters and send basic messages.
- You can't output a pydantic object from a step since it isn't serializable with pyarrow.
- In as many places as possible, I think you want to use the
load()
method instead of__init__
, since__init__
is handled by pydantic and you'll be able to see the inherited args if you don't override it. It also matches the logic of the library better (matching load groups better for instance). - I ran into some errors with the decorators that I tried to make for multiple generations and structured output because distilabel inspects the signature of functions and somehow came up with
**kwargs
was a required runtime parameter that needed to be set at pipeline init. The solution I am using is to copy the function signature from the library, though this isn't ideal for maintenance. - I ran into some errors with it not being able to make a pyarrow table after finishing the
LMGenerationTask
which were due to the parameteradd_raw_input=True
. Since I overrode theOpenAILLM
class to add support for more flexible vision (arbitrary number/order of images in chat format)(and to allow grouping all model providers under a single class), the formatted input was a list of messages, some text, some visual, all in one column (so you can vary the number of images). Pyarrow can't make a table out of this because the structure of a text and an image message are different so it can't make a type for the column. Thus, I have setself.add_raw_input=False
in e.g. theLMGenerationTask
.- This is no longer a current issue since I moved the prompt sampler into format input, which is called before the lm and discarded after (no serialization).
StepResources
seems like it might handle scheduling tasks across gpus for you, but I understand this only happens when using Ray, which has some internal scheduling that will respect those resources (there's a section in the documentation about how to use Ray for larger scale distributed generation).- What it does actually do is respect
replicas
, which is basically just data parallelism for non-generator/global steps/tasks (replicates models as well). - It will put LLMs on different gpus (provided you use the mixin properly) until it runs out of gpus (cuda_device_placement.py), but it won't reschedule them
- What it does actually do is respect
- To handle scheduling tasks (say your pipeline will use 10 different vllm servers but you have 8 gpus), you use load stages. See the docs
Task.unload()
callsself.llm.unload()
so you don't have to handle it yourself. If you wanted to keep it alive (say the vllm server), you'd need to get around this- Distilabel can handle a list of tasks in the
>>
syntax, for each task in the previous stage, it sends the task's completed batches to all of the next stage (or in the case of using a router, it will select some set of the next stage per batch) - Don't include a routing function step in the load groups, it isn't quite a step and will throw an error, but runs even when left out of the load groups
- I would like each LM to be able to have its own system prompt, which means they each need their own prompt sampler. I see two ways to do this, either make a step for each LM that has the prompt sampler and connect them properly, or put the prompt sampler with the LM. Making a bunch of steps and connecting them seems annoying and not as clean for writing new pipelines. Putting it with the LM means you don't see the system prompt since it isn't a step input or output, so I have sort of hacked distilabel by inplace updating input, which gets forwarded to
LMGenerationTask.format_output()
. - Serialization
- Initially, I ran into an error trying to hash my Config object (for the caching system) so I overrode the serialization to return an empty dict
- When I was trying to test the caching, I ran into another error where it couldn't resume from the yaml because the
LMGenerationTask
has an input_formatter callable attribute. It loads the yaml with yaml.FullLoader which won't allow arbitrary python execution (setting the input_formatter). I foundField(exclude=True)
in Pydantic to solve this. Then it occurred to me that I should do the same for the configs I was using rather than erasing their signatures. After this, there was another error in resuming because it couldn't initialize the e.g.LMGenerationTask
without providing the configs. So, I gave these default initializations. This uncovered another error which was a bug in distilabel, I had no choice but to modify the actual package to fix it. InDAG.from_dict()
, they don't set therouting_batch_function._step
which is set duringStep.connect()
, so I just added the line to do that.- The way its resuming works is when you call
pipeline.run()
, one of the early steps isself._refresh_pipeline_from_cache()
which essentially creates an entirely new dag from the cached information. Then, for excluded or secret fields, it sets them using the values of the current dag. Now that I know this, their design seems reasonable, but it is important that you understand the effect ofField(exclude=True)
to get resuming working properly. The need for serialization and deserialization also justifies the extensive use of Pydantic in distilabel.
- The way its resuming works is when you call
- Had to set
vllm_api
field to private so that it didn't try to serialize it in multiprocessing. - Might be errors with changing load_groups for a pipeline that you are trying to resume
- I made step resources an excluded parameter (from the signature and caching) so that you can change these and the pipeline will resume as normal
- [IMPORTANT] I ran into a tough error with distilabel hanging when trying to resume. The root cause (or one of them) was probably that I had stopped execution in the vscode debugger, which hard stops the program and distilabel didn't save the batch back to the pipeline's batch manager, making it so that my initial generator step didn't have its batch data and wasn't sending it up the pipeline. I am still not sure entirely how batches are routed, since this is a large and complex system, but anyways, be wary of the hanging issue. Keep in mind the functions
_manage_batch_flow, _BatchManagerStep._get_data(), get_batch() and add_batch() and _initialize_pipeline_execution()
which are related to batches in distilabel. I am not sure how exactly to solve this if it happens on something expensive to re-run. Maybe try manually editing the cache if you can find the right information.
@misc{distilabel-argilla-2024,
author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/argilla-io/distilabel}}
}