From 3264563e044e36dc0f67e7dd3dd4d264245ef58e Mon Sep 17 00:00:00 2001 From: David Berenstein Date: Mon, 19 Aug 2024 09:40:23 +0200 Subject: [PATCH] Add tutorial - generate data for training embeddings and reranking models (#893) * Add initial outline tutorial * Add section on data quality evaluation * Add conslusion * Update pipeline_samples structure for adding tutorials in a similar way as Argilla docs * Update new structure tutorials * Update title * Update to use Free serverless Inference API * Process comments from code review * Remove sections from header * Updated formatting examples * Add grid arror on new line * update phrasing * update phrasing --- .../advanced/structured_generation.md | 4 +- .../examples/benchmarking_with_distilabel.md | 20 + .../pipeline_samples/examples/index.md | 108 +-- .../examples/llama_cpp_with_outlines.md | 20 + .../examples/mistralai_with_instructor.md | 38 + .../sections/pipeline_samples/papers/deita.md | 6 +- .../sections/pipeline_samples/papers/index.md | 3 - .../papers/instruction_backtranslation.md | 12 +- .../pipeline_samples/papers/prometheus.md | 6 +- .../pipeline_samples/papers/ultrafeedback.md | 8 +- .../tutorials/GenerateSentencePair.ipynb | 694 ++++++++++++++++++ mkdocs.yml | 10 +- src/distilabel/llms/vllm.py | 2 +- 13 files changed, 862 insertions(+), 69 deletions(-) create mode 100644 docs/sections/pipeline_samples/examples/benchmarking_with_distilabel.md create mode 100644 docs/sections/pipeline_samples/examples/llama_cpp_with_outlines.md create mode 100644 docs/sections/pipeline_samples/examples/mistralai_with_instructor.md delete mode 100644 docs/sections/pipeline_samples/papers/index.md create mode 100644 docs/sections/pipeline_samples/tutorials/GenerateSentencePair.ipynb diff --git a/docs/sections/how_to_guides/advanced/structured_generation.md b/docs/sections/how_to_guides/advanced/structured_generation.md index d3e750aa93..02b8cc8e4a 100644 --- a/docs/sections/how_to_guides/advanced/structured_generation.md +++ b/docs/sections/how_to_guides/advanced/structured_generation.md @@ -111,7 +111,7 @@ These were some simple examples, but one can see the options this opens. !!! Tip A full pipeline example can be seen in the following script: - [`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/index.md#llamacpp-with-outlines) + [`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/llama_cpp_with_outlines.md) [^1]: You can check the variable type by importing it from: @@ -189,7 +189,7 @@ We get back a Python dictionary (formatted as a string) that we can parse using !!! Tip A full pipeline example can be seen in the following script: - [`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/index.md#mistralai-with-instructor) + [`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/mistralai_with_instructor.md) ## OpenAI JSON diff --git a/docs/sections/pipeline_samples/examples/benchmarking_with_distilabel.md b/docs/sections/pipeline_samples/examples/benchmarking_with_distilabel.md new file mode 100644 index 0000000000..f1f18b415f --- /dev/null +++ b/docs/sections/pipeline_samples/examples/benchmarking_with_distilabel.md @@ -0,0 +1,20 @@ +--- +hide: toc +--- +# [Benchmarking with `distilabel`: Arena Hard](#benchmarking-with-distilabel-arena-hard) + +Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark. + +The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie. + +To run this example you will first need to install the Arena Hard optional dependencies, being `pandas`, `scikit-learn`, and `numpy`. + +??? Run + + ```python + python examples/arena_hard.py + ``` + +```python title="arena_hard.py" +--8<-- "examples/arena_hard.py" +``` \ No newline at end of file diff --git a/docs/sections/pipeline_samples/examples/index.md b/docs/sections/pipeline_samples/examples/index.md index 68e25fc888..2638c36716 100644 --- a/docs/sections/pipeline_samples/examples/index.md +++ b/docs/sections/pipeline_samples/examples/index.md @@ -1,78 +1,96 @@ -# Examples +--- +hide: toc +--- +# Pipeline Samples -This section contains different example pipelines that showcase different tasks, maybe you can take inspiration from them. +- **Tutorials** provide detailed step-by-step explanations and the code used for end-to-end workflows. +- **Paper implementations** provide reproductions of fundamental papers in the synthetic data domain. +- **Examples** don't provide explenations but simply show code for different tasks. -### [llama.cpp with `outlines`](#llamacpp-with-outlines) +## Tutorials -Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`. +
-??? Example "See example" +- __Retrieval and reranking models__ - This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema. + --- - It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM]. + Learn about synthetic data generation for fine-tuning custom retrieval and reranking models. - ??? Run + [:octicons-arrow-right-24: Tutorial](../tutorials/GenerateSentencePair.ipynb) - ```python - python examples/structured_generation_with_outlines.py - ``` +
- ```python title="structured_generation_with_outlines.py" - --8<-- "examples/structured_generation_with_outlines.py" - ``` +## Paper Implementations +
-### [MistralAI with `instructor`](#mistralai-with-instructor) +- __DEITA__ -Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`. + --- -??? Example "See example" + Learn about prompt, response tuning for complexity and quality and LLMs as judges for automatic data selection. - This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics. + [:octicons-arrow-right-24: Paper](../papers/deita.md) - This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook. +- __Instruction Backtranslation__ - ??? Run + --- - ```python - python examples/structured_generation_with_instructor.py - ``` + Learn about automatically labeling human-written text with corresponding instructions. - ```python title="structured_generation_with_instructor.py" - --8<-- "examples/structured_generation_with_instructor.py" - ``` + [:octicons-arrow-right-24: Paper](../papers/instruction_backtranslation.md) - ??? "Visualizing the graphs" +- __Prometheus 2__ - Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look: + --- - !!! NOTE + Learn about using open-source models as judges for direct assessment and pair-wise ranking. - This example uses graphviz to render the graph, you can install with `pip` in the following way: + [:octicons-arrow-right-24: Paper](../papers/prometheus.md) - ```console - pip install graphviz - ``` +- __UltraFeedback__ - ```python - python examples/draw_kg.py 2 # You can pass 0,1,2 to visualize each of the samples. - ``` + --- - ![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png) + Learn about a large-scale, fine-grained, diverse preference dataset, used for training powerful reward and critic models. + [:octicons-arrow-right-24: Paper](../papers/ultrafeedback.md) -### [Benchmarking with `distilabel`: Arena Hard](#benchmarking-with-distilabel-arena-hard) +
-Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark. +## Examples + +
+ +- __Benchmarking with distilabel__ + + --- + + Learn about reproducing the Arena Hard benchmark with disitlabel. + + [:octicons-arrow-right-24: Example](./benchmarking_with_distilabel.md) + +- __llama.cpp with outlines__ + + --- + + Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel. + + [:octicons-arrow-right-24: Example](./llama_cpp_with_outlines.md) + +- __MistralAI with instructor__ + + --- + + Learn about answering instructions with knowledge graphs defined as pydantic.BaseModel objects using instructor in distilabel. + + [:octicons-arrow-right-24: Example](../papers/prometheus.md) + + +
-??? Example "See example" - The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie. - To run this example you will first need to install the Arena Hard optional dependencies, being `pandas`, `scikit-learn`, and `numpy`. - ```python title="arena_hard.py" - --8<-- "examples/arena_hard.py" - ``` diff --git a/docs/sections/pipeline_samples/examples/llama_cpp_with_outlines.md b/docs/sections/pipeline_samples/examples/llama_cpp_with_outlines.md new file mode 100644 index 0000000000..38ac6bb6fe --- /dev/null +++ b/docs/sections/pipeline_samples/examples/llama_cpp_with_outlines.md @@ -0,0 +1,20 @@ +--- +hide: toc +--- +# [llama.cpp with `outlines`](#llamacpp-with-outlines) + +Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`. + +This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema. + +It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM]. + +??? Run + + ```python + python examples/structured_generation_with_outlines.py + ``` + +```python title="structured_generation_with_outlines.py" +--8<-- "examples/structured_generation_with_outlines.py" +``` \ No newline at end of file diff --git a/docs/sections/pipeline_samples/examples/mistralai_with_instructor.md b/docs/sections/pipeline_samples/examples/mistralai_with_instructor.md new file mode 100644 index 0000000000..3b39d51e31 --- /dev/null +++ b/docs/sections/pipeline_samples/examples/mistralai_with_instructor.md @@ -0,0 +1,38 @@ +--- +hide: toc +--- +# [MistralAI with `instructor`](#mistralai-with-instructor) + +Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`. + +This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics. + +This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook. + +??? Run + + ```python + python examples/structured_generation_with_instructor.py + ``` + +```python title="structured_generation_with_instructor.py" +--8<-- "examples/structured_generation_with_instructor.py" +``` + +??? "Visualizing the graphs" + + Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look: + + !!! NOTE + + This example uses graphviz to render the graph, you can install with `pip` in the following way: + + ```console + pip install graphviz + ``` + + ```python + python examples/draw_kg.py 2 # You can pass 0,1,2 to visualize each of the samples. + ``` + + ![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png) \ No newline at end of file diff --git a/docs/sections/pipeline_samples/papers/deita.md b/docs/sections/pipeline_samples/papers/deita.md index 5c9036d756..d3ff1da59b 100644 --- a/docs/sections/pipeline_samples/papers/deita.md +++ b/docs/sections/pipeline_samples/papers/deita.md @@ -1,10 +1,10 @@ # DEITA -DEITA (Data-Efficient Instruction Tuning for Alignment) studies an automatic data selection process by first quantifying the data quality based on complexity, quality and diversity. And second, selecting across the best potential combination from an open-source dataset that would fit into the budget you allocate to tune your own LLM. +[DEITA (Data-Efficient Instruction Tuning for Alignment)](https://arxiv.org/abs/2312.15685) studies an automatic data selection process by first quantifying the data quality based on complexity, quality and diversity. Second, select the best potential combination from an open-source dataset that would fit into the budget you allocate to tune your own LLM. -In most setting we cannot allocate unlimited resources for instruction-tuning LLMs. Therefore, the DEITA authors investigated how to select qualitative data for instruction-tuning based on a principle of fewer high quality samples. Liu et al. tackle the issue of first defining good data and second identifying it to respect an initial budget to instruct-tune your LLM. +In most setting we cannot allocate unlimited resources for instruction-tuning LLMs. Therefore, the DEITA authors investigated how to select qualitative data for instruction tuning based on the principle of fewer high-quality samples. Liu et al. tackle the issue of first defining good data and second identifying it to respect an initial budget to instruct-tune your LLM. -The strategy utilizes **LLMs to replace human effort in time-intensive data quality tasks on instruction tuning datasets**. DEITA introduces a way to measure data quality across three critical dimensions: complexity, quality and diversity. +The strategy utilizes **LLMs to replace human effort in time-intensive data quality **tasks on **instruction-tuning** datasets**. DEITA introduces a way to measure data quality across three critical dimensions: complexity, quality and diversity. ![DEITA pipeline overview](../../../assets/tutorials-assets/deita/overview.png) diff --git a/docs/sections/pipeline_samples/papers/index.md b/docs/sections/pipeline_samples/papers/index.md deleted file mode 100644 index 7fed3da03a..0000000000 --- a/docs/sections/pipeline_samples/papers/index.md +++ /dev/null @@ -1,3 +0,0 @@ -# Paper Implementations - -Contains some implementations for synthetic data generation papers, using `distilabel`, providing reproducible pipelines so that anyone can play around with those approaches and customize that to their needs. We strongly believe that better data leads to better models, and synthetic data has proven to be really effective towards improving LLMs, so we aim to bridge the gap between research and practice by providing these implementations. diff --git a/docs/sections/pipeline_samples/papers/instruction_backtranslation.md b/docs/sections/pipeline_samples/papers/instruction_backtranslation.md index 8434742984..588fc50480 100644 --- a/docs/sections/pipeline_samples/papers/instruction_backtranslation.md +++ b/docs/sections/pipeline_samples/papers/instruction_backtranslation.md @@ -1,18 +1,18 @@ # Instruction Backtranslation -["Self Alignment with Instruction Backtranslation"](https://arxiv.org/abs/2308.06259) presents a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. +["Self Alignment with Instruction Backtranslation"](https://arxiv.org/abs/2308.06259) presents a scalable method to build high-quality instruction following a language model by automatically labeling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high-quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. -Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents which includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions. +Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents that includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions. -A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high quality example pairs to train an instruction following model. +A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high-quality example pairs to train an instruction-following model. -Their overall process, called instruction backtranslation performs two core steps: +Their overall process, called instruction back translation performs two core steps: 1. Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning. -2. Self-curate: Self-select high quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration. +2. Self-curate: Self-select high-quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration. -This replication covers the self-curation step i.e. the second / latter step as mentioned above, so as to be able to use the proposed prompting approach to rate the quality of the generated text, which can either be synthetically generated or real human-written text. +This replication covers the self-curation step i.e. the second/latter step as mentioned above, so as to be able to use the proposed prompting approach to rate the quality of the generated text, which can either be synthetically generated or real human-written text. ### Replication diff --git a/docs/sections/pipeline_samples/papers/prometheus.md b/docs/sections/pipeline_samples/papers/prometheus.md index 7f7b1d19d5..c86ed39309 100644 --- a/docs/sections/pipeline_samples/papers/prometheus.md +++ b/docs/sections/pipeline_samples/papers/prometheus.md @@ -1,20 +1,20 @@ # Prometheus 2 -["Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models"](https://arxiv.org/pdf/2405.01535) presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in ["Prometheus: Inducing Fine-grained Evaluation Capability in Language Models"](https://arxiv.org/abs/2310.08491); since GPT-4, as well as other proprietary LLMs, are commonly used to asses the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations. +["Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models"](https://arxiv.org/pdf/2405.01535) presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in ["Prometheus: Inducing Fine-grained Evaluation Capability in Language Models"](https://arxiv.org/abs/2310.08491); since GPT-4, as well as other proprietary LLMs, are commonly used to assess the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations. Existing open evaluator LMs exhibit critical shortcomings: 1. They issue scores that significantly diverge from those assigned by humans. 2. They lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. -Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. Prometheus 2 is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. +Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. Prometheus 2 is capable of processing both direct assessment and pair-wise ranking formats grouped with user-defined evaluation criteria. Prometheus 2 released two variants: - [`prometheus-eval/prometheus-7b-v2.0`](https://hf.co/prometheus-eval/prometheus-7b-v2.0): fine-tuned on top of [`mistralai/Mistral-7B-Instruct-v0.2`](https://hf.co/mistralai/Mistral-7B-Instruct-v0.2) - [`prometheus-eval/prometheus-8x7b-v2.0`](https://hf.co/prometheus-eval/prometheus-8x7b-v2.0): fine-tuned on top of [`mistralai/Mixtral-8x7B-Instruct-v0.1`](https://hf.co/mistralai/Mixtral-8x7B-Instruct-v0.1) -Both models have been fine-tuned for both direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer, and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively. +Both models have been fine-tuned for both direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Their models, code, and data are all publicly available at [`prometheus-eval/prometheus-eval`](https://github.com/prometheus-eval/prometheus-eval). diff --git a/docs/sections/pipeline_samples/papers/ultrafeedback.md b/docs/sections/pipeline_samples/papers/ultrafeedback.md index 704309e263..afa21a5717 100644 --- a/docs/sections/pipeline_samples/papers/ultrafeedback.md +++ b/docs/sections/pipeline_samples/papers/ultrafeedback.md @@ -6,13 +6,13 @@ UltraFeedback collects about 64k prompts from diverse resources (including Ultra To collect high-quality preference and textual feedback, they design a fine-grained annotation instruction, which contains four different aspects, namely instruction-following, truthfulness, honesty and helpfulness (even though within the paper they also mention a fifth one named verbalized calibration). Finally, GPT-4 is used to generate the ratings for the generated responses to the given prompt using the previously mentioned aspects. -### Replication +## Replication To replicate the paper we will be using `distilabel` and a smaller dataset created by the Hugging Face H4 team named [`HuggingFaceH4/instruction-dataset`](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset) for testing purposes. Also for testing purposes we will just show how to evaluate the generated responses for a given prompt using a new global aspect named `overall-rating` defined by Argilla, that computes the average of the four aspects, so as to reduce number of requests to be sent to OpenAI, but note that all the aspects are implemented within `distilabel` and can be used instead for a more faithful reproduction. Besides that we will generate three responses for each instruction using three LLMs selected from a pool of six: [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), [`argilla/notus-7b-v1`](https://huggingface.co/argilla/notus-7b-v1), [`google/gemma-1.1-7b-it`](https://huggingface.co/google/gemma-1.1-7b-it), [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), [`HuggingFaceH4/zephyr-7b-gemma-v0.1`](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1) and [`mlabonne/UltraMerge-7B`](https://huggingface.co/mlabonne/UltraMerge-7B). -#### Installation +### Installation To replicate UltraFeedback one will need to install `distilabel` as it follows: @@ -22,7 +22,7 @@ pip install "distilabel[argilla,openai,vllm]>=1.0.0" And since we will be using `vllm` we will need to use a VM with at least 6 NVIDIA GPUs with at least 16GB of memory each to run the text generation, and set the `OPENAI_API_KEY` environment variable value. -#### Building blocks +### Building blocks - [`LoadDataFromHub`][distilabel.steps.LoadDataFromHub]: Generator Step to load a dataset from the Hugging Face Hub. - [`sample_n_steps`][distilabel.pipeline.sample_n_steps]: Function to create a `routing_batch_function` that samples `n` downstream steps for each batch generated by the upstream step. This is the key to replicate the LLM pooling mechanism described in the paper. @@ -34,7 +34,7 @@ And since we will be using `vllm` we will need to use a VM with at least 6 NVIDI - [`KeepColumns`][distilabel.steps.KeepColumns]: Task to keep the desired columns while removing the not needed ones, as well as defining the order for those. - (optional) [`PreferenceToArgilla`][distilabel.steps.PreferenceToArgilla]: Task to optionally push the generated dataset to Argilla to do some further analysis and human annotation. -#### Code +### Code As mentioned before, we will put the previously mentioned building blocks together to replicate UltraFeedback. diff --git a/docs/sections/pipeline_samples/tutorials/GenerateSentencePair.ipynb b/docs/sections/pipeline_samples/tutorials/GenerateSentencePair.ipynb new file mode 100644 index 0000000000..7869cef9e1 --- /dev/null +++ b/docs/sections/pipeline_samples/tutorials/GenerateSentencePair.ipynb @@ -0,0 +1,694 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Synthetic data generation for fine-tuning custom retrieval and reranking models\n", + "\n", + "- **Goal**: Bootstrap, optimize and maintain your embedding models and rerankers through synthetic data generation and human feedback.\n", + "- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub), [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n", + "- **Components**: [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/), [GenerateSentencePair](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting started\n", + "\n", + "### Install the dependencies\n", + "\n", + "To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using **the free but rate-limited Hugging Face serverless Inference API** for this tutorial, so we need to install this as extra distilabel dependency. You can install them by running the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"distilabel[hf-inference-endpoints]\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"sentence-transformer>=3,<4\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's make the needed imports:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from distilabel.llms.huggingface import InferenceEndpointsLLM\n", + "from distilabel.pipeline import Pipeline\n", + "from distilabel.steps.tasks import GenerateSentencePair\n", + "from distilabel.steps import LoadDataFromHub\n", + "\n", + "from sentence_transformers import SentenceTransformer, CrossEncoder\n", + "import torch" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### (optional) Deploy Argilla\n", + "\n", + "You can skip this step or replace it with any other data evaluation tool but the quality of your model will suffer from a lack of data quality so we do recommend to look at your data. If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/). \n", + "\n", + "Allong with that, you will need to install argilla as distilabel extra." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"distilabel[argilla, hf-inference-endpoints]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's make the extra needed imports:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "import argilla as rg" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The dataset\n", + "\n", + "Before starting any project, it is always important to look at your data. Our data is publicly available [on the Hugging Face Hub](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_raw_unstructured?row=0) so we can have a quick look through [their dataset viewer within an embedded iFrame](https://huggingface.co/docs/hub/datasets-viewer-embed). \n", + "\n", + "\n", + "\n", + "As we can see, our dataset contains a column called `chunks`, which was obtained from the Argilla docs. Normally, you would need to download and chunk the data but we will not cover that in this tutorial. To read a full explanation for how this dataset was generated, please refer to [How we leveraged distilabel to create an Argilla 2.0 Chatbot](https://huggingface.co/blog/argilla-chatbot#downloading-and-chunking-data).\n", + "\n", + "Alternatively, we can load the entire dataset to disk with `datasets.load_dataset`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Synthetic data generation\n", + "\n", + "The [`GenerateSentencePair`](https://distilabel.argilla.io/latest/components-gallery/tasks/generatesentencepair/) component from `distilabel` can be used to generate training datasets for embeddings models. \n", + "\n", + "It is a pre-defined `Task` that given an `anchor` sentence generate data for a specific `action`. Supported actions are: `\"paraphrase\", \"semantically-similar\", \"query\", \"answer\"`. In our case the `chunks` column corresponds to the `anchor`. This means we will use `query` to generate potential queries for a fine-tuning a retrieval model and that we will use `semantically-similar` to generate texts that are similar to the intial anchor for fine-tuning a reranking model.\n", + "\n", + "We will `triplet=True` in order to generate both positive and negative examples, which should help the model generalize better during fine-tuning and we will set `hard_negative=True` to generate more challenging examples that are closer to the anchor and discussed topics.\n", + "\n", + "Lastly, we can seed the LLM with `context` to generate more relevant examples." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "context = context = (\n", + "\"\"\"\n", + "The text is a chunk from technical Python SDK documentation of Argilla.\n", + "Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets.\n", + "Along with prose explanations, the text chunk may include code snippets and Python references.\n", + "\"\"\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Retrieval\n", + "\n", + "For retrieval, we will thus generate queries that are similar to the `chunks` column. We will use the `query` action to generate potential queries for a fine-tuning a retrieval model.\n", + "\n", + "```python\n", + "generate_sentence_pair = GenerateSentencePair(\n", + " triplet=True, \n", + " hard_negative=True,\n", + " action=\"query\",\n", + " llm=llm,\n", + " input_batch_size=10,\n", + " context=context,\n", + ")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Reranking\n", + "\n", + "For reranking, we will generate texts that are similar to the intial anchor. We will use the `semantically-similar` action to generate texts that are similar to the intial anchor for fine-tuning a reranking model. In this case, we set `hard_negative=False` to generate more diverse and potentially wrong examples, which can be used as negative examples for similarity fine-tuning because [rerankers cannot be fine-tuned using triplets](https://github.com/UKPLab/sentence-transformers/issues/2366).\n", + "\n", + "```python\n", + "generate_sentence_pair = GenerateSentencePair(\n", + " triplet=True,\n", + " hard_negative=False,\n", + " action=\"semantically-similar\",\n", + " llm=llm,\n", + " input_batch_size=10,\n", + " context=context,\n", + ")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Combined pipeline\n", + "\n", + "We will now use the `GenerateSentencePair` task to generate synthetic data for both retrieval and reranking models in a single pipeline. Note that, we map the `chunks` column to the `anchor` argument." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "llm = InferenceEndpointsLLM(model_id=\"mistralai/Mistral-7B-Instruct-v0.2\")\n", + "\n", + "with Pipeline(name=\"generate\") as pipeline:\n", + " load_dataset = LoadDataFromHub(\n", + " num_examples=15,\n", + " output_mappings={\"chunks\": \"anchor\"},\n", + " )\n", + " generate_retrieval_pairs = GenerateSentencePair(\n", + " name=\"generate_retrieval_pairs\",\n", + " triplet=True,\n", + " hard_negative=True,\n", + " action=\"query\",\n", + " llm=llm,\n", + " input_batch_size=10,\n", + " context=context,\n", + " )\n", + " generate_reranking_pairs = GenerateSentencePair(\n", + " name=\"generate_reranking_pairs\",\n", + " triplet=True,\n", + " hard_negative=False, # to potentially generate non-relevant pairs\n", + " action=\"semantically-similar\",\n", + " llm=llm,\n", + " input_batch_size=10,\n", + " context=context,\n", + " )\n", + "\n", + " load_dataset >> [generate_retrieval_pairs, generate_reranking_pairs]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we can execute this using `pipeline.run`. We will provide some `parameters` to specific components within our pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "generation_kwargs = {\n", + " \"llm\": {\n", + " \"generation_kwargs\": {\n", + " \"temperature\": 0.7,\n", + " \"max_new_tokens\": 512,\n", + " }\n", + " }\n", + "}\n", + "\n", + "distiset = pipeline.run( #\n", + " parameters={\n", + " load_dataset.name: {\n", + " \"repo_id\": \"plaguss/argilla_sdk_docs_raw_unstructured\",\n", + " \"split\": \"train\",\n", + " },\n", + " generate_retrieval_pairs.name: generation_kwargs,\n", + " generate_reranking_pairs.name: generation_kwargs,\n", + " },\n", + " use_cache=False, # comment out for demo\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Distiset({\n", + " generate_reranking_pairs: DatasetDict({\n", + " train: Dataset({\n", + " features: ['filename', 'anchor', 'repo_name', 'positive', 'negative', 'distilabel_metadata', 'model_name'],\n", + " num_rows: 15\n", + " })\n", + " })\n", + " generate_retrieval_pairs: DatasetDict({\n", + " train: Dataset({\n", + " features: ['filename', 'anchor', 'repo_name', 'positive', 'negative', 'distilabel_metadata', 'model_name'],\n", + " num_rows: 15\n", + " })\n", + " })\n", + "})" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "distiset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Data generation can be a expensive, so it is recommended to store the data somewhere. For now, we will store it on the Hugging Face Hub, using our `push_to_hub` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "distiset.push_to_hub(\"my-org/my-dataset-name\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have got 2 different leaf/end nodes, therefore we've got a distil configurations we can access, one for the retrieval data, and one for the reranking data." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'filename': 'argilla-python/docs/index.md',\n", + " 'anchor': 'description: Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.\\nhide: navigation\\n\\nWelcome to Argilla\\n\\nArgilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.',\n", + " 'repo_name': 'argilla-io/argilla-python',\n", + " 'positive': 'description: Argilla is a collaboration tool designed for AI engineers and domain experts who need high-quality outputs, full data control, and maximum efficiency.\\nhide: navigation\\n\\nWelcome to Argilla\\n\\nArgilla is a collaboration tool designed for AI engineers and domain experts who need high-quality outputs, full data control, and maximum efficiency.',\n", + " 'negative': 'description: Argilla is a platform for marketing professionals and sales teams that prioritizes customer engagement, brand visibility, and revenue growth.\\nhide: navigation\\n\\nWelcome to Argilla\\n\\nArgilla is a platform for marketing professionals and sales teams that prioritizes customer engagement, brand visibility, and revenue growth.',\n", + " 'distilabel_metadata': {'raw_output_generate_reranking_pairs': '## Positive\\n\\ndescription: Argilla is a collaboration tool designed for AI engineers and domain experts who need high-quality outputs, full data control, and maximum efficiency.\\nhide: navigation\\n\\nWelcome to Argilla\\n\\nArgilla is a collaboration tool designed for AI engineers and domain experts who need high-quality outputs, full data control, and maximum efficiency.\\n\\n## Negative\\n\\ndescription: Argilla is a platform for marketing professionals and sales teams that prioritizes customer engagement, brand visibility, and revenue growth.\\nhide: navigation\\n\\nWelcome to Argilla\\n\\nArgilla is a platform for marketing professionals and sales teams that prioritizes customer engagement, brand visibility, and revenue growth.'},\n", + " 'model_name': 'gpt-4o'}" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "distiset[\"generate_reranking_pairs\"][\"train\"][0]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'filename': 'argilla-python/docs/index.md',\n", + " 'anchor': 'description: Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.\\nhide: navigation\\n\\nWelcome to Argilla\\n\\nArgilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.',\n", + " 'repo_name': 'argilla-io/argilla-python',\n", + " 'positive': 'What is Argilla and how does it benefit AI engineers and domain experts?',\n", + " 'negative': \"How does Argilla's interface compare with other project management tools?\",\n", + " 'distilabel_metadata': {'raw_output_generate_retrieval_pairs': \"## Positive\\n\\nWhat is Argilla and how does it benefit AI engineers and domain experts?\\n\\n## Negative\\n\\nHow does Argilla's interface compare with other project management tools?\"},\n", + " 'model_name': 'gpt-4o'}" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "distiset[\"generate_retrieval_pairs\"][\"train\"][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looking at these initial examples, we can see they nicely capture the essence of the `chunks` column but we will need to evaluate the quality of the data a bit more before we can use it for fine-tuning." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data quality evaluation \n", + "\n", + "Data is never as clean as it can be and this also holds for synthetically generated data too, therefore, it is always good to spent some time and look at your data.\n", + "\n", + "### Feature engineering\n", + "\n", + "In order to evaluate the quality of our data we will use features of the models that we intent to fine-tune as proxy for data quality. We can then use these features to filter out the best examples.\n", + "\n", + "In order to choose a good default model, we will use the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). We want to optimize for size and speed, so we will set model size `<100M` and then filter for `Retrieval` and `Reranking` based on the highest average score, resulting in [Snowflake/snowflake-arctic-embed-s](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) and [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) respectively.\n", + "\n", + "\n", + "\n", + "#### Retrieval\n", + "\n", + "For retrieval, we will compute similarities for the current embeddings of `anchor-positive`, `positive-negative` and `anchor-negative` pairs. We assume that an overlap of these similarities will cause the model to have difficulties generalizing and therefore we can use these features to evaluate the quality of our data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model_id = \"Snowflake/snowflake-arctic-embed-m\" # Hugging Face model ID\n", + "\n", + "model_retrieval = SentenceTransformer(\n", + " model_id, device=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we will encode the generated text pairs and compute the similarities. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.metrics.pairwise import cosine_similarity\n", + "\n", + "def get_embeddings(texts):\n", + " vectors = model_retrieval.encode(texts)\n", + " return [vector.tolist() for vector in vectors]\n", + "\n", + "\n", + "def get_similarities(vector_batch_a, vector_batch_b):\n", + " similarities = []\n", + " for vector_a, vector_b in zip(vector_batch_a, vector_batch_b):\n", + " similarity = cosine_similarity([vector_a], [vector_b])[0][0]\n", + " similarities.append(similarity)\n", + " return similarities\n", + "\n", + "def format_data_retriever(batch):# -> Any:\n", + " batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n", + " batch[\"positive-vector\"] = get_embeddings(batch[\"positive\"])\n", + " batch[\"negative-vector\"] = get_embeddings(batch[\"negative\"]) \n", + " batch[\"similarity-positive-negative\"] = get_similarities(batch[\"positive-vector\"], batch[\"negative-vector\"])\n", + " batch[\"similarity-anchor-positive\"] = get_similarities(batch[\"anchor-vector\"], batch[\"positive-vector\"])\n", + " batch[\"similarity-anchor-negative\"] = get_similarities(batch[\"anchor-vector\"], batch[\"negative-vector\"])\n", + " return batch\n", + "\n", + "dataset_generate_retrieval_pairs = distiset[\"generate_retrieval_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Reranking\n", + "\n", + "For reranking, we will compute the compute the relevance scores from an existing reranker model for `anchor-positive`, `positive-negative` and `anchor-negative` pais and make a similar assumption as for the retrieval model." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L12-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']\n", + "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" + ] + } + ], + "source": [ + "model_id = \"sentence-transformers/all-MiniLM-L12-v2\"\n", + "\n", + "model = CrossEncoder(model_id)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we will compute the similarity for the generated text pairs using the reranker. On top of that, we will compute an `anchor-vector` to allow for doing semantic search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def format_data_retriever(batch):# -> Any:\n", + " batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n", + " batch[\"similarity-positive-negative\"] = model.predict(zip(batch[\"positive-vector\"], batch[\"negative-vector\"]))\n", + " batch[\"similarity-anchor-positive\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"positive-vector\"]))\n", + " batch[\"similarity-anchor-negative\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"negative-vector\"]))\n", + " return batch\n", + "\n", + "dataset_generate_reranking_pairs = distiset[\"generate_reranking_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And voila, we have our proxies for quality evaluation which we can use to filter out the best and worst examples." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### (Optional) Argilla\n", + "\n", + "To get the most out of you data and actually look at our data, we will use Argilla. If you are not familiar with Argilla, we recommend taking a look at the [Argilla quickstart docs](https://docs.argilla.io/latest/getting_started/quickstart/). Alternatively, you can use your Hugging Face account to login to the [Argilla demo Space](https://argilla-argilla-template-space.hf.space).\n", + "\n", + "To start exploring data, we first need to define an `argilla.Dataset`. We will create a basic datset with some input `TextFields` for the `anchor` and output `TextQuestions` for the `positive` and `negative` pairs. Additionally, we will use the `file_name` as `MetaDataProperty`. Lastly, we will be re-using the vectors obtained from our previous step to allow for semantic search and we will add te similarity scores for some basic filtering and sorting." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, we need to define the setting for our Argilla dataset. We will create two different datasets, one for the retrieval data and one for the reranking data to ensure our annotators can focus on the task at hand." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import argilla as rg\n", + "from argilla._exceptions import ConflictError\n", + "\n", + "api_key = \"ohh so secret\"\n", + "api_url = \"https://davidberenstein1957-my-argilla.hf.space\"\n", + "\n", + "client = rg.Argilla(api_url=api_url, api_key=api_key)\n", + "\n", + "settings = rg.Settings(\n", + " fields=[\n", + " rg.TextField(\"anchor\")\n", + " ],\n", + " questions=[\n", + " rg.TextQuestion(\"positive\"),\n", + " rg.TextQuestion(\"negative\"),\n", + " rg.LabelQuestion(\n", + " name=\"is_positive_relevant\",\n", + " title=\"Is the positive query relevant?\",\n", + " labels=[\"yes\", \"no\"],\n", + " ),\n", + " rg.LabelQuestion(\n", + " name=\"is_negative_irrelevant\",\n", + " title=\"Is the negative query irrelevant?\",\n", + " labels=[\"yes\", \"no\"],\n", + " )\n", + " ],\n", + " metadata=[\n", + " rg.TermsMetadataProperty(\"filename\"),\n", + " rg.FloatMetadataProperty(\"similarity-positive-negative\"),\n", + " rg.FloatMetadataProperty(\"similarity-anchor-positive\"),\n", + " rg.FloatMetadataProperty(\"similarity-anchor-negative\"),\n", + " ],\n", + " vectors=[\n", + " rg.VectorField(\"anchor-vector\", dimensions=model.get_sentence_embedding_dimension())\n", + " ]\n", + ")\n", + "rg_datasets = []\n", + "for dataset_name in [\"generate_retrieval_pairs\", \"generate_reranking_pairs\"]:\n", + " ds = rg.Dataset(\n", + " name=dataset_name,\n", + " settings=settings\n", + " )\n", + " try:\n", + " ds.create()\n", + " except ConflictError:\n", + " ds = client.datasets(dataset_name)\n", + " rg_datasets.append(ds)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we've got our dataset definitions setup in Argilla, we can upload our data to Argilla." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds_datasets = [dataset_generate_retrieval_pairs, dataset_generate_reranking_pairs]\n", + "\n", + "records = []\n", + "\n", + "for rg_dataset, ds_dataset in zip(rg_datasets, ds_datasets):\n", + " for idx, entry in enumerate(ds_dataset):\n", + " records.append(\n", + " rg.Record(\n", + " id=idx,\n", + " fields={\"anchor\": entry[\"anchor\"]},\n", + " suggestions=[\n", + " rg.Suggestion(\"positive\", value=entry[\"positive\"], agent=\"gpt-4o\", type=\"model\"),\n", + " rg.Suggestion(\"negative\", value=entry[\"negative\"], agent=\"gpt-4o\", type=\"model\"),\n", + " ],\n", + " metadata={\n", + " \"filename\": entry[\"filename\"],\n", + " \"similarity-positive-negative\": entry[\"similarity-positive-negative\"],\n", + " \"similarity-anchor-positive\": entry[\"similarity-anchor-positive\"],\n", + " \"similarity-anchor-negative\": entry[\"similarity-anchor-negative\"]\n", + " },\n", + " vectors={\"anchor-vector\": entry[\"anchor-vector\"]}\n", + " )\n", + " )\n", + " rg_dataset.records.log(records)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can explore the UI and add a final human touch to get he most out of our dataset. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Fine-tuning\n", + "\n", + "At last, we can fine-tune our models. We will use the `sentence-transformers` library to fine-tune our models.\n", + "\n", + "### Retrieval\n", + "\n", + "For retrieval, we have created a script that fine-tunes a model on our generated data the generated data based [https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb](https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb).You can also [open it in Google Colab directly](https://githubtocolab.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Retrieval\n", + "\n", + "For reranking, `sentence-transformers` provides a script that shows [how to fine-tune a CrossEncoder models](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/cross-encoder). Ad of now, there is [some uncertainty over fine-tuning CrossEncoder models with triplets](https://github.com/UKPLab/sentence-transformers/issues/2366) but you can still use the `positive` and `anchor`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusions\n", + "\n", + "In this tutorial, we present an end-to-end example of fine-tuning retrievers and rerankers for RAG. This serves as a good starting point for optimizing and maintaining your data and model but need to be adapted to your specific use case.\n", + "\n", + "We started with some seed data from the Argilla docs, generated synthetic data for retrieval and reranking models, evaluated the quality of the data, and showed how to fine-tune the models. We also used Argilla to get a human touch on the data." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/mkdocs.yml b/mkdocs.yml index 5846a1020d..b98a9edbaf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -150,6 +150,7 @@ plugins: members_order: source # order methods according to their order of definition in the source code, not alphabetical order heading_level: 4 - social + - mknotebooks - distilabel/components-gallery: add_after_page: How-to guides @@ -184,13 +185,18 @@ nav: - Serving an LLM for sharing it between several tasks: "sections/how_to_guides/advanced/serving_an_llm_for_reuse.md" - Scaling and distributing a pipeline with Ray: "sections/how_to_guides/advanced/scaling_with_ray.md" - Pipeline Samples: - - Examples: "sections/pipeline_samples/examples/index.md" + - "sections/pipeline_samples/examples/index.md" + - Tutorials: + - Synthetic data generation for fine-tuning custom retrieval and reranking models: "sections/pipeline_samples/tutorials/GenerateSentencePair.ipynb" - Papers: - - "sections/pipeline_samples/papers/index.md" - DEITA: "sections/pipeline_samples/papers/deita.md" - Instruction Backtranslation: "sections/pipeline_samples/papers/instruction_backtranslation.md" - Prometheus 2: "sections/pipeline_samples/papers/prometheus.md" - UltraFeedback: "sections/pipeline_samples/papers/ultrafeedback.md" + - Examples: + - Benchmarking with distilabel: "sections/pipeline_samples/examples/benchmarking_with_distilabel.md" + - Llama cpp with outlines: "sections/pipeline_samples/examples/llama_cpp_with_outlines.md" + - MistralAI with instructor: "sections/pipeline_samples/examples/mistralai_with_instructor.md" - DeepSeek Prover: "sections/pipeline_samples/papers/deepseek_prover.md" - API Reference: - Step: diff --git a/src/distilabel/llms/vllm.py b/src/distilabel/llms/vllm.py index 4ff30c07f4..6fcf614f0b 100644 --- a/src/distilabel/llms/vllm.py +++ b/src/distilabel/llms/vllm.py @@ -537,7 +537,7 @@ async def agenerate( # type: ignore """Generates `num_generations` responses for each input. Args: - inputs: a list of inputs in chat format to generate responses for. + input: a single input in chat format to generate responses for. num_generations: the number of generations to create per input. Defaults to `1`. max_new_tokens: the maximum number of new tokens that the model will generate.