Skip to content

Commit

Permalink
Add tutorial - generate data for training embeddings and reranking mo…
Browse files Browse the repository at this point in the history
…dels (#893)

* Add initial outline tutorial

* Add section on data quality evaluation

* Add conslusion

* Update pipeline_samples structure for adding tutorials in a similar way as Argilla docs

* Update new structure tutorials

* Update title

* Update to use Free serverless Inference API

* Process comments from code review

* Remove sections from header

* Updated formatting examples

* Add grid arror on new line

* update phrasing

* update phrasing
  • Loading branch information
davidberenstein1957 authored Aug 19, 2024
1 parent 974f0db commit 3264563
Show file tree
Hide file tree
Showing 13 changed files with 862 additions and 69 deletions.
4 changes: 2 additions & 2 deletions docs/sections/how_to_guides/advanced/structured_generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ These were some simple examples, but one can see the options this opens.

!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/index.md#llamacpp-with-outlines)
[`examples/structured_generation_with_outlines.py`](../../pipeline_samples/examples/llama_cpp_with_outlines.md)

[^1]:
You can check the variable type by importing it from:
Expand Down Expand Up @@ -189,7 +189,7 @@ We get back a Python dictionary (formatted as a string) that we can parse using

!!! Tip
A full pipeline example can be seen in the following script:
[`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/index.md#mistralai-with-instructor)
[`examples/structured_generation_with_instructor.py`](../../pipeline_samples/examples/mistralai_with_instructor.md)

## OpenAI JSON

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
hide: toc
---
# [Benchmarking with `distilabel`: Arena Hard](#benchmarking-with-distilabel-arena-hard)

Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark.

The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie.

To run this example you will first need to install the Arena Hard optional dependencies, being `pandas`, `scikit-learn`, and `numpy`.

??? Run

```python
python examples/arena_hard.py
```

```python title="arena_hard.py"
--8<-- "examples/arena_hard.py"
```
108 changes: 63 additions & 45 deletions docs/sections/pipeline_samples/examples/index.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,96 @@
# Examples
---
hide: toc
---
# Pipeline Samples

This section contains different example pipelines that showcase different tasks, maybe you can take inspiration from them.
- **Tutorials** provide detailed step-by-step explanations and the code used for end-to-end workflows.
- **Paper implementations** provide reproductions of fundamental papers in the synthetic data domain.
- **Examples** don't provide explenations but simply show code for different tasks.

### [llama.cpp with `outlines`](#llamacpp-with-outlines)
## Tutorials

Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.
<div class="grid cards" markdown>

??? Example "See example"
- __Retrieval and reranking models__

This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.
---

It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].
Learn about synthetic data generation for fine-tuning custom retrieval and reranking models.

??? Run
[:octicons-arrow-right-24: Tutorial](../tutorials/GenerateSentencePair.ipynb)

```python
python examples/structured_generation_with_outlines.py
```
</div>

```python title="structured_generation_with_outlines.py"
--8<-- "examples/structured_generation_with_outlines.py"
```
## Paper Implementations

<div class="grid cards" markdown>

### [MistralAI with `instructor`](#mistralai-with-instructor)
- __DEITA__

Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.
---

??? Example "See example"
Learn about prompt, response tuning for complexity and quality and LLMs as judges for automatic data selection.

This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.
[:octicons-arrow-right-24: Paper](../papers/deita.md)

This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.
- __Instruction Backtranslation__

??? Run
---

```python
python examples/structured_generation_with_instructor.py
```
Learn about automatically labeling human-written text with corresponding instructions.

```python title="structured_generation_with_instructor.py"
--8<-- "examples/structured_generation_with_instructor.py"
```
[:octicons-arrow-right-24: Paper](../papers/instruction_backtranslation.md)

??? "Visualizing the graphs"
- __Prometheus 2__

Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:
---

!!! NOTE
Learn about using open-source models as judges for direct assessment and pair-wise ranking.

This example uses graphviz to render the graph, you can install with `pip` in the following way:
[:octicons-arrow-right-24: Paper](../papers/prometheus.md)

```console
pip install graphviz
```
- __UltraFeedback__

```python
python examples/draw_kg.py 2 # You can pass 0,1,2 to visualize each of the samples.
```
---

![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png)
Learn about a large-scale, fine-grained, diverse preference dataset, used for training powerful reward and critic models.

[:octicons-arrow-right-24: Paper](../papers/ultrafeedback.md)

### [Benchmarking with `distilabel`: Arena Hard](#benchmarking-with-distilabel-arena-hard)
</div>

Benchmark LLMs with `distilabel`: reproducing the Arena Hard benchmark.
## Examples

<div class="grid cards" markdown>

- __Benchmarking with distilabel__

---

Learn about reproducing the Arena Hard benchmark with disitlabel.

[:octicons-arrow-right-24: Example](./benchmarking_with_distilabel.md)

- __llama.cpp with outlines__

---

Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel.

[:octicons-arrow-right-24: Example](./llama_cpp_with_outlines.md)

- __MistralAI with instructor__

---

Learn about answering instructions with knowledge graphs defined as pydantic.BaseModel objects using instructor in distilabel.

[:octicons-arrow-right-24: Example](../papers/prometheus.md)


</div>

??? Example "See example"

The script below first defines both the `ArenaHard` and the `ArenaHardResults` tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a `Pipeline` to run the generation on top of the prompts with `InferenceEndpointsLLM` while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with `OpenAILLM` generating an alternate response, a comparison between the responses, and a result as A>>B, A>B, B>A, B>>A, or tie.

To run this example you will first need to install the Arena Hard optional dependencies, being `pandas`, `scikit-learn`, and `numpy`.

```python title="arena_hard.py"
--8<-- "examples/arena_hard.py"
```

20 changes: 20 additions & 0 deletions docs/sections/pipeline_samples/examples/llama_cpp_with_outlines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
hide: toc
---
# [llama.cpp with `outlines`](#llamacpp-with-outlines)

Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `distilabel`.

This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.

It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].

??? Run

```python
python examples/structured_generation_with_outlines.py
```

```python title="structured_generation_with_outlines.py"
--8<-- "examples/structured_generation_with_outlines.py"
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
hide: toc
---
# [MistralAI with `instructor`](#mistralai-with-instructor)

Answer instructions with knowledge graphs defined as `pydantic.BaseModel` objects using `instructor` in `distilabel`.

This script makes use of [`MistralLLM`][distilabel.llms.mistral.MistralLLM] and the structured output capabilities thanks to [`instructor`](https://python.useinstructor.com/) to generate knowledge graphs from complex topics.

This example is translated from this [awesome example](https://python.useinstructor.com/examples/knowledge_graph/) from `instructor` cookbook.

??? Run

```python
python examples/structured_generation_with_instructor.py
```

```python title="structured_generation_with_instructor.py"
--8<-- "examples/structured_generation_with_instructor.py"
```

??? "Visualizing the graphs"

Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:

!!! NOTE

This example uses graphviz to render the graph, you can install with `pip` in the following way:

```console
pip install graphviz
```

```python
python examples/draw_kg.py 2 # You can pass 0,1,2 to visualize each of the samples.
```

![Knowledge graph figure](../../../assets/images/sections/examples/knowledge-graph-example.png)
6 changes: 3 additions & 3 deletions docs/sections/pipeline_samples/papers/deita.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# DEITA

DEITA (Data-Efficient Instruction Tuning for Alignment) studies an automatic data selection process by first quantifying the data quality based on complexity, quality and diversity. And second, selecting across the best potential combination from an open-source dataset that would fit into the budget you allocate to tune your own LLM.
[DEITA (Data-Efficient Instruction Tuning for Alignment)](https://arxiv.org/abs/2312.15685) studies an automatic data selection process by first quantifying the data quality based on complexity, quality and diversity. Second, select the best potential combination from an open-source dataset that would fit into the budget you allocate to tune your own LLM.

In most setting we cannot allocate unlimited resources for instruction-tuning LLMs. Therefore, the DEITA authors investigated how to select qualitative data for instruction-tuning based on a principle of fewer high quality samples. Liu et al. tackle the issue of first defining good data and second identifying it to respect an initial budget to instruct-tune your LLM.
In most setting we cannot allocate unlimited resources for instruction-tuning LLMs. Therefore, the DEITA authors investigated how to select qualitative data for instruction tuning based on the principle of fewer high-quality samples. Liu et al. tackle the issue of first defining good data and second identifying it to respect an initial budget to instruct-tune your LLM.

The strategy utilizes **LLMs to replace human effort in time-intensive data quality tasks on instruction tuning datasets**. DEITA introduces a way to measure data quality across three critical dimensions: complexity, quality and diversity.
The strategy utilizes **LLMs to replace human effort in time-intensive data quality **tasks on **instruction-tuning** datasets**. DEITA introduces a way to measure data quality across three critical dimensions: complexity, quality and diversity.

![DEITA pipeline overview](../../../assets/tutorials-assets/deita/overview.png)

Expand Down
3 changes: 0 additions & 3 deletions docs/sections/pipeline_samples/papers/index.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Instruction Backtranslation

["Self Alignment with Instruction Backtranslation"](https://arxiv.org/abs/2308.06259) presents a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model.
["Self Alignment with Instruction Backtranslation"](https://arxiv.org/abs/2308.06259) presents a scalable method to build high-quality instruction following a language model by automatically labeling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high-quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model.

Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents which includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions.
Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents that includes writing about all manner of topics humans are interested in – but crucially is not paired with instructions.

A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high quality example pairs to train an instruction following model.
A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high-quality example pairs to train an instruction-following model.

Their overall process, called instruction backtranslation performs two core steps:
Their overall process, called instruction back translation performs two core steps:

1. Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning.

2. Self-curate: Self-select high quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration.
2. Self-curate: Self-select high-quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration.

This replication covers the self-curation step i.e. the second / latter step as mentioned above, so as to be able to use the proposed prompting approach to rate the quality of the generated text, which can either be synthetically generated or real human-written text.
This replication covers the self-curation step i.e. the second/latter step as mentioned above, so as to be able to use the proposed prompting approach to rate the quality of the generated text, which can either be synthetically generated or real human-written text.

### Replication

Expand Down
Loading

0 comments on commit 3264563

Please sign in to comment.