Skip to content

Commit

Permalink
Merge branch 'develop' into docs/890-docs-tutorial-generate-data-for-…
Browse files Browse the repository at this point in the history
…training-embeddings-and-reranking-models
  • Loading branch information
davidberenstein1957 authored Aug 19, 2024
2 parents e0cf666 + 974f0db commit 39b1f79
Show file tree
Hide file tree
Showing 72 changed files with 2,350 additions and 348 deletions.
Binary file added docs/assets/tutorials-assets/deepseek_prover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Saving step generated artifacts

Some `Step`s might need to produce an auxiliary artifact that is not a result of the computation, but is needed for the computation. For example, the [`FaissNearestNeighbour`](/distilabel/components-gallery/steps/faissnearestneighbour/) needs to create a Faiss index to compute the output of the step which are the top `k` nearest neighbours for each input. Generating the Faiss index takes time and it could potentially be reused outside of the `distilabel` pipeline, so it would be a shame not saving it.

For this reason, `Step`s have a method called `save_artifact` that allows saving artifacts that will be included along the outputs of the pipeline in the generated [`Distiset`][distilabel.distiset.Distiset]. The generated artifacts will be uploaded and saved when using `Distiset.push_to_hub` or `Distiset.save_to_disk` respectively. Let's see how to use it with a simple example.

```python
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt

if TYPE_CHECKING:
from distilabel.steps import StepOutput


class CountTextCharacters(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["text"]

@property
def outputs(self) -> List[str]:
return ["text_character_count"]

def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []

for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)

# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")

# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)

plt.close()

yield inputs
```

As it can be seen in the example above, we have created a simple step that counts the number of characters in each input text and generates a histogram with the distribution of the character counts. We save the histogram as an artifact of the step using the `save_artifact` method. The method takes three arguments:

- `name`: The name we want to give to the artifact.
- `write_function`: A function that writes the artifact to the desired path. The function will receive a `path` argument which is a `pathlib.Path` object pointing to the directory where the artifact should be saved.
- `metadata`: A dictionary with metadata about the artifact. This metadata will be saved along with the artifact.

Let's execute the step with a simple pipeline and push the resulting `Distiset` to the Hugging Face Hub:

??? "Example full code"

```python
from typing import TYPE_CHECKING, List

import matplotlib.pyplot as plt
from datasets import load_dataset
from distilabel.pipeline import Pipeline
from distilabel.steps import GlobalStep, StepInput, StepOutput

if TYPE_CHECKING:
from distilabel.steps import StepOutput


class CountTextCharacters(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["text"]

@property
def outputs(self) -> List[str]:
return ["text_character_count"]

def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []

for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)

# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")

# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)

plt.close()

yield inputs


with Pipeline() as pipeline:
count_text_characters = CountTextCharacters()

if __name__ == "__main__":
distiset = pipeline.run(
dataset=load_dataset(
"HuggingFaceH4/instruction-dataset", split="test"
).rename_column("prompt", "text"),
)

distiset.push_to_hub("distilabel-internal-testing/distilabel-artifacts-example")
```

The generated [distilabel-internal-testing/distilabel-artifacts-example](https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example) dataset repository has a section in its card [describing the artifacts generated by the pipeline](https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example#artifacts) and the generated plot can be seen [here](https://huggingface.co/datasets/distilabel-internal-testing/distilabel-artifacts-example/blob/main/artifacts/count_text_characters_0/text_character_count_distribution/figure.png).
31 changes: 19 additions & 12 deletions docs/sections/how_to_guides/basic/step/generator_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,19 @@
The [`GeneratorStep`][distilabel.steps.GeneratorStep] is a subclass of [`Step`][distilabel.steps.Step] that is intended to be used as the first step within a [`Pipeline`][distilabel.pipeline.Pipeline], because it doesn't require input and generates data that can be used by other steps. Alternatively, it can also be used as a standalone.

```python
from typing import List
from typing import List, TYPE_CHECKING
from typing_extensions import override

from distilabel.steps import GeneratorStep
from distilabel.steps.typing import GeneratorStepOutput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, GeneratorStepOutput

class MyGeneratorStep(GeneratorStep):
instructions: List[str]

@override
def process(self, offset: int = 0) -> GeneratorStepOutput:
def process(self, offset: int = 0) -> "GeneratorStepOutput":
if offset:
self.instructions = self.instructions[offset:]

Expand All @@ -30,7 +32,7 @@ class MyGeneratorStep(GeneratorStep):
)

@property
def outputs(self) -> List[str]:
def outputs(self) -> "StepColumns":
return ["instruction"]
```

Expand All @@ -57,7 +59,7 @@ next(step.process(offset=1))

We can define a custom generator step by creating a new subclass of the [`GeneratorStep`][distilabel.steps.GeneratorStep] and defining the following:

- `outputs`: is a property that returns a list of strings with the names of the output fields.
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.

- `process`: is a method that yields output data and a boolean flag indicating whether that's the last batch to be generated.

Expand All @@ -73,21 +75,23 @@ We can define a custom generator step by creating a new subclass of the [`Genera


```python
from typing import List
from typing import List, TYPE_CHECKING
from typing_extensions import override

from distilabel.steps import GeneratorStep
from distilabel.steps.typing import GeneratorStepOutput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, GeneratorStepOutput

class MyGeneratorStep(GeneratorStep):
instructions: List[str]

@override
def process(self, offset: int = 0) -> GeneratorStepOutput:
def process(self, offset: int = 0) -> "GeneratorStepOutput":
...

@property
def outputs(self) -> List[str]:
def outputs(self) -> "StepColumns":
...
```

Expand All @@ -96,15 +100,18 @@ We can define a custom generator step by creating a new subclass of the [`Genera
The `@step` decorator will take care of the boilerplate code, and will allow to define the `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`GeneratorStep`][distilabel.steps.GeneratorStep] subclass.

```python
from typing import TYPE_CHECKING
from distilabel.steps import step
from distilabel.steps.typing import GeneratorStepOutput

if TYPE_CHECKING:
from distilabel.steps.typing import GeneratorStepOutput

@step(outputs=[...], step_type="generator")
def CustomGeneratorStep(offset: int = 0) -> GeneratorStepOutput:
def CustomGeneratorStep(offset: int = 0) -> "GeneratorStepOutput":
yield (
...,
True if offset == 10 else False,
)

step = CustomGeneratorStep(name="my-step")
```
```
24 changes: 15 additions & 9 deletions docs/sections/how_to_guides/basic/step/global_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ The [`GlobalStep`][distilabel.steps.GlobalStep] is a subclass of [`Step`][distil

We can define a custom step by creating a new subclass of the [`GlobalStep`][distilabel.steps.GlobalStep] and defining the following:

- `inputs`: is a property that returns a list of strings with the names of the required input fields.
- `inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.

- `outputs`: is a property that returns a list of strings with the names of the output fields.
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.

- `process`: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should `yield` the output data.

Expand All @@ -23,20 +23,23 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
We can inherit from the `GlobalStep` class and define the `inputs`, `outputs`, and `process` methods as follows:

```python
from typing import TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput
from distilabel.steps.typing import StepOutput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, StepOutput

class CustomStep(Step):
@property
def inputs(self) -> List[str]:
def inputs(self) -> "StepColumns":
...

@property
def outputs(self) -> List[str]:
def outputs(self) -> "StepColumns":
...

def process(self, *inputs: StepInput) -> StepOutput:
for input in inputs:
for upstream_step_inputs in inputs:
for item in input:
...
yield item
Expand All @@ -54,14 +57,17 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
The `@step` decorator will take care of the boilerplate code, and will allow to define the `inputs`, `outputs`, and `process` methods in a more straightforward way. One downside is that it won't let you access the `self` attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclass.

```python
from typing import TYPE_CHECKING
from distilabel.steps import StepInput, step
from distilabel.steps.typing import StepOutput

if TYPE_CHECKING:
from distilabel.steps.typing import StepOutput

@step(inputs=[...], outputs=[...], step_type="global")
def CustomStep(inputs: StepInput) -> StepOutput:
def CustomStep(inputs: StepInput) -> "StepOutput":
for input in inputs:
...
yield inputs

step = CustomStep(name="my-step")
```
```
36 changes: 24 additions & 12 deletions docs/sections/how_to_guides/basic/step/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,19 @@ The [`Step`][distilabel.steps.Step] is intended to be used within the scope of a
Assuming that we have a [`Step`][distilabel.steps.Step] already defined as it follows:

```python
from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, StepOutput

class MyStep(Step):
@property
def inputs(self) -> List[str]:
def inputs(self) -> "StepColumns":
return ["input_field"]

@property
def outputs(self) -> List[str]:
def outputs(self) -> "StepColumns":
return ["output_field"]

def process(self, inputs: StepInput) -> "StepOutput":
Expand Down Expand Up @@ -71,9 +77,9 @@ There are two special types of [`Step`][distilabel.steps.Step] in `distilabel`:

We can define a custom step by creating a new subclass of the [`Step`][distilabel.steps.Step] and defining the following:

- `inputs`: is a property that returns a list of strings with the names of the required input fields.
- `inputs`: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.

- `outputs`: is a property that returns a list of strings with the names of the output fields.
- `outputs`: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.

- `process`: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should `yield` the output data.

Expand All @@ -88,20 +94,23 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe
We can inherit from the `Step` class and define the `inputs`, `outputs`, and `process` methods as follows:

```python
from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput
from distilabel.steps.typing import StepOutput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, StepOutput

class CustomStep(Step):
@property
def inputs(self) -> List[str]:
def inputs(self) -> "StepColumns":
...

@property
def outputs(self) -> List[str]:
def outputs(self) -> "StepColumns":
...

def process(self, *inputs: StepInput) -> StepOutput:
for input in inputs:
def process(self, *inputs: StepInput) -> "StepOutput":
for upstream_step_inputs in inputs:
...
yield item

Expand All @@ -119,14 +128,17 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe


```python
from typing import TYPE_CHECKING
from distilabel.steps import StepInput, step
from distilabel.steps.typing import StepOutput

if TYPE_CHECKING:
from distilabel.steps.typing import StepOutput

@step(inputs=[...], outputs=[...])
def CustomStep(inputs: StepInput) -> StepOutput:
def CustomStep(inputs: StepInput) -> "StepOutput":
for input in inputs:
...
yield inputs

step = CustomStep(name="my-step")
```
```
Loading

0 comments on commit 39b1f79

Please sign in to comment.