Skip to content

Commit

Permalink
feat: add customizable criteria for query generation in SelfInstructT…
Browse files Browse the repository at this point in the history
…ask (#269)

* criteria for query generation added, template updated

* str field updated

* spaces after ("")

* template updated

* criteria for queries in template left with newlines

* self_instruct docstring updated

* snippet updated

* Update src/distilabel/tasks/text_generation/self_instruct.py

Co-authored-by: Agus <[email protected]>

* criteria for query generation fixed

* Add tests for self instruct task

* documentation

* \n added

* snippet updated

* changing the print from the snippets

* output as comment

---------

Co-authored-by: Agus <[email protected]>
  • Loading branch information
ignacioct and plaguss authored Jan 18, 2024
1 parent 50c1df5 commit cb55007
Show file tree
Hide file tree
Showing 6 changed files with 185 additions and 20 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from distilabel.tasks import SelfInstructTask

system_prompt: str = "You are an expert Haiku writer, writing the best and most diverse Haikus given topics as inputs."

application_description = (
"An AI assistant adept at writing Haiku.\n"
"It expects complete suggestions from users providing details of the kind of haiku they want.\n"
"The AI assistant will help users write haiku about particular topics and is willing to accept requests related to a specific subject or object or a more abstract request"
"based on an emotion, theme or vibe.\n"
)


criteria_queries = (
"Incorporate a diverse range of verbs, avoiding repetition.\n"
"Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
"Design queries to be self-contained and standalone."
)

instruction_task = SelfInstructTask(
num_instructions=15,
application_description=application_description,
criteria_for_query_generation=criteria_queries,
)

# Let's print the generated prompt to see the input of the LLM model
print(instruction_task.generate_prompt("mountain peaks").formatted_prompt)

"""
# Task Description
Develop 15 user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model’s textual capabilities.
# Criteria for Queries
Incorporate a diverse range of verbs, avoiding repetition.
Ensure queries are compatible with AI model’s text generation functions and are limited to 1-2 sentences.
Design queries to be self-contained and standalone.
Write each query on a separate line and avoid using numbered lists or bullet points.
# AI Application
An AI assistant adept at writing Haiku.
It expects complete suggestions from users providing details of the kind of haiku they want.
The AI assistant will help users write haiku about particular topics and is willing to accept requests related to a specific subject or object or a more abstract requestbased on an emotion, theme or vibe.
# Context
mountain peaks
# Output
"""
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
system_prompt="You are a question-answering assistant for...",
application_description="AI assistant",
num_instructions=3,
criteria_for_query_generation="Design queries to be... ",
),
openai_api_key=os.getenv("OPENAI_API_KEY"),
)
43 changes: 27 additions & 16 deletions docs/technical-reference/tasks.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
---
description: Determine the behaviour of your LLM by selecting a suitable task.
---

# Tasks

In this section we will see what's a `Task` and the list of tasks available in `distilabel`.

## Task

The `Task` class takes charge of setting how the LLM behaves, deciding whether it acts as a *generator* or a *labeller*. To accomplish this, the `Task` class creates a prompt using a template that will be sent to the [`LLM`](../technical-reference/llms.md). It specifies the necessary input arguments for generating the prompt and identifies the output arguments to be extracted from the `LLM` response. The `Task` class yields a `Prompt` that can generate a string with the format needed, depending on the specific `LLM` used.
The `Task` class takes charge of setting how the LLM behaves, deciding whether it acts as a _generator_ or a _labeller_. To accomplish this, the `Task` class creates a prompt using a template that will be sent to the [`LLM`](../technical-reference/llms.md). It specifies the necessary input arguments for generating the prompt and identifies the output arguments to be extracted from the `LLM` response. The `Task` class yields a `Prompt` that can generate a string with the format needed, depending on the specific `LLM` used.

All the `Task`s defines a `system_prompt` which serves as the initial instruction given to the LLM, guiding it on what kind of information or output is expected, and the following methods:

- `generate_prompt`: This method will be used by the `LLM` to create the prompts that will be fed to the model.
- `parse_output`: After the `LLM` has generated the content, this method will be called on the raw outputs of the model to extract the relevant content (scores, rationales, etc).
- `input_args_names` and `output_args_names`: These methods are used in the [`Pipeline`](../technical-reference/pipeline.md) to process the datasets. The first one defines the columns that will be extracted from the dataset to build the prompt in case of a `LLM` that acts as a generator or labeller alone, or the columns that should be placed in the dataset to be processed by the *labeller* `LLM`, in the case of a `Pipeline` that has both a *generator* and a *labeller*. The second one is in charge of inserting the defined fields as columns of the dataset generated dataset.
- `input_args_names` and `output_args_names`: These methods are used in the [`Pipeline`](../technical-reference/pipeline.md) to process the datasets. The first one defines the columns that will be extracted from the dataset to build the prompt in case of a `LLM` that acts as a generator or labeller alone, or the columns that should be placed in the dataset to be processed by the _labeller_ `LLM`, in the case of a `Pipeline` that has both a _generator_ and a _labeller_. The second one is in charge of inserting the defined fields as columns of the dataset generated dataset.

After defining a task, the only action required is to pass it to the corresponding `LLM`. All the intricate processes are then handled internally:

Expand All @@ -29,15 +30,13 @@ These set of classes are designed to steer a `LLM` in generating text with speci

### TextGenerationTask

This is the base class for *text generation*, and includes the following fields for guiding the generation process:
This is the base class for _text generation_, and includes the following fields for guiding the generation process:

- `system_prompt`, which serves as the initial instruction or query given to the LLM, guiding it on what kind of information or output is expected.
- A list of `principles` to inject on the `system_prompt`, which by default correspond to those defined in the UltraFeedback paper[^1],
- and lastly a distribution for these principles so the `LLM` can be directed towards the different principles with a more customized behaviour.

[^1]:
The principles can be found [here][distilabel.tasks.text_generation.principles] in the codebase. More information on the *Principle Sampling* can be found in the [UltraFeedfack repository](https://github.com/OpenBMB/UltraFeedback#principle-sampling).

[^1]: The principles can be found [here][distilabel.tasks.text_generation.principles] in the codebase. More information on the _Principle Sampling_ can be found in the [UltraFeedfack repository](https://github.com/OpenBMB/UltraFeedback#principle-sampling).

For the API reference visit [TextGenerationTask][distilabel.tasks.text_generation.base.TextGenerationTask].

Expand All @@ -48,13 +47,28 @@ with Self-Generated Instructions](https://arxiv.org/abs/2212.10560).

From the original [repository](https://github.com/yizhongw/self-instruct/tree/main#how-self-instruct-works):

*The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances*, so this `Task` is specially interesting for generating new datasets from a set of predefined topics.
_The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances_, so this `Task` is specially interesting for generating new datasets from a set of predefined topics.

```python
--8<-- "docs/snippets/technical-reference/tasks/generic_openai_self_instruct.py"
```

For the API reference visit [SelfInstructTask][distilabel.tasks.text_generation.self_instruct.SelfInstructTask].
For the API reference visit [SelfInstructTask][distilabel.tasks.text_generation.self_instruct.SelfInstructTask].

#### Customise your SelfInstructTask

You can personalize the way in which your SelfInstructTask behaves by changing the default values of the parameters to something that suits your use case. Let's go through them:

- **System Prompt**: you can control the overall behaviour and expectations of your model.
- **Application Description**: a description of the AI application. By default, we use "AI Assistant".
- **Number of instructions**: number of instructions in the prompt.
- **Criteria for Query Generation**: the criteria for query generation that we want our model to have. The default value covers default behaviour for SelfInstructTask. This value is passed to the .jinja template, where extra instructions are added to ensure correct output format.

Let's see an example of how to customise a SelfInstructTask to create Haikus in the snippet below. You can take a look at this dataset as an example of a [Haiku DPO dataset](https://huggingface.co/datasets/davanstrien/haiku_dpo).

```python
--8<-- "docs/snippets/technical-reference/tasks/custom_task_selfinstruct_haikus.py"
```

### EvolInstructTask

Expand All @@ -63,7 +77,7 @@ Follow Complex Instructions](https://arxiv.org/abs/2304.12244).

From the original [repository](https://github.com/nlpxucan/WizardLM/tree/main?tab=readme-ov-file#overview-of-evol-instruct):

*Evol-Instruct is a novel method using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels and skill range, to improve the performance of LLMs*.
_Evol-Instruct is a novel method using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels and skill range, to improve the performance of LLMs_.

Use this `Task` to build more complete and complex datasets starting from simple ones.

Expand All @@ -74,12 +88,10 @@ Use this `Task` to build more complete and complex datasets starting from simple
You can take a look at a [sample dataset](https://huggingface.co/datasets/argilla/distilabel-sample-evol-instruct?row=19) generated using the script the following script: [examples/pipeline-evol-instruct-alpaca.py](../../examples/pipeline-evol-instruct-alpaca.py).

!!! note
The original definition of `EvolInstruct` considers an elimination evolving step with different
situations to remove the responses considered failures. Section 3.2, *Elimination Evolving* in [WizardLM paper](https://arxiv.org/abs/2304.12244) shows these steps. We have implemented steps 2-4 as part of this task, but not step one. Step 1 of the elimination process can be implemented using labellers in `distilabel`, an example implementation can be found in the following script: [examples/pipeline-openai-wizardl-equal-prompts.py](../../examples/pipeline-openai-wizardl-equal-prompts.py).
The original definition of `EvolInstruct` considers an elimination evolving step with different
situations to remove the responses considered failures. Section 3.2, _Elimination Evolving_ in [WizardLM paper](https://arxiv.org/abs/2304.12244) shows these steps. We have implemented steps 2-4 as part of this task, but not step one. Step 1 of the elimination process can be implemented using labellers in `distilabel`, an example implementation can be found in the following script: [examples/pipeline-openai-wizardl-equal-prompts.py](../../examples/pipeline-openai-wizardl-equal-prompts.py).



For the API reference visit [EvolInstructTask][distilabel.tasks.text_generation.evol_instruct.EvolInstructTask].
For the API reference visit [EvolInstructTask][distilabel.tasks.text_generation.evol_instruct.EvolInstructTask].

## Labelling

Expand All @@ -95,7 +107,7 @@ Contrary to the `TextGenerationTask`, the `PreferenceTask` is not intended for d

This task is specifically designed to build the prompts following the format defined in the ["UltraFeedback: Boosting Language Models With High Quality Feedback"](https://arxiv.org/abs/2310.01377) paper.

From the original [repository](https://github.com/OpenBMB/UltraFeedback): *To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness*. This `Task` is designed to label datasets following the different aspects defined for the UltraFeedback dataset creation.
From the original [repository](https://github.com/OpenBMB/UltraFeedback): _To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness_. This `Task` is designed to label datasets following the different aspects defined for the UltraFeedback dataset creation.

The following snippet can be used as a simplified UltraFeedback Task, for which we define 3 different ratings, but take into account the predefined versions are intended to be used out of the box:

Expand Down Expand Up @@ -145,7 +157,6 @@ Additionally, we at Argilla created a custom subtask for UltraFeedback, that gen
--8<-- "docs/snippets/technical-reference/tasks/openai_for_overall_quality.py"
```


For the API reference visit [UltraFeedbackTask][distilabel.tasks.preference.ultrafeedback.UltraFeedbackTask].

#### JudgeLMTask
Expand Down
5 changes: 1 addition & 4 deletions src/distilabel/tasks/_templates/self-instruct.jinja2
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,7 @@
Develop {{ num_instructions }} user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model's textual capabilities.

# Criteria for Queries
Incorporate a diverse range of verbs, avoiding repetition.
Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.
Design queries to be self-contained and standalone.
Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.
{{ criteria_for_query_generation }}
Write each query on a separate line and avoid using numbered lists or bullet points.

# AI Application
Expand Down
12 changes: 12 additions & 0 deletions src/distilabel/tasks/text_generation/self_instruct.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@ class SelfInstructTask(TextGenerationTask):
"AI assistant".
num_instructions (int, optional): the number of instructions to be used for the prompt.
Defaults to 5.
criteria_for_query_generation (str, optional): the criteria for query generation that we want
our model to have. Default value covers default behaviour for SelfInstructTask. This value is
passed to the .jinja template, where extra instructions are added to ensure correct output format.
References:
- [`Self-Instruct: Aligning Language Models with Self-Generated Instructions`](https://arxiv.org/abs/2212.10560)
Expand All @@ -61,9 +64,17 @@ class SelfInstructTask(TextGenerationTask):
" You are given a task description and a set of instructions for how to write the prompts for an"
" specific AI application."
)

application_description: str = "AI assistant"
num_instructions: int = 5

criteria_for_query_generation: str = (
"Incorporate a diverse range of verbs, avoiding repetition.\n"
"Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
"Design queries to be self-contained and standalone.\n"
'Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.'
)

__jinja2_template__: str = _SELF_INSTRUCT_TEMPLATE

def generate_prompt(self, input: str, **_: Any) -> Prompt:
Expand All @@ -87,6 +98,7 @@ def generate_prompt(self, input: str, **_: Any) -> Prompt:
render_kwargs = {
"application_description": self.application_description,
"num_instructions": self.num_instructions,
"criteria_for_query_generation": self.criteria_for_query_generation,
"input": input,
}
return Prompt(
Expand Down
Loading

0 comments on commit cb55007

Please sign in to comment.