feat: add customizable criteria for query generation in SelfInstructT…

…ask (#269) * criteria for query generation added, template updated * str field updated * spaces after ("") * template updated * criteria for queries in template left with newlines * self_instruct docstring updated * snippet updated * Update src/distilabel/tasks/text_generation/self_instruct.py Co-authored-by: Agus <[email protected]> * criteria for query generation fixed * Add tests for self instruct task * documentation * \n added * snippet updated * changing the print from the snippets * output as comment --------- Co-authored-by: Agus <[email protected]>
argilla-io · Jan 18, 2024 · cb55007 · cb55007
1 parent 50c1df5
commit cb55007
Show file tree

Hide file tree

Showing 6 changed files with 185 additions and 20 deletions.
diff --git a/docs/snippets/technical-reference/tasks/custom_task_selfinstruct_haikus.py b/docs/snippets/technical-reference/tasks/custom_task_selfinstruct_haikus.py
@@ -0,0 +1,43 @@
+from distilabel.tasks import SelfInstructTask
+
+system_prompt: str = "You are an expert Haiku writer, writing the best and most diverse Haikus given topics as inputs."
+
+application_description = (
+    "An AI assistant adept at writing Haiku.\n"
+    "It expects complete suggestions from users providing details of the kind of haiku they want.\n"
+    "The AI assistant will help users write haiku about particular topics and is willing to accept requests related to a specific subject or object or a more abstract request"
+    "based on an emotion, theme or vibe.\n"
+)
+
+
+criteria_queries = (
+    "Incorporate a diverse range of verbs, avoiding repetition.\n"
+    "Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
+    "Design queries to be self-contained and standalone."
+)
+
+instruction_task = SelfInstructTask(
+    num_instructions=15,
+    application_description=application_description,
+    criteria_for_query_generation=criteria_queries,
+)
+
+# Let's print the generated prompt to see the input of the LLM model
+print(instruction_task.generate_prompt("mountain peaks").formatted_prompt)
+
+"""
+# Task Description
+Develop 15 user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model’s textual capabilities.
+# Criteria for Queries
+Incorporate a diverse range of verbs, avoiding repetition.
+Ensure queries are compatible with AI model’s text generation functions and are limited to 1-2 sentences.
+Design queries to be self-contained and standalone.
+Write each query on a separate line and avoid using numbered lists or bullet points.
+# AI Application
+An AI assistant adept at writing Haiku.
+It expects complete suggestions from users providing details of the kind of haiku they want.
+The AI assistant will help users write haiku about particular topics and is willing to accept requests related to a specific subject or object or a more abstract requestbased on an emotion, theme or vibe.
+# Context
+mountain peaks
+# Output
+"""
diff --git a/docs/snippets/technical-reference/tasks/generic_openai_self_instruct.py b/docs/snippets/technical-reference/tasks/generic_openai_self_instruct.py
@@ -8,6 +8,7 @@
         system_prompt="You are a question-answering assistant for...",
         application_description="AI assistant",
         num_instructions=3,
+        criteria_for_query_generation="Design queries to be... ",
     ),
     openai_api_key=os.getenv("OPENAI_API_KEY"),
 )
diff --git a/docs/technical-reference/tasks.md b/docs/technical-reference/tasks.md
@@ -1,19 +1,20 @@
 ---
 description: Determine the behaviour of your LLM by selecting a suitable task.
 ---
+
 # Tasks
 
 In this section we will see what's a `Task` and the list of tasks available in `distilabel`.
 
 ## Task
 
-The `Task` class takes charge of setting how the LLM behaves, deciding whether it acts as a *generator* or a *labeller*. To accomplish this, the `Task` class creates a prompt using a template that will be sent to the [`LLM`](../technical-reference/llms.md). It specifies the necessary input arguments for generating the prompt and identifies the output arguments to be extracted from the `LLM` response. The `Task` class yields a `Prompt` that can generate a string with the format needed, depending on the specific `LLM` used.
+The `Task` class takes charge of setting how the LLM behaves, deciding whether it acts as a _generator_ or a _labeller_. To accomplish this, the `Task` class creates a prompt using a template that will be sent to the [`LLM`](../technical-reference/llms.md). It specifies the necessary input arguments for generating the prompt and identifies the output arguments to be extracted from the `LLM` response. The `Task` class yields a `Prompt` that can generate a string with the format needed, depending on the specific `LLM` used.
 
 All the `Task`s defines a `system_prompt` which serves as the initial instruction given to the LLM, guiding it on what kind of information or output is expected, and the following methods:
 
 - `generate_prompt`: This method will be used by the `LLM` to create the prompts that will be fed to the model.
 - `parse_output`: After the `LLM` has generated the content, this method will be called on the raw outputs of the model to extract the relevant content (scores, rationales, etc).
-- `input_args_names` and `output_args_names`: These methods are used in the [`Pipeline`](../technical-reference/pipeline.md) to process the datasets. The first one defines the columns that will be extracted from the dataset to build the prompt in case of a `LLM` that acts as a generator or labeller alone, or the columns that should be placed in the dataset to be processed by the *labeller* `LLM`, in the case of a `Pipeline` that has both a *generator* and a *labeller*. The second one is in charge of inserting the defined fields as columns of the dataset generated dataset.
+- `input_args_names` and `output_args_names`: These methods are used in the [`Pipeline`](../technical-reference/pipeline.md) to process the datasets. The first one defines the columns that will be extracted from the dataset to build the prompt in case of a `LLM` that acts as a generator or labeller alone, or the columns that should be placed in the dataset to be processed by the _labeller_ `LLM`, in the case of a `Pipeline` that has both a _generator_ and a _labeller_. The second one is in charge of inserting the defined fields as columns of the dataset generated dataset.
 
 After defining a task, the only action required is to pass it to the corresponding `LLM`. All the intricate processes are then handled internally:
 
@@ -29,15 +30,13 @@ These set of classes are designed to steer a `LLM` in generating text with speci
 
 ### TextGenerationTask
 
-This is the base class for *text generation*, and includes the following fields for guiding the generation process:
+This is the base class for _text generation_, and includes the following fields for guiding the generation process:
 
 - `system_prompt`, which serves as the initial instruction or query given to the LLM, guiding it on what kind of information or output is expected.
 - A list of `principles` to inject on the `system_prompt`, which by default correspond to those defined in the UltraFeedback paper[^1],
 - and lastly a distribution for these principles so the `LLM` can be directed towards the different principles with a more customized behaviour.
 
-[^1]:
-    The principles can be found [here][distilabel.tasks.text_generation.principles] in the codebase. More information on the *Principle Sampling* can be found in the [UltraFeedfack repository](https://github.com/OpenBMB/UltraFeedback#principle-sampling).
-
+[^1]: The principles can be found [here][distilabel.tasks.text_generation.principles] in the codebase. More information on the _Principle Sampling_ can be found in the [UltraFeedfack repository](https://github.com/OpenBMB/UltraFeedback#principle-sampling).
 
 For the API reference visit [TextGenerationTask][distilabel.tasks.text_generation.base.TextGenerationTask].
 
@@ -48,13 +47,28 @@ with Self-Generated Instructions](https://arxiv.org/abs/2212.10560).
 
 From the original [repository](https://github.com/yizhongw/self-instruct/tree/main#how-self-instruct-works):
 
-*The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances*, so this `Task` is specially interesting for generating new datasets from a set of predefined topics.
+_The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances_, so this `Task` is specially interesting for generating new datasets from a set of predefined topics.
 
 ```python
 --8<-- "docs/snippets/technical-reference/tasks/generic_openai_self_instruct.py"
 ```
 
-For the API reference visit  [SelfInstructTask][distilabel.tasks.text_generation.self_instruct.SelfInstructTask].
+For the API reference visit [SelfInstructTask][distilabel.tasks.text_generation.self_instruct.SelfInstructTask].
+
+#### Customise your SelfInstructTask
+
+You can personalize the way in which your SelfInstructTask behaves by changing the default values of the parameters to something that suits your use case. Let's go through them:
+
+- **System Prompt**: you can control the overall behaviour and expectations of your model.
+- **Application Description**: a description of the AI application. By default, we use "AI Assistant".
+- **Number of instructions**: number of instructions in the prompt.
+- **Criteria for Query Generation**: the criteria for query generation that we want our model to have. The default value covers default behaviour for SelfInstructTask. This value is passed to the .jinja template, where extra instructions are added to ensure correct output format.
+
+Let's see an example of how to customise a SelfInstructTask to create Haikus in the snippet below. You can take a look at this dataset as an example of a [Haiku DPO dataset](https://huggingface.co/datasets/davanstrien/haiku_dpo).
+
+```python
+--8<-- "docs/snippets/technical-reference/tasks/custom_task_selfinstruct_haikus.py"
+```
 
 ### EvolInstructTask
 
@@ -63,7 +77,7 @@ Follow Complex Instructions](https://arxiv.org/abs/2304.12244).
 
 From the original [repository](https://github.com/nlpxucan/WizardLM/tree/main?tab=readme-ov-file#overview-of-evol-instruct):
 
-*Evol-Instruct is a novel method using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels and skill range, to improve the performance of LLMs*.
+_Evol-Instruct is a novel method using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels and skill range, to improve the performance of LLMs_.
 
 Use this `Task` to build more complete and complex datasets starting from simple ones.
 
@@ -74,12 +88,10 @@ Use this `Task` to build more complete and complex datasets starting from simple
 You can take a look at a [sample dataset](https://huggingface.co/datasets/argilla/distilabel-sample-evol-instruct?row=19) generated using the script the following script: [examples/pipeline-evol-instruct-alpaca.py](../../examples/pipeline-evol-instruct-alpaca.py).
 
 !!! note
-    The original definition of `EvolInstruct` considers an elimination evolving step with different
-    situations to remove the responses considered failures. Section 3.2, *Elimination Evolving* in [WizardLM paper](https://arxiv.org/abs/2304.12244) shows these steps. We have implemented steps 2-4 as part of this task, but not step one. Step 1 of the elimination process can be implemented using labellers in `distilabel`, an example implementation can be found in the following script: [examples/pipeline-openai-wizardl-equal-prompts.py](../../examples/pipeline-openai-wizardl-equal-prompts.py).
+The original definition of `EvolInstruct` considers an elimination evolving step with different
+situations to remove the responses considered failures. Section 3.2, _Elimination Evolving_ in [WizardLM paper](https://arxiv.org/abs/2304.12244) shows these steps. We have implemented steps 2-4 as part of this task, but not step one. Step 1 of the elimination process can be implemented using labellers in `distilabel`, an example implementation can be found in the following script: [examples/pipeline-openai-wizardl-equal-prompts.py](../../examples/pipeline-openai-wizardl-equal-prompts.py).
 
-
-
-For the API reference visit  [EvolInstructTask][distilabel.tasks.text_generation.evol_instruct.EvolInstructTask].
+For the API reference visit [EvolInstructTask][distilabel.tasks.text_generation.evol_instruct.EvolInstructTask].
 
 ## Labelling
 
@@ -95,7 +107,7 @@ Contrary to the `TextGenerationTask`, the `PreferenceTask` is not intended for d
 
 This task is specifically designed to build the prompts following the format defined in the ["UltraFeedback: Boosting Language Models With High Quality Feedback"](https://arxiv.org/abs/2310.01377) paper.
 
-From the original [repository](https://github.com/OpenBMB/UltraFeedback): *To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness*. This `Task` is designed to label datasets following the different aspects defined for the UltraFeedback dataset creation.
+From the original [repository](https://github.com/OpenBMB/UltraFeedback): _To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness_. This `Task` is designed to label datasets following the different aspects defined for the UltraFeedback dataset creation.
 
 The following snippet can be used as a simplified UltraFeedback Task, for which we define 3 different ratings, but take into account the predefined versions are intended to be used out of the box:
 
@@ -145,7 +157,6 @@ Additionally, we at Argilla created a custom subtask for UltraFeedback, that gen
     --8<-- "docs/snippets/technical-reference/tasks/openai_for_overall_quality.py"
     ```
 
-
 For the API reference visit [UltraFeedbackTask][distilabel.tasks.preference.ultrafeedback.UltraFeedbackTask].
 
 #### JudgeLMTask

diff --git a/src/distilabel/tasks/_templates/self-instruct.jinja2 b/src/distilabel/tasks/_templates/self-instruct.jinja2
@@ -2,10 +2,7 @@
 Develop {{ num_instructions }} user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model's textual capabilities.
 
 # Criteria for Queries
-Incorporate a diverse range of verbs, avoiding repetition.
-Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.
-Design queries to be self-contained and standalone.
-Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.
+{{ criteria_for_query_generation }}
 Write each query on a separate line and avoid using numbered lists or bullet points.
 
 # AI Application

diff --git a/src/distilabel/tasks/text_generation/self_instruct.py b/src/distilabel/tasks/text_generation/self_instruct.py
@@ -50,6 +50,9 @@ class SelfInstructTask(TextGenerationTask):
             "AI assistant".
         num_instructions (int, optional): the number of instructions to be used for the prompt.
             Defaults to 5.
+        criteria_for_query_generation (str, optional): the criteria for query generation that we want
+            our model to have. Default value covers default behaviour for SelfInstructTask. This value is
+            passed to the .jinja template, where extra instructions are added to ensure correct output format.
 
     References:
         - [`Self-Instruct: Aligning Language Models with Self-Generated Instructions`](https://arxiv.org/abs/2212.10560)
@@ -61,9 +64,17 @@ class SelfInstructTask(TextGenerationTask):
         " You are given a task description and a set of instructions for how to write the prompts for an"
         " specific AI application."
     )
+
     application_description: str = "AI assistant"
     num_instructions: int = 5
 
+    criteria_for_query_generation: str = (
+        "Incorporate a diverse range of verbs, avoiding repetition.\n"
+        "Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
+        "Design queries to be self-contained and standalone.\n"
+        'Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.'
+    )
+
     __jinja2_template__: str = _SELF_INSTRUCT_TEMPLATE
 
     def generate_prompt(self, input: str, **_: Any) -> Prompt:
@@ -87,6 +98,7 @@ def generate_prompt(self, input: str, **_: Any) -> Prompt:
         render_kwargs = {
             "application_description": self.application_description,
             "num_instructions": self.num_instructions,
+            "criteria_for_query_generation": self.criteria_for_query_generation,
             "input": input,
         }
         return Prompt(