diff --git a/pr-1092/search/search_index.json b/pr-1092/search/search_index.json
index e39e09860..468c9aa88 100644
--- a/pr-1092/search/search_index.json
+++ b/pr-1092/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Distilabel","text":"Synthesize data for AI and add feedback on the fly! <p>Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.</p> <ul> <li> <p>Get started in 5 minutes!</p> <p>Install distilabel with <code>pip</code> and run your first <code>Pipeline</code> to generate and evaluate synthetic data.</p> <p> Quickstart</p> </li> <li> <p>How-to guides</p> <p>Get familiar with the basics of distilabel. Learn how to define <code>steps</code>, <code>tasks</code> and <code>llms</code> and run your <code>Pipeline</code>.</p> <p> Learn more</p> </li> </ul>"},{"location":"#why-use-distilabel","title":"Why use distilabel?","text":"<p>Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.</p> <p>Improve your AI output quality through data quality</p> <p>Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your synthetic data.</p> <p>Take control of your data and models</p> <p>Ownership of data for fine-tuning your own LLMs is not easy but distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.</p> <p>Improve efficiency by quickly iterating on the right data and models</p> <p>Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.</p>"},{"location":"#what-do-people-build-with-distilabel","title":"What do people build with distilabel?","text":"<p>The Argilla community uses distilabel to create amazing datasets and models.</p> <ul> <li>The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.</li> <li>Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.</li> <li>The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.</li> </ul>"},{"location":"api/cli/","title":"Command Line Interface (CLI)","text":"<p>This section contains the API reference for the CLI. For more information on how to use the CLI, see Tutorial - CLI.</p>"},{"location":"api/cli/#utility-functions-for-the-distilabel-pipeline-sub-commands","title":"Utility functions for the <code>distilabel pipeline</code> sub-commands","text":"<p>Here are some utility functions to help working with the pipelines in the console.</p>"},{"location":"api/cli/#distilabel.cli.pipeline.utils","title":"<code>utils</code>","text":""},{"location":"api/cli/#distilabel.cli.pipeline.utils.parse_runtime_parameters","title":"<code>parse_runtime_parameters(params)</code>","text":"<p>Parses the runtime parameters from the CLI format to the format expected by the <code>Pipeline.run</code> method. The CLI format is a list of tuples, where the first element is a list of keys and the second element is the value.</p> <p>Parameters:</p> Name Type Description Default <code>params</code> <code>List[Tuple[List[str], str]]</code> <p>A list of tuples, where the first element is a list of keys and the second element is the value.</p> required <p>Returns:</p> Type Description <code>Dict[str, Dict[str, Any]]</code> <p>A dictionary with the runtime parameters in the format expected by the</p> <code>Dict[str, Dict[str, Any]]</code> <p><code>Pipeline.run</code> method.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def parse_runtime_parameters(\n    params: List[Tuple[List[str], str]],\n) -&gt; Dict[str, Dict[str, Any]]:\n    \"\"\"Parses the runtime parameters from the CLI format to the format expected by the\n    `Pipeline.run` method. The CLI format is a list of tuples, where the first element is\n    a list of keys and the second element is the value.\n\n    Args:\n        params: A list of tuples, where the first element is a list of keys and the\n            second element is the value.\n\n    Returns:\n        A dictionary with the runtime parameters in the format expected by the\n        `Pipeline.run` method.\n    \"\"\"\n    runtime_params = {}\n    for keys, value in params:\n        current = runtime_params\n        for i, key in enumerate(keys):\n            if i == len(keys) - 1:\n                current[key] = value\n            else:\n                current = current.setdefault(key, {})\n    return runtime_params\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.valid_http_url","title":"<code>valid_http_url(url)</code>","text":"<p>Check if the URL is a valid HTTP URL.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>The URL to check.</p> required <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code>, if the URL is a valid HTTP URL. <code>False</code>, otherwise.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def valid_http_url(url: str) -&gt; bool:\n    \"\"\"Check if the URL is a valid HTTP URL.\n\n    Args:\n        url: The URL to check.\n\n    Returns:\n        `True`, if the URL is a valid HTTP URL. `False`, otherwise.\n    \"\"\"\n    try:\n        TypeAdapter(HttpUrl).validate_python(url)  # type: ignore\n    except ValidationError:\n        return False\n\n    return True\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.get_config_from_url","title":"<code>get_config_from_url(url)</code>","text":"<p>Loads the pipeline configuration from a URL pointing to a JSON or YAML file.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>The URL pointing to the pipeline configuration file.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The pipeline configuration as a dictionary.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the file format is not supported.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def get_config_from_url(url: str) -&gt; Dict[str, Any]:\n    \"\"\"Loads the pipeline configuration from a URL pointing to a JSON or YAML file.\n\n    Args:\n        url: The URL pointing to the pipeline configuration file.\n\n    Returns:\n        The pipeline configuration as a dictionary.\n\n    Raises:\n        ValueError: If the file format is not supported.\n    \"\"\"\n    if not url.endswith((\".json\", \".yaml\", \".yml\")):\n        raise DistilabelUserError(\n            f\"Unsupported file format for '{url}'. Only JSON and YAML are supported\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=seriali#serializing-the-pipeline\",\n        )\n    response = _download_remote_file(url)\n\n    if url.endswith((\".yaml\", \".yml\")):\n        content = response.content.decode(\"utf-8\")\n        return yaml.safe_load(content)\n\n    return response.json()\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.get_pipeline_from_url","title":"<code>get_pipeline_from_url(url, pipeline_name='pipeline')</code>","text":"<p>Downloads the file to the current working directory and loads the pipeline object from a python script.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>The URL pointing to the python script with the pipeline definition.</p> required <code>pipeline_name</code> <code>str</code> <p>The name of the pipeline in the script. I.e: <code>with Pipeline(...) as pipeline:...</code>.</p> <code>'pipeline'</code> <p>Returns:</p> Type Description <code>BasePipeline</code> <p>The pipeline instantiated.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the file format is not supported.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def get_pipeline_from_url(url: str, pipeline_name: str = \"pipeline\") -&gt; \"BasePipeline\":\n    \"\"\"Downloads the file to the current working directory and loads the pipeline object\n    from a python script.\n\n    Args:\n        url: The URL pointing to the python script with the pipeline definition.\n        pipeline_name: The name of the pipeline in the script.\n            I.e: `with Pipeline(...) as pipeline:...`.\n\n    Returns:\n        The pipeline instantiated.\n\n    Raises:\n        ValueError: If the file format is not supported.\n    \"\"\"\n    if not url.endswith(\".py\"):\n        raise DistilabelUserError(\n            f\"Unsupported file format for '{url}'. It must be a python file.\",\n            page=\"sections/how_to_guides/advanced/cli/#distilabel-pipeline-run\",\n        )\n    response = _download_remote_file(url)\n\n    content = response.content.decode(\"utf-8\")\n    script_local = Path.cwd() / Path(url).name\n    script_local.write_text(content)\n\n    # Add the current working directory to sys.path\n    sys.path.insert(0, os.getcwd())\n    module = importlib.import_module(str(Path(url).stem))\n    pipeline = getattr(module, pipeline_name, None)\n    if not pipeline:\n        raise ImportError(\n            f\"The script must contain an object with the pipeline named: '{pipeline_name}' that can be imported\"\n        )\n\n    return pipeline\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.get_pipeline","title":"<code>get_pipeline(config_or_script, pipeline_name='pipeline')</code>","text":"<p>Get a pipeline from a configuration file or a remote python script.</p> <p>Parameters:</p> Name Type Description Default <code>config_or_script</code> <code>str</code> <p>The path or URL to the pipeline configuration file or URL to a python script.</p> required <code>pipeline_name</code> <code>str</code> <p>The name of the pipeline in the script. I.e: <code>with Pipeline(...) as pipeline:...</code>.</p> <code>'pipeline'</code> <p>Returns:</p> Type Description <code>BasePipeline</code> <p>The pipeline.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the file format is not supported.</p> <code>FileNotFoundError</code> <p>If the configuration file does not exist.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def get_pipeline(\n    config_or_script: str, pipeline_name: str = \"pipeline\"\n) -&gt; \"BasePipeline\":\n    \"\"\"Get a pipeline from a configuration file or a remote python script.\n\n    Args:\n        config_or_script: The path or URL to the pipeline configuration file\n            or URL to a python script.\n        pipeline_name: The name of the pipeline in the script.\n            I.e: `with Pipeline(...) as pipeline:...`.\n\n    Returns:\n        The pipeline.\n\n    Raises:\n        ValueError: If the file format is not supported.\n        FileNotFoundError: If the configuration file does not exist.\n    \"\"\"\n    config = script = None\n    if config_or_script.endswith((\".json\", \".yaml\", \".yml\")):\n        config = config_or_script\n    elif config_or_script.endswith(\".py\"):\n        script = config_or_script\n    else:\n        raise DistilabelUserError(\n            \"The file must be a valid config file or python script with a pipeline.\",\n            page=\"sections/how_to_guides/advanced/cli/#distilabel-pipeline-run\",\n        )\n\n    if valid_http_url(config_or_script):\n        if config:\n            data = get_config_from_url(config)\n            return Pipeline.from_dict(data)\n        return get_pipeline_from_url(script, pipeline_name=pipeline_name)\n\n    if not config:\n        raise ValueError(\n            f\"To run a pipeline from a python script, run it as `python {script}`\"\n        )\n\n    if Path(config).is_file():\n        return Pipeline.from_file(config)\n\n    raise FileNotFoundError(f\"File '{config_or_script}' does not exist.\")\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.display_pipeline_information","title":"<code>display_pipeline_information(pipeline)</code>","text":"<p>Displays the pipeline information to the console.</p> <p>Parameters:</p> Name Type Description Default <code>pipeline</code> <code>BasePipeline</code> <p>The pipeline.</p> required Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def display_pipeline_information(pipeline: \"BasePipeline\") -&gt; None:\n    \"\"\"Displays the pipeline information to the console.\n\n    Args:\n        pipeline: The pipeline.\n    \"\"\"\n    from rich.console import Console\n\n    Console().print(_build_pipeline_panel(pipeline))\n</code></pre>"},{"location":"api/distiset/","title":"Distiset","text":"<p>This section contains the API reference for the Distiset. For more information on how to use the CLI, see Tutorial - CLI.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset","title":"<code>Distiset</code>","text":"<p>               Bases: <code>dict</code></p> <p>Convenient wrapper around <code>datasets.Dataset</code> to push to the Hugging Face Hub.</p> <p>It's a dictionary where the keys correspond to the different leaf_steps from the internal <code>DAG</code> and the values are <code>datasets.Dataset</code>.</p> <p>Attributes:</p> Name Type Description <code>_pipeline_path</code> <code>Optional[Path]</code> <p>Optional path to the <code>pipeline.yaml</code> file that generated the dataset. Defaults to <code>None</code>.</p> <code>_artifacts_path</code> <code>Optional[Path]</code> <p>Optional path to the directory containing the generated artifacts by the pipeline steps. Defaults to <code>None</code>.</p> <code>_log_filename_path</code> <code>Optional[Path]</code> <p>Optional path to the <code>pipeline.log</code> file that generated was written by the pipeline. Defaults to <code>None</code>.</p> <code>_citations</code> <code>Optional[List[str]]</code> <p>Optional list containing citations that will be included in the dataset card. Defaults to <code>None</code>.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>class Distiset(dict):\n    \"\"\"Convenient wrapper around `datasets.Dataset` to push to the Hugging Face Hub.\n\n    It's a dictionary where the keys correspond to the different leaf_steps from the internal\n    `DAG` and the values are `datasets.Dataset`.\n\n    Attributes:\n        _pipeline_path: Optional path to the `pipeline.yaml` file that generated the dataset.\n            Defaults to `None`.\n        _artifacts_path: Optional path to the directory containing the generated artifacts\n            by the pipeline steps. Defaults to `None`.\n        _log_filename_path: Optional path to the `pipeline.log` file that generated was written\n            by the pipeline. Defaults to `None`.\n        _citations: Optional list containing citations that will be included in the dataset\n            card. Defaults to `None`.\n    \"\"\"\n\n    _pipeline_path: Optional[Path] = None\n    _artifacts_path: Optional[Path] = None\n    _log_filename_path: Optional[Path] = None\n    _citations: Optional[List[str]] = None\n\n    def push_to_hub(\n        self,\n        repo_id: str,\n        private: bool = False,\n        token: Optional[str] = None,\n        generate_card: bool = True,\n        include_script: bool = False,\n        **kwargs: Any,\n    ) -&gt; None:\n        \"\"\"Pushes the `Distiset` to the Hugging Face Hub, each dataset will be pushed as a different configuration\n        corresponding to the leaf step that generated it.\n\n        Args:\n            repo_id:\n                The ID of the repository to push to in the following format: `&lt;user&gt;/&lt;dataset_name&gt;` or\n                `&lt;org&gt;/&lt;dataset_name&gt;`. Also accepts `&lt;dataset_name&gt;`, which will default to the namespace\n                of the logged-in user.\n            private:\n                Whether the dataset repository should be set to private or not. Only affects repository creation:\n                a repository that already exists will not be affected by that parameter.\n            token:\n                An optional authentication token for the Hugging Face Hub. If no token is passed, will default\n                to the token saved locally when logging in with `huggingface-cli login`. Will raise an error\n                if no token is passed and the user is not logged-in.\n            generate_card:\n                Whether to generate a dataset card or not. Defaults to True.\n            include_script:\n                Whether you want to push the pipeline script to the hugging face hub to share it.\n                If set to True, the name of the script that was run to create the distiset will be\n                automatically determined, and that will be the name of the file uploaded to your\n                repository. Take into account, this operation only makes sense for a distiset obtained\n                from calling `Pipeline.run()` method. Defaults to False.\n            **kwargs:\n                Additional keyword arguments to pass to the `push_to_hub` method of the `datasets.Dataset` object.\n\n        Raises:\n            ValueError: If no token is provided and couldn't be retrieved automatically.\n        \"\"\"\n        script_filename = sys.argv[0]\n        filename_py = (\n            script_filename.split(\"/\")[-1]\n            if \"/\" in script_filename\n            else script_filename\n        )\n        script_path = Path.cwd() / script_filename\n\n        if token is None:\n            token = get_hf_token(self.__class__.__name__, \"token\")\n\n        for name, dataset in self.items():\n            dataset.push_to_hub(\n                repo_id=repo_id,\n                config_name=name,\n                private=private,\n                token=token,\n                **kwargs,\n            )\n\n        if self.artifacts_path:\n            upload_folder(\n                repo_id=repo_id,\n                folder_path=self.artifacts_path,\n                path_in_repo=\"artifacts\",\n                token=token,\n                repo_type=\"dataset\",\n                commit_message=\"Include pipeline artifacts\",\n            )\n\n        if include_script and script_path.exists():\n            upload_file(\n                path_or_fileobj=script_path,\n                path_in_repo=filename_py,\n                repo_id=repo_id,\n                repo_type=\"dataset\",\n                token=token,\n                commit_message=\"Include pipeline script\",\n            )\n\n        if generate_card:\n            self._generate_card(\n                repo_id, token, include_script=include_script, filename_py=filename_py\n            )\n\n    def _get_card(\n        self,\n        repo_id: str,\n        token: Optional[str] = None,\n        include_script: bool = False,\n        filename_py: Optional[str] = None,\n    ) -&gt; DistilabelDatasetCard:\n        \"\"\"Generates the dataset card for the `Distiset`.\n\n        Note:\n            If `repo_id` and `token` are provided, it will extract the metadata from the README.md file\n            on the hub.\n\n        Args:\n            repo_id: Name of the repository to push to, or the path for the distiset if saved to disk.\n            token: The token to authenticate with the Hugging Face Hub.\n                We assume that if it's provided, the dataset will be in the Hugging Face Hub,\n                so the README metadata will be extracted from there.\n            include_script: Whether to upload the script to the hugging face repository.\n            filename_py: The name of the script. If `include_script` is True, the script will\n                be uploaded to the repository using this name, otherwise it won't be used.\n\n        Returns:\n            The dataset card for the `Distiset`.\n        \"\"\"\n        sample_records = {}\n        for name, dataset in self.items():\n            record = (\n                dataset[0] if not isinstance(dataset, dict) else dataset[\"train\"][0]\n            )\n            for key, value in record.items():\n                # If list is too big, the `README.md` generated will be huge so we truncate it\n                if isinstance(value, list):\n                    length = len(value)\n                    if length &lt; 10:\n                        continue\n                    record[key] = value[:10]\n                    record[key].append(\n                        f\"... (truncated - showing 10 of {length} elements)\"\n                    )\n            sample_records[name] = record\n\n        readme_metadata = {}\n        if repo_id and token:\n            readme_metadata = self._extract_readme_metadata(repo_id, token)\n\n        metadata = {\n            **readme_metadata,\n            \"size_categories\": size_categories_parser(\n                max(len(dataset) for dataset in self.values())\n            ),\n            \"tags\": [\"synthetic\", \"distilabel\", \"rlaif\"],\n        }\n\n        card = DistilabelDatasetCard.from_template(\n            card_data=DatasetCardData(**metadata),\n            repo_id=repo_id,\n            sample_records=sample_records,\n            include_script=include_script,\n            filename_py=filename_py,\n            artifacts=self._get_artifacts_metadata(),\n            references=self.citations,\n        )\n\n        return card\n\n    def _get_artifacts_metadata(self) -&gt; Dict[str, List[Dict[str, Any]]]:\n        \"\"\"Gets a dictionary with the metadata of the artifacts generated by the pipeline steps.\n\n        Returns:\n            A dictionary in which the key is the name of the step and the value is a list\n            of dictionaries, each of them containing the name and metadata of the step artifact.\n        \"\"\"\n        if not self.artifacts_path:\n            return {}\n\n        def iterdir_ignore_hidden(path: Path) -&gt; Generator[Path, None, None]:\n            return (f for f in Path(path).iterdir() if not f.name.startswith(\".\"))\n\n        artifacts_metadata = defaultdict(list)\n        for step_artifacts_dir in iterdir_ignore_hidden(self.artifacts_path):\n            step_name = step_artifacts_dir.stem\n            for artifact_dir in iterdir_ignore_hidden(step_artifacts_dir):\n                artifact_name = artifact_dir.stem\n                metadata_path = artifact_dir / \"metadata.json\"\n                metadata = json.loads(metadata_path.read_text())\n                artifacts_metadata[step_name].append(\n                    {\"name\": artifact_name, \"metadata\": metadata}\n                )\n\n        return dict(artifacts_metadata)\n\n    def _extract_readme_metadata(\n        self, repo_id: str, token: Optional[str]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Extracts the metadata from the README.md file of the dataset repository.\n\n        We have to download the previous README.md file in the repo, extract the metadata from it,\n        and generate a dict again to be passed thorough the `DatasetCardData` object.\n\n        Args:\n            repo_id: The ID of the repository to push to, from the `push_to_hub` method.\n            token: The token to authenticate with the Hugging Face Hub, from the `push_to_hub` method.\n\n        Returns:\n            The metadata extracted from the README.md file of the dataset repository as a dict.\n        \"\"\"\n        readme_path = Path(\n            hf_hub_download(repo_id, \"README.md\", repo_type=\"dataset\", token=token)\n        )\n        # Remove the '---' from the metadata\n        metadata = re.findall(r\"---\\n(.*?)\\n---\", readme_path.read_text(), re.DOTALL)[0]\n        metadata = yaml.safe_load(metadata)\n        return metadata\n\n    def _generate_card(\n        self,\n        repo_id: str,\n        token: str,\n        include_script: bool = False,\n        filename_py: Optional[str] = None,\n    ) -&gt; None:\n        \"\"\"Generates a dataset card and pushes it to the Hugging Face Hub, and\n        if the `pipeline.yaml` path is available in the `Distiset`, uploads that\n        to the same repository.\n\n        Args:\n            repo_id: The ID of the repository to push to, from the `push_to_hub` method.\n            token: The token to authenticate with the Hugging Face Hub, from the `push_to_hub` method.\n            include_script: Whether to upload the script to the hugging face repository.\n            filename_py: The name of the script. If `include_script` is True, the script will\n                be uploaded to the repository using this name, otherwise it won't be used.\n        \"\"\"\n        card = self._get_card(\n            repo_id=repo_id,\n            token=token,\n            include_script=include_script,\n            filename_py=filename_py,\n        )\n\n        card.push_to_hub(\n            repo_id,\n            repo_type=\"dataset\",\n            token=token,\n        )\n\n        if self.pipeline_path:\n            # If the pipeline.yaml is available, upload it to the Hugging Face Hub as well.\n            HfApi().upload_file(\n                path_or_fileobj=self.pipeline_path,\n                path_in_repo=PIPELINE_CONFIG_FILENAME,\n                repo_id=repo_id,\n                repo_type=\"dataset\",\n                token=token,\n            )\n\n        if self.log_filename_path:\n            # The same we had with \"pipeline.yaml\" but with the log file.\n            HfApi().upload_file(\n                path_or_fileobj=self.log_filename_path,\n                path_in_repo=PIPELINE_LOG_FILENAME,\n                repo_id=repo_id,\n                repo_type=\"dataset\",\n                token=token,\n            )\n\n    def train_test_split(\n        self,\n        train_size: float,\n        shuffle: bool = True,\n        seed: Optional[int] = None,\n    ) -&gt; Self:\n        \"\"\"Return a `Distiset` whose values will be a `datasets.DatasetDict` with two random train and test subsets.\n        Splits are created from the dataset according to `train_size` and `shuffle`.\n\n        Args:\n            train_size:\n                Float between `0.0` and `1.0` representing the proportion of the dataset to include in the test split.\n                It will be applied to all the datasets in the `Distiset`.\n            shuffle: Whether or not to shuffle the data before splitting\n            seed:\n                A seed to initialize the default BitGenerator, passed to the underlying method.\n\n        Returns:\n            The `Distiset` with the train-test split applied to all the datasets.\n        \"\"\"\n        assert 0 &lt; train_size &lt; 1, \"train_size must be a float between 0 and 1\"\n        for name, dataset in self.items():\n            self[name] = dataset.train_test_split(\n                train_size=train_size,\n                shuffle=shuffle,\n                seed=seed,\n            )\n        return self\n\n    def save_to_disk(\n        self,\n        distiset_path: PathLike,\n        max_shard_size: Optional[Union[str, int]] = None,\n        num_shards: Optional[int] = None,\n        num_proc: Optional[int] = None,\n        storage_options: Optional[dict] = None,\n        save_card: bool = True,\n        save_pipeline_config: bool = True,\n        save_pipeline_log: bool = True,\n    ) -&gt; None:\n        r\"\"\"\n        Saves a `Distiset` to a dataset directory, or in a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n        In case you want to save the `Distiset` in a remote filesystem, you can pass the `storage_options` parameter\n        as you would do with `datasets`'s `Dataset.save_to_disk` method: [see example](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets)\n\n        Args:\n            distiset_path: Path where you want to save the `Distiset`. It can be a local path\n                (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`)\n            max_shard_size: The maximum size of the dataset shards to be uploaded to the hub.\n                If expressed as a string, needs to be digits followed by a unit (like `\"50MB\"`).\n                Defaults to `None`.\n            num_shards: Number of shards to write. By default the number of shards depends on\n                `max_shard_size` and `num_proc`. Defaults to `None`.\n            num_proc: Number of processes when downloading and generating the dataset locally.\n                Multiprocessing is disabled by default. Defaults to `None`.\n            storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n                Defaults to `None`.\n            save_card: Whether to save the dataset card. Defaults to `True`.\n            save_pipeline_config: Whether to save the pipeline configuration file (aka the `pipeline.yaml` file).\n                Defaults to `True`.\n            save_pipeline_log: Whether to save the pipeline log file (aka the `pipeline.log` file).\n                Defaults to `True`.\n\n        Examples:\n            ```python\n            # Save your distiset in a local folder:\n            distiset.save_to_disk(distiset_path=\"my-distiset\")\n            # Save your distiset in a remote storage:\n            storage_options = {\n                \"key\": os.environ[\"S3_ACCESS_KEY\"],\n                \"secret\": os.environ[\"S3_SECRET_KEY\"],\n                \"client_kwargs\": {\n                    \"endpoint_url\": os.environ[\"S3_ENDPOINT_URL\"],\n                    \"region_name\": os.environ[\"S3_REGION\"],\n                },\n            }\n            distiset.save_to_disk(distiset_path=\"my-distiset\", storage_options=storage_options)\n            ```\n        \"\"\"\n        distiset_path = str(distiset_path)\n        for name, dataset in self.items():\n            dataset.save_to_disk(\n                f\"{distiset_path}/{name}\",\n                max_shard_size=max_shard_size,\n                num_shards=num_shards,\n                num_proc=num_proc,\n                storage_options=storage_options,\n            )\n\n        distiset_config_folder = posixpath.join(distiset_path, DISTISET_CONFIG_FOLDER)\n\n        fs: fsspec.AbstractFileSystem\n        fs, _, _ = fsspec.get_fs_token_paths(\n            distiset_config_folder, storage_options=storage_options\n        )\n        fs.makedirs(distiset_config_folder, exist_ok=True)\n\n        if self.artifacts_path:\n            distiset_artifacts_folder = posixpath.join(\n                distiset_path, DISTISET_ARTIFACTS_FOLDER\n            )\n            fs.copy(str(self.artifacts_path), distiset_artifacts_folder, recursive=True)\n\n        if save_card:\n            # NOTE:\u00a0Currently the card is not the same if we write to disk or push to the HF hub,\n            # as we aren't generating the README copying/updating the data from the dataset repo.\n            card = self._get_card(repo_id=Path(distiset_path).stem, token=None)\n            new_filename = posixpath.join(distiset_config_folder, \"README.md\")\n            if storage_options:\n                # Write the card the same way as DatasetCard.save does:\n                with fs.open(new_filename, \"w\", newline=\"\", encoding=\"utf-8\") as f:\n                    f.write(str(card))\n            else:\n                card.save(new_filename)\n\n        # Write our internal files to the distiset folder by copying them to the distiset folder.\n        if save_pipeline_config and self.pipeline_path:\n            new_filename = posixpath.join(\n                distiset_config_folder, PIPELINE_CONFIG_FILENAME\n            )\n            if self.pipeline_path.exists() and (not fs.isfile(new_filename)):\n                data = yaml.safe_load(self.pipeline_path.read_text())\n                with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                    yaml.dump(data, f, default_flow_style=False)\n\n        if save_pipeline_log and self.log_filename_path:\n            new_filename = posixpath.join(distiset_config_folder, PIPELINE_LOG_FILENAME)\n            if self.log_filename_path.exists() and (not fs.isfile(new_filename)):\n                data = self.log_filename_path.read_text()\n                with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                    f.write(data)\n\n    @classmethod\n    def load_from_disk(\n        cls,\n        distiset_path: PathLike,\n        keep_in_memory: Optional[bool] = None,\n        storage_options: Optional[Dict[str, Any]] = None,\n        download_dir: Optional[PathLike] = None,\n    ) -&gt; Self:\n        \"\"\"Loads a dataset that was previously saved using `Distiset.save_to_disk` from a dataset\n        directory, or from a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n        Args:\n            distiset_path: Path (\"dataset/train\") or remote URI (\"s3://bucket/dataset/train\").\n            keep_in_memory: Whether to copy the dataset in-memory, see `datasets.Dataset.load_from_disk``\n                for more information. Defaults to `None`.\n            storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n                Defaults to `None`.\n            download_dir: Optional directory to download the dataset to. Defaults to None,\n                in which case it will create a temporary directory.\n\n        Returns:\n            A `Distiset` loaded from disk, it should be a `Distiset` object created using `Distiset.save_to_disk`.\n        \"\"\"\n        original_distiset_path = str(distiset_path)\n\n        fs: fsspec.AbstractFileSystem\n        fs, _, [distiset_path] = fsspec.get_fs_token_paths(  # type: ignore\n            original_distiset_path, storage_options=storage_options\n        )\n        dest_distiset_path = distiset_path\n\n        assert fs.isdir(\n            original_distiset_path\n        ), \"`distiset_path` must be a `PathLike` object pointing to a folder or a URI of a remote filesystem.\"\n\n        has_config = False\n        has_artifacts = False\n        distiset = cls()\n\n        if is_remote_filesystem(fs):\n            src_dataset_path = distiset_path\n            if download_dir:\n                dest_distiset_path = download_dir\n            else:\n                dest_distiset_path = Dataset._build_local_temp_path(src_dataset_path)  # type: ignore\n            fs.download(src_dataset_path, dest_distiset_path.as_posix(), recursive=True)  # type: ignore\n\n        # Now we should have the distiset locally, so we can read those files\n        for folder in Path(dest_distiset_path).iterdir():\n            if folder.stem == DISTISET_CONFIG_FOLDER:\n                has_config = True\n                continue\n            elif folder.stem == DISTISET_ARTIFACTS_FOLDER:\n                has_artifacts = True\n                continue\n            distiset[folder.stem] = load_from_disk(\n                str(folder),\n                keep_in_memory=keep_in_memory,\n            )\n\n        # From the config folder we just need to point to the files. Once downloaded we set the path to point to point to the files. Once downloaded we set the path\n        # to wherever they are.\n        if has_config:\n            distiset_config_folder = posixpath.join(\n                dest_distiset_path, DISTISET_CONFIG_FOLDER\n            )\n\n            pipeline_path = posixpath.join(\n                distiset_config_folder, PIPELINE_CONFIG_FILENAME\n            )\n            if Path(pipeline_path).exists():\n                distiset.pipeline_path = Path(pipeline_path)\n\n            log_filename_path = posixpath.join(\n                distiset_config_folder, PIPELINE_LOG_FILENAME\n            )\n            if Path(log_filename_path).exists():\n                distiset.log_filename_path = Path(log_filename_path)\n\n        if has_artifacts:\n            distiset.artifacts_path = Path(\n                posixpath.join(dest_distiset_path, DISTISET_ARTIFACTS_FOLDER)\n            )\n\n        return distiset\n\n    @property\n    def pipeline_path(self) -&gt; Union[Path, None]:\n        \"\"\"Returns the path to the `pipeline.yaml` file that generated the `Pipeline`.\"\"\"\n        return self._pipeline_path\n\n    @pipeline_path.setter\n    def pipeline_path(self, path: PathLike) -&gt; None:\n        self._pipeline_path = Path(path)\n\n    @property\n    def artifacts_path(self) -&gt; Union[Path, None]:\n        \"\"\"Returns the path to the directory containing the artifacts generated by the steps\n        of the pipeline.\"\"\"\n        return self._artifacts_path\n\n    @artifacts_path.setter\n    def artifacts_path(self, path: PathLike) -&gt; None:\n        self._artifacts_path = Path(path)\n\n    @property\n    def log_filename_path(self) -&gt; Union[Path, None]:\n        \"\"\"Returns the path to the `pipeline.log` file that generated the `Pipeline`.\"\"\"\n        return self._log_filename_path\n\n    @log_filename_path.setter\n    def log_filename_path(self, path: PathLike) -&gt; None:\n        self._log_filename_path = Path(path)\n\n    @property\n    def citations(self) -&gt; Union[List[str], None]:\n        \"\"\"Bibtex references to be included in the README.\"\"\"\n        return self._citations\n\n    @citations.setter\n    def citations(self, citations_: List[str]) -&gt; None:\n        self._citations = sorted(set(citations_))\n\n    def __repr__(self):\n        # Copy from `datasets.DatasetDict.__repr__`.\n        repr = \"\\n\".join([f\"{k}: {v}\" for k, v in self.items()])\n        repr = re.sub(r\"^\", \" \" * 4, repr, count=0, flags=re.M)\n        return f\"Distiset({{\\n{repr}\\n}})\"\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.pipeline_path","title":"<code>pipeline_path</code>  <code>property</code> <code>writable</code>","text":"<p>Returns the path to the <code>pipeline.yaml</code> file that generated the <code>Pipeline</code>.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.artifacts_path","title":"<code>artifacts_path</code>  <code>property</code> <code>writable</code>","text":"<p>Returns the path to the directory containing the artifacts generated by the steps of the pipeline.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.log_filename_path","title":"<code>log_filename_path</code>  <code>property</code> <code>writable</code>","text":"<p>Returns the path to the <code>pipeline.log</code> file that generated the <code>Pipeline</code>.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.citations","title":"<code>citations</code>  <code>property</code> <code>writable</code>","text":"<p>Bibtex references to be included in the README.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.push_to_hub","title":"<code>push_to_hub(repo_id, private=False, token=None, generate_card=True, include_script=False, **kwargs)</code>","text":"<p>Pushes the <code>Distiset</code> to the Hugging Face Hub, each dataset will be pushed as a different configuration corresponding to the leaf step that generated it.</p> <p>Parameters:</p> Name Type Description Default <code>repo_id</code> <code>str</code> <p>The ID of the repository to push to in the following format: <code>&lt;user&gt;/&lt;dataset_name&gt;</code> or <code>&lt;org&gt;/&lt;dataset_name&gt;</code>. Also accepts <code>&lt;dataset_name&gt;</code>, which will default to the namespace of the logged-in user.</p> required <code>private</code> <code>bool</code> <p>Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.</p> <code>False</code> <code>token</code> <code>Optional[str]</code> <p>An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in with <code>huggingface-cli login</code>. Will raise an error if no token is passed and the user is not logged-in.</p> <code>None</code> <code>generate_card</code> <code>bool</code> <p>Whether to generate a dataset card or not. Defaults to True.</p> <code>True</code> <code>include_script</code> <code>bool</code> <p>Whether you want to push the pipeline script to the hugging face hub to share it. If set to True, the name of the script that was run to create the distiset will be automatically determined, and that will be the name of the file uploaded to your repository. Take into account, this operation only makes sense for a distiset obtained from calling <code>Pipeline.run()</code> method. Defaults to False.</p> <code>False</code> <code>**kwargs</code> <code>Any</code> <p>Additional keyword arguments to pass to the <code>push_to_hub</code> method of the <code>datasets.Dataset</code> object.</p> <code>{}</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If no token is provided and couldn't be retrieved automatically.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def push_to_hub(\n    self,\n    repo_id: str,\n    private: bool = False,\n    token: Optional[str] = None,\n    generate_card: bool = True,\n    include_script: bool = False,\n    **kwargs: Any,\n) -&gt; None:\n    \"\"\"Pushes the `Distiset` to the Hugging Face Hub, each dataset will be pushed as a different configuration\n    corresponding to the leaf step that generated it.\n\n    Args:\n        repo_id:\n            The ID of the repository to push to in the following format: `&lt;user&gt;/&lt;dataset_name&gt;` or\n            `&lt;org&gt;/&lt;dataset_name&gt;`. Also accepts `&lt;dataset_name&gt;`, which will default to the namespace\n            of the logged-in user.\n        private:\n            Whether the dataset repository should be set to private or not. Only affects repository creation:\n            a repository that already exists will not be affected by that parameter.\n        token:\n            An optional authentication token for the Hugging Face Hub. If no token is passed, will default\n            to the token saved locally when logging in with `huggingface-cli login`. Will raise an error\n            if no token is passed and the user is not logged-in.\n        generate_card:\n            Whether to generate a dataset card or not. Defaults to True.\n        include_script:\n            Whether you want to push the pipeline script to the hugging face hub to share it.\n            If set to True, the name of the script that was run to create the distiset will be\n            automatically determined, and that will be the name of the file uploaded to your\n            repository. Take into account, this operation only makes sense for a distiset obtained\n            from calling `Pipeline.run()` method. Defaults to False.\n        **kwargs:\n            Additional keyword arguments to pass to the `push_to_hub` method of the `datasets.Dataset` object.\n\n    Raises:\n        ValueError: If no token is provided and couldn't be retrieved automatically.\n    \"\"\"\n    script_filename = sys.argv[0]\n    filename_py = (\n        script_filename.split(\"/\")[-1]\n        if \"/\" in script_filename\n        else script_filename\n    )\n    script_path = Path.cwd() / script_filename\n\n    if token is None:\n        token = get_hf_token(self.__class__.__name__, \"token\")\n\n    for name, dataset in self.items():\n        dataset.push_to_hub(\n            repo_id=repo_id,\n            config_name=name,\n            private=private,\n            token=token,\n            **kwargs,\n        )\n\n    if self.artifacts_path:\n        upload_folder(\n            repo_id=repo_id,\n            folder_path=self.artifacts_path,\n            path_in_repo=\"artifacts\",\n            token=token,\n            repo_type=\"dataset\",\n            commit_message=\"Include pipeline artifacts\",\n        )\n\n    if include_script and script_path.exists():\n        upload_file(\n            path_or_fileobj=script_path,\n            path_in_repo=filename_py,\n            repo_id=repo_id,\n            repo_type=\"dataset\",\n            token=token,\n            commit_message=\"Include pipeline script\",\n        )\n\n    if generate_card:\n        self._generate_card(\n            repo_id, token, include_script=include_script, filename_py=filename_py\n        )\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.train_test_split","title":"<code>train_test_split(train_size, shuffle=True, seed=None)</code>","text":"<p>Return a <code>Distiset</code> whose values will be a <code>datasets.DatasetDict</code> with two random train and test subsets. Splits are created from the dataset according to <code>train_size</code> and <code>shuffle</code>.</p> <p>Parameters:</p> Name Type Description Default <code>train_size</code> <code>float</code> <p>Float between <code>0.0</code> and <code>1.0</code> representing the proportion of the dataset to include in the test split. It will be applied to all the datasets in the <code>Distiset</code>.</p> required <code>shuffle</code> <code>bool</code> <p>Whether or not to shuffle the data before splitting</p> <code>True</code> <code>seed</code> <code>Optional[int]</code> <p>A seed to initialize the default BitGenerator, passed to the underlying method.</p> <code>None</code> <p>Returns:</p> Type Description <code>Self</code> <p>The <code>Distiset</code> with the train-test split applied to all the datasets.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def train_test_split(\n    self,\n    train_size: float,\n    shuffle: bool = True,\n    seed: Optional[int] = None,\n) -&gt; Self:\n    \"\"\"Return a `Distiset` whose values will be a `datasets.DatasetDict` with two random train and test subsets.\n    Splits are created from the dataset according to `train_size` and `shuffle`.\n\n    Args:\n        train_size:\n            Float between `0.0` and `1.0` representing the proportion of the dataset to include in the test split.\n            It will be applied to all the datasets in the `Distiset`.\n        shuffle: Whether or not to shuffle the data before splitting\n        seed:\n            A seed to initialize the default BitGenerator, passed to the underlying method.\n\n    Returns:\n        The `Distiset` with the train-test split applied to all the datasets.\n    \"\"\"\n    assert 0 &lt; train_size &lt; 1, \"train_size must be a float between 0 and 1\"\n    for name, dataset in self.items():\n        self[name] = dataset.train_test_split(\n            train_size=train_size,\n            shuffle=shuffle,\n            seed=seed,\n        )\n    return self\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.save_to_disk","title":"<code>save_to_disk(distiset_path, max_shard_size=None, num_shards=None, num_proc=None, storage_options=None, save_card=True, save_pipeline_config=True, save_pipeline_log=True)</code>","text":"<p>Saves a <code>Distiset</code> to a dataset directory, or in a filesystem using any implementation of <code>fsspec.spec.AbstractFileSystem</code>.</p> <p>In case you want to save the <code>Distiset</code> in a remote filesystem, you can pass the <code>storage_options</code> parameter as you would do with <code>datasets</code>'s <code>Dataset.save_to_disk</code> method: see example</p> <p>Parameters:</p> Name Type Description Default <code>distiset_path</code> <code>PathLike</code> <p>Path where you want to save the <code>Distiset</code>. It can be a local path (e.g. <code>dataset/train</code>) or remote URI (e.g. <code>s3://my-bucket/dataset/train</code>)</p> required <code>max_shard_size</code> <code>Optional[Union[str, int]]</code> <p>The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like <code>\"50MB\"</code>). Defaults to <code>None</code>.</p> <code>None</code> <code>num_shards</code> <code>Optional[int]</code> <p>Number of shards to write. By default the number of shards depends on <code>max_shard_size</code> and <code>num_proc</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>num_proc</code> <code>Optional[int]</code> <p>Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default. Defaults to <code>None</code>.</p> <code>None</code> <code>storage_options</code> <code>Optional[dict]</code> <p>Key/value pairs to be passed on to the file-system backend, if any. Defaults to <code>None</code>.</p> <code>None</code> <code>save_card</code> <code>bool</code> <p>Whether to save the dataset card. Defaults to <code>True</code>.</p> <code>True</code> <code>save_pipeline_config</code> <code>bool</code> <p>Whether to save the pipeline configuration file (aka the <code>pipeline.yaml</code> file). Defaults to <code>True</code>.</p> <code>True</code> <code>save_pipeline_log</code> <code>bool</code> <p>Whether to save the pipeline log file (aka the <code>pipeline.log</code> file). Defaults to <code>True</code>.</p> <code>True</code> <p>Examples:</p> <pre><code># Save your distiset in a local folder:\ndistiset.save_to_disk(distiset_path=\"my-distiset\")\n# Save your distiset in a remote storage:\nstorage_options = {\n    \"key\": os.environ[\"S3_ACCESS_KEY\"],\n    \"secret\": os.environ[\"S3_SECRET_KEY\"],\n    \"client_kwargs\": {\n        \"endpoint_url\": os.environ[\"S3_ENDPOINT_URL\"],\n        \"region_name\": os.environ[\"S3_REGION\"],\n    },\n}\ndistiset.save_to_disk(distiset_path=\"my-distiset\", storage_options=storage_options)\n</code></pre> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def save_to_disk(\n    self,\n    distiset_path: PathLike,\n    max_shard_size: Optional[Union[str, int]] = None,\n    num_shards: Optional[int] = None,\n    num_proc: Optional[int] = None,\n    storage_options: Optional[dict] = None,\n    save_card: bool = True,\n    save_pipeline_config: bool = True,\n    save_pipeline_log: bool = True,\n) -&gt; None:\n    r\"\"\"\n    Saves a `Distiset` to a dataset directory, or in a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n    In case you want to save the `Distiset` in a remote filesystem, you can pass the `storage_options` parameter\n    as you would do with `datasets`'s `Dataset.save_to_disk` method: [see example](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets)\n\n    Args:\n        distiset_path: Path where you want to save the `Distiset`. It can be a local path\n            (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`)\n        max_shard_size: The maximum size of the dataset shards to be uploaded to the hub.\n            If expressed as a string, needs to be digits followed by a unit (like `\"50MB\"`).\n            Defaults to `None`.\n        num_shards: Number of shards to write. By default the number of shards depends on\n            `max_shard_size` and `num_proc`. Defaults to `None`.\n        num_proc: Number of processes when downloading and generating the dataset locally.\n            Multiprocessing is disabled by default. Defaults to `None`.\n        storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n        save_card: Whether to save the dataset card. Defaults to `True`.\n        save_pipeline_config: Whether to save the pipeline configuration file (aka the `pipeline.yaml` file).\n            Defaults to `True`.\n        save_pipeline_log: Whether to save the pipeline log file (aka the `pipeline.log` file).\n            Defaults to `True`.\n\n    Examples:\n        ```python\n        # Save your distiset in a local folder:\n        distiset.save_to_disk(distiset_path=\"my-distiset\")\n        # Save your distiset in a remote storage:\n        storage_options = {\n            \"key\": os.environ[\"S3_ACCESS_KEY\"],\n            \"secret\": os.environ[\"S3_SECRET_KEY\"],\n            \"client_kwargs\": {\n                \"endpoint_url\": os.environ[\"S3_ENDPOINT_URL\"],\n                \"region_name\": os.environ[\"S3_REGION\"],\n            },\n        }\n        distiset.save_to_disk(distiset_path=\"my-distiset\", storage_options=storage_options)\n        ```\n    \"\"\"\n    distiset_path = str(distiset_path)\n    for name, dataset in self.items():\n        dataset.save_to_disk(\n            f\"{distiset_path}/{name}\",\n            max_shard_size=max_shard_size,\n            num_shards=num_shards,\n            num_proc=num_proc,\n            storage_options=storage_options,\n        )\n\n    distiset_config_folder = posixpath.join(distiset_path, DISTISET_CONFIG_FOLDER)\n\n    fs: fsspec.AbstractFileSystem\n    fs, _, _ = fsspec.get_fs_token_paths(\n        distiset_config_folder, storage_options=storage_options\n    )\n    fs.makedirs(distiset_config_folder, exist_ok=True)\n\n    if self.artifacts_path:\n        distiset_artifacts_folder = posixpath.join(\n            distiset_path, DISTISET_ARTIFACTS_FOLDER\n        )\n        fs.copy(str(self.artifacts_path), distiset_artifacts_folder, recursive=True)\n\n    if save_card:\n        # NOTE:\u00a0Currently the card is not the same if we write to disk or push to the HF hub,\n        # as we aren't generating the README copying/updating the data from the dataset repo.\n        card = self._get_card(repo_id=Path(distiset_path).stem, token=None)\n        new_filename = posixpath.join(distiset_config_folder, \"README.md\")\n        if storage_options:\n            # Write the card the same way as DatasetCard.save does:\n            with fs.open(new_filename, \"w\", newline=\"\", encoding=\"utf-8\") as f:\n                f.write(str(card))\n        else:\n            card.save(new_filename)\n\n    # Write our internal files to the distiset folder by copying them to the distiset folder.\n    if save_pipeline_config and self.pipeline_path:\n        new_filename = posixpath.join(\n            distiset_config_folder, PIPELINE_CONFIG_FILENAME\n        )\n        if self.pipeline_path.exists() and (not fs.isfile(new_filename)):\n            data = yaml.safe_load(self.pipeline_path.read_text())\n            with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                yaml.dump(data, f, default_flow_style=False)\n\n    if save_pipeline_log and self.log_filename_path:\n        new_filename = posixpath.join(distiset_config_folder, PIPELINE_LOG_FILENAME)\n        if self.log_filename_path.exists() and (not fs.isfile(new_filename)):\n            data = self.log_filename_path.read_text()\n            with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                f.write(data)\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.load_from_disk","title":"<code>load_from_disk(distiset_path, keep_in_memory=None, storage_options=None, download_dir=None)</code>  <code>classmethod</code>","text":"<p>Loads a dataset that was previously saved using <code>Distiset.save_to_disk</code> from a dataset directory, or from a filesystem using any implementation of <code>fsspec.spec.AbstractFileSystem</code>.</p> <p>Parameters:</p> Name Type Description Default <code>distiset_path</code> <code>PathLike</code> <p>Path (\"dataset/train\") or remote URI (\"s3://bucket/dataset/train\").</p> required <code>keep_in_memory</code> <code>Optional[bool]</code> <p>Whether to copy the dataset in-memory, see <code>datasets.Dataset.load_from_disk`` for more information. Defaults to</code>None`.</p> <code>None</code> <code>storage_options</code> <code>Optional[Dict[str, Any]]</code> <p>Key/value pairs to be passed on to the file-system backend, if any. Defaults to <code>None</code>.</p> <code>None</code> <code>download_dir</code> <code>Optional[PathLike]</code> <p>Optional directory to download the dataset to. Defaults to None, in which case it will create a temporary directory.</p> <code>None</code> <p>Returns:</p> Type Description <code>Self</code> <p>A <code>Distiset</code> loaded from disk, it should be a <code>Distiset</code> object created using <code>Distiset.save_to_disk</code>.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>@classmethod\ndef load_from_disk(\n    cls,\n    distiset_path: PathLike,\n    keep_in_memory: Optional[bool] = None,\n    storage_options: Optional[Dict[str, Any]] = None,\n    download_dir: Optional[PathLike] = None,\n) -&gt; Self:\n    \"\"\"Loads a dataset that was previously saved using `Distiset.save_to_disk` from a dataset\n    directory, or from a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n    Args:\n        distiset_path: Path (\"dataset/train\") or remote URI (\"s3://bucket/dataset/train\").\n        keep_in_memory: Whether to copy the dataset in-memory, see `datasets.Dataset.load_from_disk``\n            for more information. Defaults to `None`.\n        storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n        download_dir: Optional directory to download the dataset to. Defaults to None,\n            in which case it will create a temporary directory.\n\n    Returns:\n        A `Distiset` loaded from disk, it should be a `Distiset` object created using `Distiset.save_to_disk`.\n    \"\"\"\n    original_distiset_path = str(distiset_path)\n\n    fs: fsspec.AbstractFileSystem\n    fs, _, [distiset_path] = fsspec.get_fs_token_paths(  # type: ignore\n        original_distiset_path, storage_options=storage_options\n    )\n    dest_distiset_path = distiset_path\n\n    assert fs.isdir(\n        original_distiset_path\n    ), \"`distiset_path` must be a `PathLike` object pointing to a folder or a URI of a remote filesystem.\"\n\n    has_config = False\n    has_artifacts = False\n    distiset = cls()\n\n    if is_remote_filesystem(fs):\n        src_dataset_path = distiset_path\n        if download_dir:\n            dest_distiset_path = download_dir\n        else:\n            dest_distiset_path = Dataset._build_local_temp_path(src_dataset_path)  # type: ignore\n        fs.download(src_dataset_path, dest_distiset_path.as_posix(), recursive=True)  # type: ignore\n\n    # Now we should have the distiset locally, so we can read those files\n    for folder in Path(dest_distiset_path).iterdir():\n        if folder.stem == DISTISET_CONFIG_FOLDER:\n            has_config = True\n            continue\n        elif folder.stem == DISTISET_ARTIFACTS_FOLDER:\n            has_artifacts = True\n            continue\n        distiset[folder.stem] = load_from_disk(\n            str(folder),\n            keep_in_memory=keep_in_memory,\n        )\n\n    # From the config folder we just need to point to the files. Once downloaded we set the path to point to point to the files. Once downloaded we set the path\n    # to wherever they are.\n    if has_config:\n        distiset_config_folder = posixpath.join(\n            dest_distiset_path, DISTISET_CONFIG_FOLDER\n        )\n\n        pipeline_path = posixpath.join(\n            distiset_config_folder, PIPELINE_CONFIG_FILENAME\n        )\n        if Path(pipeline_path).exists():\n            distiset.pipeline_path = Path(pipeline_path)\n\n        log_filename_path = posixpath.join(\n            distiset_config_folder, PIPELINE_LOG_FILENAME\n        )\n        if Path(log_filename_path).exists():\n            distiset.log_filename_path = Path(log_filename_path)\n\n    if has_artifacts:\n        distiset.artifacts_path = Path(\n            posixpath.join(dest_distiset_path, DISTISET_ARTIFACTS_FOLDER)\n        )\n\n    return distiset\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.create_distiset","title":"<code>create_distiset(data_dir, pipeline_path=None, log_filename_path=None, enable_metadata=False, dag=None)</code>","text":"<p>Creates a <code>Distiset</code> from the buffer folder.</p> <p>This function is intended to be used as a helper to create a <code>Distiset</code> from from the folder where the cached data was written by the <code>_WriteBuffer</code>.</p> <p>Parameters:</p> Name Type Description Default <code>data_dir</code> <code>Path</code> <p>Folder where the data buffers were written by the <code>_WriteBuffer</code>. It should correspond to <code>CacheLocation.data</code>.</p> required <code>pipeline_path</code> <code>Optional[Path]</code> <p>Optional path to the pipeline.yaml file that generated the dataset. Internally this will be passed to the <code>Distiset</code> object on creation to allow uploading the <code>pipeline.yaml</code> file to the repo upon <code>Distiset.push_to_hub</code>.</p> <code>None</code> <code>log_filename_path</code> <code>Optional[Path]</code> <p>Optional path to the pipeline.log file that was generated during the pipeline run. Internally this will be passed to the <code>Distiset</code> object on creation to allow uploading the <code>pipeline.log</code> file to the repo upon <code>Distiset.push_to_hub</code>.</p> <code>None</code> <code>enable_metadata</code> <code>bool</code> <p>Whether to include the distilabel metadata column in the dataset or not. Defaults to <code>False</code>.</p> <code>False</code> <code>dag</code> <code>Optional[DAG]</code> <p>DAG contained in a <code>Pipeline</code>. If informed, will be used to extract the references/ citations from it.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>The dataset created from the buffer folder, where the different leaf steps will</p> <code>Distiset</code> <p>correspond to different configurations of the dataset.</p> <p>Examples:</p> <pre><code>from pathlib import Path\ndistiset = create_distiset(Path.home() / \".cache/distilabel/pipelines/path-to-pipe-hashname\")\n</code></pre> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def create_distiset(  # noqa: C901\n    data_dir: Path,\n    pipeline_path: Optional[Path] = None,\n    log_filename_path: Optional[Path] = None,\n    enable_metadata: bool = False,\n    dag: Optional[\"DAG\"] = None,\n) -&gt; Distiset:\n    \"\"\"Creates a `Distiset` from the buffer folder.\n\n    This function is intended to be used as a helper to create a `Distiset` from from the folder\n    where the cached data was written by the `_WriteBuffer`.\n\n    Args:\n        data_dir: Folder where the data buffers were written by the `_WriteBuffer`.\n            It should correspond to `CacheLocation.data`.\n        pipeline_path: Optional path to the pipeline.yaml file that generated the dataset.\n            Internally this will be passed to the `Distiset` object on creation to allow\n            uploading the `pipeline.yaml` file to the repo upon `Distiset.push_to_hub`.\n        log_filename_path: Optional path to the pipeline.log file that was generated during the pipeline run.\n            Internally this will be passed to the `Distiset` object on creation to allow\n            uploading the `pipeline.log` file to the repo upon `Distiset.push_to_hub`.\n        enable_metadata: Whether to include the distilabel metadata column in the dataset or not.\n            Defaults to `False`.\n        dag: DAG contained in a `Pipeline`. If informed, will be used to extract the references/\n            citations from it.\n\n    Returns:\n        The dataset created from the buffer folder, where the different leaf steps will\n        correspond to different configurations of the dataset.\n\n    Examples:\n        ```python\n        from pathlib import Path\n        distiset = create_distiset(Path.home() / \".cache/distilabel/pipelines/path-to-pipe-hashname\")\n        ```\n    \"\"\"\n    from distilabel.constants import DISTILABEL_METADATA_KEY\n\n    logger = logging.getLogger(\"distilabel.distiset\")\n\n    steps_outputs_dir = data_dir / STEPS_OUTPUTS_PATH\n\n    distiset = Distiset()\n    for file in steps_outputs_dir.iterdir():\n        if file.is_file():\n            continue\n\n        files = [str(file) for file in list_files_in_dir(file)]\n        if files:\n            try:\n                ds = load_dataset(\n                    \"parquet\", name=file.stem, data_files={\"train\": files}\n                )\n                if not enable_metadata and DISTILABEL_METADATA_KEY in ds.column_names:\n                    ds = ds.remove_columns(DISTILABEL_METADATA_KEY)\n                distiset[file.stem] = ds\n            except ArrowInvalid:\n                logger.warning(f\"\u274c Failed to load the subset from '{file}' directory.\")\n                continue\n        else:\n            logger.warning(\n                f\"No output files for step '{file.stem}', can't create a dataset.\"\n                \" Did the step produce any data?\"\n            )\n\n    # If there's only one dataset i.e. one config, then set the config name to `default`\n    if len(distiset.keys()) == 1:\n        distiset[\"default\"] = distiset.pop(list(distiset.keys())[0])\n\n    # If there's any artifact set the `artifacts_path` so they can be uploaded\n    steps_artifacts_dir = data_dir / STEPS_ARTIFACTS_PATH\n    if any(steps_artifacts_dir.rglob(\"*\")):\n        distiset.artifacts_path = steps_artifacts_dir\n\n    # Include `pipeline.yaml` if exists\n    if pipeline_path:\n        distiset.pipeline_path = pipeline_path\n    else:\n        # If the pipeline path is not provided, try to find it in the parent directory\n        # and assume that's the wanted file.\n        pipeline_path = steps_outputs_dir.parent / \"pipeline.yaml\"\n        if pipeline_path.exists():\n            distiset.pipeline_path = pipeline_path\n\n    # Include `pipeline.log` if exists\n    if log_filename_path:\n        distiset.log_filename_path = log_filename_path\n    else:\n        log_filename_path = steps_outputs_dir.parent / \"pipeline.log\"\n        if log_filename_path.exists():\n            distiset.log_filename_path = log_filename_path\n\n    if dag:\n        distiset._citations = _grab_citations(dag)\n\n    return distiset\n</code></pre>"},{"location":"api/errors/","title":"Errors","text":"<p>This section contains the <code>distilabel</code> custom errors. Unlike exceptions, errors in <code>distilabel</code> are used to handle unexpected situations that can't be anticipated and that can't be handled in a controlled way.</p>"},{"location":"api/errors/#distilabel.errors.DistilabelError","title":"<code>DistilabelError</code>","text":"<p>A mixin class for common functionality shared by all Distilabel-specific errors.</p> <p>Attributes:</p> Name Type Description <code>message</code> <p>A message describing the error.</p> <code>page</code> <p>An optional error code from PydanticErrorCodes enum.</p> <p>Examples:</p> <pre><code>raise DistilabelUserError(\"This is an error message.\")\nThis is an error message.\n\nraise DistilabelUserError(\"This is an error message.\", page=\"sections/getting_started/faq/\")\nThis is an error message.\nFor further information visit 'https://distilabel.argilla.io/latest/sections/getting_started/faq/'\n</code></pre> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelError:\n    \"\"\"A mixin class for common functionality shared by all Distilabel-specific errors.\n\n    Attributes:\n        message: A message describing the error.\n        page: An optional error code from PydanticErrorCodes enum.\n\n    Examples:\n        ```python\n        raise DistilabelUserError(\"This is an error message.\")\n        This is an error message.\n\n        raise DistilabelUserError(\"This is an error message.\", page=\"sections/getting_started/faq/\")\n        This is an error message.\n        For further information visit 'https://distilabel.argilla.io/latest/sections/getting_started/faq/'\n        ```\n    \"\"\"\n\n    def __init__(self, message: str, *, page: Optional[str] = None) -&gt; None:\n        self.message = message\n        self.page = page\n\n    def __str__(self) -&gt; str:\n        if self.page is None:\n            return self.message\n        else:\n            return f\"{self.message}\\n\\nFor further information visit '{DISTILABEL_DOCS_URL}{self.page}'\"\n</code></pre>"},{"location":"api/errors/#distilabel.errors.DistilabelUserError","title":"<code>DistilabelUserError</code>","text":"<p>               Bases: <code>DistilabelError</code>, <code>ValueError</code></p> <p>ValueError that we can redirect to a given page in the documentation.</p> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelUserError(DistilabelError, ValueError):\n    \"\"\"ValueError that we can redirect to a given page in the documentation.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/errors/#distilabel.errors.DistilabelTypeError","title":"<code>DistilabelTypeError</code>","text":"<p>               Bases: <code>DistilabelError</code>, <code>TypeError</code></p> <p>TypeError that we can redirect to a given page in the documentation.</p> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelTypeError(DistilabelError, TypeError):\n    \"\"\"TypeError that we can redirect to a given page in the documentation.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/errors/#distilabel.errors.DistilabelNotImplementedError","title":"<code>DistilabelNotImplementedError</code>","text":"<p>               Bases: <code>DistilabelError</code>, <code>NotImplementedError</code></p> <p>NotImplementedError that we can redirect to a given page in the documentation.</p> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelNotImplementedError(DistilabelError, NotImplementedError):\n    \"\"\"NotImplementedError that we can redirect to a given page in the documentation.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/exceptions/","title":"Exceptions","text":"<p>This section contains the <code>distilabel</code> custom exceptions. Unlike errors, exceptions in <code>distilabel</code> are used to handle specific situations that can be anticipated and that can be handled in a controlled way internally by the library.</p>"},{"location":"api/exceptions/#distilabel.exceptions.DistilabelException","title":"<code>DistilabelException</code>","text":"<p>               Bases: <code>Exception</code></p> <p>Base exception (can be gracefully handled) for <code>distilabel</code> framework.</p> Source code in <code>src/distilabel/exceptions.py</code> <pre><code>class DistilabelException(Exception):\n    \"\"\"Base exception (can be gracefully handled) for `distilabel` framework.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/exceptions/#distilabel.exceptions.DistilabelGenerationException","title":"<code>DistilabelGenerationException</code>","text":"<p>               Bases: <code>DistilabelException</code></p> <p>Base exception for <code>LLM</code> generation errors.</p> Source code in <code>src/distilabel/exceptions.py</code> <pre><code>class DistilabelGenerationException(DistilabelException):\n    \"\"\"Base exception for `LLM` generation errors.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/exceptions/#distilabel.exceptions.DistilabelOfflineBatchGenerationNotFinishedException","title":"<code>DistilabelOfflineBatchGenerationNotFinishedException</code>","text":"<p>               Bases: <code>DistilabelGenerationException</code></p> <p>Exception raised when a batch generation is not finished.</p> Source code in <code>src/distilabel/exceptions.py</code> <pre><code>class DistilabelOfflineBatchGenerationNotFinishedException(\n    DistilabelGenerationException\n):\n    \"\"\"Exception raised when a batch generation is not finished.\"\"\"\n\n    jobs_ids: Tuple[str, ...]\n\n    def __init__(self, jobs_ids: Tuple[str, ...]) -&gt; None:\n        self.jobs_ids = jobs_ids\n        super().__init__(f\"Batch generation with jobs_ids={jobs_ids} is not finished\")\n</code></pre>"},{"location":"api/mixins/requirements/","title":"RequirementsMixin","text":""},{"location":"api/mixins/requirements/#distilabel.mixins.requirements.RequirementsMixin","title":"<code>RequirementsMixin</code>","text":"<p>Mixin for classes that have <code>requirements</code> attribute.</p> <p>Used to add requirements to a <code>Step</code> and a <code>Pipeline</code>.</p> Source code in <code>src/distilabel/mixins/requirements.py</code> <pre><code>class RequirementsMixin:\n    \"\"\"Mixin for classes that have `requirements` attribute.\n\n    Used to add requirements to a `Step` and a `Pipeline`.\n    \"\"\"\n\n    _requirements: Union[List[Requirement], None] = []\n\n    def _gather_requirements(self) -&gt; List[str]:\n        \"\"\"This method will be overwritten in the `BasePipeline` class to gather the requirements\n        from each step.\n        \"\"\"\n        return []\n\n    @property\n    def requirements(self) -&gt; List[str]:\n        \"\"\"Return a list of requirements that must be installed to run the `Pipeline`.\n\n        The requirements in a Pipeline will include the requirements from all the steps (if any).\n\n        Returns:\n            List of requirements that must be installed to run the `Pipeline`, sorted alphabetically.\n        \"\"\"\n        self.requirements = self._gather_requirements()\n\n        return [str(r) for r in self._requirements]\n\n    @requirements.setter\n    def requirements(self, _requirements: List[str]) -&gt; None:\n        requirements = []\n        if not isinstance(_requirements, list):\n            _requirements = [_requirements]\n\n        for r in _requirements:\n            try:\n                requirements.append(Requirement(r))\n            except InvalidRequirement:\n                self._logger.warning(f\"Invalid requirement: `{r}`\")\n\n        self._requirements = sorted(\n            set(self._requirements).union(set(requirements)), key=lambda x: str(x)\n        )\n\n    def requirements_to_install(self) -&gt; List[str]:\n        \"\"\"Check if the requirements are installed in the current environment, and returns the ones that aren't.\n\n        Returns:\n            List of requirements required to run the pipeline that are not installed in the current environment.\n        \"\"\"\n\n        to_install = []\n        for req in self.requirements:\n            requirement = Requirement(req)\n            if importlib.util.find_spec(requirement.name):\n                if (str(requirement.specifier) != \"\") and (\n                    version(requirement.name) != str(requirement.specifier)\n                ):\n                    to_install.append(req)\n            else:\n                to_install.append(req)\n        return to_install\n</code></pre>"},{"location":"api/mixins/requirements/#distilabel.mixins.requirements.RequirementsMixin.requirements","title":"<code>requirements</code>  <code>property</code> <code>writable</code>","text":"<p>Return a list of requirements that must be installed to run the <code>Pipeline</code>.</p> <p>The requirements in a Pipeline will include the requirements from all the steps (if any).</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>List of requirements that must be installed to run the <code>Pipeline</code>, sorted alphabetically.</p>"},{"location":"api/mixins/requirements/#distilabel.mixins.requirements.RequirementsMixin.requirements_to_install","title":"<code>requirements_to_install()</code>","text":"<p>Check if the requirements are installed in the current environment, and returns the ones that aren't.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>List of requirements required to run the pipeline that are not installed in the current environment.</p> Source code in <code>src/distilabel/mixins/requirements.py</code> <pre><code>def requirements_to_install(self) -&gt; List[str]:\n    \"\"\"Check if the requirements are installed in the current environment, and returns the ones that aren't.\n\n    Returns:\n        List of requirements required to run the pipeline that are not installed in the current environment.\n    \"\"\"\n\n    to_install = []\n    for req in self.requirements:\n        requirement = Requirement(req)\n        if importlib.util.find_spec(requirement.name):\n            if (str(requirement.specifier) != \"\") and (\n                version(requirement.name) != str(requirement.specifier)\n            ):\n                to_install.append(req)\n        else:\n            to_install.append(req)\n    return to_install\n</code></pre>"},{"location":"api/mixins/runtime_parameters/","title":"RuntimeParametersMixin","text":""},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin","title":"<code>RuntimeParametersMixin</code>","text":"<p>               Bases: <code>BaseModel</code></p> <p>Mixin for classes that have <code>RuntimeParameter</code>s attributes.</p> <p>Attributes:</p> Name Type Description <code>_runtime_parameters</code> <code>Dict[str, Any]</code> <p>A dictionary containing the values of the runtime parameters of the class. This attribute is meant to be used internally and should not be accessed directly.</p> Source code in <code>src/distilabel/mixins/runtime_parameters.py</code> <pre><code>class RuntimeParametersMixin(BaseModel):\n    \"\"\"Mixin for classes that have `RuntimeParameter`s attributes.\n\n    Attributes:\n        _runtime_parameters: A dictionary containing the values of the runtime parameters\n            of the class. This attribute is meant to be used internally and should not be\n            accessed directly.\n    \"\"\"\n\n    _runtime_parameters: Dict[str, Any] = PrivateAttr(default_factory=dict)\n\n    @property\n    def runtime_parameters_names(self) -&gt; \"RuntimeParametersNames\":\n        \"\"\"Returns a dictionary containing the name of the runtime parameters of the class\n        as keys and whether the parameter is required or not as values.\n\n        Returns:\n            A dictionary containing the name of the runtime parameters of the class as keys\n            and whether the parameter is required or not as values.\n        \"\"\"\n\n        runtime_parameters = {}\n\n        for name, field_info in self.model_fields.items():  # type: ignore\n            # `field: RuntimeParameter[Any]` or `field: Optional[RuntimeParameter[Any]]`\n            is_runtime_param, is_optional = _is_runtime_parameter(field_info)\n            if is_runtime_param:\n                runtime_parameters[name] = is_optional\n                continue\n\n            attr = getattr(self, name)\n\n            # `field: RuntimeParametersMixin`\n            if isinstance(attr, RuntimeParametersMixin):\n                runtime_parameters[name] = attr.runtime_parameters_names\n\n            # `field: List[RuntimeParametersMixin]`\n            if (\n                isinstance(attr, list)\n                and attr\n                and isinstance(attr[0], RuntimeParametersMixin)\n            ):\n                runtime_parameters[name] = {\n                    str(i): item.runtime_parameters_names for i, item in enumerate(attr)\n                }\n\n        return runtime_parameters\n\n    def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n        \"\"\"Gets the information of the runtime parameters of the class such as the name and\n        the description. This function is meant to include the information of the runtime\n        parameters in the serialized data of the class.\n\n        Returns:\n            A list containing the information for each runtime parameter of the class.\n        \"\"\"\n        runtime_parameters_info = []\n        for name, field_info in self.model_fields.items():  # type: ignore\n            if name not in self.runtime_parameters_names:\n                continue\n\n            attr = getattr(self, name)\n\n            # Get runtime parameters info for `RuntimeParametersMixin` field\n            if isinstance(attr, RuntimeParametersMixin):\n                runtime_parameters_info.append(\n                    {\n                        \"name\": name,\n                        \"runtime_parameters_info\": attr.get_runtime_parameters_info(),\n                    }\n                )\n                continue\n\n            # Get runtime parameters info for `List[RuntimeParametersMixin]` field\n            if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n                runtime_parameters_info.append(\n                    {\n                        \"name\": name,\n                        \"runtime_parameters_info\": {\n                            str(i): item.get_runtime_parameters_info()\n                            for i, item in enumerate(attr)\n                        },\n                    }\n                )\n                continue\n\n            info = {\"name\": name, \"optional\": self.runtime_parameters_names[name]}\n            if field_info.description is not None:\n                info[\"description\"] = field_info.description\n            runtime_parameters_info.append(info)\n        return runtime_parameters_info\n\n    def set_runtime_parameters(self, runtime_parameters: Dict[str, Any]) -&gt; None:\n        \"\"\"Sets the runtime parameters of the class using the provided values. If the attr\n        to be set is a `RuntimeParametersMixin`, it will call `set_runtime_parameters` on\n        the attr.\n\n        Args:\n            runtime_parameters: A dictionary containing the values of the runtime parameters\n                to set.\n        \"\"\"\n        runtime_parameters_names = list(self.runtime_parameters_names.keys())\n        for name, value in runtime_parameters.items():\n            if name not in self.runtime_parameters_names:\n                # Check done just to ensure the unit tests for the mixin run\n                if getattr(self, \"pipeline\", None):\n                    closest = difflib.get_close_matches(\n                        name, runtime_parameters_names, cutoff=0.5\n                    )\n                    msg = (\n                        f\"\u26a0\ufe0f  Runtime parameter '{name}' unknown in step '{self.name}'.\"  # type: ignore\n                    )\n                    if closest:\n                        msg += f\" Did you mean any of: {closest}\"\n                    else:\n                        msg += f\" Available runtime parameters for the step: {runtime_parameters_names}.\"\n                    self.pipeline._logger.warning(msg)  # type: ignore\n                continue\n\n            attr = getattr(self, name)\n\n            # Set runtime parameters for `RuntimeParametersMixin` field\n            if isinstance(attr, RuntimeParametersMixin):\n                attr.set_runtime_parameters(value)\n                self._runtime_parameters[name] = value\n                continue\n\n            # Set runtime parameters for `List[RuntimeParametersMixin]` field\n            if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n                for i, item in enumerate(attr):\n                    item_value = value.get(str(i), {})\n                    item.set_runtime_parameters(item_value)\n                self._runtime_parameters[name] = value\n                continue\n\n            # Handle settings values for `_SecretField`\n            field_info = self.model_fields[name]\n            inner_type = extract_annotation_inner_type(field_info.annotation)\n            if is_type_pydantic_secret_field(inner_type):\n                value = inner_type(value)\n\n            # Set the value of the runtime parameter\n            setattr(self, name, value)\n            self._runtime_parameters[name] = value\n</code></pre>"},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin.runtime_parameters_names","title":"<code>runtime_parameters_names</code>  <code>property</code>","text":"<p>Returns a dictionary containing the name of the runtime parameters of the class as keys and whether the parameter is required or not as values.</p> <p>Returns:</p> Type Description <code>RuntimeParametersNames</code> <p>A dictionary containing the name of the runtime parameters of the class as keys</p> <code>RuntimeParametersNames</code> <p>and whether the parameter is required or not as values.</p>"},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin.get_runtime_parameters_info","title":"<code>get_runtime_parameters_info()</code>","text":"<p>Gets the information of the runtime parameters of the class such as the name and the description. This function is meant to include the information of the runtime parameters in the serialized data of the class.</p> <p>Returns:</p> Type Description <code>List[RuntimeParameterInfo]</code> <p>A list containing the information for each runtime parameter of the class.</p> Source code in <code>src/distilabel/mixins/runtime_parameters.py</code> <pre><code>def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n    \"\"\"Gets the information of the runtime parameters of the class such as the name and\n    the description. This function is meant to include the information of the runtime\n    parameters in the serialized data of the class.\n\n    Returns:\n        A list containing the information for each runtime parameter of the class.\n    \"\"\"\n    runtime_parameters_info = []\n    for name, field_info in self.model_fields.items():  # type: ignore\n        if name not in self.runtime_parameters_names:\n            continue\n\n        attr = getattr(self, name)\n\n        # Get runtime parameters info for `RuntimeParametersMixin` field\n        if isinstance(attr, RuntimeParametersMixin):\n            runtime_parameters_info.append(\n                {\n                    \"name\": name,\n                    \"runtime_parameters_info\": attr.get_runtime_parameters_info(),\n                }\n            )\n            continue\n\n        # Get runtime parameters info for `List[RuntimeParametersMixin]` field\n        if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n            runtime_parameters_info.append(\n                {\n                    \"name\": name,\n                    \"runtime_parameters_info\": {\n                        str(i): item.get_runtime_parameters_info()\n                        for i, item in enumerate(attr)\n                    },\n                }\n            )\n            continue\n\n        info = {\"name\": name, \"optional\": self.runtime_parameters_names[name]}\n        if field_info.description is not None:\n            info[\"description\"] = field_info.description\n        runtime_parameters_info.append(info)\n    return runtime_parameters_info\n</code></pre>"},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin.set_runtime_parameters","title":"<code>set_runtime_parameters(runtime_parameters)</code>","text":"<p>Sets the runtime parameters of the class using the provided values. If the attr to be set is a <code>RuntimeParametersMixin</code>, it will call <code>set_runtime_parameters</code> on the attr.</p> <p>Parameters:</p> Name Type Description Default <code>runtime_parameters</code> <code>Dict[str, Any]</code> <p>A dictionary containing the values of the runtime parameters to set.</p> required Source code in <code>src/distilabel/mixins/runtime_parameters.py</code> <pre><code>def set_runtime_parameters(self, runtime_parameters: Dict[str, Any]) -&gt; None:\n    \"\"\"Sets the runtime parameters of the class using the provided values. If the attr\n    to be set is a `RuntimeParametersMixin`, it will call `set_runtime_parameters` on\n    the attr.\n\n    Args:\n        runtime_parameters: A dictionary containing the values of the runtime parameters\n            to set.\n    \"\"\"\n    runtime_parameters_names = list(self.runtime_parameters_names.keys())\n    for name, value in runtime_parameters.items():\n        if name not in self.runtime_parameters_names:\n            # Check done just to ensure the unit tests for the mixin run\n            if getattr(self, \"pipeline\", None):\n                closest = difflib.get_close_matches(\n                    name, runtime_parameters_names, cutoff=0.5\n                )\n                msg = (\n                    f\"\u26a0\ufe0f  Runtime parameter '{name}' unknown in step '{self.name}'.\"  # type: ignore\n                )\n                if closest:\n                    msg += f\" Did you mean any of: {closest}\"\n                else:\n                    msg += f\" Available runtime parameters for the step: {runtime_parameters_names}.\"\n                self.pipeline._logger.warning(msg)  # type: ignore\n            continue\n\n        attr = getattr(self, name)\n\n        # Set runtime parameters for `RuntimeParametersMixin` field\n        if isinstance(attr, RuntimeParametersMixin):\n            attr.set_runtime_parameters(value)\n            self._runtime_parameters[name] = value\n            continue\n\n        # Set runtime parameters for `List[RuntimeParametersMixin]` field\n        if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n            for i, item in enumerate(attr):\n                item_value = value.get(str(i), {})\n                item.set_runtime_parameters(item_value)\n            self._runtime_parameters[name] = value\n            continue\n\n        # Handle settings values for `_SecretField`\n        field_info = self.model_fields[name]\n        inner_type = extract_annotation_inner_type(field_info.annotation)\n        if is_type_pydantic_secret_field(inner_type):\n            value = inner_type(value)\n\n        # Set the value of the runtime parameter\n        setattr(self, name, value)\n        self._runtime_parameters[name] = value\n</code></pre>"},{"location":"api/models/embedding/","title":"Embedding","text":"<p>This section contains the API reference for the <code>distilabel</code> embeddings.</p> <p>For more information on how the <code>Embeddings</code> works and see some examples.</p>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base","title":"<code>base</code>","text":""},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings","title":"<code>Embeddings</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>BaseModel</code>, <code>_Serializable</code>, <code>ABC</code></p> <p>Base class for <code>Embeddings</code> models.</p> <p>To implement an <code>Embeddings</code> subclass, you need to subclass this class and implement:     - <code>load</code> method to load the <code>Embeddings</code> model. Don't forget to call <code>super().load()</code>,         so the <code>_logger</code> attribute is initialized.     - <code>model_name</code> property to return the model name used for the <code>Embeddings</code>.     - <code>encode</code> method to generate the sentence embeddings.</p> <p>Attributes:</p> Name Type Description <code>_logger</code> <code>Logger</code> <p>the logger to be used for the <code>Embeddings</code> model. It will be initialized when the <code>load</code> method is called.</p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>class Embeddings(RuntimeParametersMixin, BaseModel, _Serializable, ABC):\n    \"\"\"Base class for `Embeddings` models.\n\n    To implement an `Embeddings` subclass, you need to subclass this class and implement:\n        - `load` method to load the `Embeddings` model. Don't forget to call `super().load()`,\n            so the `_logger` attribute is initialized.\n        - `model_name` property to return the model name used for the `Embeddings`.\n        - `encode` method to generate the sentence embeddings.\n\n    Attributes:\n        _logger: the logger to be used for the `Embeddings` model. It will be initialized\n            when the `load` method is called.\n    \"\"\"\n\n    model_config = ConfigDict(\n        arbitrary_types_allowed=True,\n        protected_namespaces=(),\n        validate_default=True,\n        validate_assignment=True,\n        extra=\"forbid\",\n    )\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Method to be called to initialize the `Embeddings`\"\"\"\n        self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n\n    def unload(self) -&gt; None:\n        \"\"\"Method to be called to unload the `Embeddings` and release any resources.\"\"\"\n        pass\n\n    @property\n    @abstractmethod\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the `Embeddings`.\"\"\"\n        pass\n\n    @abstractmethod\n    def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n        \"\"\"Generates embeddings for the provided inputs.\n\n        Args:\n            inputs: a list of texts for which an embedding has to be generated.\n\n        Returns:\n            The generated embeddings.\n        \"\"\"\n        pass\n</code></pre>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.model_name","title":"<code>model_name</code>  <code>abstractmethod</code> <code>property</code>","text":"<p>Returns the model name used for the <code>Embeddings</code>.</p>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.load","title":"<code>load()</code>","text":"<p>Method to be called to initialize the <code>Embeddings</code></p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to be called to initialize the `Embeddings`\"\"\"\n    self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n</code></pre>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.unload","title":"<code>unload()</code>","text":"<p>Method to be called to unload the <code>Embeddings</code> and release any resources.</p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Method to be called to unload the `Embeddings` and release any resources.\"\"\"\n    pass\n</code></pre>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.encode","title":"<code>encode(inputs)</code>  <code>abstractmethod</code>","text":"<p>Generates embeddings for the provided inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[str]</code> <p>a list of texts for which an embedding has to be generated.</p> required <p>Returns:</p> Type Description <code>List[List[Union[int, float]]]</code> <p>The generated embeddings.</p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>@abstractmethod\ndef encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n    \"\"\"Generates embeddings for the provided inputs.\n\n    Args:\n        inputs: a list of texts for which an embedding has to be generated.\n\n    Returns:\n        The generated embeddings.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/","title":"Embedding Gallery","text":"<p>This section contains the existing <code>Embeddings</code> subclasses implemented in <code>distilabel</code>.</p>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings","title":"<code>embeddings</code>","text":""},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings","title":"<code>SentenceTransformerEmbeddings</code>","text":"<p>               Bases: <code>Embeddings</code>, <code>CudaDevicePlacementMixin</code></p> <p><code>sentence-transformers</code> library implementation for embedding generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>device</code> <code>Optional[RuntimeParameter[str]]</code> <p>the name of the device used to load the model e.g. \"cuda\", \"mps\", etc. Defaults to <code>None</code>.</p> <code>prompts</code> <code>Optional[Dict[str, str]]</code> <p>a dictionary containing prompts to be used with the model. Defaults to <code>None</code>.</p> <code>default_prompt_name</code> <code>Optional[str]</code> <p>the default prompt (in <code>prompts</code>) that will be applied to the inputs. If not provided, then no prompt will be used. Defaults to <code>None</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to allow fetching and executing remote code fetched from the repository in the Hub. Defaults to <code>False</code>.</p> <code>revision</code> <code>Optional[str]</code> <p>if <code>model</code> refers to a Hugging Face Hub repository, then the revision (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> <code>token</code> <code>Optional[str]</code> <p>the Hugging Face Hub token that will be used to authenticate to the Hugging Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package local configuration will be used. Defaults to <code>None</code>.</p> <code>truncate_dim</code> <code>Optional[int]</code> <p>the dimension to truncate the sentence embeddings. Defaults to <code>None</code>.</p> <code>model_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>extra kwargs that will be passed to the Hugging Face <code>transformers</code> model class. Defaults to <code>None</code>.</p> <code>tokenizer_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>extra kwargs that will be passed to the Hugging Face <code>transformers</code> tokenizer class. Defaults to <code>None</code>.</p> <code>config_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>extra kwargs that will be passed to the Hugging Face <code>transformers</code> configuration class. Defaults to <code>None</code>.</p> <code>precision</code> <code>Optional[Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']]</code> <p>the dtype that will have the resulting embeddings. Defaults to <code>\"float32\"</code>.</p> <code>normalize_embeddings</code> <code>RuntimeParameter[bool]</code> <p>whether to normalize the embeddings so they have a length of 1. Defaults to <code>None</code>.</p> <p>Examples:</p> <p>Generating sentence embeddings:</p> <pre><code>from distilabel.models import SentenceTransformerEmbeddings\n\nembeddings = SentenceTransformerEmbeddings(model=\"mixedbread-ai/mxbai-embed-large-v1\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre> Source code in <code>src/distilabel/models/embeddings/sentence_transformers.py</code> <pre><code>class SentenceTransformerEmbeddings(Embeddings, CudaDevicePlacementMixin):\n    \"\"\"`sentence-transformers` library implementation for embedding generation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        device: the name of the device used to load the model e.g. \"cuda\", \"mps\", etc.\n            Defaults to `None`.\n        prompts: a dictionary containing prompts to be used with the model. Defaults to\n            `None`.\n        default_prompt_name: the default prompt (in `prompts`) that will be applied to the\n            inputs. If not provided, then no prompt will be used. Defaults to `None`.\n        trust_remote_code: whether to allow fetching and executing remote code fetched\n            from the repository in the Hub. Defaults to `False`.\n        revision: if `model` refers to a Hugging Face Hub repository, then the revision\n            (e.g. a branch name or a commit id) to use. Defaults to `\"main\"`.\n        token: the Hugging Face Hub token that will be used to authenticate to the Hugging\n            Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package\n            local configuration will be used. Defaults to `None`.\n        truncate_dim: the dimension to truncate the sentence embeddings. Defaults to `None`.\n        model_kwargs: extra kwargs that will be passed to the Hugging Face `transformers`\n            model class. Defaults to `None`.\n        tokenizer_kwargs: extra kwargs that will be passed to the Hugging Face `transformers`\n            tokenizer class. Defaults to `None`.\n        config_kwargs: extra kwargs that will be passed to the Hugging Face `transformers`\n            configuration class. Defaults to `None`.\n        precision: the dtype that will have the resulting embeddings. Defaults to `\"float32\"`.\n        normalize_embeddings: whether to normalize the embeddings so they have a length\n            of 1. Defaults to `None`.\n\n    Examples:\n        Generating sentence embeddings:\n\n        ```python\n        from distilabel.models import SentenceTransformerEmbeddings\n\n        embeddings = SentenceTransformerEmbeddings(model=\"mixedbread-ai/mxbai-embed-large-v1\")\n\n        embeddings.load()\n\n        results = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n        # [\n        #   [-0.05447685346007347, -0.01623094454407692, ...],\n        #   [4.4889533455716446e-05, 0.044016145169734955, ...],\n        # ]\n        ```\n    \"\"\"\n\n    model: str\n    device: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The device to be used to load the model. If `None`, then it\"\n        \" will check if a GPU can be used.\",\n    )\n    prompts: Optional[Dict[str, str]] = None\n    default_prompt_name: Optional[str] = None\n    trust_remote_code: bool = False\n    revision: Optional[str] = None\n    token: Optional[str] = None\n    truncate_dim: Optional[int] = None\n    model_kwargs: Optional[Dict[str, Any]] = None\n    tokenizer_kwargs: Optional[Dict[str, Any]] = None\n    config_kwargs: Optional[Dict[str, Any]] = None\n    precision: Optional[Literal[\"float32\", \"int8\", \"uint8\", \"binary\", \"ubinary\"]] = (\n        \"float32\"\n    )\n    normalize_embeddings: RuntimeParameter[bool] = Field(\n        default=True,\n        description=\"Whether to normalize the embeddings so the generated vectors\"\n        \" have a length of 1 or not.\",\n    )\n\n    _model: Union[\"SentenceTransformer\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Sentence Transformer model\"\"\"\n        super().load()\n\n        if self.device == \"cuda\":\n            CudaDevicePlacementMixin.load(self)\n\n        try:\n            from sentence_transformers import SentenceTransformer\n        except ImportError as e:\n            raise ImportError(\n                \"`sentence-transformers` package is not installed. Please install it using\"\n                \" `pip install sentence-transformers`.\"\n            ) from e\n\n        self._model = SentenceTransformer(\n            model_name_or_path=self.model,\n            device=self.device,\n            prompts=self.prompts,\n            default_prompt_name=self.default_prompt_name,\n            trust_remote_code=self.trust_remote_code,\n            revision=self.revision,\n            token=self.token,\n            truncate_dim=self.truncate_dim,\n            model_kwargs=self.model_kwargs,\n            tokenizer_kwargs=self.tokenizer_kwargs,\n            config_kwargs=self.config_kwargs,\n        )\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the name of the model.\"\"\"\n        return self.model\n\n    def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n        \"\"\"Generates embeddings for the provided inputs.\n\n        Args:\n            inputs: a list of texts for which an embedding has to be generated.\n\n        Returns:\n            The generated embeddings.\n        \"\"\"\n        return self._model.encode(  # type: ignore\n            sentences=inputs,\n            batch_size=len(inputs),\n            convert_to_numpy=True,\n            precision=self.precision,  # type: ignore\n            normalize_embeddings=self.normalize_embeddings,  # type: ignore\n        ).tolist()  # type: ignore\n\n    def unload(self) -&gt; None:\n        del self._model\n        if self.device == \"cuda\":\n            CudaDevicePlacementMixin.unload(self)\n        super().unload()\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the name of the model.</p>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings.load","title":"<code>load()</code>","text":"<p>Loads the Sentence Transformer model</p> Source code in <code>src/distilabel/models/embeddings/sentence_transformers.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Sentence Transformer model\"\"\"\n    super().load()\n\n    if self.device == \"cuda\":\n        CudaDevicePlacementMixin.load(self)\n\n    try:\n        from sentence_transformers import SentenceTransformer\n    except ImportError as e:\n        raise ImportError(\n            \"`sentence-transformers` package is not installed. Please install it using\"\n            \" `pip install sentence-transformers`.\"\n        ) from e\n\n    self._model = SentenceTransformer(\n        model_name_or_path=self.model,\n        device=self.device,\n        prompts=self.prompts,\n        default_prompt_name=self.default_prompt_name,\n        trust_remote_code=self.trust_remote_code,\n        revision=self.revision,\n        token=self.token,\n        truncate_dim=self.truncate_dim,\n        model_kwargs=self.model_kwargs,\n        tokenizer_kwargs=self.tokenizer_kwargs,\n        config_kwargs=self.config_kwargs,\n    )\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings.encode","title":"<code>encode(inputs)</code>","text":"<p>Generates embeddings for the provided inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[str]</code> <p>a list of texts for which an embedding has to be generated.</p> required <p>Returns:</p> Type Description <code>List[List[Union[int, float]]]</code> <p>The generated embeddings.</p> Source code in <code>src/distilabel/models/embeddings/sentence_transformers.py</code> <pre><code>def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n    \"\"\"Generates embeddings for the provided inputs.\n\n    Args:\n        inputs: a list of texts for which an embedding has to be generated.\n\n    Returns:\n        The generated embeddings.\n    \"\"\"\n    return self._model.encode(  # type: ignore\n        sentences=inputs,\n        batch_size=len(inputs),\n        convert_to_numpy=True,\n        precision=self.precision,  # type: ignore\n        normalize_embeddings=self.normalize_embeddings,  # type: ignore\n    ).tolist()  # type: ignore\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings","title":"<code>vLLMEmbeddings</code>","text":"<p>               Bases: <code>Embeddings</code>, <code>CudaDevicePlacementMixin</code></p> <p><code>vllm</code> library implementation for embedding generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>dtype</code> <code>str</code> <p>the data type to use for the model. Defaults to <code>auto</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to trust the remote code when loading the model. Defaults to <code>False</code>.</p> <code>quantization</code> <code>Optional[str]</code> <p>the quantization mode to use for the model. Defaults to <code>None</code>.</p> <code>revision</code> <code>Optional[str]</code> <p>the revision of the model to load. Defaults to <code>None</code>.</p> <code>enforce_eager</code> <code>bool</code> <p>whether to enforce eager execution. Defaults to <code>True</code>.</p> <code>seed</code> <code>int</code> <p>the seed to use for the random number generator. Defaults to <code>0</code>.</p> <code>extra_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> <code>_model</code> <code>LLM</code> <p>the <code>vLLM</code> model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> References <ul> <li>Offline inference embeddings</li> </ul> <p>Examples:</p> <p>Generating sentence embeddings:</p> <pre><code>from distilabel.models import vLLMEmbeddings\n\nembeddings = vLLMEmbeddings(model=\"intfloat/e5-mistral-7b-instruct\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>class vLLMEmbeddings(Embeddings, CudaDevicePlacementMixin):\n    \"\"\"`vllm` library implementation for embedding generation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        dtype: the data type to use for the model. Defaults to `auto`.\n        trust_remote_code: whether to trust the remote code when loading the model. Defaults\n            to `False`.\n        quantization: the quantization mode to use for the model. Defaults to `None`.\n        revision: the revision of the model to load. Defaults to `None`.\n        enforce_eager: whether to enforce eager execution. Defaults to `True`.\n        seed: the seed to use for the random number generator. Defaults to `0`.\n        extra_kwargs: additional dictionary of keyword arguments that will be passed to the\n            `LLM` class of `vllm` library. Defaults to `{}`.\n        _model: the `vLLM` model instance. This attribute is meant to be used internally\n            and should not be accessed directly. It will be set in the `load` method.\n\n    References:\n        - [Offline inference embeddings](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference_embedding.html)\n\n    Examples:\n        Generating sentence embeddings:\n\n        ```python\n        from distilabel.models import vLLMEmbeddings\n\n        embeddings = vLLMEmbeddings(model=\"intfloat/e5-mistral-7b-instruct\")\n\n        embeddings.load()\n\n        results = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n        # [\n        #   [-0.05447685346007347, -0.01623094454407692, ...],\n        #   [4.4889533455716446e-05, 0.044016145169734955, ...],\n        # ]\n        ```\n    \"\"\"\n\n    model: str\n    dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    quantization: Optional[str] = None\n    revision: Optional[str] = None\n\n    enforce_eager: bool = True\n\n    seed: int = 0\n\n    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"Additional dictionary of keyword arguments that will be passed to the\"\n        \" `vLLM` class of `vllm` library. See all the supported arguments at: \"\n        \"https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py\",\n    )\n\n    _model: \"_vLLM\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\"\"\"\n        super().load()\n\n        CudaDevicePlacementMixin.load(self)\n\n        try:\n            from vllm import LLM as _vLLM\n\n        except ImportError as ie:\n            raise ImportError(\n                \"vLLM is not installed. Please install it using `pip install vllm`.\"\n            ) from ie\n\n        self._model = _vLLM(\n            self.model,\n            dtype=self.dtype,\n            trust_remote_code=self.trust_remote_code,\n            quantization=self.quantization,\n            revision=self.revision,\n            enforce_eager=self.enforce_eager,\n            seed=self.seed,\n            **self.extra_kwargs,  # type: ignore\n        )\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the `vLLM` model.\"\"\"\n        CudaDevicePlacementMixin.unload(self)\n        super().unload()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the name of the model.\"\"\"\n        return self.model\n\n    def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n        \"\"\"Generates embeddings for the provided inputs.\n\n        Args:\n            inputs: a list of texts for which an embedding has to be generated.\n\n        Returns:\n            The generated embeddings.\n        \"\"\"\n        return [output.outputs.embedding for output in self._model.encode(inputs)]\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the name of the model.</p>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.load","title":"<code>load()</code>","text":"<p>Loads the <code>vLLM</code> model using either the path or the Hugging Face Hub repository id.</p> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\"\"\"\n    super().load()\n\n    CudaDevicePlacementMixin.load(self)\n\n    try:\n        from vllm import LLM as _vLLM\n\n    except ImportError as ie:\n        raise ImportError(\n            \"vLLM is not installed. Please install it using `pip install vllm`.\"\n        ) from ie\n\n    self._model = _vLLM(\n        self.model,\n        dtype=self.dtype,\n        trust_remote_code=self.trust_remote_code,\n        quantization=self.quantization,\n        revision=self.revision,\n        enforce_eager=self.enforce_eager,\n        seed=self.seed,\n        **self.extra_kwargs,  # type: ignore\n    )\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.unload","title":"<code>unload()</code>","text":"<p>Unloads the <code>vLLM</code> model.</p> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the `vLLM` model.\"\"\"\n    CudaDevicePlacementMixin.unload(self)\n    super().unload()\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.encode","title":"<code>encode(inputs)</code>","text":"<p>Generates embeddings for the provided inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[str]</code> <p>a list of texts for which an embedding has to be generated.</p> required <p>Returns:</p> Type Description <code>List[List[Union[int, float]]]</code> <p>The generated embeddings.</p> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n    \"\"\"Generates embeddings for the provided inputs.\n\n    Args:\n        inputs: a list of texts for which an embedding has to be generated.\n\n    Returns:\n        The generated embeddings.\n    \"\"\"\n    return [output.outputs.embedding for output in self._model.encode(inputs)]\n</code></pre>"},{"location":"api/models/llm/","title":"LLM","text":"<p>This section contains the API reference for the <code>distilabel</code> LLMs, both for the <code>LLM</code> synchronous implementation, and for the <code>AsyncLLM</code> asynchronous one.</p> <p>For more information and examples on how to use existing LLMs or create custom ones, please refer to Tutorial - LLM.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base","title":"<code>base</code>","text":""},{"location":"api/models/llm/#distilabel.models.llms.base.LLM","title":"<code>LLM</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>BaseModel</code>, <code>_Serializable</code>, <code>ABC</code></p> <p>Base class for <code>LLM</code>s to be used in <code>distilabel</code> framework.</p> <p>To implement an <code>LLM</code> subclass, you need to subclass this class and implement:     - <code>load</code> method to load the <code>LLM</code> if needed. Don't forget to call <code>super().load()</code>,         so the <code>_logger</code> attribute is initialized.     - <code>model_name</code> property to return the model name used for the LLM.     - <code>generate</code> method to generate <code>num_generations</code> per input in <code>inputs</code>.</p> <p>Attributes:</p> Name Type Description <code>generation_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>the kwargs to be propagated to either <code>generate</code> or <code>agenerate</code> methods within each <code>LLM</code>.</p> <code>use_offline_batch_generation</code> <code>Optional[RuntimeParameter[bool]]</code> <p>whether to use the <code>offline_batch_generate</code> method to generate the responses.</p> <code>offline_batch_generation_block_until_done</code> <code>Optional[RuntimeParameter[int]]</code> <p>if provided, then polling will be done until the <code>ofline_batch_generate</code> method is able to retrieve the results. The value indicate the time to wait between each polling.</p> <code>jobs_ids</code> <code>Union[Tuple[str, ...], None]</code> <p>the job ids generated by the <code>offline_batch_generate</code> method. This attribute is used to store the job ids generated by the <code>offline_batch_generate</code> method so later they can be used to retrieve the results. It is not meant to be set by the user.</p> <code>_logger</code> <code>Logger</code> <p>the logger to be used for the <code>LLM</code>. It will be initialized when the <code>load</code> method is called.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>class LLM(RuntimeParametersMixin, BaseModel, _Serializable, ABC):\n    \"\"\"Base class for `LLM`s to be used in `distilabel` framework.\n\n    To implement an `LLM` subclass, you need to subclass this class and implement:\n        - `load` method to load the `LLM` if needed. Don't forget to call `super().load()`,\n            so the `_logger` attribute is initialized.\n        - `model_name` property to return the model name used for the LLM.\n        - `generate` method to generate `num_generations` per input in `inputs`.\n\n    Attributes:\n        generation_kwargs: the kwargs to be propagated to either `generate` or `agenerate`\n            methods within each `LLM`.\n        use_offline_batch_generation: whether to use the `offline_batch_generate` method to\n            generate the responses.\n        offline_batch_generation_block_until_done: if provided, then polling will be done until\n            the `ofline_batch_generate` method is able to retrieve the results. The value indicate\n            the time to wait between each polling.\n        jobs_ids: the job ids generated by the `offline_batch_generate` method. This attribute\n            is used to store the job ids generated by the `offline_batch_generate` method\n            so later they can be used to retrieve the results. It is not meant to be set by\n            the user.\n        _logger: the logger to be used for the `LLM`. It will be initialized when the `load`\n            method is called.\n    \"\"\"\n\n    model_config = ConfigDict(\n        arbitrary_types_allowed=True,\n        protected_namespaces=(),\n        validate_default=True,\n        validate_assignment=True,\n        extra=\"forbid\",\n    )\n\n    generation_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"The kwargs to be propagated to either `generate` or `agenerate`\"\n        \" methods within each `LLM`.\",\n    )\n    use_offline_batch_generation: Optional[RuntimeParameter[bool]] = Field(\n        default=False,\n        description=\"Whether to use the `offline_batch_generate` method to generate\"\n        \" the responses.\",\n    )\n    offline_batch_generation_block_until_done: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"If provided, then polling will be done until the `ofline_batch_generate`\"\n        \" method is able to retrieve the results. The value indicate the time to wait between\"\n        \" each polling.\",\n    )\n\n    jobs_ids: Union[Tuple[str, ...], None] = Field(default=None)\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Method to be called to initialize the `LLM`, its logger and optionally the\n        structured output generator.\"\"\"\n        self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n\n    def unload(self) -&gt; None:\n        \"\"\"Method to be called to unload the `LLM` and release any resources.\"\"\"\n        pass\n\n    @property\n    @abstractmethod\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        pass\n\n    def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n        \"\"\"Returns the generation kwargs to be used for the generation. This method can\n        be overridden to provide a more complex logic for the generation kwargs.\n\n        Returns:\n            The kwargs to be used for the generation.\n        \"\"\"\n        return self.generation_kwargs  # type: ignore\n\n    @abstractmethod\n    def generate(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Abstract method to be implemented by each LLM to generate `num_generations`\n        per input in `inputs`.\n\n        Args:\n            inputs: the list of inputs to generate responses for which follows OpenAI's\n                API format:\n\n                ```python\n                [\n                    {\"role\": \"system\", \"content\": \"You're a helpful assistant...\"},\n                    {\"role\": \"user\", \"content\": \"Give a template email for B2B communications...\"},\n                    {\"role\": \"assistant\", \"content\": \"Sure, here's a template you can use...\"},\n                    {\"role\": \"user\", \"content\": \"Modify the second paragraph...\"}\n                ]\n                ```\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n        \"\"\"\n        pass\n\n    def generate_outputs(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Generates outputs for the given inputs using either `generate` method or the\n        `offine_batch_generate` method if `use_offline_\n        \"\"\"\n        if self.use_offline_batch_generation:\n            if self.offline_batch_generation_block_until_done is not None:\n                return self._offline_batch_generate_polling(\n                    inputs=inputs,\n                    num_generations=num_generations,\n                    **kwargs,\n                )\n\n            # This will raise `DistilabelOfflineBatchGenerationNotFinishedException` right away\n            # if the batch generation is not finished.\n            return self.offline_batch_generate(\n                inputs=inputs,\n                num_generations=num_generations,\n                **kwargs,\n            )\n\n        return self.generate(inputs=inputs, num_generations=num_generations, **kwargs)\n\n    def _offline_batch_generate_polling(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Method to poll the `offline_batch_generate` method until the batch generation\n        is finished.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        while True:\n            try:\n                return self.offline_batch_generate(\n                    inputs=inputs,\n                    num_generations=num_generations,\n                    **kwargs,\n                )\n            except DistilabelOfflineBatchGenerationNotFinishedException as e:\n                self._logger.info(\n                    f\"Waiting for the offline batch generation to finish: {e}. Sleeping\"\n                    f\" for {self.offline_batch_generation_block_until_done} seconds before\"\n                    \" trying to get the results again.\"\n                )\n                # When running a `Step` in a child process, SIGINT is overriden so the child\n                # process doesn't stop when the parent process receives a SIGINT signal.\n                # The new handler sets an environment variable that is checked here to stop\n                # the polling.\n                if os.getenv(SIGINT_HANDLER_CALLED_ENV_NAME) is not None:\n                    self._logger.info(\n                        \"Received a KeyboardInterrupt. Stopping polling for checking if the\"\n                        \" offline batch generation is finished...\"\n                    )\n                    raise e\n                time.sleep(self.offline_batch_generation_block_until_done)  # type: ignore\n            except KeyboardInterrupt as e:\n                # This is for the case the `LLM` is being executed outside a pipeline\n                self._logger.info(\n                    \"Received a KeyboardInterrupt. Stopping polling for checking if the\"\n                    \" offline batch generation is finished...\"\n                )\n                raise DistilabelOfflineBatchGenerationNotFinishedException(\n                    jobs_ids=self.jobs_ids  # type: ignore\n                ) from e\n\n    @property\n    def generate_parameters(self) -&gt; List[\"inspect.Parameter\"]:\n        \"\"\"Returns the parameters of the `generate` method.\n\n        Returns:\n            A list containing the parameters of the `generate` method.\n        \"\"\"\n        return list(inspect.signature(self.generate).parameters.values())\n\n    @property\n    def runtime_parameters_names(self) -&gt; \"RuntimeParametersNames\":\n        \"\"\"Returns the runtime parameters of the `LLM`, which are combination of the\n        attributes of the `LLM` type hinted with `RuntimeParameter` and the parameters\n        of the `generate` method that are not `input` and `num_generations`.\n\n        Returns:\n            A dictionary with the name of the runtime parameters as keys and a boolean\n            indicating if the parameter is optional or not.\n        \"\"\"\n        runtime_parameters = super().runtime_parameters_names\n        runtime_parameters[\"generation_kwargs\"] = {}\n\n        # runtime parameters from the `generate` method\n        for param in self.generate_parameters:\n            if param.name in [\"input\", \"inputs\", \"num_generations\"]:\n                continue\n            is_optional = param.default != inspect.Parameter.empty\n            runtime_parameters[\"generation_kwargs\"][param.name] = is_optional\n\n        return runtime_parameters\n\n    def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n        \"\"\"Gets the information of the runtime parameters of the `LLM` such as the name\n        and the description. This function is meant to include the information of the runtime\n        parameters in the serialized data of the `LLM`.\n\n        Returns:\n            A list containing the information for each runtime parameter of the `LLM`.\n        \"\"\"\n        runtime_parameters_info = super().get_runtime_parameters_info()\n\n        generation_kwargs_info = next(\n            (\n                runtime_parameter_info\n                for runtime_parameter_info in runtime_parameters_info\n                if runtime_parameter_info[\"name\"] == \"generation_kwargs\"\n            ),\n            None,\n        )\n\n        # If `generation_kwargs` attribute is present, we need to include the `generate`\n        # method arguments as the information for this attribute.\n        if generation_kwargs_info:\n            generate_docstring_args = self.generate_parsed_docstring[\"args\"]\n\n            generation_kwargs_info[\"keys\"] = []\n            for key, value in generation_kwargs_info[\"optional\"].items():\n                info = {\"name\": key, \"optional\": value}\n                if description := generate_docstring_args.get(key):\n                    info[\"description\"] = description\n                generation_kwargs_info[\"keys\"].append(info)\n\n            generation_kwargs_info.pop(\"optional\")\n\n        return runtime_parameters_info\n\n    @cached_property\n    def generate_parsed_docstring(self) -&gt; \"Docstring\":\n        \"\"\"Returns the parsed docstring of the `generate` method.\n\n        Returns:\n            The parsed docstring of the `generate` method.\n        \"\"\"\n        return parse_google_docstring(self.generate)\n\n    def get_last_hidden_states(\n        self, inputs: List[\"StandardInput\"]\n    ) -&gt; List[\"HiddenState\"]:\n        \"\"\"Method to get the last hidden states of the model for a list of inputs.\n\n        Args:\n            inputs: the list of inputs to get the last hidden states from.\n\n        Returns:\n            A list containing the last hidden state for each sequence using a NumPy array\n                with shape [num_tokens, hidden_size].\n        \"\"\"\n        # TODO: update to use `DistilabelNotImplementedError`\n        raise NotImplementedError(\n            f\"Method `get_last_hidden_states` is not implemented for `{self.__class__.__name__}`\"\n        )\n\n    def _prepare_structured_output(\n        self, structured_output: \"StructuredOutputType\"\n    ) -&gt; Union[Any, None]:\n        \"\"\"Method in charge of preparing the structured output generator.\n\n        By default will raise a `NotImplementedError`, subclasses that allow it must override this\n        method with the implementation.\n\n        Args:\n            structured_output: the config to prepare the guided generation.\n\n        Returns:\n            The structure to be used for the guided generation.\n        \"\"\"\n        # TODO: update to use `DistilabelNotImplementedError`\n        raise NotImplementedError(\n            f\"Guided generation is not implemented for `{type(self).__name__}`\"\n        )\n\n    def offline_batch_generate(\n        self,\n        inputs: Union[List[\"FormattedInput\"], None] = None,\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Method to generate a list of outputs for the given inputs using an offline batch\n        generation method to be implemented by each `LLM`.\n\n        This method should create jobs the first time is called and store the job ids, so\n        the second and subsequent calls can retrieve the results of the batch generation.\n        If subsequent calls are made before the batch generation is finished, then the method\n        should raise a `DistilabelOfflineBatchGenerationNotFinishedException`. This exception\n        will be handled automatically by the `Pipeline` which will store all the required\n        information for recovering the pipeline execution when the batch generation is finished.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        raise DistilabelNotImplementedError(\n            f\"`offline_batch_generate` is not implemented for `{self.__class__.__name__}`\",\n            page=\"sections/how_to_guides/advanced/offline-batch-generation/\",\n        )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.model_name","title":"<code>model_name</code>  <code>abstractmethod</code> <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate_parameters","title":"<code>generate_parameters</code>  <code>property</code>","text":"<p>Returns the parameters of the <code>generate</code> method.</p> <p>Returns:</p> Type Description <code>List[Parameter]</code> <p>A list containing the parameters of the <code>generate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.runtime_parameters_names","title":"<code>runtime_parameters_names</code>  <code>property</code>","text":"<p>Returns the runtime parameters of the <code>LLM</code>, which are combination of the attributes of the <code>LLM</code> type hinted with <code>RuntimeParameter</code> and the parameters of the <code>generate</code> method that are not <code>input</code> and <code>num_generations</code>.</p> <p>Returns:</p> Type Description <code>RuntimeParametersNames</code> <p>A dictionary with the name of the runtime parameters as keys and a boolean</p> <code>RuntimeParametersNames</code> <p>indicating if the parameter is optional or not.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate_parsed_docstring","title":"<code>generate_parsed_docstring</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the parsed docstring of the <code>generate</code> method.</p> <p>Returns:</p> Type Description <code>Docstring</code> <p>The parsed docstring of the <code>generate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.load","title":"<code>load()</code>","text":"<p>Method to be called to initialize the <code>LLM</code>, its logger and optionally the structured output generator.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to be called to initialize the `LLM`, its logger and optionally the\n    structured output generator.\"\"\"\n    self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.unload","title":"<code>unload()</code>","text":"<p>Method to be called to unload the <code>LLM</code> and release any resources.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Method to be called to unload the `LLM` and release any resources.\"\"\"\n    pass\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.get_generation_kwargs","title":"<code>get_generation_kwargs()</code>","text":"<p>Returns the generation kwargs to be used for the generation. This method can be overridden to provide a more complex logic for the generation kwargs.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The kwargs to be used for the generation.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns the generation kwargs to be used for the generation. This method can\n    be overridden to provide a more complex logic for the generation kwargs.\n\n    Returns:\n        The kwargs to be used for the generation.\n    \"\"\"\n    return self.generation_kwargs  # type: ignore\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate","title":"<code>generate(inputs, num_generations=1, **kwargs)</code>  <code>abstractmethod</code>","text":"<p>Abstract method to be implemented by each LLM to generate <code>num_generations</code> per input in <code>inputs</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>the list of inputs to generate responses for which follows OpenAI's API format:</p> <pre><code>[\n    {\"role\": \"system\", \"content\": \"You're a helpful assistant...\"},\n    {\"role\": \"user\", \"content\": \"Give a template email for B2B communications...\"},\n    {\"role\": \"assistant\", \"content\": \"Sure, here's a template you can use...\"},\n    {\"role\": \"user\", \"content\": \"Modify the second paragraph...\"}\n]\n</code></pre> required <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>@abstractmethod\ndef generate(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Abstract method to be implemented by each LLM to generate `num_generations`\n    per input in `inputs`.\n\n    Args:\n        inputs: the list of inputs to generate responses for which follows OpenAI's\n            API format:\n\n            ```python\n            [\n                {\"role\": \"system\", \"content\": \"You're a helpful assistant...\"},\n                {\"role\": \"user\", \"content\": \"Give a template email for B2B communications...\"},\n                {\"role\": \"assistant\", \"content\": \"Sure, here's a template you can use...\"},\n                {\"role\": \"user\", \"content\": \"Modify the second paragraph...\"}\n            ]\n            ```\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate_outputs","title":"<code>generate_outputs(inputs, num_generations=1, **kwargs)</code>","text":"<p>Generates outputs for the given inputs using either <code>generate</code> method or the <code>offine_batch_generate</code> method if `use_offline_</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def generate_outputs(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Generates outputs for the given inputs using either `generate` method or the\n    `offine_batch_generate` method if `use_offline_\n    \"\"\"\n    if self.use_offline_batch_generation:\n        if self.offline_batch_generation_block_until_done is not None:\n            return self._offline_batch_generate_polling(\n                inputs=inputs,\n                num_generations=num_generations,\n                **kwargs,\n            )\n\n        # This will raise `DistilabelOfflineBatchGenerationNotFinishedException` right away\n        # if the batch generation is not finished.\n        return self.offline_batch_generate(\n            inputs=inputs,\n            num_generations=num_generations,\n            **kwargs,\n        )\n\n    return self.generate(inputs=inputs, num_generations=num_generations, **kwargs)\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.get_runtime_parameters_info","title":"<code>get_runtime_parameters_info()</code>","text":"<p>Gets the information of the runtime parameters of the <code>LLM</code> such as the name and the description. This function is meant to include the information of the runtime parameters in the serialized data of the <code>LLM</code>.</p> <p>Returns:</p> Type Description <code>List[RuntimeParameterInfo]</code> <p>A list containing the information for each runtime parameter of the <code>LLM</code>.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n    \"\"\"Gets the information of the runtime parameters of the `LLM` such as the name\n    and the description. This function is meant to include the information of the runtime\n    parameters in the serialized data of the `LLM`.\n\n    Returns:\n        A list containing the information for each runtime parameter of the `LLM`.\n    \"\"\"\n    runtime_parameters_info = super().get_runtime_parameters_info()\n\n    generation_kwargs_info = next(\n        (\n            runtime_parameter_info\n            for runtime_parameter_info in runtime_parameters_info\n            if runtime_parameter_info[\"name\"] == \"generation_kwargs\"\n        ),\n        None,\n    )\n\n    # If `generation_kwargs` attribute is present, we need to include the `generate`\n    # method arguments as the information for this attribute.\n    if generation_kwargs_info:\n        generate_docstring_args = self.generate_parsed_docstring[\"args\"]\n\n        generation_kwargs_info[\"keys\"] = []\n        for key, value in generation_kwargs_info[\"optional\"].items():\n            info = {\"name\": key, \"optional\": value}\n            if description := generate_docstring_args.get(key):\n                info[\"description\"] = description\n            generation_kwargs_info[\"keys\"].append(info)\n\n        generation_kwargs_info.pop(\"optional\")\n\n    return runtime_parameters_info\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.get_last_hidden_states","title":"<code>get_last_hidden_states(inputs)</code>","text":"<p>Method to get the last hidden states of the model for a list of inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StandardInput]</code> <p>the list of inputs to get the last hidden states from.</p> required <p>Returns:</p> Type Description <code>List[HiddenState]</code> <p>A list containing the last hidden state for each sequence using a NumPy array with shape [num_tokens, hidden_size].</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def get_last_hidden_states(\n    self, inputs: List[\"StandardInput\"]\n) -&gt; List[\"HiddenState\"]:\n    \"\"\"Method to get the last hidden states of the model for a list of inputs.\n\n    Args:\n        inputs: the list of inputs to get the last hidden states from.\n\n    Returns:\n        A list containing the last hidden state for each sequence using a NumPy array\n            with shape [num_tokens, hidden_size].\n    \"\"\"\n    # TODO: update to use `DistilabelNotImplementedError`\n    raise NotImplementedError(\n        f\"Method `get_last_hidden_states` is not implemented for `{self.__class__.__name__}`\"\n    )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.offline_batch_generate","title":"<code>offline_batch_generate(inputs=None, num_generations=1, **kwargs)</code>","text":"<p>Method to generate a list of outputs for the given inputs using an offline batch generation method to be implemented by each <code>LLM</code>.</p> <p>This method should create jobs the first time is called and store the job ids, so the second and subsequent calls can retrieve the results of the batch generation. If subsequent calls are made before the batch generation is finished, then the method should raise a <code>DistilabelOfflineBatchGenerationNotFinishedException</code>. This exception will be handled automatically by the <code>Pipeline</code> which will store all the required information for recovering the pipeline execution when the batch generation is finished.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[List[FormattedInput], None]</code> <p>the list of inputs to generate responses for.</p> <code>None</code> <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list containing the generations for each input.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def offline_batch_generate(\n    self,\n    inputs: Union[List[\"FormattedInput\"], None] = None,\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Method to generate a list of outputs for the given inputs using an offline batch\n    generation method to be implemented by each `LLM`.\n\n    This method should create jobs the first time is called and store the job ids, so\n    the second and subsequent calls can retrieve the results of the batch generation.\n    If subsequent calls are made before the batch generation is finished, then the method\n    should raise a `DistilabelOfflineBatchGenerationNotFinishedException`. This exception\n    will be handled automatically by the `Pipeline` which will store all the required\n    information for recovering the pipeline execution when the batch generation is finished.\n\n    Args:\n        inputs: the list of inputs to generate responses for.\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n\n    Returns:\n        A list containing the generations for each input.\n    \"\"\"\n    raise DistilabelNotImplementedError(\n        f\"`offline_batch_generate` is not implemented for `{self.__class__.__name__}`\",\n        page=\"sections/how_to_guides/advanced/offline-batch-generation/\",\n    )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM","title":"<code>AsyncLLM</code>","text":"<p>               Bases: <code>LLM</code></p> <p>Abstract class for asynchronous LLMs, so as to benefit from the async capabilities of each LLM implementation. This class is meant to be subclassed by each LLM, and the method <code>agenerate</code> needs to be implemented to provide the asynchronous generation of responses.</p> <p>Attributes:</p> Name Type Description <code>_event_loop</code> <code>AbstractEventLoop</code> <p>the event loop to be used for the asynchronous generation of responses.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>class AsyncLLM(LLM):\n    \"\"\"Abstract class for asynchronous LLMs, so as to benefit from the async capabilities\n    of each LLM implementation. This class is meant to be subclassed by each LLM, and the\n    method `agenerate` needs to be implemented to provide the asynchronous generation of\n    responses.\n\n    Attributes:\n        _event_loop: the event loop to be used for the asynchronous generation of responses.\n    \"\"\"\n\n    _num_generations_param_supported = True\n    _event_loop: \"asyncio.AbstractEventLoop\" = PrivateAttr(default=None)\n    _new_event_loop: bool = PrivateAttr(default=False)\n\n    @property\n    def generate_parameters(self) -&gt; List[inspect.Parameter]:\n        \"\"\"Returns the parameters of the `agenerate` method.\n\n        Returns:\n            A list containing the parameters of the `agenerate` method.\n        \"\"\"\n        return list(inspect.signature(self.agenerate).parameters.values())\n\n    @cached_property\n    def generate_parsed_docstring(self) -&gt; \"Docstring\":\n        \"\"\"Returns the parsed docstring of the `agenerate` method.\n\n        Returns:\n            The parsed docstring of the `agenerate` method.\n        \"\"\"\n        return parse_google_docstring(self.agenerate)\n\n    @property\n    def event_loop(self) -&gt; \"asyncio.AbstractEventLoop\":\n        if self._event_loop is None:\n            try:\n                self._event_loop = asyncio.get_running_loop()\n                if self._event_loop.is_closed():\n                    self._event_loop = asyncio.new_event_loop()  # type: ignore\n                    self._new_event_loop = True\n            except RuntimeError:\n                self._event_loop = asyncio.new_event_loop()\n                self._new_event_loop = True\n        asyncio.set_event_loop(self._event_loop)\n        return self._event_loop\n\n    @abstractmethod\n    async def agenerate(\n        self, input: \"FormattedInput\", num_generations: int = 1, **kwargs: Any\n    ) -&gt; \"GenerateOutput\":\n        \"\"\"Method to generate a `num_generations` responses for a given input asynchronously,\n        and executed concurrently in `generate` method.\n        \"\"\"\n        pass\n\n    async def _agenerate(\n        self, inputs: List[\"FormattedInput\"], num_generations: int = 1, **kwargs: Any\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Internal function to concurrently generate responses for a list of inputs.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        if self._num_generations_param_supported:\n            tasks = [\n                asyncio.create_task(\n                    self.agenerate(\n                        input=input, num_generations=num_generations, **kwargs\n                    )\n                )\n                for input in inputs\n            ]\n            result = await asyncio.gather(*tasks)\n            return result\n\n        tasks = [\n            asyncio.create_task(self.agenerate(input=input, **kwargs))\n            for input in inputs\n            for _ in range(num_generations)\n        ]\n        outputs = await asyncio.gather(*tasks)\n        return merge_responses(outputs, n=num_generations)\n\n    def generate(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Method to generate a list of responses asynchronously, returning the output\n        synchronously awaiting for the response of each input sent to `agenerate`.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        return self.event_loop.run_until_complete(\n            self._agenerate(inputs=inputs, num_generations=num_generations, **kwargs)\n        )\n\n    def __del__(self) -&gt; None:\n        \"\"\"Closes the event loop when the object is deleted.\"\"\"\n        if sys.meta_path is None:\n            return\n\n        if self._new_event_loop:\n            if self._event_loop.is_running():\n                self._event_loop.stop()\n            self._event_loop.close()\n\n    @staticmethod\n    def _prepare_structured_output(  # type: ignore\n        structured_output: \"InstructorStructuredOutputType\",\n        client: Any = None,\n        framework: Optional[str] = None,\n    ) -&gt; Dict[str, Union[str, Any]]:\n        \"\"\"Wraps the client and updates the schema to work store it internally as a json schema.\n\n        Args:\n            structured_output: The configuration dict to prepare the structured output.\n            client: The client to wrap to generate structured output. Implemented to work\n                with `instructor`.\n            framework: The name of the framework.\n\n        Returns:\n            A dictionary containing the wrapped client and the schema to update the structured_output\n            variable in case it is a pydantic model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.instructor import (\n            prepare_instructor,\n        )\n\n        result = {}\n        client = prepare_instructor(\n            client,\n            mode=structured_output.get(\"mode\"),\n            framework=framework,  # type: ignore\n        )\n        result[\"client\"] = client\n\n        schema = structured_output.get(\"schema\")\n        if not schema:\n            raise DistilabelUserError(\n                f\"The `structured_output` argument must contain a schema: {structured_output}\",\n                page=\"sections/how_to_guides/advanced/structured_generation/#instructor\",\n            )\n        if inspect.isclass(schema) and issubclass(schema, BaseModel):\n            # We want a json schema for the serialization, but instructor wants a pydantic BaseModel.\n            structured_output[\"schema\"] = schema.model_json_schema()  # type: ignore\n            result[\"structured_output\"] = structured_output\n\n        return result\n\n    @staticmethod\n    def _prepare_kwargs(\n        arguments: Dict[str, Any], structured_output: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Helper method to update the kwargs with the structured output configuration,\n        used in case they are defined.\n\n        Args:\n            arguments: The arguments that would be passed to the LLM as **kwargs.\n                to update with the structured output configuration.\n            structured_outputs: The structured output configuration to update the arguments.\n\n        Returns:\n            kwargs updated with the special arguments used by `instructor`.\n        \"\"\"\n        # We can deal with json schema or BaseModel, but we need to convert it to a BaseModel\n        # for the Instructor client.\n        schema = structured_output.get(\"schema\", {})\n\n        # If there's already a pydantic model, we don't need to do anything,\n        # otherwise, try to obtain one.\n        if not (inspect.isclass(schema) and issubclass(schema, BaseModel)):\n            from distilabel.steps.tasks.structured_outputs.utils import (\n                json_schema_to_model,\n            )\n\n            if isinstance(schema, str):\n                # In case it was saved in the dataset as a string.\n                schema = json.loads(schema)\n\n            try:\n                schema = json_schema_to_model(schema)\n            except Exception as e:\n                raise ValueError(\n                    f\"Failed to convert the schema to a pydantic model, the model is too complex currently: {e}\"\n                ) from e\n\n        arguments.update(\n            **{\n                \"response_model\": schema,\n                \"max_retries\": structured_output.get(\"max_retries\", 1),\n            },\n        )\n        return arguments\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.generate_parameters","title":"<code>generate_parameters</code>  <code>property</code>","text":"<p>Returns the parameters of the <code>agenerate</code> method.</p> <p>Returns:</p> Type Description <code>List[Parameter]</code> <p>A list containing the parameters of the <code>agenerate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.generate_parsed_docstring","title":"<code>generate_parsed_docstring</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the parsed docstring of the <code>agenerate</code> method.</p> <p>Returns:</p> Type Description <code>Docstring</code> <p>The parsed docstring of the <code>agenerate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.agenerate","title":"<code>agenerate(input, num_generations=1, **kwargs)</code>  <code>abstractmethod</code> <code>async</code>","text":"<p>Method to generate a <code>num_generations</code> responses for a given input asynchronously, and executed concurrently in <code>generate</code> method.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>@abstractmethod\nasync def agenerate(\n    self, input: \"FormattedInput\", num_generations: int = 1, **kwargs: Any\n) -&gt; \"GenerateOutput\":\n    \"\"\"Method to generate a `num_generations` responses for a given input asynchronously,\n    and executed concurrently in `generate` method.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.generate","title":"<code>generate(inputs, num_generations=1, **kwargs)</code>","text":"<p>Method to generate a list of responses asynchronously, returning the output synchronously awaiting for the response of each input sent to <code>agenerate</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>the list of inputs to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list containing the generations for each input.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def generate(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Method to generate a list of responses asynchronously, returning the output\n    synchronously awaiting for the response of each input sent to `agenerate`.\n\n    Args:\n        inputs: the list of inputs to generate responses for.\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n\n    Returns:\n        A list containing the generations for each input.\n    \"\"\"\n    return self.event_loop.run_until_complete(\n        self._agenerate(inputs=inputs, num_generations=num_generations, **kwargs)\n    )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.__del__","title":"<code>__del__()</code>","text":"<p>Closes the event loop when the object is deleted.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def __del__(self) -&gt; None:\n    \"\"\"Closes the event loop when the object is deleted.\"\"\"\n    if sys.meta_path is None:\n        return\n\n    if self._new_event_loop:\n        if self._event_loop.is_running():\n            self._event_loop.stop()\n        self._event_loop.close()\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.merge_responses","title":"<code>merge_responses(responses, n=1)</code>","text":"<p>Helper function to group the responses from <code>LLM.agenerate</code> method according to the number of generations requested.</p> <p>Parameters:</p> Name Type Description Default <code>responses</code> <code>List[GenerateOutput]</code> <p>the responses from the <code>LLM.agenerate</code> method.</p> required <code>n</code> <code>int</code> <p>number of responses to group together. Defaults to 1.</p> <code>1</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>List of merged responses, where each merged response contains n generations</p> <code>List[GenerateOutput]</code> <p>and their corresponding statistics.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def merge_responses(\n    responses: List[\"GenerateOutput\"], n: int = 1\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Helper function to group the responses from `LLM.agenerate` method according\n    to the number of generations requested.\n\n    Args:\n        responses: the responses from the `LLM.agenerate` method.\n        n: number of responses to group together. Defaults to 1.\n\n    Returns:\n        List of merged responses, where each merged response contains n generations\n        and their corresponding statistics.\n    \"\"\"\n    if not responses:\n        return []\n\n    def chunks(lst, n):\n        \"\"\"Yield successive n-sized chunks from lst.\"\"\"\n        for i in range(0, len(lst), n):\n            yield list(islice(lst, i, i + n))\n\n    extra_keys = [\n        key for key in responses[0].keys() if key not in (\"generations\", \"statistics\")\n    ]\n\n    result = []\n    for group in chunks(responses, n):\n        merged = {\n            \"generations\": [],\n            \"statistics\": {\"input_tokens\": [], \"output_tokens\": []},\n        }\n        for response in group:\n            merged[\"generations\"].append(response[\"generations\"][0])\n            # Merge statistics\n            for key in response[\"statistics\"]:\n                if key not in merged[\"statistics\"]:\n                    merged[\"statistics\"][key] = []\n                merged[\"statistics\"][key].append(response[\"statistics\"][key][0])\n            # Merge extra keys returned by the `LLM`\n            for extra_key in extra_keys:\n                if extra_key not in merged:\n                    merged[extra_key] = []\n                merged[extra_key].append(response[extra_key][0])\n        result.append(merged)\n    return result\n</code></pre>"},{"location":"api/models/llm/llm_gallery/","title":"LLM Gallery","text":"<p>This section contains the existing <code>LLM</code> subclasses implemented in <code>distilabel</code>.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms","title":"<code>llms</code>","text":""},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM","title":"<code>AnthropicLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Anthropic LLM implementation running the Async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the name of the model to use for the LLM e.g. \"claude-3-opus-20240229\", \"claude-3-sonnet-20240229\", etc. Available models can be checked here: Anthropic: Models overview.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Anthropic API. If not provided, it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Anthropic API. Defaults to <code>None</code> which means that <code>https://api.anthropic.com</code> will be used internally.</p> <code>timeout</code> <code>RuntimeParameter[float]</code> <p>the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>The maximum number of times to retry the request before failing. Defaults to <code>6</code>.</p> <code>http_client</code> <code>Optional[AsyncClient]</code> <p>if provided, an alternative HTTP client to use for calling Anthropic API. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <code>_aclient</code> <code>Optional[AsyncAnthropic]</code> <p>the <code>AsyncAnthropic</code> client to use for the Anthropic API. It is meant to be used internally. Set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>api_key</code>: the API key to authenticate the requests to the Anthropic API. If not     provided, it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</li> <li><code>base_url</code>: the base URL to use for the Anthropic API. Defaults to <code>\"https://api.anthropic.com\"</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</li> <li><code>max_retries</code>: the maximum number of times to retry the request before failing.     Defaults to <code>6</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AnthropicLLM\n\nllm = AnthropicLLM(model=\"claude-3-opus-20240229\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AnthropicLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AnthropicLLM(\n    model=\"claude-3-opus-20240229\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>class AnthropicLLM(AsyncLLM):\n    \"\"\"Anthropic LLM implementation running the Async API client.\n\n    Attributes:\n        model: the name of the model to use for the LLM e.g. \"claude-3-opus-20240229\",\n            \"claude-3-sonnet-20240229\", etc. Available models can be checked here:\n            [Anthropic: Models overview](https://docs.anthropic.com/claude/docs/models-overview).\n        api_key: the API key to authenticate the requests to the Anthropic API. If not provided,\n            it will be read from `ANTHROPIC_API_KEY` environment variable.\n        base_url: the base URL to use for the Anthropic API. Defaults to `None` which means\n            that `https://api.anthropic.com` will be used internally.\n        timeout: the maximum time in seconds to wait for a response. Defaults to `600.0`.\n        max_retries: The maximum number of times to retry the request before failing. Defaults\n            to `6`.\n        http_client: if provided, an alternative HTTP client to use for calling Anthropic\n            API. Defaults to `None`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _api_key_env_var: the name of the environment variable to use for the API key. It\n            is meant to be used internally.\n        _aclient: the `AsyncAnthropic` client to use for the Anthropic API. It is meant\n            to be used internally. Set in the `load` method.\n\n    Runtime parameters:\n        - `api_key`: the API key to authenticate the requests to the Anthropic API. If not\n            provided, it will be read from `ANTHROPIC_API_KEY` environment variable.\n        - `base_url`: the base URL to use for the Anthropic API. Defaults to `\"https://api.anthropic.com\"`.\n        - `timeout`: the maximum time in seconds to wait for a response. Defaults to `600.0`.\n        - `max_retries`: the maximum number of times to retry the request before failing.\n            Defaults to `6`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AnthropicLLM\n\n        llm = AnthropicLLM(model=\"claude-3-opus-20240229\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import AnthropicLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = AnthropicLLM(\n            model=\"claude-3-opus-20240229\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"ANTHROPIC_BASE_URL\", \"https://api.anthropic.com\"\n        ),\n        description=\"The base URL to use for the Anthropic API.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_ANTHROPIC_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Anthropic API.\",\n    )\n    timeout: RuntimeParameter[float] = Field(\n        default=600.0,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=6,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    http_client: Optional[AsyncClient] = Field(default=None, exclude=True)\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _api_key_env_var: str = PrivateAttr(default=_ANTHROPIC_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"AsyncAnthropic\"] = PrivateAttr(...)\n\n    def _check_model_exists(self) -&gt; None:\n        \"\"\"Checks if the specified model exists in the available models.\"\"\"\n        from anthropic import AsyncAnthropic\n\n        annotation = get_type_hints(AsyncAnthropic().messages.create).get(\"model\", None)\n        models = [\n            value\n            for type_ in get_args(annotation)\n            if get_origin(type_) is Literal\n            for value in get_args(type_)\n        ]\n\n        if self.model not in models:\n            raise ValueError(\n                f\"Model {self.model} does not exist among available models. \"\n                f\"The available models are {', '.join(models)}\"\n            )\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncAnthropic` client to use the Anthropic async API.\"\"\"\n        super().load()\n\n        try:\n            from anthropic import AsyncAnthropic\n        except ImportError as ie:\n            raise ImportError(\n                \"Anthropic Python client is not installed. Please install it using\"\n                \" `pip install anthropic`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._check_model_exists()\n\n        self._aclient = AsyncAnthropic(\n            api_key=self.api_key.get_secret_value(),\n            base_url=self.base_url,\n            timeout=self.timeout,\n            http_client=self.http_client,\n            max_retries=self.max_retries,\n        )\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"anthropic\",\n            )\n            self._aclient = result.get(\"client\")\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        max_tokens: int = 128,\n        stop_sequences: Union[List[str], None] = None,\n        temperature: float = 1.0,\n        top_p: Union[float, None] = None,\n        top_k: Union[int, None] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates a response asynchronously, using the [Anthropic Async API definition](https://github.com/anthropics/anthropic-sdk-python).\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            max_tokens: the maximum number of new tokens that the model will generate. Defaults to `128`.\n            stop_sequences: custom text sequences that will cause the model to stop generating. Defaults to `NOT_GIVEN`.\n            temperature: the temperature to use for the generation. Set only if top_p is None. Defaults to `1.0`.\n            top_p: the top-p value to use for the generation. Defaults to `NOT_GIVEN`.\n            top_k: the top-k value to use for the generation. Defaults to `NOT_GIVEN`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        from anthropic._types import NOT_GIVEN\n\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"anthropic\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"system\": (\n                input.pop(0)[\"content\"]\n                if input and input[0][\"role\"] == \"system\"\n                else NOT_GIVEN\n            ),\n            \"max_tokens\": max_tokens,\n            \"stream\": False,\n            \"stop_sequences\": NOT_GIVEN if stop_sequences is None else stop_sequences,\n            \"temperature\": temperature,\n            \"top_p\": NOT_GIVEN if top_p is None else top_p,\n            \"top_k\": NOT_GIVEN if top_k is None else top_k,\n        }\n\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n        completion: Union[\"Message\", \"BaseModel\"] = await self._aclient.messages.create(\n            **kwargs\n        )  # type: ignore\n        if structured_output:\n            # raw_response = completion._raw_response\n            return prepare_output(\n                [completion.model_dump_json()],\n                **self._get_llm_statistics(completion._raw_response),\n            )\n\n        if (content := completion.content[0].text) is None:\n            self._logger.warning(\n                f\"Received no response using Anthropic client (model: '{self.model}').\"\n                f\" Finish reason was: {completion.stop_reason}\"\n            )\n        return prepare_output([content], **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: \"Message\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion.usage.input_tokens],\n            \"output_tokens\": [completion.usage.output_tokens],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM._check_model_exists","title":"<code>_check_model_exists()</code>","text":"<p>Checks if the specified model exists in the available models.</p> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>def _check_model_exists(self) -&gt; None:\n    \"\"\"Checks if the specified model exists in the available models.\"\"\"\n    from anthropic import AsyncAnthropic\n\n    annotation = get_type_hints(AsyncAnthropic().messages.create).get(\"model\", None)\n    models = [\n        value\n        for type_ in get_args(annotation)\n        if get_origin(type_) is Literal\n        for value in get_args(type_)\n    ]\n\n    if self.model not in models:\n        raise ValueError(\n            f\"Model {self.model} does not exist among available models. \"\n            f\"The available models are {', '.join(models)}\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncAnthropic</code> client to use the Anthropic async API.</p> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncAnthropic` client to use the Anthropic async API.\"\"\"\n    super().load()\n\n    try:\n        from anthropic import AsyncAnthropic\n    except ImportError as ie:\n        raise ImportError(\n            \"Anthropic Python client is not installed. Please install it using\"\n            \" `pip install anthropic`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._check_model_exists()\n\n    self._aclient = AsyncAnthropic(\n        api_key=self.api_key.get_secret_value(),\n        base_url=self.base_url,\n        timeout=self.timeout,\n        http_client=self.http_client,\n        max_retries=self.max_retries,\n    )\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"anthropic\",\n        )\n        self._aclient = result.get(\"client\")\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM.agenerate","title":"<code>agenerate(input, max_tokens=128, stop_sequences=None, temperature=1.0, top_p=None, top_k=None)</code>  <code>async</code>","text":"<p>Generates a response asynchronously, using the Anthropic Async API definition.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>max_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>stop_sequences</code> <code>Union[List[str], None]</code> <p>custom text sequences that will cause the model to stop generating. Defaults to <code>NOT_GIVEN</code>.</p> <code>None</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Set only if top_p is None. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_p</code> <code>Union[float, None]</code> <p>the top-p value to use for the generation. Defaults to <code>NOT_GIVEN</code>.</p> <code>None</code> <code>top_k</code> <code>Union[int, None]</code> <p>the top-k value to use for the generation. Defaults to <code>NOT_GIVEN</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    max_tokens: int = 128,\n    stop_sequences: Union[List[str], None] = None,\n    temperature: float = 1.0,\n    top_p: Union[float, None] = None,\n    top_k: Union[int, None] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates a response asynchronously, using the [Anthropic Async API definition](https://github.com/anthropics/anthropic-sdk-python).\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        max_tokens: the maximum number of new tokens that the model will generate. Defaults to `128`.\n        stop_sequences: custom text sequences that will cause the model to stop generating. Defaults to `NOT_GIVEN`.\n        temperature: the temperature to use for the generation. Set only if top_p is None. Defaults to `1.0`.\n        top_p: the top-p value to use for the generation. Defaults to `NOT_GIVEN`.\n        top_k: the top-k value to use for the generation. Defaults to `NOT_GIVEN`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    from anthropic._types import NOT_GIVEN\n\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"anthropic\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"system\": (\n            input.pop(0)[\"content\"]\n            if input and input[0][\"role\"] == \"system\"\n            else NOT_GIVEN\n        ),\n        \"max_tokens\": max_tokens,\n        \"stream\": False,\n        \"stop_sequences\": NOT_GIVEN if stop_sequences is None else stop_sequences,\n        \"temperature\": temperature,\n        \"top_p\": NOT_GIVEN if top_p is None else top_p,\n        \"top_k\": NOT_GIVEN if top_k is None else top_k,\n    }\n\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n    completion: Union[\"Message\", \"BaseModel\"] = await self._aclient.messages.create(\n        **kwargs\n    )  # type: ignore\n    if structured_output:\n        # raw_response = completion._raw_response\n        return prepare_output(\n            [completion.model_dump_json()],\n            **self._get_llm_statistics(completion._raw_response),\n        )\n\n    if (content := completion.content[0].text) is None:\n        self._logger.warning(\n            f\"Received no response using Anthropic client (model: '{self.model}').\"\n            f\" Finish reason was: {completion.stop_reason}\"\n        )\n    return prepare_output([content], **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnyscaleLLM","title":"<code>AnyscaleLLM</code>","text":"<p>               Bases: <code>OpenAILLM</code></p> <p>Anyscale LLM implementation running the async API client of OpenAI.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM, e.g., <code>google/gemma-7b-it</code>. See the supported models under the \"Text Generation -&gt; Supported Models\" section here.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Anyscale API requests. Defaults to <code>None</code>, which means that the value set for the environment variable <code>ANYSCALE_BASE_URL</code> will be used, or \"https://api.endpoints.anyscale.com/v1\" if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Anyscale API. Defaults to <code>None</code> which means that the value set for the environment variable <code>ANYSCALE_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = AnyscaleLLM(model=\"google/gemma-7b-it\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/anyscale.py</code> <pre><code>class AnyscaleLLM(OpenAILLM):\n    \"\"\"Anyscale LLM implementation running the async API client of OpenAI.\n\n    Attributes:\n        model: the model name to use for the LLM, e.g., `google/gemma-7b-it`. See the\n            supported models under the \"Text Generation -&gt; Supported Models\" section\n            [here](https://docs.endpoints.anyscale.com/).\n        base_url: the base URL to use for the Anyscale API requests. Defaults to `None`, which\n            means that the value set for the environment variable `ANYSCALE_BASE_URL` will be used, or\n            \"https://api.endpoints.anyscale.com/v1\" if not set.\n        api_key: the API key to authenticate the requests to the Anyscale API. Defaults to `None` which\n            means that the value set for the environment variable `ANYSCALE_API_KEY` will be used, or\n            `None` if not set.\n        _api_key_env_var: the name of the environment variable to use for the API key.\n            It is meant to be used internally.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AnyscaleLLM\n\n        llm = AnyscaleLLM(model=\"google/gemma-7b-it\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"ANYSCALE_BASE_URL\", \"https://api.endpoints.anyscale.com/v1\"\n        ),\n        description=\"The base URL to use for the Anyscale API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_ANYSCALE_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Anyscale API.\",\n    )\n\n    _api_key_env_var: str = PrivateAttr(_ANYSCALE_API_KEY_ENV_VAR_NAME)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AzureOpenAILLM","title":"<code>AzureOpenAILLM</code>","text":"<p>               Bases: <code>OpenAILLM</code></p> <p>Azure OpenAI LLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM i.e. the name of the Azure deployment.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Azure OpenAI API can be set with <code>AZURE_OPENAI_ENDPOINT</code>. Defaults to <code>None</code> which means that the value set for the environment variable <code>AZURE_OPENAI_ENDPOINT</code> will be used, or <code>None</code> if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Azure OpenAI API. Defaults to <code>None</code> which means that the value set for the environment variable <code>AZURE_OPENAI_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>api_version</code> <code>Optional[RuntimeParameter[str]]</code> <p>the API version to use for the Azure OpenAI API. Defaults to <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_VERSION</code> will be used, or <code>None</code> if not set.</p> Icon <p><code>:material-microsoft-azure:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate text from a custom endpoint following the OpenAI API:</p> <pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AzureOpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AzureOpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/azure.py</code> <pre><code>class AzureOpenAILLM(OpenAILLM):\n    \"\"\"Azure OpenAI LLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM i.e. the name of the Azure deployment.\n        base_url: the base URL to use for the Azure OpenAI API can be set with `AZURE_OPENAI_ENDPOINT`.\n            Defaults to `None` which means that the value set for the environment variable\n            `AZURE_OPENAI_ENDPOINT` will be used, or `None` if not set.\n        api_key: the API key to authenticate the requests to the Azure OpenAI API. Defaults to `None`\n            which means that the value set for the environment variable `AZURE_OPENAI_API_KEY` will be\n            used, or `None` if not set.\n        api_version: the API version to use for the Azure OpenAI API. Defaults to `None` which means\n            that the value set for the environment variable `OPENAI_API_VERSION` will be used, or\n            `None` if not set.\n\n    Icon:\n        `:material-microsoft-azure:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AzureOpenAILLM\n\n        llm = AzureOpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate text from a custom endpoint following the OpenAI API:\n\n        ```python\n        from distilabel.models.llms import AzureOpenAILLM\n\n        llm = AzureOpenAILLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            base_url=r\"http://localhost:8080/v1\"\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import AzureOpenAILLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = AzureOpenAILLM(\n            model=\"gpt-4-turbo\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(_AZURE_OPENAI_ENDPOINT_ENV_VAR_NAME),\n        description=\"The base URL to use for the Azure OpenAI API requests i.e. the Azure OpenAI endpoint.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_AZURE_OPENAI_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Azure OpenAI API.\",\n    )\n\n    api_version: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\"OPENAI_API_VERSION\"),\n        description=\"The API version to use for the Azure OpenAI API.\",\n    )\n\n    _base_url_env_var: str = PrivateAttr(_AZURE_OPENAI_ENDPOINT_ENV_VAR_NAME)\n    _api_key_env_var: str = PrivateAttr(_AZURE_OPENAI_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"AsyncAzureOpenAI\"] = PrivateAttr(...)  # type: ignore\n\n    @override\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncAzureOpenAI` client to benefit from async requests.\"\"\"\n        # This is a workaround to avoid the `OpenAILLM` calling the _prepare_structured_output\n        # in the load method before we have the proper client.\n        with patch(\n            \"distilabel.models.openai.OpenAILLM._prepare_structured_output\", lambda x: x\n        ):\n            super().load()\n\n        try:\n            from openai import AsyncAzureOpenAI\n        except ImportError as ie:\n            raise ImportError(\n                \"OpenAI Python client is not installed. Please install it using\"\n                \" `pip install openai`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        # TODO: May be worth adding the AD auth too? Also the `organization`?\n        self._aclient = AsyncAzureOpenAI(  # type: ignore\n            azure_endpoint=self.base_url,  # type: ignore\n            azure_deployment=self.model,\n            api_version=self.api_version,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            self._prepare_structured_output(self.structured_output)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AzureOpenAILLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncAzureOpenAI</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/azure.py</code> <pre><code>@override\ndef load(self) -&gt; None:\n    \"\"\"Loads the `AsyncAzureOpenAI` client to benefit from async requests.\"\"\"\n    # This is a workaround to avoid the `OpenAILLM` calling the _prepare_structured_output\n    # in the load method before we have the proper client.\n    with patch(\n        \"distilabel.models.openai.OpenAILLM._prepare_structured_output\", lambda x: x\n    ):\n        super().load()\n\n    try:\n        from openai import AsyncAzureOpenAI\n    except ImportError as ie:\n        raise ImportError(\n            \"OpenAI Python client is not installed. Please install it using\"\n            \" `pip install openai`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    # TODO: May be worth adding the AD auth too? Also the `organization`?\n    self._aclient = AsyncAzureOpenAI(  # type: ignore\n        azure_endpoint=self.base_url,  # type: ignore\n        azure_deployment=self.model,\n        api_version=self.api_version,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        self._prepare_structured_output(self.structured_output)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM","title":"<code>CohereLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Cohere API implementation using the async client for concurrent text generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the name of the model from the Cohere API to use for the generation.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Cohere API requests. Defaults to <code>\"https://api.cohere.ai/v1\"</code>.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Cohere API. Defaults to the value of the <code>COHERE_API_KEY</code> environment variable.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>client_name</code> <code>RuntimeParameter[str]</code> <p>the name of the client to use for the API requests. Defaults to <code>\"distilabel\"</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_ChatMessage</code> <code>Type[ChatMessage]</code> <p>the <code>ChatMessage</code> class from the <code>cohere</code> package.</p> <code>_aclient</code> <code>AsyncClient</code> <p>the <code>AsyncClient</code> client from the <code>cohere</code> package.</p> Runtime parameters <ul> <li><code>base_url</code>: the base URL to use for the Cohere API requests. Defaults to     <code>\"https://api.cohere.ai/v1\"</code>.</li> <li><code>api_key</code>: the API key to authenticate the requests to the Cohere API. Defaults     to the value of the <code>COHERE_API_KEY</code> environment variable.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> <li><code>client_name</code>: the name of the client to use for the API requests. Defaults to     <code>\"distilabel\"</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import CohereLLM\n\nllm = CohereLLM(model=\"CohereForAI/c4ai-command-r-plus\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import CohereLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = CohereLLM(\n    model=\"CohereForAI/c4ai-command-r-plus\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>class CohereLLM(AsyncLLM):\n    \"\"\"Cohere API implementation using the async client for concurrent text generation.\n\n    Attributes:\n        model: the name of the model from the Cohere API to use for the generation.\n        base_url: the base URL to use for the Cohere API requests. Defaults to\n            `\"https://api.cohere.ai/v1\"`.\n        api_key: the API key to authenticate the requests to the Cohere API. Defaults to\n            the value of the `COHERE_API_KEY` environment variable.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        client_name: the name of the client to use for the API requests. Defaults to\n            `\"distilabel\"`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _ChatMessage: the `ChatMessage` class from the `cohere` package.\n        _aclient: the `AsyncClient` client from the `cohere` package.\n\n    Runtime parameters:\n        - `base_url`: the base URL to use for the Cohere API requests. Defaults to\n            `\"https://api.cohere.ai/v1\"`.\n        - `api_key`: the API key to authenticate the requests to the Cohere API. Defaults\n            to the value of the `COHERE_API_KEY` environment variable.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        - `client_name`: the name of the client to use for the API requests. Defaults to\n            `\"distilabel\"`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import CohereLLM\n\n        llm = CohereLLM(model=\"CohereForAI/c4ai-command-r-plus\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import CohereLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = CohereLLM(\n            model=\"CohereForAI/c4ai-command-r-plus\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"COHERE_BASE_URL\", \"https://api.cohere.ai/v1\"\n        ),\n        description=\"The base URL to use for the Cohere API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_COHERE_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Cohere API.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    client_name: RuntimeParameter[str] = Field(\n        default=\"distilabel\",\n        description=\"The name of the client to use for the API requests.\",\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _ChatMessage: Type[\"ChatMessage\"] = PrivateAttr(...)\n    _aclient: \"AsyncClient\" = PrivateAttr(...)\n    _tokenizer: \"Tokenizer\" = PrivateAttr(...)\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncClient` client from the `cohere` package.\"\"\"\n\n        super().load()\n\n        try:\n            from cohere import AsyncClient, ChatMessage\n        except ImportError as ie:\n            raise ImportError(\n                \"The `cohere` package is required to use the `CohereLLM` class.\"\n            ) from ie\n\n        self._ChatMessage = ChatMessage\n\n        self._aclient = AsyncClient(\n            api_key=self.api_key.get_secret_value(),  # type: ignore\n            client_name=self.client_name,\n            base_url=self.base_url,\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"cohere\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n        from cohere.manually_maintained.tokenizers import get_hf_tokenizer\n\n        self._tokenizer: \"Tokenizer\" = get_hf_tokenizer(self._aclient, self.model)\n\n    def _format_chat_to_cohere(\n        self, input: \"FormattedInput\"\n    ) -&gt; Tuple[Union[str, None], List[\"ChatMessage\"], str]:\n        \"\"\"Formats the chat input to the Cohere Chat API conversational format.\n\n        Args:\n            input: The chat input to format.\n\n        Returns:\n            A tuple containing the system, chat history, and message.\n        \"\"\"\n        system = None\n        message = None\n        chat_history = []\n        for item in input:\n            role = item[\"role\"]\n            content = item[\"content\"]\n            if role == \"system\":\n                system = content\n            elif role == \"user\":\n                message = content\n            elif role == \"assistant\":\n                if message is None:\n                    raise ValueError(\n                        \"An assistant message but be preceded by a user message.\"\n                    )\n                chat_history.append(self._ChatMessage(role=\"USER\", message=message))  # type: ignore\n                chat_history.append(self._ChatMessage(role=\"CHATBOT\", message=content))  # type: ignore\n                message = None\n\n        if message is None:\n            raise ValueError(\"The chat input must end with a user message.\")\n\n        return system, chat_history, message\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        temperature: Optional[float] = None,\n        max_tokens: Optional[int] = None,\n        k: Optional[int] = None,\n        p: Optional[float] = None,\n        seed: Optional[float] = None,\n        stop_sequences: Optional[Sequence[str]] = None,\n        frequency_penalty: Optional[float] = None,\n        presence_penalty: Optional[float] = None,\n        raw_prompting: Optional[bool] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates a response from the LLM given an input.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            temperature: the temperature to use for the generation. Defaults to `None`.\n            max_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `None`.\n            k: the number of highest probability vocabulary tokens to keep for the generation.\n                Defaults to `None`.\n            p: the nucleus sampling probability to use for the generation. Defaults to\n                `None`.\n            seed: the seed to use for the generation. Defaults to `None`.\n            stop_sequences: a list of sequences to use as stopping criteria for the generation.\n                Defaults to `None`.\n            frequency_penalty: the frequency penalty to use for the generation. Defaults\n                to `None`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `None`.\n            raw_prompting: a flag to use raw prompting for the generation. Defaults to\n                `None`.\n\n        Returns:\n            The generated response from the Cohere API model.\n        \"\"\"\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,  # type: ignore\n                client=self._aclient,\n                framework=\"cohere\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        system, chat_history, message = self._format_chat_to_cohere(input)\n\n        kwargs = {\n            \"message\": message,\n            \"model\": self.model,\n            \"preamble\": system,\n            \"chat_history\": chat_history,\n            \"temperature\": temperature,\n            \"max_tokens\": max_tokens,\n            \"k\": k,\n            \"p\": p,\n            \"seed\": seed,\n            \"stop_sequences\": stop_sequences,\n            \"frequency_penalty\": frequency_penalty,\n            \"presence_penalty\": presence_penalty,\n            \"raw_prompting\": raw_prompting,\n        }\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n        response: Union[\"Message\", \"BaseModel\"] = await self._aclient.chat(**kwargs)  # type: ignore\n\n        if structured_output:\n            return prepare_output(\n                [response.model_dump_json()],\n                **self._get_llm_statistics(\n                    input, orjson.dumps(response.model_dump_json()).decode(\"utf-8\")\n                ),  # type: ignore\n            )\n\n        if (text := response.text) == \"\":\n            self._logger.warning(  # type: ignore\n                f\"Received no response using Cohere client (model: '{self.model}').\"\n                f\" Finish reason was: {response.finish_reason}\"\n            )\n            return prepare_output(\n                [None],\n                **self._get_llm_statistics(input, \"\"),\n            )\n\n        return prepare_output(\n            [text],\n            **self._get_llm_statistics(input, text),\n        )\n\n    def _get_llm_statistics(\n        self, input: FormattedInput, output: str\n    ) -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [compute_tokens(input, self._tokenizer.encode)],\n            \"output_tokens\": [compute_tokens(output, self._tokenizer.encode)],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncClient</code> client from the <code>cohere</code> package.</p> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncClient` client from the `cohere` package.\"\"\"\n\n    super().load()\n\n    try:\n        from cohere import AsyncClient, ChatMessage\n    except ImportError as ie:\n        raise ImportError(\n            \"The `cohere` package is required to use the `CohereLLM` class.\"\n        ) from ie\n\n    self._ChatMessage = ChatMessage\n\n    self._aclient = AsyncClient(\n        api_key=self.api_key.get_secret_value(),  # type: ignore\n        client_name=self.client_name,\n        base_url=self.base_url,\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"cohere\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n\n    from cohere.manually_maintained.tokenizers import get_hf_tokenizer\n\n    self._tokenizer: \"Tokenizer\" = get_hf_tokenizer(self._aclient, self.model)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM._format_chat_to_cohere","title":"<code>_format_chat_to_cohere(input)</code>","text":"<p>Formats the chat input to the Cohere Chat API conversational format.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>The chat input to format.</p> required <p>Returns:</p> Type Description <code>Tuple[Union[str, None], List[ChatMessage], str]</code> <p>A tuple containing the system, chat history, and message.</p> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>def _format_chat_to_cohere(\n    self, input: \"FormattedInput\"\n) -&gt; Tuple[Union[str, None], List[\"ChatMessage\"], str]:\n    \"\"\"Formats the chat input to the Cohere Chat API conversational format.\n\n    Args:\n        input: The chat input to format.\n\n    Returns:\n        A tuple containing the system, chat history, and message.\n    \"\"\"\n    system = None\n    message = None\n    chat_history = []\n    for item in input:\n        role = item[\"role\"]\n        content = item[\"content\"]\n        if role == \"system\":\n            system = content\n        elif role == \"user\":\n            message = content\n        elif role == \"assistant\":\n            if message is None:\n                raise ValueError(\n                    \"An assistant message but be preceded by a user message.\"\n                )\n            chat_history.append(self._ChatMessage(role=\"USER\", message=message))  # type: ignore\n            chat_history.append(self._ChatMessage(role=\"CHATBOT\", message=content))  # type: ignore\n            message = None\n\n    if message is None:\n        raise ValueError(\"The chat input must end with a user message.\")\n\n    return system, chat_history, message\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM.agenerate","title":"<code>agenerate(input, temperature=None, max_tokens=None, k=None, p=None, seed=None, stop_sequences=None, frequency_penalty=None, presence_penalty=None, raw_prompting=None)</code>  <code>async</code>","text":"<p>Generates a response from the LLM given an input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>temperature</code> <code>Optional[float]</code> <p>the temperature to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>max_tokens</code> <code>Optional[int]</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>None</code>.</p> <code>None</code> <code>k</code> <code>Optional[int]</code> <p>the number of highest probability vocabulary tokens to keep for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>p</code> <code>Optional[float]</code> <p>the nucleus sampling probability to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>seed</code> <code>Optional[float]</code> <p>the seed to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_sequences</code> <code>Optional[Sequence[str]]</code> <p>a list of sequences to use as stopping criteria for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>Optional[float]</code> <p>the frequency penalty to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>presence_penalty</code> <code>Optional[float]</code> <p>the presence penalty to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>raw_prompting</code> <code>Optional[bool]</code> <p>a flag to use raw prompting for the generation. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>The generated response from the Cohere API model.</p> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    temperature: Optional[float] = None,\n    max_tokens: Optional[int] = None,\n    k: Optional[int] = None,\n    p: Optional[float] = None,\n    seed: Optional[float] = None,\n    stop_sequences: Optional[Sequence[str]] = None,\n    frequency_penalty: Optional[float] = None,\n    presence_penalty: Optional[float] = None,\n    raw_prompting: Optional[bool] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates a response from the LLM given an input.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        temperature: the temperature to use for the generation. Defaults to `None`.\n        max_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `None`.\n        k: the number of highest probability vocabulary tokens to keep for the generation.\n            Defaults to `None`.\n        p: the nucleus sampling probability to use for the generation. Defaults to\n            `None`.\n        seed: the seed to use for the generation. Defaults to `None`.\n        stop_sequences: a list of sequences to use as stopping criteria for the generation.\n            Defaults to `None`.\n        frequency_penalty: the frequency penalty to use for the generation. Defaults\n            to `None`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `None`.\n        raw_prompting: a flag to use raw prompting for the generation. Defaults to\n            `None`.\n\n    Returns:\n        The generated response from the Cohere API model.\n    \"\"\"\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,  # type: ignore\n            client=self._aclient,\n            framework=\"cohere\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    system, chat_history, message = self._format_chat_to_cohere(input)\n\n    kwargs = {\n        \"message\": message,\n        \"model\": self.model,\n        \"preamble\": system,\n        \"chat_history\": chat_history,\n        \"temperature\": temperature,\n        \"max_tokens\": max_tokens,\n        \"k\": k,\n        \"p\": p,\n        \"seed\": seed,\n        \"stop_sequences\": stop_sequences,\n        \"frequency_penalty\": frequency_penalty,\n        \"presence_penalty\": presence_penalty,\n        \"raw_prompting\": raw_prompting,\n    }\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n    response: Union[\"Message\", \"BaseModel\"] = await self._aclient.chat(**kwargs)  # type: ignore\n\n    if structured_output:\n        return prepare_output(\n            [response.model_dump_json()],\n            **self._get_llm_statistics(\n                input, orjson.dumps(response.model_dump_json()).decode(\"utf-8\")\n            ),  # type: ignore\n        )\n\n    if (text := response.text) == \"\":\n        self._logger.warning(  # type: ignore\n            f\"Received no response using Cohere client (model: '{self.model}').\"\n            f\" Finish reason was: {response.finish_reason}\"\n        )\n        return prepare_output(\n            [None],\n            **self._get_llm_statistics(input, \"\"),\n        )\n\n    return prepare_output(\n        [text],\n        **self._get_llm_statistics(input, text),\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM","title":"<code>GroqLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Groq API implementation using the async client for concurrent text generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the name of the model from the Groq API to use for the generation.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Groq API requests. Defaults to <code>\"https://api.groq.com\"</code>.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Groq API. Defaults to the value of the <code>GROQ_API_KEY</code> environment variable.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of times to retry the request to the API before failing. Defaults to <code>2</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key.</p> <code>_aclient</code> <code>Optional[AsyncGroq]</code> <p>the <code>AsyncGroq</code> client from the <code>groq</code> package.</p> Runtime parameters <ul> <li><code>base_url</code>: the base URL to use for the Groq API requests. Defaults to     <code>\"https://api.groq.com\"</code>.</li> <li><code>api_key</code>: the API key to authenticate the requests to the Groq API. Defaults to     the value of the <code>GROQ_API_KEY</code> environment variable.</li> <li><code>max_retries</code>: the maximum number of times to retry the request to the API before     failing. Defaults to <code>2</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import GroqLLM\n\nllm = GroqLLM(model=\"llama3-70b-8192\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import GroqLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = GroqLLM(\n    model=\"llama3-70b-8192\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/groq.py</code> <pre><code>class GroqLLM(AsyncLLM):\n    \"\"\"Groq API implementation using the async client for concurrent text generation.\n\n    Attributes:\n        model: the name of the model from the Groq API to use for the generation.\n        base_url: the base URL to use for the Groq API requests. Defaults to\n            `\"https://api.groq.com\"`.\n        api_key: the API key to authenticate the requests to the Groq API. Defaults to\n            the value of the `GROQ_API_KEY` environment variable.\n        max_retries: the maximum number of times to retry the request to the API before\n            failing. Defaults to `2`.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _api_key_env_var: the name of the environment variable to use for the API key.\n        _aclient: the `AsyncGroq` client from the `groq` package.\n\n    Runtime parameters:\n        - `base_url`: the base URL to use for the Groq API requests. Defaults to\n            `\"https://api.groq.com\"`.\n        - `api_key`: the API key to authenticate the requests to the Groq API. Defaults to\n            the value of the `GROQ_API_KEY` environment variable.\n        - `max_retries`: the maximum number of times to retry the request to the API before\n            failing. Defaults to `2`.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import GroqLLM\n\n        llm = GroqLLM(model=\"llama3-70b-8192\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import GroqLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = GroqLLM(\n            model=\"llama3-70b-8192\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            _GROQ_API_BASE_URL_ENV_VAR_NAME, \"https://api.groq.com\"\n        ),\n        description=\"The base URL to use for the Groq API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_GROQ_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Groq API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=2,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _api_key_env_var: str = PrivateAttr(_GROQ_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"AsyncGroq\"] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncGroq` client to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from groq import AsyncGroq\n        except ImportError as ie:\n            raise ImportError(\n                \"Groq Python client is not installed. Please install it using\"\n                ' `pip install groq` or from the extras as `pip install \"distilabel[groq]\"`.'\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._aclient = AsyncGroq(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"groq\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        seed: Optional[int] = None,\n        max_new_tokens: int = 128,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        stop: Optional[str] = None,\n    ) -&gt; \"GenerateOutput\":\n        \"\"\"Generates `num_generations` responses for the given input using the Groq async\n        client.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            seed: the seed to use for the generation. Defaults to `None`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: the stop sequence to use for the generation. Defaults to `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n\n        References:\n            - https://console.groq.com/docs/text-chat\n        \"\"\"\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"groq\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"seed\": seed,\n            \"temperature\": temperature,\n            \"max_tokens\": max_new_tokens,\n            \"top_p\": top_p,\n            \"stream\": False,\n            \"stop\": stop,\n        }\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n        if structured_output:\n            return prepare_output(\n                [completion.model_dump_json()],\n                **self._get_llm_statistics(completion._raw_response),\n            )\n\n        generations = []\n        for choice in completion.choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using the Groq client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n        return prepare_output(generations, **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: \"ChatCompletion\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion.usage.prompt_tokens if completion else 0],\n            \"output_tokens\": [completion.usage.completion_tokens if completion else 0],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncGroq</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/groq.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncGroq` client to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from groq import AsyncGroq\n    except ImportError as ie:\n        raise ImportError(\n            \"Groq Python client is not installed. Please install it using\"\n            ' `pip install groq` or from the extras as `pip install \"distilabel[groq]\"`.'\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._aclient = AsyncGroq(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"groq\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM.agenerate","title":"<code>agenerate(input, seed=None, max_new_tokens=128, temperature=1.0, top_p=1.0, stop=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the Groq async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>seed</code> <code>Optional[int]</code> <p>the seed to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[str]</code> <p>the stop sequence to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> References <ul> <li>https://console.groq.com/docs/text-chat</li> </ul> Source code in <code>src/distilabel/models/llms/groq.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    seed: Optional[int] = None,\n    max_new_tokens: int = 128,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    stop: Optional[str] = None,\n) -&gt; \"GenerateOutput\":\n    \"\"\"Generates `num_generations` responses for the given input using the Groq async\n    client.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        seed: the seed to use for the generation. Defaults to `None`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: the stop sequence to use for the generation. Defaults to `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n\n    References:\n        - https://console.groq.com/docs/text-chat\n    \"\"\"\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"groq\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"seed\": seed,\n        \"temperature\": temperature,\n        \"max_tokens\": max_new_tokens,\n        \"top_p\": top_p,\n        \"stream\": False,\n        \"stop\": stop,\n    }\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n    completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n    if structured_output:\n        return prepare_output(\n            [completion.model_dump_json()],\n            **self._get_llm_statistics(completion._raw_response),\n        )\n\n    generations = []\n    for choice in completion.choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using the Groq client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n    return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM","title":"<code>InferenceEndpointsLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>InferenceEndpoints LLM implementation running the async API client.</p> <p>This LLM will internally use <code>huggingface_hub.AsyncInferenceClient</code>.</p> <p>Attributes:</p> Name Type Description <code>model_id</code> <code>Optional[str]</code> <p>the model ID to use for the LLM as available in the Hugging Face Hub, which will be used to resolve the base URL for the serverless Inference Endpoints API requests. Defaults to <code>None</code>.</p> <code>endpoint_name</code> <code>Optional[RuntimeParameter[str]]</code> <p>the name of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> <code>endpoint_namespace</code> <code>Optional[RuntimeParameter[str]]</code> <p>the namespace of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Inference Endpoints API requests.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Inference Endpoints API.</p> <code>tokenizer_id</code> <code>Optional[str]</code> <p>the tokenizer ID to use for the LLM as available in the Hugging Face Hub. Defaults to <code>None</code>, but defining one is recommended to properly format the prompt.</p> <code>model_display_name</code> <code>Optional[str]</code> <p>the model display name to use for the LLM. Defaults to <code>None</code>.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[StructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> Icon <p><code>:hugging:</code></p> <p>Examples:</p> <p>Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded:</p> <pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Dedicated Inference Endpoints:</p> <pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    endpoint_name=\"&lt;ENDPOINT_NAME&gt;\",\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    endpoint_namespace=\"&lt;USER|ORG&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Dedicated Inference Endpoints or TGI:</p> <pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    base_url=\"&lt;BASE_URL&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    api_key=\"api.key\",\n    structured_output={\"format\": \"json\", \"schema\": User.model_json_schema()}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the Tour De France\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>class InferenceEndpointsLLM(AsyncLLM, MagpieChatTemplateMixin):\n    \"\"\"InferenceEndpoints LLM implementation running the async API client.\n\n    This LLM will internally use `huggingface_hub.AsyncInferenceClient`.\n\n    Attributes:\n        model_id: the model ID to use for the LLM as available in the Hugging Face Hub, which\n            will be used to resolve the base URL for the serverless Inference Endpoints API requests.\n            Defaults to `None`.\n        endpoint_name: the name of the Inference Endpoint to use for the LLM. Defaults to `None`.\n        endpoint_namespace: the namespace of the Inference Endpoint to use for the LLM. Defaults to `None`.\n        base_url: the base URL to use for the Inference Endpoints API requests.\n        api_key: the API key to authenticate the requests to the Inference Endpoints API.\n        tokenizer_id: the tokenizer ID to use for the LLM as available in the Hugging Face Hub.\n            Defaults to `None`, but defining one is recommended to properly format the prompt.\n        model_display_name: the model display name to use for the LLM. Defaults to `None`.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n        structured_output: a dictionary containing the structured output configuration or\n            if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`.\n            Defaults to None.\n\n    Icon:\n        `:hugging:`\n\n    Examples:\n        Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded:\n\n        ```python\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Dedicated Inference Endpoints:\n\n        ```python\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            endpoint_name=\"&lt;ENDPOINT_NAME&gt;\",\n            api_key=\"&lt;HF_API_KEY&gt;\",\n            endpoint_namespace=\"&lt;USER|ORG&gt;\",\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Dedicated Inference Endpoints or TGI:\n\n        ```python\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            api_key=\"&lt;HF_API_KEY&gt;\",\n            base_url=\"&lt;BASE_URL&gt;\",\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import InferenceEndpointsLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            api_key=\"api.key\",\n            structured_output={\"format\": \"json\", \"schema\": User.model_json_schema()}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the Tour De France\"}]])\n        ```\n    \"\"\"\n\n    model_id: Optional[str] = None\n\n    endpoint_name: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The name of the Inference Endpoint to use for the LLM.\",\n    )\n    endpoint_namespace: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The namespace of the Inference Endpoint to use for the LLM.\",\n    )\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The base URL to use for the Inference Endpoints API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(HF_TOKEN_ENV_VAR),\n        description=\"The API key to authenticate the requests to the Inference Endpoints API.\",\n    )\n\n    tokenizer_id: Optional[str] = None\n    model_display_name: Optional[str] = None\n\n    structured_output: Optional[RuntimeParameter[StructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n\n    _num_generations_param_supported = False\n\n    _model_name: Optional[str] = PrivateAttr(default=None)\n    _tokenizer: Optional[\"PreTrainedTokenizer\"] = PrivateAttr(default=None)\n    _api_key_env_var: str = PrivateAttr(HF_TOKEN_ENV_VAR)\n    _aclient: Optional[\"AsyncInferenceClient\"] = PrivateAttr(...)\n\n    @model_validator(mode=\"after\")  # type: ignore\n    def only_one_of_model_id_endpoint_name_or_base_url_provided(\n        self,\n    ) -&gt; \"InferenceEndpointsLLM\":\n        \"\"\"Validates that only one of `model_id` or `endpoint_name` is provided; and if `base_url` is also\n        provided, a warning will be shown informing the user that the provided `base_url` will be ignored in\n        favour of the dynamically calculated one..\"\"\"\n\n        if self.base_url and (self.model_id or self.endpoint_name):\n            self._logger.warning(  # type: ignore\n                f\"Since the `base_url={self.base_url}` is available and either one of `model_id`\"\n                \" or `endpoint_name` is also provided, the `base_url` will either be ignored\"\n                \" or overwritten with the one generated from either of those args, for serverless\"\n                \" or dedicated inference endpoints, respectively.\"\n            )\n\n        if self.use_magpie_template and self.tokenizer_id is None:\n            raise ValueError(\n                \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n                \" set a `tokenizer_id` and try again.\"\n            )\n\n        if (\n            self.model_id\n            and self.tokenizer_id is None\n            and self.structured_output is not None\n        ):\n            self.tokenizer_id = self.model_id\n\n        if self.base_url and not (self.model_id or self.endpoint_name):\n            return self\n\n        if self.model_id and not self.endpoint_name:\n            return self\n\n        if self.endpoint_name and not self.model_id:\n            return self\n\n        raise ValidationError(\n            f\"Only one of `model_id` or `endpoint_name` must be provided. If `base_url` is\"\n            f\" provided too, it will be overwritten instead. Found `model_id`={self.model_id},\"\n            f\" `endpoint_name`={self.endpoint_name}, and `base_url`={self.base_url}.\"\n        )\n\n    def load(self) -&gt; None:  # noqa: C901\n        \"\"\"Loads the `AsyncInferenceClient` client to connect to the Hugging Face Inference\n        Endpoint.\n\n        Raises:\n            ImportError: if the `huggingface-hub` Python client is not installed.\n            ValueError: if the model is not currently deployed or is not running the TGI framework.\n            ImportError: if the `transformers` Python client is not installed.\n        \"\"\"\n        super().load()\n\n        try:\n            from huggingface_hub import (\n                AsyncInferenceClient,\n                InferenceClient,\n                get_inference_endpoint,\n            )\n        except ImportError as ie:\n            raise ImportError(\n                \"Hugging Face Hub Python client is not installed. Please install it using\"\n                \" `pip install huggingface-hub`.\"\n            ) from ie\n\n        if self.api_key is None:\n            self.api_key = SecretStr(get_hf_token(self.__class__.__name__, \"api_key\"))\n\n        if self.model_id is not None:\n            client = InferenceClient(\n                model=self.model_id, token=self.api_key.get_secret_value()\n            )\n            status = client.get_model_status()\n\n            if (\n                status.state not in {\"Loadable\", \"Loaded\"}\n                and status.framework != \"text-generation-inference\"\n            ):\n                raise ValueError(\n                    f\"Model {self.model_id} is not currently deployed or is not running the TGI framework\"\n                )\n\n            self.base_url = client._resolve_url(\n                model=self.model_id, task=\"text-generation\"\n            )\n\n        if self.endpoint_name is not None:\n            client = get_inference_endpoint(\n                name=self.endpoint_name,\n                namespace=self.endpoint_namespace,\n                token=self.api_key.get_secret_value(),\n            )\n            if client.status in [\"paused\", \"scaledToZero\"]:\n                client.resume().wait(timeout=300)\n            elif client.status == \"initializing\":\n                client.wait(timeout=300)\n\n            self.base_url = client.url\n            self._model_name = client.repository\n\n        self._aclient = AsyncInferenceClient(\n            base_url=self.base_url,\n            token=self.api_key.get_secret_value(),\n        )\n\n        if self.tokenizer_id:\n            try:\n                from transformers import AutoTokenizer\n            except ImportError as ie:\n                raise ImportError(\n                    \"Transformers Python client is not installed. Please install it using\"\n                    \" `pip install transformers`.\"\n                ) from ie\n\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n\n    @property\n    @override\n    def model_name(self) -&gt; Union[str, None]:  # type: ignore\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return (\n            self.model_display_name\n            or self._model_name\n            or self.model_id\n            or self.endpoint_name\n            or self.base_url\n        )\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(  # type: ignore\n                conversation=input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    def _get_structured_output(\n        self, input: FormattedInput\n    ) -&gt; Tuple[\"StandardInput\", Union[Dict[str, Any], None]]:\n        \"\"\"Gets the structured output (if any) for the given input.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n\n        Returns:\n            The input and the structured output that will be passed as `grammar` to the\n            inference endpoint or `None` if not required.\n        \"\"\"\n        structured_output = None\n\n        # Specific structured output per input\n        if isinstance(input, tuple):\n            input, structured_output = input\n            structured_output = {\n                \"type\": structured_output[\"format\"],  # type: ignore\n                \"value\": structured_output[\"schema\"],  # type: ignore\n            }\n\n        # Same structured output for all the inputs\n        if structured_output is None and self.structured_output is not None:\n            try:\n                structured_output = {\n                    \"type\": self.structured_output[\"format\"],  # type: ignore\n                    \"value\": self.structured_output[\"schema\"],  # type: ignore\n                }\n            except KeyError as e:\n                raise ValueError(\n                    \"To use the structured output you have to inform the `format` and `schema` in \"\n                    \"the `structured_output` attribute.\"\n                ) from e\n\n        if structured_output:\n            if isinstance(structured_output[\"value\"], ModelMetaclass):\n                structured_output[\"value\"] = structured_output[\n                    \"value\"\n                ].model_json_schema()\n\n        return input, structured_output\n\n    async def _generate_with_text_generation(\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        repetition_penalty: Optional[float] = None,\n        frequency_penalty: Optional[float] = None,\n        temperature: float = 1.0,\n        do_sample: bool = False,\n        top_n_tokens: Optional[int] = None,\n        top_p: Optional[float] = None,\n        top_k: Optional[int] = None,\n        typical_p: Optional[float] = None,\n        stop_sequences: Union[List[str], None] = None,\n        return_full_text: bool = False,\n        seed: Optional[int] = None,\n        watermark: bool = False,\n    ) -&gt; GenerateOutput:\n        input, structured_output = self._get_structured_output(input)\n        prompt = self.prepare_input(input)\n        generation: Union[\"TextGenerationOutput\", None] = None\n        try:\n            generation = await self._aclient.text_generation(  # type: ignore\n                prompt=prompt,\n                max_new_tokens=max_new_tokens,\n                do_sample=do_sample,\n                typical_p=typical_p,\n                repetition_penalty=repetition_penalty,\n                frequency_penalty=frequency_penalty,\n                temperature=temperature,\n                top_n_tokens=top_n_tokens,\n                top_p=top_p,\n                top_k=top_k,\n                stop_sequences=stop_sequences,\n                return_full_text=return_full_text,\n                # NOTE: here to ensure that the cache is not used and a different response is\n                # generated every time\n                seed=seed or random.randint(0, sys.maxsize),\n                watermark=watermark,\n                grammar=structured_output,  # type: ignore\n                details=True,\n            )\n        except Exception as e:\n            self._logger.warning(  # type: ignore\n                f\"\u26a0\ufe0f Received no response using Inference Client (model: '{self.model_name}').\"\n                f\" Finish reason was: {e}\"\n            )\n        return prepare_output(\n            generations=[generation.generated_text] if generation else [None],\n            input_tokens=[compute_tokens(prompt, self._tokenizer.encode)],  # type: ignore\n            output_tokens=[\n                generation.details.generated_tokens\n                if generation and generation.details\n                else 0\n            ],\n            logprobs=self._get_logprobs_from_text_generation(generation)\n            if generation\n            else None,  # type: ignore\n        )\n\n    def _get_logprobs_from_text_generation(\n        self, generation: \"TextGenerationOutput\"\n    ) -&gt; Union[List[List[List[\"Logprob\"]]], None]:\n        if generation.details is None or generation.details.top_tokens is None:\n            return None\n\n        return [\n            [\n                [\n                    {\"token\": top_logprob[\"text\"], \"logprob\": top_logprob[\"logprob\"]}\n                    for top_logprob in token_logprobs\n                ]\n                for token_logprobs in generation.details.top_tokens\n            ]\n        ]\n\n    async def _generate_with_chat_completion(\n        self,\n        input: \"StandardInput\",\n        max_new_tokens: int = 128,\n        frequency_penalty: Optional[float] = None,\n        logit_bias: Optional[List[float]] = None,\n        logprobs: bool = False,\n        presence_penalty: Optional[float] = None,\n        seed: Optional[int] = None,\n        stop_sequences: Optional[List[str]] = None,\n        temperature: float = 1.0,\n        tool_choice: Optional[Union[Dict[str, str], Literal[\"auto\"]]] = None,\n        tool_prompt: Optional[str] = None,\n        tools: Optional[List[Dict[str, Any]]] = None,\n        top_logprobs: Optional[PositiveInt] = None,\n        top_p: Optional[float] = None,\n    ) -&gt; GenerateOutput:\n        message = None\n        completion: Union[\"ChatCompletionOutput\", None] = None\n        output_logprobs = None\n        try:\n            completion = await self._aclient.chat_completion(  # type: ignore\n                messages=input,  # type: ignore\n                max_tokens=max_new_tokens,\n                frequency_penalty=frequency_penalty,\n                logit_bias=logit_bias,\n                logprobs=logprobs,\n                presence_penalty=presence_penalty,\n                # NOTE: here to ensure that the cache is not used and a different response is\n                # generated every time\n                seed=seed or random.randint(0, sys.maxsize),\n                stop=stop_sequences,\n                temperature=temperature,\n                tool_choice=tool_choice,  # type: ignore\n                tool_prompt=tool_prompt,\n                tools=tools,  # type: ignore\n                top_logprobs=top_logprobs,\n                top_p=top_p,\n            )\n            choice = completion.choices[0]  # type: ignore\n            if (message := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"\u26a0\ufe0f Received no response using Inference Client (model: '{self.model_name}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            if choice_logprobs := self._get_logprobs_from_choice(choice):\n                output_logprobs = [choice_logprobs]\n        except Exception as e:\n            self._logger.warning(  # type: ignore\n                f\"\u26a0\ufe0f Received no response using Inference Client (model: '{self.model_name}').\"\n                f\" Finish reason was: {e}\"\n            )\n        return prepare_output(\n            generations=[message],\n            input_tokens=[completion.usage.prompt_tokens] if completion else None,\n            output_tokens=[completion.usage.completion_tokens] if completion else None,\n            logprobs=output_logprobs,\n        )\n\n    def _get_logprobs_from_choice(\n        self, choice: \"ChatCompletionOutputComplete\"\n    ) -&gt; Union[List[List[\"Logprob\"]], None]:\n        if choice.logprobs is None:\n            return None\n\n        return [\n            [\n                {\"token\": top_logprob.token, \"logprob\": top_logprob.logprob}\n                for top_logprob in token_logprobs.top_logprobs\n            ]\n            for token_logprobs in choice.logprobs.content\n        ]\n\n    def _check_stop_sequences(\n        self,\n        stop_sequences: Optional[Union[str, List[str]]] = None,\n    ) -&gt; Union[List[str], None]:\n        \"\"\"Checks that no more than 4 stop sequences are provided.\n\n        Args:\n            stop_sequences: the stop sequences to be checked.\n\n        Returns:\n            The stop sequences.\n        \"\"\"\n        if stop_sequences is not None:\n            if isinstance(stop_sequences, str):\n                stop_sequences = [stop_sequences]\n            if len(stop_sequences) &gt; 4:\n                warnings.warn(\n                    \"Only up to 4 stop sequences are allowed, so keeping the first 4 items only.\",\n                    UserWarning,\n                    stacklevel=2,\n                )\n                stop_sequences = stop_sequences[:4]\n        return stop_sequences\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        frequency_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n        logit_bias: Optional[List[float]] = None,\n        logprobs: bool = False,\n        presence_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n        seed: Optional[int] = None,\n        stop_sequences: Optional[List[str]] = None,\n        temperature: float = 1.0,\n        tool_choice: Optional[Union[Dict[str, str], Literal[\"auto\"]]] = None,\n        tool_prompt: Optional[str] = None,\n        tools: Optional[List[Dict[str, Any]]] = None,\n        top_logprobs: Optional[PositiveInt] = None,\n        top_n_tokens: Optional[PositiveInt] = None,\n        top_p: Optional[float] = None,\n        do_sample: bool = False,\n        repetition_penalty: Optional[float] = None,\n        return_full_text: bool = False,\n        top_k: Optional[int] = None,\n        typical_p: Optional[float] = None,\n        watermark: bool = False,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates completions for the given input using the async client. This method\n        uses two methods of the `huggingface_hub.AsyncClient`: `chat_completion` and `text_generation`.\n        `chat_completion` method will be used only if no `tokenizer_id` has been specified.\n        Some arguments of this function are specific to the `text_generation` method, while\n        some others are specific to the `chat_completion` method.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            frequency_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n                new tokens based on their existing frequency in the text so far, decreasing\n                model's likelihood to repeat the same line verbatim. Defauls to `None`.\n            logit_bias: modify the likelihood of specified tokens appearing in the completion.\n                This argument is exclusive to the `chat_completion` method and will be used\n                only if `tokenizer_id` is `None`.\n                Defaults to `None`.\n            logprobs: whether to return the log probabilities or not. This argument is exclusive\n                to the `chat_completion` method and will be used only if `tokenizer_id`\n                is `None`. Defaults to `False`.\n            presence_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n                new tokens based on whether they appear in the text so far, increasing the\n                model likelihood to talk about new topics. This argument is exclusive to\n                the `chat_completion` method and will be used only if `tokenizer_id` is\n                `None`. Defauls to `None`.\n            seed: the seed to use for the generation. Defaults to `None`.\n            stop_sequences: either a single string or a list of strings containing the sequences\n                to stop the generation at. Defaults to `None`, but will be set to the\n                `tokenizer.eos_token` if available.\n            temperature: the temperature to use for the generation. Defaults to `1.0`.\n            tool_choice: the name of the tool the model should call. It can be a dictionary\n                like `{\"function_name\": \"my_tool\"}` or \"auto\". If not provided, then the\n                model won't use any tool. This argument is exclusive to the `chat_completion`\n                method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n            tool_prompt: A prompt to be appended before the tools. This argument is exclusive\n                to the `chat_completion` method and will be used only if `tokenizer_id`\n                is `None`. Defauls to `None`.\n            tools: a list of tools definitions that the LLM can use.\n                This argument is exclusive to the `chat_completion` method and will be used\n                only if `tokenizer_id` is `None`. Defaults to `None`.\n            top_logprobs: the number of top log probabilities to return per output token\n                generated. This argument is exclusive to the `chat_completion` method and\n                will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n            top_n_tokens: the number of top log probabilities to return per output token\n                generated. This argument is exclusive of the `text_generation` method and\n                will be only used if `tokenizer_id` is not `None`. Defaults to `None`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            do_sample: whether to use sampling for the generation. This argument is exclusive\n                of the `text_generation` method and will be only used if `tokenizer_id` is not\n                `None`. Defaults to `False`.\n            repetition_penalty: the repetition penalty to use for the generation. This argument\n                is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `None`.\n            return_full_text: whether to return the full text of the completion or just\n                the generated text. Defaults to `False`, meaning that only the generated\n                text will be returned. This argument is exclusive of the `text_generation`\n                method and will be only used if `tokenizer_id` is not `None`.\n            top_k: the top-k value to use for the generation. This argument is exclusive\n                of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `0.8`, since neither `0.0` nor `1.0` are valid\n                values in TGI.\n            typical_p: the typical-p value to use for the generation. This argument is exclusive\n                of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `None`.\n            watermark: whether to add the watermark to the generated text. This argument\n                is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        stop_sequences = self._check_stop_sequences(stop_sequences)\n\n        if self.tokenizer_id is None:\n            return await self._generate_with_chat_completion(\n                input=input,  # type: ignore\n                max_new_tokens=max_new_tokens,\n                frequency_penalty=frequency_penalty,\n                logit_bias=logit_bias,\n                logprobs=logprobs,\n                presence_penalty=presence_penalty,\n                seed=seed,\n                stop_sequences=stop_sequences,\n                temperature=temperature,\n                tool_choice=tool_choice,\n                tool_prompt=tool_prompt,\n                tools=tools,\n                top_logprobs=top_logprobs,\n                top_p=top_p,\n            )\n\n        return await self._generate_with_text_generation(\n            input=input,\n            max_new_tokens=max_new_tokens,\n            do_sample=do_sample,\n            typical_p=typical_p,\n            repetition_penalty=repetition_penalty,\n            frequency_penalty=frequency_penalty,\n            temperature=temperature,\n            top_n_tokens=top_n_tokens,\n            top_p=top_p,\n            top_k=top_k,\n            stop_sequences=stop_sequences,\n            return_full_text=return_full_text,\n            seed=seed,\n            watermark=watermark,\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.only_one_of_model_id_endpoint_name_or_base_url_provided","title":"<code>only_one_of_model_id_endpoint_name_or_base_url_provided()</code>","text":"<p>Validates that only one of <code>model_id</code> or <code>endpoint_name</code> is provided; and if <code>base_url</code> is also provided, a warning will be shown informing the user that the provided <code>base_url</code> will be ignored in favour of the dynamically calculated one..</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>@model_validator(mode=\"after\")  # type: ignore\ndef only_one_of_model_id_endpoint_name_or_base_url_provided(\n    self,\n) -&gt; \"InferenceEndpointsLLM\":\n    \"\"\"Validates that only one of `model_id` or `endpoint_name` is provided; and if `base_url` is also\n    provided, a warning will be shown informing the user that the provided `base_url` will be ignored in\n    favour of the dynamically calculated one..\"\"\"\n\n    if self.base_url and (self.model_id or self.endpoint_name):\n        self._logger.warning(  # type: ignore\n            f\"Since the `base_url={self.base_url}` is available and either one of `model_id`\"\n            \" or `endpoint_name` is also provided, the `base_url` will either be ignored\"\n            \" or overwritten with the one generated from either of those args, for serverless\"\n            \" or dedicated inference endpoints, respectively.\"\n        )\n\n    if self.use_magpie_template and self.tokenizer_id is None:\n        raise ValueError(\n            \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n            \" set a `tokenizer_id` and try again.\"\n        )\n\n    if (\n        self.model_id\n        and self.tokenizer_id is None\n        and self.structured_output is not None\n    ):\n        self.tokenizer_id = self.model_id\n\n    if self.base_url and not (self.model_id or self.endpoint_name):\n        return self\n\n    if self.model_id and not self.endpoint_name:\n        return self\n\n    if self.endpoint_name and not self.model_id:\n        return self\n\n    raise ValidationError(\n        f\"Only one of `model_id` or `endpoint_name` must be provided. If `base_url` is\"\n        f\" provided too, it will be overwritten instead. Found `model_id`={self.model_id},\"\n        f\" `endpoint_name`={self.endpoint_name}, and `base_url`={self.base_url}.\"\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncInferenceClient</code> client to connect to the Hugging Face Inference Endpoint.</p> <p>Raises:</p> Type Description <code>ImportError</code> <p>if the <code>huggingface-hub</code> Python client is not installed.</p> <code>ValueError</code> <p>if the model is not currently deployed or is not running the TGI framework.</p> <code>ImportError</code> <p>if the <code>transformers</code> Python client is not installed.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def load(self) -&gt; None:  # noqa: C901\n    \"\"\"Loads the `AsyncInferenceClient` client to connect to the Hugging Face Inference\n    Endpoint.\n\n    Raises:\n        ImportError: if the `huggingface-hub` Python client is not installed.\n        ValueError: if the model is not currently deployed or is not running the TGI framework.\n        ImportError: if the `transformers` Python client is not installed.\n    \"\"\"\n    super().load()\n\n    try:\n        from huggingface_hub import (\n            AsyncInferenceClient,\n            InferenceClient,\n            get_inference_endpoint,\n        )\n    except ImportError as ie:\n        raise ImportError(\n            \"Hugging Face Hub Python client is not installed. Please install it using\"\n            \" `pip install huggingface-hub`.\"\n        ) from ie\n\n    if self.api_key is None:\n        self.api_key = SecretStr(get_hf_token(self.__class__.__name__, \"api_key\"))\n\n    if self.model_id is not None:\n        client = InferenceClient(\n            model=self.model_id, token=self.api_key.get_secret_value()\n        )\n        status = client.get_model_status()\n\n        if (\n            status.state not in {\"Loadable\", \"Loaded\"}\n            and status.framework != \"text-generation-inference\"\n        ):\n            raise ValueError(\n                f\"Model {self.model_id} is not currently deployed or is not running the TGI framework\"\n            )\n\n        self.base_url = client._resolve_url(\n            model=self.model_id, task=\"text-generation\"\n        )\n\n    if self.endpoint_name is not None:\n        client = get_inference_endpoint(\n            name=self.endpoint_name,\n            namespace=self.endpoint_namespace,\n            token=self.api_key.get_secret_value(),\n        )\n        if client.status in [\"paused\", \"scaledToZero\"]:\n            client.resume().wait(timeout=300)\n        elif client.status == \"initializing\":\n            client.wait(timeout=300)\n\n        self.base_url = client.url\n        self._model_name = client.repository\n\n    self._aclient = AsyncInferenceClient(\n        base_url=self.base_url,\n        token=self.api_key.get_secret_value(),\n    )\n\n    if self.tokenizer_id:\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers Python client is not installed. Please install it using\"\n                \" `pip install transformers`.\"\n            ) from ie\n\n        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(  # type: ignore\n            conversation=input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM._get_structured_output","title":"<code>_get_structured_output(input)</code>","text":"<p>Gets the structured output (if any) for the given input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <p>Returns:</p> Type Description <code>StandardInput</code> <p>The input and the structured output that will be passed as <code>grammar</code> to the</p> <code>Union[Dict[str, Any], None]</code> <p>inference endpoint or <code>None</code> if not required.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def _get_structured_output(\n    self, input: FormattedInput\n) -&gt; Tuple[\"StandardInput\", Union[Dict[str, Any], None]]:\n    \"\"\"Gets the structured output (if any) for the given input.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n\n    Returns:\n        The input and the structured output that will be passed as `grammar` to the\n        inference endpoint or `None` if not required.\n    \"\"\"\n    structured_output = None\n\n    # Specific structured output per input\n    if isinstance(input, tuple):\n        input, structured_output = input\n        structured_output = {\n            \"type\": structured_output[\"format\"],  # type: ignore\n            \"value\": structured_output[\"schema\"],  # type: ignore\n        }\n\n    # Same structured output for all the inputs\n    if structured_output is None and self.structured_output is not None:\n        try:\n            structured_output = {\n                \"type\": self.structured_output[\"format\"],  # type: ignore\n                \"value\": self.structured_output[\"schema\"],  # type: ignore\n            }\n        except KeyError as e:\n            raise ValueError(\n                \"To use the structured output you have to inform the `format` and `schema` in \"\n                \"the `structured_output` attribute.\"\n            ) from e\n\n    if structured_output:\n        if isinstance(structured_output[\"value\"], ModelMetaclass):\n            structured_output[\"value\"] = structured_output[\n                \"value\"\n            ].model_json_schema()\n\n    return input, structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM._check_stop_sequences","title":"<code>_check_stop_sequences(stop_sequences=None)</code>","text":"<p>Checks that no more than 4 stop sequences are provided.</p> <p>Parameters:</p> Name Type Description Default <code>stop_sequences</code> <code>Optional[Union[str, List[str]]]</code> <p>the stop sequences to be checked.</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[List[str], None]</code> <p>The stop sequences.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def _check_stop_sequences(\n    self,\n    stop_sequences: Optional[Union[str, List[str]]] = None,\n) -&gt; Union[List[str], None]:\n    \"\"\"Checks that no more than 4 stop sequences are provided.\n\n    Args:\n        stop_sequences: the stop sequences to be checked.\n\n    Returns:\n        The stop sequences.\n    \"\"\"\n    if stop_sequences is not None:\n        if isinstance(stop_sequences, str):\n            stop_sequences = [stop_sequences]\n        if len(stop_sequences) &gt; 4:\n            warnings.warn(\n                \"Only up to 4 stop sequences are allowed, so keeping the first 4 items only.\",\n                UserWarning,\n                stacklevel=2,\n            )\n            stop_sequences = stop_sequences[:4]\n    return stop_sequences\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.agenerate","title":"<code>agenerate(input, max_new_tokens=128, frequency_penalty=None, logit_bias=None, logprobs=False, presence_penalty=None, seed=None, stop_sequences=None, temperature=1.0, tool_choice=None, tool_prompt=None, tools=None, top_logprobs=None, top_n_tokens=None, top_p=None, do_sample=False, repetition_penalty=None, return_full_text=False, top_k=None, typical_p=None, watermark=False)</code>  <code>async</code>","text":"<p>Generates completions for the given input using the async client. This method uses two methods of the <code>huggingface_hub.AsyncClient</code>: <code>chat_completion</code> and <code>text_generation</code>. <code>chat_completion</code> method will be used only if no <code>tokenizer_id</code> has been specified. Some arguments of this function are specific to the <code>text_generation</code> method, while some others are specific to the <code>chat_completion</code> method.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>frequency_penalty</code> <code>Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]</code> <p>a value between <code>-2.0</code> and <code>2.0</code>. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing model's likelihood to repeat the same line verbatim. Defauls to <code>None</code>.</p> <code>None</code> <code>logit_bias</code> <code>Optional[List[float]]</code> <p>modify the likelihood of specified tokens appearing in the completion. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>logprobs</code> <code>bool</code> <p>whether to return the log probabilities or not. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>False</code>.</p> <code>False</code> <code>presence_penalty</code> <code>Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]</code> <p>a value between <code>-2.0</code> and <code>2.0</code>. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model likelihood to talk about new topics. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defauls to <code>None</code>.</p> <code>None</code> <code>seed</code> <code>Optional[int]</code> <p>the seed to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_sequences</code> <code>Optional[List[str]]</code> <p>either a single string or a list of strings containing the sequences to stop the generation at. Defaults to <code>None</code>, but will be set to the <code>tokenizer.eos_token</code> if available.</p> <code>None</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>tool_choice</code> <code>Optional[Union[Dict[str, str], Literal['auto']]]</code> <p>the name of the tool the model should call. It can be a dictionary like <code>{\"function_name\": \"my_tool\"}</code> or \"auto\". If not provided, then the model won't use any tool. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>tool_prompt</code> <code>Optional[str]</code> <p>A prompt to be appended before the tools. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defauls to <code>None</code>.</p> <code>None</code> <code>tools</code> <code>Optional[List[Dict[str, Any]]]</code> <p>a list of tools definitions that the LLM can use. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>top_logprobs</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>top_n_tokens</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>top_p</code> <code>Optional[float]</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>None</code> <code>do_sample</code> <code>bool</code> <p>whether to use sampling for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>False</code>.</p> <code>False</code> <code>repetition_penalty</code> <code>Optional[float]</code> <p>the repetition penalty to use for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>return_full_text</code> <code>bool</code> <p>whether to return the full text of the completion or just the generated text. Defaults to <code>False</code>, meaning that only the generated text will be returned. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>.</p> <code>False</code> <code>top_k</code> <code>Optional[int]</code> <p>the top-k value to use for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>0.8</code>, since neither <code>0.0</code> nor <code>1.0</code> are valid values in TGI.</p> <code>None</code> <code>typical_p</code> <code>Optional[float]</code> <p>the typical-p value to use for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>watermark</code> <code>bool</code> <p>whether to add the watermark to the generated text. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>False</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    max_new_tokens: int = 128,\n    frequency_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n    logit_bias: Optional[List[float]] = None,\n    logprobs: bool = False,\n    presence_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n    seed: Optional[int] = None,\n    stop_sequences: Optional[List[str]] = None,\n    temperature: float = 1.0,\n    tool_choice: Optional[Union[Dict[str, str], Literal[\"auto\"]]] = None,\n    tool_prompt: Optional[str] = None,\n    tools: Optional[List[Dict[str, Any]]] = None,\n    top_logprobs: Optional[PositiveInt] = None,\n    top_n_tokens: Optional[PositiveInt] = None,\n    top_p: Optional[float] = None,\n    do_sample: bool = False,\n    repetition_penalty: Optional[float] = None,\n    return_full_text: bool = False,\n    top_k: Optional[int] = None,\n    typical_p: Optional[float] = None,\n    watermark: bool = False,\n) -&gt; GenerateOutput:\n    \"\"\"Generates completions for the given input using the async client. This method\n    uses two methods of the `huggingface_hub.AsyncClient`: `chat_completion` and `text_generation`.\n    `chat_completion` method will be used only if no `tokenizer_id` has been specified.\n    Some arguments of this function are specific to the `text_generation` method, while\n    some others are specific to the `chat_completion` method.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        frequency_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n            new tokens based on their existing frequency in the text so far, decreasing\n            model's likelihood to repeat the same line verbatim. Defauls to `None`.\n        logit_bias: modify the likelihood of specified tokens appearing in the completion.\n            This argument is exclusive to the `chat_completion` method and will be used\n            only if `tokenizer_id` is `None`.\n            Defaults to `None`.\n        logprobs: whether to return the log probabilities or not. This argument is exclusive\n            to the `chat_completion` method and will be used only if `tokenizer_id`\n            is `None`. Defaults to `False`.\n        presence_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n            new tokens based on whether they appear in the text so far, increasing the\n            model likelihood to talk about new topics. This argument is exclusive to\n            the `chat_completion` method and will be used only if `tokenizer_id` is\n            `None`. Defauls to `None`.\n        seed: the seed to use for the generation. Defaults to `None`.\n        stop_sequences: either a single string or a list of strings containing the sequences\n            to stop the generation at. Defaults to `None`, but will be set to the\n            `tokenizer.eos_token` if available.\n        temperature: the temperature to use for the generation. Defaults to `1.0`.\n        tool_choice: the name of the tool the model should call. It can be a dictionary\n            like `{\"function_name\": \"my_tool\"}` or \"auto\". If not provided, then the\n            model won't use any tool. This argument is exclusive to the `chat_completion`\n            method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n        tool_prompt: A prompt to be appended before the tools. This argument is exclusive\n            to the `chat_completion` method and will be used only if `tokenizer_id`\n            is `None`. Defauls to `None`.\n        tools: a list of tools definitions that the LLM can use.\n            This argument is exclusive to the `chat_completion` method and will be used\n            only if `tokenizer_id` is `None`. Defaults to `None`.\n        top_logprobs: the number of top log probabilities to return per output token\n            generated. This argument is exclusive to the `chat_completion` method and\n            will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n        top_n_tokens: the number of top log probabilities to return per output token\n            generated. This argument is exclusive of the `text_generation` method and\n            will be only used if `tokenizer_id` is not `None`. Defaults to `None`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        do_sample: whether to use sampling for the generation. This argument is exclusive\n            of the `text_generation` method and will be only used if `tokenizer_id` is not\n            `None`. Defaults to `False`.\n        repetition_penalty: the repetition penalty to use for the generation. This argument\n            is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `None`.\n        return_full_text: whether to return the full text of the completion or just\n            the generated text. Defaults to `False`, meaning that only the generated\n            text will be returned. This argument is exclusive of the `text_generation`\n            method and will be only used if `tokenizer_id` is not `None`.\n        top_k: the top-k value to use for the generation. This argument is exclusive\n            of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `0.8`, since neither `0.0` nor `1.0` are valid\n            values in TGI.\n        typical_p: the typical-p value to use for the generation. This argument is exclusive\n            of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `None`.\n        watermark: whether to add the watermark to the generated text. This argument\n            is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    stop_sequences = self._check_stop_sequences(stop_sequences)\n\n    if self.tokenizer_id is None:\n        return await self._generate_with_chat_completion(\n            input=input,  # type: ignore\n            max_new_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            logit_bias=logit_bias,\n            logprobs=logprobs,\n            presence_penalty=presence_penalty,\n            seed=seed,\n            stop_sequences=stop_sequences,\n            temperature=temperature,\n            tool_choice=tool_choice,\n            tool_prompt=tool_prompt,\n            tools=tools,\n            top_logprobs=top_logprobs,\n            top_p=top_p,\n        )\n\n    return await self._generate_with_text_generation(\n        input=input,\n        max_new_tokens=max_new_tokens,\n        do_sample=do_sample,\n        typical_p=typical_p,\n        repetition_penalty=repetition_penalty,\n        frequency_penalty=frequency_penalty,\n        temperature=temperature,\n        top_n_tokens=top_n_tokens,\n        top_p=top_p,\n        top_k=top_k,\n        stop_sequences=stop_sequences,\n        return_full_text=return_full_text,\n        seed=seed,\n        watermark=watermark,\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM","title":"<code>TransformersLLM</code>","text":"<p>               Bases: <code>LLM</code>, <code>MagpieChatTemplateMixin</code>, <code>CudaDevicePlacementMixin</code></p> <p>Hugging Face <code>transformers</code> library LLM implementation using the text generation pipeline.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>revision</code> <code>str</code> <p>if <code>model</code> refers to a Hugging Face Hub repository, then the revision (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> <code>torch_dtype</code> <code>str</code> <p>the torch dtype to use for the model e.g. \"float16\", \"float32\", etc. Defaults to <code>\"auto\"</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to allow fetching and executing remote code fetched from the repository in the Hub. Defaults to <code>False</code>.</p> <code>model_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>from_pretrained</code> method of the model.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the <code>model</code> will be used. Defaults to <code>None</code>.</p> <code>use_fast</code> <code>bool</code> <p>whether to use a fast tokenizer or not. Defaults to <code>True</code>.</p> <code>chat_template</code> <code>Optional[str]</code> <p>a chat template that will be used to build the prompts before sending them to the model. If not provided, the chat template defined in the tokenizer config will be used. If not provided and the tokenizer doesn't have a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> <code>device</code> <code>Optional[Union[str, int]]</code> <p>the name or index of the device where the model will be loaded. Defaults to <code>None</code>.</p> <code>device_map</code> <code>Optional[Union[str, Dict[str, Any]]]</code> <p>a dictionary mapping each layer of the model to a device, or a mode like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> <code>token</code> <code>Optional[SecretStr]</code> <p>the Hugging Face Hub token that will be used to authenticate to the Hugging Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package local configuration will be used. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[OutlinesStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> Icon <p><code>:hugging:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import TransformersLLM\n\nllm = TransformersLLM(model=\"microsoft/Phi-3-mini-4k-instruct\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>class TransformersLLM(LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin):\n    \"\"\"Hugging Face `transformers` library LLM implementation using the text generation\n    pipeline.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        revision: if `model` refers to a Hugging Face Hub repository, then the revision\n            (e.g. a branch name or a commit id) to use. Defaults to `\"main\"`.\n        torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.\n            Defaults to `\"auto\"`.\n        trust_remote_code: whether to allow fetching and executing remote code fetched\n            from the repository in the Hub. Defaults to `False`.\n        model_kwargs: additional dictionary of keyword arguments that will be passed to\n            the `from_pretrained` method of the model.\n        tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer config files. If not provided, the one associated to the `model`\n            will be used. Defaults to `None`.\n        use_fast: whether to use a fast tokenizer or not. Defaults to `True`.\n        chat_template: a chat template that will be used to build the prompts before\n            sending them to the model. If not provided, the chat template defined in the\n            tokenizer config will be used. If not provided and the tokenizer doesn't have\n            a chat template, then ChatML template will be used. Defaults to `None`.\n        device: the name or index of the device where the model will be loaded. Defaults\n            to `None`.\n        device_map: a dictionary mapping each layer of the model to a device, or a mode\n            like `\"sequential\"` or `\"auto\"`. Defaults to `None`.\n        token: the Hugging Face Hub token that will be used to authenticate to the Hugging\n            Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package\n            local configuration will be used. Defaults to `None`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n\n    Icon:\n        `:hugging:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import TransformersLLM\n\n        llm = TransformersLLM(model=\"microsoft/Phi-3-mini-4k-instruct\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    revision: str = \"main\"\n    torch_dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    model_kwargs: Optional[Dict[str, Any]] = None\n    tokenizer: Optional[str] = None\n    use_fast: bool = True\n    chat_template: Optional[str] = None\n    device: Optional[Union[str, int]] = None\n    device_map: Optional[Union[str, Dict[str, Any]]] = None\n    token: Optional[SecretStr] = Field(\n        default_factory=lambda: os.getenv(HF_TOKEN_ENV_VAR)\n    )\n    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n\n    _pipeline: Optional[\"Pipeline\"] = PrivateAttr(...)\n    _prefix_allowed_tokens_fn: Union[Callable, None] = PrivateAttr(default=None)\n    _logits_processor: Optional[\"LogitsProcessorList\"] = PrivateAttr(default=None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the model and tokenizer and creates the text generation pipeline. In addition,\n        it will configure the tokenizer chat template.\"\"\"\n        if self.device == \"cuda\":\n            CudaDevicePlacementMixin.load(self)\n\n        try:\n            from transformers import LogitsProcessorList, pipeline\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers is not installed. Please install it using `pip install transformers`.\"\n            ) from ie\n\n        token = self.token.get_secret_value() if self.token is not None else self.token\n\n        self._pipeline = pipeline(\n            \"text-generation\",\n            model=self.model,\n            revision=self.revision,\n            torch_dtype=self.torch_dtype,\n            trust_remote_code=self.trust_remote_code,\n            model_kwargs=self.model_kwargs or {},\n            tokenizer=self.tokenizer or self.model,\n            use_fast=self.use_fast,\n            device=self.device,\n            device_map=self.device_map,\n            token=token,\n            return_full_text=False,\n        )\n\n        if self.chat_template is not None:\n            self._pipeline.tokenizer.chat_template = self.chat_template  # type: ignore\n\n        if self._pipeline.tokenizer.pad_token is None:  # type: ignore\n            self._pipeline.tokenizer.pad_token = self._pipeline.tokenizer.eos_token  # type: ignore\n\n        if self.structured_output:\n            from distilabel.steps.tasks.structured_outputs.outlines import (\n                outlines_below_0_1_0,\n            )\n\n            if outlines_below_0_1_0:\n                self._prefix_allowed_tokens_fn = self._prepare_structured_output(\n                    self.structured_output\n                )\n            else:\n                logits_processor = self._prepare_structured_output(\n                    self.structured_output\n                )\n                self._logits_processor = LogitsProcessorList([logits_processor])\n\n        super().load()\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the `vLLM` model.\"\"\"\n        CudaDevicePlacementMixin.unload(self)\n        super().unload()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        if self._pipeline.tokenizer.chat_template is None:  # type: ignore\n            return input[0][\"content\"]\n\n        prompt: str = (\n            self._pipeline.tokenizer.apply_chat_template(  # type: ignore\n                input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    @validate_call\n    def generate(  # type: ignore\n        self,\n        inputs: List[StandardInput],\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        temperature: float = 0.1,\n        repetition_penalty: float = 1.1,\n        top_p: float = 1.0,\n        top_k: int = 0,\n        do_sample: bool = True,\n    ) -&gt; List[GenerateOutput]:\n        \"\"\"Generates `num_generations` responses for each input using the text generation\n        pipeline.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            repetition_penalty: the repetition penalty to use for the generation. Defaults\n                to `1.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            top_k: the top-k value to use for the generation. Defaults to `0`.\n            do_sample: whether to use sampling or not. Defaults to `True`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        prepared_inputs = [self.prepare_input(input=input) for input in inputs]\n\n        outputs: List[List[Dict[str, str]]] = self._pipeline(\n            prepared_inputs,\n            max_new_tokens=max_new_tokens,\n            temperature=temperature,\n            repetition_penalty=repetition_penalty,\n            top_p=top_p,\n            top_k=top_k,\n            do_sample=do_sample,\n            num_return_sequences=num_generations,\n            prefix_allowed_tokens_fn=self._prefix_allowed_tokens_fn,\n            logits_processor=self._logits_processor,\n            pad_token_id=self._pipeline.tokenizer.eos_token_id,\n        )\n        llm_output = [\n            [generation[\"generated_text\"] for generation in output]\n            for output in outputs\n        ]\n\n        result = []\n        for input, output in zip(inputs, llm_output):\n            result.append(\n                prepare_output(\n                    output,\n                    input_tokens=[\n                        compute_tokens(input, self._pipeline.tokenizer.encode)\n                    ],\n                    output_tokens=[\n                        compute_tokens(row, self._pipeline.tokenizer.encode)\n                        for row in output\n                    ],\n                )\n            )\n\n        return result\n\n    def get_last_hidden_states(\n        self, inputs: List[\"StandardInput\"]\n    ) -&gt; List[\"HiddenState\"]:\n        \"\"\"Gets the last `hidden_states` of the model for the given inputs. It doesn't\n        execute the task head.\n\n        Args:\n            inputs: a list of inputs in chat format to generate the embeddings for.\n\n        Returns:\n            A list containing the last hidden state for each sequence using a NumPy array\n            with shape [num_tokens, hidden_size].\n        \"\"\"\n        model: \"PreTrainedModel\" = (\n            self._pipeline.model.model  # type: ignore\n            if hasattr(self._pipeline.model, \"model\")  # type: ignore\n            else next(self._pipeline.model.children())  # type: ignore\n        )\n        tokenizer: \"PreTrainedTokenizer\" = self._pipeline.tokenizer  # type: ignore\n        input_ids = tokenizer(\n            [self.prepare_input(input) for input in inputs],  # type: ignore\n            return_tensors=\"pt\",\n            padding=True,\n        ).to(model.device)\n        last_hidden_states = model(**input_ids)[\"last_hidden_state\"]\n\n        return [\n            seq_last_hidden_state[attention_mask.bool(), :].detach().cpu().numpy()\n            for seq_last_hidden_state, attention_mask in zip(\n                last_hidden_states,\n                input_ids[\"attention_mask\"],  # type: ignore\n            )\n        ]\n\n    def _prepare_structured_output(\n        self, structured_output: Optional[OutlinesStructuredOutputType] = None\n    ) -&gt; Union[Callable, None]:\n        \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n        Args:\n            structured_output: the configuration dict to prepare the structured output.\n\n        Returns:\n            The callable that will be used to guide the generation of the model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            prepare_guided_output,\n        )\n\n        result = prepare_guided_output(\n            structured_output, \"transformers\", self._pipeline\n        )\n        if schema := result.get(\"schema\"):\n            self.structured_output[\"schema\"] = schema\n        return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.load","title":"<code>load()</code>","text":"<p>Loads the model and tokenizer and creates the text generation pipeline. In addition, it will configure the tokenizer chat template.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the model and tokenizer and creates the text generation pipeline. In addition,\n    it will configure the tokenizer chat template.\"\"\"\n    if self.device == \"cuda\":\n        CudaDevicePlacementMixin.load(self)\n\n    try:\n        from transformers import LogitsProcessorList, pipeline\n    except ImportError as ie:\n        raise ImportError(\n            \"Transformers is not installed. Please install it using `pip install transformers`.\"\n        ) from ie\n\n    token = self.token.get_secret_value() if self.token is not None else self.token\n\n    self._pipeline = pipeline(\n        \"text-generation\",\n        model=self.model,\n        revision=self.revision,\n        torch_dtype=self.torch_dtype,\n        trust_remote_code=self.trust_remote_code,\n        model_kwargs=self.model_kwargs or {},\n        tokenizer=self.tokenizer or self.model,\n        use_fast=self.use_fast,\n        device=self.device,\n        device_map=self.device_map,\n        token=token,\n        return_full_text=False,\n    )\n\n    if self.chat_template is not None:\n        self._pipeline.tokenizer.chat_template = self.chat_template  # type: ignore\n\n    if self._pipeline.tokenizer.pad_token is None:  # type: ignore\n        self._pipeline.tokenizer.pad_token = self._pipeline.tokenizer.eos_token  # type: ignore\n\n    if self.structured_output:\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            outlines_below_0_1_0,\n        )\n\n        if outlines_below_0_1_0:\n            self._prefix_allowed_tokens_fn = self._prepare_structured_output(\n                self.structured_output\n            )\n        else:\n            logits_processor = self._prepare_structured_output(\n                self.structured_output\n            )\n            self._logits_processor = LogitsProcessorList([logits_processor])\n\n    super().load()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.unload","title":"<code>unload()</code>","text":"<p>Unloads the <code>vLLM</code> model.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the `vLLM` model.\"\"\"\n    CudaDevicePlacementMixin.unload(self)\n    super().unload()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    if self._pipeline.tokenizer.chat_template is None:  # type: ignore\n        return input[0][\"content\"]\n\n    prompt: str = (\n        self._pipeline.tokenizer.apply_chat_template(  # type: ignore\n            input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.generate","title":"<code>generate(inputs, num_generations=1, max_new_tokens=128, temperature=0.1, repetition_penalty=1.1, top_p=1.0, top_k=0, do_sample=True)</code>","text":"<p>Generates <code>num_generations</code> responses for each input using the text generation pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StandardInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>0.1</code> <code>repetition_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>1.1</code>.</p> <code>1.1</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_k</code> <code>int</code> <p>the top-k value to use for the generation. Defaults to <code>0</code>.</p> <code>0</code> <code>do_sample</code> <code>bool</code> <p>whether to use sampling or not. Defaults to <code>True</code>.</p> <code>True</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>@validate_call\ndef generate(  # type: ignore\n    self,\n    inputs: List[StandardInput],\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    temperature: float = 0.1,\n    repetition_penalty: float = 1.1,\n    top_p: float = 1.0,\n    top_k: int = 0,\n    do_sample: bool = True,\n) -&gt; List[GenerateOutput]:\n    \"\"\"Generates `num_generations` responses for each input using the text generation\n    pipeline.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        repetition_penalty: the repetition penalty to use for the generation. Defaults\n            to `1.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        top_k: the top-k value to use for the generation. Defaults to `0`.\n        do_sample: whether to use sampling or not. Defaults to `True`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    prepared_inputs = [self.prepare_input(input=input) for input in inputs]\n\n    outputs: List[List[Dict[str, str]]] = self._pipeline(\n        prepared_inputs,\n        max_new_tokens=max_new_tokens,\n        temperature=temperature,\n        repetition_penalty=repetition_penalty,\n        top_p=top_p,\n        top_k=top_k,\n        do_sample=do_sample,\n        num_return_sequences=num_generations,\n        prefix_allowed_tokens_fn=self._prefix_allowed_tokens_fn,\n        logits_processor=self._logits_processor,\n        pad_token_id=self._pipeline.tokenizer.eos_token_id,\n    )\n    llm_output = [\n        [generation[\"generated_text\"] for generation in output]\n        for output in outputs\n    ]\n\n    result = []\n    for input, output in zip(inputs, llm_output):\n        result.append(\n            prepare_output(\n                output,\n                input_tokens=[\n                    compute_tokens(input, self._pipeline.tokenizer.encode)\n                ],\n                output_tokens=[\n                    compute_tokens(row, self._pipeline.tokenizer.encode)\n                    for row in output\n                ],\n            )\n        )\n\n    return result\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.get_last_hidden_states","title":"<code>get_last_hidden_states(inputs)</code>","text":"<p>Gets the last <code>hidden_states</code> of the model for the given inputs. It doesn't execute the task head.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StandardInput]</code> <p>a list of inputs in chat format to generate the embeddings for.</p> required <p>Returns:</p> Type Description <code>List[HiddenState]</code> <p>A list containing the last hidden state for each sequence using a NumPy array</p> <code>List[HiddenState]</code> <p>with shape [num_tokens, hidden_size].</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def get_last_hidden_states(\n    self, inputs: List[\"StandardInput\"]\n) -&gt; List[\"HiddenState\"]:\n    \"\"\"Gets the last `hidden_states` of the model for the given inputs. It doesn't\n    execute the task head.\n\n    Args:\n        inputs: a list of inputs in chat format to generate the embeddings for.\n\n    Returns:\n        A list containing the last hidden state for each sequence using a NumPy array\n        with shape [num_tokens, hidden_size].\n    \"\"\"\n    model: \"PreTrainedModel\" = (\n        self._pipeline.model.model  # type: ignore\n        if hasattr(self._pipeline.model, \"model\")  # type: ignore\n        else next(self._pipeline.model.children())  # type: ignore\n    )\n    tokenizer: \"PreTrainedTokenizer\" = self._pipeline.tokenizer  # type: ignore\n    input_ids = tokenizer(\n        [self.prepare_input(input) for input in inputs],  # type: ignore\n        return_tensors=\"pt\",\n        padding=True,\n    ).to(model.device)\n    last_hidden_states = model(**input_ids)[\"last_hidden_state\"]\n\n    return [\n        seq_last_hidden_state[attention_mask.bool(), :].detach().cpu().numpy()\n        for seq_last_hidden_state, attention_mask in zip(\n            last_hidden_states,\n            input_ids[\"attention_mask\"],  # type: ignore\n        )\n    ]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM._prepare_structured_output","title":"<code>_prepare_structured_output(structured_output=None)</code>","text":"<p>Creates the appropriate function to filter tokens to generate structured outputs.</p> <p>Parameters:</p> Name Type Description Default <code>structured_output</code> <code>Optional[OutlinesStructuredOutputType]</code> <p>the configuration dict to prepare the structured output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[Callable, None]</code> <p>The callable that will be used to guide the generation of the model.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def _prepare_structured_output(\n    self, structured_output: Optional[OutlinesStructuredOutputType] = None\n) -&gt; Union[Callable, None]:\n    \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n    Args:\n        structured_output: the configuration dict to prepare the structured output.\n\n    Returns:\n        The callable that will be used to guide the generation of the model.\n    \"\"\"\n    from distilabel.steps.tasks.structured_outputs.outlines import (\n        prepare_guided_output,\n    )\n\n    result = prepare_guided_output(\n        structured_output, \"transformers\", self._pipeline\n    )\n    if schema := result.get(\"schema\"):\n        self.structured_output[\"schema\"] = schema\n    return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM","title":"<code>LiteLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>LiteLLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"gpt-3.5-turbo\" or \"mistral/mistral-large\", etc.</p> <code>verbose</code> <code>RuntimeParameter[bool]</code> <p>whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> Runtime parameters <ul> <li><code>verbose</code>: whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import LiteLLM\n\nllm = LiteLLM(model=\"gpt-3.5-turbo\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import LiteLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = LiteLLM(\n    model=\"gpt-3.5-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/litellm.py</code> <pre><code>class LiteLLM(AsyncLLM):\n    \"\"\"LiteLLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\" or \"mistral/mistral-large\",\n            etc.\n        verbose: whether to log the LiteLLM client's logs. Defaults to `False`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n\n    Runtime parameters:\n        - `verbose`: whether to log the LiteLLM client's logs. Defaults to `False`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import LiteLLM\n\n        llm = LiteLLM(model=\"gpt-3.5-turbo\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import LiteLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = LiteLLM(\n            model=\"gpt-3.5-turbo\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    verbose: RuntimeParameter[bool] = Field(\n        default=False, description=\"Whether to log the LiteLLM client's logs.\"\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _aclient: Optional[Callable] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"\n        Loads the `acompletion` LiteLLM client to benefit from async requests.\n        \"\"\"\n        super().load()\n\n        try:\n            import litellm\n\n            litellm.telemetry = False\n        except ImportError as e:\n            raise ImportError(\n                \"LiteLLM Python client is not installed. Please install it using\"\n                \" `pip install litellm`.\"\n            ) from e\n        self._aclient = litellm.acompletion\n\n        if not self.verbose:\n            litellm.suppress_debug_info = True\n            for key in logging.Logger.manager.loggerDict.keys():\n                if \"litellm\" not in key.lower():\n                    continue\n                logging.getLogger(key).setLevel(logging.CRITICAL)\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"litellm\",\n            )\n            self._aclient = result.get(\"client\")\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore # noqa: C901\n        self,\n        input: FormattedInput,\n        num_generations: int = 1,\n        functions: Optional[List] = None,\n        function_call: Optional[str] = None,\n        temperature: Optional[float] = 1.0,\n        top_p: Optional[float] = 1.0,\n        stop: Optional[Union[str, list]] = None,\n        max_tokens: Optional[int] = None,\n        presence_penalty: Optional[float] = None,\n        frequency_penalty: Optional[float] = None,\n        logit_bias: Optional[dict] = None,\n        user: Optional[str] = None,\n        metadata: Optional[dict] = None,\n        api_base: Optional[str] = None,\n        api_version: Optional[str] = None,\n        api_key: Optional[str] = None,\n        model_list: Optional[list] = None,\n        mock_response: Optional[str] = None,\n        force_timeout: Optional[int] = 600,\n        custom_llm_provider: Optional[str] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the [LiteLLM async client](https://github.com/BerriAI/litellm).\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            functions: a list of functions to apply to the conversation messages. Defaults to\n                `None`.\n            function_call: the name of the function to call within the conversation. Defaults\n                to `None`.\n            temperature: the temperature to use for the generation. Defaults to `1.0`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: Up to 4 sequences where the LLM API will stop generating further tokens.\n                Defaults to `None`.\n            max_tokens: The maximum number of tokens in the generated completion. Defaults to\n                `None`.\n            presence_penalty: It is used to penalize new tokens based on their existence in the\n                text so far. Defaults to `None`.\n            frequency_penalty: It is used to penalize new tokens based on their frequency in the\n                text so far. Defaults to `None`.\n            logit_bias: Used to modify the probability of specific tokens appearing in the\n                completion. Defaults to `None`.\n            user: A unique identifier representing your end-user. This can help the LLM provider\n                to monitor and detect abuse. Defaults to `None`.\n            metadata: Pass in additional metadata to tag your completion calls - eg. prompt\n                version, details, etc. Defaults to `None`.\n            api_base: Base URL for the API. Defaults to `None`.\n            api_version: API version. Defaults to `None`.\n            api_key: API key. Defaults to `None`.\n            model_list: List of api base, version, keys. Defaults to `None`.\n            mock_response: If provided, return a mock completion response for testing or debugging\n                purposes. Defaults to `None`.\n            force_timeout: The maximum execution time in seconds for the completion request.\n                Defaults to `600`.\n            custom_llm_provider: Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)\n                model=\"amazon.titan-tg1-large\" and custom_llm_provider=\"bedrock\". Defaults to\n                `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        import litellm\n        from litellm import token_counter\n\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"litellm\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"model\": self.model,\n            \"messages\": input,\n            \"n\": num_generations,\n            \"functions\": functions,\n            \"function_call\": function_call,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n            \"stream\": False,\n            \"stop\": stop,\n            \"max_tokens\": max_tokens,\n            \"presence_penalty\": presence_penalty,\n            \"frequency_penalty\": frequency_penalty,\n            \"logit_bias\": logit_bias,\n            \"user\": user,\n            \"metadata\": metadata,\n            \"api_base\": api_base,\n            \"api_version\": api_version,\n            \"api_key\": api_key,\n            \"model_list\": model_list,\n            \"mock_response\": mock_response,\n            \"force_timeout\": force_timeout,\n            \"custom_llm_provider\": custom_llm_provider,\n        }\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n        async def _call_aclient_until_n_choices() -&gt; List[\"Choices\"]:\n            choices = []\n            while len(choices) &lt; num_generations:\n                completion = await self._aclient(**kwargs)  # type: ignore\n                if not self.structured_output:\n                    completion = completion.choices\n                choices.extend(completion)\n            return choices\n\n        # litellm.drop_params is used to en/disable sending **kwargs parameters to the API if they cannot be used\n        try:\n            litellm.drop_params = False\n            choices = await _call_aclient_until_n_choices()\n        except litellm.exceptions.APIError as e:\n            if \"does not support parameters\" in str(e):\n                litellm.drop_params = True\n                choices = await _call_aclient_until_n_choices()\n            else:\n                raise e\n\n        generations = []\n        input_tokens = [\n            token_counter(model=self.model, messages=input)\n        ] * num_generations\n        output_tokens = []\n\n        if self.structured_output:\n            for choice in choices:\n                generations.append(choice.model_dump_json())\n                output_tokens.append(\n                    token_counter(\n                        model=self.model,\n                        text=orjson.dumps(choice.model_dump_json()).decode(\"utf-8\"),\n                    )\n                )\n            return prepare_output(\n                generations,\n                input_tokens=input_tokens,\n                output_tokens=output_tokens,\n            )\n\n        for choice in choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using LiteLLM client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n            output_tokens.append(token_counter(model=self.model, text=content))\n\n        return prepare_output(\n            generations, input_tokens=input_tokens, output_tokens=output_tokens\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>acompletion</code> LiteLLM client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/litellm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"\n    Loads the `acompletion` LiteLLM client to benefit from async requests.\n    \"\"\"\n    super().load()\n\n    try:\n        import litellm\n\n        litellm.telemetry = False\n    except ImportError as e:\n        raise ImportError(\n            \"LiteLLM Python client is not installed. Please install it using\"\n            \" `pip install litellm`.\"\n        ) from e\n    self._aclient = litellm.acompletion\n\n    if not self.verbose:\n        litellm.suppress_debug_info = True\n        for key in logging.Logger.manager.loggerDict.keys():\n            if \"litellm\" not in key.lower():\n                continue\n            logging.getLogger(key).setLevel(logging.CRITICAL)\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"litellm\",\n        )\n        self._aclient = result.get(\"client\")\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM.agenerate","title":"<code>agenerate(input, num_generations=1, functions=None, function_call=None, temperature=1.0, top_p=1.0, stop=None, max_tokens=None, presence_penalty=None, frequency_penalty=None, logit_bias=None, user=None, metadata=None, api_base=None, api_version=None, api_key=None, model_list=None, mock_response=None, force_timeout=600, custom_llm_provider=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the LiteLLM async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>functions</code> <code>Optional[List]</code> <p>a list of functions to apply to the conversation messages. Defaults to <code>None</code>.</p> <code>None</code> <code>function_call</code> <code>Optional[str]</code> <p>the name of the function to call within the conversation. Defaults to <code>None</code>.</p> <code>None</code> <code>temperature</code> <code>Optional[float]</code> <p>the temperature to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_p</code> <code>Optional[float]</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[Union[str, list]]</code> <p>Up to 4 sequences where the LLM API will stop generating further tokens. Defaults to <code>None</code>.</p> <code>None</code> <code>max_tokens</code> <code>Optional[int]</code> <p>The maximum number of tokens in the generated completion. Defaults to <code>None</code>.</p> <code>None</code> <code>presence_penalty</code> <code>Optional[float]</code> <p>It is used to penalize new tokens based on their existence in the text so far. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>Optional[float]</code> <p>It is used to penalize new tokens based on their frequency in the text so far. Defaults to <code>None</code>.</p> <code>None</code> <code>logit_bias</code> <code>Optional[dict]</code> <p>Used to modify the probability of specific tokens appearing in the completion. Defaults to <code>None</code>.</p> <code>None</code> <code>user</code> <code>Optional[str]</code> <p>A unique identifier representing your end-user. This can help the LLM provider to monitor and detect abuse. Defaults to <code>None</code>.</p> <code>None</code> <code>metadata</code> <code>Optional[dict]</code> <p>Pass in additional metadata to tag your completion calls - eg. prompt version, details, etc. Defaults to <code>None</code>.</p> <code>None</code> <code>api_base</code> <code>Optional[str]</code> <p>Base URL for the API. Defaults to <code>None</code>.</p> <code>None</code> <code>api_version</code> <code>Optional[str]</code> <p>API version. Defaults to <code>None</code>.</p> <code>None</code> <code>api_key</code> <code>Optional[str]</code> <p>API key. Defaults to <code>None</code>.</p> <code>None</code> <code>model_list</code> <code>Optional[list]</code> <p>List of api base, version, keys. Defaults to <code>None</code>.</p> <code>None</code> <code>mock_response</code> <code>Optional[str]</code> <p>If provided, return a mock completion response for testing or debugging purposes. Defaults to <code>None</code>.</p> <code>None</code> <code>force_timeout</code> <code>Optional[int]</code> <p>The maximum execution time in seconds for the completion request. Defaults to <code>600</code>.</p> <code>600</code> <code>custom_llm_provider</code> <code>Optional[str]</code> <p>Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable) model=\"amazon.titan-tg1-large\" and custom_llm_provider=\"bedrock\". Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/litellm.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore # noqa: C901\n    self,\n    input: FormattedInput,\n    num_generations: int = 1,\n    functions: Optional[List] = None,\n    function_call: Optional[str] = None,\n    temperature: Optional[float] = 1.0,\n    top_p: Optional[float] = 1.0,\n    stop: Optional[Union[str, list]] = None,\n    max_tokens: Optional[int] = None,\n    presence_penalty: Optional[float] = None,\n    frequency_penalty: Optional[float] = None,\n    logit_bias: Optional[dict] = None,\n    user: Optional[str] = None,\n    metadata: Optional[dict] = None,\n    api_base: Optional[str] = None,\n    api_version: Optional[str] = None,\n    api_key: Optional[str] = None,\n    model_list: Optional[list] = None,\n    mock_response: Optional[str] = None,\n    force_timeout: Optional[int] = 600,\n    custom_llm_provider: Optional[str] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the [LiteLLM async client](https://github.com/BerriAI/litellm).\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        functions: a list of functions to apply to the conversation messages. Defaults to\n            `None`.\n        function_call: the name of the function to call within the conversation. Defaults\n            to `None`.\n        temperature: the temperature to use for the generation. Defaults to `1.0`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: Up to 4 sequences where the LLM API will stop generating further tokens.\n            Defaults to `None`.\n        max_tokens: The maximum number of tokens in the generated completion. Defaults to\n            `None`.\n        presence_penalty: It is used to penalize new tokens based on their existence in the\n            text so far. Defaults to `None`.\n        frequency_penalty: It is used to penalize new tokens based on their frequency in the\n            text so far. Defaults to `None`.\n        logit_bias: Used to modify the probability of specific tokens appearing in the\n            completion. Defaults to `None`.\n        user: A unique identifier representing your end-user. This can help the LLM provider\n            to monitor and detect abuse. Defaults to `None`.\n        metadata: Pass in additional metadata to tag your completion calls - eg. prompt\n            version, details, etc. Defaults to `None`.\n        api_base: Base URL for the API. Defaults to `None`.\n        api_version: API version. Defaults to `None`.\n        api_key: API key. Defaults to `None`.\n        model_list: List of api base, version, keys. Defaults to `None`.\n        mock_response: If provided, return a mock completion response for testing or debugging\n            purposes. Defaults to `None`.\n        force_timeout: The maximum execution time in seconds for the completion request.\n            Defaults to `600`.\n        custom_llm_provider: Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)\n            model=\"amazon.titan-tg1-large\" and custom_llm_provider=\"bedrock\". Defaults to\n            `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    import litellm\n    from litellm import token_counter\n\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"litellm\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"model\": self.model,\n        \"messages\": input,\n        \"n\": num_generations,\n        \"functions\": functions,\n        \"function_call\": function_call,\n        \"temperature\": temperature,\n        \"top_p\": top_p,\n        \"stream\": False,\n        \"stop\": stop,\n        \"max_tokens\": max_tokens,\n        \"presence_penalty\": presence_penalty,\n        \"frequency_penalty\": frequency_penalty,\n        \"logit_bias\": logit_bias,\n        \"user\": user,\n        \"metadata\": metadata,\n        \"api_base\": api_base,\n        \"api_version\": api_version,\n        \"api_key\": api_key,\n        \"model_list\": model_list,\n        \"mock_response\": mock_response,\n        \"force_timeout\": force_timeout,\n        \"custom_llm_provider\": custom_llm_provider,\n    }\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n    async def _call_aclient_until_n_choices() -&gt; List[\"Choices\"]:\n        choices = []\n        while len(choices) &lt; num_generations:\n            completion = await self._aclient(**kwargs)  # type: ignore\n            if not self.structured_output:\n                completion = completion.choices\n            choices.extend(completion)\n        return choices\n\n    # litellm.drop_params is used to en/disable sending **kwargs parameters to the API if they cannot be used\n    try:\n        litellm.drop_params = False\n        choices = await _call_aclient_until_n_choices()\n    except litellm.exceptions.APIError as e:\n        if \"does not support parameters\" in str(e):\n            litellm.drop_params = True\n            choices = await _call_aclient_until_n_choices()\n        else:\n            raise e\n\n    generations = []\n    input_tokens = [\n        token_counter(model=self.model, messages=input)\n    ] * num_generations\n    output_tokens = []\n\n    if self.structured_output:\n        for choice in choices:\n            generations.append(choice.model_dump_json())\n            output_tokens.append(\n                token_counter(\n                    model=self.model,\n                    text=orjson.dumps(choice.model_dump_json()).decode(\"utf-8\"),\n                )\n            )\n        return prepare_output(\n            generations,\n            input_tokens=input_tokens,\n            output_tokens=output_tokens,\n        )\n\n    for choice in choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using LiteLLM client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n        output_tokens.append(token_counter(model=self.model, text=content))\n\n    return prepare_output(\n        generations, input_tokens=input_tokens, output_tokens=output_tokens\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM","title":"<code>LlamaCppLLM</code>","text":"<p>               Bases: <code>LLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>llama.cpp LLM implementation running the Python bindings for the C++ code.</p> <p>Attributes:</p> Name Type Description <code>model_path</code> <code>RuntimeParameter[FilePath]</code> <p>contains the path to the GGUF quantized model, compatible with the installed version of the <code>llama.cpp</code> Python bindings.</p> <code>n_gpu_layers</code> <code>RuntimeParameter[int]</code> <p>the number of layers to use for the GPU. Defaults to <code>-1</code>, meaning that the available GPU device will be used.</p> <code>chat_format</code> <code>Optional[RuntimeParameter[str]]</code> <p>the chat format to use for the model. Defaults to <code>None</code>, which means the Llama format will be used.</p> <code>n_ctx</code> <code>int</code> <p>the context size to use for the model. Defaults to <code>512</code>.</p> <code>n_batch</code> <code>int</code> <p>the prompt processing maximum batch size to use for the model. Defaults to <code>512</code>.</p> <code>seed</code> <code>int</code> <p>random seed to use for the generation. Defaults to <code>4294967295</code>.</p> <code>verbose</code> <code>RuntimeParameter[bool]</code> <p>whether to print verbose output. Defaults to <code>False</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[OutlinesStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>extra_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</p> <code>tokenizer_id</code> <code>Optional[RuntimeParameter[str]]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the <code>model</code> will be used. Defaults to <code>None</code>.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> <code>_model</code> <code>Optional[Llama]</code> <p>the Llama model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>model_path</code>: the path to the GGUF quantized model.</li> <li><code>n_gpu_layers</code>: the number of layers to use for the GPU. Defaults to <code>-1</code>.</li> <li><code>chat_format</code>: the chat format to use for the model. Defaults to <code>None</code>.</li> <li><code>verbose</code>: whether to print verbose output. Defaults to <code>False</code>.</li> <li><code>extra_kwargs</code>: additional dictionary of keyword arguments that will be passed to the     <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</li> </ul> References <ul> <li><code>llama.cpp</code></li> <li><code>llama-cpp-python</code></li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\n# You can follow along this example downloading the following model running the following\n# command in the terminal, that will download the model to the `Downloads` folder:\n# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),\n    n_gpu_layers=-1,  # To use the GPU if available\n    n_ctx=1024,       # Set the context size\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),  # type: ignore\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>class LlamaCppLLM(LLM, MagpieChatTemplateMixin):\n    \"\"\"llama.cpp LLM implementation running the Python bindings for the C++ code.\n\n    Attributes:\n        model_path: contains the path to the GGUF quantized model, compatible with the\n            installed version of the `llama.cpp` Python bindings.\n        n_gpu_layers: the number of layers to use for the GPU. Defaults to `-1`, meaning that\n            the available GPU device will be used.\n        chat_format: the chat format to use for the model. Defaults to `None`, which means the\n            Llama format will be used.\n        n_ctx: the context size to use for the model. Defaults to `512`.\n        n_batch: the prompt processing maximum batch size to use for the model. Defaults to `512`.\n        seed: random seed to use for the generation. Defaults to `4294967295`.\n        verbose: whether to print verbose output. Defaults to `False`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        extra_kwargs: additional dictionary of keyword arguments that will be passed to the\n            `Llama` class of `llama_cpp` library. Defaults to `{}`.\n        tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer config files. If not provided, the one associated to the `model`\n            will be used. Defaults to `None`.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n        _model: the Llama model instance. This attribute is meant to be used internally and\n            should not be accessed directly. It will be set in the `load` method.\n\n    Runtime parameters:\n        - `model_path`: the path to the GGUF quantized model.\n        - `n_gpu_layers`: the number of layers to use for the GPU. Defaults to `-1`.\n        - `chat_format`: the chat format to use for the model. Defaults to `None`.\n        - `verbose`: whether to print verbose output. Defaults to `False`.\n        - `extra_kwargs`: additional dictionary of keyword arguments that will be passed to the\n            `Llama` class of `llama_cpp` library. Defaults to `{}`.\n\n    References:\n        - [`llama.cpp`](https://github.com/ggerganov/llama.cpp)\n        - [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python)\n\n    Examples:\n        Generate text:\n\n        ```python\n        from pathlib import Path\n        from distilabel.models.llms import LlamaCppLLM\n\n        # You can follow along this example downloading the following model running the following\n        # command in the terminal, that will download the model to the `Downloads` folder:\n        # curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\n        model_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\n        llm = LlamaCppLLM(\n            model_path=str(Path.home() / model_path),\n            n_gpu_layers=-1,  # To use the GPU if available\n            n_ctx=1024,       # Set the context size\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pathlib import Path\n        from distilabel.models.llms import LlamaCppLLM\n\n        model_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = LlamaCppLLM(\n            model_path=str(Path.home() / model_path),  # type: ignore\n            n_gpu_layers=-1,\n            n_ctx=1024,\n            structured_output={\"format\": \"json\", \"schema\": Character},\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model_path: RuntimeParameter[FilePath] = Field(\n        default=None, description=\"The path to the GGUF quantized model.\", exclude=True\n    )\n    n_gpu_layers: RuntimeParameter[int] = Field(\n        default=-1,\n        description=\"The number of layers that will be loaded in the GPU.\",\n    )\n    chat_format: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The chat format to use for the model. Defaults to `None`, which means the Llama format will be used.\",\n    )\n\n    n_ctx: int = 512\n    n_batch: int = 512\n    seed: int = 4294967295\n    verbose: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether to print verbose output from llama.cpp library.\",\n    )\n    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"Additional dictionary of keyword arguments that will be passed to the\"\n        \" `Llama` class of `llama_cpp` library. See all the supported arguments at: \"\n        \"https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__\",\n    )\n    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n    tokenizer_id: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repo id or a path to a directory containing\"\n        \" the tokenizer config files. If not provided, the one associated to the `model`\"\n        \" will be used.\",\n    )\n    _logits_processor: Optional[\"LogitsProcessorList\"] = PrivateAttr(default=None)\n    _model: Optional[\"Llama\"] = PrivateAttr(...)\n\n    @model_validator(mode=\"after\")\n    def validate_magpie_usage(\n        self,\n    ) -&gt; \"LlamaCppLLM\":\n        \"\"\"Validates that magpie usage is valid.\"\"\"\n\n        if self.use_magpie_template and self.tokenizer_id is None:\n            raise ValueError(\n                \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n                \" set a `tokenizer_id` and try again.\"\n            )\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `Llama` model from the `model_path`.\"\"\"\n        try:\n            from llama_cpp import Llama\n        except ImportError as ie:\n            raise ImportError(\n                \"The `llama_cpp` package is required to use the `LlamaCppLLM` class.\"\n            ) from ie\n\n        self._model = Llama(\n            model_path=self.model_path.as_posix(),\n            seed=self.seed,\n            n_ctx=self.n_ctx,\n            n_batch=self.n_batch,\n            chat_format=self.chat_format,\n            n_gpu_layers=self.n_gpu_layers,\n            verbose=self.verbose,\n            **self.extra_kwargs,\n        )\n\n        if self.structured_output:\n            self._logits_processor = self._prepare_structured_output(\n                self.structured_output\n            )\n\n        if self.use_magpie_template or self.magpie_pre_query_template:\n            if not self.tokenizer_id:\n                raise ValueError(\n                    \"The Hugging Face Hub repo id or a path to a directory containing\"\n                    \" the tokenizer config files is required when using the `use_magpie_template`\"\n                    \" or `magpie_pre_query_template` runtime parameters.\"\n                )\n\n        if self.tokenizer_id:\n            try:\n                from transformers import AutoTokenizer\n            except ImportError as ie:\n                raise ImportError(\n                    \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n                ) from ie\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n            if self._tokenizer.chat_template is None:\n                raise ValueError(\n                    \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n                )\n\n        # NOTE: Here because of the custom `logging` interface used, since it will create the logging name\n        # out of the model name, which won't be available until the `Llama` instance is created.\n        super().load()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self._model.model_path  # type: ignore\n\n    def _generate_chat_completion(\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; \"CreateChatCompletionResponse\":\n        return self._model.create_chat_completion(  # type: ignore\n            messages=input,  # type: ignore\n            max_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            temperature=temperature,\n            top_p=top_p,\n            logits_processor=self._logits_processor,\n            **(extra_generation_kwargs or {}),\n        )\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(  # type: ignore\n                conversation=input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    def _generate_with_text_generation(\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; \"CreateChatCompletionResponse\":\n        prompt = self.prepare_input(input)\n        return self._model.create_completion(\n            prompt=prompt,\n            max_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            temperature=temperature,\n            top_p=top_p,\n            logits_processor=self._logits_processor,\n            **(extra_generation_kwargs or {}),\n        )\n\n    @validate_call\n    def generate(  # type: ignore\n        self,\n        inputs: List[FormattedInput],\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; List[GenerateOutput]:\n        \"\"\"Generates `num_generations` responses for the given input using the Llama model.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            extra_generation_kwargs: dictionary with additional arguments to be passed to\n                the `create_chat_completion` method. Reference at\n                https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        structured_output = None\n        batch_outputs = []\n        for input in inputs:\n            if isinstance(input, tuple):\n                input, structured_output = input\n            elif self.structured_output:\n                structured_output = self.structured_output\n\n            outputs = []\n            output_tokens = []\n            for _ in range(num_generations):\n                # NOTE(plaguss): There seems to be a bug in how the logits processor\n                # is used. Basically it consumes the FSM internally, and it isn't reinitialized\n                # after each generation, so subsequent calls yield nothing. This is a workaround\n                # until is fixed in the `llama_cpp` or `outlines` libraries.\n                if structured_output:\n                    self._logits_processor = self._prepare_structured_output(\n                        structured_output\n                    )\n                if self.tokenizer_id is None:\n                    completion = self._generate_chat_completion(\n                        input,\n                        max_new_tokens,\n                        frequency_penalty,\n                        presence_penalty,\n                        temperature,\n                        top_p,\n                        extra_generation_kwargs,\n                    )\n                    outputs.append(completion[\"choices\"][0][\"message\"][\"content\"])\n                    output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n                else:\n                    completion: \"CreateChatCompletionResponse\" = (\n                        self._generate_with_text_generation(  # type: ignore\n                            input,\n                            max_new_tokens,\n                            frequency_penalty,\n                            presence_penalty,\n                            temperature,\n                            top_p,\n                            extra_generation_kwargs,\n                        )\n                    )\n                    outputs.append(completion[\"choices\"][0][\"text\"])\n                    output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n            batch_outputs.append(\n                prepare_output(\n                    outputs,\n                    input_tokens=[completion[\"usage\"][\"prompt_tokens\"]]\n                    * num_generations,\n                    output_tokens=output_tokens,\n                )\n            )\n\n        return batch_outputs\n\n    def _prepare_structured_output(\n        self, structured_output: Optional[OutlinesStructuredOutputType] = None\n    ) -&gt; Union[\"LogitsProcessorList\", None]:\n        \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n        Args:\n            structured_output: the configuration dict to prepare the structured output.\n\n        Returns:\n            The callable that will be used to guide the generation of the model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            prepare_guided_output,\n        )\n\n        result = prepare_guided_output(structured_output, \"llamacpp\", self._model)\n        if (schema := result.get(\"schema\")) and self.structured_output:\n            self.structured_output[\"schema\"] = schema\n        return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.validate_magpie_usage","title":"<code>validate_magpie_usage()</code>","text":"<p>Validates that magpie usage is valid.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_magpie_usage(\n    self,\n) -&gt; \"LlamaCppLLM\":\n    \"\"\"Validates that magpie usage is valid.\"\"\"\n\n    if self.use_magpie_template and self.tokenizer_id is None:\n        raise ValueError(\n            \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n            \" set a `tokenizer_id` and try again.\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>Llama</code> model from the <code>model_path</code>.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `Llama` model from the `model_path`.\"\"\"\n    try:\n        from llama_cpp import Llama\n    except ImportError as ie:\n        raise ImportError(\n            \"The `llama_cpp` package is required to use the `LlamaCppLLM` class.\"\n        ) from ie\n\n    self._model = Llama(\n        model_path=self.model_path.as_posix(),\n        seed=self.seed,\n        n_ctx=self.n_ctx,\n        n_batch=self.n_batch,\n        chat_format=self.chat_format,\n        n_gpu_layers=self.n_gpu_layers,\n        verbose=self.verbose,\n        **self.extra_kwargs,\n    )\n\n    if self.structured_output:\n        self._logits_processor = self._prepare_structured_output(\n            self.structured_output\n        )\n\n    if self.use_magpie_template or self.magpie_pre_query_template:\n        if not self.tokenizer_id:\n            raise ValueError(\n                \"The Hugging Face Hub repo id or a path to a directory containing\"\n                \" the tokenizer config files is required when using the `use_magpie_template`\"\n                \" or `magpie_pre_query_template` runtime parameters.\"\n            )\n\n    if self.tokenizer_id:\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n            ) from ie\n        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n        if self._tokenizer.chat_template is None:\n            raise ValueError(\n                \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n            )\n\n    # NOTE: Here because of the custom `logging` interface used, since it will create the logging name\n    # out of the model name, which won't be available until the `Llama` instance is created.\n    super().load()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(  # type: ignore\n            conversation=input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.generate","title":"<code>generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, extra_generation_kwargs=None)</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the Llama model.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>extra_generation_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>dictionary with additional arguments to be passed to the <code>create_chat_completion</code> method. Reference at https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion</p> <code>None</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>@validate_call\ndef generate(  # type: ignore\n    self,\n    inputs: List[FormattedInput],\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    frequency_penalty: float = 0.0,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n) -&gt; List[GenerateOutput]:\n    \"\"\"Generates `num_generations` responses for the given input using the Llama model.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        extra_generation_kwargs: dictionary with additional arguments to be passed to\n            the `create_chat_completion` method. Reference at\n            https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    structured_output = None\n    batch_outputs = []\n    for input in inputs:\n        if isinstance(input, tuple):\n            input, structured_output = input\n        elif self.structured_output:\n            structured_output = self.structured_output\n\n        outputs = []\n        output_tokens = []\n        for _ in range(num_generations):\n            # NOTE(plaguss): There seems to be a bug in how the logits processor\n            # is used. Basically it consumes the FSM internally, and it isn't reinitialized\n            # after each generation, so subsequent calls yield nothing. This is a workaround\n            # until is fixed in the `llama_cpp` or `outlines` libraries.\n            if structured_output:\n                self._logits_processor = self._prepare_structured_output(\n                    structured_output\n                )\n            if self.tokenizer_id is None:\n                completion = self._generate_chat_completion(\n                    input,\n                    max_new_tokens,\n                    frequency_penalty,\n                    presence_penalty,\n                    temperature,\n                    top_p,\n                    extra_generation_kwargs,\n                )\n                outputs.append(completion[\"choices\"][0][\"message\"][\"content\"])\n                output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n            else:\n                completion: \"CreateChatCompletionResponse\" = (\n                    self._generate_with_text_generation(  # type: ignore\n                        input,\n                        max_new_tokens,\n                        frequency_penalty,\n                        presence_penalty,\n                        temperature,\n                        top_p,\n                        extra_generation_kwargs,\n                    )\n                )\n                outputs.append(completion[\"choices\"][0][\"text\"])\n                output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n        batch_outputs.append(\n            prepare_output(\n                outputs,\n                input_tokens=[completion[\"usage\"][\"prompt_tokens\"]]\n                * num_generations,\n                output_tokens=output_tokens,\n            )\n        )\n\n    return batch_outputs\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM._prepare_structured_output","title":"<code>_prepare_structured_output(structured_output=None)</code>","text":"<p>Creates the appropriate function to filter tokens to generate structured outputs.</p> <p>Parameters:</p> Name Type Description Default <code>structured_output</code> <code>Optional[OutlinesStructuredOutputType]</code> <p>the configuration dict to prepare the structured output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[LogitsProcessorList, None]</code> <p>The callable that will be used to guide the generation of the model.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>def _prepare_structured_output(\n    self, structured_output: Optional[OutlinesStructuredOutputType] = None\n) -&gt; Union[\"LogitsProcessorList\", None]:\n    \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n    Args:\n        structured_output: the configuration dict to prepare the structured output.\n\n    Returns:\n        The callable that will be used to guide the generation of the model.\n    \"\"\"\n    from distilabel.steps.tasks.structured_outputs.outlines import (\n        prepare_guided_output,\n    )\n\n    result = prepare_guided_output(structured_output, \"llamacpp\", self._model)\n    if (schema := result.get(\"schema\")) and self.structured_output:\n        self.structured_output[\"schema\"] = schema\n    return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM","title":"<code>MistralLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Mistral LLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"mistral-tiny\", \"mistral-large\", etc.</p> <code>endpoint</code> <code>str</code> <p>the endpoint to use for the Mistral API. Defaults to \"https://api.mistral.ai\".</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Mistral API. Defaults to <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of retries to attempt when a request fails. Defaults to <code>5</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</p> <code>max_concurrent_requests</code> <code>RuntimeParameter[int]</code> <p>the maximum number of concurrent requests to send. Defaults to <code>64</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <code>_aclient</code> <code>Optional[Mistral]</code> <p>the <code>Mistral</code> to use for the Mistral API. It is meant to be used internally. Set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>api_key</code>: the API key to authenticate the requests to the Mistral API.</li> <li><code>max_retries</code>: the maximum number of retries to attempt when a request fails.     Defaults to <code>5</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</li> <li><code>max_concurrent_requests</code>: the maximum number of concurrent requests to send.     Defaults to <code>64</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import MistralLLM\n\nllm = MistralLLM(model=\"open-mixtral-8x22b\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import MistralLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = MistralLLM(\n    model=\"open-mixtral-8x22b\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/mistral.py</code> <pre><code>class MistralLLM(AsyncLLM):\n    \"\"\"Mistral LLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"mistral-tiny\", \"mistral-large\", etc.\n        endpoint: the endpoint to use for the Mistral API. Defaults to \"https://api.mistral.ai\".\n        api_key: the API key to authenticate the requests to the Mistral API. Defaults to `None` which\n            means that the value set for the environment variable `OPENAI_API_KEY` will be used, or\n            `None` if not set.\n        max_retries: the maximum number of retries to attempt when a request fails. Defaults to `5`.\n        timeout: the maximum time in seconds to wait for a response. Defaults to `120`.\n        max_concurrent_requests: the maximum number of concurrent requests to send. Defaults\n            to `64`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _api_key_env_var: the name of the environment variable to use for the API key. It is meant to\n            be used internally.\n        _aclient: the `Mistral` to use for the Mistral API. It is meant to be used internally.\n            Set in the `load` method.\n\n    Runtime parameters:\n        - `api_key`: the API key to authenticate the requests to the Mistral API.\n        - `max_retries`: the maximum number of retries to attempt when a request fails.\n            Defaults to `5`.\n        - `timeout`: the maximum time in seconds to wait for a response. Defaults to `120`.\n        - `max_concurrent_requests`: the maximum number of concurrent requests to send.\n            Defaults to `64`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import MistralLLM\n\n        llm = MistralLLM(model=\"open-mixtral-8x22b\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import MistralLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = MistralLLM(\n            model=\"open-mixtral-8x22b\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    endpoint: str = \"https://api.mistral.ai\"\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_MISTRALAI_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Mistral API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=6,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    max_concurrent_requests: RuntimeParameter[int] = Field(\n        default=64, description=\"The maximum number of concurrent requests to send.\"\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _api_key_env_var: str = PrivateAttr(_MISTRALAI_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"Mistral\"] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `Mistral` client to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from mistralai import Mistral\n        except ImportError as ie:\n            raise ImportError(\n                \"MistralAI Python client is not installed. Please install it using\"\n                \" `pip install mistralai`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._aclient = Mistral(\n            api_key=self.api_key.get_secret_value(),\n            endpoint=self.endpoint,\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,  # type: ignore\n            max_concurrent_requests=self.max_concurrent_requests,  # type: ignore\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"mistral\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    # TODO: add `num_generations` parameter once Mistral client allows `n` parameter\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        max_new_tokens: Optional[int] = None,\n        temperature: Optional[float] = None,\n        top_p: Optional[float] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the MistralAI async\n        client.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"mistral\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"max_tokens\": max_new_tokens,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n        }\n        generations = []\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n            # TODO:\u00a0This should work just with the _aclient.chat method, but it's not working.\n            # We need to check instructor and see if we can create a PR.\n            completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n        else:\n            # completion = await self._aclient.chat(**kwargs)  # type: ignore\n            completion = await self._aclient.chat.complete_async(**kwargs)  # type: ignore\n\n        if structured_output:\n            return prepare_output(\n                [completion.model_dump_json()],\n                **self._get_llm_statistics(completion._raw_response),\n            )\n\n        for choice in completion.choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using MistralAI client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n\n        return prepare_output(generations, **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: \"ChatCompletionResponse\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion.usage.prompt_tokens],\n            \"output_tokens\": [completion.usage.completion_tokens],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>Mistral</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/mistral.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `Mistral` client to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from mistralai import Mistral\n    except ImportError as ie:\n        raise ImportError(\n            \"MistralAI Python client is not installed. Please install it using\"\n            \" `pip install mistralai`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._aclient = Mistral(\n        api_key=self.api_key.get_secret_value(),\n        endpoint=self.endpoint,\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,  # type: ignore\n        max_concurrent_requests=self.max_concurrent_requests,  # type: ignore\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"mistral\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM.agenerate","title":"<code>agenerate(input, max_new_tokens=None, temperature=None, top_p=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the MistralAI async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>max_new_tokens</code> <code>Optional[int]</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>None</code> <code>temperature</code> <code>Optional[float]</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>None</code> <code>top_p</code> <code>Optional[float]</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/mistral.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    max_new_tokens: Optional[int] = None,\n    temperature: Optional[float] = None,\n    top_p: Optional[float] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the MistralAI async\n    client.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"mistral\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"max_tokens\": max_new_tokens,\n        \"temperature\": temperature,\n        \"top_p\": top_p,\n    }\n    generations = []\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n        # TODO:\u00a0This should work just with the _aclient.chat method, but it's not working.\n        # We need to check instructor and see if we can create a PR.\n        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n    else:\n        # completion = await self._aclient.chat(**kwargs)  # type: ignore\n        completion = await self._aclient.chat.complete_async(**kwargs)  # type: ignore\n\n    if structured_output:\n        return prepare_output(\n            [completion.model_dump_json()],\n            **self._get_llm_statistics(completion._raw_response),\n        )\n\n    for choice in completion.choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using MistralAI client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n\n    return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM","title":"<code>MixtureOfAgentsLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p><code>Mixture-of-Agents</code> implementation.</p> <p>An <code>LLM</code> class that leverages <code>LLM</code>s collective strenghts to generate a response, as described in the \"Mixture-of-Agents Enhances Large Language model Capabilities\" paper. There is a list of <code>LLM</code>s proposing/generating outputs that <code>LLM</code>s from the next round/layer can use as auxiliary information. Finally, there is an <code>LLM</code> that aggregates the outputs to generate the final response.</p> <p>Attributes:</p> Name Type Description <code>aggregator_llm</code> <code>LLM</code> <p>The <code>LLM</code> that aggregates the outputs of the proposer <code>LLM</code>s.</p> <code>proposers_llms</code> <code>List[AsyncLLM]</code> <p>The list of <code>LLM</code>s that propose outputs to be aggregated.</p> <code>rounds</code> <code>int</code> <p>The number of layers or rounds that the <code>proposers_llms</code> will generate outputs. Defaults to <code>1</code>.</p> References <ul> <li>Mixture-of-Agents Enhances Large Language Model Capabilities</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM\n\nllm = MixtureOfAgentsLLM(\n    aggregator_llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    proposers_llms=[\n        InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n            tokenizer_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n            tokenizer_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n        ),\n    ],\n    rounds=2,\n)\n\nllm.load()\n\noutput = llm.generate_outputs(\n    inputs=[\n        [\n            {\n                \"role\": \"user\",\n                \"content\": \"My favorite witty review of The Rings of Power series is this: Input:\",\n            }\n        ]\n    ]\n)\n</code></pre> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>class MixtureOfAgentsLLM(AsyncLLM):\n    \"\"\"`Mixture-of-Agents` implementation.\n\n    An `LLM` class that leverages `LLM`s collective strenghts to generate a response,\n    as described in the \"Mixture-of-Agents Enhances Large Language model Capabilities\"\n    paper. There is a list of `LLM`s proposing/generating outputs that `LLM`s from the next\n    round/layer can use as auxiliary information. Finally, there is an `LLM` that aggregates\n    the outputs to generate the final response.\n\n    Attributes:\n        aggregator_llm: The `LLM` that aggregates the outputs of the proposer `LLM`s.\n        proposers_llms: The list of `LLM`s that propose outputs to be aggregated.\n        rounds: The number of layers or rounds that the `proposers_llms` will generate\n            outputs. Defaults to `1`.\n\n    References:\n        - [Mixture-of-Agents Enhances Large Language Model Capabilities](https://arxiv.org/abs/2406.04692)\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM\n\n        llm = MixtureOfAgentsLLM(\n            aggregator_llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            ),\n            proposers_llms=[\n                InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                    tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                ),\n                InferenceEndpointsLLM(\n                    model_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n                    tokenizer_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n                ),\n                InferenceEndpointsLLM(\n                    model_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n                    tokenizer_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n                ),\n            ],\n            rounds=2,\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(\n            inputs=[\n                [\n                    {\n                        \"role\": \"user\",\n                        \"content\": \"My favorite witty review of The Rings of Power series is this: Input:\",\n                    }\n                ]\n            ]\n        )\n        ```\n    \"\"\"\n\n    aggregator_llm: LLM\n    proposers_llms: List[AsyncLLM] = Field(default_factory=list)\n    rounds: int = 1\n\n    @property\n    def runtime_parameters_names(self) -&gt; \"RuntimeParametersNames\":\n        \"\"\"Returns the runtime parameters of the `LLM`, which are a combination of the\n        `RuntimeParameter`s of the `LLM`, the `aggregator_llm` and the `proposers_llms`.\n\n        Returns:\n            The runtime parameters of the `LLM`.\n        \"\"\"\n        runtime_parameters_names = super().runtime_parameters_names\n        del runtime_parameters_names[\"generation_kwargs\"]\n        return runtime_parameters_names\n\n    def load(self) -&gt; None:\n        \"\"\"Loads all the `LLM`s in the `MixtureOfAgents`.\"\"\"\n        super().load()\n\n        for llm in self.proposers_llms:\n            self._logger.debug(f\"Loading proposer LLM in MoA: {llm}\")  # type: ignore\n            llm.load()\n\n        self._logger.debug(f\"Loading aggregator LLM in MoA: {self.aggregator_llm}\")  # type: ignore\n        self.aggregator_llm.load()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the aggregated model name.\"\"\"\n        return f\"moa-{self.aggregator_llm.model_name}-{'-'.join([llm.model_name for llm in self.proposers_llms])}\"\n\n    def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n        \"\"\"Returns the generation kwargs of the `MixtureOfAgents` as a dictionary.\n\n        Returns:\n            The generation kwargs of the `MixtureOfAgents`.\n        \"\"\"\n        return {\n            \"aggregator_llm\": self.aggregator_llm.get_generation_kwargs(),\n            \"proposers_llms\": [\n                llm.get_generation_kwargs() for llm in self.proposers_llms\n            ],\n        }\n\n    # `abstractmethod`, had to be implemented but not used\n    async def agenerate(\n        self, input: \"FormattedInput\", num_generations: int = 1, **kwargs: Any\n    ) -&gt; List[Union[str, None]]:\n        raise NotImplementedError(\n            \"`agenerate` method is not implemented for `MixtureOfAgents`\"\n        )\n\n    def _build_moa_system_prompt(self, prev_outputs: List[str]) -&gt; str:\n        \"\"\"Builds the Mixture-of-Agents system prompt.\n\n        Args:\n            prev_outputs: The list of previous outputs to use as references.\n\n        Returns:\n            The Mixture-of-Agents system prompt.\n        \"\"\"\n        moa_system_prompt = MOA_SYSTEM_PROMPT\n        for i, prev_output in enumerate(prev_outputs):\n            if prev_output is not None:\n                moa_system_prompt += f\"\\n{i + 1}. {prev_output}\"\n        return moa_system_prompt\n\n    def _inject_moa_system_prompt(\n        self, input: \"StandardInput\", prev_outputs: List[str]\n    ) -&gt; \"StandardInput\":\n        \"\"\"Injects the Mixture-of-Agents system prompt into the input.\n\n        Args:\n            input: The input to inject the system prompt into.\n            prev_outputs: The list of previous outputs to use as references.\n\n        Returns:\n            The input with the Mixture-of-Agents system prompt injected.\n        \"\"\"\n        if len(prev_outputs) == 0:\n            return input\n\n        moa_system_prompt = self._build_moa_system_prompt(prev_outputs)\n\n        system = next((item for item in input if item[\"role\"] == \"system\"), None)\n        if system:\n            original_system_prompt = system[\"content\"]\n            system[\"content\"] = f\"{moa_system_prompt}\\n\\n{original_system_prompt}\"\n        else:\n            input.insert(0, {\"role\": \"system\", \"content\": moa_system_prompt})\n\n        return input\n\n    async def _agenerate(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Internal function to concurrently generate responses for a list of inputs.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        aggregator_llm_kwargs: Dict[str, Any] = kwargs.get(\"aggregator_llm\", {})\n        proposers_llms_kwargs: List[Dict[str, Any]] = kwargs.get(\n            \"proposers_llms\", [{}] * len(self.proposers_llms)\n        )\n\n        prev_outputs = []\n        for round in range(self.rounds):\n            self._logger.debug(f\"Generating round {round + 1}/{self.rounds} in MoA\")  # type: ignore\n\n            # Generate `num_generations` with each proposer LLM for each input\n            tasks = [\n                asyncio.create_task(\n                    llm._agenerate(\n                        inputs=[\n                            self._inject_moa_system_prompt(\n                                cast(\"StandardInput\", input), prev_input_outputs\n                            )\n                            for input, prev_input_outputs in itertools.zip_longest(\n                                inputs, prev_outputs, fillvalue=[]\n                            )\n                        ],\n                        num_generations=1,\n                        **generation_kwargs,\n                    )\n                )\n                for llm, generation_kwargs in zip(\n                    self.proposers_llms, proposers_llms_kwargs\n                )\n            ]\n\n            # Group generations per input\n            outputs: List[List[\"GenerateOutput\"]] = await asyncio.gather(*tasks)\n            prev_outputs = [\n                list(itertools.chain(*input_outputs)) for input_outputs in zip(*outputs)\n            ]\n\n        self._logger.debug(\"Aggregating outputs in MoA\")  # type: ignore\n        if isinstance(self.aggregator_llm, AsyncLLM):\n            return await self.aggregator_llm._agenerate(\n                inputs=[\n                    self._inject_moa_system_prompt(\n                        cast(\"StandardInput\", input), prev_input_outputs\n                    )\n                    for input, prev_input_outputs in zip(inputs, prev_outputs)\n                ],\n                num_generations=num_generations,\n                **aggregator_llm_kwargs,\n            )\n\n        return self.aggregator_llm.generate(\n            inputs=[\n                self._inject_moa_system_prompt(\n                    cast(\"StandardInput\", input), prev_input_outputs\n                )\n                for input, prev_input_outputs in zip(inputs, prev_outputs)\n            ],\n            num_generations=num_generations,\n            **aggregator_llm_kwargs,\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.runtime_parameters_names","title":"<code>runtime_parameters_names</code>  <code>property</code>","text":"<p>Returns the runtime parameters of the <code>LLM</code>, which are a combination of the <code>RuntimeParameter</code>s of the <code>LLM</code>, the <code>aggregator_llm</code> and the <code>proposers_llms</code>.</p> <p>Returns:</p> Type Description <code>RuntimeParametersNames</code> <p>The runtime parameters of the <code>LLM</code>.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the aggregated model name.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.load","title":"<code>load()</code>","text":"<p>Loads all the <code>LLM</code>s in the <code>MixtureOfAgents</code>.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads all the `LLM`s in the `MixtureOfAgents`.\"\"\"\n    super().load()\n\n    for llm in self.proposers_llms:\n        self._logger.debug(f\"Loading proposer LLM in MoA: {llm}\")  # type: ignore\n        llm.load()\n\n    self._logger.debug(f\"Loading aggregator LLM in MoA: {self.aggregator_llm}\")  # type: ignore\n    self.aggregator_llm.load()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.get_generation_kwargs","title":"<code>get_generation_kwargs()</code>","text":"<p>Returns the generation kwargs of the <code>MixtureOfAgents</code> as a dictionary.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The generation kwargs of the <code>MixtureOfAgents</code>.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns the generation kwargs of the `MixtureOfAgents` as a dictionary.\n\n    Returns:\n        The generation kwargs of the `MixtureOfAgents`.\n    \"\"\"\n    return {\n        \"aggregator_llm\": self.aggregator_llm.get_generation_kwargs(),\n        \"proposers_llms\": [\n            llm.get_generation_kwargs() for llm in self.proposers_llms\n        ],\n    }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM._build_moa_system_prompt","title":"<code>_build_moa_system_prompt(prev_outputs)</code>","text":"<p>Builds the Mixture-of-Agents system prompt.</p> <p>Parameters:</p> Name Type Description Default <code>prev_outputs</code> <code>List[str]</code> <p>The list of previous outputs to use as references.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The Mixture-of-Agents system prompt.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def _build_moa_system_prompt(self, prev_outputs: List[str]) -&gt; str:\n    \"\"\"Builds the Mixture-of-Agents system prompt.\n\n    Args:\n        prev_outputs: The list of previous outputs to use as references.\n\n    Returns:\n        The Mixture-of-Agents system prompt.\n    \"\"\"\n    moa_system_prompt = MOA_SYSTEM_PROMPT\n    for i, prev_output in enumerate(prev_outputs):\n        if prev_output is not None:\n            moa_system_prompt += f\"\\n{i + 1}. {prev_output}\"\n    return moa_system_prompt\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM._inject_moa_system_prompt","title":"<code>_inject_moa_system_prompt(input, prev_outputs)</code>","text":"<p>Injects the Mixture-of-Agents system prompt into the input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>The input to inject the system prompt into.</p> required <code>prev_outputs</code> <code>List[str]</code> <p>The list of previous outputs to use as references.</p> required <p>Returns:</p> Type Description <code>StandardInput</code> <p>The input with the Mixture-of-Agents system prompt injected.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def _inject_moa_system_prompt(\n    self, input: \"StandardInput\", prev_outputs: List[str]\n) -&gt; \"StandardInput\":\n    \"\"\"Injects the Mixture-of-Agents system prompt into the input.\n\n    Args:\n        input: The input to inject the system prompt into.\n        prev_outputs: The list of previous outputs to use as references.\n\n    Returns:\n        The input with the Mixture-of-Agents system prompt injected.\n    \"\"\"\n    if len(prev_outputs) == 0:\n        return input\n\n    moa_system_prompt = self._build_moa_system_prompt(prev_outputs)\n\n    system = next((item for item in input if item[\"role\"] == \"system\"), None)\n    if system:\n        original_system_prompt = system[\"content\"]\n        system[\"content\"] = f\"{moa_system_prompt}\\n\\n{original_system_prompt}\"\n    else:\n        input.insert(0, {\"role\": \"system\", \"content\": moa_system_prompt})\n\n    return input\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM._agenerate","title":"<code>_agenerate(inputs, num_generations=1, **kwargs)</code>  <code>async</code>","text":"<p>Internal function to concurrently generate responses for a list of inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>the list of inputs to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list containing the generations for each input.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>async def _agenerate(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Internal function to concurrently generate responses for a list of inputs.\n\n    Args:\n        inputs: the list of inputs to generate responses for.\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n\n    Returns:\n        A list containing the generations for each input.\n    \"\"\"\n    aggregator_llm_kwargs: Dict[str, Any] = kwargs.get(\"aggregator_llm\", {})\n    proposers_llms_kwargs: List[Dict[str, Any]] = kwargs.get(\n        \"proposers_llms\", [{}] * len(self.proposers_llms)\n    )\n\n    prev_outputs = []\n    for round in range(self.rounds):\n        self._logger.debug(f\"Generating round {round + 1}/{self.rounds} in MoA\")  # type: ignore\n\n        # Generate `num_generations` with each proposer LLM for each input\n        tasks = [\n            asyncio.create_task(\n                llm._agenerate(\n                    inputs=[\n                        self._inject_moa_system_prompt(\n                            cast(\"StandardInput\", input), prev_input_outputs\n                        )\n                        for input, prev_input_outputs in itertools.zip_longest(\n                            inputs, prev_outputs, fillvalue=[]\n                        )\n                    ],\n                    num_generations=1,\n                    **generation_kwargs,\n                )\n            )\n            for llm, generation_kwargs in zip(\n                self.proposers_llms, proposers_llms_kwargs\n            )\n        ]\n\n        # Group generations per input\n        outputs: List[List[\"GenerateOutput\"]] = await asyncio.gather(*tasks)\n        prev_outputs = [\n            list(itertools.chain(*input_outputs)) for input_outputs in zip(*outputs)\n        ]\n\n    self._logger.debug(\"Aggregating outputs in MoA\")  # type: ignore\n    if isinstance(self.aggregator_llm, AsyncLLM):\n        return await self.aggregator_llm._agenerate(\n            inputs=[\n                self._inject_moa_system_prompt(\n                    cast(\"StandardInput\", input), prev_input_outputs\n                )\n                for input, prev_input_outputs in zip(inputs, prev_outputs)\n            ],\n            num_generations=num_generations,\n            **aggregator_llm_kwargs,\n        )\n\n    return self.aggregator_llm.generate(\n        inputs=[\n            self._inject_moa_system_prompt(\n                cast(\"StandardInput\", input), prev_input_outputs\n            )\n            for input, prev_input_outputs in zip(inputs, prev_outputs)\n        ],\n        num_generations=num_generations,\n        **aggregator_llm_kwargs,\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM","title":"<code>OllamaLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>Ollama LLM implementation running the Async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"notus\".</p> <code>host</code> <code>Optional[RuntimeParameter[str]]</code> <p>the Ollama server host.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the timeout for the LLM. Defaults to <code>120</code>.</p> <code>follow_redirects</code> <code>bool</code> <p>whether to follow redirects. Defaults to <code>True</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>tokenizer_id</code> <code>Optional[RuntimeParameter[str]]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the <code>model</code> will be used. Defaults to <code>None</code>.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> <code>_aclient</code> <code>Optional[AsyncClient]</code> <p>the <code>AsyncClient</code> to use for the Ollama API. It is meant to be used internally. Set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>host</code>: the Ollama server host.</li> <li><code>timeout</code>: the client timeout for the Ollama API. Defaults to <code>120</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import OllamaLLM\n\nllm = OllamaLLM(model=\"llama3\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>class OllamaLLM(AsyncLLM, MagpieChatTemplateMixin):\n    \"\"\"Ollama LLM implementation running the Async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"notus\".\n        host: the Ollama server host.\n        timeout: the timeout for the LLM. Defaults to `120`.\n        follow_redirects: whether to follow redirects. Defaults to `True`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer config files. If not provided, the one associated to the `model`\n            will be used. Defaults to `None`.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n        _aclient: the `AsyncClient` to use for the Ollama API. It is meant to be used internally.\n            Set in the `load` method.\n\n    Runtime parameters:\n        - `host`: the Ollama server host.\n        - `timeout`: the client timeout for the Ollama API. Defaults to `120`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import OllamaLLM\n\n        llm = OllamaLLM(model=\"llama3\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    host: Optional[RuntimeParameter[str]] = Field(\n        default=None, description=\"The host of the Ollama API.\"\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120, description=\"The timeout for the Ollama API.\"\n    )\n    follow_redirects: bool = True\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n    tokenizer_id: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repo id or a path to a directory containing\"\n        \" the tokenizer config files. If not provided, the one associated to the `model`\"\n        \" will be used.\",\n    )\n    _num_generations_param_supported = False\n    _aclient: Optional[\"AsyncClient\"] = PrivateAttr(...)  # type: ignore\n\n    @model_validator(mode=\"after\")  # type: ignore\n    def validate_magpie_usage(\n        self,\n    ) -&gt; \"OllamaLLM\":\n        \"\"\"Validates that magpie usage is valid.\"\"\"\n\n        if self.use_magpie_template and self.tokenizer_id is None:\n            raise ValueError(\n                \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n                \" set a `tokenizer_id` and try again.\"\n            )\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncClient` to use Ollama async API.\"\"\"\n        super().load()\n\n        try:\n            from ollama import AsyncClient\n\n            self._aclient = AsyncClient(\n                host=self.host,\n                timeout=self.timeout,\n                follow_redirects=self.follow_redirects,\n            )\n        except ImportError as e:\n            raise ImportError(\n                \"Ollama Python client is not installed. Please install it using\"\n                \" `pip install ollama`.\"\n            ) from e\n\n        if self.tokenizer_id:\n            try:\n                from transformers import AutoTokenizer\n            except ImportError as ie:\n                raise ImportError(\n                    \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n                ) from ie\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n            if self._tokenizer.chat_template is None:\n                raise ValueError(\n                    \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n                )\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    async def _generate_chat_completion(\n        self,\n        input: \"StandardInput\",\n        format: Literal[\"\", \"json\"] = \"\",\n        options: Union[Options, None] = None,\n        keep_alive: Union[bool, None] = None,\n    ) -&gt; \"ChatResponse\":\n        return await self._aclient.chat(\n            model=self.model,\n            messages=input,\n            stream=False,\n            format=format,\n            options=options,\n            keep_alive=keep_alive,\n        )\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(\n                conversation=input,\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    async def _generate_with_text_generation(\n        self,\n        input: \"StandardInput\",\n        format: Literal[\"\", \"json\"] = None,\n        options: Union[Options, None] = None,\n        keep_alive: Union[bool, None] = None,\n    ) -&gt; \"GenerateResponse\":\n        input = self.prepare_input(input)\n        return await self._aclient.generate(\n            model=self.model,\n            prompt=input,\n            format=format,\n            options=options,\n            keep_alive=keep_alive,\n            raw=True,\n        )\n\n    @validate_call\n    async def agenerate(\n        self,\n        input: StandardInput,\n        format: Literal[\"\", \"json\"] = \"\",\n        # TODO: include relevant options from `Options` in `agenerate` method.\n        options: Union[Options, None] = None,\n        keep_alive: Union[bool, None] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"\n        Generates a response asynchronously, using the [Ollama Async API definition](https://github.com/ollama/ollama-python).\n\n        Args:\n            input: the input to use for the generation.\n            format: the format to use for the generation. Defaults to `\"\"`.\n            options: the options to use for the generation. Defaults to `None`.\n            keep_alive: whether to keep the connection alive. Defaults to `None`.\n\n        Returns:\n            A list of strings as completion for the given input.\n        \"\"\"\n        text = None\n        try:\n            if not format:\n                format = None\n            if self.tokenizer_id is None:\n                completion = await self._generate_chat_completion(\n                    input, format, options, keep_alive\n                )\n                text = completion[\"message\"][\"content\"]\n            else:\n                completion = await self._generate_with_text_generation(\n                    input, format, options, keep_alive\n                )\n                text = completion.response\n        except Exception as e:\n            self._logger.warning(  # type: ignore\n                f\"\u26a0\ufe0f Received no response using Ollama client (model: '{self.model_name}').\"\n                f\" Finish reason was: {e}\"\n            )\n\n        return prepare_output([text], **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: Dict[str, Any]) -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion[\"prompt_eval_count\"]],\n            \"output_tokens\": [completion[\"eval_count\"]],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.validate_magpie_usage","title":"<code>validate_magpie_usage()</code>","text":"<p>Validates that magpie usage is valid.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>@model_validator(mode=\"after\")  # type: ignore\ndef validate_magpie_usage(\n    self,\n) -&gt; \"OllamaLLM\":\n    \"\"\"Validates that magpie usage is valid.\"\"\"\n\n    if self.use_magpie_template and self.tokenizer_id is None:\n        raise ValueError(\n            \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n            \" set a `tokenizer_id` and try again.\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncClient</code> to use Ollama async API.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncClient` to use Ollama async API.\"\"\"\n    super().load()\n\n    try:\n        from ollama import AsyncClient\n\n        self._aclient = AsyncClient(\n            host=self.host,\n            timeout=self.timeout,\n            follow_redirects=self.follow_redirects,\n        )\n    except ImportError as e:\n        raise ImportError(\n            \"Ollama Python client is not installed. Please install it using\"\n            \" `pip install ollama`.\"\n        ) from e\n\n    if self.tokenizer_id:\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n            ) from ie\n        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n        if self._tokenizer.chat_template is None:\n            raise ValueError(\n                \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n            )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(\n            conversation=input,\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.agenerate","title":"<code>agenerate(input, format='', options=None, keep_alive=None)</code>  <code>async</code>","text":"<p>Generates a response asynchronously, using the Ollama Async API definition.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input to use for the generation.</p> required <code>format</code> <code>Literal['', 'json']</code> <p>the format to use for the generation. Defaults to <code>\"\"</code>.</p> <code>''</code> <code>options</code> <code>Union[Options, None]</code> <p>the options to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>keep_alive</code> <code>Union[bool, None]</code> <p>whether to keep the connection alive. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of strings as completion for the given input.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>@validate_call\nasync def agenerate(\n    self,\n    input: StandardInput,\n    format: Literal[\"\", \"json\"] = \"\",\n    # TODO: include relevant options from `Options` in `agenerate` method.\n    options: Union[Options, None] = None,\n    keep_alive: Union[bool, None] = None,\n) -&gt; GenerateOutput:\n    \"\"\"\n    Generates a response asynchronously, using the [Ollama Async API definition](https://github.com/ollama/ollama-python).\n\n    Args:\n        input: the input to use for the generation.\n        format: the format to use for the generation. Defaults to `\"\"`.\n        options: the options to use for the generation. Defaults to `None`.\n        keep_alive: whether to keep the connection alive. Defaults to `None`.\n\n    Returns:\n        A list of strings as completion for the given input.\n    \"\"\"\n    text = None\n    try:\n        if not format:\n            format = None\n        if self.tokenizer_id is None:\n            completion = await self._generate_chat_completion(\n                input, format, options, keep_alive\n            )\n            text = completion[\"message\"][\"content\"]\n        else:\n            completion = await self._generate_with_text_generation(\n                input, format, options, keep_alive\n            )\n            text = completion.response\n    except Exception as e:\n        self._logger.warning(  # type: ignore\n            f\"\u26a0\ufe0f Received no response using Ollama client (model: '{self.model_name}').\"\n            f\" Finish reason was: {e}\"\n        )\n\n    return prepare_output([text], **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM","title":"<code>OpenAILLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>OpenAI LLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"gpt-3.5-turbo\", \"gpt-4\", etc. Supported models can be found here.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the OpenAI API requests. Defaults to <code>None</code>, which means that the value set for the environment variable <code>OPENAI_BASE_URL</code> will be used, or \"https://api.openai.com/v1\" if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the OpenAI API. Defaults to <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of times to retry the request to the API before failing. Defaults to <code>6</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> Runtime parameters <ul> <li><code>base_url</code>: the base URL to use for the OpenAI API requests. Defaults to <code>None</code>.</li> <li><code>api_key</code>: the API key to authenticate the requests to the OpenAI API. Defaults     to <code>None</code>.</li> <li><code>max_retries</code>: the maximum number of times to retry the request to the API before     failing. Defaults to <code>6</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> </ul> Icon <p><code>:simple-openai:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate text from a custom endpoint following the OpenAI API:</p> <pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import OpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = OpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> <p>Generate with Batch API (offline batch generation):</p> <pre><code>from distilabel.models.llms import OpenAILLM\n\nload = llm = OpenAILLM(\n    model=\"gpt-3.5-turbo\",\n    use_offline_batch_generation=True,\n    offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n# [['Hello! How can I assist you today?']]\n</code></pre> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>class OpenAILLM(AsyncLLM):\n    \"\"\"OpenAI LLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\", \"gpt-4\", etc.\n            Supported models can be found [here](https://platform.openai.com/docs/guides/text-generation).\n        base_url: the base URL to use for the OpenAI API requests. Defaults to `None`, which\n            means that the value set for the environment variable `OPENAI_BASE_URL` will\n            be used, or \"https://api.openai.com/v1\" if not set.\n        api_key: the API key to authenticate the requests to the OpenAI API. Defaults to\n            `None` which means that the value set for the environment variable `OPENAI_API_KEY`\n            will be used, or `None` if not set.\n        max_retries: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n\n    Runtime parameters:\n        - `base_url`: the base URL to use for the OpenAI API requests. Defaults to `None`.\n        - `api_key`: the API key to authenticate the requests to the OpenAI API. Defaults\n            to `None`.\n        - `max_retries`: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n\n    Icon:\n        `:simple-openai:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import OpenAILLM\n\n        llm = OpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate text from a custom endpoint following the OpenAI API:\n\n        ```python\n        from distilabel.models.llms import OpenAILLM\n\n        llm = OpenAILLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            base_url=r\"http://localhost:8080/v1\"\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import OpenAILLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = OpenAILLM(\n            model=\"gpt-4-turbo\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n\n        Generate with Batch API (offline batch generation):\n\n        ```python\n        from distilabel.models.llms import OpenAILLM\n\n        load = llm = OpenAILLM(\n            model=\"gpt-3.5-turbo\",\n            use_offline_batch_generation=True,\n            offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        # [['Hello! How can I assist you today?']]\n        ```\n    \"\"\"\n\n    model: str\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"OPENAI_BASE_URL\", \"https://api.openai.com/v1\"\n        ),\n        description=\"The base URL to use for the OpenAI API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_OPENAI_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the OpenAI API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=6,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _api_key_env_var: str = PrivateAttr(_OPENAI_API_KEY_ENV_VAR_NAME)\n    _client: \"OpenAI\" = PrivateAttr(None)\n    _aclient: \"AsyncOpenAI\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncOpenAI` client to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from openai import AsyncOpenAI, OpenAI\n        except ImportError as ie:\n            raise ImportError(\n                \"OpenAI Python client is not installed. Please install it using\"\n                \" `pip install openai`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._client = OpenAI(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        self._aclient = AsyncOpenAI(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"openai\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    def unload(self) -&gt; None:\n        \"\"\"Set clients to `None` as they both contain `thread._RLock` which cannot be pickled\n        in case an exception is raised and has to be handled in the main process\"\"\"\n\n        self._client = None  # type: ignore\n        self._aclient = None  # type: ignore\n        self.structured_output = None\n        super().unload()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        logprobs: bool = False,\n        top_logprobs: Optional[PositiveInt] = None,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        stop: Optional[Union[str, List[str]]] = None,\n        response_format: Optional[Dict[str, str]] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the OpenAI async\n        client.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            logprobs: whether to return the log probabilities or not. Defaults to `False`.\n            top_logprobs: the number of top log probabilities to return per output token\n                generated. Defaults to `None`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: a string or a list of strings to use as a stop sequence for the generation.\n                Defaults to `None`.\n            response_format: the format of the response to return. Must be one of\n                \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n                for more information on how to use the JSON model from OpenAI. Defaults to None\n                which returns text. To return JSON, use {\"type\": \"json_object\"}.\n\n        Note:\n            If response_format\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,  # type: ignore\n                client=self._aclient,\n                framework=\"openai\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"logprobs\": logprobs,\n            \"top_logprobs\": top_logprobs,\n            \"max_tokens\": max_new_tokens,\n            \"n\": num_generations,\n            \"frequency_penalty\": frequency_penalty,\n            \"presence_penalty\": presence_penalty,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n            \"stop\": stop,\n        }\n        # Check if it's a vision generation task, in that case \"stop\" cannot be used or raises\n        # an error in the API.\n        if isinstance(\n            [row for row in input if row[\"role\"] == \"user\"][0][\"content\"], list\n        ):\n            kwargs.pop(\"stop\")\n\n        if response_format is not None:\n            kwargs[\"response_format\"] = response_format\n\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n\n        if structured_output:\n            # NOTE: `instructor` doesn't work with `n` parameter, so it will always return\n            # only 1 choice.\n            statistics = self._get_llm_statistics(completion._raw_response)\n            if choice_logprobs := self._get_logprobs_from_choice(\n                completion._raw_response.choices[0]\n            ):\n                output_logprobs = [choice_logprobs]\n            else:\n                output_logprobs = None\n            return prepare_output(\n                generations=[completion.model_dump_json()],\n                input_tokens=statistics[\"input_tokens\"],\n                output_tokens=statistics[\"output_tokens\"],\n                logprobs=output_logprobs,\n            )\n\n        return self._generations_from_openai_completion(completion)\n\n    def _generations_from_openai_completion(\n        self, completion: \"OpenAIChatCompletion\"\n    ) -&gt; \"GenerateOutput\":\n        \"\"\"Get the generations from the OpenAI Chat Completion object.\n\n        Args:\n            completion: the completion object to get the generations from.\n\n        Returns:\n            A list of strings containing the generated responses for the input.\n        \"\"\"\n        generations = []\n        logprobs = []\n        for choice in completion.choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using OpenAI client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n            if choice_logprobs := self._get_logprobs_from_choice(choice):\n                logprobs.append(choice_logprobs)\n\n        statistics = self._get_llm_statistics(completion)\n        return prepare_output(\n            generations=generations,\n            input_tokens=statistics[\"input_tokens\"],\n            output_tokens=statistics[\"output_tokens\"],\n            logprobs=logprobs,\n        )\n\n    def _get_logprobs_from_choice(\n        self, choice: \"OpenAIChoice\"\n    ) -&gt; Union[List[List[\"Logprob\"]], None]:\n        if choice.logprobs is None or choice.logprobs.content is None:\n            return None\n\n        return [\n            [\n                {\"token\": top_logprob.token, \"logprob\": top_logprob.logprob}\n                for top_logprob in token_logprobs.top_logprobs\n            ]\n            for token_logprobs in choice.logprobs.content\n        ]\n\n    def offline_batch_generate(\n        self,\n        inputs: Union[List[\"FormattedInput\"], None] = None,\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        logprobs: bool = False,\n        top_logprobs: Optional[PositiveInt] = None,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        stop: Optional[Union[str, List[str]]] = None,\n        response_format: Optional[str] = None,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Uses the OpenAI batch API to generate `num_generations` responses for the given\n        inputs.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            logprobs: whether to return the log probabilities or not. Defaults to `False`.\n            top_logprobs: the number of top log probabilities to return per output token\n                generated. Defaults to `None`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: a string or a list of strings to use as a stop sequence for the generation.\n                Defaults to `None`.\n            response_format: the format of the response to return. Must be one of\n                \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n                for more information on how to use the JSON model from OpenAI. Defaults to `text`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input\n            in `inputs`.\n\n        Raises:\n            DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n                is not finished yet.\n            ValueError: if no job IDs were found to retrieve the results from.\n        \"\"\"\n        if self.jobs_ids:\n            return self._check_and_get_batch_results()\n\n        if inputs:\n            self.jobs_ids = self._create_jobs(\n                inputs=inputs,\n                **{\n                    \"model\": self.model,\n                    \"logprobs\": logprobs,\n                    \"top_logprobs\": top_logprobs,\n                    \"max_tokens\": max_new_tokens,\n                    \"n\": num_generations,\n                    \"frequency_penalty\": frequency_penalty,\n                    \"presence_penalty\": presence_penalty,\n                    \"temperature\": temperature,\n                    \"top_p\": top_p,\n                    \"stop\": stop,\n                    \"response_format\": response_format,\n                },\n            )\n            raise DistilabelOfflineBatchGenerationNotFinishedException(\n                jobs_ids=self.jobs_ids\n            )\n\n        raise ValueError(\"No `inputs` were provided and no `jobs_ids` were found.\")\n\n    def _check_and_get_batch_results(self) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Checks the status of the batch jobs and retrieves the results from the OpenAI\n        Batch API.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n\n        Raises:\n            ValueError: if no job IDs were found to retrieve the results from.\n            DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n                is not finished yet.\n            RuntimeError: if the only batch job found failed.\n        \"\"\"\n        if not self.jobs_ids:\n            raise ValueError(\"No job IDs were found to retrieve the results from.\")\n\n        outputs = []\n        for batch_id in self.jobs_ids:\n            batch = self._get_openai_batch(batch_id)\n\n            if batch.status in (\"validating\", \"in_progress\", \"finalizing\"):\n                raise DistilabelOfflineBatchGenerationNotFinishedException(\n                    jobs_ids=self.jobs_ids\n                )\n\n            if batch.status in (\"failed\", \"expired\", \"cancelled\", \"cancelling\"):\n                self._logger.error(  # type: ignore\n                    f\"OpenAI API batch with ID '{batch_id}' failed with status '{batch.status}'.\"\n                )\n                if len(self.jobs_ids) == 1:\n                    self.jobs_ids = None\n                    raise RuntimeError(\n                        f\"The only OpenAI API Batch that was created with ID '{batch_id}'\"\n                        f\" failed with status '{batch.status}'.\"\n                    )\n\n                continue\n\n            outputs.extend(self._retrieve_batch_results(batch))\n\n        # sort by `custom_id` to return the results in the same order as the inputs\n        outputs = sorted(outputs, key=lambda x: int(x[\"custom_id\"]))\n        return [self._parse_output(output) for output in outputs]\n\n    def _parse_output(self, output: Dict[str, Any]) -&gt; \"GenerateOutput\":\n        \"\"\"Parses the output from the OpenAI Batch API into a list of strings.\n\n        Args:\n            output: the output to parse.\n\n        Returns:\n            A list of strings containing the generated responses for the input.\n        \"\"\"\n        from openai.types.chat import ChatCompletion as OpenAIChatCompletion\n\n        if \"response\" not in output:\n            return []\n\n        if output[\"response\"][\"status_code\"] != 200:\n            return []\n\n        return self._generations_from_openai_completion(\n            OpenAIChatCompletion(**output[\"response\"][\"body\"])\n        )\n\n    def _get_openai_batch(self, batch_id: str) -&gt; \"OpenAIBatch\":\n        \"\"\"Gets a batch from the OpenAI Batch API.\n\n        Args:\n            batch_id: the ID of the batch to retrieve.\n\n        Returns:\n            The batch retrieved from the OpenAI Batch API.\n\n        Raises:\n            openai.OpenAIError: if there was an error while retrieving the batch from the\n                OpenAI Batch API.\n        \"\"\"\n        import openai\n\n        try:\n            return self._client.batches.retrieve(batch_id)\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while retrieving batch '{batch_id}' from OpenAI: {e}\"\n            )\n            raise e\n\n    def _retrieve_batch_results(self, batch: \"OpenAIBatch\") -&gt; List[Dict[str, Any]]:\n        \"\"\"Retrieves the results of a batch from its output file, parsing the JSONL content\n        into a list of dictionaries.\n\n        Args:\n            batch: the batch to retrieve the results from.\n\n        Returns:\n            A list of dictionaries containing the results of the batch.\n\n        Raises:\n            AssertionError: if no output file ID was found in the batch.\n        \"\"\"\n        import openai\n\n        assert batch.output_file_id, \"No output file ID was found in the batch.\"\n\n        try:\n            file_response = self._client.files.content(batch.output_file_id)\n            return [orjson.loads(line) for line in file_response.text.splitlines()]\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while retrieving batch results from file '{batch.output_file_id}': {e}\"\n            )\n            return []\n\n    def _create_jobs(\n        self, inputs: List[\"FormattedInput\"], **kwargs: Any\n    ) -&gt; Tuple[str, ...]:\n        \"\"\"Creates jobs in the OpenAI Batch API to generate responses for the given inputs.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            kwargs: the keyword arguments to use for the generation.\n\n        Returns:\n            A list of job IDs created in the OpenAI Batch API.\n        \"\"\"\n        batch_input_files = self._create_batch_files(inputs=inputs, **kwargs)\n        jobs = []\n        for batch_input_file in batch_input_files:\n            if batch := self._create_batch_api_job(batch_input_file):\n                jobs.append(batch.id)\n        return tuple(jobs)\n\n    def _create_batch_api_job(\n        self, batch_input_file: \"OpenAIFileObject\"\n    ) -&gt; Union[\"OpenAIBatch\", None]:\n        \"\"\"Creates a job in the OpenAI Batch API to generate responses for the given input\n        file.\n\n        Args:\n            batch_input_file: the input file to generate responses for.\n\n        Returns:\n            The batch job created in the OpenAI Batch API.\n        \"\"\"\n        import openai\n\n        metadata = {\"description\": \"distilabel\"}\n\n        if distilabel_pipeline_name := envs.DISTILABEL_PIPELINE_NAME:\n            metadata[\"distilabel_pipeline_name\"] = distilabel_pipeline_name\n\n        if distilabel_pipeline_cache_id := envs.DISTILABEL_PIPELINE_CACHE_ID:\n            metadata[\"distilabel_pipeline_cache_id\"] = distilabel_pipeline_cache_id\n\n        batch = None\n        try:\n            batch = self._client.batches.create(\n                completion_window=\"24h\",\n                endpoint=\"/v1/chat/completions\",\n                input_file_id=batch_input_file.id,\n                metadata=metadata,\n            )\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while creating OpenAI Batch API job for file with ID\"\n                f\" '{batch_input_file.id}': {e}.\"\n            )\n            raise e\n        return batch\n\n    def _create_batch_files(\n        self, inputs: List[\"FormattedInput\"], **kwargs: Any\n    ) -&gt; List[\"OpenAIFileObject\"]:\n        \"\"\"Creates the necessary input files for the batch API to generate responses. The\n        maximum size of each file so the OpenAI Batch API can process it is 100MB, so we\n        need to split the inputs into multiple files if necessary.\n\n        More information: https://platform.openai.com/docs/api-reference/files/create\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for, optionally\n                including structured output.\n            kwargs: the keyword arguments to use for the generation.\n\n        Returns:\n            The list of file objects created for the OpenAI Batch API.\n\n        Raises:\n            openai.OpenAIError: if there was an error while creating the batch input file\n                in the OpenAI Batch API.\n        \"\"\"\n        import openai\n\n        files = []\n        for file_no, buffer in enumerate(\n            self._create_jsonl_buffers(inputs=inputs, **kwargs)\n        ):\n            try:\n                # TODO: add distilabel pipeline name and id\n                batch_input_file = self._client.files.create(\n                    file=(self._name_for_openai_files(file_no), buffer),\n                    purpose=\"batch\",\n                )\n                files.append(batch_input_file)\n            except openai.OpenAIError as e:\n                self._logger.error(  # type: ignore\n                    f\"Error while creating OpenAI batch input file: {e}\"\n                )\n                raise e\n        return files\n\n    def _create_jsonl_buffers(\n        self, inputs: List[\"FormattedInput\"], **kwargs: Any\n    ) -&gt; Generator[io.BytesIO, None, None]:\n        \"\"\"Creates a generator of buffers containing the JSONL formatted inputs to be\n        used by the OpenAI Batch API. The buffers created are of size 100MB or less.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for, optionally\n                including structured output.\n            kwargs: the keyword arguments to use for the generation.\n\n        Yields:\n            A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch\n            API.\n        \"\"\"\n        buffer = io.BytesIO()\n        buffer_current_size = 0\n        for i, input in enumerate(inputs):\n            # We create the smallest `custom_id` so we don't  increase the size of the file\n            # to much, but we can still sort the results with the order of the inputs.\n            row = self._create_jsonl_row(input=input, custom_id=str(i), **kwargs)\n            row_size = len(row)\n            if row_size + buffer_current_size &gt; _OPENAI_BATCH_API_MAX_FILE_SIZE:\n                buffer.seek(0)\n                yield buffer\n                buffer = io.BytesIO()\n                buffer_current_size = 0\n            buffer.write(row)\n            buffer_current_size += row_size\n\n        if buffer_current_size &gt; 0:\n            buffer.seek(0)\n            yield buffer\n\n    def _create_jsonl_row(\n        self, input: \"FormattedInput\", custom_id: str, **kwargs: Any\n    ) -&gt; bytes:\n        \"\"\"Creates a JSONL formatted row to be used by the OpenAI Batch API.\n\n        Args:\n            input: a list of inputs in chat format to generate responses for, optionally\n                including structured output.\n            custom_id: a custom ID to use for the row.\n            kwargs: the keyword arguments to use for the generation.\n\n        Returns:\n            A JSONL formatted row to be used by the OpenAI Batch API.\n        \"\"\"\n        # TODO: depending on the format of the input, add `response_format` to the kwargs\n        row = {\n            \"custom_id\": custom_id,\n            \"method\": \"POST\",\n            \"url\": \"/v1/chat/completions\",\n            \"body\": {\"messages\": input, **kwargs},\n        }\n        json_row = orjson.dumps(row)\n        return json_row + b\"\\n\"\n\n    def _name_for_openai_files(self, file_no: int) -&gt; str:\n        if (\n            envs.DISTILABEL_PIPELINE_NAME is None\n            or envs.DISTILABEL_PIPELINE_CACHE_ID is None\n        ):\n            return f\"distilabel-pipeline-fileno-{file_no}.jsonl\"\n\n        return f\"distilabel-pipeline-{envs.DISTILABEL_PIPELINE_NAME}-{envs.DISTILABEL_PIPELINE_CACHE_ID}-fileno-{file_no}.jsonl\"\n\n    @staticmethod\n    def _get_llm_statistics(\n        completion: Union[\"OpenAIChatCompletion\", \"OpenAICompletion\"],\n    ) -&gt; \"LLMStatistics\":\n        return {\n            \"output_tokens\": [\n                completion.usage.completion_tokens if completion.usage else 0\n            ],\n            \"input_tokens\": [completion.usage.prompt_tokens if completion.usage else 0],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncOpenAI</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncOpenAI` client to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from openai import AsyncOpenAI, OpenAI\n    except ImportError as ie:\n        raise ImportError(\n            \"OpenAI Python client is not installed. Please install it using\"\n            \" `pip install openai`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._client = OpenAI(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    self._aclient = AsyncOpenAI(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"openai\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.unload","title":"<code>unload()</code>","text":"<p>Set clients to <code>None</code> as they both contain <code>thread._RLock</code> which cannot be pickled in case an exception is raised and has to be handled in the main process</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Set clients to `None` as they both contain `thread._RLock` which cannot be pickled\n    in case an exception is raised and has to be handled in the main process\"\"\"\n\n    self._client = None  # type: ignore\n    self._aclient = None  # type: ignore\n    self.structured_output = None\n    super().unload()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.agenerate","title":"<code>agenerate(input, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the OpenAI async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>logprobs</code> <code>bool</code> <p>whether to return the log probabilities or not. Defaults to <code>False</code>.</p> <code>False</code> <code>top_logprobs</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[Union[str, List[str]]]</code> <p>a string or a list of strings to use as a stop sequence for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>response_format</code> <code>Optional[Dict[str, str]]</code> <p>the format of the response to return. Must be one of \"text\" or \"json\". Read the documentation here for more information on how to use the JSON model from OpenAI. Defaults to None which returns text. To return JSON, use {\"type\": \"json_object\"}.</p> <code>None</code> Note <p>If response_format</p> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    logprobs: bool = False,\n    top_logprobs: Optional[PositiveInt] = None,\n    frequency_penalty: float = 0.0,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    stop: Optional[Union[str, List[str]]] = None,\n    response_format: Optional[Dict[str, str]] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the OpenAI async\n    client.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        logprobs: whether to return the log probabilities or not. Defaults to `False`.\n        top_logprobs: the number of top log probabilities to return per output token\n            generated. Defaults to `None`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: a string or a list of strings to use as a stop sequence for the generation.\n            Defaults to `None`.\n        response_format: the format of the response to return. Must be one of\n            \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n            for more information on how to use the JSON model from OpenAI. Defaults to None\n            which returns text. To return JSON, use {\"type\": \"json_object\"}.\n\n    Note:\n        If response_format\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,  # type: ignore\n            client=self._aclient,\n            framework=\"openai\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"logprobs\": logprobs,\n        \"top_logprobs\": top_logprobs,\n        \"max_tokens\": max_new_tokens,\n        \"n\": num_generations,\n        \"frequency_penalty\": frequency_penalty,\n        \"presence_penalty\": presence_penalty,\n        \"temperature\": temperature,\n        \"top_p\": top_p,\n        \"stop\": stop,\n    }\n    # Check if it's a vision generation task, in that case \"stop\" cannot be used or raises\n    # an error in the API.\n    if isinstance(\n        [row for row in input if row[\"role\"] == \"user\"][0][\"content\"], list\n    ):\n        kwargs.pop(\"stop\")\n\n    if response_format is not None:\n        kwargs[\"response_format\"] = response_format\n\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n    completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n\n    if structured_output:\n        # NOTE: `instructor` doesn't work with `n` parameter, so it will always return\n        # only 1 choice.\n        statistics = self._get_llm_statistics(completion._raw_response)\n        if choice_logprobs := self._get_logprobs_from_choice(\n            completion._raw_response.choices[0]\n        ):\n            output_logprobs = [choice_logprobs]\n        else:\n            output_logprobs = None\n        return prepare_output(\n            generations=[completion.model_dump_json()],\n            input_tokens=statistics[\"input_tokens\"],\n            output_tokens=statistics[\"output_tokens\"],\n            logprobs=output_logprobs,\n        )\n\n    return self._generations_from_openai_completion(completion)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._generations_from_openai_completion","title":"<code>_generations_from_openai_completion(completion)</code>","text":"<p>Get the generations from the OpenAI Chat Completion object.</p> <p>Parameters:</p> Name Type Description Default <code>completion</code> <code>ChatCompletion</code> <p>the completion object to get the generations from.</p> required <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of strings containing the generated responses for the input.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _generations_from_openai_completion(\n    self, completion: \"OpenAIChatCompletion\"\n) -&gt; \"GenerateOutput\":\n    \"\"\"Get the generations from the OpenAI Chat Completion object.\n\n    Args:\n        completion: the completion object to get the generations from.\n\n    Returns:\n        A list of strings containing the generated responses for the input.\n    \"\"\"\n    generations = []\n    logprobs = []\n    for choice in completion.choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using OpenAI client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n        if choice_logprobs := self._get_logprobs_from_choice(choice):\n            logprobs.append(choice_logprobs)\n\n    statistics = self._get_llm_statistics(completion)\n    return prepare_output(\n        generations=generations,\n        input_tokens=statistics[\"input_tokens\"],\n        output_tokens=statistics[\"output_tokens\"],\n        logprobs=logprobs,\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.offline_batch_generate","title":"<code>offline_batch_generate(inputs=None, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, **kwargs)</code>","text":"<p>Uses the OpenAI batch API to generate <code>num_generations</code> responses for the given inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[List[FormattedInput], None]</code> <p>a list of inputs in chat format to generate responses for.</p> <code>None</code> <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>logprobs</code> <code>bool</code> <p>whether to return the log probabilities or not. Defaults to <code>False</code>.</p> <code>False</code> <code>top_logprobs</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[Union[str, List[str]]]</code> <p>a string or a list of strings to use as a stop sequence for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>response_format</code> <code>Optional[str]</code> <p>the format of the response to return. Must be one of \"text\" or \"json\". Read the documentation here for more information on how to use the JSON model from OpenAI. Defaults to <code>text</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input</p> <code>List[GenerateOutput]</code> <p>in <code>inputs</code>.</p> <p>Raises:</p> Type Description <code>DistilabelOfflineBatchGenerationNotFinishedException</code> <p>if the batch generation is not finished yet.</p> <code>ValueError</code> <p>if no job IDs were found to retrieve the results from.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def offline_batch_generate(\n    self,\n    inputs: Union[List[\"FormattedInput\"], None] = None,\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    logprobs: bool = False,\n    top_logprobs: Optional[PositiveInt] = None,\n    frequency_penalty: float = 0.0,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    stop: Optional[Union[str, List[str]]] = None,\n    response_format: Optional[str] = None,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Uses the OpenAI batch API to generate `num_generations` responses for the given\n    inputs.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        logprobs: whether to return the log probabilities or not. Defaults to `False`.\n        top_logprobs: the number of top log probabilities to return per output token\n            generated. Defaults to `None`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: a string or a list of strings to use as a stop sequence for the generation.\n            Defaults to `None`.\n        response_format: the format of the response to return. Must be one of\n            \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n            for more information on how to use the JSON model from OpenAI. Defaults to `text`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input\n        in `inputs`.\n\n    Raises:\n        DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n            is not finished yet.\n        ValueError: if no job IDs were found to retrieve the results from.\n    \"\"\"\n    if self.jobs_ids:\n        return self._check_and_get_batch_results()\n\n    if inputs:\n        self.jobs_ids = self._create_jobs(\n            inputs=inputs,\n            **{\n                \"model\": self.model,\n                \"logprobs\": logprobs,\n                \"top_logprobs\": top_logprobs,\n                \"max_tokens\": max_new_tokens,\n                \"n\": num_generations,\n                \"frequency_penalty\": frequency_penalty,\n                \"presence_penalty\": presence_penalty,\n                \"temperature\": temperature,\n                \"top_p\": top_p,\n                \"stop\": stop,\n                \"response_format\": response_format,\n            },\n        )\n        raise DistilabelOfflineBatchGenerationNotFinishedException(\n            jobs_ids=self.jobs_ids\n        )\n\n    raise ValueError(\"No `inputs` were provided and no `jobs_ids` were found.\")\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._check_and_get_batch_results","title":"<code>_check_and_get_batch_results()</code>","text":"<p>Checks the status of the batch jobs and retrieves the results from the OpenAI Batch API.</p> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>if no job IDs were found to retrieve the results from.</p> <code>DistilabelOfflineBatchGenerationNotFinishedException</code> <p>if the batch generation is not finished yet.</p> <code>RuntimeError</code> <p>if the only batch job found failed.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _check_and_get_batch_results(self) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Checks the status of the batch jobs and retrieves the results from the OpenAI\n    Batch API.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n\n    Raises:\n        ValueError: if no job IDs were found to retrieve the results from.\n        DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n            is not finished yet.\n        RuntimeError: if the only batch job found failed.\n    \"\"\"\n    if not self.jobs_ids:\n        raise ValueError(\"No job IDs were found to retrieve the results from.\")\n\n    outputs = []\n    for batch_id in self.jobs_ids:\n        batch = self._get_openai_batch(batch_id)\n\n        if batch.status in (\"validating\", \"in_progress\", \"finalizing\"):\n            raise DistilabelOfflineBatchGenerationNotFinishedException(\n                jobs_ids=self.jobs_ids\n            )\n\n        if batch.status in (\"failed\", \"expired\", \"cancelled\", \"cancelling\"):\n            self._logger.error(  # type: ignore\n                f\"OpenAI API batch with ID '{batch_id}' failed with status '{batch.status}'.\"\n            )\n            if len(self.jobs_ids) == 1:\n                self.jobs_ids = None\n                raise RuntimeError(\n                    f\"The only OpenAI API Batch that was created with ID '{batch_id}'\"\n                    f\" failed with status '{batch.status}'.\"\n                )\n\n            continue\n\n        outputs.extend(self._retrieve_batch_results(batch))\n\n    # sort by `custom_id` to return the results in the same order as the inputs\n    outputs = sorted(outputs, key=lambda x: int(x[\"custom_id\"]))\n    return [self._parse_output(output) for output in outputs]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._parse_output","title":"<code>_parse_output(output)</code>","text":"<p>Parses the output from the OpenAI Batch API into a list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Dict[str, Any]</code> <p>the output to parse.</p> required <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of strings containing the generated responses for the input.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _parse_output(self, output: Dict[str, Any]) -&gt; \"GenerateOutput\":\n    \"\"\"Parses the output from the OpenAI Batch API into a list of strings.\n\n    Args:\n        output: the output to parse.\n\n    Returns:\n        A list of strings containing the generated responses for the input.\n    \"\"\"\n    from openai.types.chat import ChatCompletion as OpenAIChatCompletion\n\n    if \"response\" not in output:\n        return []\n\n    if output[\"response\"][\"status_code\"] != 200:\n        return []\n\n    return self._generations_from_openai_completion(\n        OpenAIChatCompletion(**output[\"response\"][\"body\"])\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._get_openai_batch","title":"<code>_get_openai_batch(batch_id)</code>","text":"<p>Gets a batch from the OpenAI Batch API.</p> <p>Parameters:</p> Name Type Description Default <code>batch_id</code> <code>str</code> <p>the ID of the batch to retrieve.</p> required <p>Returns:</p> Type Description <code>Batch</code> <p>The batch retrieved from the OpenAI Batch API.</p> <p>Raises:</p> Type Description <code>OpenAIError</code> <p>if there was an error while retrieving the batch from the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _get_openai_batch(self, batch_id: str) -&gt; \"OpenAIBatch\":\n    \"\"\"Gets a batch from the OpenAI Batch API.\n\n    Args:\n        batch_id: the ID of the batch to retrieve.\n\n    Returns:\n        The batch retrieved from the OpenAI Batch API.\n\n    Raises:\n        openai.OpenAIError: if there was an error while retrieving the batch from the\n            OpenAI Batch API.\n    \"\"\"\n    import openai\n\n    try:\n        return self._client.batches.retrieve(batch_id)\n    except openai.OpenAIError as e:\n        self._logger.error(  # type: ignore\n            f\"Error while retrieving batch '{batch_id}' from OpenAI: {e}\"\n        )\n        raise e\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._retrieve_batch_results","title":"<code>_retrieve_batch_results(batch)</code>","text":"<p>Retrieves the results of a batch from its output file, parsing the JSONL content into a list of dictionaries.</p> <p>Parameters:</p> Name Type Description Default <code>batch</code> <code>Batch</code> <p>the batch to retrieve the results from.</p> required <p>Returns:</p> Type Description <code>List[Dict[str, Any]]</code> <p>A list of dictionaries containing the results of the batch.</p> <p>Raises:</p> Type Description <code>AssertionError</code> <p>if no output file ID was found in the batch.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _retrieve_batch_results(self, batch: \"OpenAIBatch\") -&gt; List[Dict[str, Any]]:\n    \"\"\"Retrieves the results of a batch from its output file, parsing the JSONL content\n    into a list of dictionaries.\n\n    Args:\n        batch: the batch to retrieve the results from.\n\n    Returns:\n        A list of dictionaries containing the results of the batch.\n\n    Raises:\n        AssertionError: if no output file ID was found in the batch.\n    \"\"\"\n    import openai\n\n    assert batch.output_file_id, \"No output file ID was found in the batch.\"\n\n    try:\n        file_response = self._client.files.content(batch.output_file_id)\n        return [orjson.loads(line) for line in file_response.text.splitlines()]\n    except openai.OpenAIError as e:\n        self._logger.error(  # type: ignore\n            f\"Error while retrieving batch results from file '{batch.output_file_id}': {e}\"\n        )\n        return []\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_jobs","title":"<code>_create_jobs(inputs, **kwargs)</code>","text":"<p>Creates jobs in the OpenAI Batch API to generate responses for the given inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>Tuple[str, ...]</code> <p>A list of job IDs created in the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_jobs(\n    self, inputs: List[\"FormattedInput\"], **kwargs: Any\n) -&gt; Tuple[str, ...]:\n    \"\"\"Creates jobs in the OpenAI Batch API to generate responses for the given inputs.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        kwargs: the keyword arguments to use for the generation.\n\n    Returns:\n        A list of job IDs created in the OpenAI Batch API.\n    \"\"\"\n    batch_input_files = self._create_batch_files(inputs=inputs, **kwargs)\n    jobs = []\n    for batch_input_file in batch_input_files:\n        if batch := self._create_batch_api_job(batch_input_file):\n            jobs.append(batch.id)\n    return tuple(jobs)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_batch_api_job","title":"<code>_create_batch_api_job(batch_input_file)</code>","text":"<p>Creates a job in the OpenAI Batch API to generate responses for the given input file.</p> <p>Parameters:</p> Name Type Description Default <code>batch_input_file</code> <code>FileObject</code> <p>the input file to generate responses for.</p> required <p>Returns:</p> Type Description <code>Union[Batch, None]</code> <p>The batch job created in the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_batch_api_job(\n    self, batch_input_file: \"OpenAIFileObject\"\n) -&gt; Union[\"OpenAIBatch\", None]:\n    \"\"\"Creates a job in the OpenAI Batch API to generate responses for the given input\n    file.\n\n    Args:\n        batch_input_file: the input file to generate responses for.\n\n    Returns:\n        The batch job created in the OpenAI Batch API.\n    \"\"\"\n    import openai\n\n    metadata = {\"description\": \"distilabel\"}\n\n    if distilabel_pipeline_name := envs.DISTILABEL_PIPELINE_NAME:\n        metadata[\"distilabel_pipeline_name\"] = distilabel_pipeline_name\n\n    if distilabel_pipeline_cache_id := envs.DISTILABEL_PIPELINE_CACHE_ID:\n        metadata[\"distilabel_pipeline_cache_id\"] = distilabel_pipeline_cache_id\n\n    batch = None\n    try:\n        batch = self._client.batches.create(\n            completion_window=\"24h\",\n            endpoint=\"/v1/chat/completions\",\n            input_file_id=batch_input_file.id,\n            metadata=metadata,\n        )\n    except openai.OpenAIError as e:\n        self._logger.error(  # type: ignore\n            f\"Error while creating OpenAI Batch API job for file with ID\"\n            f\" '{batch_input_file.id}': {e}.\"\n        )\n        raise e\n    return batch\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_batch_files","title":"<code>_create_batch_files(inputs, **kwargs)</code>","text":"<p>Creates the necessary input files for the batch API to generate responses. The maximum size of each file so the OpenAI Batch API can process it is 100MB, so we need to split the inputs into multiple files if necessary.</p> <p>More information: https://platform.openai.com/docs/api-reference/files/create</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for, optionally including structured output.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[FileObject]</code> <p>The list of file objects created for the OpenAI Batch API.</p> <p>Raises:</p> Type Description <code>OpenAIError</code> <p>if there was an error while creating the batch input file in the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_batch_files(\n    self, inputs: List[\"FormattedInput\"], **kwargs: Any\n) -&gt; List[\"OpenAIFileObject\"]:\n    \"\"\"Creates the necessary input files for the batch API to generate responses. The\n    maximum size of each file so the OpenAI Batch API can process it is 100MB, so we\n    need to split the inputs into multiple files if necessary.\n\n    More information: https://platform.openai.com/docs/api-reference/files/create\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for, optionally\n            including structured output.\n        kwargs: the keyword arguments to use for the generation.\n\n    Returns:\n        The list of file objects created for the OpenAI Batch API.\n\n    Raises:\n        openai.OpenAIError: if there was an error while creating the batch input file\n            in the OpenAI Batch API.\n    \"\"\"\n    import openai\n\n    files = []\n    for file_no, buffer in enumerate(\n        self._create_jsonl_buffers(inputs=inputs, **kwargs)\n    ):\n        try:\n            # TODO: add distilabel pipeline name and id\n            batch_input_file = self._client.files.create(\n                file=(self._name_for_openai_files(file_no), buffer),\n                purpose=\"batch\",\n            )\n            files.append(batch_input_file)\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while creating OpenAI batch input file: {e}\"\n            )\n            raise e\n    return files\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_jsonl_buffers","title":"<code>_create_jsonl_buffers(inputs, **kwargs)</code>","text":"<p>Creates a generator of buffers containing the JSONL formatted inputs to be used by the OpenAI Batch API. The buffers created are of size 100MB or less.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for, optionally including structured output.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Yields:</p> Type Description <code>BytesIO</code> <p>A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch</p> <code>BytesIO</code> <p>API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_jsonl_buffers(\n    self, inputs: List[\"FormattedInput\"], **kwargs: Any\n) -&gt; Generator[io.BytesIO, None, None]:\n    \"\"\"Creates a generator of buffers containing the JSONL formatted inputs to be\n    used by the OpenAI Batch API. The buffers created are of size 100MB or less.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for, optionally\n            including structured output.\n        kwargs: the keyword arguments to use for the generation.\n\n    Yields:\n        A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch\n        API.\n    \"\"\"\n    buffer = io.BytesIO()\n    buffer_current_size = 0\n    for i, input in enumerate(inputs):\n        # We create the smallest `custom_id` so we don't  increase the size of the file\n        # to much, but we can still sort the results with the order of the inputs.\n        row = self._create_jsonl_row(input=input, custom_id=str(i), **kwargs)\n        row_size = len(row)\n        if row_size + buffer_current_size &gt; _OPENAI_BATCH_API_MAX_FILE_SIZE:\n            buffer.seek(0)\n            yield buffer\n            buffer = io.BytesIO()\n            buffer_current_size = 0\n        buffer.write(row)\n        buffer_current_size += row_size\n\n    if buffer_current_size &gt; 0:\n        buffer.seek(0)\n        yield buffer\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_jsonl_row","title":"<code>_create_jsonl_row(input, custom_id, **kwargs)</code>","text":"<p>Creates a JSONL formatted row to be used by the OpenAI Batch API.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a list of inputs in chat format to generate responses for, optionally including structured output.</p> required <code>custom_id</code> <code>str</code> <p>a custom ID to use for the row.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>bytes</code> <p>A JSONL formatted row to be used by the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_jsonl_row(\n    self, input: \"FormattedInput\", custom_id: str, **kwargs: Any\n) -&gt; bytes:\n    \"\"\"Creates a JSONL formatted row to be used by the OpenAI Batch API.\n\n    Args:\n        input: a list of inputs in chat format to generate responses for, optionally\n            including structured output.\n        custom_id: a custom ID to use for the row.\n        kwargs: the keyword arguments to use for the generation.\n\n    Returns:\n        A JSONL formatted row to be used by the OpenAI Batch API.\n    \"\"\"\n    # TODO: depending on the format of the input, add `response_format` to the kwargs\n    row = {\n        \"custom_id\": custom_id,\n        \"method\": \"POST\",\n        \"url\": \"/v1/chat/completions\",\n        \"body\": {\"messages\": input, **kwargs},\n    }\n    json_row = orjson.dumps(row)\n    return json_row + b\"\\n\"\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TogetherLLM","title":"<code>TogetherLLM</code>","text":"<p>               Bases: <code>OpenAILLM</code></p> <p>TogetherLLM LLM implementation running the async API client of OpenAI.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"mistralai/Mixtral-8x7B-Instruct-v0.1\". Supported models can be found here.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Together API can be set with <code>TOGETHER_BASE_URL</code>. Defaults to <code>None</code> which means that the value set for the environment variable <code>TOGETHER_BASE_URL</code> will be used, or \"https://api.together.xyz/v1\" if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Together API. Defaults to <code>None</code> which means that the value set for the environment variable <code>TOGETHER_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = TogetherLLM(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/together.py</code> <pre><code>class TogetherLLM(OpenAILLM):\n    \"\"\"TogetherLLM LLM implementation running the async API client of OpenAI.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"mistralai/Mixtral-8x7B-Instruct-v0.1\".\n            Supported models can be found [here](https://api.together.xyz/models).\n        base_url: the base URL to use for the Together API can be set with `TOGETHER_BASE_URL`.\n            Defaults to `None` which means that the value set for the environment variable\n            `TOGETHER_BASE_URL` will be used, or \"https://api.together.xyz/v1\" if not set.\n        api_key: the API key to authenticate the requests to the Together API. Defaults to `None`\n            which means that the value set for the environment variable `TOGETHER_API_KEY` will be\n            used, or `None` if not set.\n        _api_key_env_var: the name of the environment variable to use for the API key. It\n            is meant to be used internally.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AnyscaleLLM\n\n        llm = TogetherLLM(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"TOGETHER_BASE_URL\", \"https://api.together.xyz/v1\"\n        ),\n        description=\"The base URL to use for the Together API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_TOGETHER_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Together API.\",\n    )\n\n    _api_key_env_var: str = PrivateAttr(_TOGETHER_API_KEY_ENV_VAR_NAME)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM","title":"<code>VertexAILLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>VertexAI LLM implementation running the async API clients for Gemini.</p> <ul> <li>Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini</li> </ul> <p>To use the <code>VertexAILLM</code> is necessary to have configured the Google Cloud authentication using one of these methods:</p> <ul> <li>Setting <code>GOOGLE_CLOUD_CREDENTIALS</code> environment variable</li> <li>Using <code>gcloud auth application-default login</code> command</li> <li>Using <code>vertexai.init</code> function from the <code>google-cloud-aiplatform</code> library</li> </ul> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"gemini-1.0-pro\". Supported models.</p> <code>_aclient</code> <code>Optional[GenerativeModel]</code> <p>the <code>GenerativeModel</code> to use for the Vertex AI Gemini API. It is meant to be used internally. Set in the <code>load</code> method.</p> Icon <p><code>:simple-googlecloud:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import VertexAILLM\n\nllm = VertexAILLM(model=\"gemini-1.5-pro\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>class VertexAILLM(AsyncLLM):\n    \"\"\"VertexAI LLM implementation running the async API clients for Gemini.\n\n    - Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini\n\n    To use the `VertexAILLM` is necessary to have configured the Google Cloud authentication\n    using one of these methods:\n\n    - Setting `GOOGLE_CLOUD_CREDENTIALS` environment variable\n    - Using `gcloud auth application-default login` command\n    - Using `vertexai.init` function from the `google-cloud-aiplatform` library\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"gemini-1.0-pro\". [Supported models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models).\n        _aclient: the `GenerativeModel` to use for the Vertex AI Gemini API. It is meant\n            to be used internally. Set in the `load` method.\n\n    Icon:\n        `:simple-googlecloud:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import VertexAILLM\n\n        llm = VertexAILLM(model=\"gemini-1.5-pro\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    model: str\n\n    _num_generations_param_supported = False\n\n    _aclient: Optional[\"GenerativeModel\"] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `GenerativeModel` class which has access to `generate_content_async` to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from vertexai.generative_models import GenerationConfig, GenerativeModel\n\n            self._generation_config_class = GenerationConfig\n        except ImportError as e:\n            raise ImportError(\n                \"vertexai is not installed. Please install it using\"\n                \" `pip install google-cloud-aiplatform`.\"\n            ) from e\n\n        if _is_gemini_model(self.model):\n            self._aclient = GenerativeModel(model_name=self.model)\n        else:\n            raise NotImplementedError(\n                \"`VertexAILLM` is only implemented for `gemini` models that allow for `ChatType` data.\"\n            )\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def _chattype_to_content(self, input: \"StandardInput\") -&gt; List[\"Content\"]:\n        \"\"\"Converts a chat type to a list of content items expected by the API.\n\n        Args:\n            input: the chat type to be converted.\n\n        Returns:\n            List[str]: a list of content items expected by the API.\n        \"\"\"\n        from vertexai.generative_models import Content, Part\n\n        contents = []\n        for message in input:\n            if message[\"role\"] not in [\"user\", \"model\"]:\n                raise ValueError(\n                    \"`VertexAILLM only supports the roles 'user' or 'model'.\"\n                )\n            contents.append(\n                Content(\n                    role=message[\"role\"], parts=[Part.from_text(message[\"content\"])]\n                )\n            )\n        return contents\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: VertexChatType,\n        temperature: Optional[float] = None,\n        top_p: Optional[float] = None,\n        top_k: Optional[int] = None,\n        max_output_tokens: Optional[int] = None,\n        stop_sequences: Optional[List[str]] = None,\n        safety_settings: Optional[Dict[str, Any]] = None,\n        tools: Optional[List[Dict[str, Any]]] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the [VertexAI async client definition](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini).\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            temperature: Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to `None`.\n            top_p: If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to `None`.\n            top_k: If specified, top-k sampling will be used. Defaults to `None`.\n            max_output_tokens: The maximum number of output tokens to generate per message. Defaults to `None`.\n            stop_sequences: A list of stop sequences. Defaults to `None`.\n            safety_settings: Safety configuration for returned content from the API. Defaults to `None`.\n            tools: A potential list of tools that can be used by the API. Defaults to `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        from vertexai.generative_models import GenerationConfig\n\n        content: \"GenerationResponse\" = await self._aclient.generate_content_async(  # type: ignore\n            contents=self._chattype_to_content(input),\n            generation_config=GenerationConfig(\n                candidate_count=1,  # only one candidate allowed per call\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                max_output_tokens=max_output_tokens,\n                stop_sequences=stop_sequences,\n            ),\n            safety_settings=safety_settings,  # type: ignore\n            tools=tools,  # type: ignore\n            stream=False,\n        )\n\n        text = None\n        try:\n            text = content.candidates[0].text\n        except ValueError:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using VertexAI client (model: '{self.model}').\"\n                f\" Finish reason was: '{content.candidates[0].finish_reason}'.\"\n            )\n        return prepare_output([text], **self._get_llm_statistics(content))\n\n    @staticmethod\n    def _get_llm_statistics(content: \"GenerationResponse\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [content.usage_metadata.prompt_token_count],\n            \"output_tokens\": [content.usage_metadata.candidates_token_count],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>GenerativeModel</code> class which has access to <code>generate_content_async</code> to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `GenerativeModel` class which has access to `generate_content_async` to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from vertexai.generative_models import GenerationConfig, GenerativeModel\n\n        self._generation_config_class = GenerationConfig\n    except ImportError as e:\n        raise ImportError(\n            \"vertexai is not installed. Please install it using\"\n            \" `pip install google-cloud-aiplatform`.\"\n        ) from e\n\n    if _is_gemini_model(self.model):\n        self._aclient = GenerativeModel(model_name=self.model)\n    else:\n        raise NotImplementedError(\n            \"`VertexAILLM` is only implemented for `gemini` models that allow for `ChatType` data.\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM._chattype_to_content","title":"<code>_chattype_to_content(input)</code>","text":"<p>Converts a chat type to a list of content items expected by the API.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the chat type to be converted.</p> required <p>Returns:</p> Type Description <code>List[Content]</code> <p>List[str]: a list of content items expected by the API.</p> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>def _chattype_to_content(self, input: \"StandardInput\") -&gt; List[\"Content\"]:\n    \"\"\"Converts a chat type to a list of content items expected by the API.\n\n    Args:\n        input: the chat type to be converted.\n\n    Returns:\n        List[str]: a list of content items expected by the API.\n    \"\"\"\n    from vertexai.generative_models import Content, Part\n\n    contents = []\n    for message in input:\n        if message[\"role\"] not in [\"user\", \"model\"]:\n            raise ValueError(\n                \"`VertexAILLM only supports the roles 'user' or 'model'.\"\n            )\n        contents.append(\n            Content(\n                role=message[\"role\"], parts=[Part.from_text(message[\"content\"])]\n            )\n        )\n    return contents\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM.agenerate","title":"<code>agenerate(input, temperature=None, top_p=None, top_k=None, max_output_tokens=None, stop_sequences=None, safety_settings=None, tools=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the VertexAI async client definition.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>VertexChatType</code> <p>a single input in chat format to generate responses for.</p> required <code>temperature</code> <code>Optional[float]</code> <p>Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to <code>None</code>.</p> <code>None</code> <code>top_p</code> <code>Optional[float]</code> <p>If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to <code>None</code>.</p> <code>None</code> <code>top_k</code> <code>Optional[int]</code> <p>If specified, top-k sampling will be used. Defaults to <code>None</code>.</p> <code>None</code> <code>max_output_tokens</code> <code>Optional[int]</code> <p>The maximum number of output tokens to generate per message. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_sequences</code> <code>Optional[List[str]]</code> <p>A list of stop sequences. Defaults to <code>None</code>.</p> <code>None</code> <code>safety_settings</code> <code>Optional[Dict[str, Any]]</code> <p>Safety configuration for returned content from the API. Defaults to <code>None</code>.</p> <code>None</code> <code>tools</code> <code>Optional[List[Dict[str, Any]]]</code> <p>A potential list of tools that can be used by the API. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: VertexChatType,\n    temperature: Optional[float] = None,\n    top_p: Optional[float] = None,\n    top_k: Optional[int] = None,\n    max_output_tokens: Optional[int] = None,\n    stop_sequences: Optional[List[str]] = None,\n    safety_settings: Optional[Dict[str, Any]] = None,\n    tools: Optional[List[Dict[str, Any]]] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the [VertexAI async client definition](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini).\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        temperature: Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to `None`.\n        top_p: If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to `None`.\n        top_k: If specified, top-k sampling will be used. Defaults to `None`.\n        max_output_tokens: The maximum number of output tokens to generate per message. Defaults to `None`.\n        stop_sequences: A list of stop sequences. Defaults to `None`.\n        safety_settings: Safety configuration for returned content from the API. Defaults to `None`.\n        tools: A potential list of tools that can be used by the API. Defaults to `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    from vertexai.generative_models import GenerationConfig\n\n    content: \"GenerationResponse\" = await self._aclient.generate_content_async(  # type: ignore\n        contents=self._chattype_to_content(input),\n        generation_config=GenerationConfig(\n            candidate_count=1,  # only one candidate allowed per call\n            temperature=temperature,\n            top_k=top_k,\n            top_p=top_p,\n            max_output_tokens=max_output_tokens,\n            stop_sequences=stop_sequences,\n        ),\n        safety_settings=safety_settings,  # type: ignore\n        tools=tools,  # type: ignore\n        stream=False,\n    )\n\n    text = None\n    try:\n        text = content.candidates[0].text\n    except ValueError:\n        self._logger.warning(  # type: ignore\n            f\"Received no response using VertexAI client (model: '{self.model}').\"\n            f\" Finish reason was: '{content.candidates[0].finish_reason}'.\"\n        )\n    return prepare_output([text], **self._get_llm_statistics(content))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM","title":"<code>ClientvLLM</code>","text":"<p>               Bases: <code>OpenAILLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>A client for the <code>vLLM</code> server implementing the OpenAI API specification.</p> <p>Attributes:</p> Name Type Description <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of times to retry the request to the API before failing. Defaults to <code>6</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>httpx_client_kwargs</code> <code>RuntimeParameter[int]</code> <p>extra kwargs that will be passed to the <code>httpx.AsyncClient</code> created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the Hugging Face Hub repo id or path of the tokenizer that will be used to apply the chat template and tokenize the inputs before sending it to the server. Defaults to <code>None</code>.</p> <code>tokenizer_revision</code> <code>Optional[str]</code> <p>the revision of the tokenizer to load. Defaults to <code>None</code>.</p> <code>_aclient</code> <code>AsyncOpenAI</code> <p>the <code>httpx.AsyncClient</code> used to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> Runtime parameters <ul> <li><code>base_url</code>: the base url of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</li> <li><code>max_retries</code>: the maximum number of times to retry the request to the API before     failing. Defaults to <code>6</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> <li><code>httpx_client_kwargs</code>: extra kwargs that will be passed to the <code>httpx.AsyncClient</code>     created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import ClientvLLM\n\nllm = ClientvLLM(\n    base_url=\"http://localhost:8000/v1\",\n    tokenizer=\"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n)\n\nllm.load()\n\nresults = llm.generate_outputs(\n    inputs=[[{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]],\n    temperature=0.7,\n    top_p=1.0,\n    max_new_tokens=256,\n)\n# [\n#     [\n#         \"I'm functioning properly, thank you for asking. How can I assist you today?\",\n#         \"I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?\",\n#         \"I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?\"\n#     ]\n# ]\n</code></pre> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>class ClientvLLM(OpenAILLM, MagpieChatTemplateMixin):\n    \"\"\"A client for the `vLLM` server implementing the OpenAI API specification.\n\n    Attributes:\n        base_url: the base URL of the `vLLM` server. Defaults to `\"http://localhost:8000\"`.\n        max_retries: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        httpx_client_kwargs: extra kwargs that will be passed to the `httpx.AsyncClient`\n            created to comunicate with the `vLLM` server. Defaults to `None`.\n        tokenizer: the Hugging Face Hub repo id or path of the tokenizer that will be used\n            to apply the chat template and tokenize the inputs before sending it to the\n            server. Defaults to `None`.\n        tokenizer_revision: the revision of the tokenizer to load. Defaults to `None`.\n        _aclient: the `httpx.AsyncClient` used to comunicate with the `vLLM` server. Defaults\n            to `None`.\n\n    Runtime parameters:\n        - `base_url`: the base url of the `vLLM` server. Defaults to `\"http://localhost:8000\"`.\n        - `max_retries`: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        - `httpx_client_kwargs`: extra kwargs that will be passed to the `httpx.AsyncClient`\n            created to comunicate with the `vLLM` server. Defaults to `None`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import ClientvLLM\n\n        llm = ClientvLLM(\n            base_url=\"http://localhost:8000/v1\",\n            tokenizer=\"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n        )\n\n        llm.load()\n\n        results = llm.generate_outputs(\n            inputs=[[{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]],\n            temperature=0.7,\n            top_p=1.0,\n            max_new_tokens=256,\n        )\n        # [\n        #     [\n        #         \"I'm functioning properly, thank you for asking. How can I assist you today?\",\n        #         \"I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?\",\n        #         \"I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?\"\n        #     ]\n        # ]\n        ```\n    \"\"\"\n\n    model: str = \"\"  # Default value so it's not needed to `ClientvLLM(model=\"...\")`\n    tokenizer: Optional[str] = None\n    tokenizer_revision: Optional[str] = None\n\n    # We need the sync client to get the list of models\n    _client: \"OpenAI\" = PrivateAttr(None)\n    _tokenizer: \"PreTrainedTokenizer\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer\n        optionally.\"\"\"\n\n        self.api_key = SecretStr(\"EMPTY\")\n\n        # We need to first create the sync client to get the model name that will be used\n        # in the `super().load()` when creating the logger.\n        try:\n            from openai import OpenAI\n        except ImportError as ie:\n            raise ImportError(\n                \"OpenAI Python client is not installed. Please install it using\"\n                \" `pip install openai`.\"\n            ) from ie\n\n        self._client = OpenAI(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),  # type: ignore\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        super().load()\n\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"To use `ClientvLLM` you need to install `transformers`.\"\n                \"Please install it using `pip install transformers`.\"\n            ) from ie\n\n        self._tokenizer = AutoTokenizer.from_pretrained(\n            self.tokenizer, revision=self.tokenizer_revision\n        )\n\n    @cached_property\n    def model_name(self) -&gt; str:  # type: ignore\n        \"\"\"Returns the name of the model served with vLLM server.\"\"\"\n        models = self._client.models.list()\n        return models.data[0].id\n\n    def _prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(  # type: ignore\n                input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,  # type: ignore\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        logit_bias: Optional[Dict[str, int]] = None,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for each input.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            logit_bias: modify the likelihood of specified tokens appearing in the completion.\n                Defaults to ``\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: nucleus sampling. The value refers to the top-p tokens that should be\n                considered for sampling. Defaults to `1.0`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n\n        completion = await self._aclient.completions.create(\n            model=self.model_name,\n            prompt=self._prepare_input(input),  # type: ignore\n            n=num_generations,\n            max_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            logit_bias=logit_bias,\n            presence_penalty=presence_penalty,\n            temperature=temperature,\n            top_p=top_p,\n        )\n\n        generations = []\n        for choice in completion.choices:\n            text = choice.text\n            if text == \"\":\n                self._logger.warning(  # type: ignore\n                    f\"Received no response from vLLM server (model: '{self.model_name}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(text)\n\n        return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM.model_name","title":"<code>model_name</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the name of the model served with vLLM server.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM.load","title":"<code>load()</code>","text":"<p>Creates an <code>httpx.AsyncClient</code> to connect to the vLLM server and a tokenizer optionally.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer\n    optionally.\"\"\"\n\n    self.api_key = SecretStr(\"EMPTY\")\n\n    # We need to first create the sync client to get the model name that will be used\n    # in the `super().load()` when creating the logger.\n    try:\n        from openai import OpenAI\n    except ImportError as ie:\n        raise ImportError(\n            \"OpenAI Python client is not installed. Please install it using\"\n            \" `pip install openai`.\"\n        ) from ie\n\n    self._client = OpenAI(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),  # type: ignore\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    super().load()\n\n    try:\n        from transformers import AutoTokenizer\n    except ImportError as ie:\n        raise ImportError(\n            \"To use `ClientvLLM` you need to install `transformers`.\"\n            \"Please install it using `pip install transformers`.\"\n        ) from ie\n\n    self._tokenizer = AutoTokenizer.from_pretrained(\n        self.tokenizer, revision=self.tokenizer_revision\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM._prepare_input","title":"<code>_prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def _prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(  # type: ignore\n            input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,  # type: ignore\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM.agenerate","title":"<code>agenerate(input, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, logit_bias=None, presence_penalty=0.0, temperature=1.0, top_p=1.0)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for each input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>logit_bias</code> <code>Optional[Dict[str, int]]</code> <p>modify the likelihood of specified tokens appearing in the completion. Defaults to ``</p> <code>None</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>nucleus sampling. The value refers to the top-p tokens that should be considered for sampling. Defaults to <code>1.0</code>.</p> <code>1.0</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    frequency_penalty: float = 0.0,\n    logit_bias: Optional[Dict[str, int]] = None,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for each input.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        logit_bias: modify the likelihood of specified tokens appearing in the completion.\n            Defaults to ``\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: nucleus sampling. The value refers to the top-p tokens that should be\n            considered for sampling. Defaults to `1.0`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n\n    completion = await self._aclient.completions.create(\n        model=self.model_name,\n        prompt=self._prepare_input(input),  # type: ignore\n        n=num_generations,\n        max_tokens=max_new_tokens,\n        frequency_penalty=frequency_penalty,\n        logit_bias=logit_bias,\n        presence_penalty=presence_penalty,\n        temperature=temperature,\n        top_p=top_p,\n    )\n\n    generations = []\n    for choice in completion.choices:\n        text = choice.text\n        if text == \"\":\n            self._logger.warning(  # type: ignore\n                f\"Received no response from vLLM server (model: '{self.model_name}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(text)\n\n    return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM","title":"<code>vLLM</code>","text":"<p>               Bases: <code>LLM</code>, <code>MagpieChatTemplateMixin</code>, <code>CudaDevicePlacementMixin</code></p> <p><code>vLLM</code> library LLM implementation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>dtype</code> <code>str</code> <p>the data type to use for the model. Defaults to <code>auto</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to trust the remote code when loading the model. Defaults to <code>False</code>.</p> <code>quantization</code> <code>Optional[str]</code> <p>the quantization mode to use for the model. Defaults to <code>None</code>.</p> <code>revision</code> <code>Optional[str]</code> <p>the revision of the model to load. Defaults to <code>None</code>.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer files. If not provided, the tokenizer will be loaded from the model directory. Defaults to <code>None</code>.</p> <code>tokenizer_mode</code> <code>Literal['auto', 'slow']</code> <p>the mode to use for the tokenizer. Defaults to <code>auto</code>.</p> <code>tokenizer_revision</code> <code>Optional[str]</code> <p>the revision of the tokenizer to load. Defaults to <code>None</code>.</p> <code>skip_tokenizer_init</code> <code>bool</code> <p>whether to skip the initialization of the tokenizer. Defaults to <code>False</code>.</p> <code>chat_template</code> <code>Optional[str]</code> <p>a chat template that will be used to build the prompts before sending them to the model. If not provided, the chat template defined in the tokenizer config will be used. If not provided and the tokenizer doesn't have a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[OutlinesStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>seed</code> <code>int</code> <p>the seed to use for the random number generator. Defaults to <code>0</code>.</p> <code>extra_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> <code>_model</code> <code>LLM</code> <p>the <code>vLLM</code> model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> <code>_tokenizer</code> <code>PreTrainedTokenizer</code> <p>the tokenizer instance used to format the prompt before passing it to the <code>LLM</code>. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> References <ul> <li>https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py</li> </ul> Runtime parameters <ul> <li><code>extra_kwargs</code>: additional dictionary of keyword arguments that will be passed to     the <code>LLM</code> class of <code>vllm</code> library.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import vLLM\n\n# You can pass a custom chat_template to the model\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pathlib import Path\nfrom distilabel.models.llms import vLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\"\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>class vLLM(LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin):\n    \"\"\"`vLLM` library LLM implementation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        dtype: the data type to use for the model. Defaults to `auto`.\n        trust_remote_code: whether to trust the remote code when loading the model. Defaults\n            to `False`.\n        quantization: the quantization mode to use for the model. Defaults to `None`.\n        revision: the revision of the model to load. Defaults to `None`.\n        tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer files. If not provided, the tokenizer will be loaded from the\n            model directory. Defaults to `None`.\n        tokenizer_mode: the mode to use for the tokenizer. Defaults to `auto`.\n        tokenizer_revision: the revision of the tokenizer to load. Defaults to `None`.\n        skip_tokenizer_init: whether to skip the initialization of the tokenizer. Defaults\n            to `False`.\n        chat_template: a chat template that will be used to build the prompts before\n            sending them to the model. If not provided, the chat template defined in the\n            tokenizer config will be used. If not provided and the tokenizer doesn't have\n            a chat template, then ChatML template will be used. Defaults to `None`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        seed: the seed to use for the random number generator. Defaults to `0`.\n        extra_kwargs: additional dictionary of keyword arguments that will be passed to the\n            `LLM` class of `vllm` library. Defaults to `{}`.\n        _model: the `vLLM` model instance. This attribute is meant to be used internally\n            and should not be accessed directly. It will be set in the `load` method.\n        _tokenizer: the tokenizer instance used to format the prompt before passing it to\n            the `LLM`. This attribute is meant to be used internally and should not be\n            accessed directly. It will be set in the `load` method.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n\n    References:\n        - https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py\n\n    Runtime parameters:\n        - `extra_kwargs`: additional dictionary of keyword arguments that will be passed to\n            the `LLM` class of `vllm` library.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import vLLM\n\n        # You can pass a custom chat_template to the model\n        llm = vLLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pathlib import Path\n        from distilabel.models.llms import vLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = vLLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\"\n            structured_output={\"format\": \"json\", \"schema\": Character},\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    quantization: Optional[str] = None\n    revision: Optional[str] = None\n\n    tokenizer: Optional[str] = None\n    tokenizer_mode: Literal[\"auto\", \"slow\"] = \"auto\"\n    tokenizer_revision: Optional[str] = None\n    skip_tokenizer_init: bool = False\n    chat_template: Optional[str] = None\n\n    seed: int = 0\n\n    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"Additional dictionary of keyword arguments that will be passed to the\"\n        \" `vLLM` class of `vllm` library. See all the supported arguments at: \"\n        \"https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py\",\n    )\n    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n\n    _model: \"_vLLM\" = PrivateAttr(None)\n    _tokenizer: \"PreTrainedTokenizer\" = PrivateAttr(None)\n    _structured_output_logits_processor: Optional[Callable] = PrivateAttr(default=None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\n        Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly\n        parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the\n        default value is ChatML format, unless explicitly provided.\n        \"\"\"\n        super().load()\n\n        CudaDevicePlacementMixin.load(self)\n\n        try:\n            from vllm import LLM as _vLLM\n        except ImportError as ie:\n            raise ImportError(\n                \"vLLM is not installed. Please install it using `pip install vllm`.\"\n            ) from ie\n\n        self._model = _vLLM(\n            self.model,\n            dtype=self.dtype,\n            trust_remote_code=self.trust_remote_code,\n            quantization=self.quantization,\n            revision=self.revision,\n            tokenizer=self.tokenizer,\n            tokenizer_mode=self.tokenizer_mode,\n            tokenizer_revision=self.tokenizer_revision,\n            skip_tokenizer_init=self.skip_tokenizer_init,\n            seed=self.seed,\n            **self.extra_kwargs,  # type: ignore\n        )\n\n        self._tokenizer = self._model.get_tokenizer()  # type: ignore\n        if self.chat_template is not None:\n            self._tokenizer.chat_template = self.chat_template  # type: ignore\n\n        if self.structured_output:\n            self._structured_output_logits_processor = self._prepare_structured_output(\n                self.structured_output\n            )\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the `vLLM` model.\"\"\"\n        self._cleanup_vllm_model()\n        self._model = None  # type: ignore\n        self._tokenizer = None  # type: ignore\n        CudaDevicePlacementMixin.unload(self)\n        super().unload()\n\n    def _cleanup_vllm_model(self) -&gt; None:\n        if self._model is None:\n            return\n\n        import torch  # noqa\n        from vllm.distributed.parallel_state import (\n            destroy_distributed_environment,\n            destroy_model_parallel,\n        )\n\n        destroy_model_parallel()\n        destroy_distributed_environment()\n        del self._model.llm_engine.model_executor\n        del self._model\n        with contextlib.suppress(AssertionError):\n            torch.distributed.destroy_process_group()\n        gc.collect()\n        if torch.cuda.is_available():\n            torch.cuda.empty_cache()\n            torch.cuda.synchronize()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        if self._tokenizer.chat_template is None:\n            return [item[\"content\"] for item in input if item[\"role\"] == \"user\"][0]\n\n        prompt: str = (\n            self._tokenizer.apply_chat_template(\n                input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,  # type: ignore\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    def _prepare_batches(\n        self, inputs: List[\"StructuredInput\"]\n    ) -&gt; Tuple[List[Tuple[List[str], \"OutlinesStructuredOutputType\"]], List[int]]:\n        \"\"\"Prepares the inputs by grouping them by the structured output.\n\n        When we generate structured outputs with schemas obtained from a dataset, we need to\n        prepare the data to try to send batches of inputs instead of single inputs to the model\n        to take advante of the engine. So we group the inputs by the structured output to be\n        passed in the `generate` method.\n\n        Args:\n            inputs: The batch of inputs passed to the generate method. As we expect to be generating\n                structured outputs, each element will be a tuple containing the instruction and the\n                structured output.\n\n        Returns:\n            The prepared batches (sub-batches let's say) to be passed to the `generate` method.\n            Each new tuple will contain instead of the single instruction, a list of instructions\n        \"\"\"\n        instruction_order = {}\n        batches: Dict[str, List[str]] = {}\n        for i, (instruction, structured_output) in enumerate(inputs):\n            instruction = self.prepare_input(instruction)\n            instruction_order[instruction] = i\n\n            structured_output = json.dumps(structured_output)\n            if structured_output not in batches:\n                batches[structured_output] = [instruction]\n            else:\n                batches[structured_output].append(instruction)\n\n        # Built a list with instructions sorted by structured output\n        flat_instructions = [\n            instruction for _, group in batches.items() for instruction in group\n        ]\n\n        # Generate the list of indices based on the original order\n        sorted_indices = [\n            instruction_order[instruction] for instruction in flat_instructions\n        ]\n\n        return [\n            (batch, json.loads(schema)) for schema, batch in batches.items()\n        ], sorted_indices\n\n    @validate_call\n    def generate(  # noqa: C901 # type: ignore\n        self,\n        inputs: List[FormattedInput],\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        presence_penalty: float = 0.0,\n        frequency_penalty: float = 0.0,\n        repetition_penalty: float = 1.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        top_k: int = -1,\n        min_p: float = 0.0,\n        logprobs: Optional[PositiveInt] = None,\n        stop: Optional[List[str]] = None,\n        stop_token_ids: Optional[List[int]] = None,\n        include_stop_str_in_output: bool = False,\n        logits_processors: Optional[LogitsProcessors] = None,\n        extra_sampling_params: Optional[Dict[str, Any]] = None,\n    ) -&gt; List[GenerateOutput]:\n        \"\"\"Generates `num_generations` responses for each input.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            repetition_penalty: the repetition penalty to use for the generation Defaults to\n                `1.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            top_k: the top-k value to use for the generation. Defaults to `0`.\n            min_p: the minimum probability to use for the generation. Defaults to `0.0`.\n            logprobs: number of log probabilities to return per output token. If `None`,\n                then no log probability won't be returned. Defaults to `None`.\n            stop: a list of strings that will be used to stop the generation when found.\n                Defaults to `None`.\n            stop_token_ids: a list of token ids that will be used to stop the generation\n                when found. Defaults to `None`.\n            include_stop_str_in_output: whether to include the stop string in the output.\n                Defaults to `False`.\n            logits_processors: a list of functions to process the logits before sampling.\n                Defaults to `None`.\n            extra_sampling_params: dictionary with additional arguments to be passed to\n                the `SamplingParams` class from `vllm`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        from vllm import SamplingParams\n\n        if not logits_processors:\n            logits_processors = []\n\n        if extra_sampling_params is None:\n            extra_sampling_params = {}\n\n        structured_output = None\n\n        if isinstance(inputs[0], tuple):\n            # Prepare the batches for structured generation\n            prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore\n        else:\n            # Simulate a batch without the structured output content\n            prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore\n            sorted_indices = None\n\n        # Case in which we have a single structured output for the dataset\n        if self._structured_output_logits_processor:\n            logits_processors.append(self._structured_output_logits_processor)\n\n        batched_outputs: List[\"LLMOutput\"] = []\n        generations = []\n\n        for prepared_inputs, structured_output in prepared_batches:\n            if self.structured_output is not None and structured_output is not None:\n                # TODO: warning\n                pass\n\n            if structured_output is not None:\n                logits_processors.append(\n                    self._prepare_structured_output(structured_output)  # type: ignore\n                )\n\n            sampling_params = SamplingParams(  # type: ignore\n                n=num_generations,\n                presence_penalty=presence_penalty,\n                frequency_penalty=frequency_penalty,\n                repetition_penalty=repetition_penalty,\n                temperature=temperature,\n                top_p=top_p,\n                top_k=top_k,\n                min_p=min_p,\n                max_tokens=max_new_tokens,\n                logprobs=logprobs,\n                stop=stop,\n                stop_token_ids=stop_token_ids,\n                include_stop_str_in_output=include_stop_str_in_output,\n                logits_processors=logits_processors,\n                **extra_sampling_params,\n            )\n\n            batch_outputs: List[\"RequestOutput\"] = self._model.generate(\n                prompts=prepared_inputs,\n                sampling_params=sampling_params,\n                use_tqdm=False,\n            )\n\n            # Remove structured output logit processor to avoid stacking structured output\n            # logits processors that leads to non-sense generations\n            if structured_output is not None:\n                logits_processors.pop(-1)\n\n            for input, outputs in zip(prepared_inputs, batch_outputs):\n                texts, statistics, outputs_logprobs = self._process_outputs(\n                    input, outputs\n                )\n                batched_outputs.append(texts)\n                generations.append(\n                    prepare_output(\n                        generations=texts,\n                        input_tokens=statistics[\"input_tokens\"],\n                        output_tokens=statistics[\"output_tokens\"],\n                        logprobs=outputs_logprobs,\n                    )\n                )\n\n        if sorted_indices is not None:\n            pairs = list(enumerate(sorted_indices))\n            pairs.sort(key=lambda x: x[1])\n            generations = [generations[original_idx] for original_idx, _ in pairs]\n\n        return generations\n\n    def _process_outputs(\n        self, input: str, outputs: \"RequestOutput\"\n    ) -&gt; Tuple[\"LLMOutput\", \"LLMStatistics\", \"LLMLogprobs\"]:\n        texts = []\n        outputs_logprobs = []\n        statistics = {\n            \"input_tokens\": [compute_tokens(input, self._tokenizer.encode)]\n            * len(outputs.outputs),\n            \"output_tokens\": [],\n        }\n        for output in outputs.outputs:\n            texts.append(output.text)\n            statistics[\"output_tokens\"].append(len(output.token_ids))\n            if output.logprobs is not None:\n                outputs_logprobs.append(self._get_llm_logprobs(output))\n        return texts, statistics, outputs_logprobs\n\n    def _prepare_structured_output(\n        self, structured_output: \"OutlinesStructuredOutputType\"\n    ) -&gt; Union[Callable, None]:\n        \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n        Args:\n            structured_output: the configuration dict to prepare the structured output.\n\n        Returns:\n            The callable that will be used to guide the generation of the model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            prepare_guided_output,\n        )\n\n        assert structured_output is not None, \"`structured_output` cannot be `None`\"\n\n        result = prepare_guided_output(structured_output, \"vllm\", self._model)\n        if (schema := result.get(\"schema\")) and self.structured_output:\n            self.structured_output[\"schema\"] = schema\n        return result[\"processor\"]\n\n    def _get_llm_logprobs(self, output: \"CompletionOutput\") -&gt; List[List[\"Logprob\"]]:\n        logprobs = []\n        for token_logprob in output.logprobs:  # type: ignore\n            token_logprobs = []\n            for logprob in token_logprob.values():\n                token_logprobs.append(\n                    {\"token\": logprob.decoded_token, \"logprob\": logprob.logprob}\n                )\n            logprobs.append(token_logprobs)\n        return logprobs\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>vLLM</code> model using either the path or the Hugging Face Hub repository id. Additionally, this method also sets the <code>chat_template</code> for the tokenizer, so as to properly parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the default value is ChatML format, unless explicitly provided.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\n    Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly\n    parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the\n    default value is ChatML format, unless explicitly provided.\n    \"\"\"\n    super().load()\n\n    CudaDevicePlacementMixin.load(self)\n\n    try:\n        from vllm import LLM as _vLLM\n    except ImportError as ie:\n        raise ImportError(\n            \"vLLM is not installed. Please install it using `pip install vllm`.\"\n        ) from ie\n\n    self._model = _vLLM(\n        self.model,\n        dtype=self.dtype,\n        trust_remote_code=self.trust_remote_code,\n        quantization=self.quantization,\n        revision=self.revision,\n        tokenizer=self.tokenizer,\n        tokenizer_mode=self.tokenizer_mode,\n        tokenizer_revision=self.tokenizer_revision,\n        skip_tokenizer_init=self.skip_tokenizer_init,\n        seed=self.seed,\n        **self.extra_kwargs,  # type: ignore\n    )\n\n    self._tokenizer = self._model.get_tokenizer()  # type: ignore\n    if self.chat_template is not None:\n        self._tokenizer.chat_template = self.chat_template  # type: ignore\n\n    if self.structured_output:\n        self._structured_output_logits_processor = self._prepare_structured_output(\n            self.structured_output\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.unload","title":"<code>unload()</code>","text":"<p>Unloads the <code>vLLM</code> model.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the `vLLM` model.\"\"\"\n    self._cleanup_vllm_model()\n    self._model = None  # type: ignore\n    self._tokenizer = None  # type: ignore\n    CudaDevicePlacementMixin.unload(self)\n    super().unload()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    if self._tokenizer.chat_template is None:\n        return [item[\"content\"] for item in input if item[\"role\"] == \"user\"][0]\n\n    prompt: str = (\n        self._tokenizer.apply_chat_template(\n            input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,  # type: ignore\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM._prepare_batches","title":"<code>_prepare_batches(inputs)</code>","text":"<p>Prepares the inputs by grouping them by the structured output.</p> <p>When we generate structured outputs with schemas obtained from a dataset, we need to prepare the data to try to send batches of inputs instead of single inputs to the model to take advante of the engine. So we group the inputs by the structured output to be passed in the <code>generate</code> method.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StructuredInput]</code> <p>The batch of inputs passed to the generate method. As we expect to be generating structured outputs, each element will be a tuple containing the instruction and the structured output.</p> required <p>Returns:</p> Type Description <code>List[Tuple[List[str], OutlinesStructuredOutputType]]</code> <p>The prepared batches (sub-batches let's say) to be passed to the <code>generate</code> method.</p> <code>List[int]</code> <p>Each new tuple will contain instead of the single instruction, a list of instructions</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def _prepare_batches(\n    self, inputs: List[\"StructuredInput\"]\n) -&gt; Tuple[List[Tuple[List[str], \"OutlinesStructuredOutputType\"]], List[int]]:\n    \"\"\"Prepares the inputs by grouping them by the structured output.\n\n    When we generate structured outputs with schemas obtained from a dataset, we need to\n    prepare the data to try to send batches of inputs instead of single inputs to the model\n    to take advante of the engine. So we group the inputs by the structured output to be\n    passed in the `generate` method.\n\n    Args:\n        inputs: The batch of inputs passed to the generate method. As we expect to be generating\n            structured outputs, each element will be a tuple containing the instruction and the\n            structured output.\n\n    Returns:\n        The prepared batches (sub-batches let's say) to be passed to the `generate` method.\n        Each new tuple will contain instead of the single instruction, a list of instructions\n    \"\"\"\n    instruction_order = {}\n    batches: Dict[str, List[str]] = {}\n    for i, (instruction, structured_output) in enumerate(inputs):\n        instruction = self.prepare_input(instruction)\n        instruction_order[instruction] = i\n\n        structured_output = json.dumps(structured_output)\n        if structured_output not in batches:\n            batches[structured_output] = [instruction]\n        else:\n            batches[structured_output].append(instruction)\n\n    # Built a list with instructions sorted by structured output\n    flat_instructions = [\n        instruction for _, group in batches.items() for instruction in group\n    ]\n\n    # Generate the list of indices based on the original order\n    sorted_indices = [\n        instruction_order[instruction] for instruction in flat_instructions\n    ]\n\n    return [\n        (batch, json.loads(schema)) for schema, batch in batches.items()\n    ], sorted_indices\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.generate","title":"<code>generate(inputs, num_generations=1, max_new_tokens=128, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, logprobs=None, stop=None, stop_token_ids=None, include_stop_str_in_output=False, logits_processors=None, extra_sampling_params=None)</code>","text":"<p>Generates <code>num_generations</code> responses for each input.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>repetition_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_k</code> <code>int</code> <p>the top-k value to use for the generation. Defaults to <code>0</code>.</p> <code>-1</code> <code>min_p</code> <code>float</code> <p>the minimum probability to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>logprobs</code> <code>Optional[PositiveInt]</code> <p>number of log probabilities to return per output token. If <code>None</code>, then no log probability won't be returned. Defaults to <code>None</code>.</p> <code>None</code> <code>stop</code> <code>Optional[List[str]]</code> <p>a list of strings that will be used to stop the generation when found. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_token_ids</code> <code>Optional[List[int]]</code> <p>a list of token ids that will be used to stop the generation when found. Defaults to <code>None</code>.</p> <code>None</code> <code>include_stop_str_in_output</code> <code>bool</code> <p>whether to include the stop string in the output. Defaults to <code>False</code>.</p> <code>False</code> <code>logits_processors</code> <code>Optional[LogitsProcessors]</code> <p>a list of functions to process the logits before sampling. Defaults to <code>None</code>.</p> <code>None</code> <code>extra_sampling_params</code> <code>Optional[Dict[str, Any]]</code> <p>dictionary with additional arguments to be passed to the <code>SamplingParams</code> class from <code>vllm</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>@validate_call\ndef generate(  # noqa: C901 # type: ignore\n    self,\n    inputs: List[FormattedInput],\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    presence_penalty: float = 0.0,\n    frequency_penalty: float = 0.0,\n    repetition_penalty: float = 1.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    top_k: int = -1,\n    min_p: float = 0.0,\n    logprobs: Optional[PositiveInt] = None,\n    stop: Optional[List[str]] = None,\n    stop_token_ids: Optional[List[int]] = None,\n    include_stop_str_in_output: bool = False,\n    logits_processors: Optional[LogitsProcessors] = None,\n    extra_sampling_params: Optional[Dict[str, Any]] = None,\n) -&gt; List[GenerateOutput]:\n    \"\"\"Generates `num_generations` responses for each input.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        repetition_penalty: the repetition penalty to use for the generation Defaults to\n            `1.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        top_k: the top-k value to use for the generation. Defaults to `0`.\n        min_p: the minimum probability to use for the generation. Defaults to `0.0`.\n        logprobs: number of log probabilities to return per output token. If `None`,\n            then no log probability won't be returned. Defaults to `None`.\n        stop: a list of strings that will be used to stop the generation when found.\n            Defaults to `None`.\n        stop_token_ids: a list of token ids that will be used to stop the generation\n            when found. Defaults to `None`.\n        include_stop_str_in_output: whether to include the stop string in the output.\n            Defaults to `False`.\n        logits_processors: a list of functions to process the logits before sampling.\n            Defaults to `None`.\n        extra_sampling_params: dictionary with additional arguments to be passed to\n            the `SamplingParams` class from `vllm`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    from vllm import SamplingParams\n\n    if not logits_processors:\n        logits_processors = []\n\n    if extra_sampling_params is None:\n        extra_sampling_params = {}\n\n    structured_output = None\n\n    if isinstance(inputs[0], tuple):\n        # Prepare the batches for structured generation\n        prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore\n    else:\n        # Simulate a batch without the structured output content\n        prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore\n        sorted_indices = None\n\n    # Case in which we have a single structured output for the dataset\n    if self._structured_output_logits_processor:\n        logits_processors.append(self._structured_output_logits_processor)\n\n    batched_outputs: List[\"LLMOutput\"] = []\n    generations = []\n\n    for prepared_inputs, structured_output in prepared_batches:\n        if self.structured_output is not None and structured_output is not None:\n            # TODO: warning\n            pass\n\n        if structured_output is not None:\n            logits_processors.append(\n                self._prepare_structured_output(structured_output)  # type: ignore\n            )\n\n        sampling_params = SamplingParams(  # type: ignore\n            n=num_generations,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repetition_penalty=repetition_penalty,\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            max_tokens=max_new_tokens,\n            logprobs=logprobs,\n            stop=stop,\n            stop_token_ids=stop_token_ids,\n            include_stop_str_in_output=include_stop_str_in_output,\n            logits_processors=logits_processors,\n            **extra_sampling_params,\n        )\n\n        batch_outputs: List[\"RequestOutput\"] = self._model.generate(\n            prompts=prepared_inputs,\n            sampling_params=sampling_params,\n            use_tqdm=False,\n        )\n\n        # Remove structured output logit processor to avoid stacking structured output\n        # logits processors that leads to non-sense generations\n        if structured_output is not None:\n            logits_processors.pop(-1)\n\n        for input, outputs in zip(prepared_inputs, batch_outputs):\n            texts, statistics, outputs_logprobs = self._process_outputs(\n                input, outputs\n            )\n            batched_outputs.append(texts)\n            generations.append(\n                prepare_output(\n                    generations=texts,\n                    input_tokens=statistics[\"input_tokens\"],\n                    output_tokens=statistics[\"output_tokens\"],\n                    logprobs=outputs_logprobs,\n                )\n            )\n\n    if sorted_indices is not None:\n        pairs = list(enumerate(sorted_indices))\n        pairs.sort(key=lambda x: x[1])\n        generations = [generations[original_idx] for original_idx, _ in pairs]\n\n    return generations\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM._prepare_structured_output","title":"<code>_prepare_structured_output(structured_output)</code>","text":"<p>Creates the appropriate function to filter tokens to generate structured outputs.</p> <p>Parameters:</p> Name Type Description Default <code>structured_output</code> <code>OutlinesStructuredOutputType</code> <p>the configuration dict to prepare the structured output.</p> required <p>Returns:</p> Type Description <code>Union[Callable, None]</code> <p>The callable that will be used to guide the generation of the model.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def _prepare_structured_output(\n    self, structured_output: \"OutlinesStructuredOutputType\"\n) -&gt; Union[Callable, None]:\n    \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n    Args:\n        structured_output: the configuration dict to prepare the structured output.\n\n    Returns:\n        The callable that will be used to guide the generation of the model.\n    \"\"\"\n    from distilabel.steps.tasks.structured_outputs.outlines import (\n        prepare_guided_output,\n    )\n\n    assert structured_output is not None, \"`structured_output` cannot be `None`\"\n\n    result = prepare_guided_output(structured_output, \"vllm\", self._model)\n    if (schema := result.get(\"schema\")) and self.structured_output:\n        self.structured_output[\"schema\"] = schema\n    return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin","title":"<code>CudaDevicePlacementMixin</code>","text":"<p>               Bases: <code>BaseModel</code></p> <p>Mixin class to assign CUDA devices to the <code>LLM</code> based on the <code>cuda_devices</code> attribute and the device placement information provided in <code>_device_llm_placement_map</code>. Providing the device placement information is optional, but if it is provided, it will be used to assign CUDA devices to the <code>LLM</code>s, trying to avoid using the same device for different <code>LLM</code>s.</p> <p>Attributes:</p> Name Type Description <code>cuda_devices</code> <code>RuntimeParameter[Union[List[int], Literal['auto']]]</code> <p>a list with the ID of the CUDA devices to be used by the <code>LLM</code>. If set to \"auto\", the devices will be automatically assigned based on the device placement information provided in <code>_device_llm_placement_map</code>. If set to a list of devices, it will be checked if the devices are available to be used by the <code>LLM</code>. If not, a warning will be logged.</p> <code>disable_cuda_device_placement</code> <code>RuntimeParameter[bool]</code> <p>Whether to disable the CUDA device placement logic or not. Defaults to <code>False</code>.</p> <code>_llm_identifier</code> <code>Union[str, None]</code> <p>the identifier of the <code>LLM</code> to be used as key in <code>_device_llm_placement_map</code>.</p> <code>_device_llm_placement_map</code> <code>Generator[Dict[str, List[int]], None, None]</code> <p>a dictionary with the device placement information for each <code>LLM</code>.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>class CudaDevicePlacementMixin(BaseModel):\n    \"\"\"Mixin class to assign CUDA devices to the `LLM` based on the `cuda_devices` attribute\n    and the device placement information provided in `_device_llm_placement_map`. Providing\n    the device placement information is optional, but if it is provided, it will be used to\n    assign CUDA devices to the `LLM`s, trying to avoid using the same device for different\n    `LLM`s.\n\n    Attributes:\n        cuda_devices: a list with the ID of the CUDA devices to be used by the `LLM`. If set\n            to \"auto\", the devices will be automatically assigned based on the device\n            placement information provided in `_device_llm_placement_map`. If set to a list\n            of devices, it will be checked if the devices are available to be used by the\n            `LLM`. If not, a warning will be logged.\n        disable_cuda_device_placement: Whether to disable the CUDA device placement logic\n            or not. Defaults to `False`.\n        _llm_identifier: the identifier of the `LLM` to be used as key in `_device_llm_placement_map`.\n        _device_llm_placement_map: a dictionary with the device placement information for each\n            `LLM`.\n    \"\"\"\n\n    cuda_devices: RuntimeParameter[Union[List[int], Literal[\"auto\"]]] = Field(\n        default=\"auto\", description=\"A list with the ID of the CUDA devices to be used.\"\n    )\n    disable_cuda_device_placement: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether to disable the CUDA device placement logic or not.\",\n    )\n\n    _llm_identifier: Union[str, None] = PrivateAttr(default=None)\n    _desired_num_gpus: PositiveInt = PrivateAttr(default=1)\n    _available_cuda_devices: List[int] = PrivateAttr(default_factory=list)\n    _can_check_cuda_devices: bool = PrivateAttr(default=False)\n\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Assign CUDA devices to the LLM based on the device placement information provided\n        in `_device_llm_placement_map`.\"\"\"\n\n        if self.disable_cuda_device_placement:\n            return\n\n        try:\n            import pynvml\n\n            pynvml.nvmlInit()\n            device_count = pynvml.nvmlDeviceGetCount()\n            self._available_cuda_devices = list(range(device_count))\n            self._can_check_cuda_devices = True\n        except ImportError as ie:\n            if self.cuda_devices == \"auto\":\n                raise ImportError(\n                    \"The 'pynvml' library is not installed. It is required to automatically\"\n                    \" assign CUDA devices to the `LLM`s. Please, install it and try again.\"\n                ) from ie\n\n            if self.cuda_devices:\n                self._logger.warning(  # type: ignore\n                    \"The 'pynvml' library is not installed. It is recommended to install it\"\n                    \" to check if the CUDA devices assigned to the LLM are available.\"\n                )\n\n        self._assign_cuda_devices()\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the LLM and removes the CUDA devices assigned to it from the device\n        placement information provided in `_device_llm_placement_map`.\"\"\"\n        if self.disable_cuda_device_placement:\n            return\n\n        with self._device_llm_placement_map() as device_map:\n            if self._llm_identifier in device_map:\n                self._logger.debug(  # type: ignore\n                    f\"Removing '{self._llm_identifier}' from the CUDA device map file\"\n                    f\" '{_CUDA_DEVICE_PLACEMENT_MIXIN_FILE}'.\"\n                )\n                del device_map[self._llm_identifier]\n\n    @contextmanager\n    def _device_llm_placement_map(self) -&gt; Generator[Dict[str, List[int]], None, None]:\n        \"\"\"Reads the content of the device placement file of the node with a lock, yields\n        the content, and writes the content back to the file after the context manager is\n        closed. If the file doesn't exist, an empty dictionary will be yielded.\n\n        Yields:\n            The content of the device placement file.\n        \"\"\"\n        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.parent.mkdir(parents=True, exist_ok=True)\n        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.touch()\n        with portalocker.Lock(\n            _CUDA_DEVICE_PLACEMENT_MIXIN_FILE,\n            \"r+\",\n            flags=portalocker.LockFlags.EXCLUSIVE,\n        ) as f:\n            try:\n                content = json.load(f)\n            except json.JSONDecodeError:\n                content = {}\n            yield content\n            f.seek(0)\n            f.truncate()\n            f.write(json.dumps(content))\n\n    def _assign_cuda_devices(self) -&gt; None:\n        \"\"\"Assigns CUDA devices to the LLM based on the device placement information provided\n        in `_device_llm_placement_map`. If the `cuda_devices` attribute is set to \"auto\", it\n        will be set to the first available CUDA device that is not going to be used by any\n        other LLM. If the `cuda_devices` attribute is set to a list of devices, it will be\n        checked if the devices are available to be used by the LLM. If not, a warning will be\n        logged.\"\"\"\n\n        # Take the lock and read the device placement information for each LLM.\n        with self._device_llm_placement_map() as device_map:\n            if self.cuda_devices == \"auto\":\n                self.cuda_devices = []\n                for _ in range(self._desired_num_gpus):\n                    if (device_id := self._get_cuda_device(device_map)) is not None:\n                        self.cuda_devices.append(device_id)\n                        device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n                if len(self.cuda_devices) != self._desired_num_gpus:\n                    self._logger.warning(  # type: ignore\n                        f\"Could not assign the desired number of GPUs {self._desired_num_gpus}\"\n                        f\" for LLM with identifier '{self._llm_identifier}'.\"\n                    )\n            else:\n                self._check_cuda_devices(device_map)\n\n            device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n\n        # `_device_llm_placement_map` was not provided and user didn't set the `cuda_devices`\n        # attribute. In this case, the `cuda_devices` attribute will be set to an empty list.\n        if self.cuda_devices == \"auto\":\n            self.cuda_devices = []\n\n        self._set_cuda_visible_devices()\n\n    def _check_cuda_devices(self, device_map: Dict[str, List[int]]) -&gt; None:\n        \"\"\"Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.\n\n        Args:\n            device_map: a dictionary with the device placement information for each LLM.\n        \"\"\"\n        for device in self.cuda_devices:  # type: ignore\n            for llm, devices in device_map.items():\n                if device in devices:\n                    self._logger.warning(  # type: ignore\n                        f\"LLM with identifier '{llm}' is also going to use CUDA device \"\n                        f\"'{device}'. This may lead to performance issues or running out\"\n                        \" of memory depending on the device capabilities and the loaded\"\n                        \" models.\"\n                    )\n\n    def _get_cuda_device(self, device_map: Dict[str, List[int]]) -&gt; Union[int, None]:\n        \"\"\"Returns the first available CUDA device to be used by the LLM that is not going\n        to be used by any other LLM.\n\n        Args:\n            device_map: a dictionary with the device placement information for each LLM.\n\n        Returns:\n            The first available CUDA device to be used by the LLM.\n\n        Raises:\n            RuntimeError: if there is no available CUDA device to be used by the LLM.\n        \"\"\"\n        for device in self._available_cuda_devices:\n            if all(device not in devices for devices in device_map.values()):\n                return device\n\n        return None\n\n    def _set_cuda_visible_devices(self) -&gt; None:\n        \"\"\"Sets the `CUDA_VISIBLE_DEVICES` environment variable to the list of CUDA devices\n        to be used by the LLM.\n        \"\"\"\n        if not self.cuda_devices:\n            return\n\n        if self._can_check_cuda_devices and not all(\n            device in self._available_cuda_devices for device in self.cuda_devices\n        ):\n            raise RuntimeError(\n                f\"Invalid CUDA devices for LLM '{self._llm_identifier}': {self.cuda_devices}.\"\n                f\" The available devices are: {self._available_cuda_devices}. Please, review\"\n                \" the 'cuda_devices' attribute and try again.\"\n            )\n\n        cuda_devices = \",\".join([str(device) for device in self.cuda_devices])\n        self._logger.info(  # type: ignore\n            f\"\ud83c\udfae LLM '{self._llm_identifier}' is going to use the following CUDA devices:\"\n            f\" {self.cuda_devices}.\"\n        )\n        os.environ[\"CUDA_VISIBLE_DEVICES\"] = cuda_devices\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin.load","title":"<code>load()</code>","text":"<p>Assign CUDA devices to the LLM based on the device placement information provided in <code>_device_llm_placement_map</code>.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Assign CUDA devices to the LLM based on the device placement information provided\n    in `_device_llm_placement_map`.\"\"\"\n\n    if self.disable_cuda_device_placement:\n        return\n\n    try:\n        import pynvml\n\n        pynvml.nvmlInit()\n        device_count = pynvml.nvmlDeviceGetCount()\n        self._available_cuda_devices = list(range(device_count))\n        self._can_check_cuda_devices = True\n    except ImportError as ie:\n        if self.cuda_devices == \"auto\":\n            raise ImportError(\n                \"The 'pynvml' library is not installed. It is required to automatically\"\n                \" assign CUDA devices to the `LLM`s. Please, install it and try again.\"\n            ) from ie\n\n        if self.cuda_devices:\n            self._logger.warning(  # type: ignore\n                \"The 'pynvml' library is not installed. It is recommended to install it\"\n                \" to check if the CUDA devices assigned to the LLM are available.\"\n            )\n\n    self._assign_cuda_devices()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin.unload","title":"<code>unload()</code>","text":"<p>Unloads the LLM and removes the CUDA devices assigned to it from the device placement information provided in <code>_device_llm_placement_map</code>.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the LLM and removes the CUDA devices assigned to it from the device\n    placement information provided in `_device_llm_placement_map`.\"\"\"\n    if self.disable_cuda_device_placement:\n        return\n\n    with self._device_llm_placement_map() as device_map:\n        if self._llm_identifier in device_map:\n            self._logger.debug(  # type: ignore\n                f\"Removing '{self._llm_identifier}' from the CUDA device map file\"\n                f\" '{_CUDA_DEVICE_PLACEMENT_MIXIN_FILE}'.\"\n            )\n            del device_map[self._llm_identifier]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._device_llm_placement_map","title":"<code>_device_llm_placement_map()</code>","text":"<p>Reads the content of the device placement file of the node with a lock, yields the content, and writes the content back to the file after the context manager is closed. If the file doesn't exist, an empty dictionary will be yielded.</p> <p>Yields:</p> Type Description <code>Dict[str, List[int]]</code> <p>The content of the device placement file.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>@contextmanager\ndef _device_llm_placement_map(self) -&gt; Generator[Dict[str, List[int]], None, None]:\n    \"\"\"Reads the content of the device placement file of the node with a lock, yields\n    the content, and writes the content back to the file after the context manager is\n    closed. If the file doesn't exist, an empty dictionary will be yielded.\n\n    Yields:\n        The content of the device placement file.\n    \"\"\"\n    _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.parent.mkdir(parents=True, exist_ok=True)\n    _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.touch()\n    with portalocker.Lock(\n        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE,\n        \"r+\",\n        flags=portalocker.LockFlags.EXCLUSIVE,\n    ) as f:\n        try:\n            content = json.load(f)\n        except json.JSONDecodeError:\n            content = {}\n        yield content\n        f.seek(0)\n        f.truncate()\n        f.write(json.dumps(content))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._assign_cuda_devices","title":"<code>_assign_cuda_devices()</code>","text":"<p>Assigns CUDA devices to the LLM based on the device placement information provided in <code>_device_llm_placement_map</code>. If the <code>cuda_devices</code> attribute is set to \"auto\", it will be set to the first available CUDA device that is not going to be used by any other LLM. If the <code>cuda_devices</code> attribute is set to a list of devices, it will be checked if the devices are available to be used by the LLM. If not, a warning will be logged.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _assign_cuda_devices(self) -&gt; None:\n    \"\"\"Assigns CUDA devices to the LLM based on the device placement information provided\n    in `_device_llm_placement_map`. If the `cuda_devices` attribute is set to \"auto\", it\n    will be set to the first available CUDA device that is not going to be used by any\n    other LLM. If the `cuda_devices` attribute is set to a list of devices, it will be\n    checked if the devices are available to be used by the LLM. If not, a warning will be\n    logged.\"\"\"\n\n    # Take the lock and read the device placement information for each LLM.\n    with self._device_llm_placement_map() as device_map:\n        if self.cuda_devices == \"auto\":\n            self.cuda_devices = []\n            for _ in range(self._desired_num_gpus):\n                if (device_id := self._get_cuda_device(device_map)) is not None:\n                    self.cuda_devices.append(device_id)\n                    device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n            if len(self.cuda_devices) != self._desired_num_gpus:\n                self._logger.warning(  # type: ignore\n                    f\"Could not assign the desired number of GPUs {self._desired_num_gpus}\"\n                    f\" for LLM with identifier '{self._llm_identifier}'.\"\n                )\n        else:\n            self._check_cuda_devices(device_map)\n\n        device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n\n    # `_device_llm_placement_map` was not provided and user didn't set the `cuda_devices`\n    # attribute. In this case, the `cuda_devices` attribute will be set to an empty list.\n    if self.cuda_devices == \"auto\":\n        self.cuda_devices = []\n\n    self._set_cuda_visible_devices()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._check_cuda_devices","title":"<code>_check_cuda_devices(device_map)</code>","text":"<p>Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.</p> <p>Parameters:</p> Name Type Description Default <code>device_map</code> <code>Dict[str, List[int]]</code> <p>a dictionary with the device placement information for each LLM.</p> required Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _check_cuda_devices(self, device_map: Dict[str, List[int]]) -&gt; None:\n    \"\"\"Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.\n\n    Args:\n        device_map: a dictionary with the device placement information for each LLM.\n    \"\"\"\n    for device in self.cuda_devices:  # type: ignore\n        for llm, devices in device_map.items():\n            if device in devices:\n                self._logger.warning(  # type: ignore\n                    f\"LLM with identifier '{llm}' is also going to use CUDA device \"\n                    f\"'{device}'. This may lead to performance issues or running out\"\n                    \" of memory depending on the device capabilities and the loaded\"\n                    \" models.\"\n                )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._get_cuda_device","title":"<code>_get_cuda_device(device_map)</code>","text":"<p>Returns the first available CUDA device to be used by the LLM that is not going to be used by any other LLM.</p> <p>Parameters:</p> Name Type Description Default <code>device_map</code> <code>Dict[str, List[int]]</code> <p>a dictionary with the device placement information for each LLM.</p> required <p>Returns:</p> Type Description <code>Union[int, None]</code> <p>The first available CUDA device to be used by the LLM.</p> <p>Raises:</p> Type Description <code>RuntimeError</code> <p>if there is no available CUDA device to be used by the LLM.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _get_cuda_device(self, device_map: Dict[str, List[int]]) -&gt; Union[int, None]:\n    \"\"\"Returns the first available CUDA device to be used by the LLM that is not going\n    to be used by any other LLM.\n\n    Args:\n        device_map: a dictionary with the device placement information for each LLM.\n\n    Returns:\n        The first available CUDA device to be used by the LLM.\n\n    Raises:\n        RuntimeError: if there is no available CUDA device to be used by the LLM.\n    \"\"\"\n    for device in self._available_cuda_devices:\n        if all(device not in devices for devices in device_map.values()):\n            return device\n\n    return None\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._set_cuda_visible_devices","title":"<code>_set_cuda_visible_devices()</code>","text":"<p>Sets the <code>CUDA_VISIBLE_DEVICES</code> environment variable to the list of CUDA devices to be used by the LLM.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _set_cuda_visible_devices(self) -&gt; None:\n    \"\"\"Sets the `CUDA_VISIBLE_DEVICES` environment variable to the list of CUDA devices\n    to be used by the LLM.\n    \"\"\"\n    if not self.cuda_devices:\n        return\n\n    if self._can_check_cuda_devices and not all(\n        device in self._available_cuda_devices for device in self.cuda_devices\n    ):\n        raise RuntimeError(\n            f\"Invalid CUDA devices for LLM '{self._llm_identifier}': {self.cuda_devices}.\"\n            f\" The available devices are: {self._available_cuda_devices}. Please, review\"\n            \" the 'cuda_devices' attribute and try again.\"\n        )\n\n    cuda_devices = \",\".join([str(device) for device in self.cuda_devices])\n    self._logger.info(  # type: ignore\n        f\"\ud83c\udfae LLM '{self._llm_identifier}' is going to use the following CUDA devices:\"\n        f\" {self.cuda_devices}.\"\n    )\n    os.environ[\"CUDA_VISIBLE_DEVICES\"] = cuda_devices\n</code></pre>"},{"location":"api/pipeline/","title":"Pipeline","text":"<p>This section contains the API reference for the <code>distilabel</code> pipelines. For an example on how to use the pipelines, see the Tutorial - Pipeline.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base","title":"<code>base</code>","text":""},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline","title":"<code>BasePipeline</code>","text":"<p>               Bases: <code>ABC</code>, <code>RequirementsMixin</code>, <code>_Serializable</code></p> <p>Base class for a <code>distilabel</code> pipeline.</p> <p>Attributes:</p> Name Type Description <code>name</code> <p>The name of the pipeline.</p> <code>description</code> <p>A description of the pipeline.</p> <code>dag</code> <p>The <code>DAG</code> instance that represents the pipeline.</p> <code>_cache_dir</code> <p>The directory where the pipeline will be cached.</p> <code>_logger</code> <p>The logger instance that will be used by the pipeline.</p> <code>_batch_manager</code> <code>Optional[_BatchManager]</code> <p>The batch manager that will manage the batches received from the steps while running the pipeline. It will be created when the pipeline is run, from scratch or from cache. Defaults to <code>None</code>.</p> <code>_write_buffer</code> <code>Optional[_WriteBuffer]</code> <p>The buffer that will store the data of the leaf steps of the pipeline while running, so the <code>Distiset</code> can be created at the end. It will be created when the pipeline is run. Defaults to <code>None</code>.</p> <code>_fs</code> <code>Optional[AbstractFileSystem]</code> <p>The <code>fsspec</code> filesystem to be used to store the data of the <code>_Batch</code>es passed between the steps. It will be set when the pipeline is run. Defaults to <code>None</code>.</p> <code>_storage_base_path</code> <code>Optional[str]</code> <p>The base path where the data of the <code>_Batch</code>es passed between the steps will be stored. It will be set then the pipeline is run. Defaults to <code>None</code>.</p> <code>_use_fs_to_pass_data</code> <code>bool</code> <p>Whether to use the file system to pass the data of the <code>_Batch</code>es between the steps. Even if this parameter is <code>False</code>, the <code>Batch</code>es received by <code>GlobalStep</code>s will always use the file system to pass the data. Defaults to <code>False</code>.</p> <code>_dry_run</code> <p>A flag to indicate if the pipeline is running in dry run mode. Defaults to <code>False</code>.</p> <code>output_queue</code> <p>A queue to store the output of the steps while running the pipeline.</p> <code>load_queue</code> <p>A queue used by each <code>Step</code> to notify the main process it has finished loading or it the step has been unloaded.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>class BasePipeline(ABC, RequirementsMixin, _Serializable):\n    \"\"\"Base class for a `distilabel` pipeline.\n\n    Attributes:\n        name: The name of the pipeline.\n        description: A description of the pipeline.\n        dag: The `DAG` instance that represents the pipeline.\n        _cache_dir: The directory where the pipeline will be cached.\n        _logger: The logger instance that will be used by the pipeline.\n        _batch_manager: The batch manager that will manage the batches received from the\n            steps while running the pipeline. It will be created when the pipeline is run,\n            from scratch or from cache. Defaults to `None`.\n        _write_buffer: The buffer that will store the data of the leaf steps of the pipeline\n            while running, so the `Distiset` can be created at the end. It will be created\n            when the pipeline is run. Defaults to `None`.\n        _fs: The `fsspec` filesystem to be used to store the data of the `_Batch`es passed\n            between the steps. It will be set when the pipeline is run. Defaults to `None`.\n        _storage_base_path: The base path where the data of the `_Batch`es passed between\n            the steps will be stored. It will be set then the pipeline is run. Defaults\n            to `None`.\n        _use_fs_to_pass_data: Whether to use the file system to pass the data of the\n            `_Batch`es between the steps. Even if this parameter is `False`, the `Batch`es\n            received by `GlobalStep`s will always use the file system to pass the data.\n            Defaults to `False`.\n        _dry_run: A flag to indicate if the pipeline is running in dry run mode. Defaults\n            to `False`.\n        output_queue: A queue to store the output of the steps while running the pipeline.\n        load_queue: A queue used by each `Step` to notify the main process it has finished\n            loading or it the step has been unloaded.\n    \"\"\"\n\n    _output_queue: \"Queue[Any]\"\n    _load_queue: \"Queue[Union[StepLoadStatus, None]]\"\n\n    def __init__(\n        self,\n        name: Optional[str] = None,\n        description: Optional[str] = None,\n        cache_dir: Optional[Union[str, \"PathLike\"]] = None,\n        enable_metadata: bool = False,\n        requirements: Optional[List[str]] = None,\n    ) -&gt; None:\n        \"\"\"Initialize the `BasePipeline` instance.\n\n        Args:\n            name: The name of the pipeline. If not generated, a random one will be generated by default.\n            description: A description of the pipeline. Defaults to `None`.\n            cache_dir: A directory where the pipeline will be cached. Defaults to `None`.\n            enable_metadata: Whether to include the distilabel metadata column for the pipeline\n                in the final `Distiset`. It contains metadata used by distilabel, for example\n                the raw outputs of the `LLM` without processing would be here, inside `raw_output_...`\n                field. Defaults to `False`.\n            requirements: List of requirements that must be installed to run the pipeline.\n                Defaults to `None`, but can be helpful to inform in a pipeline to be shared\n                that this requirements must be installed.\n        \"\"\"\n        self.name = name or _PIPELINE_DEFAULT_NAME\n        self.description = description\n        self._enable_metadata = enable_metadata\n        self.dag = DAG()\n\n        if cache_dir:\n            self._cache_dir = Path(cache_dir)\n        elif env_cache_dir := envs.DISTILABEL_CACHE_DIR:\n            self._cache_dir = Path(env_cache_dir)\n        else:\n            self._cache_dir = constants.PIPELINES_CACHE_DIR\n\n        self._logger = logging.getLogger(\"distilabel.pipeline\")\n\n        self._batch_manager: Optional[\"_BatchManager\"] = None\n        self._write_buffer: Optional[\"_WriteBuffer\"] = None\n        self._steps_input_queues: Dict[str, \"Queue\"] = {}\n\n        self._steps_load_status: Dict[str, int] = {}\n        self._steps_load_status_lock = threading.Lock()\n\n        self._stop_called = False\n        self._stop_called_lock = threading.Lock()\n        self._stop_calls = 0\n\n        self._recover_offline_batch_generate_for_step: Union[\n            Tuple[str, List[List[Dict[str, Any]]]], None\n        ] = None\n\n        self._fs: Optional[fsspec.AbstractFileSystem] = None\n        self._storage_base_path: Optional[str] = None\n        self._use_fs_to_pass_data: bool = False\n        self._dry_run = False\n\n        self._current_stage = 0\n        self._stages_last_batch: List[List[str]] = []\n        self._load_groups = []\n\n        self.requirements = requirements or []\n\n        self._exception: Union[Exception, None] = None\n\n        self._log_queue: Union[\"Queue[Any]\", None] = None\n\n    def __enter__(self) -&gt; Self:\n        \"\"\"Set the global pipeline instance when entering a pipeline context.\"\"\"\n        _GlobalPipelineManager.set_pipeline(self)\n        return self\n\n    def __exit__(self, exc_type, exc_value, traceback) -&gt; None:\n        \"\"\"Unset the global pipeline instance when exiting a pipeline context.\"\"\"\n        _GlobalPipelineManager.set_pipeline(None)\n        self._set_pipeline_name()\n\n    def _set_pipeline_name(self) -&gt; None:\n        \"\"\"Creates a name for the pipeline if it's the default one (if hasn't been set).\"\"\"\n        if self.name == _PIPELINE_DEFAULT_NAME:\n            self.name = f\"pipeline_{'_'.join(self.dag)}\"\n\n    @property\n    def signature(self) -&gt; str:\n        \"\"\"Makes a signature (hash) of a pipeline, using the step ids and the adjacency between them.\n\n        The main use is to find the pipeline in the cache folder.\n\n        Returns:\n            Signature of the pipeline.\n        \"\"\"\n\n        pipeline_dump = self.dump()[\"pipeline\"]\n        steps_names = list(self.dag)\n        connections_info = [\n            f\"{c['from']}-{'-'.join(c['to'])}\" for c in pipeline_dump[\"connections\"]\n        ]\n\n        routing_batch_functions_info = []\n        for function in pipeline_dump[\"routing_batch_functions\"]:\n            step = function[\"step\"]\n            routing_batch_function: \"RoutingBatchFunction\" = self.dag.get_step(step)[\n                constants.ROUTING_BATCH_FUNCTION_ATTR_NAME\n            ]\n            if type_info := routing_batch_function._get_type_info():\n                step += f\"-{type_info}\"\n            routing_batch_functions_info.append(step)\n\n        return hashlib.sha1(\n            \",\".join(\n                steps_names + connections_info + routing_batch_functions_info\n            ).encode()\n        ).hexdigest()\n\n    def run(\n        self,\n        parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n        load_groups: Optional[\"LoadGroups\"] = None,\n        use_cache: bool = True,\n        storage_parameters: Optional[Dict[str, Any]] = None,\n        use_fs_to_pass_data: bool = False,\n        dataset: Optional[\"InputDataset\"] = None,\n        dataset_batch_size: int = 50,\n        logging_handlers: Optional[List[logging.Handler]] = None,\n    ) -&gt; \"Distiset\":  # type: ignore\n        \"\"\"Run the pipeline. It will set the runtime parameters for the steps and validate\n        the pipeline.\n\n        This method should be extended by the specific pipeline implementation,\n        adding the logic to run the pipeline.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n                the runtime parameters for the step as the value. Defaults to `None`.\n            load_groups: A list containing lists of steps that have to be loaded together\n                and in isolation with respect to the rest of the steps of the pipeline.\n                This argument also allows passing the following modes:\n\n                - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                    the execution of the steps will be sequential.\n\n                Defaults to `None`.\n            use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n                `True`.\n            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n                that will be used to store the data of the `_Batch`es passed between the\n                steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n                `GlobalStep` it will be always used). It must have at least the \"path\" key,\n                and it can contain additional keys depending on the protocol. By default,\n                it will use the local file system and a directory in the cache directory.\n                Defaults to `None`.\n            use_fs_to_pass_data: Whether to use the file system to pass the data of\n                the `_Batch`es between the steps. Even if this parameter is `False`, the\n                `Batch`es received by `GlobalStep`s will always use the file system to\n                pass the data. Defaults to `False`.\n            dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n                root step. Convenient method when you have already processed the dataset in\n                your script and just want to pass it already processed. Defaults to `None`.\n            dataset_batch_size: if `dataset` is given, this will be the size of the batches\n                yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n            logging_handlers: A list of logging handlers that will be used to log the\n                output of the pipeline. This argument can be useful so the logging messages\n                can be extracted and used in a different context. Defaults to `None`.\n\n        Returns:\n            The `Distiset` created by the pipeline.\n        \"\"\"\n\n        self._exception: Union[Exception, None] = None\n\n        # Set the runtime parameters that will be used during the pipeline execution.\n        # They are used to generate the signature of the pipeline that is used to hit the\n        # cache when the pipeline is run, so it's important to do it first.\n        self._set_runtime_parameters(parameters or {})\n\n        self._refresh_pipeline_from_cache()\n\n        if dataset is not None:\n            self._add_dataset_generator_step(dataset, dataset_batch_size)\n\n        setup_logging(\n            log_queue=self._log_queue,\n            filename=str(self._cache_location[\"log_file\"]),\n            logging_handlers=logging_handlers,\n        )\n\n        # Set the name of the pipeline if it's the default one. This should be called\n        # if the pipeline is defined within the context manager, and the run is called\n        # outside of it. Is here in the following case:\n        # with Pipeline() as pipeline:\n        #    pipeline.run()\n        self._set_pipeline_name()\n\n        # Validate the pipeline DAG to check that all the steps are chainable, there are\n        # no missing runtime parameters, batch sizes are correct, load groups are valid,\n        # etc.\n        self._load_groups = self._built_load_groups(load_groups)\n        self._validate()\n\n        self._set_pipeline_artifacts_path_in_steps()\n\n        # Set the initial load status for all the steps\n        self._init_steps_load_status()\n\n        # Load the stages status or initialize it\n        self._load_stages_status(use_cache)\n\n        # Load the `_BatchManager` from cache or create one from scratch\n        self._load_batch_manager(use_cache)\n\n        # Check pipeline requirements are installed\n        self._check_requirements()\n\n        # Setup the filesystem that will be used to pass the data of the `_Batch`es\n        self._setup_fsspec(storage_parameters)\n        self._use_fs_to_pass_data = use_fs_to_pass_data\n\n        if self._dry_run:\n            self._logger.info(\"\ud83c\udf35 Dry run mode\")\n\n        # If the batch manager is not able to generate batches, that means that the loaded\n        # `_BatchManager` from cache didn't have any remaining batches to process i.e.\n        # the previous pipeline execution was completed successfully.\n        if not self._batch_manager.can_generate():  # type: ignore\n            self._logger.info(\n                \"\ud83d\udcbe Loaded batch manager from cache doesn't contain any remaining data.\"\n                \" Returning `Distiset` from cache data...\"\n            )\n            distiset = create_distiset(\n                data_dir=self._cache_location[\"data\"],\n                pipeline_path=self._cache_location[\"pipeline\"],\n                log_filename_path=self._cache_location[\"log_file\"],\n                enable_metadata=self._enable_metadata,\n                dag=self.dag,\n            )\n            stop_logging()\n            return distiset\n\n        self._setup_write_buffer(use_cache)\n\n        self._print_load_stages_info()\n\n    def dry_run(\n        self,\n        parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n        batch_size: int = 1,\n        dataset: Optional[\"InputDataset\"] = None,\n    ) -&gt; \"Distiset\":\n        \"\"\"Do a dry run to test the pipeline runs as expected.\n\n        Running a `Pipeline` in dry run mode will set all the `batch_size` of generator steps\n        to the specified `batch_size`, and run just with a single batch, effectively\n        running the whole pipeline with a single example. The cache will be set to `False`.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n                the runtime parameters for the step as the value. Defaults to `None`.\n            batch_size: The batch size of the unique batch generated by the generators\n                steps of the pipeline. Defaults to `1`.\n            dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n                root step. Convenient method when you have already processed the dataset in\n                your script and just want to pass it already processed. Defaults to `None`.\n\n        Returns:\n            Will return the `Distiset` as the main run method would do.\n        \"\"\"\n        self._dry_run = True\n\n        for step_name in self.dag:\n            step = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n\n            if step.is_generator:\n                if not parameters:\n                    parameters = {}\n                parameters[step_name] = {\"batch_size\": batch_size}\n\n        distiset = self.run(parameters=parameters, use_cache=False, dataset=dataset)\n\n        self._dry_run = False\n        return distiset\n\n    def get_load_stages(self, load_groups: Optional[\"LoadGroups\"] = None) -&gt; LoadStages:\n        \"\"\"Convenient method to get the load stages of a pipeline.\n\n        Args:\n            load_groups: A list containing list of steps that has to be loaded together\n                and in isolation with respect to the rest of the steps of the pipeline.\n                Defaults to `None`.\n\n        Returns:\n            A tuple with the first element containing asorted list by stage containing\n            lists with the names of the steps of the stage, and the second element a list\n            sorted by stage containing lists with the names of the last steps of the stage.\n        \"\"\"\n        load_groups = self._built_load_groups(load_groups)\n        return self.dag.get_steps_load_stages(load_groups)\n\n    def _add_dataset_generator_step(\n        self, dataset: \"InputDataset\", batch_size: int = 50\n    ) -&gt; None:\n        \"\"\"Create a root step to work as the `GeneratorStep` for the pipeline using a\n        dataset.\n\n        Args:\n            dataset: A dataset that will be used to create a `GeneratorStep` and\n                placed in the DAG as the root step.\n            batch_size: The size of the batches generated by the `GeneratorStep`.\n\n        Raises:\n            ValueError: If there's already a `GeneratorStep` in the pipeline.\n        \"\"\"\n        for step_name in self.dag:\n            step = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            if isinstance(step_name, GeneratorStep):\n                raise DistilabelUserError(\n                    \"There is already a `GeneratorStep` in the pipeline, you can either\"\n                    \" pass a `dataset` to the run method, or create a `GeneratorStep` explictly.\"\n                    f\" `GeneratorStep`: {step}\",\n                    page=\"sections/how_to_guides/basic/step/#types-of-steps\",\n                )\n        loader = make_generator_step(\n            dataset=dataset,\n            pipeline=self,\n            batch_size=batch_size,\n        )\n        self.dag.add_root_step(loader)\n\n    def get_runtime_parameters_info(self) -&gt; \"PipelineRuntimeParametersInfo\":\n        \"\"\"Get the runtime parameters for the steps in the pipeline.\n\n        Returns:\n            A dictionary with the step name as the key and a list of dictionaries with\n            the parameter name and the parameter info as the value.\n        \"\"\"\n        runtime_parameters = {}\n        for step_name in self.dag:\n            step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            runtime_parameters[step_name] = step.get_runtime_parameters_info()\n        return runtime_parameters\n\n    def _built_load_groups(\n        self, load_groups: Optional[\"LoadGroups\"] = None\n    ) -&gt; List[List[str]]:\n        if load_groups is None:\n            return []\n\n        if load_groups == \"sequential_step_execution\":\n            return [[step_name] for step_name in self.dag]\n\n        return [\n            [\n                step.name if isinstance(step, _Step) else step\n                for step in steps_load_group\n            ]  # type: ignore\n            for steps_load_group in load_groups\n        ]\n\n    def _validate(self) -&gt; None:\n        \"\"\"Validates the pipeline DAG to check that all the steps are chainable, there are\n        no missing runtime parameters, batch sizes are correct and that load groups are\n        valid (if any).\"\"\"\n        self.dag.validate()\n        self._validate_load_groups(self._load_groups)\n\n    def _validate_load_groups(self, load_groups: List[List[Any]]) -&gt; None:  # noqa: C901\n        \"\"\"Checks that the provided load groups are valid and that the steps can be scheduled\n        to be loaded in different stages without any issue.\n\n        Args:\n            load_groups: the load groups to be checked.\n\n        Raises:\n            DistilabelUserError: if something is not OK when checking the load groups.\n        \"\"\"\n\n        def check_predecessor_in_load_group(\n            step_name: str, load_group: List[str], first: bool\n        ) -&gt; Union[str, None]:\n            if not first and step_name in load_group:\n                return step_name\n\n            for predecessor_step_name in self.dag.get_step_predecessors(step_name):\n                # Immediate predecessor is in the same load group. This is OK.\n                if first and predecessor_step_name in load_group:\n                    continue\n\n                # Case: A -&gt; B -&gt; C, load_group=[A, C]\n                # If a non-immediate predecessor is in the same load group and an immediate\n                # predecessor is not , then it's not OK because we cannot load `step_name`\n                # before one immediate predecessor.\n                if step_name_in_load_group := check_predecessor_in_load_group(\n                    predecessor_step_name, load_group, False\n                ):\n                    return step_name_in_load_group\n\n            return None\n\n        steps_included_in_load_group = []\n        for load_group_num, steps_load_group in enumerate(load_groups):\n            for step_name in steps_load_group:\n                if step_name not in self.dag.G:\n                    raise DistilabelUserError(\n                        f\"Step with name '{step_name}' included in group {load_group_num} of\"\n                        \" the `load_groups` is not an step included in the pipeline. Please,\"\n                        \" check that you're passing the correct step name and run again.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                node = self.dag.get_step(step_name)\n                step: \"_Step\" = node[constants.STEP_ATTR_NAME]\n\n                if step_name_in_load_group := check_predecessor_in_load_group(\n                    step_name, steps_load_group, True\n                ):\n                    # Improve this user error message\n                    raise DistilabelUserError(\n                        f\"Step with name '{step_name}' cannot be in the same load group\"\n                        f\" as the step with name '{step_name_in_load_group}'. '{step_name_in_load_group}'\"\n                        f\" is not an immediate predecessor of '{step_name}' and there are\"\n                        \" immediate predecessors that have not been included.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                if step.is_global and len(steps_load_group) &gt; 1:\n                    raise DistilabelUserError(\n                        f\"Global step '{step_name}' has been included in a load group along\"\n                        \" more steps. Global steps cannot be included in a load group with\"\n                        \" more steps as they will be loaded in a different stage to the\"\n                        \" rest of the steps in the pipeline by default.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                if step_name in steps_included_in_load_group:\n                    raise DistilabelUserError(\n                        f\"Step with name '{step_name}' in load group {load_group_num} has\"\n                        \" already been included in a previous load group. A step cannot be in more\"\n                        \" than one load group.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                steps_included_in_load_group.append(step_name)\n\n    def _init_steps_load_status(self) -&gt; None:\n        \"\"\"Initialize the `_steps_load_status` dictionary assigning 0 to every step of\n        the pipeline.\"\"\"\n        for step_name in self.dag:\n            self._steps_load_status[step_name] = _STEP_NOT_LOADED_CODE\n\n    def _set_pipeline_artifacts_path_in_steps(self) -&gt; None:\n        \"\"\"Sets the attribute `_pipeline_artifacts_path` in all the `Step`s of the pipeline,\n        so steps can use it to get the path to save the generated artifacts.\"\"\"\n        artifacts_path = self._cache_location[\"data\"] / constants.STEPS_ARTIFACTS_PATH\n        for name in self.dag:\n            step: \"_Step\" = self.dag.get_step(name)[constants.STEP_ATTR_NAME]\n            step.set_pipeline_artifacts_path(path=artifacts_path)\n\n    def _check_requirements(self) -&gt; None:\n        \"\"\"Checks if the dependencies required to run the pipeline are installed.\n\n        Raises:\n            ModuleNotFoundError: if one or more requirements are missing.\n        \"\"\"\n        if to_install := self.requirements_to_install():\n            # Print the list of requirements like they would appear in a requirements.txt\n            to_install_list = \"\\n\" + \"\\n\".join(to_install)\n            msg = f\"Please install the following requirements to run the pipeline: {to_install_list}\"\n            self._logger.error(msg)\n            raise ModuleNotFoundError(msg)\n\n    def _setup_fsspec(\n        self, storage_parameters: Optional[Dict[str, Any]] = None\n    ) -&gt; None:\n        \"\"\"Setups the `fsspec` filesystem to be used to store the data of the `_Batch`es\n        passed between the steps.\n\n        Args:\n            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n                that will be used to store the data of the `_Batch`es passed between the\n                steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n                `GlobalStep` it will be always used). It must have at least the \"path\" key,\n                and it can contain additional keys depending on the protocol. By default,\n                it will use the local file system and a directory in the cache directory.\n                Defaults to `None`.\n        \"\"\"\n        if not storage_parameters:\n            self._fs = fsspec.filesystem(\"file\")\n            self._storage_base_path = (\n                f\"file://{self._cache_location['batch_input_data']}\"\n            )\n            return\n\n        if \"path\" not in storage_parameters:\n            raise DistilabelUserError(\n                \"The 'path' key must be present in the `storage_parameters` dictionary\"\n                \" if it's not `None`.\",\n                page=\"sections/how_to_guides/advanced/fs_to_pass_data/\",\n            )\n\n        path = storage_parameters.pop(\"path\")\n        protocol = UPath(path).protocol\n\n        self._fs = fsspec.filesystem(protocol, **storage_parameters)\n        self._storage_base_path = path\n\n    def _add_step(self, step: \"_Step\") -&gt; None:\n        \"\"\"Add a step to the pipeline.\n\n        Args:\n            step: The step to be added to the pipeline.\n        \"\"\"\n        self.dag.add_step(step)\n\n    def _add_edge(self, from_step: str, to_step: str) -&gt; None:\n        \"\"\"Add an edge between two steps in the pipeline.\n\n        Args:\n            from_step: The name of the step that will generate the input for `to_step`.\n            to_step: The name of the step that will receive the input from `from_step`.\n        \"\"\"\n        self.dag.add_edge(from_step, to_step)\n\n        # Check if `from_step` has a `routing_batch_function`. If it does, then mark\n        # `to_step` as a step that will receive a routed batch.\n        node = self.dag.get_step(from_step)  # type: ignore\n        routing_batch_function = node.get(\n            constants.ROUTING_BATCH_FUNCTION_ATTR_NAME, None\n        )\n        self.dag.set_step_attr(\n            name=to_step,\n            attr=constants.RECEIVES_ROUTED_BATCHES_ATTR_NAME,\n            value=routing_batch_function is not None,\n        )\n\n    def _is_convergence_step(self, step_name: str) -&gt; None:\n        \"\"\"Checks if a step is a convergence step.\n\n        Args:\n            step_name: The name of the step.\n        \"\"\"\n        return self.dag.get_step(step_name).get(constants.CONVERGENCE_STEP_ATTR_NAME)\n\n    def _add_routing_batch_function(\n        self, step_name: str, routing_batch_function: \"RoutingBatchFunction\"\n    ) -&gt; None:\n        \"\"\"Add a routing batch function to a step.\n\n        Args:\n            step_name: The name of the step that will receive the routed batch.\n            routing_batch_function: The function that will route the batch to the step.\n        \"\"\"\n        self.dag.set_step_attr(\n            name=step_name,\n            attr=constants.ROUTING_BATCH_FUNCTION_ATTR_NAME,\n            value=routing_batch_function,\n        )\n\n    def _set_runtime_parameters(self, parameters: Dict[str, Dict[str, Any]]) -&gt; None:\n        \"\"\"Set the runtime parameters for the steps in the pipeline.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n            the parameter name as the key and the parameter value as the value.\n        \"\"\"\n        step_names = set(self.dag.G)\n        for step_name, step_parameters in parameters.items():\n            if step_name not in step_names:\n                self._logger.warning(\n                    f\"\u2753 Step '{step_name}' provided in `Pipeline.run(parameters={{...}})` not found in the pipeline.\"\n                    f\" Available steps are: {step_names}.\"\n                )\n            else:\n                step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n                step.set_runtime_parameters(step_parameters)\n\n    def _model_dump(self, obj: Any, **kwargs: Any) -&gt; Dict[str, Any]:\n        \"\"\"Dumps the DAG content to a dict.\n\n        Args:\n            obj (Any): Unused, just kept to match the signature of the parent method.\n            kwargs (Any): Unused, just kept to match the signature of the parent method.\n\n        Returns:\n            Dict[str, Any]: Internal representation of the DAG from networkx in a serializable format.\n        \"\"\"\n        return self.dag.dump()\n\n    def draw(\n        self,\n        path: Optional[Union[str, Path]] = \"pipeline.png\",\n        top_to_bottom: bool = False,\n        show_edge_labels: bool = True,\n    ) -&gt; str:\n        \"\"\"\n        Draws the pipeline.\n\n        Parameters:\n            path: The path to save the image to.\n            top_to_bottom: Whether to draw the DAG top to bottom. Defaults to `False`.\n            show_edge_labels: Whether to show the edge labels. Defaults to `True`.\n\n        Returns:\n            The path to the saved image.\n        \"\"\"\n        png = self.dag.draw(\n            top_to_bottom=top_to_bottom, show_edge_labels=show_edge_labels\n        )\n        with open(path, \"wb\") as f:\n            f.write(png)\n        return path\n\n    def __repr__(self) -&gt; str:\n        \"\"\"\n        If running in a Jupyter notebook, display an image representing this `Pipeline`.\n        \"\"\"\n        if in_notebook():\n            try:\n                from IPython.display import Image, display\n\n                image_data = self.dag.draw()\n\n                display(Image(image_data))\n            except Exception:\n                pass\n        return super().__repr__()\n\n    def dump(self, **kwargs: Any) -&gt; Dict[str, Any]:\n        return {\n            \"distilabel\": {\"version\": __version__},\n            \"pipeline\": {\n                \"name\": self.name,\n                \"description\": self.description,\n                **super().dump(),\n            },\n            \"requirements\": self.requirements,\n        }\n\n    @classmethod\n    def from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n        \"\"\"Create a Pipeline from a dict containing the serialized data.\n\n        Note:\n            It's intended for internal use.\n\n        Args:\n            data (Dict[str, Any]): Dictionary containing the serialized data from a Pipeline.\n\n        Returns:\n            BasePipeline: Pipeline recreated from the dictionary info.\n        \"\"\"\n        name = data[\"pipeline\"][\"name\"]\n        description = data[\"pipeline\"].get(\"description\")\n        requirements = data.get(\"requirements\", [])\n        with cls(name=name, description=description, requirements=requirements) as pipe:\n            pipe.dag = DAG.from_dict(data[\"pipeline\"])\n        return pipe\n\n    @property\n    def _cache_location(self) -&gt; \"_CacheLocation\":\n        \"\"\"Dictionary containing the object that will stored and the location,\n        whether it is a filename or a folder.\n\n        Returns:\n            Path: Filenames where the pipeline content will be serialized.\n        \"\"\"\n        folder = self._cache_dir / self.name / self.signature\n        pipeline_execution_dir = folder / \"executions\" / self.aggregated_steps_signature\n        return {\n            \"pipeline\": pipeline_execution_dir / \"pipeline.yaml\",\n            \"batch_manager\": pipeline_execution_dir / \"batch_manager.json\",\n            \"steps_data\": self._cache_dir / self.name / \"steps_data\",\n            \"data\": pipeline_execution_dir / \"data\",\n            \"batch_input_data\": pipeline_execution_dir / \"batch_input_data\",\n            \"log_file\": pipeline_execution_dir / \"pipeline.log\",\n            \"stages_file\": pipeline_execution_dir / \"stages.json\",\n        }\n\n    @property\n    def aggregated_steps_signature(self) -&gt; str:\n        \"\"\"Creates an aggregated signature using `Step`s signature that will be used for\n        the `_BatchManager`.\n\n        Returns:\n            The aggregated signature.\n        \"\"\"\n        signatures = []\n        for step_name in self.dag:\n            step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            signatures.append(step.signature)\n        return hashlib.sha1(\"\".join(signatures).encode()).hexdigest()\n\n    def _cache(self) -&gt; None:\n        \"\"\"Saves the `BasePipeline` using the `_cache_filename`.\"\"\"\n        if self._dry_run:\n            return\n\n        self.save(\n            path=self._cache_location[\"pipeline\"],\n            format=self._cache_location[\"pipeline\"].suffix.replace(\".\", \"\"),  # type: ignore\n        )\n\n        if self._batch_manager is not None:\n            self._batch_manager.cache(\n                path=self._cache_location[\"batch_manager\"],\n                steps_data_path=self._cache_location[\"steps_data\"],\n            )\n\n        self._save_stages_status()\n\n        self._logger.debug(\"Pipeline and batch manager saved to cache.\")\n\n    def _save_stages_status(self) -&gt; None:\n        \"\"\"Saves the stages status to cache.\"\"\"\n        self.save(\n            path=self._cache_location[\"stages_file\"],\n            format=\"json\",\n            dump={\n                \"current_stage\": self._current_stage,\n                \"stages_last_batch\": self._stages_last_batch,\n            },\n        )\n\n    def _get_steps_load_stages(self) -&gt; Tuple[List[List[str]], List[List[str]]]:\n        return self.dag.get_steps_load_stages(self._load_groups)\n\n    def _load_stages_status(self, use_cache: bool = True) -&gt; None:\n        \"\"\"Try to load the stages status from cache, or initialize it if cache file doesn't\n        exist or cache is not going to be used.\"\"\"\n        if use_cache and self._cache_location[\"stages_file\"].exists():\n            stages_status = read_json(self._cache_location[\"stages_file\"])\n            self._current_stage = stages_status[\"current_stage\"]\n            self._stages_last_batch = stages_status[\"stages_last_batch\"]\n        else:\n            self._current_stage = 0\n            self._stages_last_batch = [\n                [] for _ in range(len(self._get_steps_load_stages()[0]))\n            ]\n\n    def _refresh_pipeline_from_cache(self) -&gt; None:\n        \"\"\"Refresh the DAG (and its steps) from the cache file. This is useful as some\n        `Step`s can update and change their state during the pipeline execution, and this\n        method will make sure the pipeline is up-to-date with the latest changes when\n        the pipeline is reloaded from cache.\n        \"\"\"\n\n        def recursively_handle_secrets_and_excluded_attributes(\n            cached_model: \"BaseModel\", model: \"BaseModel\"\n        ) -&gt; None:\n            \"\"\"Recursively handle the secrets and excluded attributes of a `BaseModel`,\n            setting the values of the cached model to the values of the model.\n\n            Args:\n                cached_model: The cached model that will be updated as it doesn't contain\n                    the secrets and excluded attributes (not serialized).\n                model: The model that contains the secrets and excluded attributes because\n                    it comes from pipeline instantiation.\n            \"\"\"\n            for field_name, field_info in cached_model.model_fields.items():\n                if field_name in (\"pipeline\"):\n                    continue\n\n                inner_type = extract_annotation_inner_type(field_info.annotation)\n                if is_type_pydantic_secret_field(inner_type) or field_info.exclude:\n                    setattr(cached_model, field_name, getattr(model, field_name))\n                elif isclass(inner_type) and issubclass(inner_type, BaseModel):\n                    recursively_handle_secrets_and_excluded_attributes(\n                        getattr(cached_model, field_name),\n                        getattr(model, field_name),\n                    )\n\n        if self._cache_location[\"pipeline\"].exists():\n            cached_dag = self.from_yaml(self._cache_location[\"pipeline\"]).dag\n\n            for step_name in cached_dag:\n                step_cached: \"_Step\" = cached_dag.get_step(step_name)[\n                    constants.STEP_ATTR_NAME\n                ]\n                step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n                recursively_handle_secrets_and_excluded_attributes(step_cached, step)\n\n            self.dag = cached_dag\n\n    def _load_batch_manager(self, use_cache: bool = True) -&gt; None:\n        \"\"\"Will try to load the `_BatchManager` from the cache dir if found. Otherwise,\n        it will create one from scratch.\n\n        If the `_BatchManager` is loaded from cache, we check for invalid steps (those that\n        may have a different signature than the original in the pipeline folder), and\n        restart them, as well as their successors.\n\n        Args:\n            use_cache: whether the cache should be used or not.\n        \"\"\"\n        batch_manager_cache_loc = self._cache_location[\"batch_manager\"]\n\n        # This first condition handles the case in which the pipeline is exactly the same\n        # no steps have been added, removed or changed.\n        if use_cache and batch_manager_cache_loc.exists():\n            self._logger.info(\n                f\"\ud83d\udcbe Loading `_BatchManager` from cache: '{batch_manager_cache_loc}'\"\n            )\n            self._batch_manager = _BatchManager.load_from_cache(\n                dag=self.dag,\n                batch_manager_path=batch_manager_cache_loc,\n                steps_data_path=self._cache_location[\"steps_data\"],\n            )\n            self._invalidate_steps_cache_if_required()\n        # In this other case, the pipeline has been changed. We need to create a new batch\n        # manager and if `use_cache==True` then check which outputs have we computed and\n        # cached for steps that haven't changed but that were executed in another pipeline,\n        # and therefore we can reuse\n        else:\n            self._batch_manager = _BatchManager.from_dag(\n                dag=self.dag,\n                use_cache=use_cache,\n                steps_data_path=self._cache_location[\"steps_data\"],\n            )\n\n    def _invalidate_steps_cache_if_required(self) -&gt; None:\n        \"\"\"Iterates over the steps of the pipeline and invalidates their cache if required.\"\"\"\n        for step_name in self.dag:\n            # `GeneratorStep`s doesn't receive input data so no need to check their\n            # `_BatchManagerStep`\n            if self.dag.get_step(step_name)[constants.STEP_ATTR_NAME].is_generator:\n                continue\n\n            step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            if not step.use_cache:\n                self._batch_manager.invalidate_cache_for(\n                    step_name=step.name,\n                    dag=self.dag,\n                    steps_data_path=self._cache_location[\"steps_data\"],\n                )  # type: ignore\n                self._logger.info(\n                    f\"\u267b\ufe0f Step '{step.name}' won't use cache (`use_cache=False`). The cache of this step and their successors won't be\"\n                    \" reused and the results will have to be recomputed.\"\n                )\n                break\n\n    def _setup_write_buffer(self, use_cache: bool = True) -&gt; None:\n        \"\"\"Setups the `_WriteBuffer` that will store the data of the leaf steps of the\n        pipeline while running, so the `Distiset` can be created at the end.\n        \"\"\"\n        if not use_cache and self._cache_location[\"data\"].exists():\n            shutil.rmtree(self._cache_location[\"data\"])\n        buffer_data_path = self._cache_location[\"data\"] / constants.STEPS_OUTPUTS_PATH\n        self._logger.info(f\"\ud83d\udcdd Pipeline data will be written to '{buffer_data_path}'\")\n        self._write_buffer = _WriteBuffer(\n            buffer_data_path,\n            self.dag.leaf_steps,\n            steps_cached={\n                step_name: self.dag.get_step(step_name)[\n                    constants.STEP_ATTR_NAME\n                ].use_cache\n                for step_name in self.dag\n            },\n        )\n\n    def _print_load_stages_info(self) -&gt; None:\n        \"\"\"Prints the information about the load stages.\"\"\"\n        stages, _ = self._get_steps_load_stages()\n        msg = \"\"\n        for stage, steps in enumerate(stages):\n            steps_to_be_loaded = self._steps_to_be_loaded_in_stage(stage)\n            msg += f\"\\n * Stage {stage}:\"\n            for step_name in steps:\n                step: \"Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n                if step.is_generator:\n                    emoji = \"\ud83d\udeb0\"\n                elif step.is_global:\n                    emoji = \"\ud83c\udf10\"\n                else:\n                    emoji = \"\ud83d\udd04\"\n                msg += f\"\\n   - {emoji} '{step_name}'\"\n                if step_name not in steps_to_be_loaded:\n                    msg += \" (results cached, won't be loaded and executed)\"\n        legend = \"\\n * Legend: \ud83d\udeb0 GeneratorStep \ud83c\udf10 GlobalStep \ud83d\udd04 Step\"\n        self._logger.info(\n            f\"\u231b The steps of the pipeline will be loaded in stages:{legend}{msg}\"\n        )\n\n    def _run_output_queue_loop_in_thread(self) -&gt; threading.Thread:\n        \"\"\"Runs the output queue loop in a separate thread to receive the output batches\n        from the steps. This is done to avoid the signal handler to block the loop, which\n        would prevent the pipeline from stopping correctly.\"\"\"\n        thread = threading.Thread(target=self._output_queue_loop)\n        thread.start()\n        return thread\n\n    def _output_queue_loop(self) -&gt; None:\n        \"\"\"Loop to receive the output batches from the steps and manage the flow of the\n        batches through the pipeline.\"\"\"\n        self._create_steps_input_queues()\n\n        if not self._initialize_pipeline_execution():\n            return\n\n        while self._should_continue_processing():  # type: ignore\n            self._logger.debug(\"Waiting for output batch from step...\")\n            if (batch := self._output_queue.get()) is None:\n                self._logger.debug(\"Received `None` from output queue. Breaking loop.\")\n                break\n\n            self._logger.debug(\n                f\"Received batch with seq_no {batch.seq_no} from step '{batch.step_name}'\"\n                f\" from output queue: {batch}\"\n            )\n\n            self._process_batch(batch)\n\n            # If `_stop_called` was set to `True` while waiting for the output queue, then\n            # we need to handle the stop of the pipeline and break the loop to avoid\n            # propagating the batches through the pipeline and making the stop process\n            # slower.\n            with self._stop_called_lock:\n                if self._stop_called:\n                    self._handle_batch_on_stop(batch)\n                    break\n\n            # If there is another load stage and all the `last_batch`es from the stage\n            # have been received, then load the next stage.\n            if self._should_load_next_stage():\n                self._wait_current_stage_to_finish()\n                if not self._update_stage():\n                    break\n\n            self._manage_batch_flow(batch)\n\n        self._finalize_pipeline_execution()\n\n    def _create_steps_input_queues(self) -&gt; None:\n        \"\"\"Creates the input queue for all the steps in the pipeline.\"\"\"\n        for step_name in self.dag:\n            self._logger.debug(f\"Creating input queue for '{step_name}' step...\")\n            input_queue = self._create_step_input_queue(step_name)\n            self._steps_input_queues[step_name] = input_queue\n\n    def _initialize_pipeline_execution(self) -&gt; bool:\n        \"\"\"Load the steps of the required stage to initialize the pipeline execution,\n        and requests the initial batches to trigger the batch flowing in the pipeline.\n\n        Returns:\n            `True` if initialization went OK, `False` otherwise.\n        \"\"\"\n        # Wait for all the steps to be loaded correctly\n        if not self._run_stage_steps_and_wait(stage=self._current_stage):\n            self._set_steps_not_loaded_exception()\n            return False\n\n        # Send the \"first\" batches to the steps so the batches starts flowing through\n        # the input queues and output queue\n        self._request_initial_batches()\n\n        return True\n\n    def _should_continue_processing(self) -&gt; bool:\n        \"\"\"Condition for the consume batches from the `output_queue` loop.\n\n        Returns:\n            `True` if should continue consuming batches, `False` otherwise and the pipeline\n            should stop.\n        \"\"\"\n        with self._stop_called_lock:\n            return self._batch_manager.can_generate() and not self._stop_called  # type: ignore\n\n    def _process_batch(\n        self, batch: \"_Batch\", send_last_batch_flag: bool = True\n    ) -&gt; None:\n        \"\"\"Process a batch consumed from the `output_queue`.\n\n        Args:\n            batch: the batch to be processed.\n        \"\"\"\n        if batch.data_path:\n            self._logger.debug(\n                f\"Reading {batch.seq_no} batch data from '{batch.step_name}': '{batch.data_path}'\"\n            )\n            batch.read_batch_data_from_fs()\n\n        if batch.step_name in self.dag.leaf_steps:\n            self._write_buffer.add_batch(batch)  # type: ignore\n\n        if batch.last_batch:\n            self._register_stages_last_batch(batch)\n\n            # Make sure to send the `LAST_BATCH_SENT_FLAG` to the predecessors of the step\n            # if the batch is the last one, so they stop their processing loop even if they\n            # haven't received the last batch because of the routing function.\n            if send_last_batch_flag:\n                for step_name in self.dag.get_step_predecessors(batch.step_name):\n                    if self._is_step_running(step_name):\n                        self._send_last_batch_flag_to_step(step_name)\n\n    def _set_step_for_recovering_offline_batch_generation(\n        self, step: \"_Step\", data: List[List[Dict[str, Any]]]\n    ) -&gt; None:\n        \"\"\"Sets the required information to recover a pipeline execution from a `_Step`\n        that used an `LLM` with offline batch generation.\n\n        Args:\n            step: The `_Step` that used an `LLM` with offline batch generation.\n            data: The data that was used to generate the batches for the step.\n        \"\"\"\n        # Replace step so the attribute `jobs_ids` of the `LLM` is not lost, as it was\n        # updated in the child process but not in the main process.\n        step_name: str = step.name  # type: ignore\n        self.dag.set_step_attr(\n            name=step_name, attr=constants.STEP_ATTR_NAME, value=step\n        )\n        self._recover_offline_batch_generate_for_step = (step_name, data)\n\n    def _add_batch_for_recovering_offline_batch_generation(self) -&gt; None:\n        \"\"\"Adds a dummy `_Batch` to the specified step name (it's a `Task` that used an\n        `LLM` with offline batch generation) to recover the pipeline state for offline\n        batch generation in next pipeline executions.\"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        if self._recover_offline_batch_generate_for_step is None:\n            return\n\n        step_name, data = self._recover_offline_batch_generate_for_step\n        self._logger.debug(\n            f\"Adding batch to '{step_name}' step to recover pipeline execution for offline\"\n            \" batch generation...\"\n        )\n        self._batch_manager.add_batch_to_recover_offline_batch_generation(\n            to_step=step_name,\n            data=data,\n        )\n\n    def _register_stages_last_batch(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Registers the last batch received from a step in the `_stages_last_batch`\n        dictionary.\n\n        Args:\n            batch: The last batch received from a step.\n        \"\"\"\n        _, stages_last_steps = self._get_steps_load_stages()\n        stage_last_steps = stages_last_steps[self._current_stage]\n        if batch.step_name in stage_last_steps:\n            self._stages_last_batch[self._current_stage].append(batch.step_name)\n            self._stages_last_batch[self._current_stage].sort()\n\n    def _update_stage(self) -&gt; bool:\n        \"\"\"Checks if the steps of next stage should be loaded and updates `_current_stage`\n        attribute.\n\n        Returns:\n            `True` if updating the stage went OK, `False` otherwise.\n        \"\"\"\n        self._current_stage += 1\n        if not self._run_stage_steps_and_wait(stage=self._current_stage):\n            self._set_steps_not_loaded_exception()\n            return False\n\n        return True\n\n    def _should_load_next_stage(self) -&gt; bool:\n        \"\"\"Returns if the next stage should be loaded.\n\n        Returns:\n            `True` if the next stage should be loaded, `False` otherwise.\n        \"\"\"\n        _, stage_last_steps = self._get_steps_load_stages()\n        there_is_next_stage = self._current_stage + 1 &lt; len(stage_last_steps)\n        stage_last_batches_received = (\n            self._stages_last_batch[self._current_stage]\n            == stage_last_steps[self._current_stage]\n        )\n        return there_is_next_stage and stage_last_batches_received\n\n    def _finalize_pipeline_execution(self) -&gt; None:\n        \"\"\"Finalizes the pipeline execution handling the prematurely stop of the pipeline\n        if required, caching the data and ensuring that all the steps finish its execution.\"\"\"\n\n        # Send `None` to steps `input_queue`s just in case some step is still waiting\n        self._notify_steps_to_stop()\n\n        for step_name in self.dag:\n            while self._is_step_running(step_name):\n                self._logger.debug(f\"Waiting for step '{step_name}' to finish...\")\n                time.sleep(0.5)\n\n        with self._stop_called_lock:\n            if self._stop_called:\n                self._handle_stop()\n\n            # Reset flag state\n            self._stop_called = False\n\n        self._add_batch_for_recovering_offline_batch_generation()\n\n        self._cache()\n\n    def _run_load_queue_loop_in_thread(self) -&gt; threading.Thread:\n        \"\"\"Runs a background thread that reads from the `load_queue` to update the status\n        of the number of replicas loaded for each step.\n\n        Returns:\n            The thread that was started.\n        \"\"\"\n        thread = threading.Thread(target=self._run_load_queue_loop)\n        thread.start()\n        return thread\n\n    def _run_load_queue_loop(self) -&gt; None:\n        \"\"\"Runs a loop that reads from the `load_queue` to update the status of the number\n        of replicas loaded for each step.\"\"\"\n\n        while True:\n            if (load_info := self._load_queue.get()) is None:\n                self._logger.debug(\"Received `None` from load queue. Breaking loop.\")\n                break\n\n            with self._steps_load_status_lock:\n                step_name, status = load_info[\"name\"], load_info[\"status\"]\n                if status == \"loaded\":\n                    if self._steps_load_status[step_name] == _STEP_NOT_LOADED_CODE:\n                        self._steps_load_status[step_name] = 1\n                    else:\n                        self._steps_load_status[step_name] += 1\n                elif status == \"unloaded\":\n                    self._steps_load_status[step_name] -= 1\n                    if self._steps_load_status[step_name] == 0:\n                        self._steps_load_status[step_name] = _STEP_UNLOADED_CODE\n                else:\n                    # load failed\n                    self._steps_load_status[step_name] = _STEP_LOAD_FAILED_CODE\n\n                self._logger.debug(\n                    f\"Step '{step_name}' loaded replicas: {self._steps_load_status[step_name]}\"\n                )\n\n    def _is_step_running(self, step_name: str) -&gt; bool:\n        \"\"\"Checks if the step is running (at least one replica is running).\n\n        Args:\n            step_name: The step to be check if running.\n\n        Returns:\n            `True` if the step is running, `False` otherwise.\n        \"\"\"\n        with self._steps_load_status_lock:\n            return self._steps_load_status[step_name] &gt;= 1\n\n    def _steps_to_be_loaded_in_stage(self, stage: int) -&gt; List[str]:\n        \"\"\"Returns the list of steps of the provided stage that should be loaded taking\n        into account if they have finished.\n\n        Args:\n            stage: the stage number\n\n        Returns:\n            A list containing the name of the steps that should be loaded in this stage.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        steps_stages, _ = self._get_steps_load_stages()\n\n        return [\n            step\n            for step in steps_stages[stage]\n            if not self._batch_manager.step_has_finished(step)\n        ]\n\n    def _get_steps_load_status(self, steps: List[str]) -&gt; Dict[str, int]:\n        \"\"\"Gets the a dictionary containing the load status of the provided steps.\n\n        Args:\n            steps: a list containing the names of the steps to get their load status.\n\n        Returns:\n            A dictionary containing the load status of the provided steps.\n        \"\"\"\n        return {\n            step_name: replicas\n            for step_name, replicas in self._steps_load_status.items()\n            if step_name in steps\n        }\n\n    def _wait_current_stage_to_finish(self) -&gt; None:\n        \"\"\"Waits for the current stage to finish.\"\"\"\n        stage = self._current_stage\n        steps = self._steps_to_be_loaded_in_stage(stage)\n        self._logger.info(f\"\u23f3 Waiting for stage {stage} to finish...\")\n        with self._stop_called_lock:\n            while not self._stop_called:\n                filtered_steps_load_status = self._get_steps_load_status(steps)\n                if all(\n                    replicas == _STEP_UNLOADED_CODE\n                    for replicas in filtered_steps_load_status.values()\n                ):\n                    self._logger.info(f\"\u2705 Stage {stage} has finished!\")\n                    break\n\n    def _run_stage_steps_and_wait(self, stage: int) -&gt; bool:\n        \"\"\"Runs the steps of the specified stage and waits for them to be ready.\n\n        Args:\n            stage: the stage from which the steps have to be loaded.\n\n        Returns:\n            `True` if all the steps have been loaded correctly, `False` otherwise.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        steps = self._steps_to_be_loaded_in_stage(stage)\n        self._logger.debug(f\"Steps to be loaded in stage {stage}: {steps}\")\n\n        # Run the steps of the stage\n        self._run_steps(steps=steps)\n\n        # Wait for them to be ready\n        self._logger.info(f\"\u23f3 Waiting for all the steps of stage {stage} to load...\")\n        previous_message = None\n        with self._stop_called_lock:\n            while not self._stop_called:\n                with self._steps_load_status_lock:\n                    filtered_steps_load_status = self._get_steps_load_status(steps)\n                    self._logger.debug(\n                        f\"Steps from stage {stage} loaded: {filtered_steps_load_status}\"\n                    )\n\n                    if any(\n                        replicas_loaded == _STEP_LOAD_FAILED_CODE\n                        for replicas_loaded in filtered_steps_load_status.values()\n                    ):\n                        self._logger.error(\n                            f\"\u274c Failed to load all the steps of stage {stage}\"\n                        )\n                        return False\n\n                    num_steps_loaded = 0\n                    replicas_message = \"\"\n                    for step_name, replicas in filtered_steps_load_status.items():\n                        step_replica_count = self.dag.get_step_replica_count(step_name)\n                        # It can happen that the step is very fast and it has done all the\n                        # work and have finished its execution before checking if it has\n                        # been loaded, that's why we also considered the step to be loaded\n                        # if `_STEP_UNLOADED_CODE`.\n                        if (\n                            replicas == step_replica_count\n                            or replicas == _STEP_UNLOADED_CODE\n                        ):\n                            num_steps_loaded += 1\n                        replicas_message += f\"\\n * '{step_name}' replicas: {max(0, replicas)}/{step_replica_count}\"\n\n                    message = f\"\u23f3 Steps from stage {stage} loaded: {num_steps_loaded}/{len(filtered_steps_load_status)}{replicas_message}\"\n                    if num_steps_loaded &gt; 0 and message != previous_message:\n                        self._logger.info(message)\n                        previous_message = message\n\n                    if num_steps_loaded == len(filtered_steps_load_status):\n                        self._logger.info(\n                            f\"\u2705 All the steps from stage {stage} have been loaded!\"\n                        )\n                        return True\n\n                time.sleep(2.5)\n\n        return not self._stop_called\n\n    def _handle_stop(self) -&gt; None:\n        \"\"\"Handles the stop of the pipeline execution, which will stop the steps from\n        processing more batches and wait for the output queue to be empty, to not lose\n        any data that was already processed by the steps before the stop was called.\"\"\"\n        self._logger.debug(\"Handling stop of the pipeline execution...\")\n\n        self._add_batches_back_to_batch_manager()\n\n        # Wait for the input queue to be empty, which means that all the steps finished\n        # processing the batches that were sent before the stop flag.\n        self._wait_steps_input_queues_empty()\n\n        self._consume_output_queue()\n\n        if self._should_load_next_stage():\n            self._current_stage += 1\n\n    def _wait_steps_input_queues_empty(self) -&gt; None:\n        self._logger.debug(\"Waiting for steps input queues to be empty...\")\n        for step_name in self.dag:\n            self._wait_step_input_queue_empty(step_name)\n        self._logger.debug(\"Steps input queues are empty!\")\n\n    def _wait_step_input_queue_empty(self, step_name: str) -&gt; Union[\"Queue[Any]\", None]:\n        \"\"\"Waits for the input queue of a step to be empty.\n\n        Args:\n            step_name: The name of the step.\n\n        Returns:\n            The input queue of the step if it's not loaded or finished, `None` otherwise.\n        \"\"\"\n        if self._check_step_not_loaded_or_finished(step_name):\n            return None\n\n        if input_queue := self.dag.get_step(step_name).get(\n            constants.INPUT_QUEUE_ATTR_NAME\n        ):\n            while input_queue.qsize() != 0:\n                pass\n            return input_queue\n\n    def _check_step_not_loaded_or_finished(self, step_name: str) -&gt; bool:\n        \"\"\"Checks if a step is not loaded or already finished.\n\n        Args:\n            step_name: The name of the step.\n\n        Returns:\n            `True` if the step is not loaded or already finished, `False` otherwise.\n        \"\"\"\n        with self._steps_load_status_lock:\n            num_replicas = self._steps_load_status[step_name]\n\n            # The step has finished (replicas = 0) or it has failed to load\n            if num_replicas in [0, _STEP_LOAD_FAILED_CODE, _STEP_UNLOADED_CODE]:\n                return True\n\n        return False\n\n    @property\n    @abstractmethod\n    def QueueClass(self) -&gt; Callable:\n        \"\"\"The class of the queue to use in the pipeline.\"\"\"\n        pass\n\n    def _create_step_input_queue(self, step_name: str) -&gt; \"Queue[Any]\":\n        \"\"\"Creates an input queue for a step.\n\n        Args:\n            step_name: The name of the step.\n\n        Returns:\n            The input queue created.\n        \"\"\"\n        input_queue = self.QueueClass()\n        self.dag.set_step_attr(step_name, constants.INPUT_QUEUE_ATTR_NAME, input_queue)\n        return input_queue\n\n    @abstractmethod\n    def _run_step(self, step: \"_Step\", input_queue: \"Queue[Any]\", replica: int) -&gt; None:\n        \"\"\"Runs the `Step` instance.\n\n        Args:\n            step: The `Step` instance to run.\n            input_queue: The input queue where the step will receive the batches.\n            replica: The replica ID assigned.\n        \"\"\"\n        pass\n\n    def _run_steps(self, steps: Iterable[str]) -&gt; None:\n        \"\"\"Runs the `Step`s of the pipeline, creating first an input queue for each step\n        that will be used to send the batches.\n\n        Args:\n            steps:\n        \"\"\"\n        for step_name in steps:\n            step: \"Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            input_queue = self._steps_input_queues[step.name]  # type: ignore\n\n            # Set `pipeline` to `None` as in some Python environments the pipeline is not\n            # picklable and it will raise an error when trying to send the step to the process.\n            # `TypeError: cannot pickle 'code' object`\n            step.pipeline = None\n\n            if not step.is_normal and step.resources.replicas &gt; 1:  # type: ignore\n                self._logger.warning(\n                    f\"Step '{step_name}' is a `GeneratorStep` or `GlobalStep` and has more\"\n                    \" than 1 replica. Only `Step` instances can have more than 1 replica.\"\n                    \" The number of replicas for the step will be set to 1.\"\n                )\n\n            step_num_replicas: int = step.resources.replicas if step.is_normal else 1  # type: ignore\n            for replica in range(step_num_replicas):\n                self._logger.debug(\n                    f\"Running 1 replica of step '{step.name}' with ID {replica}...\"\n                )\n                self._run_step(\n                    step=step.model_copy(deep=True),\n                    input_queue=input_queue,\n                    replica=replica,\n                )\n\n    def _add_batches_back_to_batch_manager(self) -&gt; None:\n        \"\"\"Add the `Batch`es that were sent to a `Step` back to the `_BatchManager`. This\n        method should be used when the pipeline has been stopped prematurely.\"\"\"\n        self._logger.debug(\n            \"Adding batches from step input queues back to the batch manager...\"\n        )\n        for step_name in self.dag:\n            node = self.dag.get_step(step_name)\n            step: \"_Step\" = node[constants.STEP_ATTR_NAME]\n            if step.is_generator:\n                continue\n            if input_queue := node.get(constants.INPUT_QUEUE_ATTR_NAME):\n                while not input_queue.empty():\n                    batch = input_queue.get()\n                    if not isinstance(batch, _Batch):\n                        continue\n                    self._batch_manager.add_batch(  # type: ignore\n                        to_step=step_name,\n                        batch=batch,\n                        prepend=True,\n                    )\n                    self._logger.debug(\n                        f\"Adding batch back to the batch manager: {batch}\"\n                    )\n                if self._check_step_not_loaded_or_finished(step_name):\n                    # Notify the step to stop\n                    input_queue.put(None)\n        self._logger.debug(\"Finished adding batches back to the batch manager.\")\n\n    def _consume_output_queue(self) -&gt; None:\n        \"\"\"Consumes the `Batch`es from the output queue until it's empty. This method should\n        be used when the pipeline has been stopped prematurely to consume and to not lose\n        the `Batch`es that were processed by the leaf `Step`s before stopping the pipeline.\"\"\"\n        while not self._output_queue.empty():\n            batch = self._output_queue.get()\n            if batch is None:\n                continue\n            self._process_batch(batch, send_last_batch_flag=False)\n            self._handle_batch_on_stop(batch)\n\n    def _manage_batch_flow(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Checks if the step that generated the batch has more data in its buffer to\n        generate a new batch. If there's data, then a new batch is sent to the step. If\n        the step has no data in its buffer, then the predecessors generator steps are\n        requested to send a new batch.\n\n        Args:\n            batch: The batch that was processed.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        route_to, do_not_route_to, routed = self._get_successors(batch)\n\n        self._register_batch(batch)\n\n        # Keep track of the steps that the batch was routed to\n        if routed:\n            batch.batch_routed_to = route_to\n\n        self._set_next_expected_seq_no(\n            steps=do_not_route_to,\n            from_step=batch.step_name,\n            next_expected_seq_no=batch.seq_no + 1,\n        )\n\n        step = self._get_step_from_batch(batch)\n\n        # Add the batch to the successors input buffers\n        for successor in route_to:\n            # Copy batch to avoid modifying the same reference in the batch manager\n            batch_to_add = batch.copy() if len(route_to) &gt; 1 else batch\n\n            self._batch_manager.add_batch(successor, batch_to_add)\n\n            # Check if the step is a generator and if there are successors that need data\n            # from this step. This usually happens when the generator `batch_size` is smaller\n            # than the `input_batch_size` of the successor steps.\n            if (\n                step.is_generator\n                and step.name in self._batch_manager.step_empty_buffers(successor)\n            ):\n                last_batch_sent = self._batch_manager.get_last_batch_sent(step.name)\n                self._send_batch_to_step(last_batch_sent.next_batch())  # type: ignore\n\n            # If successor step has enough data in its buffer to create a new batch, then\n            # send the batch to the step.\n            while new_batch := self._batch_manager.get_batch(successor):\n                self._send_batch_to_step(new_batch)\n\n        if not step.is_generator:\n            # Step (\"this\", the one from which the batch was received) has enough data on its\n            # buffers to create a new batch\n            while new_batch := self._batch_manager.get_batch(step.name):  # type: ignore\n                self._send_batch_to_step(new_batch)\n            else:\n                self._request_more_batches_if_needed(step)\n        else:\n            # Case in which the pipeline only contains a `GeneratorStep` so we constanly keep\n            # requesting batch after batch as there is no downstream step to consume it\n            if len(self.dag) == 1:\n                self._request_batch_from_generator(step.name)  # type: ignore\n\n        self._cache()\n\n    def _send_to_step(self, step_name: str, to_send: Any) -&gt; None:\n        \"\"\"Sends something to the input queue of a step.\n\n        Args:\n            step_name: The name of the step.\n            to_send: The object to send.\n        \"\"\"\n        input_queue = self.dag.get_step(step_name)[constants.INPUT_QUEUE_ATTR_NAME]\n        input_queue.put(to_send)\n\n    def _send_batch_to_step(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Sends a batch to the input queue of a step, writing the data of the batch\n        to the filesystem and setting `batch.data_path` with the path where the data\n        was written (if requiered i.e. the step is a global step or `use_fs_to_pass_data`)\n\n        This method should be extended by the specific pipeline implementation, adding\n        the logic to send the batch to the step.\n\n        Args:\n            batch: The batch to send.\n        \"\"\"\n        self._logger.debug(\n            f\"Setting batch {batch.seq_no} as last batch sent to '{batch.step_name}': {batch}\"\n        )\n        self._batch_manager.set_last_batch_sent(batch)  # type: ignore\n\n        step: \"_Step\" = self.dag.get_step(batch.step_name)[constants.STEP_ATTR_NAME]\n        if not step.is_generator and (step.is_global or self._use_fs_to_pass_data):\n            base_path = UPath(self._storage_base_path) / step.name  # type: ignore\n            self._logger.debug(\n                f\"Writing {batch.seq_no} batch for '{batch.step_name}' step to filesystem: {base_path}\"\n            )\n            batch.write_batch_data_to_fs(self._fs, base_path)  # type: ignore\n\n        self._logger.debug(\n            f\"Sending batch {batch.seq_no} to step '{batch.step_name}': {batch}\"\n        )\n        self._send_to_step(batch.step_name, batch)\n\n    def _gather_requirements(self) -&gt; List[str]:\n        \"\"\"Extracts the requirements from the steps to be used in the pipeline.\n\n        Returns:\n            List of requirements gathered from the steps.\n        \"\"\"\n        steps_requirements = []\n        for step in self.dag:\n            step_req = self.dag.get_step(step)[constants.STEP_ATTR_NAME].requirements\n            steps_requirements.extend(step_req)\n\n        return steps_requirements\n\n    def _register_batch(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Registers a batch in the batch manager.\n\n        Args:\n            batch: The batch to register.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n        self._batch_manager.register_batch(\n            batch, steps_data_path=self._cache_location[\"steps_data\"]\n        )  # type: ignore\n        self._logger.debug(\n            f\"Batch {batch.seq_no} from step '{batch.step_name}' registered in batch\"\n            \" manager\"\n        )\n\n    def _send_last_batch_flag_to_step(self, step_name: str) -&gt; None:\n        \"\"\"Sends the `LAST_BATCH_SENT_FLAG` to a step to stop processing batches.\n\n        Args:\n            step_name: The name of the step.\n        \"\"\"\n        self._logger.debug(\n            f\"Sending `LAST_BATCH_SENT_FLAG` to '{step_name}' step to stop processing\"\n            \" batches...\"\n        )\n\n        for _ in range(self.dag.get_step_replica_count(step_name)):\n            self._send_to_step(step_name, constants.LAST_BATCH_SENT_FLAG)\n        self._batch_manager.set_last_batch_flag_sent_to(step_name)  # type: ignore\n\n    def _request_initial_batches(self) -&gt; None:\n        \"\"\"Requests the initial batches to the generator steps.\"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n        for step in self._batch_manager._steps.values():\n            if not self._is_step_running(step.step_name):\n                continue\n            if batch := step.get_batch():\n                self._logger.debug(\n                    f\"Sending initial batch to '{step.step_name}' step: {batch}\"\n                )\n                self._send_batch_to_step(batch)\n\n        for step_name in self.dag.root_steps:\n            if not self._is_step_running(step_name):\n                continue\n            seq_no = 0\n            if last_batch := self._batch_manager.get_last_batch(step_name):\n                seq_no = last_batch.seq_no + 1\n            batch = _Batch(seq_no=seq_no, step_name=step_name, last_batch=self._dry_run)\n            self._logger.debug(\n                f\"Requesting initial batch to '{step_name}' generator step: {batch}\"\n            )\n            self._send_batch_to_step(batch)\n\n    def _request_batch_from_generator(self, step_name: str) -&gt; None:\n        \"\"\"Request a new batch to a `GeneratorStep`.\n\n        Args:\n            step_name: the name of the `GeneratorStep` to which a batch has to be requested.\n        \"\"\"\n        # Get the last batch that the previous step sent to generate the next batch\n        # (next `seq_no`).\n        last_batch = self._batch_manager.get_last_batch_sent(step_name)  # type: ignore\n        if last_batch is None:\n            return\n        self._send_batch_to_step(last_batch.next_batch())\n\n    def _request_more_batches_if_needed(self, step: \"Step\") -&gt; None:\n        \"\"\"Request more batches to the predecessors steps of `step` if needed.\n\n        Args:\n            step: The step of which it has to be checked if more batches are needed from\n                its predecessors.\n        \"\"\"\n        empty_buffers = self._batch_manager.step_empty_buffers(step.name)  # type: ignore\n        for previous_step_name in empty_buffers:\n            # Only more batches can be requested to the `GeneratorStep`s as they are the\n            # only kind of steps that lazily generate batches.\n            if previous_step_name not in self.dag.root_steps:\n                continue\n\n            self._request_batch_from_generator(previous_step_name)\n\n    def _handle_batch_on_stop(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Handles a batch that was received from the output queue when the pipeline was\n        stopped. It will add and register the batch in the batch manager.\n\n        Args:\n            batch: The batch to handle.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        self._batch_manager.register_batch(\n            batch, steps_data_path=self._cache_location[\"steps_data\"]\n        )\n        step: \"Step\" = self.dag.get_step(batch.step_name)[constants.STEP_ATTR_NAME]\n        for successor in self.dag.get_step_successors(step.name):  # type: ignore\n            self._batch_manager.add_batch(successor, batch)\n\n    def _get_step_from_batch(self, batch: \"_Batch\") -&gt; \"Step\":\n        \"\"\"Gets the `Step` instance from a batch.\n\n        Args:\n            batch: The batch to get the step from.\n\n        Returns:\n            The `Step` instance.\n        \"\"\"\n        return self.dag.get_step(batch.step_name)[constants.STEP_ATTR_NAME]\n\n    def _notify_steps_to_stop(self) -&gt; None:\n        \"\"\"Notifies the steps to stop their infinite running loop by sending `None` to\n        their input queues.\"\"\"\n        with self._steps_load_status_lock:\n            for step_name, replicas in self._steps_load_status.items():\n                if replicas &gt; 0:\n                    for _ in range(replicas):\n                        self._send_to_step(step_name, None)\n\n    def _get_successors(self, batch: \"_Batch\") -&gt; Tuple[List[str], List[str], bool]:\n        \"\"\"Gets the successors and the successors to which the batch has to be routed.\n\n        Args:\n            batch: The batch to which the successors will be determined.\n\n        Returns:\n            The successors to route the batch to and whether the batch was routed using\n            a routing function.\n        \"\"\"\n        node = self.dag.get_step(batch.step_name)\n        step: \"Step\" = node[constants.STEP_ATTR_NAME]\n        successors = list(self.dag.get_step_successors(step.name))  # type: ignore\n        route_to = successors\n\n        # Check if the step has a routing function to send the batch to specific steps\n        if routing_batch_function := node.get(\n            constants.ROUTING_BATCH_FUNCTION_ATTR_NAME\n        ):\n            route_to = routing_batch_function(batch, successors)\n            successors_str = \", \".join(f\"'{successor}'\" for successor in route_to)\n            self._logger.info(\n                f\"\ud83d\ude8f Using '{step.name}' routing function to send batch {batch.seq_no} to steps: {successors_str}\"\n            )\n\n        return route_to, list(set(successors) - set(route_to)), route_to != successors\n\n    def _set_next_expected_seq_no(\n        self, steps: List[str], from_step: str, next_expected_seq_no: int\n    ) -&gt; None:\n        \"\"\"Sets the next expected sequence number of a `_Batch` received by `step` from\n        `from_step`. This is necessary as some `Step`s might not receive all the batches\n        comming from the previous steps because there is a routing batch function.\n\n        Args:\n            steps: list of steps to which the next expected sequence number of a `_Batch`\n                from `from_step` has to be updated in the `_BatchManager`.\n            from_step: the name of the step from which the next expected sequence number\n                of a `_Batch` has to be updated in `steps`.\n            next_expected_seq_no: the number of the next expected sequence number of a `Batch`\n                from `from_step`.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        for step in steps:\n            self._batch_manager.set_next_expected_seq_no(\n                step_name=step,\n                from_step=from_step,\n                next_expected_seq_no=next_expected_seq_no,\n            )\n\n    @abstractmethod\n    def _teardown(self) -&gt; None:\n        \"\"\"Clean/release/stop resources reserved to run the pipeline.\"\"\"\n        pass\n\n    @abstractmethod\n    def _set_steps_not_loaded_exception(self) -&gt; None:\n        \"\"\"Used to raise `RuntimeError` when the load of the steps failed.\n\n        Raises:\n            RuntimeError: containing the information and why a step failed to be loaded.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def _stop(self) -&gt; None:\n        \"\"\"Stops the pipeline in a controlled way.\"\"\"\n        pass\n\n    def _stop_load_queue_loop(self) -&gt; None:\n        \"\"\"Stops the `_load_queue` loop sending a `None`.\"\"\"\n        self._logger.debug(\"Sending `None` to the load queue to notify stop...\")\n        self._load_queue.put(None)\n\n    def _stop_output_queue_loop(self) -&gt; None:\n        \"\"\"Stops the `_output_queue` loop sending a `None`.\"\"\"\n        self._logger.debug(\"Sending `None` to the output queue to notify stop...\")\n        self._output_queue.put(None)\n\n    def _handle_keyboard_interrupt(self) -&gt; Any:\n        \"\"\"Handles KeyboardInterrupt signal sent during the Pipeline.run method.\n\n        It will try to call self._stop (if the pipeline didn't started yet, it won't\n        have any effect), and if the pool is already started, will close it before exiting\n        the program.\n\n        Returns:\n            The original `signal.SIGINT` handler.\n        \"\"\"\n\n        def signal_handler(signumber: int, frame: Any) -&gt; None:\n            self._stop()\n\n        return signal.signal(signal.SIGINT, signal_handler)\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.signature","title":"<code>signature</code>  <code>property</code>","text":"<p>Makes a signature (hash) of a pipeline, using the step ids and the adjacency between them.</p> <p>The main use is to find the pipeline in the cache folder.</p> <p>Returns:</p> Type Description <code>str</code> <p>Signature of the pipeline.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.aggregated_steps_signature","title":"<code>aggregated_steps_signature</code>  <code>property</code>","text":"<p>Creates an aggregated signature using <code>Step</code>s signature that will be used for the <code>_BatchManager</code>.</p> <p>Returns:</p> Type Description <code>str</code> <p>The aggregated signature.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.QueueClass","title":"<code>QueueClass</code>  <code>abstractmethod</code> <code>property</code>","text":"<p>The class of the queue to use in the pipeline.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__init__","title":"<code>__init__(name=None, description=None, cache_dir=None, enable_metadata=False, requirements=None)</code>","text":"<p>Initialize the <code>BasePipeline</code> instance.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>Optional[str]</code> <p>The name of the pipeline. If not generated, a random one will be generated by default.</p> <code>None</code> <code>description</code> <code>Optional[str]</code> <p>A description of the pipeline. Defaults to <code>None</code>.</p> <code>None</code> <code>cache_dir</code> <code>Optional[Union[str, PathLike]]</code> <p>A directory where the pipeline will be cached. Defaults to <code>None</code>.</p> <code>None</code> <code>enable_metadata</code> <code>bool</code> <p>Whether to include the distilabel metadata column for the pipeline in the final <code>Distiset</code>. It contains metadata used by distilabel, for example the raw outputs of the <code>LLM</code> without processing would be here, inside <code>raw_output_...</code> field. Defaults to <code>False</code>.</p> <code>False</code> <code>requirements</code> <code>Optional[List[str]]</code> <p>List of requirements that must be installed to run the pipeline. Defaults to <code>None</code>, but can be helpful to inform in a pipeline to be shared that this requirements must be installed.</p> <code>None</code> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __init__(\n    self,\n    name: Optional[str] = None,\n    description: Optional[str] = None,\n    cache_dir: Optional[Union[str, \"PathLike\"]] = None,\n    enable_metadata: bool = False,\n    requirements: Optional[List[str]] = None,\n) -&gt; None:\n    \"\"\"Initialize the `BasePipeline` instance.\n\n    Args:\n        name: The name of the pipeline. If not generated, a random one will be generated by default.\n        description: A description of the pipeline. Defaults to `None`.\n        cache_dir: A directory where the pipeline will be cached. Defaults to `None`.\n        enable_metadata: Whether to include the distilabel metadata column for the pipeline\n            in the final `Distiset`. It contains metadata used by distilabel, for example\n            the raw outputs of the `LLM` without processing would be here, inside `raw_output_...`\n            field. Defaults to `False`.\n        requirements: List of requirements that must be installed to run the pipeline.\n            Defaults to `None`, but can be helpful to inform in a pipeline to be shared\n            that this requirements must be installed.\n    \"\"\"\n    self.name = name or _PIPELINE_DEFAULT_NAME\n    self.description = description\n    self._enable_metadata = enable_metadata\n    self.dag = DAG()\n\n    if cache_dir:\n        self._cache_dir = Path(cache_dir)\n    elif env_cache_dir := envs.DISTILABEL_CACHE_DIR:\n        self._cache_dir = Path(env_cache_dir)\n    else:\n        self._cache_dir = constants.PIPELINES_CACHE_DIR\n\n    self._logger = logging.getLogger(\"distilabel.pipeline\")\n\n    self._batch_manager: Optional[\"_BatchManager\"] = None\n    self._write_buffer: Optional[\"_WriteBuffer\"] = None\n    self._steps_input_queues: Dict[str, \"Queue\"] = {}\n\n    self._steps_load_status: Dict[str, int] = {}\n    self._steps_load_status_lock = threading.Lock()\n\n    self._stop_called = False\n    self._stop_called_lock = threading.Lock()\n    self._stop_calls = 0\n\n    self._recover_offline_batch_generate_for_step: Union[\n        Tuple[str, List[List[Dict[str, Any]]]], None\n    ] = None\n\n    self._fs: Optional[fsspec.AbstractFileSystem] = None\n    self._storage_base_path: Optional[str] = None\n    self._use_fs_to_pass_data: bool = False\n    self._dry_run = False\n\n    self._current_stage = 0\n    self._stages_last_batch: List[List[str]] = []\n    self._load_groups = []\n\n    self.requirements = requirements or []\n\n    self._exception: Union[Exception, None] = None\n\n    self._log_queue: Union[\"Queue[Any]\", None] = None\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__enter__","title":"<code>__enter__()</code>","text":"<p>Set the global pipeline instance when entering a pipeline context.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __enter__(self) -&gt; Self:\n    \"\"\"Set the global pipeline instance when entering a pipeline context.\"\"\"\n    _GlobalPipelineManager.set_pipeline(self)\n    return self\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__exit__","title":"<code>__exit__(exc_type, exc_value, traceback)</code>","text":"<p>Unset the global pipeline instance when exiting a pipeline context.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __exit__(self, exc_type, exc_value, traceback) -&gt; None:\n    \"\"\"Unset the global pipeline instance when exiting a pipeline context.\"\"\"\n    _GlobalPipelineManager.set_pipeline(None)\n    self._set_pipeline_name()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.run","title":"<code>run(parameters=None, load_groups=None, use_cache=True, storage_parameters=None, use_fs_to_pass_data=False, dataset=None, dataset_batch_size=50, logging_handlers=None)</code>","text":"<p>Run the pipeline. It will set the runtime parameters for the steps and validate the pipeline.</p> <p>This method should be extended by the specific pipeline implementation, adding the logic to run the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>parameters</code> <code>Optional[Dict[str, Dict[str, Any]]]</code> <p>A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to <code>None</code>.</p> <code>None</code> <code>load_groups</code> <code>Optional[LoadGroups]</code> <p>A list containing lists of steps that have to be loaded together and in isolation with respect to the rest of the steps of the pipeline. This argument also allows passing the following modes:</p> <ul> <li>\"sequential_step_execution\": each step will be executed in a stage i.e.     the execution of the steps will be sequential.</li> </ul> <p>Defaults to <code>None</code>.</p> <code>None</code> <code>use_cache</code> <code>bool</code> <p>Whether to use the cache from previous pipeline runs. Defaults to <code>True</code>.</p> <code>True</code> <code>storage_parameters</code> <code>Optional[Dict[str, Any]]</code> <p>A dictionary with the storage parameters (<code>fsspec</code> and path) that will be used to store the data of the <code>_Batch</code>es passed between the steps if <code>use_fs_to_pass_data</code> is <code>True</code> (for the batches received by a <code>GlobalStep</code> it will be always used). It must have at least the \"path\" key, and it can contain additional keys depending on the protocol. By default, it will use the local file system and a directory in the cache directory. Defaults to <code>None</code>.</p> <code>None</code> <code>use_fs_to_pass_data</code> <code>bool</code> <p>Whether to use the file system to pass the data of the <code>_Batch</code>es between the steps. Even if this parameter is <code>False</code>, the <code>Batch</code>es received by <code>GlobalStep</code>s will always use the file system to pass the data. Defaults to <code>False</code>.</p> <code>False</code> <code>dataset</code> <code>Optional[InputDataset]</code> <p>If given, it will be used to create a <code>GeneratorStep</code> and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to <code>None</code>.</p> <code>None</code> <code>dataset_batch_size</code> <code>int</code> <p>if <code>dataset</code> is given, this will be the size of the batches yield by the <code>GeneratorStep</code> created using the <code>dataset</code>. Defaults to <code>50</code>.</p> <code>50</code> <code>logging_handlers</code> <code>Optional[List[Handler]]</code> <p>A list of logging handlers that will be used to log the output of the pipeline. This argument can be useful so the logging messages can be extracted and used in a different context. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>The <code>Distiset</code> created by the pipeline.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def run(\n    self,\n    parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n    load_groups: Optional[\"LoadGroups\"] = None,\n    use_cache: bool = True,\n    storage_parameters: Optional[Dict[str, Any]] = None,\n    use_fs_to_pass_data: bool = False,\n    dataset: Optional[\"InputDataset\"] = None,\n    dataset_batch_size: int = 50,\n    logging_handlers: Optional[List[logging.Handler]] = None,\n) -&gt; \"Distiset\":  # type: ignore\n    \"\"\"Run the pipeline. It will set the runtime parameters for the steps and validate\n    the pipeline.\n\n    This method should be extended by the specific pipeline implementation,\n    adding the logic to run the pipeline.\n\n    Args:\n        parameters: A dictionary with the step name as the key and a dictionary with\n            the runtime parameters for the step as the value. Defaults to `None`.\n        load_groups: A list containing lists of steps that have to be loaded together\n            and in isolation with respect to the rest of the steps of the pipeline.\n            This argument also allows passing the following modes:\n\n            - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                the execution of the steps will be sequential.\n\n            Defaults to `None`.\n        use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n            `True`.\n        storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n            that will be used to store the data of the `_Batch`es passed between the\n            steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n            `GlobalStep` it will be always used). It must have at least the \"path\" key,\n            and it can contain additional keys depending on the protocol. By default,\n            it will use the local file system and a directory in the cache directory.\n            Defaults to `None`.\n        use_fs_to_pass_data: Whether to use the file system to pass the data of\n            the `_Batch`es between the steps. Even if this parameter is `False`, the\n            `Batch`es received by `GlobalStep`s will always use the file system to\n            pass the data. Defaults to `False`.\n        dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n            root step. Convenient method when you have already processed the dataset in\n            your script and just want to pass it already processed. Defaults to `None`.\n        dataset_batch_size: if `dataset` is given, this will be the size of the batches\n            yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n        logging_handlers: A list of logging handlers that will be used to log the\n            output of the pipeline. This argument can be useful so the logging messages\n            can be extracted and used in a different context. Defaults to `None`.\n\n    Returns:\n        The `Distiset` created by the pipeline.\n    \"\"\"\n\n    self._exception: Union[Exception, None] = None\n\n    # Set the runtime parameters that will be used during the pipeline execution.\n    # They are used to generate the signature of the pipeline that is used to hit the\n    # cache when the pipeline is run, so it's important to do it first.\n    self._set_runtime_parameters(parameters or {})\n\n    self._refresh_pipeline_from_cache()\n\n    if dataset is not None:\n        self._add_dataset_generator_step(dataset, dataset_batch_size)\n\n    setup_logging(\n        log_queue=self._log_queue,\n        filename=str(self._cache_location[\"log_file\"]),\n        logging_handlers=logging_handlers,\n    )\n\n    # Set the name of the pipeline if it's the default one. This should be called\n    # if the pipeline is defined within the context manager, and the run is called\n    # outside of it. Is here in the following case:\n    # with Pipeline() as pipeline:\n    #    pipeline.run()\n    self._set_pipeline_name()\n\n    # Validate the pipeline DAG to check that all the steps are chainable, there are\n    # no missing runtime parameters, batch sizes are correct, load groups are valid,\n    # etc.\n    self._load_groups = self._built_load_groups(load_groups)\n    self._validate()\n\n    self._set_pipeline_artifacts_path_in_steps()\n\n    # Set the initial load status for all the steps\n    self._init_steps_load_status()\n\n    # Load the stages status or initialize it\n    self._load_stages_status(use_cache)\n\n    # Load the `_BatchManager` from cache or create one from scratch\n    self._load_batch_manager(use_cache)\n\n    # Check pipeline requirements are installed\n    self._check_requirements()\n\n    # Setup the filesystem that will be used to pass the data of the `_Batch`es\n    self._setup_fsspec(storage_parameters)\n    self._use_fs_to_pass_data = use_fs_to_pass_data\n\n    if self._dry_run:\n        self._logger.info(\"\ud83c\udf35 Dry run mode\")\n\n    # If the batch manager is not able to generate batches, that means that the loaded\n    # `_BatchManager` from cache didn't have any remaining batches to process i.e.\n    # the previous pipeline execution was completed successfully.\n    if not self._batch_manager.can_generate():  # type: ignore\n        self._logger.info(\n            \"\ud83d\udcbe Loaded batch manager from cache doesn't contain any remaining data.\"\n            \" Returning `Distiset` from cache data...\"\n        )\n        distiset = create_distiset(\n            data_dir=self._cache_location[\"data\"],\n            pipeline_path=self._cache_location[\"pipeline\"],\n            log_filename_path=self._cache_location[\"log_file\"],\n            enable_metadata=self._enable_metadata,\n            dag=self.dag,\n        )\n        stop_logging()\n        return distiset\n\n    self._setup_write_buffer(use_cache)\n\n    self._print_load_stages_info()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.dry_run","title":"<code>dry_run(parameters=None, batch_size=1, dataset=None)</code>","text":"<p>Do a dry run to test the pipeline runs as expected.</p> <p>Running a <code>Pipeline</code> in dry run mode will set all the <code>batch_size</code> of generator steps to the specified <code>batch_size</code>, and run just with a single batch, effectively running the whole pipeline with a single example. The cache will be set to <code>False</code>.</p> <p>Parameters:</p> Name Type Description Default <code>parameters</code> <code>Optional[Dict[str, Dict[str, Any]]]</code> <p>A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to <code>None</code>.</p> <code>None</code> <code>batch_size</code> <code>int</code> <p>The batch size of the unique batch generated by the generators steps of the pipeline. Defaults to <code>1</code>.</p> <code>1</code> <code>dataset</code> <code>Optional[InputDataset]</code> <p>If given, it will be used to create a <code>GeneratorStep</code> and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>Will return the <code>Distiset</code> as the main run method would do.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def dry_run(\n    self,\n    parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n    batch_size: int = 1,\n    dataset: Optional[\"InputDataset\"] = None,\n) -&gt; \"Distiset\":\n    \"\"\"Do a dry run to test the pipeline runs as expected.\n\n    Running a `Pipeline` in dry run mode will set all the `batch_size` of generator steps\n    to the specified `batch_size`, and run just with a single batch, effectively\n    running the whole pipeline with a single example. The cache will be set to `False`.\n\n    Args:\n        parameters: A dictionary with the step name as the key and a dictionary with\n            the runtime parameters for the step as the value. Defaults to `None`.\n        batch_size: The batch size of the unique batch generated by the generators\n            steps of the pipeline. Defaults to `1`.\n        dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n            root step. Convenient method when you have already processed the dataset in\n            your script and just want to pass it already processed. Defaults to `None`.\n\n    Returns:\n        Will return the `Distiset` as the main run method would do.\n    \"\"\"\n    self._dry_run = True\n\n    for step_name in self.dag:\n        step = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n\n        if step.is_generator:\n            if not parameters:\n                parameters = {}\n            parameters[step_name] = {\"batch_size\": batch_size}\n\n    distiset = self.run(parameters=parameters, use_cache=False, dataset=dataset)\n\n    self._dry_run = False\n    return distiset\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.get_load_stages","title":"<code>get_load_stages(load_groups=None)</code>","text":"<p>Convenient method to get the load stages of a pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>load_groups</code> <code>Optional[LoadGroups]</code> <p>A list containing list of steps that has to be loaded together and in isolation with respect to the rest of the steps of the pipeline. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>LoadStages</code> <p>A tuple with the first element containing asorted list by stage containing</p> <code>LoadStages</code> <p>lists with the names of the steps of the stage, and the second element a list</p> <code>LoadStages</code> <p>sorted by stage containing lists with the names of the last steps of the stage.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def get_load_stages(self, load_groups: Optional[\"LoadGroups\"] = None) -&gt; LoadStages:\n    \"\"\"Convenient method to get the load stages of a pipeline.\n\n    Args:\n        load_groups: A list containing list of steps that has to be loaded together\n            and in isolation with respect to the rest of the steps of the pipeline.\n            Defaults to `None`.\n\n    Returns:\n        A tuple with the first element containing asorted list by stage containing\n        lists with the names of the steps of the stage, and the second element a list\n        sorted by stage containing lists with the names of the last steps of the stage.\n    \"\"\"\n    load_groups = self._built_load_groups(load_groups)\n    return self.dag.get_steps_load_stages(load_groups)\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.get_runtime_parameters_info","title":"<code>get_runtime_parameters_info()</code>","text":"<p>Get the runtime parameters for the steps in the pipeline.</p> <p>Returns:</p> Type Description <code>PipelineRuntimeParametersInfo</code> <p>A dictionary with the step name as the key and a list of dictionaries with</p> <code>PipelineRuntimeParametersInfo</code> <p>the parameter name and the parameter info as the value.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def get_runtime_parameters_info(self) -&gt; \"PipelineRuntimeParametersInfo\":\n    \"\"\"Get the runtime parameters for the steps in the pipeline.\n\n    Returns:\n        A dictionary with the step name as the key and a list of dictionaries with\n        the parameter name and the parameter info as the value.\n    \"\"\"\n    runtime_parameters = {}\n    for step_name in self.dag:\n        step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n        runtime_parameters[step_name] = step.get_runtime_parameters_info()\n    return runtime_parameters\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.draw","title":"<code>draw(path='pipeline.png', top_to_bottom=False, show_edge_labels=True)</code>","text":"<p>Draws the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Optional[Union[str, Path]]</code> <p>The path to save the image to.</p> <code>'pipeline.png'</code> <code>top_to_bottom</code> <code>bool</code> <p>Whether to draw the DAG top to bottom. Defaults to <code>False</code>.</p> <code>False</code> <code>show_edge_labels</code> <code>bool</code> <p>Whether to show the edge labels. Defaults to <code>True</code>.</p> <code>True</code> <p>Returns:</p> Type Description <code>str</code> <p>The path to the saved image.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def draw(\n    self,\n    path: Optional[Union[str, Path]] = \"pipeline.png\",\n    top_to_bottom: bool = False,\n    show_edge_labels: bool = True,\n) -&gt; str:\n    \"\"\"\n    Draws the pipeline.\n\n    Parameters:\n        path: The path to save the image to.\n        top_to_bottom: Whether to draw the DAG top to bottom. Defaults to `False`.\n        show_edge_labels: Whether to show the edge labels. Defaults to `True`.\n\n    Returns:\n        The path to the saved image.\n    \"\"\"\n    png = self.dag.draw(\n        top_to_bottom=top_to_bottom, show_edge_labels=show_edge_labels\n    )\n    with open(path, \"wb\") as f:\n        f.write(png)\n    return path\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__repr__","title":"<code>__repr__()</code>","text":"<p>If running in a Jupyter notebook, display an image representing this <code>Pipeline</code>.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"\n    If running in a Jupyter notebook, display an image representing this `Pipeline`.\n    \"\"\"\n    if in_notebook():\n        try:\n            from IPython.display import Image, display\n\n            image_data = self.dag.draw()\n\n            display(Image(image_data))\n        except Exception:\n            pass\n    return super().__repr__()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.from_dict","title":"<code>from_dict(data)</code>  <code>classmethod</code>","text":"<p>Create a Pipeline from a dict containing the serialized data.</p> Note <p>It's intended for internal use.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> <p>Dictionary containing the serialized data from a Pipeline.</p> required <p>Returns:</p> Name Type Description <code>BasePipeline</code> <code>Self</code> <p>Pipeline recreated from the dictionary info.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n    \"\"\"Create a Pipeline from a dict containing the serialized data.\n\n    Note:\n        It's intended for internal use.\n\n    Args:\n        data (Dict[str, Any]): Dictionary containing the serialized data from a Pipeline.\n\n    Returns:\n        BasePipeline: Pipeline recreated from the dictionary info.\n    \"\"\"\n    name = data[\"pipeline\"][\"name\"]\n    description = data[\"pipeline\"].get(\"description\")\n    requirements = data.get(\"requirements\", [])\n    with cls(name=name, description=description, requirements=requirements) as pipe:\n        pipe.dag = DAG.from_dict(data[\"pipeline\"])\n    return pipe\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.local","title":"<code>local</code>","text":""},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline","title":"<code>Pipeline</code>","text":"<p>               Bases: <code>BasePipeline</code></p> <p>Local pipeline implementation using <code>multiprocessing</code>.</p> Source code in <code>src/distilabel/pipeline/local.py</code> <pre><code>class Pipeline(BasePipeline):\n    \"\"\"Local pipeline implementation using `multiprocessing`.\"\"\"\n\n    def ray(\n        self,\n        ray_head_node_url: Optional[str] = None,\n        ray_init_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; RayPipeline:\n        \"\"\"Creates a `RayPipeline` using the init parameters of this pipeline. This is a\n        convenient method that can be used to \"transform\" one common `Pipeline` to a `RayPipeline`\n        and it's mainly used by the CLI.\n\n        Args:\n            ray_head_node_url: The URL that can be used to connect to the head node of\n                the Ray cluster. Normally, you won't want to use this argument as the\n                recommended way to submit a job to a Ray cluster is using the [Ray Jobs\n                CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-overview).\n                Defaults to `None`.\n            ray_init_kwargs: kwargs that will be passed to the `ray.init` method. Defaults\n                to `None`.\n\n        Returns:\n            A `RayPipeline` instance.\n        \"\"\"\n        pipeline = RayPipeline(\n            name=self.name,\n            description=self.description,\n            cache_dir=self._cache_dir,\n            enable_metadata=self._enable_metadata,\n            requirements=self.requirements,\n            ray_head_node_url=ray_head_node_url,\n            ray_init_kwargs=ray_init_kwargs,\n        )\n        pipeline.dag = self.dag\n        return pipeline\n\n    def run(\n        self,\n        parameters: Optional[Dict[Any, Dict[str, Any]]] = None,\n        load_groups: Optional[\"LoadGroups\"] = None,\n        use_cache: bool = True,\n        storage_parameters: Optional[Dict[str, Any]] = None,\n        use_fs_to_pass_data: bool = False,\n        dataset: Optional[\"InputDataset\"] = None,\n        dataset_batch_size: int = 50,\n        logging_handlers: Optional[List[\"logging.Handler\"]] = None,\n    ) -&gt; \"Distiset\":\n        \"\"\"Runs the pipeline.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n                the runtime parameters for the step as the value. Defaults to `None`.\n            load_groups: A list containing lists of steps that have to be loaded together\n                and in isolation with respect to the rest of the steps of the pipeline.\n                This argument also allows passing the following modes:\n\n                - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                    the execution of the steps will be sequential.\n\n                Defaults to `None`.\n            use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n                `True`.\n            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n                that will be used to store the data of the `_Batch`es passed between the\n                steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n                `GlobalStep` it will be always used). It must have at least the \"path\" key,\n                and it can contain additional keys depending on the protocol. By default,\n                it will use the local file system and a directory in the cache directory.\n                Defaults to `None`.\n            use_fs_to_pass_data: Whether to use the file system to pass the data of\n                the `_Batch`es between the steps. Even if this parameter is `False`, the\n                `Batch`es received by `GlobalStep`s will always use the file system to\n                pass the data. Defaults to `False`.\n            dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n                root step. Convenient method when you have already processed the dataset in\n                your script and just want to pass it already processed. Defaults to `None`.\n            dataset_batch_size: if `dataset` is given, this will be the size of the batches\n                yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n            logging_handlers: A list of logging handlers that will be used to log the\n                output of the pipeline. This argument can be useful so the logging messages\n                can be extracted and used in a different context. Defaults to `None`.\n\n        Returns:\n            The `Distiset` created by the pipeline.\n\n        Raises:\n            RuntimeError: If the pipeline fails to load all the steps.\n        \"\"\"\n        if script_executed_in_ray_cluster():\n            print(\"Script running in Ray cluster... Using `RayPipeline`...\")\n            return self.ray().run(\n                parameters=parameters,\n                use_cache=use_cache,\n                storage_parameters=storage_parameters,\n                use_fs_to_pass_data=use_fs_to_pass_data,\n                dataset=dataset,\n                dataset_batch_size=dataset_batch_size,\n            )\n\n        self._log_queue = cast(\"Queue[Any]\", mp.Queue())\n\n        if distiset := super().run(\n            parameters=parameters,\n            load_groups=load_groups,\n            use_cache=use_cache,\n            storage_parameters=storage_parameters,\n            use_fs_to_pass_data=use_fs_to_pass_data,\n            dataset=dataset,\n            dataset_batch_size=dataset_batch_size,\n            logging_handlers=logging_handlers,\n        ):\n            return distiset\n\n        num_processes = self.dag.get_total_replica_count()\n        with (\n            mp.Manager() as manager,\n            _NoDaemonPool(\n                num_processes,\n                initializer=_init_worker,\n                initargs=(\n                    self._log_queue,\n                    self.name,\n                    self.signature,\n                ),\n            ) as pool,\n        ):\n            self._manager = manager\n            self._pool = pool\n            self._output_queue = self.QueueClass()\n            self._load_queue = self.QueueClass()\n            self._handle_keyboard_interrupt()\n\n            # Run the loop for receiving the load status of each step\n            self._load_steps_thread = self._run_load_queue_loop_in_thread()\n\n            # Start a loop to receive the output batches from the steps\n            self._output_queue_thread = self._run_output_queue_loop_in_thread()\n            self._output_queue_thread.join()\n\n            self._teardown()\n\n            if self._exception:\n                raise self._exception\n\n        distiset = create_distiset(\n            self._cache_location[\"data\"],\n            pipeline_path=self._cache_location[\"pipeline\"],\n            log_filename_path=self._cache_location[\"log_file\"],\n            enable_metadata=self._enable_metadata,\n            dag=self.dag,\n        )\n\n        stop_logging()\n\n        return distiset\n\n    @property\n    def QueueClass(self) -&gt; Callable:\n        \"\"\"The callable used to create the input and output queues.\n\n        Returns:\n            The callable to create a `Queue`.\n        \"\"\"\n        assert self._manager, \"Manager is not initialized\"\n        return self._manager.Queue\n\n    def _run_step(self, step: \"_Step\", input_queue: \"Queue[Any]\", replica: int) -&gt; None:\n        \"\"\"Runs the `Step` wrapped in a `_ProcessWrapper` in a separate process of the\n        `Pool`.\n\n        Args:\n            step: The step to run.\n            input_queue: The input queue to send the data to the step.\n            replica: The replica ID assigned.\n        \"\"\"\n        assert self._pool, \"Pool is not initialized\"\n\n        step_wrapper = _StepWrapper(\n            step=step,  # type: ignore\n            replica=replica,\n            input_queue=input_queue,\n            output_queue=self._output_queue,\n            load_queue=self._load_queue,\n            dry_run=self._dry_run,\n            ray_pipeline=False,\n        )\n\n        self._pool.apply_async(step_wrapper.run, error_callback=self._error_callback)\n\n    def _error_callback(self, e: BaseException) -&gt; None:\n        \"\"\"Error callback that will be called when an error occurs in a `Step` process.\n\n        Args:\n            e: The exception raised by the process.\n        \"\"\"\n        global _SUBPROCESS_EXCEPTION\n\n        # First we check that the exception is a `_StepWrapperException`, otherwise, we\n        # print it out and stop the pipeline, since some errors may be unhandled\n        if not isinstance(e, _StepWrapperException):\n            self._logger.error(f\"\u274c Failed with an unhandled exception: {e}\")\n            self._stop()\n            return\n\n        if e.is_load_error:\n            self._logger.error(f\"\u274c Failed to load step '{e.step.name}': {e.message}\")\n            _SUBPROCESS_EXCEPTION = e.subprocess_exception\n            _SUBPROCESS_EXCEPTION.__traceback__ = tblib.Traceback.from_string(  # type: ignore\n                e.formatted_traceback\n            ).as_traceback()\n            return\n\n        # If the step is global, is not in the last trophic level and has no successors,\n        # then we can ignore the error and continue executing the pipeline\n        step_name: str = e.step.name  # type: ignore\n        if (\n            e.step.is_global\n            and not self.dag.step_in_last_trophic_level(step_name)\n            and list(self.dag.get_step_successors(step_name)) == []\n        ):\n            self._logger.error(\n                f\"\u270b An error occurred when running global step '{step_name}' with no\"\n                \" successors and not in the last trophic level. Pipeline execution can\"\n                f\" continue. Error will be ignored.\"\n            )\n            self._logger.error(f\"Subprocess traceback:\\n\\n{e.formatted_traceback}\")\n            return\n\n        # Handle tasks using an `LLM` using offline batch generation\n        if isinstance(\n            e.subprocess_exception, DistilabelOfflineBatchGenerationNotFinishedException\n        ):\n            self._logger.info(\n                f\"\u23f9\ufe0f '{e.step.name}' task stopped pipeline execution: LLM offline batch\"\n                \" generation in progress. Rerun pipeline with cache to check results and\"\n                \" continue execution.\"\n            )\n            self._set_step_for_recovering_offline_batch_generation(e.step, e.data)  # type: ignore\n            with self._stop_called_lock:\n                if not self._stop_called:\n                    self._stop(acquire_lock=False)\n            return\n\n        # Global step with successors failed\n        self._logger.error(f\"An error occurred in global step '{step_name}'\")\n        self._logger.error(f\"Subprocess traceback:\\n\\n{e.formatted_traceback}\")\n\n        self._stop()\n\n    def _teardown(self) -&gt; None:\n        \"\"\"Clean/release/stop resources reserved to run the pipeline.\"\"\"\n        if self._write_buffer:\n            self._write_buffer.close()\n\n        if self._batch_manager:\n            self._batch_manager = None\n\n        self._stop_load_queue_loop()\n        self._load_steps_thread.join()\n\n        if self._pool:\n            self._pool.terminate()\n            self._pool.join()\n\n        if self._manager:\n            self._manager.shutdown()\n            self._manager.join()\n\n    def _set_steps_not_loaded_exception(self) -&gt; None:\n        \"\"\"Raises a `RuntimeError` notifying that the steps load has failed.\n\n        Raises:\n            RuntimeError: containing the information and why a step failed to be loaded.\n        \"\"\"\n        self._exception = RuntimeError(\n            \"Failed to load all the steps. Could not run pipeline.\"\n        )\n        self._exception.__cause__ = _SUBPROCESS_EXCEPTION\n\n    def _stop(self, acquire_lock: bool = True) -&gt; None:\n        \"\"\"Stops the pipeline execution. It will first send `None` to the input queues\n        of all the steps and then wait until the output queue is empty i.e. all the steps\n        finished processing the batches that were sent before the stop flag. Then it will\n        send `None` to the output queue to notify the pipeline to stop.\n\n        Args:\n            acquire_lock: Whether to acquire the lock to access the `_stop_called` attribute.\n        \"\"\"\n\n        if acquire_lock:\n            self._stop_called_lock.acquire()\n\n        if self._stop_called:\n            self._stop_calls += 1\n            if self._stop_calls == 1:\n                self._logger.warning(\"\ud83d\uded1 Press again to force the pipeline to stop.\")\n            elif self._stop_calls &gt; 1:\n                self._logger.warning(\"\ud83d\uded1 Forcing pipeline interruption.\")\n\n                if self._pool:\n                    self._pool.terminate()\n                    self._pool.join()\n                    self._pool = None\n\n                if self._manager:\n                    self._manager.shutdown()\n                    self._manager.join()\n                    self._manager = None\n\n                stop_logging()\n\n                sys.exit(1)\n\n            return\n        self._stop_called = True\n\n        if acquire_lock:\n            self._stop_called_lock.release()\n\n        self._logger.debug(\n            f\"Steps loaded before calling `stop`: {self._steps_load_status}\"\n        )\n        self._logger.info(\n            \"\ud83d\uded1 Stopping pipeline. Waiting for steps to finish processing batches...\"\n        )\n\n        self._stop_output_queue_loop()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline.QueueClass","title":"<code>QueueClass</code>  <code>property</code>","text":"<p>The callable used to create the input and output queues.</p> <p>Returns:</p> Type Description <code>Callable</code> <p>The callable to create a <code>Queue</code>.</p>"},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline.ray","title":"<code>ray(ray_head_node_url=None, ray_init_kwargs=None)</code>","text":"<p>Creates a <code>RayPipeline</code> using the init parameters of this pipeline. This is a convenient method that can be used to \"transform\" one common <code>Pipeline</code> to a <code>RayPipeline</code> and it's mainly used by the CLI.</p> <p>Parameters:</p> Name Type Description Default <code>ray_head_node_url</code> <code>Optional[str]</code> <p>The URL that can be used to connect to the head node of the Ray cluster. Normally, you won't want to use this argument as the recommended way to submit a job to a Ray cluster is using the Ray Jobs CLI. Defaults to <code>None</code>.</p> <code>None</code> <code>ray_init_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>kwargs that will be passed to the <code>ray.init</code> method. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>RayPipeline</code> <p>A <code>RayPipeline</code> instance.</p> Source code in <code>src/distilabel/pipeline/local.py</code> <pre><code>def ray(\n    self,\n    ray_head_node_url: Optional[str] = None,\n    ray_init_kwargs: Optional[Dict[str, Any]] = None,\n) -&gt; RayPipeline:\n    \"\"\"Creates a `RayPipeline` using the init parameters of this pipeline. This is a\n    convenient method that can be used to \"transform\" one common `Pipeline` to a `RayPipeline`\n    and it's mainly used by the CLI.\n\n    Args:\n        ray_head_node_url: The URL that can be used to connect to the head node of\n            the Ray cluster. Normally, you won't want to use this argument as the\n            recommended way to submit a job to a Ray cluster is using the [Ray Jobs\n            CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-overview).\n            Defaults to `None`.\n        ray_init_kwargs: kwargs that will be passed to the `ray.init` method. Defaults\n            to `None`.\n\n    Returns:\n        A `RayPipeline` instance.\n    \"\"\"\n    pipeline = RayPipeline(\n        name=self.name,\n        description=self.description,\n        cache_dir=self._cache_dir,\n        enable_metadata=self._enable_metadata,\n        requirements=self.requirements,\n        ray_head_node_url=ray_head_node_url,\n        ray_init_kwargs=ray_init_kwargs,\n    )\n    pipeline.dag = self.dag\n    return pipeline\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline.run","title":"<code>run(parameters=None, load_groups=None, use_cache=True, storage_parameters=None, use_fs_to_pass_data=False, dataset=None, dataset_batch_size=50, logging_handlers=None)</code>","text":"<p>Runs the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>parameters</code> <code>Optional[Dict[Any, Dict[str, Any]]]</code> <p>A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to <code>None</code>.</p> <code>None</code> <code>load_groups</code> <code>Optional[LoadGroups]</code> <p>A list containing lists of steps that have to be loaded together and in isolation with respect to the rest of the steps of the pipeline. This argument also allows passing the following modes:</p> <ul> <li>\"sequential_step_execution\": each step will be executed in a stage i.e.     the execution of the steps will be sequential.</li> </ul> <p>Defaults to <code>None</code>.</p> <code>None</code> <code>use_cache</code> <code>bool</code> <p>Whether to use the cache from previous pipeline runs. Defaults to <code>True</code>.</p> <code>True</code> <code>storage_parameters</code> <code>Optional[Dict[str, Any]]</code> <p>A dictionary with the storage parameters (<code>fsspec</code> and path) that will be used to store the data of the <code>_Batch</code>es passed between the steps if <code>use_fs_to_pass_data</code> is <code>True</code> (for the batches received by a <code>GlobalStep</code> it will be always used). It must have at least the \"path\" key, and it can contain additional keys depending on the protocol. By default, it will use the local file system and a directory in the cache directory. Defaults to <code>None</code>.</p> <code>None</code> <code>use_fs_to_pass_data</code> <code>bool</code> <p>Whether to use the file system to pass the data of the <code>_Batch</code>es between the steps. Even if this parameter is <code>False</code>, the <code>Batch</code>es received by <code>GlobalStep</code>s will always use the file system to pass the data. Defaults to <code>False</code>.</p> <code>False</code> <code>dataset</code> <code>Optional[InputDataset]</code> <p>If given, it will be used to create a <code>GeneratorStep</code> and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to <code>None</code>.</p> <code>None</code> <code>dataset_batch_size</code> <code>int</code> <p>if <code>dataset</code> is given, this will be the size of the batches yield by the <code>GeneratorStep</code> created using the <code>dataset</code>. Defaults to <code>50</code>.</p> <code>50</code> <code>logging_handlers</code> <code>Optional[List[Handler]]</code> <p>A list of logging handlers that will be used to log the output of the pipeline. This argument can be useful so the logging messages can be extracted and used in a different context. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>The <code>Distiset</code> created by the pipeline.</p> <p>Raises:</p> Type Description <code>RuntimeError</code> <p>If the pipeline fails to load all the steps.</p> Source code in <code>src/distilabel/pipeline/local.py</code> <pre><code>def run(\n    self,\n    parameters: Optional[Dict[Any, Dict[str, Any]]] = None,\n    load_groups: Optional[\"LoadGroups\"] = None,\n    use_cache: bool = True,\n    storage_parameters: Optional[Dict[str, Any]] = None,\n    use_fs_to_pass_data: bool = False,\n    dataset: Optional[\"InputDataset\"] = None,\n    dataset_batch_size: int = 50,\n    logging_handlers: Optional[List[\"logging.Handler\"]] = None,\n) -&gt; \"Distiset\":\n    \"\"\"Runs the pipeline.\n\n    Args:\n        parameters: A dictionary with the step name as the key and a dictionary with\n            the runtime parameters for the step as the value. Defaults to `None`.\n        load_groups: A list containing lists of steps that have to be loaded together\n            and in isolation with respect to the rest of the steps of the pipeline.\n            This argument also allows passing the following modes:\n\n            - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                the execution of the steps will be sequential.\n\n            Defaults to `None`.\n        use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n            `True`.\n        storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n            that will be used to store the data of the `_Batch`es passed between the\n            steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n            `GlobalStep` it will be always used). It must have at least the \"path\" key,\n            and it can contain additional keys depending on the protocol. By default,\n            it will use the local file system and a directory in the cache directory.\n            Defaults to `None`.\n        use_fs_to_pass_data: Whether to use the file system to pass the data of\n            the `_Batch`es between the steps. Even if this parameter is `False`, the\n            `Batch`es received by `GlobalStep`s will always use the file system to\n            pass the data. Defaults to `False`.\n        dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n            root step. Convenient method when you have already processed the dataset in\n            your script and just want to pass it already processed. Defaults to `None`.\n        dataset_batch_size: if `dataset` is given, this will be the size of the batches\n            yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n        logging_handlers: A list of logging handlers that will be used to log the\n            output of the pipeline. This argument can be useful so the logging messages\n            can be extracted and used in a different context. Defaults to `None`.\n\n    Returns:\n        The `Distiset` created by the pipeline.\n\n    Raises:\n        RuntimeError: If the pipeline fails to load all the steps.\n    \"\"\"\n    if script_executed_in_ray_cluster():\n        print(\"Script running in Ray cluster... Using `RayPipeline`...\")\n        return self.ray().run(\n            parameters=parameters,\n            use_cache=use_cache,\n            storage_parameters=storage_parameters,\n            use_fs_to_pass_data=use_fs_to_pass_data,\n            dataset=dataset,\n            dataset_batch_size=dataset_batch_size,\n        )\n\n    self._log_queue = cast(\"Queue[Any]\", mp.Queue())\n\n    if distiset := super().run(\n        parameters=parameters,\n        load_groups=load_groups,\n        use_cache=use_cache,\n        storage_parameters=storage_parameters,\n        use_fs_to_pass_data=use_fs_to_pass_data,\n        dataset=dataset,\n        dataset_batch_size=dataset_batch_size,\n        logging_handlers=logging_handlers,\n    ):\n        return distiset\n\n    num_processes = self.dag.get_total_replica_count()\n    with (\n        mp.Manager() as manager,\n        _NoDaemonPool(\n            num_processes,\n            initializer=_init_worker,\n            initargs=(\n                self._log_queue,\n                self.name,\n                self.signature,\n            ),\n        ) as pool,\n    ):\n        self._manager = manager\n        self._pool = pool\n        self._output_queue = self.QueueClass()\n        self._load_queue = self.QueueClass()\n        self._handle_keyboard_interrupt()\n\n        # Run the loop for receiving the load status of each step\n        self._load_steps_thread = self._run_load_queue_loop_in_thread()\n\n        # Start a loop to receive the output batches from the steps\n        self._output_queue_thread = self._run_output_queue_loop_in_thread()\n        self._output_queue_thread.join()\n\n        self._teardown()\n\n        if self._exception:\n            raise self._exception\n\n    distiset = create_distiset(\n        self._cache_location[\"data\"],\n        pipeline_path=self._cache_location[\"pipeline\"],\n        log_filename_path=self._cache_location[\"log_file\"],\n        enable_metadata=self._enable_metadata,\n        dag=self.dag,\n    )\n\n    stop_logging()\n\n    return distiset\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/","title":"Routing batch function","text":""},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function","title":"<code>routing_batch_function</code>","text":""},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunc","title":"<code>RoutingBatchFunc = Callable[[List[str]], List[str]]</code>  <code>module-attribute</code>","text":"<p>Type alias for a routing batch function. It takes a list of all the downstream steps and returns a list with the names of the steps that should receive the batch.</p>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction","title":"<code>RoutingBatchFunction</code>","text":"<p>               Bases: <code>BaseModel</code>, <code>_Serializable</code></p> <p>A thin wrapper around a routing batch function that can be used to route batches from one upstream step to specific downstream steps.</p> <p>Attributes:</p> Name Type Description <code>routing_function</code> <code>RoutingBatchFunc</code> <p>The routing function that takes a list of all the downstream steps and returns a list with the names of the steps that should receive the batch.</p> <code>_step</code> <code>Union[_Step, None]</code> <p>The upstream step that is connected to the routing batch function.</p> <code>_routed_batch_registry</code> <code>Dict[str, Dict[int, List[str]]]</code> <p>A dictionary that keeps track of the batches that have been routed to specific downstream steps.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>class RoutingBatchFunction(BaseModel, _Serializable):\n    \"\"\"A thin wrapper around a routing batch function that can be used to route batches\n    from one upstream step to specific downstream steps.\n\n    Attributes:\n        routing_function: The routing function that takes a list of all the downstream steps\n            and returns a list with the names of the steps that should receive the batch.\n        _step: The upstream step that is connected to the routing batch function.\n        _routed_batch_registry: A dictionary that keeps track of the batches that have been\n            routed to specific downstream steps.\n    \"\"\"\n\n    routing_function: RoutingBatchFunc\n    description: Optional[str] = None\n\n    _step: Union[\"_Step\", None] = PrivateAttr(default=None)\n    _routed_batch_registry: Dict[str, Dict[int, List[str]]] = PrivateAttr(\n        default_factory=dict\n    )\n    _factory_function_module: Union[str, None] = PrivateAttr(default=None)\n    _factory_function_name: Union[str, None] = PrivateAttr(default=None)\n    _factory_function_kwargs: Union[Dict[str, Any], None] = PrivateAttr(default=None)\n\n    def route_batch(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n        \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n        should be routed.\n\n        Args:\n            batch: The batch that should be routed.\n            steps: A list of all the downstream steps that can receive the batch.\n\n        Returns:\n            A list with the names of the steps that should receive the batch.\n        \"\"\"\n        routed_steps = self.routing_function(steps)\n        self._register_routed_batch(batch, routed_steps)\n        return routed_steps\n\n    def set_factory_function(\n        self,\n        factory_function_module: str,\n        factory_function_name: str,\n        factory_function_kwargs: Dict[str, Any],\n    ) -&gt; None:\n        \"\"\"Sets the factory function that was used to create the `routing_batch_function`.\n\n        Args:\n            factory_function_module: The module name where the factory function is defined.\n            factory_function_name: The name of the factory function that was used to create\n                the `routing_batch_function`.\n            factory_function_kwargs: The keyword arguments that were used when calling the\n                factory function.\n        \"\"\"\n        self._factory_function_module = factory_function_module\n        self._factory_function_name = factory_function_name\n        self._factory_function_kwargs = factory_function_kwargs\n\n    def __call__(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n        \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n        should be routed.\n\n        Args:\n            batch: The batch that should be routed.\n            steps: A list of all the downstream steps that can receive the batch.\n\n        Returns:\n            A list with the names of the steps that should receive the batch.\n        \"\"\"\n        return self.route_batch(batch, steps)\n\n    def _register_routed_batch(self, batch: \"_Batch\", routed_steps: List[str]) -&gt; None:\n        \"\"\"Registers a batch that has been routed to specific downstream steps.\n\n        Args:\n            batch: The batch that has been routed.\n            routed_steps: The list of downstream steps that have been selected to receive\n                the batch.\n        \"\"\"\n        upstream_step = batch.step_name\n        batch_seq_no = batch.seq_no\n        self._routed_batch_registry.setdefault(upstream_step, {}).setdefault(\n            batch_seq_no, routed_steps\n        )\n\n    def __rshift__(\n        self, other: List[\"DownstreamConnectableSteps\"]\n    ) -&gt; List[\"DownstreamConnectableSteps\"]:\n        \"\"\"Connects a list of dowstream steps to the upstream step of the routing batch\n        function.\n\n        Args:\n            other: A list of downstream steps that should be connected to the upstream step\n                of the routing batch function.\n\n        Returns:\n            The list of downstream steps that have been connected to the upstream step of the\n            routing batch function.\n        \"\"\"\n        if not isinstance(other, list):\n            raise DistilabelUserError(\n                f\"Can only set a `routing_batch_function` for a list of steps. Got: {other}.\"\n                \" Please, review the right-hand side of the `routing_batch_function &gt;&gt; other`\"\n                \" expression. It should be\"\n                \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, dowstream_step_2, ...]`.\",\n                page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n            )\n\n        if not self._step:\n            raise DistilabelUserError(\n                \"Routing batch function doesn't have an upstream step. Cannot connect downstream\"\n                \" steps before connecting the upstream step. Connect this routing batch\"\n                \" function to an upstream step using the `&gt;&gt;` operator. For example:\"\n                \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, downstream_step_2, ...]`.\",\n                page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n            )\n\n        for step in other:\n            self._step.connect(step)\n        return other\n\n    def dump(self, **kwargs: Any) -&gt; Dict[str, Any]:\n        \"\"\"Dumps the routing batch function to a dictionary, and the information of the\n        factory function used to create this routing batch function.\n\n        Args:\n            **kwargs: Additional keyword arguments that should be included in the dump.\n\n        Returns:\n            A dictionary with the routing batch function information and the factory function\n            information.\n        \"\"\"\n        dump_info: Dict[str, Any] = {\"step\": self._step.name}  # type: ignore\n\n        if self.description:\n            dump_info[\"description\"] = self.description\n\n        if type_info := self._get_type_info():\n            dump_info[TYPE_INFO_KEY] = type_info\n\n        return dump_info\n\n    def _get_type_info(self) -&gt; Dict[str, Any]:\n        \"\"\"Returns the information of the factory function used to create the routing batch\n        function.\n\n        Returns:\n            A dictionary with the factory function information.\n        \"\"\"\n\n        type_info = {}\n\n        if self._factory_function_module:\n            type_info[\"module\"] = self._factory_function_module\n\n        if self._factory_function_name:\n            type_info[\"name\"] = self._factory_function_name\n\n        if self._factory_function_kwargs:\n            type_info[\"kwargs\"] = self._factory_function_kwargs\n\n        return type_info\n\n    @classmethod\n    def from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n        \"\"\"Loads a routing batch function from a dictionary. It must contain the information\n        of the factory function used to create the routing batch function.\n\n        Args:\n            data: A dictionary with the routing batch function information and the factory\n                function information.\n        \"\"\"\n        type_info = data.get(TYPE_INFO_KEY)\n        if not type_info:\n            step = data.get(\"step\")\n            raise ValueError(\n                f\"The routing batch function for step '{step}' was created without a factory\"\n                \" function, and it cannot be reconstructed.\"\n            )\n\n        module = type_info.get(\"module\")\n        name = type_info.get(\"name\")\n        kwargs = type_info.get(\"kwargs\")\n\n        if not module or not name or not kwargs:\n            raise ValueError(\n                \"The routing batch function was created with a factory function, but the\"\n                \" information is incomplete. Cannot reconstruct the routing batch function.\"\n            )\n\n        routing_batch_function = _get_module_attr(module=module, name=name)(**kwargs)\n        routing_batch_function.description = data.get(\"description\")\n        routing_batch_function.set_factory_function(\n            factory_function_module=module,\n            factory_function_name=name,\n            factory_function_kwargs=kwargs,\n        )\n\n        return routing_batch_function\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.route_batch","title":"<code>route_batch(batch, steps)</code>","text":"<p>Returns a list of selected downstream steps from <code>steps</code> to which the <code>batch</code> should be routed.</p> <p>Parameters:</p> Name Type Description Default <code>batch</code> <code>_Batch</code> <p>The batch that should be routed.</p> required <code>steps</code> <code>List[str]</code> <p>A list of all the downstream steps that can receive the batch.</p> required <p>Returns:</p> Type Description <code>List[str]</code> <p>A list with the names of the steps that should receive the batch.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def route_batch(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n    \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n    should be routed.\n\n    Args:\n        batch: The batch that should be routed.\n        steps: A list of all the downstream steps that can receive the batch.\n\n    Returns:\n        A list with the names of the steps that should receive the batch.\n    \"\"\"\n    routed_steps = self.routing_function(steps)\n    self._register_routed_batch(batch, routed_steps)\n    return routed_steps\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.set_factory_function","title":"<code>set_factory_function(factory_function_module, factory_function_name, factory_function_kwargs)</code>","text":"<p>Sets the factory function that was used to create the <code>routing_batch_function</code>.</p> <p>Parameters:</p> Name Type Description Default <code>factory_function_module</code> <code>str</code> <p>The module name where the factory function is defined.</p> required <code>factory_function_name</code> <code>str</code> <p>The name of the factory function that was used to create the <code>routing_batch_function</code>.</p> required <code>factory_function_kwargs</code> <code>Dict[str, Any]</code> <p>The keyword arguments that were used when calling the factory function.</p> required Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def set_factory_function(\n    self,\n    factory_function_module: str,\n    factory_function_name: str,\n    factory_function_kwargs: Dict[str, Any],\n) -&gt; None:\n    \"\"\"Sets the factory function that was used to create the `routing_batch_function`.\n\n    Args:\n        factory_function_module: The module name where the factory function is defined.\n        factory_function_name: The name of the factory function that was used to create\n            the `routing_batch_function`.\n        factory_function_kwargs: The keyword arguments that were used when calling the\n            factory function.\n    \"\"\"\n    self._factory_function_module = factory_function_module\n    self._factory_function_name = factory_function_name\n    self._factory_function_kwargs = factory_function_kwargs\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.__call__","title":"<code>__call__(batch, steps)</code>","text":"<p>Returns a list of selected downstream steps from <code>steps</code> to which the <code>batch</code> should be routed.</p> <p>Parameters:</p> Name Type Description Default <code>batch</code> <code>_Batch</code> <p>The batch that should be routed.</p> required <code>steps</code> <code>List[str]</code> <p>A list of all the downstream steps that can receive the batch.</p> required <p>Returns:</p> Type Description <code>List[str]</code> <p>A list with the names of the steps that should receive the batch.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def __call__(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n    \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n    should be routed.\n\n    Args:\n        batch: The batch that should be routed.\n        steps: A list of all the downstream steps that can receive the batch.\n\n    Returns:\n        A list with the names of the steps that should receive the batch.\n    \"\"\"\n    return self.route_batch(batch, steps)\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.__rshift__","title":"<code>__rshift__(other)</code>","text":"<p>Connects a list of dowstream steps to the upstream step of the routing batch function.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>List[DownstreamConnectableSteps]</code> <p>A list of downstream steps that should be connected to the upstream step of the routing batch function.</p> required <p>Returns:</p> Type Description <code>List[DownstreamConnectableSteps]</code> <p>The list of downstream steps that have been connected to the upstream step of the</p> <code>List[DownstreamConnectableSteps]</code> <p>routing batch function.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def __rshift__(\n    self, other: List[\"DownstreamConnectableSteps\"]\n) -&gt; List[\"DownstreamConnectableSteps\"]:\n    \"\"\"Connects a list of dowstream steps to the upstream step of the routing batch\n    function.\n\n    Args:\n        other: A list of downstream steps that should be connected to the upstream step\n            of the routing batch function.\n\n    Returns:\n        The list of downstream steps that have been connected to the upstream step of the\n        routing batch function.\n    \"\"\"\n    if not isinstance(other, list):\n        raise DistilabelUserError(\n            f\"Can only set a `routing_batch_function` for a list of steps. Got: {other}.\"\n            \" Please, review the right-hand side of the `routing_batch_function &gt;&gt; other`\"\n            \" expression. It should be\"\n            \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, dowstream_step_2, ...]`.\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n        )\n\n    if not self._step:\n        raise DistilabelUserError(\n            \"Routing batch function doesn't have an upstream step. Cannot connect downstream\"\n            \" steps before connecting the upstream step. Connect this routing batch\"\n            \" function to an upstream step using the `&gt;&gt;` operator. For example:\"\n            \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, downstream_step_2, ...]`.\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n        )\n\n    for step in other:\n        self._step.connect(step)\n    return other\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.dump","title":"<code>dump(**kwargs)</code>","text":"<p>Dumps the routing batch function to a dictionary, and the information of the factory function used to create this routing batch function.</p> <p>Parameters:</p> Name Type Description Default <code>**kwargs</code> <code>Any</code> <p>Additional keyword arguments that should be included in the dump.</p> <code>{}</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dictionary with the routing batch function information and the factory function</p> <code>Dict[str, Any]</code> <p>information.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def dump(self, **kwargs: Any) -&gt; Dict[str, Any]:\n    \"\"\"Dumps the routing batch function to a dictionary, and the information of the\n    factory function used to create this routing batch function.\n\n    Args:\n        **kwargs: Additional keyword arguments that should be included in the dump.\n\n    Returns:\n        A dictionary with the routing batch function information and the factory function\n        information.\n    \"\"\"\n    dump_info: Dict[str, Any] = {\"step\": self._step.name}  # type: ignore\n\n    if self.description:\n        dump_info[\"description\"] = self.description\n\n    if type_info := self._get_type_info():\n        dump_info[TYPE_INFO_KEY] = type_info\n\n    return dump_info\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.from_dict","title":"<code>from_dict(data)</code>  <code>classmethod</code>","text":"<p>Loads a routing batch function from a dictionary. It must contain the information of the factory function used to create the routing batch function.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> <p>A dictionary with the routing batch function information and the factory function information.</p> required Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n    \"\"\"Loads a routing batch function from a dictionary. It must contain the information\n    of the factory function used to create the routing batch function.\n\n    Args:\n        data: A dictionary with the routing batch function information and the factory\n            function information.\n    \"\"\"\n    type_info = data.get(TYPE_INFO_KEY)\n    if not type_info:\n        step = data.get(\"step\")\n        raise ValueError(\n            f\"The routing batch function for step '{step}' was created without a factory\"\n            \" function, and it cannot be reconstructed.\"\n        )\n\n    module = type_info.get(\"module\")\n    name = type_info.get(\"name\")\n    kwargs = type_info.get(\"kwargs\")\n\n    if not module or not name or not kwargs:\n        raise ValueError(\n            \"The routing batch function was created with a factory function, but the\"\n            \" information is incomplete. Cannot reconstruct the routing batch function.\"\n        )\n\n    routing_batch_function = _get_module_attr(module=module, name=name)(**kwargs)\n    routing_batch_function.description = data.get(\"description\")\n    routing_batch_function.set_factory_function(\n        factory_function_module=module,\n        factory_function_name=name,\n        factory_function_kwargs=kwargs,\n    )\n\n    return routing_batch_function\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.routing_batch_function","title":"<code>routing_batch_function(description=None)</code>","text":"<p>Creates a routing batch function that can be used to route batches from one upstream step to specific downstream steps.</p> <p>Parameters:</p> Name Type Description Default <code>description</code> <code>Optional[str]</code> <p>An optional description for the routing batch function.</p> <code>None</code> <p>Returns:</p> Type Description <code>Callable[[RoutingBatchFunc], RoutingBatchFunction]</code> <p>A <code>RoutingBatchFunction</code> instance that can be used with the <code>&gt;&gt;</code> operators and with</p> <code>Callable[[RoutingBatchFunc], RoutingBatchFunction]</code> <p>the <code>Pipeline.connect</code> method when defining the pipeline.</p> <p>Example:</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, routing_batch_function\nfrom distilabel.steps import LoadDataFromHub, GroupColumns\n\n\n@routing_batch_function\ndef random_routing_batch(steps: List[str]) -&gt; List[str]:\n    return random.sample(steps, 2)\n\n\nwith Pipeline(name=\"routing-batch-function\") as pipeline:\n    load_data = LoadDataFromHub()\n\n    generations = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        generations.append(task)\n\n    combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n    load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n</code></pre> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def routing_batch_function(\n    description: Optional[str] = None,\n) -&gt; Callable[[RoutingBatchFunc], RoutingBatchFunction]:\n    \"\"\"Creates a routing batch function that can be used to route batches from one upstream\n    step to specific downstream steps.\n\n    Args:\n        description: An optional description for the routing batch function.\n\n    Returns:\n        A `RoutingBatchFunction` instance that can be used with the `&gt;&gt;` operators and with\n        the `Pipeline.connect` method when defining the pipeline.\n\n    Example:\n\n    ```python\n    from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\n    from distilabel.pipeline import Pipeline, routing_batch_function\n    from distilabel.steps import LoadDataFromHub, GroupColumns\n\n\n    @routing_batch_function\n    def random_routing_batch(steps: List[str]) -&gt; List[str]:\n        return random.sample(steps, 2)\n\n\n    with Pipeline(name=\"routing-batch-function\") as pipeline:\n        load_data = LoadDataFromHub()\n\n        generations = []\n        for llm in (\n            OpenAILLM(model=\"gpt-4-0125-preview\"),\n            MistralLLM(model=\"mistral-large-2402\"),\n            VertexAILLM(model=\"gemini-1.5-pro\"),\n        ):\n            task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n            generations.append(task)\n\n        combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n        load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n    ```\n    \"\"\"\n\n    def decorator(func: RoutingBatchFunc) -&gt; RoutingBatchFunction:\n        factory_function_name, factory_function_module, factory_function_kwargs = (\n            None,\n            None,\n            None,\n        )\n\n        # Check if `routing_batch_function` was created using a factory function from an installed package\n        stack = inspect.stack()\n        if len(stack) &gt; 2:\n            factory_function_frame_info = stack[1]\n\n            # Function factory path\n            if factory_function_frame_info.function != \"&lt;module&gt;\":\n                factory_function_name = factory_function_frame_info.function\n                factory_function_module = inspect.getmodule(\n                    factory_function_frame_info.frame\n                ).__name__  # type: ignore\n\n                # Function factory kwargs\n                factory_function_kwargs = factory_function_frame_info.frame.f_locals\n\n        routing_batch_function = RoutingBatchFunction(\n            routing_function=func,\n            description=description,\n        )\n\n        if (\n            factory_function_module\n            and factory_function_name\n            and factory_function_kwargs\n        ):\n            routing_batch_function.set_factory_function(\n                factory_function_module=factory_function_module,\n                factory_function_name=factory_function_name,\n                factory_function_kwargs=factory_function_kwargs,\n            )\n\n        return routing_batch_function\n\n    return decorator\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.sample_n_steps","title":"<code>sample_n_steps(n)</code>","text":"<p>A simple function that creates a routing batch function that samples <code>n</code> steps from the list of all the downstream steps.</p> <p>Parameters:</p> Name Type Description Default <code>n</code> <code>int</code> <p>The number of steps to sample from the list of all the downstream steps.</p> required <p>Returns:</p> Type Description <code>RoutingBatchFunction</code> <p>A <code>RoutingBatchFunction</code> instance that can be used with the <code>&gt;&gt;</code> operators and with</p> <code>RoutingBatchFunction</code> <p>the <code>Pipeline.connect</code> method when defining the pipeline.</p> <p>Example:</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, sample_n_steps\nfrom distilabel.steps import LoadDataFromHub, GroupColumns\n\nrandom_routing_batch = sample_n_steps(2)\n\n\nwith Pipeline(name=\"routing-batch-function\") as pipeline:\n    load_data = LoadDataFromHub()\n\n    generations = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        generations.append(task)\n\n    combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n    load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n</code></pre> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def sample_n_steps(n: int) -&gt; RoutingBatchFunction:\n    \"\"\"A simple function that creates a routing batch function that samples `n` steps from\n    the list of all the downstream steps.\n\n    Args:\n        n: The number of steps to sample from the list of all the downstream steps.\n\n    Returns:\n        A `RoutingBatchFunction` instance that can be used with the `&gt;&gt;` operators and with\n        the `Pipeline.connect` method when defining the pipeline.\n\n    Example:\n\n    ```python\n    from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\n    from distilabel.pipeline import Pipeline, sample_n_steps\n    from distilabel.steps import LoadDataFromHub, GroupColumns\n\n    random_routing_batch = sample_n_steps(2)\n\n\n    with Pipeline(name=\"routing-batch-function\") as pipeline:\n        load_data = LoadDataFromHub()\n\n        generations = []\n        for llm in (\n            OpenAILLM(model=\"gpt-4-0125-preview\"),\n            MistralLLM(model=\"mistral-large-2402\"),\n            VertexAILLM(model=\"gemini-1.5-pro\"),\n        ):\n            task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n            generations.append(task)\n\n        combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n        load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n    ```\n    \"\"\"\n\n    @routing_batch_function(\n        description=f\"Sample {n} steps from the list of downstream steps.\"\n    )\n    def sample_n(steps: List[str]) -&gt; List[str]:\n        return random.sample(steps, n)\n\n    return sample_n\n</code></pre>"},{"location":"api/pipeline/step_wrapper/","title":"Step Wrapper","text":""},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapper","title":"<code>_StepWrapper</code>","text":"<p>Wrapper to run the <code>Step</code>.</p> <p>Attributes:</p> Name Type Description <code>step</code> <p>The step to run.</p> <code>replica</code> <p>The replica ID assigned.</p> <code>input_queue</code> <p>The queue to receive the input data.</p> <code>output_queue</code> <p>The queue to send the output data.</p> <code>load_queue</code> <p>The queue used to notify the main process that the step has been loaded, has been unloaded or has failed to load.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>class _StepWrapper:\n    \"\"\"Wrapper to run the `Step`.\n\n    Attributes:\n        step: The step to run.\n        replica: The replica ID assigned.\n        input_queue: The queue to receive the input data.\n        output_queue: The queue to send the output data.\n        load_queue: The queue used to notify the main process that the step has been loaded,\n            has been unloaded or has failed to load.\n    \"\"\"\n\n    def __init__(\n        self,\n        step: Union[\"Step\", \"GeneratorStep\"],\n        replica: int,\n        input_queue: \"Queue[_Batch]\",\n        output_queue: \"Queue[_Batch]\",\n        load_queue: \"Queue[Union[StepLoadStatus, None]]\",\n        dry_run: bool = False,\n        ray_pipeline: bool = False,\n    ) -&gt; None:\n        \"\"\"Initializes the `_ProcessWrapper`.\n\n        Args:\n            step: The step to run.\n            input_queue: The queue to receive the input data.\n            output_queue: The queue to send the output data.\n            load_queue: The queue used to notify the main process that the step has been\n                loaded, has been unloaded or has failed to load.\n            dry_run: Flag to ensure we are forcing to run the last batch.\n            ray_pipeline: Whether the step is running a `RayPipeline` or not.\n        \"\"\"\n        self.step = step\n        self.replica = replica\n        self.input_queue = input_queue\n        self.output_queue = output_queue\n        self.load_queue = load_queue\n        self.dry_run = dry_run\n        self.ray_pipeline = ray_pipeline\n\n        self._init_cuda_device_placement()\n\n    def _init_cuda_device_placement(self) -&gt; None:\n        \"\"\"Sets the LLM identifier and the number of desired GPUs of the `CudaDevicePlacementMixin`\"\"\"\n\n        def _init_cuda_device_placement_mixin(attr: CudaDevicePlacementMixin) -&gt; None:\n            if self.ray_pipeline:\n                attr.disable_cuda_device_placement = True\n            else:\n                desired_num_gpus = self.step.resources.gpus or 1\n                attr._llm_identifier = f\"{self.step.name}-replica-{self.replica}\"\n                attr._desired_num_gpus = desired_num_gpus\n\n        for field_name in self.step.model_fields_set:\n            attr = getattr(self.step, field_name)\n            if isinstance(attr, CudaDevicePlacementMixin):\n                _init_cuda_device_placement_mixin(attr)\n\n        if isinstance(self.step, CudaDevicePlacementMixin):\n            _init_cuda_device_placement_mixin(self.step)\n\n    def run(self) -&gt; str:\n        \"\"\"The target function executed by the process. This function will also handle\n        the step lifecycle, executing first the `load` function of the `Step` and then\n        waiting to receive a batch from the `input_queue` that will be handled by the\n        `process` method of the `Step`.\n\n        Returns:\n            The name of the step that was executed.\n        \"\"\"\n\n        try:\n            self.step.load()\n            self.step._logger.debug(f\"Step '{self.step.name}' loaded!\")\n        except Exception as e:\n            self.step.unload()\n            self._notify_load_failed()\n            raise _StepWrapperException.create_load_error(\n                message=f\"Step load failed: {e}\",\n                step=self.step,\n                subprocess_exception=e,\n            ) from e\n\n        self._notify_load()\n\n        if self.step.is_generator:\n            self._generator_step_process_loop()\n        else:\n            self._non_generator_process_loop()\n\n        # Just in case `None` sentinel was sent\n        try:\n            self.input_queue.get(block=False)\n        except Exception:\n            pass\n\n        self.step.unload()\n\n        self._notify_unload()\n\n        self.step._logger.info(\n            f\"\ud83c\udfc1 Finished running step '{self.step.name}' (replica ID: {self.replica})\"\n        )\n\n        return self.step.name  # type: ignore\n\n    def _notify_load(self) -&gt; None:\n        \"\"\"Notifies that the step has finished executing its `load` function successfully.\"\"\"\n        self.step._logger.debug(\n            f\"Notifying load of step '{self.step.name}' (replica ID {self.replica})...\"\n        )\n        self.load_queue.put({\"name\": self.step.name, \"status\": \"loaded\"})  # type: ignore\n\n    def _notify_unload(self) -&gt; None:\n        \"\"\"Notifies that the step has been unloaded.\"\"\"\n        self.step._logger.debug(\n            f\"Notifying unload of step '{self.step.name}' (replica ID {self.replica})...\"\n        )\n        self.load_queue.put({\"name\": self.step.name, \"status\": \"unloaded\"})  # type: ignore\n\n    def _notify_load_failed(self) -&gt; None:\n        \"\"\"Notifies that the step failed to load.\"\"\"\n        self.step._logger.debug(\n            f\"Notifying load failed of step '{self.step.name}' (replica ID {self.replica})...\"\n        )\n        self.load_queue.put({\"name\": self.step.name, \"status\": \"load_failed\"})  # type: ignore\n\n    def _generator_step_process_loop(self) -&gt; None:\n        \"\"\"Runs the process loop for a generator step. It will call the `process` method\n        of the step and send the output data to the `output_queue` and block until the next\n        batch request is received (i.e. receiving an empty batch from the `input_queue`).\n\n        If the `last_batch` attribute of the batch is `True`, the loop will stop and the\n        process will finish.\n\n        Raises:\n            _StepWrapperException: If an error occurs during the execution of the\n                `process` method.\n        \"\"\"\n        step = cast(\"GeneratorStep\", self.step)\n\n        try:\n            if (batch := self.input_queue.get()) is None:\n                self.step._logger.info(\n                    f\"\ud83d\uded1 Stopping yielding batches from step '{self.step.name}'\"\n                )\n                return\n\n            offset = batch.seq_no * step.batch_size  # type: ignore\n\n            self.step._logger.info(\n                f\"\ud83d\udeb0 Starting yielding batches from generator step '{self.step.name}'.\"\n                f\" Offset: {offset}\"\n            )\n\n            for data, last_batch in step.process_applying_mappings(offset=offset):\n                batch.set_data([data])\n                batch.last_batch = self.dry_run or last_batch\n                self._send_batch(batch)\n\n                if batch.last_batch:\n                    return\n\n                self.step._logger.debug(\n                    f\"Step '{self.step.name}' waiting for next batch request...\"\n                )\n                if (batch := self.input_queue.get()) is None:\n                    self.step._logger.info(\n                        f\"\ud83d\uded1 Stopping yielding batches from step '{self.step.name}'\"\n                    )\n                    return\n        except Exception as e:\n            raise _StepWrapperException(str(e), self.step, 2, e) from e\n\n    def _non_generator_process_loop(self) -&gt; None:\n        \"\"\"Runs the process loop for a non-generator step. It will call the `process`\n        method of the step and send the output data to the `output_queue` and block until\n        the next batch is received from the `input_queue`. If the `last_batch` attribute\n        of the batch is `True`, the loop will stop and the process will finish.\n\n        If an error occurs during the execution of the `process` method and the step is\n        global, the process will raise a `_StepWrapperException`. If the step is not\n        global, the process will log the error and send an empty batch to the `output_queue`.\n\n        Raises:\n            _StepWrapperException: If an error occurs during the execution of the\n                `process` method and the step is global.\n        \"\"\"\n        step = cast(\"Step\", self.step)\n        while True:\n            if (batch := self.input_queue.get()) is None:\n                self.step._logger.info(\n                    f\"\ud83d\uded1 Stopping processing batches from step '{self.step.name}'\"\n                )\n                break\n\n            if batch == LAST_BATCH_SENT_FLAG:\n                self.step._logger.debug(\"Received `LAST_BATCH_SENT_FLAG`. Stopping...\")\n                break\n\n            self.step._logger.info(\n                f\"\ud83d\udce6 Processing batch {batch.seq_no} in '{batch.step_name}' (replica ID: {self.replica})\"\n            )\n\n            if batch.data_path is not None:\n                self.step._logger.debug(f\"Reading batch data from '{batch.data_path}'\")\n                batch.read_batch_data_from_fs()\n\n            result = []\n            try:\n                if self.step.has_multiple_inputs:\n                    result = next(step.process_applying_mappings(*batch.data))\n                else:\n                    result = next(step.process_applying_mappings(batch.data[0]))\n            except Exception as e:\n                if self.step.is_global:\n                    self.step.unload()\n                    self._notify_unload()\n                    data = (\n                        batch.data\n                        if isinstance(\n                            e, DistilabelOfflineBatchGenerationNotFinishedException\n                        )\n                        else None\n                    )\n                    raise _StepWrapperException(str(e), self.step, 2, e, data) from e\n\n                # Impute step outputs columns with `None`\n                result = self._impute_step_outputs(batch)\n\n                # if the step is not global then we can skip the batch which means sending\n                # an empty batch to the output queue\n                self.step._logger.warning(\n                    f\"\u26a0\ufe0f Processing batch {batch.seq_no} with step '{self.step.name}' failed.\"\n                    \" Sending empty batch filled with `None`s...\"\n                )\n                self.step._logger.warning(\n                    f\"Subprocess traceback:\\n\\n{traceback.format_exc()}\"\n                )\n            finally:\n                batch.set_data([result])\n                self._send_batch(batch)\n\n            if batch.last_batch:\n                break\n\n    def _impute_step_outputs(self, batch: \"_Batch\") -&gt; List[Dict[str, Any]]:\n        \"\"\"Imputes the step outputs columns with `None` in the batch data.\n\n        Args:\n            batch: The batch to impute.\n        \"\"\"\n        return self.step.impute_step_outputs(batch.data[0])\n\n    def _send_batch(self, batch: _Batch) -&gt; None:\n        \"\"\"Sends a batch to the `output_queue`.\"\"\"\n        if batch.data_path is not None:\n            self.step._logger.debug(f\"Writing batch data to '{batch.data_path}'\")\n            batch.write_batch_data_to_fs()\n\n        self.step._logger.info(\n            f\"\ud83d\udce8 Step '{batch.step_name}' sending batch {batch.seq_no} to output queue\"\n        )\n        self.output_queue.put(batch)\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapper.__init__","title":"<code>__init__(step, replica, input_queue, output_queue, load_queue, dry_run=False, ray_pipeline=False)</code>","text":"<p>Initializes the <code>_ProcessWrapper</code>.</p> <p>Parameters:</p> Name Type Description Default <code>step</code> <code>Union[Step, GeneratorStep]</code> <p>The step to run.</p> required <code>input_queue</code> <code>Queue[_Batch]</code> <p>The queue to receive the input data.</p> required <code>output_queue</code> <code>Queue[_Batch]</code> <p>The queue to send the output data.</p> required <code>load_queue</code> <code>Queue[Union[StepLoadStatus, None]]</code> <p>The queue used to notify the main process that the step has been loaded, has been unloaded or has failed to load.</p> required <code>dry_run</code> <code>bool</code> <p>Flag to ensure we are forcing to run the last batch.</p> <code>False</code> <code>ray_pipeline</code> <code>bool</code> <p>Whether the step is running a <code>RayPipeline</code> or not.</p> <code>False</code> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>def __init__(\n    self,\n    step: Union[\"Step\", \"GeneratorStep\"],\n    replica: int,\n    input_queue: \"Queue[_Batch]\",\n    output_queue: \"Queue[_Batch]\",\n    load_queue: \"Queue[Union[StepLoadStatus, None]]\",\n    dry_run: bool = False,\n    ray_pipeline: bool = False,\n) -&gt; None:\n    \"\"\"Initializes the `_ProcessWrapper`.\n\n    Args:\n        step: The step to run.\n        input_queue: The queue to receive the input data.\n        output_queue: The queue to send the output data.\n        load_queue: The queue used to notify the main process that the step has been\n            loaded, has been unloaded or has failed to load.\n        dry_run: Flag to ensure we are forcing to run the last batch.\n        ray_pipeline: Whether the step is running a `RayPipeline` or not.\n    \"\"\"\n    self.step = step\n    self.replica = replica\n    self.input_queue = input_queue\n    self.output_queue = output_queue\n    self.load_queue = load_queue\n    self.dry_run = dry_run\n    self.ray_pipeline = ray_pipeline\n\n    self._init_cuda_device_placement()\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapper.run","title":"<code>run()</code>","text":"<p>The target function executed by the process. This function will also handle the step lifecycle, executing first the <code>load</code> function of the <code>Step</code> and then waiting to receive a batch from the <code>input_queue</code> that will be handled by the <code>process</code> method of the <code>Step</code>.</p> <p>Returns:</p> Type Description <code>str</code> <p>The name of the step that was executed.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>def run(self) -&gt; str:\n    \"\"\"The target function executed by the process. This function will also handle\n    the step lifecycle, executing first the `load` function of the `Step` and then\n    waiting to receive a batch from the `input_queue` that will be handled by the\n    `process` method of the `Step`.\n\n    Returns:\n        The name of the step that was executed.\n    \"\"\"\n\n    try:\n        self.step.load()\n        self.step._logger.debug(f\"Step '{self.step.name}' loaded!\")\n    except Exception as e:\n        self.step.unload()\n        self._notify_load_failed()\n        raise _StepWrapperException.create_load_error(\n            message=f\"Step load failed: {e}\",\n            step=self.step,\n            subprocess_exception=e,\n        ) from e\n\n    self._notify_load()\n\n    if self.step.is_generator:\n        self._generator_step_process_loop()\n    else:\n        self._non_generator_process_loop()\n\n    # Just in case `None` sentinel was sent\n    try:\n        self.input_queue.get(block=False)\n    except Exception:\n        pass\n\n    self.step.unload()\n\n    self._notify_unload()\n\n    self.step._logger.info(\n        f\"\ud83c\udfc1 Finished running step '{self.step.name}' (replica ID: {self.replica})\"\n    )\n\n    return self.step.name  # type: ignore\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapperException","title":"<code>_StepWrapperException</code>","text":"<p>               Bases: <code>Exception</code></p> <p>Exception to be raised when an error occurs in the <code>_StepWrapper</code> class.</p> <p>Attributes:</p> Name Type Description <code>message</code> <p>The error message.</p> <code>step</code> <p>The <code>Step</code> that raised the error.</p> <code>code</code> <p>The error code.</p> <code>subprocess_exception</code> <p>The exception raised by the subprocess.</p> <code>data</code> <p>The data that caused the error. Defaults to <code>None</code>.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>class _StepWrapperException(Exception):\n    \"\"\"Exception to be raised when an error occurs in the `_StepWrapper` class.\n\n    Attributes:\n        message: The error message.\n        step: The `Step` that raised the error.\n        code: The error code.\n        subprocess_exception: The exception raised by the subprocess.\n        data: The data that caused the error. Defaults to `None`.\n    \"\"\"\n\n    def __init__(\n        self,\n        message: str,\n        step: \"_Step\",\n        code: int,\n        subprocess_exception: Exception,\n        data: Optional[List[List[Dict[str, Any]]]] = None,\n    ) -&gt; None:\n        self.message = f\"{message}\\n\\nFor further information visit '{DISTILABEL_DOCS_URL}api/pipeline/step_wrapper'\"\n        self.step = step\n        self.code = code\n        self.subprocess_exception = subprocess_exception\n        self.formatted_traceback = \"\".join(\n            traceback.format_exception(\n                type(subprocess_exception),\n                subprocess_exception,\n                subprocess_exception.__traceback__,\n            )\n        )\n        self.data = data\n\n    @classmethod\n    def create_load_error(\n        cls,\n        message: str,\n        step: \"_Step\",\n        subprocess_exception: Optional[Exception] = None,\n    ) -&gt; \"_StepWrapperException\":\n        \"\"\"Creates a `_StepWrapperException` for a load error.\n\n        Args:\n            message: The error message.\n            step: The `Step` that raised the error.\n            subprocess_exception: The exception raised by the subprocess. Defaults to `None`.\n\n        Returns:\n            The `_StepWrapperException` instance.\n        \"\"\"\n        return cls(message, step, 1, subprocess_exception, None)\n\n    @property\n    def is_load_error(self) -&gt; bool:\n        \"\"\"Whether the error is a load error.\n\n        Returns:\n            `True` if the error is a load error, `False` otherwise.\n        \"\"\"\n        return self.code == 1\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapperException.is_load_error","title":"<code>is_load_error</code>  <code>property</code>","text":"<p>Whether the error is a load error.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the error is a load error, <code>False</code> otherwise.</p>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapperException.create_load_error","title":"<code>create_load_error(message, step, subprocess_exception=None)</code>  <code>classmethod</code>","text":"<p>Creates a <code>_StepWrapperException</code> for a load error.</p> <p>Parameters:</p> Name Type Description Default <code>message</code> <code>str</code> <p>The error message.</p> required <code>step</code> <code>_Step</code> <p>The <code>Step</code> that raised the error.</p> required <code>subprocess_exception</code> <code>Optional[Exception]</code> <p>The exception raised by the subprocess. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>_StepWrapperException</code> <p>The <code>_StepWrapperException</code> instance.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>@classmethod\ndef create_load_error(\n    cls,\n    message: str,\n    step: \"_Step\",\n    subprocess_exception: Optional[Exception] = None,\n) -&gt; \"_StepWrapperException\":\n    \"\"\"Creates a `_StepWrapperException` for a load error.\n\n    Args:\n        message: The error message.\n        step: The `Step` that raised the error.\n        subprocess_exception: The exception raised by the subprocess. Defaults to `None`.\n\n    Returns:\n        The `_StepWrapperException` instance.\n    \"\"\"\n    return cls(message, step, 1, subprocess_exception, None)\n</code></pre>"},{"location":"api/pipeline/typing/","title":"Pipeline Typing","text":""},{"location":"api/pipeline/typing/#distilabel.pipeline.typing","title":"<code>typing</code>","text":""},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.DownstreamConnectable","title":"<code>DownstreamConnectable = Union['Step', 'GlobalStep']</code>  <code>module-attribute</code>","text":"<p>Alias for the <code>Step</code> types that can be connected as downstream steps.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.UpstreamConnectableSteps","title":"<code>UpstreamConnectableSteps = TypeVar('UpstreamConnectableSteps', bound=Union['Step', 'GlobalStep', 'GeneratorStep'])</code>  <code>module-attribute</code>","text":"<p>Type for the <code>Step</code> types that can be connected as upstream steps.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.DownstreamConnectableSteps","title":"<code>DownstreamConnectableSteps = TypeVar('DownstreamConnectableSteps', bound=DownstreamConnectable, covariant=True)</code>  <code>module-attribute</code>","text":"<p>Type for the <code>Step</code> types that can be connected as downstream steps.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.PipelineRuntimeParametersInfo","title":"<code>PipelineRuntimeParametersInfo = Dict[str, Union[List['RuntimeParameterInfo'], Dict[str, 'RuntimeParameterInfo']]]</code>  <code>module-attribute</code>","text":"<p>Alias for the information of the runtime parameters of a <code>Pipeline</code>.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.InputDataset","title":"<code>InputDataset = Union['Dataset', 'pd.DataFrame', List[Dict[str, str]]]</code>  <code>module-attribute</code>","text":"<p>Alias for the types we can process as input dataset.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.LoadGroups","title":"<code>LoadGroups = Union[List[List[Any]], Literal['sequential_step_execution']]</code>  <code>module-attribute</code>","text":"<p>Alias for the types that can be used as load groups.</p> <ul> <li>if <code>List[List[Any]]</code>, it's a list containing lists of steps that have to be loaded in isolation.</li> <li>if \"sequential_step_execution\", each step will be loaded in a different stage i.e. only one step will be executed at a time.</li> </ul>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.StepLoadStatus","title":"<code>StepLoadStatus</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>Dict containing information about if one step was loaded/unloaded or if it's load failed</p> Source code in <code>src/distilabel/pipeline/typing.py</code> <pre><code>class StepLoadStatus(TypedDict):\n    \"\"\"Dict containing information about if one step was loaded/unloaded or if it's load\n    failed\"\"\"\n\n    name: str\n    status: Literal[\"loaded\", \"unloaded\", \"load_failed\"]\n</code></pre>"},{"location":"api/step/","title":"Step","text":"<p>This section contains the API reference for the <code>distilabel</code> step, both for the <code>_Step</code> base class and the <code>Step</code> class.</p> <p>For more information and examples on how to use existing steps or create custom ones, please refer to Tutorial - Step.</p>"},{"location":"api/step/#distilabel.steps.base","title":"<code>base</code>","text":""},{"location":"api/step/#distilabel.steps.base.StepInput","title":"<code>StepInput = Annotated[List[Dict[str, Any]], _STEP_INPUT_ANNOTATION]</code>  <code>module-attribute</code>","text":"<p>StepInput is just an <code>Annotated</code> alias of the typing <code>List[Dict[str, Any]]</code> with extra metadata that allows <code>distilabel</code> to perform validations over the <code>process</code> step method defined in each <code>Step</code></p>"},{"location":"api/step/#distilabel.steps.base._Step","title":"<code>_Step</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>RequirementsMixin</code>, <code>SignatureMixin</code>, <code>BaseModel</code>, <code>_Serializable</code>, <code>ABC</code></p> <p>Base class for the steps that can be included in a <code>Pipeline</code>.</p> <p>A <code>Step</code> is a class defining some processing logic. The input and outputs for this processing logic are lists of dictionaries with the same keys:</p> <pre><code>```python\n[\n    {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n    {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n    {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n]\n```\n</code></pre> <p>The processing logic is defined in the <code>process</code> method, which depending on the number of previous steps, can receive more than one list of dictionaries, each with the output of the previous steps. In order to make <code>distilabel</code> know where the outputs from the previous steps are, the <code>process</code> function from each <code>Step</code> must have an argument or positional argument annotated with <code>StepInput</code>.</p> <pre><code>```python\nclass StepWithOnePreviousStep(Step):\n    def process(self, inputs: StepInput) -&gt; StepOutput:\n        yield [...]\n\nclass StepWithSeveralPreviousStep(Step):\n    # mind the * to indicate that the argument is a list of StepInput\n    def process(self, *inputs: StepInput) -&gt; StepOutput:\n        yield [...]\n```\n</code></pre> <p>In order to perform static validations and to check that the chaining of the steps in the pipeline is valid, a <code>Step</code> must also define the <code>inputs</code> and <code>outputs</code> properties:</p> <ul> <li><code>inputs</code>: a list of strings with the names of the columns that the step needs as     input. It can be an empty list if the step is a generator step.</li> <li><code>outputs</code>: a list of strings with the names of the columns that the step will     produce as output.</li> </ul> <p>Optionally, a <code>Step</code> can override the <code>load</code> method to perform any initialization logic before the <code>process</code> method is called. For example, to load an LLM, stablish a connection to a database, etc.</p> <p>Finally, the <code>Step</code> class inherits from <code>pydantic.BaseModel</code>, so attributes can be easily defined, validated, serialized and included in the <code>__init__</code> method of the step.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class _Step(\n    RuntimeParametersMixin,\n    RequirementsMixin,\n    SignatureMixin,\n    BaseModel,\n    _Serializable,\n    ABC,\n):\n    \"\"\"Base class for the steps that can be included in a `Pipeline`.\n\n    A `Step` is a class defining some processing logic. The input and outputs for this\n    processing logic are lists of dictionaries with the same keys:\n\n        ```python\n        [\n            {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n            {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n            {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n        ]\n        ```\n\n    The processing logic is defined in the `process` method, which depending on the\n    number of previous steps, can receive more than one list of dictionaries, each with\n    the output of the previous steps. In order to make `distilabel` know where the outputs\n    from the previous steps are, the `process` function from each `Step` must have an argument\n    or positional argument annotated with `StepInput`.\n\n        ```python\n        class StepWithOnePreviousStep(Step):\n            def process(self, inputs: StepInput) -&gt; StepOutput:\n                yield [...]\n\n        class StepWithSeveralPreviousStep(Step):\n            # mind the * to indicate that the argument is a list of StepInput\n            def process(self, *inputs: StepInput) -&gt; StepOutput:\n                yield [...]\n        ```\n\n    In order to perform static validations and to check that the chaining of the steps\n    in the pipeline is valid, a `Step` must also define the `inputs` and `outputs`\n    properties:\n\n    - `inputs`: a list of strings with the names of the columns that the step needs as\n        input. It can be an empty list if the step is a generator step.\n    - `outputs`: a list of strings with the names of the columns that the step will\n        produce as output.\n\n    Optionally, a `Step` can override the `load` method to perform any initialization\n    logic before the `process` method is called. For example, to load an LLM, stablish a\n    connection to a database, etc.\n\n    Finally, the `Step` class inherits from `pydantic.BaseModel`, so attributes can be easily\n    defined, validated, serialized and included in the `__init__` method of the step.\n    \"\"\"\n\n    model_config = ConfigDict(\n        arbitrary_types_allowed=True,\n        validate_default=True,\n        validate_assignment=True,\n        extra=\"forbid\",\n    )\n\n    name: Optional[str] = Field(default=None, pattern=r\"^[a-zA-Z0-9_-]+$\")\n    resources: StepResources = StepResources()\n    pipeline: Any = Field(default=None, exclude=True, repr=False)\n    input_mappings: Dict[str, str] = {}\n    output_mappings: Dict[str, str] = {}\n    use_cache: bool = True\n\n    _pipeline_artifacts_path: Path = PrivateAttr(None)\n    _built_from_decorator: bool = PrivateAttr(default=False)\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        from distilabel.pipeline.base import _GlobalPipelineManager\n\n        super().model_post_init(__context)\n\n        if self.pipeline is None:\n            self.pipeline = _GlobalPipelineManager.get_pipeline()\n\n        if self.pipeline is None:\n            _logger = logging.getLogger(f\"distilabel.step.{self.name}\")\n            _logger.warning(\n                f\"Step '{self.name}' hasn't received a pipeline, and it hasn't been\"\n                \" created within a `Pipeline` context. Please, use\"\n                \" `with Pipeline() as pipeline:` and create the step within the context.\"\n            )\n\n        if not self.name:\n            # This must be done before the check for repeated names, but assuming\n            # we are passing the pipeline from the _GlobalPipelineManager, should\n            # be done after that.\n            self.name = _infer_step_name(type(self).__name__, self.pipeline)\n\n        if self.pipeline is not None:\n            # If not set an error will be raised in `Pipeline.run` parent\n            self.pipeline._add_step(self)\n\n    def connect(\n        self,\n        *steps: \"_Step\",\n        routing_batch_function: Optional[\"RoutingBatchFunction\"] = None,\n    ) -&gt; None:\n        \"\"\"Connects the current step to another step in the pipeline, which means that\n        the output of this step will be the input of the other step.\n\n        Args:\n            steps: The steps to connect to the current step.\n            routing_batch_function: A function that receives a list of steps and returns\n                a list of steps to which the output batch generated by this step should be\n                routed. It should be used to define the routing logic of the pipeline. If\n                not provided, the output batch will be routed to all the connected steps.\n                Defaults to `None`.\n        \"\"\"\n        assert self.pipeline is not None\n\n        if routing_batch_function:\n            self._set_routing_batch_function(routing_batch_function)\n\n        for step in steps:\n            self.pipeline._add_edge(from_step=self.name, to_step=step.name)  # type: ignore\n\n    def _set_routing_batch_function(\n        self, routing_batch_function: \"RoutingBatchFunction\"\n    ) -&gt; None:\n        \"\"\"Sets a routing batch function for the batches generated by this step, so they\n        get routed to specific downstream steps.\n\n        Args:\n            routing_batch_function: The routing batch function that will be used to route\n                the batches generated by this step.\n        \"\"\"\n        self.pipeline._add_routing_batch_function(\n            step_name=self.name,  # type: ignore\n            routing_batch_function=routing_batch_function,\n        )\n        routing_batch_function._step = self\n\n    @overload\n    def __rshift__(self, other: \"RoutingBatchFunction\") -&gt; \"RoutingBatchFunction\": ...\n\n    @overload\n    def __rshift__(\n        self, other: List[\"DownstreamConnectableSteps\"]\n    ) -&gt; List[\"DownstreamConnectableSteps\"]: ...\n\n    @overload\n    def __rshift__(self, other: \"DownstreamConnectable\") -&gt; \"DownstreamConnectable\": ...\n\n    def __rshift__(\n        self,\n        other: Union[\n            \"DownstreamConnectable\",\n            \"RoutingBatchFunction\",\n            List[\"DownstreamConnectableSteps\"],\n        ],\n    ) -&gt; Union[\n        \"DownstreamConnectable\",\n        \"RoutingBatchFunction\",\n        List[\"DownstreamConnectableSteps\"],\n    ]:\n        \"\"\"Allows using the `&gt;&gt;` operator to connect steps in the pipeline.\n\n        Args:\n            other: The step to connect, a list of steps to connect to or a routing batch\n                function to be set for the step.\n\n        Returns:\n            The connected step, the list of connected steps or the routing batch function.\n\n        Example:\n            ```python\n            step1 &gt;&gt; step2\n            # Would be equivalent to:\n            step1.connect(step2)\n\n            # It also allows to connect a list of steps\n            step1 &gt;&gt; [step2, step3]\n            ```\n        \"\"\"\n        # Here to avoid circular imports\n        from distilabel.pipeline.routing_batch_function import RoutingBatchFunction\n\n        if isinstance(other, list):\n            self.connect(*other)\n            return other\n\n        if isinstance(other, RoutingBatchFunction):\n            self._set_routing_batch_function(other)\n            return other\n\n        self.connect(other)\n        return other\n\n    def __rrshift__(self, other: List[\"UpstreamConnectableSteps\"]) -&gt; Self:\n        \"\"\"Allows using the [step1, step2] &gt;&gt; step3 operator to connect a list of steps in the pipeline\n        to a single step, as the list doesn't have the __rshift__ operator.\n\n        Args:\n            other: The step to connect to.\n\n        Returns:\n            The connected step\n\n        Example:\n            ```python\n            [step2, step3] &gt;&gt; step1\n            # Would be equivalent to:\n            step2.connect(step1)\n            step3.connect(step1)\n            ```\n        \"\"\"\n        for o in other:\n            o.connect(self)\n        return self\n\n    def load(self) -&gt; None:\n        \"\"\"Method to perform any initialization logic before the `process` method is\n        called. For example, to load an LLM, stablish a connection to a database, etc.\n        \"\"\"\n        self._logger = logging.getLogger(f\"distilabel.step.{self.name}\")\n\n    def unload(self) -&gt; None:\n        \"\"\"Method to perform any cleanup logic after the `process` method is called. For\n        example, to close a connection to a database, etc.\n        \"\"\"\n        self._logger.debug(\"Executing step unload logic.\")\n\n    @property\n    def is_generator(self) -&gt; bool:\n        \"\"\"Whether the step is a generator step or not.\n\n        Returns:\n            `True` if the step is a generator step, `False` otherwise.\n        \"\"\"\n        return isinstance(self, GeneratorStep)\n\n    @property\n    def is_global(self) -&gt; bool:\n        \"\"\"Whether the step is a global step or not.\n\n        Returns:\n            `True` if the step is a global step, `False` otherwise.\n        \"\"\"\n        return isinstance(self, GlobalStep)\n\n    @property\n    def is_normal(self) -&gt; bool:\n        \"\"\"Whether the step is a normal step or not.\n\n        Returns:\n            `True` if the step is a normal step, `False` otherwise.\n        \"\"\"\n        return not self.is_generator and not self.is_global\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of strings with the names of the mandatory columns that the step needs as\n        input or dictionary in which the keys are the input columns of the step and the\n        values are booleans indicating whether the column is optional or not.\n\n        Returns:\n            List of strings with the names of the columns that the step needs as input.\n        \"\"\"\n        return []\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of strings with the names of the columns that the step will produce as\n        output or dictionary in which the keys are the output columns of the step and the\n        values are booleans indicating whether the column is optional or not.\n\n        Returns:\n            List of strings with the names of the columns that the step will produce as\n            output.\n        \"\"\"\n        return []\n\n    @cached_property\n    def process_parameters(self) -&gt; List[inspect.Parameter]:\n        \"\"\"Returns the parameters of the `process` method of the step.\n\n        Returns:\n            The parameters of the `process` method of the step.\n        \"\"\"\n        return list(inspect.signature(self.process).parameters.values())  # type: ignore\n\n    def has_multiple_inputs(self) -&gt; bool:\n        \"\"\"Whether the `process` method of the step receives more than one input or not\n        i.e. has a `*` argument annotated with `StepInput`.\n\n        Returns:\n            `True` if the `process` method of the step receives more than one input,\n            `False` otherwise.\n        \"\"\"\n        return any(\n            param.kind == param.VAR_POSITIONAL for param in self.process_parameters\n        )\n\n    def get_process_step_input(self) -&gt; Union[inspect.Parameter, None]:\n        \"\"\"Returns the parameter of the `process` method of the step annotated with\n        `StepInput`.\n\n        Returns:\n            The parameter of the `process` method of the step annotated with `StepInput`,\n            or `None` if there is no parameter annotated with `StepInput`.\n\n        Raises:\n            TypeError: If the step has more than one parameter annotated with `StepInput`.\n        \"\"\"\n        step_input_parameter = None\n        for parameter in self.process_parameters:\n            if is_parameter_annotated_with(parameter, _STEP_INPUT_ANNOTATION):\n                if step_input_parameter is not None:\n                    raise DistilabelTypeError(\n                        f\"Step '{self.name}' should have only one parameter with type\"\n                        \" hint `StepInput`.\",\n                        page=\"sections/how_to_guides/basic/step/#defining-custom-steps\",\n                    )\n                step_input_parameter = parameter\n        return step_input_parameter\n\n    def verify_inputs_mappings(self) -&gt; None:\n        \"\"\"Verifies that the `inputs_mappings` of the step are valid i.e. the input\n        columns exist in the inputs of the step.\n\n        Raises:\n            ValueError: If the `inputs_mappings` of the step are not valid.\n        \"\"\"\n        if not self.input_mappings:\n            return\n\n        for input in self.input_mappings:\n            if input not in self.inputs:\n                raise DistilabelUserError(\n                    f\"The input column '{input}' doesn't exist in the inputs of the\"\n                    f\" step '{self.name}'. Inputs of the step are: {self.inputs}.\"\n                    \" Please, review the `inputs_mappings` argument of the step.\",\n                    page=\"sections/how_to_guides/basic/step/#arguments\",\n                )\n\n    def verify_outputs_mappings(self) -&gt; None:\n        \"\"\"Verifies that the `outputs_mappings` of the step are valid i.e. the output\n        columns exist in the outputs of the step.\n\n        Raises:\n            ValueError: If the `outputs_mappings` of the step are not valid.\n        \"\"\"\n        if not self.output_mappings:\n            return\n\n        for output in self.output_mappings:\n            if output not in self.outputs:\n                raise DistilabelUserError(\n                    f\"The output column '{output}' doesn't exist in the outputs of the\"\n                    f\" step '{self.name}'. Outputs of the step are: {self.outputs}.\"\n                    \" Please, review the `outputs_mappings` argument of the step.\",\n                    page=\"sections/how_to_guides/basic/step/#arguments\",\n                )\n\n    def get_inputs(self) -&gt; Dict[str, bool]:\n        \"\"\"Gets the inputs of the step after the `input_mappings`. This method is meant\n        to be used to run validations on the inputs of the step.\n\n        Returns:\n            The inputs of the step after the `input_mappings` and if they are required or\n            not.\n        \"\"\"\n        if isinstance(self.inputs, list):\n            return {\n                self.input_mappings.get(input, input): True for input in self.inputs\n            }\n\n        return {\n            self.input_mappings.get(input, input): required\n            for input, required in self.inputs.items()\n        }\n\n    def get_outputs(self) -&gt; Dict[str, bool]:\n        \"\"\"Gets the outputs of the step after the `outputs_mappings`. This method is\n        meant to be used to run validations on the outputs of the step.\n\n        Returns:\n            The outputs of the step after the `outputs_mappings` and if they are required\n            or not.\n        \"\"\"\n        if isinstance(self.outputs, list):\n            return {\n                self.output_mappings.get(output, output): True\n                for output in self.outputs\n            }\n\n        return {\n            self.output_mappings.get(output, output): required\n            for output, required in self.outputs.items()\n        }\n\n    def set_pipeline_artifacts_path(self, path: Path) -&gt; None:\n        \"\"\"Sets the `_pipeline_artifacts_path` attribute. This method is meant to be used\n        by the `Pipeline` once the cache location is known.\n\n        Args:\n            path: the path where the artifacts generated by the pipeline steps should be\n                saved.\n        \"\"\"\n        self._pipeline_artifacts_path = path\n\n    @property\n    def artifacts_directory(self) -&gt; Union[Path, None]:\n        \"\"\"Gets the path of the directory where the step should save its generated artifacts.\n\n        Returns:\n            The path of the directory where the step should save the generated artifacts,\n                or `None` if `_pipeline_artifacts_path` is not set.\n        \"\"\"\n        if self._pipeline_artifacts_path is None:\n            return None\n        return self._pipeline_artifacts_path / self.name  # type: ignore\n\n    def save_artifact(\n        self,\n        name: str,\n        write_function: Callable[[Path], None],\n        metadata: Optional[Dict[str, Any]] = None,\n    ) -&gt; None:\n        \"\"\"Saves an artifact generated by the `Step`.\n\n        Args:\n            name: the name of the artifact.\n            write_function: a function that will receive the path where the artifact should\n                be saved.\n            metadata: the artifact metadata. Defaults to `None`.\n        \"\"\"\n        if self.artifacts_directory is None:\n            self._logger.warning(\n                f\"Cannot save artifact with '{name}' as `_pipeline_artifacts_path` is not\"\n                \" set. This is normal if the `Step` is being executed as a standalone component.\"\n            )\n            return\n\n        artifact_directory_path = self.artifacts_directory / name\n        artifact_directory_path.mkdir(parents=True, exist_ok=True)\n\n        self._logger.info(f\"\ud83c\udffa Storing '{name}' generated artifact...\")\n\n        self._logger.debug(\n            f\"Calling `write_function` to write artifact in '{artifact_directory_path}'...\"\n        )\n        write_function(artifact_directory_path)\n\n        metadata_path = artifact_directory_path / \"metadata.json\"\n        self._logger.debug(\n            f\"Calling `write_json` to write artifact metadata in '{metadata_path}'...\"\n        )\n        write_json(filename=metadata_path, data=metadata or {})\n\n    def impute_step_outputs(\n        self, step_output: List[Dict[str, Any]]\n    ) -&gt; List[Dict[str, Any]]:\n        \"\"\"\n        Imputes the output columns of the step that are not present in the step output.\n        \"\"\"\n        result = []\n        for row in step_output:\n            data = row.copy()\n            for output in self.get_outputs().keys():\n                data[output] = None\n            result.append(data)\n        return result\n\n    def _model_dump(self, obj: Any, **kwargs: Any) -&gt; Dict[str, Any]:\n        dump = super()._model_dump(obj, **kwargs)\n        dump[\"runtime_parameters_info\"] = self.get_runtime_parameters_info()\n        return dump\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.is_generator","title":"<code>is_generator</code>  <code>property</code>","text":"<p>Whether the step is a generator step or not.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the step is a generator step, <code>False</code> otherwise.</p>"},{"location":"api/step/#distilabel.steps.base._Step.is_global","title":"<code>is_global</code>  <code>property</code>","text":"<p>Whether the step is a global step or not.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the step is a global step, <code>False</code> otherwise.</p>"},{"location":"api/step/#distilabel.steps.base._Step.is_normal","title":"<code>is_normal</code>  <code>property</code>","text":"<p>Whether the step is a normal step or not.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the step is a normal step, <code>False</code> otherwise.</p>"},{"location":"api/step/#distilabel.steps.base._Step.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of strings with the names of the mandatory columns that the step needs as input or dictionary in which the keys are the input columns of the step and the values are booleans indicating whether the column is optional or not.</p> <p>Returns:</p> Type Description <code>StepColumns</code> <p>List of strings with the names of the columns that the step needs as input.</p>"},{"location":"api/step/#distilabel.steps.base._Step.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of strings with the names of the columns that the step will produce as output or dictionary in which the keys are the output columns of the step and the values are booleans indicating whether the column is optional or not.</p> <p>Returns:</p> Type Description <code>StepColumns</code> <p>List of strings with the names of the columns that the step will produce as</p> <code>StepColumns</code> <p>output.</p>"},{"location":"api/step/#distilabel.steps.base._Step.process_parameters","title":"<code>process_parameters</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the parameters of the <code>process</code> method of the step.</p> <p>Returns:</p> Type Description <code>List[Parameter]</code> <p>The parameters of the <code>process</code> method of the step.</p>"},{"location":"api/step/#distilabel.steps.base._Step.artifacts_directory","title":"<code>artifacts_directory</code>  <code>property</code>","text":"<p>Gets the path of the directory where the step should save its generated artifacts.</p> <p>Returns:</p> Type Description <code>Union[Path, None]</code> <p>The path of the directory where the step should save the generated artifacts, or <code>None</code> if <code>_pipeline_artifacts_path</code> is not set.</p>"},{"location":"api/step/#distilabel.steps.base._Step.connect","title":"<code>connect(*steps, routing_batch_function=None)</code>","text":"<p>Connects the current step to another step in the pipeline, which means that the output of this step will be the input of the other step.</p> <p>Parameters:</p> Name Type Description Default <code>steps</code> <code>_Step</code> <p>The steps to connect to the current step.</p> <code>()</code> <code>routing_batch_function</code> <code>Optional[RoutingBatchFunction]</code> <p>A function that receives a list of steps and returns a list of steps to which the output batch generated by this step should be routed. It should be used to define the routing logic of the pipeline. If not provided, the output batch will be routed to all the connected steps. Defaults to <code>None</code>.</p> <code>None</code> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def connect(\n    self,\n    *steps: \"_Step\",\n    routing_batch_function: Optional[\"RoutingBatchFunction\"] = None,\n) -&gt; None:\n    \"\"\"Connects the current step to another step in the pipeline, which means that\n    the output of this step will be the input of the other step.\n\n    Args:\n        steps: The steps to connect to the current step.\n        routing_batch_function: A function that receives a list of steps and returns\n            a list of steps to which the output batch generated by this step should be\n            routed. It should be used to define the routing logic of the pipeline. If\n            not provided, the output batch will be routed to all the connected steps.\n            Defaults to `None`.\n    \"\"\"\n    assert self.pipeline is not None\n\n    if routing_batch_function:\n        self._set_routing_batch_function(routing_batch_function)\n\n    for step in steps:\n        self.pipeline._add_edge(from_step=self.name, to_step=step.name)  # type: ignore\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.__rshift__","title":"<code>__rshift__(other)</code>","text":"<pre><code>__rshift__(other: RoutingBatchFunction) -&gt; RoutingBatchFunction\n</code></pre><pre><code>__rshift__(other: List[DownstreamConnectableSteps]) -&gt; List[DownstreamConnectableSteps]\n</code></pre><pre><code>__rshift__(other: DownstreamConnectable) -&gt; DownstreamConnectable\n</code></pre> <p>Allows using the <code>&gt;&gt;</code> operator to connect steps in the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[DownstreamConnectable, RoutingBatchFunction, List[DownstreamConnectableSteps]]</code> <p>The step to connect, a list of steps to connect to or a routing batch function to be set for the step.</p> required <p>Returns:</p> Type Description <code>Union[DownstreamConnectable, RoutingBatchFunction, List[DownstreamConnectableSteps]]</code> <p>The connected step, the list of connected steps or the routing batch function.</p> Example <pre><code>step1 &gt;&gt; step2\n# Would be equivalent to:\nstep1.connect(step2)\n\n# It also allows to connect a list of steps\nstep1 &gt;&gt; [step2, step3]\n</code></pre> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def __rshift__(\n    self,\n    other: Union[\n        \"DownstreamConnectable\",\n        \"RoutingBatchFunction\",\n        List[\"DownstreamConnectableSteps\"],\n    ],\n) -&gt; Union[\n    \"DownstreamConnectable\",\n    \"RoutingBatchFunction\",\n    List[\"DownstreamConnectableSteps\"],\n]:\n    \"\"\"Allows using the `&gt;&gt;` operator to connect steps in the pipeline.\n\n    Args:\n        other: The step to connect, a list of steps to connect to or a routing batch\n            function to be set for the step.\n\n    Returns:\n        The connected step, the list of connected steps or the routing batch function.\n\n    Example:\n        ```python\n        step1 &gt;&gt; step2\n        # Would be equivalent to:\n        step1.connect(step2)\n\n        # It also allows to connect a list of steps\n        step1 &gt;&gt; [step2, step3]\n        ```\n    \"\"\"\n    # Here to avoid circular imports\n    from distilabel.pipeline.routing_batch_function import RoutingBatchFunction\n\n    if isinstance(other, list):\n        self.connect(*other)\n        return other\n\n    if isinstance(other, RoutingBatchFunction):\n        self._set_routing_batch_function(other)\n        return other\n\n    self.connect(other)\n    return other\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.__rrshift__","title":"<code>__rrshift__(other)</code>","text":"<p>Allows using the [step1, step2] &gt;&gt; step3 operator to connect a list of steps in the pipeline to a single step, as the list doesn't have the rshift operator.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>List[UpstreamConnectableSteps]</code> <p>The step to connect to.</p> required <p>Returns:</p> Type Description <code>Self</code> <p>The connected step</p> Example <pre><code>[step2, step3] &gt;&gt; step1\n# Would be equivalent to:\nstep2.connect(step1)\nstep3.connect(step1)\n</code></pre> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def __rrshift__(self, other: List[\"UpstreamConnectableSteps\"]) -&gt; Self:\n    \"\"\"Allows using the [step1, step2] &gt;&gt; step3 operator to connect a list of steps in the pipeline\n    to a single step, as the list doesn't have the __rshift__ operator.\n\n    Args:\n        other: The step to connect to.\n\n    Returns:\n        The connected step\n\n    Example:\n        ```python\n        [step2, step3] &gt;&gt; step1\n        # Would be equivalent to:\n        step2.connect(step1)\n        step3.connect(step1)\n        ```\n    \"\"\"\n    for o in other:\n        o.connect(self)\n    return self\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.load","title":"<code>load()</code>","text":"<p>Method to perform any initialization logic before the <code>process</code> method is called. For example, to load an LLM, stablish a connection to a database, etc.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to perform any initialization logic before the `process` method is\n    called. For example, to load an LLM, stablish a connection to a database, etc.\n    \"\"\"\n    self._logger = logging.getLogger(f\"distilabel.step.{self.name}\")\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.unload","title":"<code>unload()</code>","text":"<p>Method to perform any cleanup logic after the <code>process</code> method is called. For example, to close a connection to a database, etc.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Method to perform any cleanup logic after the `process` method is called. For\n    example, to close a connection to a database, etc.\n    \"\"\"\n    self._logger.debug(\"Executing step unload logic.\")\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.has_multiple_inputs","title":"<code>has_multiple_inputs()</code>","text":"<p>Whether the <code>process</code> method of the step receives more than one input or not i.e. has a <code>*</code> argument annotated with <code>StepInput</code>.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the <code>process</code> method of the step receives more than one input,</p> <code>bool</code> <p><code>False</code> otherwise.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def has_multiple_inputs(self) -&gt; bool:\n    \"\"\"Whether the `process` method of the step receives more than one input or not\n    i.e. has a `*` argument annotated with `StepInput`.\n\n    Returns:\n        `True` if the `process` method of the step receives more than one input,\n        `False` otherwise.\n    \"\"\"\n    return any(\n        param.kind == param.VAR_POSITIONAL for param in self.process_parameters\n    )\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.get_process_step_input","title":"<code>get_process_step_input()</code>","text":"<p>Returns the parameter of the <code>process</code> method of the step annotated with <code>StepInput</code>.</p> <p>Returns:</p> Type Description <code>Union[Parameter, None]</code> <p>The parameter of the <code>process</code> method of the step annotated with <code>StepInput</code>,</p> <code>Union[Parameter, None]</code> <p>or <code>None</code> if there is no parameter annotated with <code>StepInput</code>.</p> <p>Raises:</p> Type Description <code>TypeError</code> <p>If the step has more than one parameter annotated with <code>StepInput</code>.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def get_process_step_input(self) -&gt; Union[inspect.Parameter, None]:\n    \"\"\"Returns the parameter of the `process` method of the step annotated with\n    `StepInput`.\n\n    Returns:\n        The parameter of the `process` method of the step annotated with `StepInput`,\n        or `None` if there is no parameter annotated with `StepInput`.\n\n    Raises:\n        TypeError: If the step has more than one parameter annotated with `StepInput`.\n    \"\"\"\n    step_input_parameter = None\n    for parameter in self.process_parameters:\n        if is_parameter_annotated_with(parameter, _STEP_INPUT_ANNOTATION):\n            if step_input_parameter is not None:\n                raise DistilabelTypeError(\n                    f\"Step '{self.name}' should have only one parameter with type\"\n                    \" hint `StepInput`.\",\n                    page=\"sections/how_to_guides/basic/step/#defining-custom-steps\",\n                )\n            step_input_parameter = parameter\n    return step_input_parameter\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.verify_inputs_mappings","title":"<code>verify_inputs_mappings()</code>","text":"<p>Verifies that the <code>inputs_mappings</code> of the step are valid i.e. the input columns exist in the inputs of the step.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>inputs_mappings</code> of the step are not valid.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def verify_inputs_mappings(self) -&gt; None:\n    \"\"\"Verifies that the `inputs_mappings` of the step are valid i.e. the input\n    columns exist in the inputs of the step.\n\n    Raises:\n        ValueError: If the `inputs_mappings` of the step are not valid.\n    \"\"\"\n    if not self.input_mappings:\n        return\n\n    for input in self.input_mappings:\n        if input not in self.inputs:\n            raise DistilabelUserError(\n                f\"The input column '{input}' doesn't exist in the inputs of the\"\n                f\" step '{self.name}'. Inputs of the step are: {self.inputs}.\"\n                \" Please, review the `inputs_mappings` argument of the step.\",\n                page=\"sections/how_to_guides/basic/step/#arguments\",\n            )\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.verify_outputs_mappings","title":"<code>verify_outputs_mappings()</code>","text":"<p>Verifies that the <code>outputs_mappings</code> of the step are valid i.e. the output columns exist in the outputs of the step.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>outputs_mappings</code> of the step are not valid.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def verify_outputs_mappings(self) -&gt; None:\n    \"\"\"Verifies that the `outputs_mappings` of the step are valid i.e. the output\n    columns exist in the outputs of the step.\n\n    Raises:\n        ValueError: If the `outputs_mappings` of the step are not valid.\n    \"\"\"\n    if not self.output_mappings:\n        return\n\n    for output in self.output_mappings:\n        if output not in self.outputs:\n            raise DistilabelUserError(\n                f\"The output column '{output}' doesn't exist in the outputs of the\"\n                f\" step '{self.name}'. Outputs of the step are: {self.outputs}.\"\n                \" Please, review the `outputs_mappings` argument of the step.\",\n                page=\"sections/how_to_guides/basic/step/#arguments\",\n            )\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.get_inputs","title":"<code>get_inputs()</code>","text":"<p>Gets the inputs of the step after the <code>input_mappings</code>. This method is meant to be used to run validations on the inputs of the step.</p> <p>Returns:</p> Type Description <code>Dict[str, bool]</code> <p>The inputs of the step after the <code>input_mappings</code> and if they are required or</p> <code>Dict[str, bool]</code> <p>not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def get_inputs(self) -&gt; Dict[str, bool]:\n    \"\"\"Gets the inputs of the step after the `input_mappings`. This method is meant\n    to be used to run validations on the inputs of the step.\n\n    Returns:\n        The inputs of the step after the `input_mappings` and if they are required or\n        not.\n    \"\"\"\n    if isinstance(self.inputs, list):\n        return {\n            self.input_mappings.get(input, input): True for input in self.inputs\n        }\n\n    return {\n        self.input_mappings.get(input, input): required\n        for input, required in self.inputs.items()\n    }\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.get_outputs","title":"<code>get_outputs()</code>","text":"<p>Gets the outputs of the step after the <code>outputs_mappings</code>. This method is meant to be used to run validations on the outputs of the step.</p> <p>Returns:</p> Type Description <code>Dict[str, bool]</code> <p>The outputs of the step after the <code>outputs_mappings</code> and if they are required</p> <code>Dict[str, bool]</code> <p>or not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def get_outputs(self) -&gt; Dict[str, bool]:\n    \"\"\"Gets the outputs of the step after the `outputs_mappings`. This method is\n    meant to be used to run validations on the outputs of the step.\n\n    Returns:\n        The outputs of the step after the `outputs_mappings` and if they are required\n        or not.\n    \"\"\"\n    if isinstance(self.outputs, list):\n        return {\n            self.output_mappings.get(output, output): True\n            for output in self.outputs\n        }\n\n    return {\n        self.output_mappings.get(output, output): required\n        for output, required in self.outputs.items()\n    }\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.set_pipeline_artifacts_path","title":"<code>set_pipeline_artifacts_path(path)</code>","text":"<p>Sets the <code>_pipeline_artifacts_path</code> attribute. This method is meant to be used by the <code>Pipeline</code> once the cache location is known.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>the path where the artifacts generated by the pipeline steps should be saved.</p> required Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def set_pipeline_artifacts_path(self, path: Path) -&gt; None:\n    \"\"\"Sets the `_pipeline_artifacts_path` attribute. This method is meant to be used\n    by the `Pipeline` once the cache location is known.\n\n    Args:\n        path: the path where the artifacts generated by the pipeline steps should be\n            saved.\n    \"\"\"\n    self._pipeline_artifacts_path = path\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.save_artifact","title":"<code>save_artifact(name, write_function, metadata=None)</code>","text":"<p>Saves an artifact generated by the <code>Step</code>.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>the name of the artifact.</p> required <code>write_function</code> <code>Callable[[Path], None]</code> <p>a function that will receive the path where the artifact should be saved.</p> required <code>metadata</code> <code>Optional[Dict[str, Any]]</code> <p>the artifact metadata. Defaults to <code>None</code>.</p> <code>None</code> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def save_artifact(\n    self,\n    name: str,\n    write_function: Callable[[Path], None],\n    metadata: Optional[Dict[str, Any]] = None,\n) -&gt; None:\n    \"\"\"Saves an artifact generated by the `Step`.\n\n    Args:\n        name: the name of the artifact.\n        write_function: a function that will receive the path where the artifact should\n            be saved.\n        metadata: the artifact metadata. Defaults to `None`.\n    \"\"\"\n    if self.artifacts_directory is None:\n        self._logger.warning(\n            f\"Cannot save artifact with '{name}' as `_pipeline_artifacts_path` is not\"\n            \" set. This is normal if the `Step` is being executed as a standalone component.\"\n        )\n        return\n\n    artifact_directory_path = self.artifacts_directory / name\n    artifact_directory_path.mkdir(parents=True, exist_ok=True)\n\n    self._logger.info(f\"\ud83c\udffa Storing '{name}' generated artifact...\")\n\n    self._logger.debug(\n        f\"Calling `write_function` to write artifact in '{artifact_directory_path}'...\"\n    )\n    write_function(artifact_directory_path)\n\n    metadata_path = artifact_directory_path / \"metadata.json\"\n    self._logger.debug(\n        f\"Calling `write_json` to write artifact metadata in '{metadata_path}'...\"\n    )\n    write_json(filename=metadata_path, data=metadata or {})\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.impute_step_outputs","title":"<code>impute_step_outputs(step_output)</code>","text":"<p>Imputes the output columns of the step that are not present in the step output.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def impute_step_outputs(\n    self, step_output: List[Dict[str, Any]]\n) -&gt; List[Dict[str, Any]]:\n    \"\"\"\n    Imputes the output columns of the step that are not present in the step output.\n    \"\"\"\n    result = []\n    for row in step_output:\n        data = row.copy()\n        for output in self.get_outputs().keys():\n            data[output] = None\n        result.append(data)\n    return result\n</code></pre>"},{"location":"api/step/#distilabel.steps.base.Step","title":"<code>Step</code>","text":"<p>               Bases: <code>_Step</code>, <code>ABC</code></p> <p>Base class for the steps that can be included in a <code>Pipeline</code>.</p> <p>Attributes:</p> Name Type Description <code>input_batch_size</code> <code>RuntimeParameter[PositiveInt]</code> <p>The number of rows that will contain the batches processed by the step. Defaults to <code>50</code>.</p> Runtime parameters <ul> <li><code>input_batch_size</code>: The number of rows that will contain the batches processed     by the step. Defaults to <code>50</code>.</li> </ul> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class Step(_Step, ABC):\n    \"\"\"Base class for the steps that can be included in a `Pipeline`.\n\n    Attributes:\n        input_batch_size: The number of rows that will contain the batches processed by\n            the step. Defaults to `50`.\n\n    Runtime parameters:\n        - `input_batch_size`: The number of rows that will contain the batches processed\n            by the step. Defaults to `50`.\n    \"\"\"\n\n    input_batch_size: RuntimeParameter[PositiveInt] = Field(\n        default=DEFAULT_INPUT_BATCH_SIZE,\n        description=\"The number of rows that will contain the batches processed by the\"\n        \" step.\",\n    )\n\n    @abstractmethod\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Method that defines the processing logic of the step. It should yield the\n        output rows.\n\n        Args:\n            *inputs: An argument used to receive the outputs of the previous steps. The\n                number of arguments depends on the number of previous steps. It doesn't\n                need to be an `*args` argument, it can be a regular argument annotated\n                with `StepInput` if the step has only one previous step.\n        \"\"\"\n        pass\n\n    def process_applying_mappings(self, *args: List[Dict[str, Any]]) -&gt; \"StepOutput\":\n        \"\"\"Runs the `process` method of the step applying the `input_mappings` to the input\n        rows and the `outputs_mappings` to the output rows. This is the function that\n        should be used to run the processing logic of the step.\n\n        Yields:\n            The output rows.\n        \"\"\"\n\n        inputs, overriden_inputs = (\n            self._apply_input_mappings(args)\n            if self.input_mappings\n            else (args, [{} for _ in range(len(args[0]))])\n        )\n\n        # If the `Step` was built using the `@step` decorator, then we need to pass\n        # the runtime parameters as kwargs, so they can be used within the processing\n        # function\n        generator = (\n            self.process(*inputs)\n            if not self._built_from_decorator\n            else self.process(*inputs, **self._runtime_parameters)\n        )\n\n        for output_rows in generator:\n            restored = []\n            for i, row in enumerate(output_rows):\n                # Correct the index here because we don't know the num_generations from the llm\n                # ahead of time. For example, if we have `len(overriden_inputs)==5` and `len(row)==10`,\n                # from `num_generations==2` and `group_generations=False` in the LLM:\n                # The loop will use indices 0, 1, 2, 3, 4, 0, 1, 2, 3, 4\n                ntimes_i = i % len(overriden_inputs)\n                restored.append(\n                    self._apply_mappings_and_restore_overriden(\n                        row, overriden_inputs[ntimes_i]\n                    )\n                )\n            yield restored\n\n    def _apply_input_mappings(\n        self, inputs: Tuple[List[Dict[str, Any]], ...]\n    ) -&gt; Tuple[Tuple[List[Dict[str, Any]], ...], List[Dict[str, Any]]]:\n        \"\"\"Applies the `input_mappings` to the input rows.\n\n        Args:\n            inputs: The input rows.\n\n        Returns:\n            The input rows with the `input_mappings` applied and the overriden values\n                that were replaced by the `input_mappings`.\n        \"\"\"\n        reverted_input_mappings = {v: k for k, v in self.input_mappings.items()}\n\n        renamed_inputs = []\n        overriden_inputs = []\n        for i, row_inputs in enumerate(inputs):\n            renamed_row_inputs = []\n            for row in row_inputs:\n                overriden_keys = {}\n                renamed_row = {}\n                for k, v in row.items():\n                    renamed_key = reverted_input_mappings.get(k, k)\n\n                    if renamed_key not in renamed_row or k != renamed_key:\n                        renamed_row[renamed_key] = v\n\n                        if k != renamed_key and renamed_key in row and len(inputs) == 1:\n                            overriden_keys[renamed_key] = row[renamed_key]\n\n                if i == 0:\n                    overriden_inputs.append(overriden_keys)\n                renamed_row_inputs.append(renamed_row)\n            renamed_inputs.append(renamed_row_inputs)\n        return tuple(renamed_inputs), overriden_inputs\n\n    def _apply_mappings_and_restore_overriden(\n        self, row: Dict[str, Any], overriden: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Reverts the `input_mappings` applied to the input rows and applies the `output_mappings`\n        to the output rows. In addition, it restores the overriden values that were replaced\n        by the `input_mappings`.\n\n        Args:\n            row: The output row.\n            overriden: The overriden values that were replaced by the `input_mappings`.\n\n        Returns:\n            The output row with the `output_mappings` applied and the overriden values\n            restored.\n        \"\"\"\n        result = {}\n        for k, v in row.items():\n            mapped_key = (\n                self.output_mappings.get(k, None)\n                or self.input_mappings.get(k, None)\n                or k\n            )\n            result[mapped_key] = v\n\n        # Restore overriden values\n        for k, v in overriden.items():\n            if k not in result:\n                result[k] = v\n\n        return result\n</code></pre>"},{"location":"api/step/#distilabel.steps.base.Step.process","title":"<code>process(*inputs)</code>  <code>abstractmethod</code>","text":"<p>Method that defines the processing logic of the step. It should yield the output rows.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>An argument used to receive the outputs of the previous steps. The number of arguments depends on the number of previous steps. It doesn't need to be an <code>*args</code> argument, it can be a regular argument annotated with <code>StepInput</code> if the step has only one previous step.</p> <code>()</code> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>@abstractmethod\ndef process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Method that defines the processing logic of the step. It should yield the\n    output rows.\n\n    Args:\n        *inputs: An argument used to receive the outputs of the previous steps. The\n            number of arguments depends on the number of previous steps. It doesn't\n            need to be an `*args` argument, it can be a regular argument annotated\n            with `StepInput` if the step has only one previous step.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/step/#distilabel.steps.base.Step.process_applying_mappings","title":"<code>process_applying_mappings(*args)</code>","text":"<p>Runs the <code>process</code> method of the step applying the <code>input_mappings</code> to the input rows and the <code>outputs_mappings</code> to the output rows. This is the function that should be used to run the processing logic of the step.</p> <p>Yields:</p> Type Description <code>StepOutput</code> <p>The output rows.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def process_applying_mappings(self, *args: List[Dict[str, Any]]) -&gt; \"StepOutput\":\n    \"\"\"Runs the `process` method of the step applying the `input_mappings` to the input\n    rows and the `outputs_mappings` to the output rows. This is the function that\n    should be used to run the processing logic of the step.\n\n    Yields:\n        The output rows.\n    \"\"\"\n\n    inputs, overriden_inputs = (\n        self._apply_input_mappings(args)\n        if self.input_mappings\n        else (args, [{} for _ in range(len(args[0]))])\n    )\n\n    # If the `Step` was built using the `@step` decorator, then we need to pass\n    # the runtime parameters as kwargs, so they can be used within the processing\n    # function\n    generator = (\n        self.process(*inputs)\n        if not self._built_from_decorator\n        else self.process(*inputs, **self._runtime_parameters)\n    )\n\n    for output_rows in generator:\n        restored = []\n        for i, row in enumerate(output_rows):\n            # Correct the index here because we don't know the num_generations from the llm\n            # ahead of time. For example, if we have `len(overriden_inputs)==5` and `len(row)==10`,\n            # from `num_generations==2` and `group_generations=False` in the LLM:\n            # The loop will use indices 0, 1, 2, 3, 4, 0, 1, 2, 3, 4\n            ntimes_i = i % len(overriden_inputs)\n            restored.append(\n                self._apply_mappings_and_restore_overriden(\n                    row, overriden_inputs[ntimes_i]\n                )\n            )\n        yield restored\n</code></pre>"},{"location":"api/step/decorator/","title":"@step","text":"<p>This section contains the reference for the <code>@step</code> decorator, used to create new <code>Step</code> subclasses without having to manually define the class.</p> <p>For more information check the Tutorial - Step page.</p>"},{"location":"api/step/decorator/#distilabel.steps.decorator","title":"<code>decorator</code>","text":""},{"location":"api/step/decorator/#distilabel.steps.decorator.step","title":"<code>step(inputs=None, outputs=None, step_type='normal')</code>","text":"<pre><code>step(inputs: Union[StepColumns, None] = None, outputs: Union[StepColumns, None] = None, step_type: Literal['normal'] = 'normal') -&gt; Callable[..., Type[Step]]\n</code></pre><pre><code>step(inputs: Union[StepColumns, None] = None, outputs: Union[StepColumns, None] = None, step_type: Literal['global'] = 'global') -&gt; Callable[..., Type[GlobalStep]]\n</code></pre><pre><code>step(inputs: None = None, outputs: Union[StepColumns, None] = None, step_type: Literal['generator'] = 'generator') -&gt; Callable[..., Type[GeneratorStep]]\n</code></pre> <p>Creates an <code>Step</code> from a processing function.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the inputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not, that are required by the step. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> <code>outputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the outputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column will be generated or not. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> <code>step_type</code> <code>Literal['normal', 'global', 'generator']</code> <p>the kind of step to create. Valid choices are: \"normal\" (<code>Step</code>), \"global\" (<code>GlobalStep</code>) or \"generator\" (<code>GeneratorStep</code>). Defaults to <code>\"normal\"</code>.</p> <code>'normal'</code> <p>Returns:</p> Type Description <code>Callable[..., Type[_Step]]</code> <p>A callable that will generate the type given the processing function.</p> <p>Example:</p> <pre><code># Normal step\n@step(inputs=[\"instruction\"], outputs=[\"generation\"])\ndef GenerationStep(inputs: StepInput, dummy_generation: RuntimeParameter[str]) -&gt; StepOutput:\n    for input in inputs:\n        input[\"generation\"] = dummy_generation\n    yield inputs\n\n# Global step\n@step(inputs=[\"instruction\"], step_type=\"global\")\ndef FilteringStep(inputs: StepInput, max_length: RuntimeParameter[int] = 256) -&gt; StepOutput:\n    yield [\n        input\n        for input in inputs\n        if len(input[\"instruction\"]) &lt;= max_length\n    ]\n\n# Generator step\n@step(outputs=[\"num\"], step_type=\"generator\")\ndef RowGenerator(num_rows: RuntimeParameter[int] = 500) -&gt; GeneratorStepOutput:\n    data = list(range(num_rows))\n    for i in range(0, len(data), 100):\n        last_batch = i + 100 &gt;= len(data)\n        yield [{\"num\": num} for num in data[i : i + 100]], last_batch\n</code></pre> Source code in <code>src/distilabel/steps/decorator.py</code> <pre><code>def step(\n    inputs: Union[\"StepColumns\", None] = None,\n    outputs: Union[\"StepColumns\", None] = None,\n    step_type: Literal[\"normal\", \"global\", \"generator\"] = \"normal\",\n) -&gt; Callable[..., Type[\"_Step\"]]:\n    \"\"\"Creates an `Step` from a processing function.\n\n    Args:\n        inputs: a list containing the name of the inputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column is required or not, that are required by the step. If not provided\n            the default will be an empty list `[]` and it will be assumed that the step\n            doesn't need any specific columns. Defaults to `None`.\n        outputs: a list containing the name of the outputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column will be generated or not. If not provided the default will be an\n            empty list `[]` and it will be assumed that the step doesn't need any specific\n            columns. Defaults to `None`.\n        step_type: the kind of step to create. Valid choices are: \"normal\" (`Step`),\n            \"global\" (`GlobalStep`) or \"generator\" (`GeneratorStep`). Defaults to\n            `\"normal\"`.\n\n    Returns:\n        A callable that will generate the type given the processing function.\n\n    Example:\n\n    ```python\n    # Normal step\n    @step(inputs=[\"instruction\"], outputs=[\"generation\"])\n    def GenerationStep(inputs: StepInput, dummy_generation: RuntimeParameter[str]) -&gt; StepOutput:\n        for input in inputs:\n            input[\"generation\"] = dummy_generation\n        yield inputs\n\n    # Global step\n    @step(inputs=[\"instruction\"], step_type=\"global\")\n    def FilteringStep(inputs: StepInput, max_length: RuntimeParameter[int] = 256) -&gt; StepOutput:\n        yield [\n            input\n            for input in inputs\n            if len(input[\"instruction\"]) &lt;= max_length\n        ]\n\n    # Generator step\n    @step(outputs=[\"num\"], step_type=\"generator\")\n    def RowGenerator(num_rows: RuntimeParameter[int] = 500) -&gt; GeneratorStepOutput:\n        data = list(range(num_rows))\n        for i in range(0, len(data), 100):\n            last_batch = i + 100 &gt;= len(data)\n            yield [{\"num\": num} for num in data[i : i + 100]], last_batch\n    ```\n    \"\"\"\n\n    inputs = inputs or []\n    outputs = outputs or []\n\n    def decorator(func: ProcessingFunc) -&gt; Type[\"_Step\"]:\n        if step_type not in _STEP_MAPPING:\n            raise ValueError(\n                f\"Invalid step type '{step_type}'. Please, review the '{func.__name__}'\"\n                \" function decorated with the `@step` decorator and provide a valid\"\n                \" `step_type`. Valid choices are: 'normal', 'global' or 'generator'.\"\n            )\n\n        BaseClass = _STEP_MAPPING[step_type]\n\n        signature = inspect.signature(func)\n\n        runtime_parameters = {\n            name: (\n                param.annotation,\n                param.default if param.default != param.empty else None,\n            )\n            for name, param in signature.parameters.items()\n        }\n\n        runtime_parameters = {}\n        step_input_parameter = None\n        for name, param in signature.parameters.items():\n            if is_parameter_annotated_with(param, _RUNTIME_PARAMETER_ANNOTATION):\n                runtime_parameters[name] = (\n                    param.annotation,\n                    param.default if param.default != param.empty else None,\n                )\n\n            if not step_type == \"generator\" and is_parameter_annotated_with(\n                param, _STEP_INPUT_ANNOTATION\n            ):\n                if step_input_parameter is not None:\n                    raise ValueError(\n                        f\"Function '{func.__name__}' has more than one parameter annotated\"\n                        f\" with `StepInput`. Please, review the '{func.__name__}' function\"\n                        \" decorated with the `@step` decorator and provide only one\"\n                        \" argument annotated with `StepInput`.\"\n                    )\n                step_input_parameter = param\n\n        RuntimeParametersModel = create_model(  # type: ignore\n            \"RuntimeParametersModel\",\n            **runtime_parameters,  # type: ignore\n        )\n\n        def inputs_property(self) -&gt; \"StepColumns\":\n            return inputs\n\n        def outputs_property(self) -&gt; \"StepColumns\":\n            return outputs\n\n        def process(\n            self, *args: Any, **kwargs: Any\n        ) -&gt; Union[\"StepOutput\", \"GeneratorStepOutput\"]:\n            return func(*args, **kwargs)\n\n        return type(  # type: ignore\n            func.__name__,\n            (\n                BaseClass,\n                RuntimeParametersModel,\n            ),\n            {\n                \"process\": process,\n                \"inputs\": property(inputs_property),\n                \"outputs\": property(outputs_property),\n                \"__module__\": func.__module__,\n                \"__doc__\": func.__doc__,\n                \"_built_from_decorator\": True,\n                # Override the `get_process_step_input` method to return the parameter\n                # of the original function annotated with `StepInput`.\n                \"get_process_step_input\": lambda self: step_input_parameter,\n            },\n        )\n\n    return decorator\n</code></pre>"},{"location":"api/step/generator_step/","title":"GeneratorStep","text":"<p>This section contains the API reference for the <code>GeneratorStep</code> class.</p> <p>For more information and examples on how to use existing generator steps or create custom ones, please refer to Tutorial - Step - GeneratorStep.</p>"},{"location":"api/step/generator_step/#distilabel.steps.base.GeneratorStep","title":"<code>GeneratorStep</code>","text":"<p>               Bases: <code>_Step</code>, <code>ABC</code></p> <p>A special kind of <code>Step</code> that is able to generate data i.e. it doesn't receive any input from the previous steps.</p> <p>Attributes:</p> Name Type Description <code>batch_size</code> <code>RuntimeParameter[int]</code> <p>The number of rows that will contain the batches generated by the step. Defaults to <code>50</code>.</p> Runtime parameters <ul> <li><code>batch_size</code>: The number of rows that will contain the batches generated by     the step. Defaults to <code>50</code>.</li> </ul> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class GeneratorStep(_Step, ABC):\n    \"\"\"A special kind of `Step` that is able to generate data i.e. it doesn't receive\n    any input from the previous steps.\n\n    Attributes:\n        batch_size: The number of rows that will contain the batches generated by the\n            step. Defaults to `50`.\n\n    Runtime parameters:\n        - `batch_size`: The number of rows that will contain the batches generated by\n            the step. Defaults to `50`.\n    \"\"\"\n\n    batch_size: RuntimeParameter[int] = Field(\n        default=50,\n        description=\"The number of rows that will contain the batches generated by the\"\n        \" step.\",\n    )\n\n    @abstractmethod\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Method that defines the generation logic of the step. It should yield the\n        output rows and a boolean indicating if it's the last batch or not.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to 0.\n\n        Yields:\n            The output rows and a boolean indicating if it's the last batch or not.\n        \"\"\"\n        pass\n\n    def process_applying_mappings(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Runs the `process` method of the step applying the `outputs_mappings` to the\n        output rows. This is the function that should be used to run the generation logic\n        of the step.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to 0.\n\n        Yields:\n            The output rows and a boolean indicating if it's the last batch or not.\n        \"\"\"\n\n        # If the `Step` was built using the `@step` decorator, then we need to pass\n        # the runtime parameters as `kwargs`, so they can be used within the processing\n        # function\n        generator = (\n            self.process(offset=offset)\n            if not self._built_from_decorator\n            else self.process(offset=offset, **self._runtime_parameters)\n        )\n\n        for output_rows, last_batch in generator:\n            yield (\n                [\n                    {self.output_mappings.get(k, k): v for k, v in row.items()}\n                    for row in output_rows\n                ],\n                last_batch,\n            )\n</code></pre>"},{"location":"api/step/generator_step/#distilabel.steps.base.GeneratorStep.process","title":"<code>process(offset=0)</code>  <code>abstractmethod</code>","text":"<p>Method that defines the generation logic of the step. It should yield the output rows and a boolean indicating if it's the last batch or not.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to 0.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>The output rows and a boolean indicating if it's the last batch or not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>@abstractmethod\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Method that defines the generation logic of the step. It should yield the\n    output rows and a boolean indicating if it's the last batch or not.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to 0.\n\n    Yields:\n        The output rows and a boolean indicating if it's the last batch or not.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/step/generator_step/#distilabel.steps.base.GeneratorStep.process_applying_mappings","title":"<code>process_applying_mappings(offset=0)</code>","text":"<p>Runs the <code>process</code> method of the step applying the <code>outputs_mappings</code> to the output rows. This is the function that should be used to run the generation logic of the step.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to 0.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>The output rows and a boolean indicating if it's the last batch or not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def process_applying_mappings(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Runs the `process` method of the step applying the `outputs_mappings` to the\n    output rows. This is the function that should be used to run the generation logic\n    of the step.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to 0.\n\n    Yields:\n        The output rows and a boolean indicating if it's the last batch or not.\n    \"\"\"\n\n    # If the `Step` was built using the `@step` decorator, then we need to pass\n    # the runtime parameters as `kwargs`, so they can be used within the processing\n    # function\n    generator = (\n        self.process(offset=offset)\n        if not self._built_from_decorator\n        else self.process(offset=offset, **self._runtime_parameters)\n    )\n\n    for output_rows, last_batch in generator:\n        yield (\n            [\n                {self.output_mappings.get(k, k): v for k, v in row.items()}\n                for row in output_rows\n            ],\n            last_batch,\n        )\n</code></pre>"},{"location":"api/step/generator_step/#distilabel.steps.generators.utils.make_generator_step","title":"<code>make_generator_step(dataset, pipeline=None, batch_size=50, input_mappings=None, output_mappings=None, resources=StepResources(), repo_id='default_name')</code>","text":"<p>Helper method to create a <code>GeneratorStep</code> from a dataset, to simplify</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Union[Dataset, DataFrame, List[Dict[str, str]]]</code> <p>The dataset to use in the <code>Pipeline</code>.</p> required <code>batch_size</code> <code>int</code> <p>The batch_size, will default to the same used by the <code>GeneratorStep</code>s. Defaults to <code>50</code>.</p> <code>50</code> <code>input_mappings</code> <code>Optional[Dict[str, str]]</code> <p>Applies the same as any other step. Defaults to <code>None</code>.</p> <code>None</code> <code>output_mappings</code> <code>Optional[Dict[str, str]]</code> <p>Applies the same as any other step. Defaults to <code>None</code>.</p> <code>None</code> <code>resources</code> <code>StepResources</code> <p>Applies the same as any other step. Defaults to <code>StepResources()</code>.</p> <code>StepResources()</code> <code>repo_id</code> <code>Optional[str]</code> <p>The repository ID to use in the <code>LoadDataFromHub</code> step. This shouldn't be necessary, but in case of error, the dataset will try to be loaded using <code>load_dataset</code> internally. If that case happens, the <code>repo_id</code> will be used.</p> <code>'default_name'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the format is different from the ones supported.</p> <p>Returns:</p> Type Description <code>GeneratorStep</code> <p>A <code>LoadDataFromDicts</code> if the input is a list of dicts, or <code>LoadDataFromHub</code> instance</p> <code>GeneratorStep</code> <p>if the input is a <code>pd.DataFrame</code> or a <code>Dataset</code>.</p> Source code in <code>src/distilabel/steps/generators/utils.py</code> <pre><code>def make_generator_step(\n    dataset: Union[Dataset, pd.DataFrame, List[Dict[str, str]]],\n    pipeline: Union[\"BasePipeline\", None] = None,\n    batch_size: int = 50,\n    input_mappings: Optional[Dict[str, str]] = None,\n    output_mappings: Optional[Dict[str, str]] = None,\n    resources: StepResources = StepResources(),\n    repo_id: Optional[str] = \"default_name\",\n) -&gt; \"GeneratorStep\":\n    \"\"\"Helper method to create a `GeneratorStep` from a dataset, to simplify\n\n    Args:\n        dataset: The dataset to use in the `Pipeline`.\n        batch_size: The batch_size, will default to the same used by the `GeneratorStep`s.\n            Defaults to `50`.\n        input_mappings: Applies the same as any other step. Defaults to `None`.\n        output_mappings: Applies the same as any other step. Defaults to `None`.\n        resources: Applies the same as any other step. Defaults to `StepResources()`.\n        repo_id: The repository ID to use in the `LoadDataFromHub` step.\n            This shouldn't be necessary, but in case of error, the dataset will try to be loaded\n            using `load_dataset` internally. If that case happens, the `repo_id` will be used.\n\n    Raises:\n        ValueError: If the format is different from the ones supported.\n\n    Returns:\n        A `LoadDataFromDicts` if the input is a list of dicts, or `LoadDataFromHub` instance\n        if the input is a `pd.DataFrame` or a `Dataset`.\n    \"\"\"\n    from distilabel.steps import LoadDataFromDicts, LoadDataFromHub\n\n    if isinstance(dataset, list):\n        return LoadDataFromDicts(\n            pipeline=pipeline,\n            data=dataset,\n            batch_size=batch_size,\n            input_mappings=input_mappings or {},\n            output_mappings=output_mappings or {},\n            resources=resources,\n        )\n\n    if isinstance(dataset, pd.DataFrame):\n        dataset = Dataset.from_pandas(dataset, preserve_index=False)\n\n    if not isinstance(dataset, Dataset):\n        raise DistilabelUserError(\n            f\"Dataset type not allowed: {type(dataset)}, must be one of: \"\n            \"`datasets.Dataset`, `pd.DataFrame`, `List[Dict[str, str]]`\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=make_#__tabbed_1_2\",\n        )\n\n    loader = LoadDataFromHub(\n        pipeline=pipeline,\n        repo_id=repo_id,\n        batch_size=batch_size,\n        input_mappings=input_mappings or {},\n        output_mappings=output_mappings or {},\n        resources=resources,\n    )\n    super(loader.__class__, loader).load()  # Ensure the logger is loaded\n    loader._dataset = dataset\n    loader.num_examples = len(dataset)\n    loader._dataset_info = {\"default\": dataset.info}\n    return loader\n</code></pre>"},{"location":"api/step/global_step/","title":"GlobalStep","text":"<p>This section contains the API reference for the <code>GlobalStep</code> class.</p> <p>For more information and examples on how to use existing global steps or create custom ones, please refer to Tutorial - Step - GlobalStep.</p>"},{"location":"api/step/global_step/#distilabel.steps.base.GlobalStep","title":"<code>GlobalStep</code>","text":"<p>               Bases: <code>Step</code>, <code>ABC</code></p> <p>A special kind of <code>Step</code> which it's <code>process</code> method receives all the data processed by their previous steps at once, instead of receiving it in batches. This kind of steps are useful when the processing logic requires to have all the data at once, for example to train a model, to perform a global aggregation, etc.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class GlobalStep(Step, ABC):\n    \"\"\"A special kind of `Step` which it's `process` method receives all the data processed\n    by their previous steps at once, instead of receiving it in batches. This kind of steps\n    are useful when the processing logic requires to have all the data at once, for example\n    to train a model, to perform a global aggregation, etc.\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return []\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return []\n</code></pre>"},{"location":"api/step/resources/","title":"StepResources","text":""},{"location":"api/step/resources/#distilabel.steps.base.StepResources","title":"<code>StepResources</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>BaseModel</code></p> <p>A class to define the resources assigned to a <code>_Step</code>.</p> <p>Attributes:</p> Name Type Description <code>replicas</code> <code>RuntimeParameter[PositiveInt]</code> <p>The number of replicas for the step.</p> <code>cpus</code> <code>Optional[RuntimeParameter[PositiveInt]]</code> <p>The number of CPUs assigned to each step replica.</p> <code>gpus</code> <code>Optional[RuntimeParameter[PositiveInt]]</code> <p>The number of GPUs assigned to each step replica.</p> <code>memory</code> <code>Optional[RuntimeParameter[PositiveInt]]</code> <p>The memory in bytes required for each step replica.</p> <code>resources</code> <code>Optional[RuntimeParameter[Dict[str, int]]]</code> <p>A dictionary containing the number of custom resources required for each step replica.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class StepResources(RuntimeParametersMixin, BaseModel):\n    \"\"\"A class to define the resources assigned to a `_Step`.\n\n    Attributes:\n        replicas: The number of replicas for the step.\n        cpus: The number of CPUs assigned to each step replica.\n        gpus: The number of GPUs assigned to each step replica.\n        memory: The memory in bytes required for each step replica.\n        resources: A dictionary containing the number of custom resources required for\n            each step replica.\n    \"\"\"\n\n    replicas: RuntimeParameter[PositiveInt] = Field(\n        default=1, description=\"The number of replicas for the step.\"\n    )\n    cpus: Optional[RuntimeParameter[PositiveInt]] = Field(\n        default=None, description=\"The number of CPUs assigned to each step replica.\"\n    )\n    gpus: Optional[RuntimeParameter[PositiveInt]] = Field(\n        default=None, description=\"The number of GPUs assigned to each step replica.\"\n    )\n    memory: Optional[RuntimeParameter[PositiveInt]] = Field(\n        default=None, description=\"The memory in bytes required for each step replica.\"\n    )\n    resources: Optional[RuntimeParameter[Dict[str, int]]] = Field(\n        default=None,\n        description=\"A dictionary containing names of custom resources and the\"\n        \" number of those resources required for each step replica.\",\n    )\n</code></pre>"},{"location":"api/step/typing/","title":"Step Typing","text":""},{"location":"api/step/typing/#distilabel.steps.typing","title":"<code>typing</code>","text":""},{"location":"api/step/typing/#distilabel.steps.typing.StepOutput","title":"<code>StepOutput = Iterator[List[Dict[str, Any]]]</code>  <code>module-attribute</code>","text":"<p><code>StepOutput</code> is an alias of the typing <code>Iterator[List[Dict[str, Any]]]</code></p>"},{"location":"api/step/typing/#distilabel.steps.typing.GeneratorStepOutput","title":"<code>GeneratorStepOutput = Iterator[Tuple[List[Dict[str, Any]], bool]]</code>  <code>module-attribute</code>","text":"<p><code>GeneratorStepOutput</code> is an alias of the typing <code>Iterator[Tuple[List[Dict[str, Any]], bool]]</code></p>"},{"location":"api/step/typing/#distilabel.steps.typing.StepColumns","title":"<code>StepColumns = Union[List[str], Dict[str, bool]]</code>  <code>module-attribute</code>","text":"<p><code>StepColumns</code> is an alias of the typing <code>Union[List[str], Dict[str, bool]]</code> used by the <code>inputs</code> and <code>outputs</code> properties of an <code>Step</code>. In the case of a <code>List[str]</code>, it is a list with the required columns. In the case of a <code>Dict[str, bool]</code>, it is a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not.</p>"},{"location":"api/step_gallery/argilla/","title":"Argilla","text":"<p>This section contains the existing steps integrated with <code>Argilla</code> so as to easily push the generated datasets to Argilla.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base","title":"<code>base</code>","text":""},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase","title":"<code>ArgillaBase</code>","text":"<p>               Bases: <code>Step</code>, <code>ABC</code></p> <p>Abstract step that provides a class to subclass from, that contains the boilerplate code required to interact with Argilla, as well as some extra validations on top of it. It also defines the abstract methods that need to be implemented in order to add a new dataset type as a step.</p> Note <p>This class is not intended to be instanced directly, but via subclass.</p> <p>Attributes:</p> Name Type Description <code>dataset_name</code> <code>RuntimeParameter[str]</code> <p>The name of the dataset in Argilla where the records will be added.</p> <code>dataset_workspace</code> <code>Optional[RuntimeParameter[str]]</code> <p>The workspace where the dataset will be created in Argilla. Defaults to <code>None</code>, which means it will be created in the default workspace.</p> <code>api_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_URL</code> environment variable.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> Runtime parameters <ul> <li><code>dataset_name</code>: The name of the dataset in Argilla where the records will be     added.</li> <li><code>dataset_workspace</code>: The workspace where the dataset will be created in Argilla.     Defaults to <code>None</code>, which means it will be created in the default workspace.</li> <li><code>api_url</code>: The base URL to use for the Argilla API requests.</li> <li><code>api_key</code>: The API key to authenticate the requests to the Argilla API.</li> </ul> Input columns <ul> <li>dynamic, based on the <code>inputs</code> value provided</li> </ul> Source code in <code>src/distilabel/steps/argilla/base.py</code> <pre><code>class ArgillaBase(Step, ABC):\n    \"\"\"Abstract step that provides a class to subclass from, that contains the boilerplate code\n    required to interact with Argilla, as well as some extra validations on top of it. It also defines\n    the abstract methods that need to be implemented in order to add a new dataset type as a step.\n\n    Note:\n        This class is not intended to be instanced directly, but via subclass.\n\n    Attributes:\n        dataset_name: The name of the dataset in Argilla where the records will be added.\n        dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to\n            `None`, which means it will be created in the default workspace.\n        api_url: The URL of the Argilla API. Defaults to `None`, which means it will be read from\n            the `ARGILLA_API_URL` environment variable.\n        api_key: The API key to authenticate with Argilla. Defaults to `None`, which means it will\n            be read from the `ARGILLA_API_KEY` environment variable.\n\n    Runtime parameters:\n        - `dataset_name`: The name of the dataset in Argilla where the records will be\n            added.\n        - `dataset_workspace`: The workspace where the dataset will be created in Argilla.\n            Defaults to `None`, which means it will be created in the default workspace.\n        - `api_url`: The base URL to use for the Argilla API requests.\n        - `api_key`: The API key to authenticate the requests to the Argilla API.\n\n    Input columns:\n        - dynamic, based on the `inputs` value provided\n    \"\"\"\n\n    dataset_name: RuntimeParameter[str] = Field(\n        default=None, description=\"The name of the dataset in Argilla.\"\n    )\n    dataset_workspace: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The workspace where the dataset will be created in Argilla. Defaults \"\n        \"to `None` which means it will be created in the default workspace.\",\n    )\n\n    api_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(_ARGILLA_API_URL_ENV_VAR_NAME),\n        description=\"The base URL to use for the Argilla API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_ARGILLA_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Argilla API.\",\n    )\n\n    _client: Optional[\"Argilla\"] = PrivateAttr(...)\n    _dataset: Optional[\"Dataset\"] = PrivateAttr(...)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Checks that the Argilla Python SDK is installed, and then filters the Argilla warnings.\"\"\"\n        super().model_post_init(__context)\n\n        if importlib.util.find_spec(\"argilla\") is None:\n            raise ImportError(\n                \"Argilla is not installed. Please install it using `pip install argilla\"\n                \" --upgrade`.\"\n            )\n\n    def _client_init(self) -&gt; None:\n        \"\"\"Initializes the Argilla API client with the provided `api_url` and `api_key`.\"\"\"\n        try:\n            self._client = rg.Argilla(  # type: ignore\n                api_url=self.api_url,\n                api_key=self.api_key.get_secret_value(),  # type: ignore\n                headers={\"Authorization\": f\"Bearer {os.environ['HF_TOKEN']}\"}\n                if isinstance(self.api_url, str)\n                and \"hf.space\" in self.api_url\n                and \"HF_TOKEN\" in os.environ\n                else {},\n            )\n        except Exception as e:\n            raise DistilabelUserError(\n                f\"Failed to initialize the Argilla API: {e}\",\n                page=\"sections/how_to_guides/advanced/argilla/\",\n            ) from e\n\n    @property\n    def _dataset_exists_in_workspace(self) -&gt; bool:\n        \"\"\"Checks if the dataset already exists in Argilla in the provided workspace if any.\n\n        Returns:\n            `True` if the dataset exists, `False` otherwise.\n        \"\"\"\n        return (\n            self._client.datasets(  # type: ignore\n                name=self.dataset_name,  # type: ignore\n                workspace=self.dataset_workspace,\n            )\n            is not None\n        )\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs of the step is an empty list, since the steps subclassing from this one, will\n        always be leaf nodes and won't propagate the inputs neither generate any outputs.\n        \"\"\"\n        return []\n\n    def load(self) -&gt; None:\n        \"\"\"Method to perform any initialization logic before the `process` method is\n        called. For example, to load an LLM, stablish a connection to a database, etc.\n        \"\"\"\n        super().load()\n\n        if self.api_url is None or self.api_key is None:\n            raise DistilabelUserError(\n                \"`Argilla` step requires the `api_url` and `api_key` to be provided. Please,\"\n                \" provide those at step instantiation, via environment variables `ARGILLA_API_URL`\"\n                \" and `ARGILLA_API_KEY`, or as `Step` runtime parameters via `pipeline.run(parameters={...})`.\",\n                page=\"sections/how_to_guides/advanced/argilla/\",\n            )\n\n        self._client_init()\n\n    @property\n    @abstractmethod\n    def inputs(self) -&gt; \"StepColumns\": ...\n\n    @abstractmethod\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\": ...\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs of the step is an empty list, since the steps subclassing from this one, will always be leaf nodes and won't propagate the inputs neither generate any outputs.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Checks that the Argilla Python SDK is installed, and then filters the Argilla warnings.</p> Source code in <code>src/distilabel/steps/argilla/base.py</code> <pre><code>def model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Checks that the Argilla Python SDK is installed, and then filters the Argilla warnings.\"\"\"\n    super().model_post_init(__context)\n\n    if importlib.util.find_spec(\"argilla\") is None:\n        raise ImportError(\n            \"Argilla is not installed. Please install it using `pip install argilla\"\n            \" --upgrade`.\"\n        )\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase.load","title":"<code>load()</code>","text":"<p>Method to perform any initialization logic before the <code>process</code> method is called. For example, to load an LLM, stablish a connection to a database, etc.</p> Source code in <code>src/distilabel/steps/argilla/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to perform any initialization logic before the `process` method is\n    called. For example, to load an LLM, stablish a connection to a database, etc.\n    \"\"\"\n    super().load()\n\n    if self.api_url is None or self.api_key is None:\n        raise DistilabelUserError(\n            \"`Argilla` step requires the `api_url` and `api_key` to be provided. Please,\"\n            \" provide those at step instantiation, via environment variables `ARGILLA_API_URL`\"\n            \" and `ARGILLA_API_KEY`, or as `Step` runtime parameters via `pipeline.run(parameters={...})`.\",\n            page=\"sections/how_to_guides/advanced/argilla/\",\n        )\n\n    self._client_init()\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference","title":"<code>preference</code>","text":""},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla","title":"<code>PreferenceToArgilla</code>","text":"<p>               Bases: <code>ArgillaBase</code></p> <p>Creates a preference dataset in Argilla.</p> <p>Step that creates a dataset in Argilla during the load phase, and then pushes the input batches into it as records. This dataset is a preference dataset, where there's one field for the instruction and one extra field per each generation within the same record, and then a rating question per each of the generation fields. The rating question asks the annotator to set a rating from 1 to 5 for each of the provided generations.</p> Note <p>This step is meant to be used in conjunction with the <code>UltraFeedback</code> step, or any other step generating both ratings and responses for a given set of instruction and generations for the given instruction. But alternatively, it can also be used with any other task or step generating only the <code>instruction</code> and <code>generations</code>, as the <code>ratings</code> and <code>rationales</code> are optional.</p> <p>Attributes:</p> Name Type Description <code>num_generations</code> <code>int</code> <p>The number of generations to include in the dataset.</p> <code>dataset_name</code> <code>RuntimeParameter[str]</code> <p>The name of the dataset in Argilla.</p> <code>dataset_workspace</code> <code>Optional[RuntimeParameter[str]]</code> <p>The workspace where the dataset will be created in Argilla. Defaults to <code>None</code>, which means it will be created in the default workspace.</p> <code>api_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_URL</code> environment variable.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> Runtime parameters <ul> <li><code>api_url</code>: The base URL to use for the Argilla API requests.</li> <li><code>api_key</code>: The API key to authenticate the requests to the Argilla API.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the completion.</li> <li>generations (<code>List[str]</code>): The completion that was generated based on the input instruction.</li> <li>ratings (<code>List[str]</code>, optional): The ratings for the generations. If not provided, the     generated ratings won't be pushed to Argilla.</li> <li>rationales (<code>List[str]</code>, optional): The rationales for the ratings. If not provided, the     generated rationales won't be pushed to Argilla.</li> </ul> <p>Examples:</p> <p>Push a preference dataset to an Argilla instance:</p> <pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = PreferenceToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generations': ['first_generation', 'second_generation']}]\n</code></pre> <p>It can also include ratings and rationales:</p> <pre><code>result = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n                \"ratings\": [\"4\", \"5\"],\n                \"rationales\": [\"rationale for 4\", \"rationale for 5\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'instruction',\n#         'generations': ['first_generation', 'second_generation'],\n#         'ratings': ['4', '5'],\n#         'rationales': ['rationale for 4', 'rationale for 5']\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/argilla/preference.py</code> <pre><code>class PreferenceToArgilla(ArgillaBase):\n    \"\"\"Creates a preference dataset in Argilla.\n\n    Step that creates a dataset in Argilla during the load phase, and then pushes the input\n    batches into it as records. This dataset is a preference dataset, where there's one field\n    for the instruction and one extra field per each generation within the same record, and then\n    a rating question per each of the generation fields. The rating question asks the annotator to\n    set a rating from 1 to 5 for each of the provided generations.\n\n    Note:\n        This step is meant to be used in conjunction with the `UltraFeedback` step, or any other step\n        generating both ratings and responses for a given set of instruction and generations for the\n        given instruction. But alternatively, it can also be used with any other task or step generating\n        only the `instruction` and `generations`, as the `ratings` and `rationales` are optional.\n\n    Attributes:\n        num_generations: The number of generations to include in the dataset.\n        dataset_name: The name of the dataset in Argilla.\n        dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to\n            `None`, which means it will be created in the default workspace.\n        api_url: The URL of the Argilla API. Defaults to `None`, which means it will be read from\n            the `ARGILLA_API_URL` environment variable.\n        api_key: The API key to authenticate with Argilla. Defaults to `None`, which means it will\n            be read from the `ARGILLA_API_KEY` environment variable.\n\n    Runtime parameters:\n        - `api_url`: The base URL to use for the Argilla API requests.\n        - `api_key`: The API key to authenticate the requests to the Argilla API.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the completion.\n        - generations (`List[str]`): The completion that was generated based on the input instruction.\n        - ratings (`List[str]`, optional): The ratings for the generations. If not provided, the\n            generated ratings won't be pushed to Argilla.\n        - rationales (`List[str]`, optional): The rationales for the ratings. If not provided, the\n            generated rationales won't be pushed to Argilla.\n\n    Examples:\n        Push a preference dataset to an Argilla instance:\n\n        ```python\n        from distilabel.steps import PreferenceToArgilla\n\n        to_argilla = PreferenceToArgilla(\n            num_generations=2,\n            api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n            api_key=\"api.key\",\n            dataset_name=\"argilla_dataset\",\n            dataset_workspace=\"my_workspace\",\n        )\n        to_argilla.load()\n\n        result = next(\n            to_argilla.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"generations\": [\"first_generation\", \"second_generation\"],\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction', 'generations': ['first_generation', 'second_generation']}]\n        ```\n\n        It can also include ratings and rationales:\n\n        ```python\n        result = next(\n            to_argilla.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"generations\": [\"first_generation\", \"second_generation\"],\n                        \"ratings\": [\"4\", \"5\"],\n                        \"rationales\": [\"rationale for 4\", \"rationale for 5\"],\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'instruction': 'instruction',\n        #         'generations': ['first_generation', 'second_generation'],\n        #         'ratings': ['4', '5'],\n        #         'rationales': ['rationale for 4', 'rationale for 5']\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    num_generations: int\n\n    _id: str = PrivateAttr(default=\"id\")\n    _instruction: str = PrivateAttr(...)\n    _generations: str = PrivateAttr(...)\n    _ratings: str = PrivateAttr(...)\n    _rationales: str = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Sets the `_instruction` and `_generations` attributes based on the `inputs_mapping`, otherwise\n        uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n        the text-generation scenario. And then it pushes it to Argilla.\n        \"\"\"\n        super().load()\n\n        # Both `instruction` and `generations` will be used as the fields of the dataset\n        self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n        self._generations = self.input_mappings.get(\"generations\", \"generations\")\n        # Both `ratings` and `rationales` will be used as suggestions to the default questions of the dataset\n        self._ratings = self.input_mappings.get(\"ratings\", \"ratings\")\n        self._rationales = self.input_mappings.get(\"rationales\", \"rationales\")\n\n        if self._dataset_exists_in_workspace:\n            _dataset = self._client.datasets(  # type: ignore\n                name=self.dataset_name,  # type: ignore\n                workspace=self.dataset_workspace,  # type: ignore\n            )\n\n            for field in _dataset.fields:\n                if not isinstance(field, rg.TextField):\n                    continue\n                if (\n                    field.name\n                    not in [self._id, self._instruction]  # type: ignore\n                    + [\n                        f\"{self._generations}-{idx}\"\n                        for idx in range(self.num_generations)\n                    ]\n                    and field.required\n                ):\n                    raise DistilabelUserError(\n                        f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                        f\" already exists, but contains at least a required field that is\"\n                        f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generations}`\"\n                        f\" (one per generation starting from 0 up to {self.num_generations - 1}).\",\n                        page=\"components-gallery/steps/preferencetoargilla/\",\n                    )\n\n            self._dataset = _dataset\n        else:\n            _settings = rg.Settings(  # type: ignore\n                fields=[\n                    rg.TextField(name=self._id, title=self._id),  # type: ignore\n                    rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                    *self._generation_fields(),  # type: ignore\n                ],\n                questions=self._rating_rationale_pairs(),  # type: ignore\n            )\n            _dataset = rg.Dataset(  # type: ignore\n                name=self.dataset_name,\n                workspace=self.dataset_workspace,\n                settings=_settings,\n                client=self._client,\n            )\n            self._dataset = _dataset.create()\n\n    def _generation_fields(self) -&gt; List[\"TextField\"]:\n        \"\"\"Method to generate the fields for each of the generations.\n\n        Returns:\n            A list containing `TextField`s for each text generation.\n        \"\"\"\n        return [\n            rg.TextField(  # type: ignore\n                name=f\"{self._generations}-{idx}\",\n                title=f\"{self._generations}-{idx}\",\n                required=True if idx == 0 else False,\n            )\n            for idx in range(self.num_generations)\n        ]\n\n    def _rating_rationale_pairs(\n        self,\n    ) -&gt; List[Union[\"RatingQuestion\", \"TextQuestion\"]]:\n        \"\"\"Method to generate the rating and rationale questions for each of the generations.\n\n        Returns:\n            A list of questions containing a `RatingQuestion` and `TextQuestion` pair for\n            each text generation.\n        \"\"\"\n        questions = []\n        for idx in range(self.num_generations):\n            questions.extend(\n                [\n                    rg.RatingQuestion(  # type: ignore\n                        name=f\"{self._generations}-{idx}-rating\",\n                        title=f\"Rate {self._generations}-{idx} given {self._instruction}.\",\n                        description=f\"Ignore this question if the corresponding `{self._generations}-{idx}` field is not available.\"\n                        if idx != 0\n                        else None,\n                        values=[1, 2, 3, 4, 5],\n                        required=True if idx == 0 else False,\n                    ),\n                    rg.TextQuestion(  # type: ignore\n                        name=f\"{self._generations}-{idx}-rationale\",\n                        title=f\"Specify the rationale for {self._generations}-{idx}'s rating.\",\n                        description=f\"Ignore this question if the corresponding `{self._generations}-{idx}` field is not available.\"\n                        if idx != 0\n                        else None,\n                        required=False,\n                    ),\n                ]\n            )\n        return questions\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the step are the `instruction` and the `generations`. Optionally, one could also\n        provide the `ratings` and the `rationales` for the generations.\"\"\"\n        return [\"instruction\", \"generations\"]\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"The optional inputs for the step are the `ratings` and the `rationales` for the generations.\"\"\"\n        return [\"ratings\", \"rationales\"]\n\n    def _add_suggestions_if_any(self, input: Dict[str, Any]) -&gt; List[\"Suggestion\"]:\n        \"\"\"Method to generate the suggestions for the `rg.Record` based on the input.\n\n        Returns:\n            A list of `Suggestion`s for the rating and rationales questions.\n        \"\"\"\n        # Since the `suggestions` i.e. answers to the `questions` are optional, will default to {}\n        suggestions = []\n        # If `ratings` is in `input`, then add those as suggestions\n        if self._ratings in input:\n            suggestions.extend(\n                [\n                    rg.Suggestion(  # type: ignore\n                        value=rating,\n                        question_name=f\"{self._generations}-{idx}-rating\",\n                    )\n                    for idx, rating in enumerate(input[self._ratings])\n                    if rating is not None\n                    and isinstance(rating, int)\n                    and rating in [1, 2, 3, 4, 5]\n                ],\n            )\n        # If `rationales` is in `input`, then add those as suggestions\n        if self._rationales in input:\n            suggestions.extend(\n                [\n                    rg.Suggestion(  # type: ignore\n                        value=rationale,\n                        question_name=f\"{self._generations}-{idx}-rationale\",\n                    )\n                    for idx, rationale in enumerate(input[self._rationales])\n                    if rationale is not None and isinstance(rationale, str)\n                ],\n            )\n        return suggestions\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Creates and pushes the records as `rg.Record`s to the Argilla dataset.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n        records = []\n        for input in inputs:\n            # Generate the SHA-256 hash of the instruction to use it as the metadata\n            instruction_id = hashlib.sha256(\n                input[\"instruction\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            generations = {\n                f\"{self._generations}-{idx}\": generation\n                for idx, generation in enumerate(input[\"generations\"])  # type: ignore\n            }\n\n            records.append(  # type: ignore\n                rg.Record(  # type: ignore\n                    fields={\n                        \"id\": instruction_id,\n                        \"instruction\": input[\"instruction\"],  # type: ignore\n                        **generations,\n                    },\n                    suggestions=self._add_suggestions_if_any(input),  # type: ignore\n                )\n            )\n        self._dataset.records.log(records)  # type: ignore\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the step are the <code>instruction</code> and the <code>generations</code>. Optionally, one could also provide the <code>ratings</code> and the <code>rationales</code> for the generations.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>The optional inputs for the step are the <code>ratings</code> and the <code>rationales</code> for the generations.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.load","title":"<code>load()</code>","text":"<p>Sets the <code>_instruction</code> and <code>_generations</code> attributes based on the <code>inputs_mapping</code>, otherwise uses the default values; and then uses those values to create a <code>FeedbackDataset</code> suited for the text-generation scenario. And then it pushes it to Argilla.</p> Source code in <code>src/distilabel/steps/argilla/preference.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Sets the `_instruction` and `_generations` attributes based on the `inputs_mapping`, otherwise\n    uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n    the text-generation scenario. And then it pushes it to Argilla.\n    \"\"\"\n    super().load()\n\n    # Both `instruction` and `generations` will be used as the fields of the dataset\n    self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n    self._generations = self.input_mappings.get(\"generations\", \"generations\")\n    # Both `ratings` and `rationales` will be used as suggestions to the default questions of the dataset\n    self._ratings = self.input_mappings.get(\"ratings\", \"ratings\")\n    self._rationales = self.input_mappings.get(\"rationales\", \"rationales\")\n\n    if self._dataset_exists_in_workspace:\n        _dataset = self._client.datasets(  # type: ignore\n            name=self.dataset_name,  # type: ignore\n            workspace=self.dataset_workspace,  # type: ignore\n        )\n\n        for field in _dataset.fields:\n            if not isinstance(field, rg.TextField):\n                continue\n            if (\n                field.name\n                not in [self._id, self._instruction]  # type: ignore\n                + [\n                    f\"{self._generations}-{idx}\"\n                    for idx in range(self.num_generations)\n                ]\n                and field.required\n            ):\n                raise DistilabelUserError(\n                    f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                    f\" already exists, but contains at least a required field that is\"\n                    f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generations}`\"\n                    f\" (one per generation starting from 0 up to {self.num_generations - 1}).\",\n                    page=\"components-gallery/steps/preferencetoargilla/\",\n                )\n\n        self._dataset = _dataset\n    else:\n        _settings = rg.Settings(  # type: ignore\n            fields=[\n                rg.TextField(name=self._id, title=self._id),  # type: ignore\n                rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                *self._generation_fields(),  # type: ignore\n            ],\n            questions=self._rating_rationale_pairs(),  # type: ignore\n        )\n        _dataset = rg.Dataset(  # type: ignore\n            name=self.dataset_name,\n            workspace=self.dataset_workspace,\n            settings=_settings,\n            client=self._client,\n        )\n        self._dataset = _dataset.create()\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.process","title":"<code>process(inputs)</code>","text":"<p>Creates and pushes the records as <code>rg.Record</code>s to the Argilla dataset.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/argilla/preference.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Creates and pushes the records as `rg.Record`s to the Argilla dataset.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n    records = []\n    for input in inputs:\n        # Generate the SHA-256 hash of the instruction to use it as the metadata\n        instruction_id = hashlib.sha256(\n            input[\"instruction\"].encode(\"utf-8\")  # type: ignore\n        ).hexdigest()\n\n        generations = {\n            f\"{self._generations}-{idx}\": generation\n            for idx, generation in enumerate(input[\"generations\"])  # type: ignore\n        }\n\n        records.append(  # type: ignore\n            rg.Record(  # type: ignore\n                fields={\n                    \"id\": instruction_id,\n                    \"instruction\": input[\"instruction\"],  # type: ignore\n                    **generations,\n                },\n                suggestions=self._add_suggestions_if_any(input),  # type: ignore\n            )\n        )\n    self._dataset.records.log(records)  # type: ignore\n    yield inputs\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation","title":"<code>text_generation</code>","text":""},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla","title":"<code>TextGenerationToArgilla</code>","text":"<p>               Bases: <code>ArgillaBase</code></p> <p>Creates a text generation dataset in Argilla.</p> <p><code>Step</code> that creates a dataset in Argilla during the load phase, and then pushes the input batches into it as records. This dataset is a text-generation dataset, where there's one field per each input, and then a label question to rate the quality of the completion in either bad (represented with \ud83d\udc4e) or good (represented with \ud83d\udc4d).</p> Note <p>This step is meant to be used in conjunction with a <code>TextGeneration</code> step and no column mapping is needed, as it will use the default values for the <code>instruction</code> and <code>generation</code> columns.</p> <p>Attributes:</p> Name Type Description <code>dataset_name</code> <code>RuntimeParameter[str]</code> <p>The name of the dataset in Argilla.</p> <code>dataset_workspace</code> <code>Optional[RuntimeParameter[str]]</code> <p>The workspace where the dataset will be created in Argilla. Defaults to <code>None</code>, which means it will be created in the default workspace.</p> <code>api_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_URL</code> environment variable.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> Runtime parameters <ul> <li><code>api_url</code>: The base URL to use for the Argilla API requests.</li> <li><code>api_key</code>: The API key to authenticate the requests to the Argilla API.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the completion.</li> <li>generation (<code>str</code> or <code>List[str]</code>): The completions that were generated based on the input instruction.</li> </ul> <p>Examples:</p> <p>Push a text generation dataset to an Argilla instance:</p> <pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = TextGenerationToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generation\": \"generation\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generation': 'generation'}]\n</code></pre> Source code in <code>src/distilabel/steps/argilla/text_generation.py</code> <pre><code>class TextGenerationToArgilla(ArgillaBase):\n    \"\"\"Creates a text generation dataset in Argilla.\n\n    `Step` that creates a dataset in Argilla during the load phase, and then pushes the input\n    batches into it as records. This dataset is a text-generation dataset, where there's one field\n    per each input, and then a label question to rate the quality of the completion in either bad\n    (represented with \ud83d\udc4e) or good (represented with \ud83d\udc4d).\n\n    Note:\n        This step is meant to be used in conjunction with a `TextGeneration` step and no column mapping\n        is needed, as it will use the default values for the `instruction` and `generation` columns.\n\n    Attributes:\n        dataset_name: The name of the dataset in Argilla.\n        dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to\n            `None`, which means it will be created in the default workspace.\n        api_url: The URL of the Argilla API. Defaults to `None`, which means it will be read from\n            the `ARGILLA_API_URL` environment variable.\n        api_key: The API key to authenticate with Argilla. Defaults to `None`, which means it will\n            be read from the `ARGILLA_API_KEY` environment variable.\n\n    Runtime parameters:\n        - `api_url`: The base URL to use for the Argilla API requests.\n        - `api_key`: The API key to authenticate the requests to the Argilla API.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the completion.\n        - generation (`str` or `List[str]`): The completions that were generated based on the input instruction.\n\n    Examples:\n        Push a text generation dataset to an Argilla instance:\n\n        ```python\n        from distilabel.steps import PreferenceToArgilla\n\n        to_argilla = TextGenerationToArgilla(\n            num_generations=2,\n            api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n            api_key=\"api.key\",\n            dataset_name=\"argilla_dataset\",\n            dataset_workspace=\"my_workspace\",\n        )\n        to_argilla.load()\n\n        result = next(\n            to_argilla.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"generation\": \"generation\",\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction', 'generation': 'generation'}]\n        ```\n    \"\"\"\n\n    _id: str = PrivateAttr(default=\"id\")\n    _instruction: str = PrivateAttr(...)\n    _generation: str = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Sets the `_instruction` and `_generation` attributes based on the `inputs_mapping`, otherwise\n        uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n        the text-generation scenario. And then it pushes it to Argilla.\n        \"\"\"\n        super().load()\n\n        self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n        self._generation = self.input_mappings.get(\"generation\", \"generation\")\n\n        if self._dataset_exists_in_workspace:\n            _dataset = self._client.datasets(  # type: ignore\n                name=self.dataset_name,  # type: ignore\n                workspace=self.dataset_workspace,  # type: ignore\n            )\n\n            for field in _dataset.fields:\n                if not isinstance(field, rg.TextField):  # type: ignore\n                    continue\n                if (\n                    field.name not in [self._id, self._instruction, self._generation]\n                    and field.required\n                ):\n                    raise DistilabelUserError(\n                        f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                        f\" already exists, but contains at least a required field that is\"\n                        f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generation}`,\"\n                        \" so it cannot be reused for this dataset.\",\n                        page=\"components-gallery/steps/textgenerationtoargilla/\",\n                    )\n\n            self._dataset = _dataset\n        else:\n            _settings = rg.Settings(  # type: ignore\n                fields=[\n                    rg.TextField(name=self._id, title=self._id),  # type: ignore\n                    rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                    rg.TextField(name=self._generation, title=self._generation),  # type: ignore\n                ],\n                questions=[\n                    rg.LabelQuestion(  # type: ignore\n                        name=\"quality\",\n                        title=f\"What's the quality of the {self._generation} for the given {self._instruction}?\",\n                        labels={\"bad\": \"\ud83d\udc4e\", \"good\": \"\ud83d\udc4d\"},  # type: ignore\n                    )\n                ],\n            )\n            _dataset = rg.Dataset(  # type: ignore\n                name=self.dataset_name,\n                workspace=self.dataset_workspace,\n                settings=_settings,\n                client=self._client,\n            )\n            self._dataset = _dataset.create()\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the step are the `instruction` and the `generation`.\"\"\"\n        return [\"instruction\", \"generation\"]\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Creates and pushes the records as FeedbackRecords to the Argilla dataset.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n        records = []\n        for input in inputs:\n            # Generate the SHA-256 hash of the instruction to use it as the metadata\n            instruction_id = hashlib.sha256(\n                input[\"instruction\"].encode(\"utf-8\")\n            ).hexdigest()\n\n            generations = input[\"generation\"]\n\n            # If the `generation` is not a list, then convert it into a list\n            if not isinstance(generations, list):\n                generations = [generations]\n\n            # Create a `generations_set` to avoid adding duplicates\n            generations_set = set()\n\n            for generation in generations:\n                # If the generation is already in the set, then skip it\n                if generation in generations_set:\n                    continue\n                # Otherwise, add it to the set\n                generations_set.add(generation)\n\n                records.append(\n                    rg.Record(  # type: ignore\n                        fields={\n                            self._id: instruction_id,\n                            self._instruction: input[\"instruction\"],\n                            self._generation: generation,\n                        },\n                    ),\n                )\n        self._dataset.records.log(records)  # type: ignore\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the step are the <code>instruction</code> and the <code>generation</code>.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla.load","title":"<code>load()</code>","text":"<p>Sets the <code>_instruction</code> and <code>_generation</code> attributes based on the <code>inputs_mapping</code>, otherwise uses the default values; and then uses those values to create a <code>FeedbackDataset</code> suited for the text-generation scenario. And then it pushes it to Argilla.</p> Source code in <code>src/distilabel/steps/argilla/text_generation.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Sets the `_instruction` and `_generation` attributes based on the `inputs_mapping`, otherwise\n    uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n    the text-generation scenario. And then it pushes it to Argilla.\n    \"\"\"\n    super().load()\n\n    self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n    self._generation = self.input_mappings.get(\"generation\", \"generation\")\n\n    if self._dataset_exists_in_workspace:\n        _dataset = self._client.datasets(  # type: ignore\n            name=self.dataset_name,  # type: ignore\n            workspace=self.dataset_workspace,  # type: ignore\n        )\n\n        for field in _dataset.fields:\n            if not isinstance(field, rg.TextField):  # type: ignore\n                continue\n            if (\n                field.name not in [self._id, self._instruction, self._generation]\n                and field.required\n            ):\n                raise DistilabelUserError(\n                    f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                    f\" already exists, but contains at least a required field that is\"\n                    f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generation}`,\"\n                    \" so it cannot be reused for this dataset.\",\n                    page=\"components-gallery/steps/textgenerationtoargilla/\",\n                )\n\n        self._dataset = _dataset\n    else:\n        _settings = rg.Settings(  # type: ignore\n            fields=[\n                rg.TextField(name=self._id, title=self._id),  # type: ignore\n                rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                rg.TextField(name=self._generation, title=self._generation),  # type: ignore\n            ],\n            questions=[\n                rg.LabelQuestion(  # type: ignore\n                    name=\"quality\",\n                    title=f\"What's the quality of the {self._generation} for the given {self._instruction}?\",\n                    labels={\"bad\": \"\ud83d\udc4e\", \"good\": \"\ud83d\udc4d\"},  # type: ignore\n                )\n            ],\n        )\n        _dataset = rg.Dataset(  # type: ignore\n            name=self.dataset_name,\n            workspace=self.dataset_workspace,\n            settings=_settings,\n            client=self._client,\n        )\n        self._dataset = _dataset.create()\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla.process","title":"<code>process(inputs)</code>","text":"<p>Creates and pushes the records as FeedbackRecords to the Argilla dataset.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/argilla/text_generation.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Creates and pushes the records as FeedbackRecords to the Argilla dataset.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n    records = []\n    for input in inputs:\n        # Generate the SHA-256 hash of the instruction to use it as the metadata\n        instruction_id = hashlib.sha256(\n            input[\"instruction\"].encode(\"utf-8\")\n        ).hexdigest()\n\n        generations = input[\"generation\"]\n\n        # If the `generation` is not a list, then convert it into a list\n        if not isinstance(generations, list):\n            generations = [generations]\n\n        # Create a `generations_set` to avoid adding duplicates\n        generations_set = set()\n\n        for generation in generations:\n            # If the generation is already in the set, then skip it\n            if generation in generations_set:\n                continue\n            # Otherwise, add it to the set\n            generations_set.add(generation)\n\n            records.append(\n                rg.Record(  # type: ignore\n                    fields={\n                        self._id: instruction_id,\n                        self._instruction: input[\"instruction\"],\n                        self._generation: generation,\n                    },\n                ),\n            )\n    self._dataset.records.log(records)  # type: ignore\n    yield inputs\n</code></pre>"},{"location":"api/step_gallery/columns/","title":"Columns","text":"<p>This section contains the existing steps intended to be used for common column operations to apply to the batches.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand","title":"<code>expand</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns","title":"<code>ExpandColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Expand columns that contain lists into multiple rows.</p> <p><code>ExpandColumns</code> is a <code>Step</code> that takes a list of columns and expands them into multiple rows. The new rows will have the same data as the original row, except for the expanded column, which will contain a single item from the original list.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>Union[Dict[str, str], List[str]]</code> <p>A dictionary that maps the column to be expanded to the new column name or a list of columns to be expanded. If a list is provided, the new column name will be the same as the column name.</p> <code>encoded</code> <code>Union[bool, List[str]]</code> <p>A bool to inform Whether the columns are JSON encoded lists. If this value is set to True, the columns will be decoded before expanding. Alternatively, to specify columns that can be encoded, a list can be provided. In this case, the column names informed must be a subset of the columns selected for expansion.</p> <code>split_statistics</code> <code>bool</code> <p>A bool to inform whether the statistics in the <code>distilabel_metadata</code> column should be split into multiple rows. If we want to expand some columns containing a list of strings that come from having parsed the output of an LLM, the tokens in the <code>statistics_{step_name}</code> of the <code>distilabel_metadata</code> column should be splitted to avoid multiplying them if we aggregate the data afterwards. For example, with a task that is supposed to generate a list of N instructions, and we want each of those N instructions in different rows, we should split the statistics by N. In such a case, set this value to True.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to be expanded into     multiple rows.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> attribute):  The expanded columns.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <p>Expand the selected columns into multiple rows:</p> <pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"]}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre> <p>Expand the selected columns which are JSON encoded into multiple rows:</p> <pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    encoded=True,  # It can also be a list of columns that are encoded, i.e. [\"generation\"]\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": '[\"generation 1\", \"generation 2\"]'}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre> <p>Expand the selected columns and split the statistics in the <code>distilabel_metadata</code> column:</p> <pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    split_statistics=True,\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"],\n                \"distilabel_metadata\": {\n                    \"statistics_generation\": {\n                        \"input_tokens\": [12],\n                        \"output_tokens\": [12],\n                    },\n                },\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}, {'instruction': 'instruction 1', 'generation': 'generation 2', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}]\n</code></pre> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>class ExpandColumns(Step):\n    \"\"\"Expand columns that contain lists into multiple rows.\n\n    `ExpandColumns` is a `Step` that takes a list of columns and expands them into multiple\n    rows. The new rows will have the same data as the original row, except for the expanded\n    column, which will contain a single item from the original list.\n\n    Attributes:\n        columns: A dictionary that maps the column to be expanded to the new column name\n            or a list of columns to be expanded. If a list is provided, the new column name\n            will be the same as the column name.\n        encoded: A bool to inform Whether the columns are JSON encoded lists. If this value is\n            set to True, the columns will be decoded before expanding. Alternatively, to specify\n            columns that can be encoded, a list can be provided. In this case, the column names\n            informed must be a subset of the columns selected for expansion.\n        split_statistics: A bool to inform whether the statistics in the `distilabel_metadata`\n            column should be split into multiple rows.\n            If we want to expand some columns containing a list of strings that come from\n            having parsed the output of an LLM, the tokens in the `statistics_{step_name}`\n            of the `distilabel_metadata` column should be splitted to avoid multiplying\n            them if we aggregate the data afterwards. For example, with a task that is supposed\n            to generate a list of N instructions, and we want each of those N instructions in\n            different rows, we should split the statistics by N.\n            In such a case, set this value to True.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to be expanded into\n            multiple rows.\n\n    Output columns:\n        - dynamic (determined by `columns` attribute):  The expanded columns.\n\n    Categories:\n        - columns\n\n    Examples:\n        Expand the selected columns into multiple rows:\n\n        ```python\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(\n            columns=[\"generation\"],\n        )\n        expand_columns.load()\n\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"instruction 1\",\n                        \"generation\": [\"generation 1\", \"generation 2\"]}\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n        ```\n\n        Expand the selected columns which are JSON encoded into multiple rows:\n\n        ```python\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(\n            columns=[\"generation\"],\n            encoded=True,  # It can also be a list of columns that are encoded, i.e. [\"generation\"]\n        )\n        expand_columns.load()\n\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"instruction 1\",\n                        \"generation\": '[\"generation 1\", \"generation 2\"]'}\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n        ```\n\n        Expand the selected columns and split the statistics in the `distilabel_metadata` column:\n\n        ```python\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(\n            columns=[\"generation\"],\n            split_statistics=True,\n        )\n        expand_columns.load()\n\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"instruction 1\",\n                        \"generation\": [\"generation 1\", \"generation 2\"],\n                        \"distilabel_metadata\": {\n                            \"statistics_generation\": {\n                                \"input_tokens\": [12],\n                                \"output_tokens\": [12],\n                            },\n                        },\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction 1', 'generation': 'generation 1', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}, {'instruction': 'instruction 1', 'generation': 'generation 2', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}]\n        ```\n    \"\"\"\n\n    columns: Union[Dict[str, str], List[str]]\n    encoded: Union[bool, List[str]] = False\n    split_statistics: bool = False\n\n    @field_validator(\"columns\")\n    @classmethod\n    def always_dict(cls, value: Union[Dict[str, str], List[str]]) -&gt; Dict[str, str]:\n        \"\"\"Ensure that the columns are always a dictionary.\n\n        Args:\n            value: The columns to be expanded.\n\n        Returns:\n            The columns to be expanded as a dictionary.\n        \"\"\"\n        if isinstance(value, list):\n            return {col: col for col in value}\n\n        return value\n\n    @model_validator(mode=\"after\")\n    def is_subset(self) -&gt; Self:\n        \"\"\"Ensure the \"encoded\" column names are a subset of the \"columns\" selected.\n\n        Returns:\n            The \"encoded\" attribute updated to work internally.\n        \"\"\"\n        if isinstance(self.encoded, list):\n            if not set(self.encoded).issubset(set(self.columns.keys())):\n                raise ValueError(\n                    \"The 'encoded' columns must be a subset of the 'columns' selected for expansion.\"\n                )\n        if isinstance(self.encoded, bool):\n            self.encoded = list(self.columns.keys()) if self.encoded else []\n        return self\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The columns to be expanded.\"\"\"\n        return list(self.columns.keys())\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The expanded columns.\"\"\"\n        return [\n            new_column if new_column else expand_column\n            for expand_column, new_column in self.columns.items()\n        ]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Expand the columns in the input data.\n\n        Args:\n            inputs: The input data.\n\n        Yields:\n            The expanded rows.\n        \"\"\"\n        if self.encoded:\n            for input in inputs:\n                for column in self.encoded:\n                    input[column] = json.loads(input[column])\n\n        yield [row for input in inputs for row in self._expand_columns(input)]\n\n    def _expand_columns(self, input: Dict[str, Any]) -&gt; List[Dict[str, Any]]:\n        \"\"\"Expand the columns in the input data.\n\n        Args:\n            input: The input data.\n\n        Returns:\n            The expanded rows.\n        \"\"\"\n        metadata_visited = False\n        expanded_rows = []\n        # Update the columns here to avoid doing the validation on the `inputs`, as the\n        # `distilabel_metadata` is not defined on Pipeline creation on the DAG.\n        columns = self.columns\n        if self.split_statistics:\n            columns[\"distilabel_metadata\"] = \"distilabel_metadata\"\n\n        for expand_column, new_column in columns.items():  # type: ignore\n            data = input.get(expand_column)\n            input, metadata_visited = self._split_metadata(\n                input, len(data), metadata_visited\n            )\n\n            rows = []\n            for item, expanded in zip_longest(*[data, expanded_rows], fillvalue=input):\n                rows.append({**expanded, new_column: item})\n            expanded_rows = rows\n        return expanded_rows\n\n    def _split_metadata(\n        self, input: Dict[str, Any], n: int, metadata_visited: bool = False\n    ) -&gt; None:\n        \"\"\"Help method to split the statistics in `distilabel_metadata` column.\n\n        Args:\n            input: The input data.\n            n: Number of splits to apply to the tokens (if we have 12 tokens and want to split\n                them 3 times, n==3).\n            metadata_visited: Bool to prevent from updating the data more than once.\n\n        Returns:\n            Updated input with the `distilabel_metadata` updated.\n        \"\"\"\n        # - If we want to split the statistics, we need to ensure that the metadata is present.\n        # - Metadata can only be visited once per row to avoid successive splitting.\n        # TODO: For an odd number of tokens, this will miss 1, we have to fix it.\n        if (\n            self.split_statistics\n            and (metadata := input.get(\"distilabel_metadata\", {}))\n            and not metadata_visited\n        ):\n            for k, v in metadata.items():\n                if (\n                    not v\n                ):  # In case it wasn't found in the metadata for some error, skip it\n                    continue\n                if k.startswith(\"statistics_\") and (\n                    \"input_tokens\" in v and \"output_tokens\" in v\n                ):\n                    # For num_generations&gt;1 we assume all the tokens should be divided by n\n                    # TODO: The tokens should always come as a list, but there can\n                    # be differences\n                    if isinstance(v[\"input_tokens\"], list):\n                        input_tokens = [value // n for value in v[\"input_tokens\"]]\n                    else:\n                        input_tokens = [v[\"input_tokens\"] // n]\n                    if isinstance(v[\"input_tokens\"], list):\n                        output_tokens = [value // n for value in v[\"output_tokens\"]]\n                    else:\n                        output_tokens = [v[\"output_tokens\"] // n]\n\n                    input[\"distilabel_metadata\"][k] = {\n                        \"input_tokens\": input_tokens,\n                        \"output_tokens\": output_tokens,\n                    }\n                metadata_visited = True\n            # Once we have updated the metadata, Create a list out of it to let the\n            # following section to expand it as any other column.\n            if isinstance(input[\"distilabel_metadata\"], dict):\n                input[\"distilabel_metadata\"] = [input[\"distilabel_metadata\"]] * n\n        return input, metadata_visited\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The columns to be expanded.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The expanded columns.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.always_dict","title":"<code>always_dict(value)</code>  <code>classmethod</code>","text":"<p>Ensure that the columns are always a dictionary.</p> <p>Parameters:</p> Name Type Description Default <code>value</code> <code>Union[Dict[str, str], List[str]]</code> <p>The columns to be expanded.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>The columns to be expanded as a dictionary.</p> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>@field_validator(\"columns\")\n@classmethod\ndef always_dict(cls, value: Union[Dict[str, str], List[str]]) -&gt; Dict[str, str]:\n    \"\"\"Ensure that the columns are always a dictionary.\n\n    Args:\n        value: The columns to be expanded.\n\n    Returns:\n        The columns to be expanded as a dictionary.\n    \"\"\"\n    if isinstance(value, list):\n        return {col: col for col in value}\n\n    return value\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.is_subset","title":"<code>is_subset()</code>","text":"<p>Ensure the \"encoded\" column names are a subset of the \"columns\" selected.</p> <p>Returns:</p> Type Description <code>Self</code> <p>The \"encoded\" attribute updated to work internally.</p> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>@model_validator(mode=\"after\")\ndef is_subset(self) -&gt; Self:\n    \"\"\"Ensure the \"encoded\" column names are a subset of the \"columns\" selected.\n\n    Returns:\n        The \"encoded\" attribute updated to work internally.\n    \"\"\"\n    if isinstance(self.encoded, list):\n        if not set(self.encoded).issubset(set(self.columns.keys())):\n            raise ValueError(\n                \"The 'encoded' columns must be a subset of the 'columns' selected for expansion.\"\n            )\n    if isinstance(self.encoded, bool):\n        self.encoded = list(self.columns.keys()) if self.encoded else []\n    return self\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.process","title":"<code>process(inputs)</code>","text":"<p>Expand the columns in the input data.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>The expanded rows.</p> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Expand the columns in the input data.\n\n    Args:\n        inputs: The input data.\n\n    Yields:\n        The expanded rows.\n    \"\"\"\n    if self.encoded:\n        for input in inputs:\n            for column in self.encoded:\n                input[column] = json.loads(input[column])\n\n    yield [row for input in inputs for row in self._expand_columns(input)]\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep","title":"<code>keep</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns","title":"<code>KeepColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Keeps selected columns in the dataset.</p> <p><code>KeepColumns</code> is a <code>Step</code> that implements the <code>process</code> method that keeps only the columns specified in the <code>columns</code> attribute. Also <code>KeepColumns</code> provides an attribute <code>columns</code> to specify the columns to keep which will override the default value for the properties <code>inputs</code> and <code>outputs</code>.</p> Note <p>The order in which the columns are provided is important, as the output will be sorted using the provided order, which is useful before pushing either a <code>dataset.Dataset</code> via the <code>PushToHub</code> step or a <code>distilabel.Distiset</code> via the <code>Pipeline.run</code> output variable.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>List of strings with the names of the columns to keep.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to keep.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns that were kept.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <p>Select the columns to keep:</p> <pre><code>from distilabel.steps import KeepColumns\n\nkeep_columns = KeepColumns(\n    columns=[\"instruction\", \"generation\"],\n)\nkeep_columns.load()\n\nresult = next(\n    keep_columns.process(\n        [{\"instruction\": \"What's the brightest color?\", \"generation\": \"white\", \"model_name\": \"my_model\"}],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': \"What's the brightest color?\", 'generation': 'white'}]\n</code></pre> Source code in <code>src/distilabel/steps/columns/keep.py</code> <pre><code>class KeepColumns(Step):\n    \"\"\"Keeps selected columns in the dataset.\n\n    `KeepColumns` is a `Step` that implements the `process` method that keeps only the columns\n    specified in the `columns` attribute. Also `KeepColumns` provides an attribute `columns` to\n    specify the columns to keep which will override the default value for the properties `inputs`\n    and `outputs`.\n\n    Note:\n        The order in which the columns are provided is important, as the output will be sorted\n        using the provided order, which is useful before pushing either a `dataset.Dataset` via\n        the `PushToHub` step or a `distilabel.Distiset` via the `Pipeline.run` output variable.\n\n    Attributes:\n        columns: List of strings with the names of the columns to keep.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to keep.\n\n    Output columns:\n        - dynamic (determined by `columns` attribute): The columns that were kept.\n\n    Categories:\n        - columns\n\n    Examples:\n        Select the columns to keep:\n\n        ```python\n        from distilabel.steps import KeepColumns\n\n        keep_columns = KeepColumns(\n            columns=[\"instruction\", \"generation\"],\n        )\n        keep_columns.load()\n\n        result = next(\n            keep_columns.process(\n                [{\"instruction\": \"What's the brightest color?\", \"generation\": \"white\", \"model_name\": \"my_model\"}],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': \"What's the brightest color?\", 'generation': 'white'}]\n        ```\n    \"\"\"\n\n    columns: List[str]\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task are the column names in `columns`.\"\"\"\n        return self.columns\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs for the task are the column names in `columns`.\"\"\"\n        return self.columns\n\n    @override\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"The `process` method keeps only the columns specified in the `columns` attribute.\n\n        Args:\n            *inputs: A list of dictionaries with the input data.\n\n        Yields:\n            A list of dictionaries with the output data.\n        \"\"\"\n        for input in inputs:\n            outputs = []\n            for item in input:\n                outputs.append({col: item[col] for col in self.columns})\n            yield outputs\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the column names in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task are the column names in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method keeps only the columns specified in the <code>columns</code> attribute.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of dictionaries with the input data.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of dictionaries with the output data.</p> Source code in <code>src/distilabel/steps/columns/keep.py</code> <pre><code>@override\ndef process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"The `process` method keeps only the columns specified in the `columns` attribute.\n\n    Args:\n        *inputs: A list of dictionaries with the input data.\n\n    Yields:\n        A list of dictionaries with the output data.\n    \"\"\"\n    for input in inputs:\n        outputs = []\n        for item in input:\n            outputs.append({col: item[col] for col in self.columns})\n        yield outputs\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.merge","title":"<code>merge</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.merge.MergeColumns","title":"<code>MergeColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Merge columns from a row.</p> <p><code>MergeColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>merge_columns</code> function to handle and combine columns in a <code>StepInput</code>. <code>MergeColumns</code> provides two attributes <code>columns</code> and <code>output_column</code> to specify the columns to merge and the resulting output column.</p> <p>This step can be useful if you have a <code>Task</code> that generates instructions for example, and you want to have more examples of those. In such a case, you could for example use another <code>Task</code> to multiply your instructions synthetically, what would yield two different columns splitted. Using <code>MergeColumns</code> you can merge them and use them as a single column in your dataset for further processing.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>List of strings with the names of the columns to merge.</p> <code>output_column</code> <code>Optional[str]</code> <p>str name of the output column</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to merge.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> and <code>output_column</code> attributes): The columns     that were merged.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <p>Combine columns in rows of a dataset:</p> <pre><code>from distilabel.steps import MergeColumns\n\ncombiner = MergeColumns(\n    columns=[\"queries\", \"multiple_queries\"],\n    output_column=\"queries\",\n)\ncombiner.load()\n\nresult = next(\n    combiner.process(\n        [\n            {\n                \"queries\": \"How are you?\",\n                \"multiple_queries\": [\"What's up?\", \"Everything ok?\"]\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'queries': ['How are you?', \"What's up?\", 'Everything ok?']}]\n</code></pre> Source code in <code>src/distilabel/steps/columns/merge.py</code> <pre><code>class MergeColumns(Step):\n    \"\"\"Merge columns from a row.\n\n    `MergeColumns` is a `Step` that implements the `process` method that calls the `merge_columns`\n    function to handle and combine columns in a `StepInput`. `MergeColumns` provides two attributes\n    `columns` and `output_column` to specify the columns to merge and the resulting output column.\n\n    This step can be useful if you have a `Task` that generates instructions for example, and you\n    want to have more examples of those. In such a case, you could for example use another `Task`\n    to multiply your instructions synthetically, what would yield two different columns splitted.\n    Using `MergeColumns` you can merge them and use them as a single column in your dataset for\n    further processing.\n\n    Attributes:\n        columns: List of strings with the names of the columns to merge.\n        output_column: str name of the output column\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to merge.\n\n    Output columns:\n        - dynamic (determined by `columns` and `output_column` attributes): The columns\n            that were merged.\n\n    Categories:\n        - columns\n\n    Examples:\n        Combine columns in rows of a dataset:\n\n        ```python\n        from distilabel.steps import MergeColumns\n\n        combiner = MergeColumns(\n            columns=[\"queries\", \"multiple_queries\"],\n            output_column=\"queries\",\n        )\n        combiner.load()\n\n        result = next(\n            combiner.process(\n                [\n                    {\n                        \"queries\": \"How are you?\",\n                        \"multiple_queries\": [\"What's up?\", \"Everything ok?\"]\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'queries': ['How are you?', \"What's up?\", 'Everything ok?']}]\n        ```\n    \"\"\"\n\n    columns: List[str]\n    output_column: Optional[str] = None\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return self.columns\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [self.output_column] if self.output_column else [\"merged_column\"]\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        combined = []\n        for input in inputs:\n            combined.append(\n                merge_columns(\n                    input,\n                    columns=self.columns,\n                    new_column=self.outputs[0],\n                )\n            )\n        yield combined\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group","title":"<code>group</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns","title":"<code>GroupColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Combines columns from a list of <code>StepInput</code>.</p> <p><code>GroupColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>group_dicts</code> function to handle and combine a list of <code>StepInput</code>. Also <code>GroupColumns</code> provides two attributes <code>columns</code> and <code>output_columns</code> to specify the columns to group and the output columns which will override the default value for the properties <code>inputs</code> and <code>outputs</code>, respectively.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>List of strings with the names of the columns to group.</p> <code>output_columns</code> <code>Optional[List[str]]</code> <p>Optional list of strings with the names of the output columns.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to group.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> and <code>output_columns</code> attributes): The columns     that were grouped.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <pre><code>Group columns of a dataset:\n\n```python\nfrom distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n# [{'merged_generation': ['AI generated text', 'Other generated text'], 'merged_model_name': ['my_model']}]\n```\n\nSpecify the name of the output columns:\n\n```python\nfrom distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n    output_columns=[\"generations\", \"generation_models\"]\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n#[{'generations': ['AI generated text', 'Other generated text'], 'generation_models': ['my_model']}]\n```\n</code></pre> Source code in <code>src/distilabel/steps/columns/group.py</code> <pre><code>class GroupColumns(Step):\n    \"\"\"Combines columns from a list of `StepInput`.\n\n    `GroupColumns` is a `Step` that implements the `process` method that calls the `group_dicts`\n    function to handle and combine a list of `StepInput`. Also `GroupColumns` provides two attributes\n    `columns` and `output_columns` to specify the columns to group and the output columns\n    which will override the default value for the properties `inputs` and `outputs`, respectively.\n\n    Attributes:\n        columns: List of strings with the names of the columns to group.\n        output_columns: Optional list of strings with the names of the output columns.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to group.\n\n    Output columns:\n        - dynamic (determined by `columns` and `output_columns` attributes): The columns\n            that were grouped.\n\n    Categories:\n        - columns\n\n    Examples:\n\n        Group columns of a dataset:\n\n        ```python\n        from distilabel.steps import GroupColumns\n\n        group_columns = GroupColumns(\n            name=\"group_columns\",\n            columns=[\"generation\", \"model_name\"],\n        )\n        group_columns.load()\n\n        result = next(\n            group_columns.process(\n                [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n                [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'merged_generation': ['AI generated text', 'Other generated text'], 'merged_model_name': ['my_model']}]\n        ```\n\n        Specify the name of the output columns:\n\n        ```python\n        from distilabel.steps import GroupColumns\n\n        group_columns = GroupColumns(\n            name=\"group_columns\",\n            columns=[\"generation\", \"model_name\"],\n            output_columns=[\"generations\", \"generation_models\"]\n        )\n        group_columns.load()\n\n        result = next(\n            group_columns.process(\n                [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n                [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n            )\n        )\n        # &gt;&gt;&gt; result\n        #[{'generations': ['AI generated text', 'Other generated text'], 'generation_models': ['my_model']}]\n        ```\n    \"\"\"\n\n    columns: List[str]\n    output_columns: Optional[List[str]] = None\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are the column names in `columns`.\"\"\"\n        return self.columns\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The outputs for the task are the column names in `output_columns` or\n        `grouped_{column}` for each column in `columns`.\"\"\"\n        return (\n            self.output_columns\n            if self.output_columns is not None\n            else [f\"grouped_{column}\" for column in self.columns]\n        )\n\n    @override\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"The `process` method calls the `group_dicts` function to handle and combine a list of `StepInput`.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with the combined `StepInput` using the `group_dicts` function.\n        \"\"\"\n        yield group_columns(\n            *inputs,\n            group_columns=self.inputs,\n            output_group_columns=self.outputs,\n        )\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the column names in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task are the column names in <code>output_columns</code> or <code>grouped_{column}</code> for each column in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method calls the <code>group_dicts</code> function to handle and combine a list of <code>StepInput</code>.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with the combined <code>StepInput</code> using the <code>group_dicts</code> function.</p> Source code in <code>src/distilabel/steps/columns/group.py</code> <pre><code>@override\ndef process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"The `process` method calls the `group_dicts` function to handle and combine a list of `StepInput`.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with the combined `StepInput` using the `group_dicts` function.\n    \"\"\"\n    yield group_columns(\n        *inputs,\n        group_columns=self.inputs,\n        output_group_columns=self.outputs,\n    )\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.CombineColumns","title":"<code>CombineColumns</code>","text":"<p>               Bases: <code>GroupColumns</code></p> <p><code>CombineColumns</code> is deprecated and will be removed in version 1.5.0, use <code>GroupColumns</code> instead.</p> Source code in <code>src/distilabel/steps/columns/group.py</code> <pre><code>class CombineColumns(GroupColumns):\n    \"\"\"`CombineColumns` is deprecated and will be removed in version 1.5.0, use `GroupColumns` instead.\"\"\"\n\n    def __init__(self, **data: Any) -&gt; None:\n        warnings.warn(\n            \"`CombineColumns` is deprecated and will be removed in version 1.5.0, use `GroupColumns` instead.\",\n            DeprecationWarning,\n            stacklevel=2,\n        )\n        return super().__init__(**data)\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils","title":"<code>utils</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils.merge_distilabel_metadata","title":"<code>merge_distilabel_metadata(*output_dicts)</code>","text":"<p>Merge the <code>DISTILABEL_METADATA_KEY</code> from multiple output dictionaries. <code>DISTILABEL_METADATA_KEY</code> can be either a dictionary containing metadata keys or a list containing dictionaries of metadata keys.</p> <p>Parameters:</p> Name Type Description Default <code>*output_dicts</code> <code>Dict[str, Any]</code> <p>Variable number of dictionaries or lists containing distilabel metadata.</p> <code>()</code> <p>Returns:</p> Type Description <code>Union[Dict[str, Any], List[Dict[str, Any]]]</code> <p>A merged dictionary or list containing all the distilabel metadata.</p> Source code in <code>src/distilabel/steps/columns/utils.py</code> <pre><code>def merge_distilabel_metadata(\n    *output_dicts: Dict[str, Any],\n) -&gt; Union[Dict[str, Any], List[Dict[str, Any]]]:\n    \"\"\"\n    Merge the `DISTILABEL_METADATA_KEY` from multiple output dictionaries. `DISTILABEL_METADATA_KEY`\n    can be either a dictionary containing metadata keys or a list containing dictionaries\n    of metadata keys.\n\n    Args:\n        *output_dicts: Variable number of dictionaries or lists containing distilabel metadata.\n\n    Returns:\n        A merged dictionary or list containing all the distilabel metadata.\n    \"\"\"\n    merged_metadata = defaultdict(list)\n\n    for output_dict in output_dicts:\n        metadata = output_dict.get(DISTILABEL_METADATA_KEY, {})\n        # If `distilabel_metadata_key` is a `list` then it contains dictionaries with\n        # the metadata per `num_generations` created when `group_generations==True`\n        if isinstance(metadata, list):\n            if not isinstance(merged_metadata, list):\n                merged_metadata = []\n            merged_metadata.extend(metadata)\n        else:\n            for key, value in metadata.items():\n                merged_metadata[key].append(value)\n\n    if isinstance(merged_metadata, list):\n        return merged_metadata\n\n    final_metadata = {}\n    for key, value_list in merged_metadata.items():\n        if len(value_list) == 1:\n            final_metadata[key] = value_list[0]\n        else:\n            final_metadata[key] = value_list\n\n    return final_metadata\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils.group_columns","title":"<code>group_columns(*inputs, group_columns, output_group_columns=None)</code>","text":"<p>Groups multiple list of dictionaries into a single list of dictionaries on the specified <code>group_columns</code>. If <code>group_columns</code> are provided, then it will also rename <code>group_columns</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>list of dictionaries to combine.</p> <code>()</code> <code>group_columns</code> <code>List[str]</code> <p>list of keys to merge on.</p> required <code>output_group_columns</code> <code>Optional[List[str]]</code> <p>list of keys to rename the merge keys to. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>StepInput</code> <p>A list of dictionaries where the values of the <code>group_columns</code> are combined into a</p> <code>StepInput</code> <p>list and renamed to <code>output_group_columns</code>.</p> Source code in <code>src/distilabel/steps/columns/utils.py</code> <pre><code>def group_columns(\n    *inputs: \"StepInput\",\n    group_columns: List[str],\n    output_group_columns: Optional[List[str]] = None,\n) -&gt; \"StepInput\":\n    \"\"\"Groups multiple list of dictionaries into a single list of dictionaries on the\n    specified `group_columns`. If `group_columns` are provided, then it will also rename\n    `group_columns`.\n\n    Args:\n        inputs: list of dictionaries to combine.\n        group_columns: list of keys to merge on.\n        output_group_columns: list of keys to rename the merge keys to. Defaults to `None`.\n\n    Returns:\n        A list of dictionaries where the values of the `group_columns` are combined into a\n        list and renamed to `output_group_columns`.\n    \"\"\"\n    if output_group_columns is not None and len(output_group_columns) != len(\n        group_columns\n    ):\n        raise ValueError(\n            \"The length of `output_group_columns` must be the same as the length of `group_columns`.\"\n        )\n    if output_group_columns is None:\n        output_group_columns = [f\"grouped_{key}\" for key in group_columns]\n    group_columns_dict = dict(zip(group_columns, output_group_columns))\n\n    result = []\n    # Use zip to iterate over lists based on their index\n    for dicts_at_index in zip(*inputs):\n        combined_dict = {}\n        metadata_dicts = []\n        # Iterate over dicts at the same index\n        for d in dicts_at_index:\n            # Extract metadata for merging\n            if DISTILABEL_METADATA_KEY in d:\n                metadata_dicts.append(\n                    {DISTILABEL_METADATA_KEY: d[DISTILABEL_METADATA_KEY]}\n                )\n            # Iterate over key-value pairs in each dict\n            for key, value in d.items():\n                if key == DISTILABEL_METADATA_KEY:\n                    continue\n                # If the key is in the merge_keys, append the value to the existing list\n                if key in group_columns_dict.keys():\n                    combined_dict.setdefault(group_columns_dict[key], []).append(value)\n                # If the key is not in the merge_keys, create a new key-value pair\n                else:\n                    combined_dict[key] = value\n\n        if metadata_dicts:\n            combined_dict[DISTILABEL_METADATA_KEY] = merge_distilabel_metadata(\n                *metadata_dicts\n            )\n\n        result.append(combined_dict)\n    return result\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils.merge_columns","title":"<code>merge_columns(row, columns, new_column='combined_key')</code>","text":"<p>Merge columns in a dictionary into a single column on the specified <code>new_column</code>.</p> <p>Parameters:</p> Name Type Description Default <code>row</code> <code>Dict[str, Any]</code> <p>Dictionary corresponding to a row in a dataset.</p> required <code>columns</code> <code>List[str]</code> <p>List of keys to merge.</p> required <code>new_column</code> <code>str</code> <p>Name of the new key created.</p> <code>'combined_key'</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Dictionary with the new merged key.</p> Source code in <code>src/distilabel/steps/columns/utils.py</code> <pre><code>def merge_columns(\n    row: Dict[str, Any], columns: List[str], new_column: str = \"combined_key\"\n) -&gt; Dict[str, Any]:\n    \"\"\"Merge columns in a dictionary into a single column on the specified `new_column`.\n\n    Args:\n        row: Dictionary corresponding to a row in a dataset.\n        columns: List of keys to merge.\n        new_column: Name of the new key created.\n\n    Returns:\n        Dictionary with the new merged key.\n    \"\"\"\n    result = row.copy()  # preserve the original dictionary\n    combined = []\n    for key in columns:\n        to_combine = result.pop(key)\n        if not isinstance(to_combine, list):\n            to_combine = [to_combine]\n        combined += to_combine\n    result[new_column] = combined\n    return result\n</code></pre>"},{"location":"api/step_gallery/extra/","title":"Extra","text":""},{"location":"api/step_gallery/extra/#distilabel.steps","title":"<code>steps</code>","text":""},{"location":"api/step_gallery/extra/#distilabel.steps.DBSCAN","title":"<code>DBSCAN</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core samples in regions of high density and expands clusters from them. This algorithm is good for data which contains clusters of similar density.</p> <p>This is a <code>GlobalStep</code> that clusters the embeddings using the DBSCAN algorithm from <code>sklearn</code>. Visit <code>TextClustering</code> step for an example of use. The trained model is saved as an artifact when creating a distiset and pushing it to the Hugging Face Hub.</p> Input columns <ul> <li>projection (<code>List[float]</code>): Vector representation of the text to cluster,     normally the output from the <code>UMAP</code> step.</li> </ul> Output columns <ul> <li>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1     means it wasn't clustered.</li> </ul> Categories <ul> <li>clustering</li> <li>text-classification</li> </ul> References <ul> <li><code>DBSCAN demo of sklearn</code></li> <li><code>sklearn dbscan</code></li> </ul> <p>Attributes:</p> Name Type Description <code>-</code> <code>eps</code> <p>The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.</p> <code>-</code> <code>min_samples</code> <p>The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If <code>min_samples</code> is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse.</p> <code>-</code> <code>metric</code> <p>The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.</p> <code>-</code> <code>n_jobs</code> <p>The number of parallel jobs to run.</p> Runtime parameters <ul> <li><code>eps</code>: The maximum distance between two samples for one to be considered as in the     neighborhood of the other. This is not a maximum bound on the distances of     points within a cluster. This is the most important DBSCAN parameter to     choose appropriately for your data set and distance function.</li> <li><code>min_samples</code>: The number of samples (or total weight) in a neighborhood for a point     to be considered as a core point. This includes the point itself. If <code>min_samples</code>     is set to a higher value, DBSCAN will find denser clusters, whereas if it is set     to a lower value, the found clusters will be more sparse.</li> <li><code>metric</code>: The metric to use when calculating distance between instances in a feature     array. If metric is a string or callable, it must be one of the options allowed     by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.</li> <li><code>n_jobs</code>: The number of parallel jobs to run.</li> </ul> Source code in <code>src/distilabel/steps/clustering/dbscan.py</code> <pre><code>class DBSCAN(GlobalStep):\n    r\"\"\"DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core\n    samples in regions of high density and expands clusters from them. This algorithm\n    is good for data which contains clusters of similar density.\n\n    This is a `GlobalStep` that clusters the embeddings using the DBSCAN algorithm\n    from `sklearn`. Visit `TextClustering` step for an example of use.\n    The trained model is saved as an artifact when creating a distiset\n    and pushing it to the Hugging Face Hub.\n\n    Input columns:\n        - projection (`List[float]`): Vector representation of the text to cluster,\n            normally the output from the `UMAP` step.\n\n    Output columns:\n        - cluster_label (`int`): Integer representing the label of a given cluster. -1\n            means it wasn't clustered.\n\n    Categories:\n        - clustering\n        - text-classification\n\n    References:\n        - [`DBSCAN demo of sklearn`](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#demo-of-dbscan-clustering-algorithm)\n        - [`sklearn dbscan`](https://scikit-learn.org/stable/modules/clustering.html#dbscan)\n\n    Attributes:\n        - eps: The maximum distance between two samples for one to be considered as in the\n            neighborhood of the other. This is not a maximum bound on the distances of\n            points within a cluster. This is the most important DBSCAN parameter to\n            choose appropriately for your data set and distance function.\n        - min_samples: The number of samples (or total weight) in a neighborhood for a point\n            to be considered as a core point. This includes the point itself. If `min_samples`\n            is set to a higher value, DBSCAN will find denser clusters, whereas if it is set\n            to a lower value, the found clusters will be more sparse.\n        - metric: The metric to use when calculating distance between instances in a feature\n            array. If metric is a string or callable, it must be one of the options allowed\n            by `sklearn.metrics.pairwise_distances` for its metric parameter.\n        - n_jobs: The number of parallel jobs to run.\n\n    Runtime parameters:\n        - `eps`: The maximum distance between two samples for one to be considered as in the\n            neighborhood of the other. This is not a maximum bound on the distances of\n            points within a cluster. This is the most important DBSCAN parameter to\n            choose appropriately for your data set and distance function.\n        - `min_samples`: The number of samples (or total weight) in a neighborhood for a point\n            to be considered as a core point. This includes the point itself. If `min_samples`\n            is set to a higher value, DBSCAN will find denser clusters, whereas if it is set\n            to a lower value, the found clusters will be more sparse.\n        - `metric`: The metric to use when calculating distance between instances in a feature\n            array. If metric is a string or callable, it must be one of the options allowed\n            by `sklearn.metrics.pairwise_distances` for its metric parameter.\n        - `n_jobs`: The number of parallel jobs to run.\n    \"\"\"\n\n    eps: Optional[RuntimeParameter[float]] = Field(\n        default=0.3,\n        description=(\n            \"The maximum distance between two samples for one to be considered \"\n            \"as in the neighborhood of the other. This is not a maximum bound \"\n            \"on the distances of points within a cluster. This is the most \"\n            \"important DBSCAN parameter to choose appropriately for your data set \"\n            \"and distance function.\"\n        ),\n    )\n    min_samples: Optional[RuntimeParameter[int]] = Field(\n        default=30,\n        description=(\n            \"The number of samples (or total weight) in a neighborhood for a point to \"\n            \"be considered as a core point. This includes the point itself. If \"\n            \"`min_samples` is set to a higher value, DBSCAN will find denser clusters, \"\n            \"whereas if it is set to a lower value, the found clusters will be more \"\n            \"sparse.\"\n        ),\n    )\n    metric: Optional[RuntimeParameter[str]] = Field(\n        default=\"euclidean\",\n        description=(\n            \"The metric to use when calculating distance between instances in a \"\n            \"feature array. If metric is a string or callable, it must be one of \"\n            \"the options allowed by `sklearn.metrics.pairwise_distances` for \"\n            \"its metric parameter.\"\n        ),\n    )\n    n_jobs: Optional[RuntimeParameter[int]] = Field(\n        default=8, description=\"The number of parallel jobs to run.\"\n    )\n\n    _clusterer: Optional[\"_DBSCAN\"] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n        if importlib.util.find_spec(\"sklearn\") is None:\n            raise ImportError(\n                \"`sklearn` package is not installed. Please install it using `pip install scikit-learn`.\"\n            )\n        from sklearn.cluster import DBSCAN as _DBSCAN\n\n        self._clusterer = _DBSCAN(\n            eps=self.eps,\n            min_samples=self.min_samples,\n            metric=self.metric,\n            n_jobs=self.n_jobs,\n        )\n\n    def unload(self) -&gt; None:\n        self._clusterer = None\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"projection\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"cluster_label\"]\n\n    def _save_model(self, model: Any) -&gt; None:\n        import joblib\n\n        def save_model(path):\n            with open(str(path / \"DBSCAN.joblib\"), \"wb\") as f:\n                joblib.dump(model, f)\n\n        self.save_artifact(\n            name=\"DBSCAN_model\",\n            write_function=lambda path: save_model(path),\n            metadata={\n                \"eps\": self.eps,\n                \"min_samples\": self.min_samples,\n                \"metric\": self.metric,\n            },\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        projections = np.array([input[\"projection\"] for input in inputs])\n\n        self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Start training DBSCAN...\")\n        fitted_clusterer = self._clusterer.fit(projections)\n        cluster_labels = fitted_clusterer.labels_\n        # Sets the cluster labels for each input, -1 means it wasn't clustered\n        for input, cluster_label in zip(inputs, cluster_labels):\n            input[\"cluster_label\"] = cluster_label\n        self._logger.info(f\"DBSCAN labels assigned: {len(set(cluster_labels))}\")\n        self._save_model(fitted_clusterer)\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering","title":"<code>TextClustering</code>","text":"<p>               Bases: <code>TextClassification</code>, <code>GlobalTask</code></p> <p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p>This is a <code>GlobalTask</code> that inherits from <code>TextClassification</code>, this means that all the attributes from that class are available here. Also, in this case we deal with all the inputs at once, instead of using batches. The <code>input_batch_size</code> is used here to send the examples to the LLM in batches (a subtle difference with the more common <code>Task</code> definitions). The task looks in each cluster for a given number of representative examples (the number is set by the <code>samples_per_cluster</code> attribute), and sends them to the LLM to get a label/s that represent the cluster. The labels are then assigned to each text in the cluster. The clusters and projections used in the step, are assumed to be obtained from the <code>UMAP</code> + <code>DBSCAN</code> steps, but could be generated for similar steps, as long as they represent the same concepts. This step runs a pipeline like the one in this repository: https://github.com/huggingface/text-clustering</p> Input columns <ul> <li>text (<code>str</code>): The reference text we want to obtain labels for.</li> <li>projection (<code>List[float]</code>): Vector representation of the text to cluster,     normally the output from the <code>UMAP</code> step.</li> <li>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1     means it wasn't clustered.</li> </ul> Output columns <ul> <li>summary_label (<code>str</code>): The label or list of labels for the text.</li> <li>model_name (<code>str</code>): The name of the model used to generate the label/s.</li> </ul> Categories <ul> <li>clustering</li> <li>text-classification</li> </ul> References <ul> <li><code>text-clustering repository</code></li> </ul> <p>Attributes:</p> Name Type Description <code>-</code> <code>savefig</code> <p>Whether to generate and save a figure with the clustering of the texts.</p> <code>-</code> <code>samples_per_cluster</code> <p>The number of examples to use in the LLM as a sample of the cluster.</p> <p>Examples:</p> <p>Generate labels for a set of texts using clustering:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps import UMAP, DBSCAN, TextClustering\nfrom distilabel.pipeline import Pipeline\n\nds_name = \"argilla-warehouse/personahub-fineweb-edu-4-clustering-100k\"\n\nwith Pipeline(name=\"Text clustering dataset\") as pipeline:\n    batch_size = 500\n\n    ds = load_dataset(ds_name, split=\"train\").select(range(10000))\n    loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)\n\n    umap = UMAP(n_components=2, metric=\"cosine\")\n    dbscan = DBSCAN(eps=0.3, min_samples=30)\n\n    text_clustering = TextClustering(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        ),\n        n=3,  # 3 labels per example\n        query_title=\"Examples of Personas\",\n        samples_per_cluster=10,\n        context=(\n            \"Describe the main themes, topics, or categories that could describe the \"\n            \"following types of personas. All the examples of personas must share \"\n            \"the same set of labels.\"\n        ),\n        default_label=\"None\",\n        savefig=True,\n        input_batch_size=8,\n        input_mappings={\"text\": \"persona\"},\n        use_default_structured_output=True,\n    )\n\n    loader &gt;&gt; umap &gt;&gt; dbscan &gt;&gt; text_clustering\n</code></pre> Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>class TextClustering(TextClassification, GlobalTask):\n    \"\"\"Task that clusters a set of texts and generates summary labels for each cluster.\n\n    This is a `GlobalTask` that inherits from `TextClassification`, this means that all\n    the attributes from that class are available here. Also, in this case we deal\n    with all the inputs at once, instead of using batches. The `input_batch_size` is\n    used here to send the examples to the LLM in batches (a subtle difference with the\n    more common `Task` definitions).\n    The task looks in each cluster for a given number of representative examples (the number\n    is set by the `samples_per_cluster` attribute), and sends them to the LLM to get a label/s\n    that represent the cluster. The labels are then assigned to each text in the cluster.\n    The clusters and projections used in the step, are assumed to be obtained from the `UMAP`\n    + `DBSCAN` steps, but could be generated for similar steps, as long as they represent the\n    same concepts.\n    This step runs a pipeline like the one in this repository:\n    https://github.com/huggingface/text-clustering\n\n    Input columns:\n        - text (`str`): The reference text we want to obtain labels for.\n        - projection (`List[float]`): Vector representation of the text to cluster,\n            normally the output from the `UMAP` step.\n        - cluster_label (`int`): Integer representing the label of a given cluster. -1\n            means it wasn't clustered.\n\n    Output columns:\n        - summary_label (`str`): The label or list of labels for the text.\n        - model_name (`str`): The name of the model used to generate the label/s.\n\n    Categories:\n        - clustering\n        - text-classification\n\n    References:\n        - [`text-clustering repository`](https://github.com/huggingface/text-clustering)\n\n    Attributes:\n        - savefig: Whether to generate and save a figure with the clustering of the texts.\n        - samples_per_cluster: The number of examples to use in the LLM as a sample of the cluster.\n\n    Examples:\n        Generate labels for a set of texts using clustering:\n\n        ```python\n        from distilabel.models import InferenceEndpointsLLM\n        from distilabel.steps import UMAP, DBSCAN, TextClustering\n        from distilabel.pipeline import Pipeline\n\n        ds_name = \"argilla-warehouse/personahub-fineweb-edu-4-clustering-100k\"\n\n        with Pipeline(name=\"Text clustering dataset\") as pipeline:\n            batch_size = 500\n\n            ds = load_dataset(ds_name, split=\"train\").select(range(10000))\n            loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)\n\n            umap = UMAP(n_components=2, metric=\"cosine\")\n            dbscan = DBSCAN(eps=0.3, min_samples=30)\n\n            text_clustering = TextClustering(\n                llm=InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                ),\n                n=3,  # 3 labels per example\n                query_title=\"Examples of Personas\",\n                samples_per_cluster=10,\n                context=(\n                    \"Describe the main themes, topics, or categories that could describe the \"\n                    \"following types of personas. All the examples of personas must share \"\n                    \"the same set of labels.\"\n                ),\n                default_label=\"None\",\n                savefig=True,\n                input_batch_size=8,\n                input_mappings={\"text\": \"persona\"},\n                use_default_structured_output=True,\n            )\n\n            loader &gt;&gt; umap &gt;&gt; dbscan &gt;&gt; text_clustering\n        ```\n    \"\"\"\n\n    savefig: Optional[RuntimeParameter[bool]] = Field(\n        default=True,\n        description=\"Whether to generate and save a figure with the clustering of the texts.\",\n    )\n    samples_per_cluster: int = Field(\n        default=10,\n        description=\"The number of examples to use in the LLM as a sample of the cluster.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the same as those for `TextClassification` plus\n        the `projection` and `cluster_label` columns (which can be obtained from\n        UMAP + DBSCAN steps).\n        \"\"\"\n        return super().inputs + [\"projection\", \"cluster_label\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `summary_label` and the `model_name`.\"\"\"\n        return [\"summary_label\", \"model_name\"]\n\n    def load(self) -&gt; None:\n        super().load()\n        if self.savefig and (importlib.util.find_spec(\"matplotlib\") is None):\n            raise ImportError(\n                \"`matplotlib` package is not installed. Please install it using `pip install matplotlib`.\"\n            )\n\n    def _save_figure(\n        self,\n        data: pd.DataFrame,\n        cluster_centers: Dict[str, Tuple[float, float]],\n        cluster_summaries: Dict[int, str],\n    ) -&gt; None:\n        \"\"\"Saves the figure starting from the dataframe, using matplotlib.\n\n        Args:\n            data: pd.DataFrame with the columns 'X', 'Y' and 'labels' representing\n                the projections and the label of each text respectively.\n            cluster_centers: Dictionary mapping from each label the center of a cluster,\n                to help with the placement of the annotations.\n            cluster_summaries: The summaries of the clusters, obtained from the LLM.\n        \"\"\"\n        import matplotlib.pyplot as plt\n\n        fig, ax = plt.subplots(figsize=(12, 8), dpi=300)\n        unique_labels = data[\"labels\"].unique()\n        # Map of colors for each label (-1 is black)\n        colormap = dict(\n            zip(unique_labels, plt.cm.Spectral(np.linspace(0, 1, len(unique_labels))))\n        )\n        colormap[-1] = np.array([0, 0, 0, 0])\n        data[\"color\"] = data[\"labels\"].map(colormap)\n\n        data.plot(\n            kind=\"scatter\",\n            x=\"X\",\n            y=\"Y\",\n            c=\"color\",\n            s=0.75,\n            alpha=0.8,\n            linewidth=0.4,\n            ax=ax,\n            colorbar=False,\n        )\n\n        for label in cluster_summaries.keys():\n            if label == -1:\n                continue\n            summary = str(cluster_summaries[label])  # These are obtained from the LLM\n            position = cluster_centers[label]\n            t = ax.text(\n                position[0],\n                position[1],\n                summary,\n                horizontalalignment=\"center\",\n                verticalalignment=\"center\",\n                fontsize=4,\n            )\n            t.set_bbox(\n                {\n                    \"facecolor\": \"white\",\n                    \"alpha\": 0.9,\n                    \"linewidth\": 0,\n                    \"boxstyle\": \"square,pad=0.1\",\n                }\n            )\n\n        ax.set_axis_off()\n        # Save the plot as an artifact of the step\n        self.save_artifact(\n            name=\"Text clusters\",\n            write_function=lambda path: fig.savefig(path / \"figure_clustering.png\"),\n            metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n        )\n        plt.close()\n\n    def _create_figure(\n        self,\n        inputs: StepInput,\n        label2docs: Dict[int, List[str]],\n        cluster_summaries: Dict[int, str],\n    ) -&gt; None:\n        \"\"\"Creates a figure of the clustered texts and save it as an artifact.\n\n        Args:\n            inputs: The inputs of the step, as we will extract information from them again.\n            label2docs: Map from each label to the list of documents (texts) that belong to that cluster.\n            cluster_summaries: The summaries of the clusters, obtained from the LLM.\n        \"\"\"\n        self._logger.info(\"\ud83d\uddbc\ufe0f Creating figure for the clusters...\")\n\n        labels = []\n        projections = []\n        id2cluster = {}\n        for i, input in enumerate(inputs):\n            label = input[\"cluster_label\"]\n            id2cluster[i] = label\n            labels.append(label)\n            projections.append(input[\"projection\"])\n\n        projections = np.array(projections)\n\n        # Contains the placement of the cluster centers in the figure\n        cluster_centers: Dict[str, Tuple[float, float]] = {}\n        for label in label2docs.keys():\n            x = np.mean([projections[doc, 0] for doc in label2docs[label]])\n            y = np.mean([projections[doc, 1] for doc in label2docs[label]])\n            cluster_centers[label] = (x, y)\n\n        df = pd.DataFrame(\n            data={\n                \"X\": projections[:, 0],\n                \"Y\": projections[:, 1],\n                \"labels\": labels,\n            }\n        )\n\n        self._save_figure(\n            df, cluster_centers=cluster_centers, cluster_summaries=cluster_summaries\n        )\n\n    def _prepare_input_texts(\n        self,\n        inputs: StepInput,\n        label2docs: Dict[int, List[int]],\n        unique_labels: List[int],\n    ) -&gt; List[Dict[str, Union[str, int]]]:\n        \"\"\"Prepares a batch of inputs to send to the LLM, with the examples of each cluster.\n\n        Args:\n            inputs: Inputs from the step.\n            label2docs: Map from each label to the list of documents (texts) that\n                belong to that cluster.\n            unique_labels: The unique labels of the clusters.\n\n        Returns:\n            The input texts to send to the LLM, with the examples of each cluster\n            prepared to be used in the prompt, and an additional key to store the\n            labels (that will be needed to find the data after the batches are\n            returned from the LLM).\n        \"\"\"\n        input_texts = []\n        for label in range(unique_labels):  # The label -1 is implicitly excluded\n            # Get the ids but remove possible duplicates, which could happen with bigger probability\n            # the bigger the number of examples requested, and the smaller the subset of examples\n            ids = set(\n                np.random.choice(label2docs[label], size=self.samples_per_cluster)\n            )  # Grab the number of examples\n            examples = [inputs[i][\"text\"] for i in ids]\n            input_text = {\n                \"text\": \"\\n\\n\".join(\n                    [f\"Example {i}:\\n{t}\" for i, t in enumerate(examples, start=1)]\n                ),\n                \"__LABEL\": label,\n            }\n            input_texts.append(input_text)\n        return input_texts\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        labels = [input[\"cluster_label\"] for input in inputs]\n        # -1 because -1 is the label for the unclassified\n        unique_labels = len(set(labels)) - 1\n        # This will be the output of the LLM, the set of labels for each cluster\n        cluster_summaries: Dict[int, str] = {-1: self.default_label}\n\n        # Map from label to list of documents, will use them to select examples from each cluster\n        label2docs = defaultdict(list)\n        for i, label in enumerate(labels):\n            label2docs[label].append(i)\n\n        input_texts = self._prepare_input_texts(inputs, label2docs, unique_labels)\n\n        # Send the texts in batches to the LLM, and get the labels for each cluster\n        for i, batched_inputs in enumerate(batched(input_texts, self.input_batch_size)):\n            self._logger.info(f\"\ud83d\udce6 Processing internal batch of inputs {i}...\")\n            results = super().process(batched_inputs)\n            for result in next(results):  # Extract the elements from the generator\n                cluster_summaries[result[\"__LABEL\"]] = result[\"labels\"]\n\n        # Assign the labels to each text\n        for input in inputs:\n            input[\"summary_label\"] = json.dumps(\n                cluster_summaries[input[\"cluster_label\"]]\n            )\n\n        if self.savefig:\n            self._create_figure(inputs, label2docs, cluster_summaries)\n\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the same as those for <code>TextClassification</code> plus the <code>projection</code> and <code>cluster_label</code> columns (which can be obtained from UMAP + DBSCAN steps).</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>summary_label</code> and the <code>model_name</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering._save_figure","title":"<code>_save_figure(data, cluster_centers, cluster_summaries)</code>","text":"<p>Saves the figure starting from the dataframe, using matplotlib.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>DataFrame</code> <p>pd.DataFrame with the columns 'X', 'Y' and 'labels' representing the projections and the label of each text respectively.</p> required <code>cluster_centers</code> <code>Dict[str, Tuple[float, float]]</code> <p>Dictionary mapping from each label the center of a cluster, to help with the placement of the annotations.</p> required <code>cluster_summaries</code> <code>Dict[int, str]</code> <p>The summaries of the clusters, obtained from the LLM.</p> required Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>def _save_figure(\n    self,\n    data: pd.DataFrame,\n    cluster_centers: Dict[str, Tuple[float, float]],\n    cluster_summaries: Dict[int, str],\n) -&gt; None:\n    \"\"\"Saves the figure starting from the dataframe, using matplotlib.\n\n    Args:\n        data: pd.DataFrame with the columns 'X', 'Y' and 'labels' representing\n            the projections and the label of each text respectively.\n        cluster_centers: Dictionary mapping from each label the center of a cluster,\n            to help with the placement of the annotations.\n        cluster_summaries: The summaries of the clusters, obtained from the LLM.\n    \"\"\"\n    import matplotlib.pyplot as plt\n\n    fig, ax = plt.subplots(figsize=(12, 8), dpi=300)\n    unique_labels = data[\"labels\"].unique()\n    # Map of colors for each label (-1 is black)\n    colormap = dict(\n        zip(unique_labels, plt.cm.Spectral(np.linspace(0, 1, len(unique_labels))))\n    )\n    colormap[-1] = np.array([0, 0, 0, 0])\n    data[\"color\"] = data[\"labels\"].map(colormap)\n\n    data.plot(\n        kind=\"scatter\",\n        x=\"X\",\n        y=\"Y\",\n        c=\"color\",\n        s=0.75,\n        alpha=0.8,\n        linewidth=0.4,\n        ax=ax,\n        colorbar=False,\n    )\n\n    for label in cluster_summaries.keys():\n        if label == -1:\n            continue\n        summary = str(cluster_summaries[label])  # These are obtained from the LLM\n        position = cluster_centers[label]\n        t = ax.text(\n            position[0],\n            position[1],\n            summary,\n            horizontalalignment=\"center\",\n            verticalalignment=\"center\",\n            fontsize=4,\n        )\n        t.set_bbox(\n            {\n                \"facecolor\": \"white\",\n                \"alpha\": 0.9,\n                \"linewidth\": 0,\n                \"boxstyle\": \"square,pad=0.1\",\n            }\n        )\n\n    ax.set_axis_off()\n    # Save the plot as an artifact of the step\n    self.save_artifact(\n        name=\"Text clusters\",\n        write_function=lambda path: fig.savefig(path / \"figure_clustering.png\"),\n        metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n    )\n    plt.close()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering._create_figure","title":"<code>_create_figure(inputs, label2docs, cluster_summaries)</code>","text":"<p>Creates a figure of the clustered texts and save it as an artifact.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The inputs of the step, as we will extract information from them again.</p> required <code>label2docs</code> <code>Dict[int, List[str]]</code> <p>Map from each label to the list of documents (texts) that belong to that cluster.</p> required <code>cluster_summaries</code> <code>Dict[int, str]</code> <p>The summaries of the clusters, obtained from the LLM.</p> required Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>def _create_figure(\n    self,\n    inputs: StepInput,\n    label2docs: Dict[int, List[str]],\n    cluster_summaries: Dict[int, str],\n) -&gt; None:\n    \"\"\"Creates a figure of the clustered texts and save it as an artifact.\n\n    Args:\n        inputs: The inputs of the step, as we will extract information from them again.\n        label2docs: Map from each label to the list of documents (texts) that belong to that cluster.\n        cluster_summaries: The summaries of the clusters, obtained from the LLM.\n    \"\"\"\n    self._logger.info(\"\ud83d\uddbc\ufe0f Creating figure for the clusters...\")\n\n    labels = []\n    projections = []\n    id2cluster = {}\n    for i, input in enumerate(inputs):\n        label = input[\"cluster_label\"]\n        id2cluster[i] = label\n        labels.append(label)\n        projections.append(input[\"projection\"])\n\n    projections = np.array(projections)\n\n    # Contains the placement of the cluster centers in the figure\n    cluster_centers: Dict[str, Tuple[float, float]] = {}\n    for label in label2docs.keys():\n        x = np.mean([projections[doc, 0] for doc in label2docs[label]])\n        y = np.mean([projections[doc, 1] for doc in label2docs[label]])\n        cluster_centers[label] = (x, y)\n\n    df = pd.DataFrame(\n        data={\n            \"X\": projections[:, 0],\n            \"Y\": projections[:, 1],\n            \"labels\": labels,\n        }\n    )\n\n    self._save_figure(\n        df, cluster_centers=cluster_centers, cluster_summaries=cluster_summaries\n    )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering._prepare_input_texts","title":"<code>_prepare_input_texts(inputs, label2docs, unique_labels)</code>","text":"<p>Prepares a batch of inputs to send to the LLM, with the examples of each cluster.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>Inputs from the step.</p> required <code>label2docs</code> <code>Dict[int, List[int]]</code> <p>Map from each label to the list of documents (texts) that belong to that cluster.</p> required <code>unique_labels</code> <code>List[int]</code> <p>The unique labels of the clusters.</p> required <p>Returns:</p> Type Description <code>List[Dict[str, Union[str, int]]]</code> <p>The input texts to send to the LLM, with the examples of each cluster</p> <code>List[Dict[str, Union[str, int]]]</code> <p>prepared to be used in the prompt, and an additional key to store the</p> <code>List[Dict[str, Union[str, int]]]</code> <p>labels (that will be needed to find the data after the batches are</p> <code>List[Dict[str, Union[str, int]]]</code> <p>returned from the LLM).</p> Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>def _prepare_input_texts(\n    self,\n    inputs: StepInput,\n    label2docs: Dict[int, List[int]],\n    unique_labels: List[int],\n) -&gt; List[Dict[str, Union[str, int]]]:\n    \"\"\"Prepares a batch of inputs to send to the LLM, with the examples of each cluster.\n\n    Args:\n        inputs: Inputs from the step.\n        label2docs: Map from each label to the list of documents (texts) that\n            belong to that cluster.\n        unique_labels: The unique labels of the clusters.\n\n    Returns:\n        The input texts to send to the LLM, with the examples of each cluster\n        prepared to be used in the prompt, and an additional key to store the\n        labels (that will be needed to find the data after the batches are\n        returned from the LLM).\n    \"\"\"\n    input_texts = []\n    for label in range(unique_labels):  # The label -1 is implicitly excluded\n        # Get the ids but remove possible duplicates, which could happen with bigger probability\n        # the bigger the number of examples requested, and the smaller the subset of examples\n        ids = set(\n            np.random.choice(label2docs[label], size=self.samples_per_cluster)\n        )  # Grab the number of examples\n        examples = [inputs[i][\"text\"] for i in ids]\n        input_text = {\n            \"text\": \"\\n\\n\".join(\n                [f\"Example {i}:\\n{t}\" for i, t in enumerate(examples, start=1)]\n            ),\n            \"__LABEL\": label,\n        }\n        input_texts.append(input_text)\n    return input_texts\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.UMAP","title":"<code>UMAP</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>UMAP is a general purpose manifold learning and dimension reduction algorithm.</p> <p>This is a <code>GlobalStep</code> that reduces the dimensionality of the embeddings using. Visit the <code>TextClustering</code> step for an example of use. The trained model is saved as an artifact when creating a distiset and pushing it to the Hugging Face Hub.</p> Input columns <ul> <li>embedding (<code>List[float]</code>): The original embeddings we want to reduce the dimension.</li> </ul> Output columns <ul> <li>projection (<code>List[float]</code>): Embedding reduced to the number of components specified,     the size of the new embeddings will be determined by the <code>n_components</code>.</li> </ul> Categories <ul> <li>clustering</li> <li>text-classification</li> </ul> References <ul> <li><code>UMAP repository</code></li> <li><code>UMAP documentation</code></li> </ul> <p>Attributes:</p> Name Type Description <code>-</code> <code>n_components</code> <p>The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100.</p> <code>-</code> <code>metric</code> <p>The metric to use to compute distances in high dimensional space. Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.</p> <code>-</code> <code>n_jobs</code> <p>The number of parallel jobs to run. Defaults to <code>8</code>.</p> <code>-</code> <code>random_state</code> <p>The random state to use for the UMAP algorithm.</p> Runtime parameters <ul> <li><code>n_components</code>: The dimension of the space to embed into. This defaults to 2 to     provide easy visualization (that's probably what you want), but can     reasonably be set to any integer value in the range 2 to 100.</li> <li><code>metric</code>: The metric to use to compute distances in high dimensional space.     Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.</li> <li><code>n_jobs</code>: The number of parallel jobs to run. Defaults to <code>8</code>.</li> <li><code>random_state</code>: The random state to use for the UMAP algorithm.</li> </ul> Citations <pre><code>@misc{mcinnes2020umapuniformmanifoldapproximation,\n    title={UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction},\n    author={Leland McInnes and John Healy and James Melville},\n    year={2020},\n    eprint={1802.03426},\n    archivePrefix={arXiv},\n    primaryClass={stat.ML},\n    url={https://arxiv.org/abs/1802.03426},\n}\n</code></pre> Source code in <code>src/distilabel/steps/clustering/umap.py</code> <pre><code>class UMAP(GlobalStep):\n    r\"\"\"UMAP is a general purpose manifold learning and dimension reduction algorithm.\n\n    This is a `GlobalStep` that reduces the dimensionality of the embeddings using. Visit\n    the `TextClustering` step for an example of use. The trained model is saved as an artifact\n    when creating a distiset and pushing it to the Hugging Face Hub.\n\n    Input columns:\n        - embedding (`List[float]`): The original embeddings we want to reduce the dimension.\n\n    Output columns:\n        - projection (`List[float]`): Embedding reduced to the number of components specified,\n            the size of the new embeddings will be determined by the `n_components`.\n\n    Categories:\n        - clustering\n        - text-classification\n\n    References:\n        - [`UMAP repository`](https://github.com/lmcinnes/umap/tree/master)\n        - [`UMAP documentation`](https://umap-learn.readthedocs.io/en/latest/)\n\n    Attributes:\n        - n_components: The dimension of the space to embed into. This defaults to 2 to\n            provide easy visualization (that's probably what you want), but can\n            reasonably be set to any integer value in the range 2 to 100.\n        - metric: The metric to use to compute distances in high dimensional space.\n            Visit UMAP's documentation for more information. Defaults to `euclidean`.\n        - n_jobs: The number of parallel jobs to run. Defaults to `8`.\n        - random_state: The random state to use for the UMAP algorithm.\n\n    Runtime parameters:\n        - `n_components`: The dimension of the space to embed into. This defaults to 2 to\n            provide easy visualization (that's probably what you want), but can\n            reasonably be set to any integer value in the range 2 to 100.\n        - `metric`: The metric to use to compute distances in high dimensional space.\n            Visit UMAP's documentation for more information. Defaults to `euclidean`.\n        - `n_jobs`: The number of parallel jobs to run. Defaults to `8`.\n        - `random_state`: The random state to use for the UMAP algorithm.\n\n    Citations:\n        ```\n        @misc{mcinnes2020umapuniformmanifoldapproximation,\n            title={UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction},\n            author={Leland McInnes and John Healy and James Melville},\n            year={2020},\n            eprint={1802.03426},\n            archivePrefix={arXiv},\n            primaryClass={stat.ML},\n            url={https://arxiv.org/abs/1802.03426},\n        }\n        ```\n    \"\"\"\n\n    n_components: Optional[RuntimeParameter[int]] = Field(\n        default=2,\n        description=(\n            \"The dimension of the space to embed into. This defaults to 2 to \"\n            \"provide easy visualization, but can reasonably be set to any \"\n            \"integer value in the range 2 to 100.\"\n        ),\n    )\n    metric: Optional[RuntimeParameter[str]] = Field(\n        default=\"euclidean\",\n        description=(\n            \"The metric to use to compute distances in high dimensional space. \"\n            \"Visit UMAP's documentation for more information.\"\n        ),\n    )\n    n_jobs: Optional[RuntimeParameter[int]] = Field(\n        default=8, description=\"The number of parallel jobs to run.\"\n    )\n    random_state: Optional[RuntimeParameter[int]] = Field(\n        default=None, description=\"The random state to use for the UMAP algorithm.\"\n    )\n\n    _umap: Optional[\"_UMAP\"] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n        if importlib.util.find_spec(\"umap\") is None:\n            raise ImportError(\n                \"`umap` package is not installed. Please install it using `pip install umap-learn`.\"\n            )\n        from umap import UMAP as _UMAP\n\n        self._umap = _UMAP(\n            n_components=self.n_components,\n            metric=self.metric,\n            n_jobs=self.n_jobs,\n            random_state=self.random_state,\n        )\n\n    def unload(self) -&gt; None:\n        self._umap = None\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"embedding\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"projection\"]\n\n    def _save_model(self, model: Any) -&gt; None:\n        import joblib\n\n        def save_model(path):\n            with open(str(path / \"UMAP.joblib\"), \"wb\") as f:\n                joblib.dump(model, f)\n\n        self.save_artifact(\n            name=\"UMAP_model\",\n            write_function=lambda path: save_model(path),\n            metadata={\n                \"n_components\": self.n_components,\n                \"metric\": self.metric,\n            },\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        # Shape of the embeddings is (n_samples, n_features)\n        embeddings = np.array([input[\"embedding\"] for input in inputs])\n\n        self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Start UMAP training...\")\n        mapper = self._umap.fit(embeddings)\n        # Shape of the projection will be (n_samples, n_components)\n        for input, projection in zip(inputs, mapper.embedding_):\n            input[\"projection\"] = projection\n\n        self._save_model(mapper)\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.CombineOutputs","title":"<code>CombineOutputs</code>","text":"<p>               Bases: <code>Step</code></p> <p>Combine the outputs of several upstream steps.</p> <p><code>CombineOutputs</code> is a <code>Step</code> that takes the outputs of several upstream steps and combines them to generate a new dictionary with all keys/columns of the upstream steps outputs.</p> Input columns <ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul> Output columns <ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <pre><code>Combine dictionaries of a dataset:\n\n```python\nfrom distilabel.steps import CombineOutputs\n\ncombine_outputs = CombineOutputs()\ncombine_outputs.load()\n\nresult = next(\n    combine_outputs.process(\n        [{\"a\": 1, \"b\": 2}, {\"a\": 3, \"b\": 4}],\n        [{\"c\": 5, \"d\": 6}, {\"c\": 7, \"d\": 8}],\n    )\n)\n# [\n#   {\"a\": 1, \"b\": 2, \"c\": 5, \"d\": 6},\n#   {\"a\": 3, \"b\": 4, \"c\": 7, \"d\": 8},\n# ]\n```\n\nCombine upstream steps outputs in a pipeline:\n\n```python\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs\n\nwith Pipeline() as pipeline:\n    step_1 = ...\n    step_2 = ...\n    step_3 = ...\n    combine = CombineOutputs()\n\n    [step_1, step_2, step_3] &gt;&gt; combine\n```\n</code></pre> Source code in <code>src/distilabel/steps/columns/combine.py</code> <pre><code>class CombineOutputs(Step):\n    \"\"\"Combine the outputs of several upstream steps.\n\n    `CombineOutputs` is a `Step` that takes the outputs of several upstream steps and combines\n    them to generate a new dictionary with all keys/columns of the upstream steps outputs.\n\n    Input columns:\n        - dynamic (based on the upstream `Step`s): All the columns of the upstream steps outputs.\n\n    Output columns:\n        - dynamic (based on the upstream `Step`s): All the columns of the upstream steps outputs.\n\n    Categories:\n        - columns\n\n    Examples:\n\n        Combine dictionaries of a dataset:\n\n        ```python\n        from distilabel.steps import CombineOutputs\n\n        combine_outputs = CombineOutputs()\n        combine_outputs.load()\n\n        result = next(\n            combine_outputs.process(\n                [{\"a\": 1, \"b\": 2}, {\"a\": 3, \"b\": 4}],\n                [{\"c\": 5, \"d\": 6}, {\"c\": 7, \"d\": 8}],\n            )\n        )\n        # [\n        #   {\"a\": 1, \"b\": 2, \"c\": 5, \"d\": 6},\n        #   {\"a\": 3, \"b\": 4, \"c\": 7, \"d\": 8},\n        # ]\n        ```\n\n        Combine upstream steps outputs in a pipeline:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import CombineOutputs\n\n        with Pipeline() as pipeline:\n            step_1 = ...\n            step_2 = ...\n            step_3 = ...\n            combine = CombineOutputs()\n\n            [step_1, step_2, step_3] &gt;&gt; combine\n        ```\n    \"\"\"\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        combined_outputs = []\n        for output_dicts in zip(*inputs):\n            combined_dict = {}\n            for output_dict in output_dicts:\n                combined_dict.update(\n                    {\n                        k: v\n                        for k, v in output_dict.items()\n                        if k != DISTILABEL_METADATA_KEY\n                    }\n                )\n\n            if any(\n                DISTILABEL_METADATA_KEY in output_dict for output_dict in output_dicts\n            ):\n                combined_dict[DISTILABEL_METADATA_KEY] = merge_distilabel_metadata(\n                    *output_dicts\n                )\n            combined_outputs.append(combined_dict)\n\n        yield combined_outputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering","title":"<code>DeitaFiltering</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Filter dataset rows using DEITA filtering strategy.</p> <p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings. It's an implementation of the filtering step from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>data_budget</code> <code>RuntimeParameter[int]</code> <p>The desired size of the dataset after filtering.</p> <code>diversity_threshold</code> <code>RuntimeParameter[float]</code> <p>If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset. Defaults to <code>0.9</code>.</p> <code>normalize_embeddings</code> <code>RuntimeParameter[bool]</code> <p>Whether to normalize the embeddings before computing the cosine distance. Defaults to <code>True</code>.</p> Runtime parameters <ul> <li><code>data_budget</code>: The desired size of the dataset after filtering.</li> <li><code>diversity_threshold</code>: If a row has a cosine distance with respect to it's nearest     neighbor greater than this value, it will be included in the filtered dataset.</li> </ul> Input columns <ul> <li>evol_instruction_score (<code>float</code>): The score of the instruction generated by     <code>ComplexityScorer</code> step.</li> <li>evol_response_score (<code>float</code>): The score of the response generated by     <code>QualityScorer</code> step.</li> <li>embedding (<code>List[float]</code>): The embedding generated for the conversation of the     instruction-response pair using <code>GenerateEmbeddings</code> step.</li> </ul> Output columns <ul> <li>deita_score (<code>float</code>): The DEITA score for the instruction-response pair.</li> <li>deita_score_computed_with (<code>List[str]</code>): The scores used to compute the DEITA     score.</li> <li>nearest_neighbor_distance (<code>float</code>): The cosine distance between the embeddings     of the instruction-response pair.</li> </ul> Categories <ul> <li>filtering</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings:</p> <pre><code>from distilabel.steps import DeitaFiltering\n\ndeita_filtering = DeitaFiltering(data_budget=1)\n\ndeita_filtering.load()\n\nresult = next(\n    deita_filtering.process(\n        [\n            {\n                \"evol_instruction_score\": 0.5,\n                \"evol_response_score\": 0.5,\n                \"embedding\": [-8.12729941, -5.24642847, -6.34003029],\n            },\n            {\n                \"evol_instruction_score\": 0.6,\n                \"evol_response_score\": 0.6,\n                \"embedding\": [2.99329242, 0.7800932, 0.7799726],\n            },\n            {\n                \"evol_instruction_score\": 0.7,\n                \"evol_response_score\": 0.7,\n                \"embedding\": [10.29041806, 14.33088073, 13.00557506],\n            },\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>class DeitaFiltering(GlobalStep):\n    \"\"\"Filter dataset rows using DEITA filtering strategy.\n\n    Filter the dataset based on the DEITA score and the cosine distance between the embeddings.\n    It's an implementation of the filtering step from the paper 'What Makes Good Data\n    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.\n\n    Attributes:\n        data_budget: The desired size of the dataset after filtering.\n        diversity_threshold: If a row has a cosine distance with respect to it's nearest\n            neighbor greater than this value, it will be included in the filtered dataset.\n            Defaults to `0.9`.\n        normalize_embeddings: Whether to normalize the embeddings before computing the cosine\n            distance. Defaults to `True`.\n\n    Runtime parameters:\n        - `data_budget`: The desired size of the dataset after filtering.\n        - `diversity_threshold`: If a row has a cosine distance with respect to it's nearest\n            neighbor greater than this value, it will be included in the filtered dataset.\n\n    Input columns:\n        - evol_instruction_score (`float`): The score of the instruction generated by\n            `ComplexityScorer` step.\n        - evol_response_score (`float`): The score of the response generated by\n            `QualityScorer` step.\n        - embedding (`List[float]`): The embedding generated for the conversation of the\n            instruction-response pair using `GenerateEmbeddings` step.\n\n    Output columns:\n        - deita_score (`float`): The DEITA score for the instruction-response pair.\n        - deita_score_computed_with (`List[str]`): The scores used to compute the DEITA\n            score.\n        - nearest_neighbor_distance (`float`): The cosine distance between the embeddings\n            of the instruction-response pair.\n\n    Categories:\n        - filtering\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Filter the dataset based on the DEITA score and the cosine distance between the embeddings:\n\n        ```python\n        from distilabel.steps import DeitaFiltering\n\n        deita_filtering = DeitaFiltering(data_budget=1)\n\n        deita_filtering.load()\n\n        result = next(\n            deita_filtering.process(\n                [\n                    {\n                        \"evol_instruction_score\": 0.5,\n                        \"evol_response_score\": 0.5,\n                        \"embedding\": [-8.12729941, -5.24642847, -6.34003029],\n                    },\n                    {\n                        \"evol_instruction_score\": 0.6,\n                        \"evol_response_score\": 0.6,\n                        \"embedding\": [2.99329242, 0.7800932, 0.7799726],\n                    },\n                    {\n                        \"evol_instruction_score\": 0.7,\n                        \"evol_response_score\": 0.7,\n                        \"embedding\": [10.29041806, 14.33088073, 13.00557506],\n                    },\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    data_budget: RuntimeParameter[int] = Field(\n        default=None, description=\"The desired size of the dataset after filtering.\"\n    )\n    diversity_threshold: RuntimeParameter[float] = Field(\n        default=0.9,\n        description=\"If a row has a cosine distance with respect to it's nearest neighbor\"\n        \" greater than this value, it will be included in the filtered dataset.\",\n    )\n    normalize_embeddings: RuntimeParameter[bool] = Field(\n        default=True,\n        description=\"Whether to normalize the embeddings before computing the cosine distance.\",\n    )\n    distance_metric: RuntimeParameter[Literal[\"cosine\", \"manhattan\"]] = Field(\n        default=\"cosine\",\n        description=\"The distance metric to use. Currently only 'cosine' is supported.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"evol_instruction_score\", \"evol_response_score\", \"embedding\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"deita_score\", \"nearest_neighbor_distance\", \"deita_score_computed_with\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Filter the dataset based on the DEITA score and the cosine distance between the\n        embeddings.\n\n        Args:\n            inputs: The input data.\n\n        Returns:\n            The filtered dataset.\n        \"\"\"\n        inputs = self._compute_deita_score(inputs)\n        inputs = self._compute_nearest_neighbor(inputs)\n        inputs.sort(key=lambda x: x[\"deita_score\"], reverse=True)\n\n        selected_rows = []\n        for input in inputs:\n            if len(selected_rows) &gt;= self.data_budget:  # type: ignore\n                break\n            if input[\"nearest_neighbor_distance\"] &gt;= self.diversity_threshold:\n                selected_rows.append(input)\n        yield selected_rows\n\n    def _compute_deita_score(self, inputs: StepInput) -&gt; StepInput:\n        \"\"\"Computes the DEITA score for each instruction-response pair. The DEITA score is\n        the product of the instruction score and the response score.\n\n        Args:\n            inputs: The input data.\n\n        Returns:\n            The input data with the DEITA score computed.\n        \"\"\"\n        for input_ in inputs:\n            evol_instruction_score = input_.get(\"evol_instruction_score\")\n            evol_response_score = input_.get(\"evol_response_score\")\n\n            if evol_instruction_score and evol_response_score:\n                deita_score = evol_instruction_score * evol_response_score\n                score_computed_with = [\"evol_instruction_score\", \"evol_response_score\"]\n            elif evol_instruction_score:\n                self._logger.warning(\n                    \"Response score is missing for the instruction-response pair. Using\"\n                    \" instruction score as DEITA score.\"\n                )\n                deita_score = evol_instruction_score\n                score_computed_with = [\"evol_instruction_score\"]\n            elif evol_response_score:\n                self._logger.warning(\n                    \"Instruction score is missing for the instruction-response pair. Using\"\n                    \" response score as DEITA score.\"\n                )\n                deita_score = evol_response_score\n                score_computed_with = [\"evol_response_score\"]\n            else:\n                self._logger.warning(\n                    \"Instruction and response scores are missing for the instruction-response\"\n                    \" pair. Setting DEITA score to 0.\"\n                )\n                deita_score = 0\n                score_computed_with = []\n\n            input_.update(\n                {\n                    \"deita_score\": deita_score,\n                    \"deita_score_computed_with\": score_computed_with,\n                }\n            )\n        return inputs\n\n    def _compute_nearest_neighbor(self, inputs: StepInput) -&gt; StepInput:\n        \"\"\"Computes the cosine distance between the embeddings of the instruction-response\n        pairs and the nearest neighbor.\n\n        Args:\n            inputs: The input data.\n\n        Returns:\n            The input data with the cosine distance computed.\n        \"\"\"\n        embeddings = np.array([input[\"embedding\"] for input in inputs])\n        if self.normalize_embeddings:\n            embeddings = self._normalize_embeddings(embeddings)\n        self._logger.info(\"\ud83d\udccf Computing nearest neighbor distance...\")\n\n        if self.distance_metric == \"cosine\":\n            self._logger.info(\"\ud83d\udccf Using cosine distance.\")\n            distances = self._cosine_distance(embeddings)\n        else:\n            self._logger.info(\"\ud83d\udccf Using manhattan distance.\")\n            distances = self._manhattan_distance(embeddings)\n\n        for distance, input in zip(distances, inputs):\n            input[\"nearest_neighbor_distance\"] = distance\n        return inputs\n\n    def _normalize_embeddings(self, embeddings: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Normalize the embeddings.\n\n        Args:\n            embeddings: The embeddings to normalize.\n\n        Returns:\n            The normalized embeddings.\n        \"\"\"\n        self._logger.info(\"\u2696\ufe0f Normalizing embeddings...\")\n        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)\n        return embeddings / norms\n\n    def _cosine_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n        \"\"\"Computes the cosine distance between the embeddings.\n\n        Args:\n            embeddings: The embeddings.\n\n        Returns:\n            The cosine distance between the embeddings.\n        \"\"\"\n        cosine_similarity = np.dot(embeddings, embeddings.T)\n        cosine_distance = 1 - cosine_similarity\n        # Ignore self-distance\n        np.fill_diagonal(cosine_distance, np.inf)\n        return np.min(cosine_distance, axis=1)\n\n    def _manhattan_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n        \"\"\"Computes the manhattan distance between the embeddings.\n\n        Args:\n            embeddings: The embeddings.\n\n        Returns:\n            The manhattan distance between the embeddings.\n        \"\"\"\n        manhattan_distance = np.abs(embeddings[:, None] - embeddings).sum(-1)\n        # Ignore self-distance\n        np.fill_diagonal(manhattan_distance, np.inf)\n        return np.min(manhattan_distance, axis=1)\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering.process","title":"<code>process(inputs)</code>","text":"<p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>The filtered dataset.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Filter the dataset based on the DEITA score and the cosine distance between the\n    embeddings.\n\n    Args:\n        inputs: The input data.\n\n    Returns:\n        The filtered dataset.\n    \"\"\"\n    inputs = self._compute_deita_score(inputs)\n    inputs = self._compute_nearest_neighbor(inputs)\n    inputs.sort(key=lambda x: x[\"deita_score\"], reverse=True)\n\n    selected_rows = []\n    for input in inputs:\n        if len(selected_rows) &gt;= self.data_budget:  # type: ignore\n            break\n        if input[\"nearest_neighbor_distance\"] &gt;= self.diversity_threshold:\n            selected_rows.append(input)\n    yield selected_rows\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._compute_deita_score","title":"<code>_compute_deita_score(inputs)</code>","text":"<p>Computes the DEITA score for each instruction-response pair. The DEITA score is the product of the instruction score and the response score.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Returns:</p> Type Description <code>StepInput</code> <p>The input data with the DEITA score computed.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _compute_deita_score(self, inputs: StepInput) -&gt; StepInput:\n    \"\"\"Computes the DEITA score for each instruction-response pair. The DEITA score is\n    the product of the instruction score and the response score.\n\n    Args:\n        inputs: The input data.\n\n    Returns:\n        The input data with the DEITA score computed.\n    \"\"\"\n    for input_ in inputs:\n        evol_instruction_score = input_.get(\"evol_instruction_score\")\n        evol_response_score = input_.get(\"evol_response_score\")\n\n        if evol_instruction_score and evol_response_score:\n            deita_score = evol_instruction_score * evol_response_score\n            score_computed_with = [\"evol_instruction_score\", \"evol_response_score\"]\n        elif evol_instruction_score:\n            self._logger.warning(\n                \"Response score is missing for the instruction-response pair. Using\"\n                \" instruction score as DEITA score.\"\n            )\n            deita_score = evol_instruction_score\n            score_computed_with = [\"evol_instruction_score\"]\n        elif evol_response_score:\n            self._logger.warning(\n                \"Instruction score is missing for the instruction-response pair. Using\"\n                \" response score as DEITA score.\"\n            )\n            deita_score = evol_response_score\n            score_computed_with = [\"evol_response_score\"]\n        else:\n            self._logger.warning(\n                \"Instruction and response scores are missing for the instruction-response\"\n                \" pair. Setting DEITA score to 0.\"\n            )\n            deita_score = 0\n            score_computed_with = []\n\n        input_.update(\n            {\n                \"deita_score\": deita_score,\n                \"deita_score_computed_with\": score_computed_with,\n            }\n        )\n    return inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._compute_nearest_neighbor","title":"<code>_compute_nearest_neighbor(inputs)</code>","text":"<p>Computes the cosine distance between the embeddings of the instruction-response pairs and the nearest neighbor.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Returns:</p> Type Description <code>StepInput</code> <p>The input data with the cosine distance computed.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _compute_nearest_neighbor(self, inputs: StepInput) -&gt; StepInput:\n    \"\"\"Computes the cosine distance between the embeddings of the instruction-response\n    pairs and the nearest neighbor.\n\n    Args:\n        inputs: The input data.\n\n    Returns:\n        The input data with the cosine distance computed.\n    \"\"\"\n    embeddings = np.array([input[\"embedding\"] for input in inputs])\n    if self.normalize_embeddings:\n        embeddings = self._normalize_embeddings(embeddings)\n    self._logger.info(\"\ud83d\udccf Computing nearest neighbor distance...\")\n\n    if self.distance_metric == \"cosine\":\n        self._logger.info(\"\ud83d\udccf Using cosine distance.\")\n        distances = self._cosine_distance(embeddings)\n    else:\n        self._logger.info(\"\ud83d\udccf Using manhattan distance.\")\n        distances = self._manhattan_distance(embeddings)\n\n    for distance, input in zip(distances, inputs):\n        input[\"nearest_neighbor_distance\"] = distance\n    return inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._normalize_embeddings","title":"<code>_normalize_embeddings(embeddings)</code>","text":"<p>Normalize the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>embeddings</code> <code>ndarray</code> <p>The embeddings to normalize.</p> required <p>Returns:</p> Type Description <code>ndarray</code> <p>The normalized embeddings.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _normalize_embeddings(self, embeddings: np.ndarray) -&gt; np.ndarray:\n    \"\"\"Normalize the embeddings.\n\n    Args:\n        embeddings: The embeddings to normalize.\n\n    Returns:\n        The normalized embeddings.\n    \"\"\"\n    self._logger.info(\"\u2696\ufe0f Normalizing embeddings...\")\n    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)\n    return embeddings / norms\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._cosine_distance","title":"<code>_cosine_distance(embeddings)</code>","text":"<p>Computes the cosine distance between the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>embeddings</code> <code>array</code> <p>The embeddings.</p> required <p>Returns:</p> Type Description <code>array</code> <p>The cosine distance between the embeddings.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _cosine_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n    \"\"\"Computes the cosine distance between the embeddings.\n\n    Args:\n        embeddings: The embeddings.\n\n    Returns:\n        The cosine distance between the embeddings.\n    \"\"\"\n    cosine_similarity = np.dot(embeddings, embeddings.T)\n    cosine_distance = 1 - cosine_similarity\n    # Ignore self-distance\n    np.fill_diagonal(cosine_distance, np.inf)\n    return np.min(cosine_distance, axis=1)\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._manhattan_distance","title":"<code>_manhattan_distance(embeddings)</code>","text":"<p>Computes the manhattan distance between the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>embeddings</code> <code>array</code> <p>The embeddings.</p> required <p>Returns:</p> Type Description <code>array</code> <p>The manhattan distance between the embeddings.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _manhattan_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n    \"\"\"Computes the manhattan distance between the embeddings.\n\n    Args:\n        embeddings: The embeddings.\n\n    Returns:\n        The manhattan distance between the embeddings.\n    \"\"\"\n    manhattan_distance = np.abs(embeddings[:, None] - embeddings).sum(-1)\n    # Ignore self-distance\n    np.fill_diagonal(manhattan_distance, np.inf)\n    return np.min(manhattan_distance, axis=1)\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.EmbeddingGeneration","title":"<code>EmbeddingGeneration</code>","text":"<p>               Bases: <code>Step</code></p> <p>Generate embeddings using an <code>Embeddings</code> model.</p> <p><code>EmbeddingGeneration</code> is a <code>Step</code> that using an <code>Embeddings</code> model generates sentence embeddings for the provided input texts.</p> <p>Attributes:</p> Name Type Description <code>embeddings</code> <code>Embeddings</code> <p>the <code>Embeddings</code> model used to generate the sentence embeddings.</p> Input columns <ul> <li>text (<code>str</code>): The text for which the sentence embedding has to be generated.</li> </ul> Output columns <ul> <li>embedding (<code>List[Union[float, int]]</code>): the generated sentence embedding.</li> </ul> Categories <ul> <li>embedding</li> </ul> <p>Examples:</p> <p>Generate sentence embeddings with Sentence Transformers:</p> <pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.steps import EmbeddingGeneration\n\nembedding_generation = EmbeddingGeneration(\n    embeddings=SentenceTransformerEmbeddings(\n        model=\"mixedbread-ai/mxbai-embed-large-v1\",\n    )\n)\n\nembedding_generation.load()\n\nresult = next(embedding_generation.process([{\"text\": \"Hello, how are you?\"}]))\n# [{'text': 'Hello, how are you?', 'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]\n</code></pre> Source code in <code>src/distilabel/steps/embeddings/embedding_generation.py</code> <pre><code>class EmbeddingGeneration(Step):\n    \"\"\"Generate embeddings using an `Embeddings` model.\n\n    `EmbeddingGeneration` is a `Step` that using an `Embeddings` model generates sentence\n    embeddings for the provided input texts.\n\n    Attributes:\n        embeddings: the `Embeddings` model used to generate the sentence embeddings.\n\n    Input columns:\n        - text (`str`): The text for which the sentence embedding has to be generated.\n\n    Output columns:\n        - embedding (`List[Union[float, int]]`): the generated sentence embedding.\n\n    Categories:\n        - embedding\n\n    Examples:\n        Generate sentence embeddings with Sentence Transformers:\n\n        ```python\n        from distilabel.models import SentenceTransformerEmbeddings\n        from distilabel.steps import EmbeddingGeneration\n\n        embedding_generation = EmbeddingGeneration(\n            embeddings=SentenceTransformerEmbeddings(\n                model=\"mixedbread-ai/mxbai-embed-large-v1\",\n            )\n        )\n\n        embedding_generation.load()\n\n        result = next(embedding_generation.process([{\"text\": \"Hello, how are you?\"}]))\n        # [{'text': 'Hello, how are you?', 'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]\n        ```\n\n    \"\"\"\n\n    embeddings: Embeddings\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"embedding\", \"model_name\"]\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `Embeddings` model.\"\"\"\n        super().load()\n\n        self.embeddings.load()\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        embeddings = self.embeddings.encode(inputs=[input[\"text\"] for input in inputs])\n        for input, embedding in zip(inputs, embeddings):\n            input[\"embedding\"] = embedding\n            input[\"model_name\"] = self.embeddings.model_name\n        yield inputs\n\n    def unload(self) -&gt; None:\n        super().unload()\n        self.embeddings.unload()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.EmbeddingGeneration.load","title":"<code>load()</code>","text":"<p>Loads the <code>Embeddings</code> model.</p> Source code in <code>src/distilabel/steps/embeddings/embedding_generation.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `Embeddings` model.\"\"\"\n    super().load()\n\n    self.embeddings.load()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour","title":"<code>FaissNearestNeighbour</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Create a <code>faiss</code> index to get the nearest neighbours.</p> <p><code>FaissNearestNeighbour</code> is a <code>GlobalStep</code> that creates a <code>faiss</code> index using the Hugging Face <code>datasets</code> library integration, and then gets the nearest neighbours and the scores or distance of the nearest neighbours for each input row.</p> <p>Attributes:</p> Name Type Description <code>device</code> <code>Optional[RuntimeParameter[Union[int, List[int]]]]</code> <p>the CUDA device ID or a list of IDs to be used. If negative integer, it will use all the available GPUs. Defaults to <code>None</code>.</p> <code>string_factory</code> <code>Optional[RuntimeParameter[str]]</code> <p>the name of the factory to be used to build the <code>faiss</code> index. Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. Defaults to <code>None</code>.</p> <code>metric_type</code> <code>Optional[RuntimeParameter[int]]</code> <p>the metric to be used to measure the distance between the points. It's an integer and the recommend way to pass it is importing <code>faiss</code> and then passing one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</p> <code>k</code> <code>Optional[RuntimeParameter[int]]</code> <p>the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</p> <code>search_batch_size</code> <code>Optional[RuntimeParameter[int]]</code> <p>the number of rows to include in a search batch. The value can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults to <code>50</code>.</p> <code>train_size</code> <code>Optional[RuntimeParameter[int]]</code> <p>If the index needs a training step, specifies how many vectors will be used to train the index.</p> Runtime parameters <ul> <li><code>device</code>: the CUDA device ID or a list of IDs to be used. If negative integer,     it will use all the available GPUs. Defaults to <code>None</code>.</li> <li><code>string_factory</code>: the name of the factory to be used to build the <code>faiss</code> index.     Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.     Defaults to <code>None</code>.</li> <li><code>metric_type</code>: the metric to be used to measure the distance between the points.     It's an integer and the recommend way to pass it is importing <code>faiss</code> and then     passing one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</li> <li><code>k</code>: the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</li> <li><code>search_batch_size</code>: the number of rows to include in a search batch. The value     can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults     to <code>50</code>.</li> <li><code>train_size</code>: If the index needs a training step, specifies how many vectors will     be used to train the index.</li> </ul> Input columns <ul> <li>embedding (<code>List[Union[float, int]]</code>): a sentence embedding.</li> </ul> Output columns <ul> <li>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours     in the inputs for the row.</li> <li>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>     nearest neighbour in the inputs.</li> </ul> Categories <ul> <li>embedding</li> </ul> References <ul> <li><code>The Faiss library</code></li> </ul> <p>Examples:</p> <p>Generating embeddings and getting the nearest neighbours:</p> <pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub\n\nwith Pipeline(name=\"hello\") as pipeline:\n    load_data = LoadDataFromHub(output_mappings={\"prompt\": \"text\"})\n\n    embeddings = EmbeddingGeneration(\n        embeddings=SentenceTransformerEmbeddings(\n            model=\"mixedbread-ai/mxbai-embed-large-v1\"\n        )\n    )\n\n    nearest_neighbours = FaissNearestNeighbour()\n\n    load_data &gt;&gt; embeddings &gt;&gt; nearest_neighbours\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n        },\n        use_cache=False,\n    )\n</code></pre> Citations <pre><code>@misc{douze2024faisslibrary,\n    title={The Faiss library},\n    author={Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre-Emmanuel Mazar\u00e9 and Maria Lomeli and Lucas Hosseini and Herv\u00e9 J\u00e9gou},\n    year={2024},\n    eprint={2401.08281},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG},\n    url={https://arxiv.org/abs/2401.08281},\n}\n</code></pre> Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>class FaissNearestNeighbour(GlobalStep):\n    \"\"\"Create a `faiss` index to get the nearest neighbours.\n\n    `FaissNearestNeighbour` is a `GlobalStep` that creates a `faiss` index using the Hugging\n    Face `datasets` library integration, and then gets the nearest neighbours and the scores\n    or distance of the nearest neighbours for each input row.\n\n    Attributes:\n        device: the CUDA device ID or a list of IDs to be used. If negative integer, it\n            will use all the available GPUs. Defaults to `None`.\n        string_factory: the name of the factory to be used to build the `faiss` index.\n            Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.\n            Defaults to `None`.\n        metric_type: the metric to be used to measure the distance between the points. It's\n            an integer and the recommend way to pass it is importing `faiss` and then passing\n            one of `faiss.METRIC_x` variables. Defaults to `None`.\n        k: the number of nearest neighbours to search for each input row. Defaults to `1`.\n        search_batch_size: the number of rows to include in a search batch. The value can\n            be adjusted to maximize the resources usage or to avoid OOM issues. Defaults\n            to `50`.\n        train_size: If the index needs a training step, specifies how many vectors will be\n            used to train the index.\n\n    Runtime parameters:\n        - `device`: the CUDA device ID or a list of IDs to be used. If negative integer,\n            it will use all the available GPUs. Defaults to `None`.\n        - `string_factory`: the name of the factory to be used to build the `faiss` index.\n            Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.\n            Defaults to `None`.\n        - `metric_type`: the metric to be used to measure the distance between the points.\n            It's an integer and the recommend way to pass it is importing `faiss` and then\n            passing one of `faiss.METRIC_x` variables. Defaults to `None`.\n        - `k`: the number of nearest neighbours to search for each input row. Defaults to `1`.\n        - `search_batch_size`: the number of rows to include in a search batch. The value\n            can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults\n            to `50`.\n        - `train_size`: If the index needs a training step, specifies how many vectors will\n            be used to train the index.\n\n    Input columns:\n        - embedding (`List[Union[float, int]]`): a sentence embedding.\n\n    Output columns:\n        - nn_indices (`List[int]`): a list containing the indices of the `k` nearest neighbours\n            in the inputs for the row.\n        - nn_scores (`List[float]`): a list containing the score or distance to each `k`\n            nearest neighbour in the inputs.\n\n    Categories:\n        - embedding\n\n    References:\n        - [`The Faiss library`](https://arxiv.org/abs/2401.08281)\n\n    Examples:\n        Generating embeddings and getting the nearest neighbours:\n\n        ```python\n        from distilabel.models import SentenceTransformerEmbeddings\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub\n\n        with Pipeline(name=\"hello\") as pipeline:\n            load_data = LoadDataFromHub(output_mappings={\"prompt\": \"text\"})\n\n            embeddings = EmbeddingGeneration(\n                embeddings=SentenceTransformerEmbeddings(\n                    model=\"mixedbread-ai/mxbai-embed-large-v1\"\n                )\n            )\n\n            nearest_neighbours = FaissNearestNeighbour()\n\n            load_data &gt;&gt; embeddings &gt;&gt; nearest_neighbours\n\n        if __name__ == \"__main__\":\n            distiset = pipeline.run(\n                parameters={\n                    load_data.name: {\n                        \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                        \"split\": \"test\",\n                    },\n                },\n                use_cache=False,\n            )\n        ```\n\n    Citations:\n        ```\n        @misc{douze2024faisslibrary,\n            title={The Faiss library},\n            author={Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre-Emmanuel Mazar\u00e9 and Maria Lomeli and Lucas Hosseini and Herv\u00e9 J\u00e9gou},\n            year={2024},\n            eprint={2401.08281},\n            archivePrefix={arXiv},\n            primaryClass={cs.LG},\n            url={https://arxiv.org/abs/2401.08281},\n        }\n        ```\n    \"\"\"\n\n    device: Optional[RuntimeParameter[Union[int, List[int]]]] = Field(\n        default=None,\n        description=\"The CUDA device ID or a list of IDs to be used. If negative integer,\"\n        \" it will use all the available GPUs.\",\n    )\n    string_factory: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The name of the factory to be used to build the `faiss` index.\"\n        \"Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.\",\n    )\n    metric_type: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"The metric to be used to measure the distance between the points. It's\"\n        \" an integer and the recommend way to pass it is importing `faiss` and thenpassing\"\n        \" one of `faiss.METRIC_x` variables.\",\n    )\n    k: Optional[RuntimeParameter[int]] = Field(\n        default=1,\n        description=\"The number of nearest neighbours to search for each input row.\",\n    )\n    search_batch_size: Optional[RuntimeParameter[int]] = Field(\n        default=50,\n        description=\"The number of rows to include in a search batch. The value can be adjusted\"\n        \" to maximize the resources usage or to avoid OOM issues.\",\n    )\n    train_size: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"If the index needs a training step, specifies how many vectors will be used to train the index.\",\n    )\n\n    def load(self) -&gt; None:\n        super().load()\n\n        if importlib.util.find_spec(\"faiss\") is None:\n            raise ImportError(\n                \"`faiss` package is not installed. Please install it using `pip install\"\n                \" faiss-cpu` or `pip install faiss-gpu`.\"\n            )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"embedding\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"nn_indices\", \"nn_scores\"]\n\n    def _build_index(self, inputs: List[Dict[str, Any]]) -&gt; Dataset:\n        \"\"\"Builds a `faiss` index using `datasets` integration.\n\n        Args:\n            inputs: a list of dictionaries.\n\n        Returns:\n            The build `datasets.Dataset` with its `faiss` index.\n        \"\"\"\n        dataset = Dataset.from_list(inputs)\n        if self.train_size is not None and self.string_factory:\n            self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Starting Faiss index training...\")\n        dataset.add_faiss_index(\n            column=\"embedding\",\n            device=self.device,  # type: ignore\n            string_factory=self.string_factory,\n            metric_type=self.metric_type,\n            train_size=self.train_size,\n        )\n        return dataset\n\n    def _save_index(self, dataset: Dataset) -&gt; None:\n        \"\"\"Save the generated Faiss index as an artifact of the step.\n\n        Args:\n            dataset: the dataset with the `faiss` index built.\n        \"\"\"\n        self.save_artifact(\n            name=\"faiss_index\",\n            write_function=lambda path: dataset.save_faiss_index(\n                index_name=\"embedding\", file=path / \"index.faiss\"\n            ),\n            metadata={\n                \"num_rows\": len(dataset),\n                \"embedding_dim\": len(dataset[0][\"embedding\"]),\n            },\n        )\n\n    def _search(self, dataset: Dataset) -&gt; Dataset:\n        \"\"\"Search the top `k` nearest neighbours for each row in the dataset.\n\n        Args:\n            dataset: the dataset with the `faiss` index built.\n\n        Returns:\n            The updated dataset containing the top `k` nearest neighbours for each row,\n            as well as the score or distance.\n        \"\"\"\n\n        def add_search_results(examples: Dict[str, List[Any]]) -&gt; Dict[str, List[Any]]:\n            queries = np.array(examples[\"embedding\"])\n            results = dataset.search_batch(\n                index_name=\"embedding\",\n                queries=queries,\n                k=self.k + 1,  # type: ignore\n            )\n            examples[\"nn_indices\"] = [indices[1:] for indices in results.total_indices]\n            examples[\"nn_scores\"] = [scores[1:] for scores in results.total_scores]\n            return examples\n\n        return dataset.map(\n            add_search_results, batched=True, batch_size=self.search_batch_size\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        dataset = self._build_index(inputs)\n        dataset_with_search_results = self._search(dataset)\n        self._save_index(dataset)\n        yield dataset_with_search_results.to_list()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour._build_index","title":"<code>_build_index(inputs)</code>","text":"<p>Builds a <code>faiss</code> index using <code>datasets</code> integration.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[Dict[str, Any]]</code> <p>a list of dictionaries.</p> required <p>Returns:</p> Type Description <code>Dataset</code> <p>The build <code>datasets.Dataset</code> with its <code>faiss</code> index.</p> Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>def _build_index(self, inputs: List[Dict[str, Any]]) -&gt; Dataset:\n    \"\"\"Builds a `faiss` index using `datasets` integration.\n\n    Args:\n        inputs: a list of dictionaries.\n\n    Returns:\n        The build `datasets.Dataset` with its `faiss` index.\n    \"\"\"\n    dataset = Dataset.from_list(inputs)\n    if self.train_size is not None and self.string_factory:\n        self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Starting Faiss index training...\")\n    dataset.add_faiss_index(\n        column=\"embedding\",\n        device=self.device,  # type: ignore\n        string_factory=self.string_factory,\n        metric_type=self.metric_type,\n        train_size=self.train_size,\n    )\n    return dataset\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour._save_index","title":"<code>_save_index(dataset)</code>","text":"<p>Save the generated Faiss index as an artifact of the step.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Dataset</code> <p>the dataset with the <code>faiss</code> index built.</p> required Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>def _save_index(self, dataset: Dataset) -&gt; None:\n    \"\"\"Save the generated Faiss index as an artifact of the step.\n\n    Args:\n        dataset: the dataset with the `faiss` index built.\n    \"\"\"\n    self.save_artifact(\n        name=\"faiss_index\",\n        write_function=lambda path: dataset.save_faiss_index(\n            index_name=\"embedding\", file=path / \"index.faiss\"\n        ),\n        metadata={\n            \"num_rows\": len(dataset),\n            \"embedding_dim\": len(dataset[0][\"embedding\"]),\n        },\n    )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour._search","title":"<code>_search(dataset)</code>","text":"<p>Search the top <code>k</code> nearest neighbours for each row in the dataset.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Dataset</code> <p>the dataset with the <code>faiss</code> index built.</p> required <p>Returns:</p> Type Description <code>Dataset</code> <p>The updated dataset containing the top <code>k</code> nearest neighbours for each row,</p> <code>Dataset</code> <p>as well as the score or distance.</p> Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>def _search(self, dataset: Dataset) -&gt; Dataset:\n    \"\"\"Search the top `k` nearest neighbours for each row in the dataset.\n\n    Args:\n        dataset: the dataset with the `faiss` index built.\n\n    Returns:\n        The updated dataset containing the top `k` nearest neighbours for each row,\n        as well as the score or distance.\n    \"\"\"\n\n    def add_search_results(examples: Dict[str, List[Any]]) -&gt; Dict[str, List[Any]]:\n        queries = np.array(examples[\"embedding\"])\n        results = dataset.search_batch(\n            index_name=\"embedding\",\n            queries=queries,\n            k=self.k + 1,  # type: ignore\n        )\n        examples[\"nn_indices\"] = [indices[1:] for indices in results.total_indices]\n        examples[\"nn_scores\"] = [scores[1:] for scores in results.total_scores]\n        return examples\n\n    return dataset.map(\n        add_search_results, batched=True, batch_size=self.search_batch_size\n    )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.EmbeddingDedup","title":"<code>EmbeddingDedup</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Deduplicates text using embeddings.</p> <p><code>EmbeddingDedup</code> is a Step that detects near-duplicates in datasets, using embeddings to compare the similarity between the texts. The typical workflow with this step would include having a dataset with embeddings precomputed, and then (possibly using the <code>FaissNearestNeighbour</code>) using the <code>nn_indices</code> and <code>nn_scores</code>, determine the texts that are duplicate.</p> <p>Attributes:</p> Name Type Description <code>threshold</code> <code>Optional[RuntimeParameter[float]]</code> <p>the threshold to consider 2 examples as duplicates. It's dependent on the type of index that was used to generate the embeddings. For example, if the embeddings were generated using cosine similarity, a threshold of <code>0.9</code> would make all the texts with a cosine similarity above the value duplicates. Higher values detect less duplicates in such an index, but that should be taken into account when building it. Defaults to <code>0.9</code>.</p> Runtime Parameters <ul> <li><code>threshold</code>: the threshold to consider 2 examples as duplicates.</li> </ul> Input columns <ul> <li>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours     in the inputs for the row.</li> <li>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>     nearest neighbour in the inputs.</li> </ul> Output columns <ul> <li>keep_row_after_embedding_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is     not a duplicate i.e. this text should be kept.</li> </ul> Categories <ul> <li>filtering</li> </ul> <p>Examples:</p> <pre><code>Deduplicate a list of texts using embedding information:\n\n```python\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    data = LoadDataFromDicts(\n        data=[\n            {\n                \"persona\": \"A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.\",\n                \"embedding\": [\n                    0.018477669046149742,\n                    -0.03748236608841726,\n                    0.001919870620352492,\n                    0.024918478063770535,\n                    0.02348063521315178,\n                    0.0038251285566308375,\n                    -0.01723884983037716,\n                    0.02881971942372201,\n                ],\n                \"nn_indices\": [0, 1],\n                \"nn_scores\": [\n                    0.9164746999740601,\n                    0.782106876373291,\n                ],\n            },\n            {\n                \"persona\": \"A music teacher or instructor focused on theoretical and practical piano lessons.\",\n                \"embedding\": [\n                    -0.0023464179614082125,\n                    -0.07325472251663565,\n                    -0.06058678419516501,\n                    -0.02100326928586996,\n                    -0.013462744792362657,\n                    0.027368447064244242,\n                    -0.003916070100455717,\n                    0.01243614518480423,\n                ],\n                \"nn_indices\": [0, 2],\n                \"nn_scores\": [\n                    0.7552462220191956,\n                    0.7261884808540344,\n                ],\n            },\n            {\n                \"persona\": \"A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.\",\n                \"embedding\": [\n                    -0.01630817942328242,\n                    -0.023760151552345232,\n                    -0.014249650090627883,\n                    -0.005713686451446624,\n                    -0.016033059279131567,\n                    0.0071440908501058786,\n                    -0.05691099643425161,\n                    0.01597412704817784,\n                ],\n                \"nn_indices\": [1, 2],\n                \"nn_scores\": [\n                    0.8107735514640808,\n                    0.7172299027442932,\n                ],\n            },\n        ],\n        batch_size=batch_size,\n    )\n    # In general you should do something like this before the deduplication step, to obtain the\n    # `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's\n    # no need for it.\n    # nn = FaissNearestNeighbour(\n    #     k=30,\n    #     metric_type=faiss.METRIC_INNER_PRODUCT,\n    #     search_batch_size=50,\n    #     train_size=len(dataset),              # The number of embeddings to use for training\n    #     string_factory=\"IVF300_HNSW32,Flat\"   # To use an index (optional, maybe required for big datasets)\n    # )\n    # Read more about the `string_factory` here:\n    # https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index\n\n    embedding_dedup = EmbeddingDedup(\n        threshold=0.8,\n        input_batch_size=batch_size,\n    )\n\n    data &gt;&gt; embedding_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_embedding_filtering\"])\n```\n</code></pre> Source code in <code>src/distilabel/steps/filtering/embedding.py</code> <pre><code>class EmbeddingDedup(GlobalStep):\n    \"\"\"Deduplicates text using embeddings.\n\n    `EmbeddingDedup` is a Step that detects near-duplicates in datasets, using\n    embeddings to compare the similarity between the texts. The typical workflow with this step\n    would include having a dataset with embeddings precomputed, and then (possibly using the\n    `FaissNearestNeighbour`) using the `nn_indices` and `nn_scores`, determine the texts that\n    are duplicate.\n\n    Attributes:\n        threshold: the threshold to consider 2 examples as duplicates.\n            It's dependent on the type of index that was used to generate the embeddings.\n            For example, if the embeddings were generated using cosine similarity, a threshold\n            of `0.9` would make all the texts with a cosine similarity above the value\n            duplicates. Higher values detect less duplicates in such an index, but that should\n            be taken into account when building it. Defaults to `0.9`.\n\n    Runtime Parameters:\n        - `threshold`: the threshold to consider 2 examples as duplicates.\n\n    Input columns:\n        - nn_indices (`List[int]`): a list containing the indices of the `k` nearest neighbours\n            in the inputs for the row.\n        - nn_scores (`List[float]`): a list containing the score or distance to each `k`\n            nearest neighbour in the inputs.\n\n    Output columns:\n        - keep_row_after_embedding_filtering (`bool`): boolean indicating if the piece `text` is\n            not a duplicate i.e. this text should be kept.\n\n    Categories:\n        - filtering\n\n    Examples:\n\n        Deduplicate a list of texts using embedding information:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import EmbeddingDedup\n        from distilabel.steps import LoadDataFromDicts\n\n        with Pipeline() as pipeline:\n            data = LoadDataFromDicts(\n                data=[\n                    {\n                        \"persona\": \"A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.\",\n                        \"embedding\": [\n                            0.018477669046149742,\n                            -0.03748236608841726,\n                            0.001919870620352492,\n                            0.024918478063770535,\n                            0.02348063521315178,\n                            0.0038251285566308375,\n                            -0.01723884983037716,\n                            0.02881971942372201,\n                        ],\n                        \"nn_indices\": [0, 1],\n                        \"nn_scores\": [\n                            0.9164746999740601,\n                            0.782106876373291,\n                        ],\n                    },\n                    {\n                        \"persona\": \"A music teacher or instructor focused on theoretical and practical piano lessons.\",\n                        \"embedding\": [\n                            -0.0023464179614082125,\n                            -0.07325472251663565,\n                            -0.06058678419516501,\n                            -0.02100326928586996,\n                            -0.013462744792362657,\n                            0.027368447064244242,\n                            -0.003916070100455717,\n                            0.01243614518480423,\n                        ],\n                        \"nn_indices\": [0, 2],\n                        \"nn_scores\": [\n                            0.7552462220191956,\n                            0.7261884808540344,\n                        ],\n                    },\n                    {\n                        \"persona\": \"A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.\",\n                        \"embedding\": [\n                            -0.01630817942328242,\n                            -0.023760151552345232,\n                            -0.014249650090627883,\n                            -0.005713686451446624,\n                            -0.016033059279131567,\n                            0.0071440908501058786,\n                            -0.05691099643425161,\n                            0.01597412704817784,\n                        ],\n                        \"nn_indices\": [1, 2],\n                        \"nn_scores\": [\n                            0.8107735514640808,\n                            0.7172299027442932,\n                        ],\n                    },\n                ],\n                batch_size=batch_size,\n            )\n            # In general you should do something like this before the deduplication step, to obtain the\n            # `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's\n            # no need for it.\n            # nn = FaissNearestNeighbour(\n            #     k=30,\n            #     metric_type=faiss.METRIC_INNER_PRODUCT,\n            #     search_batch_size=50,\n            #     train_size=len(dataset),              # The number of embeddings to use for training\n            #     string_factory=\"IVF300_HNSW32,Flat\"   # To use an index (optional, maybe required for big datasets)\n            # )\n            # Read more about the `string_factory` here:\n            # https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index\n\n            embedding_dedup = EmbeddingDedup(\n                threshold=0.8,\n                input_batch_size=batch_size,\n            )\n\n            data &gt;&gt; embedding_dedup\n\n        if __name__ == \"__main__\":\n            distiset = pipeline.run(use_cache=False)\n            ds = distiset[\"default\"][\"train\"]\n            # Filter out the duplicates\n            ds_dedup = ds.filter(lambda x: x[\"keep_row_after_embedding_filtering\"])\n        ```\n    \"\"\"\n\n    threshold: Optional[RuntimeParameter[float]] = Field(\n        default=0.9,\n        description=\"The threshold to consider 2 examples as duplicates. It's dependent \"\n        \"on the type of index that was used to generate the embeddings. For example, if \"\n        \"the embeddings were generated using cosine similarity, a threshold of `0.9` \"\n        \"would make all the texts with a cosine similarity above the value duplicates. \"\n        \"Higher values detect less duplicates in such an index, but that should be \"\n        \"taken into account when building it.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"nn_scores\", \"nn_indices\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"keep_row_after_embedding_filtering\"]\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        rows_to_remove = set()\n\n        for input in track(inputs, description=\"Running Embedding deduplication...\"):\n            input[\"keep_row_after_embedding_filtering\"] = True\n            indices_scores = np.array(input[\"nn_scores\"]) &gt; self.threshold\n            indices = np.array(input[\"nn_indices\"])[indices_scores]\n            if len(indices) &gt; 0:  # If there are any rows found over the threshold\n                rows_to_remove.update(list(indices))\n\n        # Remove duplicates and get the list of rows to remove\n        for idx in rows_to_remove:\n            inputs[idx][\"keep_row_after_embedding_filtering\"] = False\n\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.MinHashDedup","title":"<code>MinHashDedup</code>","text":"<p>               Bases: <code>Step</code></p> <p>Deduplicates text using <code>MinHash</code> and <code>MinHashLSH</code>.</p> <p><code>MinHashDedup</code> is a Step that detects near-duplicates in datasets. The idea roughly translates to the following steps: 1. Tokenize the text into words or ngrams. 2. Create a <code>MinHash</code> for each text. 3. Store the <code>MinHashes</code> in a <code>MinHashLSH</code>. 4. Check if the <code>MinHash</code> is already in the <code>LSH</code>, if so, it is a duplicate.</p> <p>Attributes:</p> Name Type Description <code>num_perm</code> <code>int</code> <p>the number of permutations to use. Defaults to <code>128</code>.</p> <code>seed</code> <code>int</code> <p>the seed to use for the MinHash. Defaults to <code>1</code>.</p> <code>tokenizer</code> <code>Literal['words', 'ngrams']</code> <p>the tokenizer to use. Available ones are <code>words</code> or <code>ngrams</code>. If <code>words</code> is selected, it tokenizes the text into words using nltk's word tokenizer. <code>ngram</code> estimates the ngrams (together with the size <code>n</code>). Defaults to <code>words</code>.</p> <code>n</code> <code>Optional[int]</code> <p>the size of the ngrams to use. Only relevant if <code>tokenizer=\"ngrams\"</code>. Defaults to <code>5</code>.</p> <code>threshold</code> <code>float</code> <p>the threshold to consider two MinHashes as duplicates. Values closer to 0 detect more duplicates. Defaults to <code>0.9</code>.</p> <code>storage</code> <code>Literal['dict', 'disk']</code> <p>the storage to use for the LSH. Can be <code>dict</code> to store the index in memory, or <code>disk</code>. Keep in mind, <code>disk</code> is an experimental feature not defined in <code>datasketch</code>, that is based on DiskCache's <code>Index</code> class. It should work as a <code>dict</code>, but backed by disk, but depending on the system it can be slower. Defaults to <code>dict</code>.</p> Input columns <ul> <li>text (<code>str</code>): the texts to be filtered.</li> </ul> Output columns <ul> <li>keep_row_after_minhash_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is     not a duplicate i.e. this text should be kept.</li> </ul> Categories <ul> <li>filtering</li> </ul> References <ul> <li><code>datasketch documentation</code></li> <li>Identifying and Filtering Near-Duplicate Documents</li> <li>Diskcache's Index</li> </ul> <p>Examples:</p> <pre><code>Deduplicate a list of texts using MinHash and MinHashLSH:\n\n```python\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import MinHashDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    ds_size = 1000\n    batch_size = 500  # Bigger batch sizes work better for this step\n    data = LoadDataFromDicts(\n        data=[\n            {\"text\": \"This is a test document.\"},\n            {\"text\": \"This document is a test.\"},\n            {\"text\": \"Test document for duplication.\"},\n            {\"text\": \"Document for duplication test.\"},\n            {\"text\": \"This is another unique document.\"},\n        ]\n        * (ds_size // 5),\n        batch_size=batch_size,\n    )\n    minhash_dedup = MinHashDedup(\n        tokenizer=\"words\",\n        threshold=0.9,      # lower values will increase the number of duplicates\n        storage=\"dict\",     # or \"disk\" for bigger datasets\n    )\n\n    data &gt;&gt; minhash_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_minhash_filtering\"])\n```\n</code></pre> Source code in <code>src/distilabel/steps/filtering/minhash.py</code> <pre><code>class MinHashDedup(Step):\n    \"\"\"Deduplicates text using `MinHash` and `MinHashLSH`.\n\n    `MinHashDedup` is a Step that detects near-duplicates in datasets. The idea roughly translates\n    to the following steps:\n    1. Tokenize the text into words or ngrams.\n    2. Create a `MinHash` for each text.\n    3. Store the `MinHashes` in a `MinHashLSH`.\n    4. Check if the `MinHash` is already in the `LSH`, if so, it is a duplicate.\n\n    Attributes:\n        num_perm: the number of permutations to use. Defaults to `128`.\n        seed: the seed to use for the MinHash. Defaults to `1`.\n        tokenizer: the tokenizer to use. Available ones are `words` or `ngrams`.\n            If `words` is selected, it tokenizes the text into words using nltk's\n            word tokenizer. `ngram` estimates the ngrams (together with the size\n            `n`). Defaults to `words`.\n        n: the size of the ngrams to use. Only relevant if `tokenizer=\"ngrams\"`. Defaults to `5`.\n        threshold: the threshold to consider two MinHashes as duplicates.\n            Values closer to 0 detect more duplicates. Defaults to `0.9`.\n        storage: the storage to use for the LSH. Can be `dict` to store the index\n            in memory, or `disk`. Keep in mind, `disk` is an experimental feature\n            not defined in `datasketch`, that is based on DiskCache's `Index` class.\n            It should work as a `dict`, but backed by disk, but depending on the system\n            it can be slower. Defaults to `dict`.\n\n    Input columns:\n        - text (`str`): the texts to be filtered.\n\n    Output columns:\n        - keep_row_after_minhash_filtering (`bool`): boolean indicating if the piece `text` is\n            not a duplicate i.e. this text should be kept.\n\n    Categories:\n        - filtering\n\n    References:\n        - [`datasketch documentation`](https://ekzhu.github.io/datasketch/lsh.html)\n        - [Identifying and Filtering Near-Duplicate Documents](https://cs.brown.edu/courses/cs253/papers/nearduplicate.pdf)\n        - [Diskcache's Index](https://grantjenks.com/docs/diskcache/api.html#diskcache.Index)\n\n    Examples:\n\n        Deduplicate a list of texts using MinHash and MinHashLSH:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import MinHashDedup\n        from distilabel.steps import LoadDataFromDicts\n\n        with Pipeline() as pipeline:\n            ds_size = 1000\n            batch_size = 500  # Bigger batch sizes work better for this step\n            data = LoadDataFromDicts(\n                data=[\n                    {\"text\": \"This is a test document.\"},\n                    {\"text\": \"This document is a test.\"},\n                    {\"text\": \"Test document for duplication.\"},\n                    {\"text\": \"Document for duplication test.\"},\n                    {\"text\": \"This is another unique document.\"},\n                ]\n                * (ds_size // 5),\n                batch_size=batch_size,\n            )\n            minhash_dedup = MinHashDedup(\n                tokenizer=\"words\",\n                threshold=0.9,      # lower values will increase the number of duplicates\n                storage=\"dict\",     # or \"disk\" for bigger datasets\n            )\n\n            data &gt;&gt; minhash_dedup\n\n        if __name__ == \"__main__\":\n            distiset = pipeline.run(use_cache=False)\n            ds = distiset[\"default\"][\"train\"]\n            # Filter out the duplicates\n            ds_dedup = ds.filter(lambda x: x[\"keep_row_after_minhash_filtering\"])\n        ```\n    \"\"\"\n\n    num_perm: int = 128\n    seed: int = 1\n    tokenizer: Literal[\"words\", \"ngrams\"] = \"words\"\n    n: Optional[int] = 5\n    threshold: float = 0.9\n    storage: Literal[\"dict\", \"disk\"] = \"dict\"\n\n    _hasher: Union[\"MinHash\", None] = PrivateAttr(None)\n    _tokenizer: Union[Callable, None] = PrivateAttr(None)\n    _lhs: Union[\"MinHashLSH\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n        if not importlib.import_module(\"datasketch\"):\n            raise ImportError(\n                \"`datasketch` is needed to deduplicate with MinHash, but is not installed. \"\n                \"Please install it using `pip install datasketch`.\"\n            )\n        from datasketch import MinHash\n\n        from distilabel.steps.filtering._datasketch import MinHashLSH\n\n        self._hasher = MinHash.bulk\n        self._lsh = MinHashLSH(\n            num_perm=self.num_perm,\n            threshold=self.threshold,\n            storage_config={\"type\": self.storage},\n        )\n\n        if self.tokenizer == \"words\":\n            if not importlib.import_module(\"nltk\"):\n                raise ImportError(\n                    \"`nltk` is needed to tokenize based on words, but is not installed. \"\n                    \"Please install it using `pip install nltk`. Then run `nltk.download('punkt_tab')`.\"\n                )\n            self._tokenizer = tokenized_on_words\n        else:\n            self._tokenizer = partial(tokenize_on_ngrams, n=self.n)\n\n    def unload(self) -&gt; None:\n        super().unload()\n        # In case of LSH being stored in disk, we need to close the file.\n        if self.storage == \"disk\":\n            self._lsh.close()\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"keep_row_after_minhash_filtering\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        tokenized_texts = []\n        for input in inputs:\n            tokenized_texts.append(self._tokenizer([input[self.inputs[0]]])[0])\n\n        minhashes = self._hasher(\n            tokenized_texts, num_perm=self.num_perm, seed=self.seed\n        )\n\n        for input, minhash in zip(inputs, minhashes):\n            # Check if the text is already in the LSH index\n            if self._lsh.query(minhash):\n                input[\"keep_row_after_minhash_filtering\"] = False\n            else:\n                self._lsh.insert(str(uuid.uuid4()), minhash)\n                input[\"keep_row_after_minhash_filtering\"] = True\n\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate","title":"<code>ConversationTemplate</code>","text":"<p>               Bases: <code>Step</code></p> <p>Generate a conversation template from an instruction and a response.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction to be used in the conversation.</li> <li>response (<code>str</code>): The response to be used in the conversation.</li> </ul> Output columns <ul> <li>conversation (<code>ChatType</code>): The conversation template.</li> </ul> Categories <ul> <li>format</li> <li>chat</li> <li>template</li> </ul> <p>Examples:</p> <p>Create a conversation from an instruction and a response:</p> <pre><code>from distilabel.steps import ConversationTemplate\n\nconv_template = ConversationTemplate()\nconv_template.load()\n\nresult = next(\n    conv_template.process(\n        [\n            {\n                \"instruction\": \"Hello\",\n                \"response\": \"Hi\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'Hello', 'response': 'Hi', 'conversation': [{'role': 'user', 'content': 'Hello'}, {'role': 'assistant', 'content': 'Hi'}]}]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/conversation.py</code> <pre><code>class ConversationTemplate(Step):\n    \"\"\"Generate a conversation template from an instruction and a response.\n\n    Input columns:\n        - instruction (`str`): The instruction to be used in the conversation.\n        - response (`str`): The response to be used in the conversation.\n\n    Output columns:\n        - conversation (`ChatType`): The conversation template.\n\n    Categories:\n        - format\n        - chat\n        - template\n\n    Examples:\n        Create a conversation from an instruction and a response:\n\n        ```python\n        from distilabel.steps import ConversationTemplate\n\n        conv_template = ConversationTemplate()\n        conv_template.load()\n\n        result = next(\n            conv_template.process(\n                [\n                    {\n                        \"instruction\": \"Hello\",\n                        \"response\": \"Hi\",\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'Hello', 'response': 'Hi', 'conversation': [{'role': 'user', 'content': 'Hello'}, {'role': 'assistant', 'content': 'Hi'}]}]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The instruction and response.\"\"\"\n        return [\"instruction\", \"response\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The conversation template.\"\"\"\n        return [\"conversation\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Generate a conversation template from an instruction and a response.\n\n        Args:\n            inputs: The input data.\n\n        Yields:\n            The input data with the conversation template.\n        \"\"\"\n        for input in inputs:\n            input[\"conversation\"] = [\n                {\"role\": \"user\", \"content\": input[\"instruction\"]},\n                {\"role\": \"assistant\", \"content\": input[\"response\"]},\n            ]\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The instruction and response.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The conversation template.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate.process","title":"<code>process(inputs)</code>","text":"<p>Generate a conversation template from an instruction and a response.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>The input data with the conversation template.</p> Source code in <code>src/distilabel/steps/formatting/conversation.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Generate a conversation template from an instruction and a response.\n\n    Args:\n        inputs: The input data.\n\n    Yields:\n        The input data with the conversation template.\n    \"\"\"\n    for input in inputs:\n        input[\"conversation\"] = [\n            {\"role\": \"user\", \"content\": input[\"instruction\"]},\n            {\"role\": \"assistant\", \"content\": input[\"response\"]},\n        ]\n    yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO","title":"<code>FormatChatGenerationDPO</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of a combination of a <code>ChatGeneration</code> + a preference task for Direct Preference Optimization (DPO).</p> <p><code>FormatChatGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>ChatGeneration</code> task with a preference <code>Task</code> i.e. a task generating <code>ratings</code> such as <code>UltraFeedback</code> following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>., so that those are used to rank the existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>.</p> Note <p>The <code>messages</code> column should contain at least one message from the user, the <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p> Input columns <ul> <li>messages (<code>List[Dict[str, str]]</code>): The conversation messages.</li> <li>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</li> <li>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,     only available if the <code>model_name</code> from the <code>ChatGeneration</code> task/s is combined into a single     column named this way, otherwise, it will be ignored.</li> <li>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference     task such as <code>UltraFeedback</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The user message used to generate the <code>generations</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</li> <li>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,     if the <code>generation_models</code> are available.</li> <li>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</li> <li>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</li> <li>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,     if the <code>generation_models</code> are available.</li> <li>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</li> </ul> Categories <ul> <li>format</li> <li>chat-generation</li> <li>preference</li> <li>messages</li> <li>generations</li> </ul> <p>Examples:</p> <p>Format your dataset for DPO fine tuning:</p> <pre><code>from distilabel.steps import FormatChatGenerationDPO\n\nformat_dpo = FormatChatGenerationDPO()\nformat_dpo.load()\n\n# NOTE: \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}],\n#         'generations': ['4', '5', '6'],\n#         'ratings': [1, 0, -1],\n#         'prompt': \"What's 2+2?\",\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'chosen_rating': 1,\n#         'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#         'rejected_rating': -1\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>class FormatChatGenerationDPO(Step):\n    \"\"\"Format the output of a combination of a `ChatGeneration` + a preference task for Direct Preference Optimization (DPO).\n\n    `FormatChatGenerationDPO` is a `Step` that formats the output of the combination of a `ChatGeneration`\n    task with a preference `Task` i.e. a task generating `ratings` such as `UltraFeedback` following the standard\n    formatting from frameworks such as `axolotl` or `alignment-handbook`., so that those are used to rank the\n    existing generations and provide the `chosen` and `rejected` generations based on the `ratings`.\n\n    Note:\n        The `messages` column should contain at least one message from the user, the `generations`\n        column should contain at least two generations, the `ratings` column should contain the same\n        number of ratings as generations.\n\n    Input columns:\n        - messages (`List[Dict[str, str]]`): The conversation messages.\n        - generations (`List[str]`): The generations produced by the `LLM`.\n        - generation_models (`List[str]`, optional): The model names used to generate the `generations`,\n            only available if the `model_name` from the `ChatGeneration` task/s is combined into a single\n            column named this way, otherwise, it will be ignored.\n        - ratings (`List[float]`): The ratings for each of the `generations`, produced by a preference\n            task such as `UltraFeedback`.\n\n    Output columns:\n        - prompt (`str`): The user message used to generate the `generations` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - chosen (`List[Dict[str, str]]`): The `chosen` generation based on the `ratings`.\n        - chosen_model (`str`, optional): The model name used to generate the `chosen` generation,\n            if the `generation_models` are available.\n        - chosen_rating (`float`): The rating of the `chosen` generation.\n        - rejected (`List[Dict[str, str]]`): The `rejected` generation based on the `ratings`.\n        - rejected_model (`str`, optional): The model name used to generate the `rejected` generation,\n            if the `generation_models` are available.\n        - rejected_rating (`float`): The rating of the `rejected` generation.\n\n    Categories:\n        - format\n        - chat-generation\n        - preference\n        - messages\n        - generations\n\n    Examples:\n        Format your dataset for DPO fine tuning:\n\n        ```python\n        from distilabel.steps import FormatChatGenerationDPO\n\n        format_dpo = FormatChatGenerationDPO()\n        format_dpo.load()\n\n        # NOTE: \"generation_models\" can be added optionally.\n        result = next(\n            format_dpo.process(\n                [\n                    {\n                        \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                        \"generations\": [\"4\", \"5\", \"6\"],\n                        \"ratings\": [1, 0, -1],\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}],\n        #         'generations': ['4', '5', '6'],\n        #         'ratings': [1, 0, -1],\n        #         'prompt': \"What's 2+2?\",\n        #         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #         'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n        #         'chosen_rating': 1,\n        #         'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n        #         'rejected_rating': -1\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `messages`, `generations`,\n        and `ratings`.\"\"\"\n        return [\"messages\", \"generations\", \"ratings\"]\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"List of optional inputs, which are not required by the `Step` but used if available,\n        which in this case is: `generation_models`.\"\"\"\n        return [\"generation_models\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `chosen`,\n        `chosen_model`, `chosen_rating`, `rejected`, `rejected_model`, `rejected_rating`. Both\n        the `chosen_model` and `rejected_model` being optional and only used if `generation_models`\n        is available.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\n            \"prompt\",\n            \"prompt_id\",\n            \"chosen\",\n            \"chosen_model\",\n            \"chosen_rating\",\n            \"rejected\",\n            \"rejected_model\",\n            \"rejected_rating\",\n        ]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the DPO formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                item[\"prompt\"] = next(\n                    (\n                        turn[\"content\"]\n                        for turn in item[\"messages\"]\n                        if turn[\"role\"] == \"user\"\n                    ),\n                    None,\n                )\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"chosen\"] = item[\"messages\"] + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][chosen_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n                item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n                rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"rejected\"] = item[\"messages\"] + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][rejected_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n                item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>messages</code>, <code>generations</code>, and <code>ratings</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>List of optional inputs, which are not required by the <code>Step</code> but used if available, which in this case is: <code>generation_models</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>chosen</code>, <code>chosen_model</code>, <code>chosen_rating</code>, <code>rejected</code>, <code>rejected_model</code>, <code>rejected_rating</code>. Both the <code>chosen_model</code> and <code>rejected_model</code> being optional and only used if <code>generation_models</code> is available.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the DPO formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the DPO standard.</p> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the DPO formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            item[\"prompt\"] = next(\n                (\n                    turn[\"content\"]\n                    for turn in item[\"messages\"]\n                    if turn[\"role\"] == \"user\"\n                ),\n                None,\n            )\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"chosen\"] = item[\"messages\"] + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][chosen_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n            item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n            rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"rejected\"] = item[\"messages\"] + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][rejected_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n            item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO","title":"<code>FormatTextGenerationDPO</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of your LLMs for Direct Preference Optimization (DPO).</p> <p><code>FormatTextGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>TextGeneration</code> task with a preference <code>Task</code> i.e. a task generating <code>ratings</code>, so that those are used to rank the existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>. Use this step to transform the output of a combination of a <code>TextGeneration</code> + a preference task such as <code>UltraFeedback</code> following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>.</p> Note <p>The <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p> Input columns <ul> <li>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the     <code>generations</code>, if available.</li> <li>instruction (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</li> <li>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</li> <li>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,     only available if the <code>model_name</code> from the <code>TextGeneration</code> task/s is combined into a single     column named this way, otherwise, it will be ignored.</li> <li>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference     task such as <code>UltraFeedback</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</li> <li>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,     if the <code>generation_models</code> are available.</li> <li>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</li> <li>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</li> <li>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,     if the <code>generation_models</code> are available.</li> <li>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</li> </ul> Categories <ul> <li>format</li> <li>text-generation</li> <li>preference</li> <li>instruction</li> <li>generations</li> </ul> <p>Examples:</p> <p>Format your dataset for DPO fine tuning:</p> <pre><code>from distilabel.steps import FormatTextGenerationDPO\n\nformat_dpo = FormatTextGenerationDPO()\nformat_dpo.load()\n\n# NOTE: Both \"system_prompt\" and \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#    {   'instruction': \"What's 2+2?\",\n#        'generations': ['4', '5', '6'],\n#        'ratings': [1, 0, -1],\n#        'prompt': \"What's 2+2?\",\n#        'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#        'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#        'chosen_rating': 1,\n#        'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#        'rejected_rating': -1\n#    }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>class FormatTextGenerationDPO(Step):\n    \"\"\"Format the output of your LLMs for Direct Preference Optimization (DPO).\n\n    `FormatTextGenerationDPO` is a `Step` that formats the output of the combination of a `TextGeneration`\n    task with a preference `Task` i.e. a task generating `ratings`, so that those are used to rank the\n    existing generations and provide the `chosen` and `rejected` generations based on the `ratings`.\n    Use this step to transform the output of a combination of a `TextGeneration` + a preference task such as\n    `UltraFeedback` following the standard formatting from frameworks such as `axolotl` or `alignment-handbook`.\n\n    Note:\n        The `generations` column should contain at least two generations, the `ratings` column should\n        contain the same number of ratings as generations.\n\n    Input columns:\n        - system_prompt (`str`, optional): The system prompt used within the `LLM` to generate the\n            `generations`, if available.\n        - instruction (`str`): The instruction used to generate the `generations` with the `LLM`.\n        - generations (`List[str]`): The generations produced by the `LLM`.\n        - generation_models (`List[str]`, optional): The model names used to generate the `generations`,\n            only available if the `model_name` from the `TextGeneration` task/s is combined into a single\n            column named this way, otherwise, it will be ignored.\n        - ratings (`List[float]`): The ratings for each of the `generations`, produced by a preference\n            task such as `UltraFeedback`.\n\n    Output columns:\n        - prompt (`str`): The instruction used to generate the `generations` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - chosen (`List[Dict[str, str]]`): The `chosen` generation based on the `ratings`.\n        - chosen_model (`str`, optional): The model name used to generate the `chosen` generation,\n            if the `generation_models` are available.\n        - chosen_rating (`float`): The rating of the `chosen` generation.\n        - rejected (`List[Dict[str, str]]`): The `rejected` generation based on the `ratings`.\n        - rejected_model (`str`, optional): The model name used to generate the `rejected` generation,\n            if the `generation_models` are available.\n        - rejected_rating (`float`): The rating of the `rejected` generation.\n\n    Categories:\n        - format\n        - text-generation\n        - preference\n        - instruction\n        - generations\n\n    Examples:\n        Format your dataset for DPO fine tuning:\n\n        ```python\n        from distilabel.steps import FormatTextGenerationDPO\n\n        format_dpo = FormatTextGenerationDPO()\n        format_dpo.load()\n\n        # NOTE: Both \"system_prompt\" and \"generation_models\" can be added optionally.\n        result = next(\n            format_dpo.process(\n                [\n                    {\n                        \"instruction\": \"What's 2+2?\",\n                        \"generations\": [\"4\", \"5\", \"6\"],\n                        \"ratings\": [1, 0, -1],\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #    {   'instruction': \"What's 2+2?\",\n        #        'generations': ['4', '5', '6'],\n        #        'ratings': [1, 0, -1],\n        #        'prompt': \"What's 2+2?\",\n        #        'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #        'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n        #        'chosen_rating': 1,\n        #        'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n        #        'rejected_rating': -1\n        #    }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `instruction`, `generations`,\n        and `ratings`.\"\"\"\n        return {\n            \"system_prompt\": False,\n            \"instruction\": True,\n            \"generations\": True,\n            \"generation_models\": False,\n            \"ratings\": True,\n        }\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"List of optional inputs, which are not required by the `Step` but used if available,\n        which in this case are: `system_prompt`, and `generation_models`.\"\"\"\n        return [\"system_prompt\", \"generation_models\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `chosen`,\n        `chosen_model`, `chosen_rating`, `rejected`, `rejected_model`, `rejected_rating`. Both\n        the `chosen_model` and `rejected_model` being optional and only used if `generation_models`\n        is available.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\n            \"prompt\",\n            \"prompt_id\",\n            \"chosen\",\n            \"chosen_model\",\n            \"chosen_rating\",\n            \"rejected\",\n            \"rejected_model\",\n            \"rejected_rating\",\n        ]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the DPO formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                messages = [\n                    {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n                ]\n                if (\n                    \"system_prompt\" in item\n                    and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                    and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n                ):\n                    messages.insert(\n                        0,\n                        {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                    )\n\n                item[\"prompt\"] = item[\"instruction\"]\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"chosen\"] = messages + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][chosen_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n                item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n                rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"rejected\"] = messages + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][rejected_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n                item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>instruction</code>, <code>generations</code>, and <code>ratings</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>List of optional inputs, which are not required by the <code>Step</code> but used if available, which in this case are: <code>system_prompt</code>, and <code>generation_models</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>chosen</code>, <code>chosen_model</code>, <code>chosen_rating</code>, <code>rejected</code>, <code>rejected_model</code>, <code>rejected_rating</code>. Both the <code>chosen_model</code> and <code>rejected_model</code> being optional and only used if <code>generation_models</code> is available.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the DPO formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the DPO standard.</p> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the DPO formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            messages = [\n                {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n            ]\n            if (\n                \"system_prompt\" in item\n                and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n            ):\n                messages.insert(\n                    0,\n                    {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                )\n\n            item[\"prompt\"] = item[\"instruction\"]\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"chosen\"] = messages + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][chosen_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n            item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n            rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"rejected\"] = messages + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][rejected_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n            item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT","title":"<code>FormatChatGenerationSFT</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatChatGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>. The output of the <code>ChatGeneration</code> task is formatted into a chat-like conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message in the conversation.</p> Input columns <ul> <li>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the     <code>generation</code>, if available.</li> <li>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as     the user message and the <code>generation</code> as the assistant message.</li> </ul> Categories <ul> <li>format</li> <li>chat-generation</li> <li>instruction</li> <li>generation</li> </ul> <p>Examples:</p> <p>Format your dataset for SFT:</p> <pre><code>from distilabel.steps import FormatChatGenerationSFT\n\nformat_sft = FormatChatGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>class FormatChatGenerationSFT(Step):\n    \"\"\"Format the output of a `ChatGeneration` task for Supervised Fine-Tuning (SFT).\n\n    `FormatChatGenerationSFT` is a `Step` that formats the output of a `ChatGeneration` task for\n    Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as `axolotl`\n    or `alignment-handbook`. The output of the `ChatGeneration` task is formatted into a chat-like\n    conversation with the `instruction` as the user message and the `generation` as the assistant\n    message. Optionally, if the `system_prompt` is available, it is included as the first message\n    in the conversation.\n\n    Input columns:\n        - system_prompt (`str`, optional): The system prompt used within the `LLM` to generate the\n            `generation`, if available.\n        - instruction (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - generation (`str`): The generation produced by the `LLM`.\n\n    Output columns:\n        - prompt (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - messages (`List[Dict[str, str]]`): The chat-like conversation with the `instruction` as\n            the user message and the `generation` as the assistant message.\n\n    Categories:\n        - format\n        - chat-generation\n        - instruction\n        - generation\n\n    Examples:\n        Format your dataset for SFT:\n\n        ```python\n        from distilabel.steps import FormatChatGenerationSFT\n\n        format_sft = FormatChatGenerationSFT()\n        format_sft.load()\n\n        # NOTE: \"system_prompt\" can be added optionally.\n        result = next(\n            format_sft.process(\n                [\n                    {\n                        \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                        \"generation\": \"4\"\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n        #         'generation': '4',\n        #         'prompt': 'What's 2+2?',\n        #         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `instruction`, and `generation`.\"\"\"\n        return [\"messages\", \"generation\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `messages`.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\"prompt\", \"prompt_id\", \"messages\"]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the SFT formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                item[\"prompt\"] = next(\n                    (\n                        turn[\"content\"]\n                        for turn in item[\"messages\"]\n                        if turn[\"role\"] == \"user\"\n                    ),\n                    None,\n                )\n\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                item[\"messages\"] = item[\"messages\"] + [\n                    {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n                ]\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>instruction</code>, and <code>generation</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>messages</code>.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the SFT formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the SFT standard.</p> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the SFT formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            item[\"prompt\"] = next(\n                (\n                    turn[\"content\"]\n                    for turn in item[\"messages\"]\n                    if turn[\"role\"] == \"user\"\n                ),\n                None,\n            )\n\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            item[\"messages\"] = item[\"messages\"] + [\n                {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n            ]\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT","title":"<code>FormatTextGenerationSFT</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatTextGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>. The output of the <code>TextGeneration</code> task is formatted into a chat-like conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message in the conversation.</p> Input columns <ul> <li>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the     <code>generation</code>, if available.</li> <li>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as     the user message and the <code>generation</code> as the assistant message.</li> </ul> Categories <ul> <li>format</li> <li>text-generation</li> <li>instruction</li> <li>generation</li> </ul> <p>Examples:</p> <p>Format your dataset for SFT fine tuning:</p> <pre><code>from distilabel.steps import FormatTextGenerationSFT\n\nformat_sft = FormatTextGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'What's 2+2?',\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}]\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>class FormatTextGenerationSFT(Step):\n    \"\"\"Format the output of a `TextGeneration` task for Supervised Fine-Tuning (SFT).\n\n    `FormatTextGenerationSFT` is a `Step` that formats the output of a `TextGeneration` task for\n    Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as `axolotl`\n    or `alignment-handbook`. The output of the `TextGeneration` task is formatted into a chat-like\n    conversation with the `instruction` as the user message and the `generation` as the assistant\n    message. Optionally, if the `system_prompt` is available, it is included as the first message\n    in the conversation.\n\n    Input columns:\n        - system_prompt (`str`, optional): The system prompt used within the `LLM` to generate the\n            `generation`, if available.\n        - instruction (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - generation (`str`): The generation produced by the `LLM`.\n\n    Output columns:\n        - prompt (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - messages (`List[Dict[str, str]]`): The chat-like conversation with the `instruction` as\n            the user message and the `generation` as the assistant message.\n\n    Categories:\n        - format\n        - text-generation\n        - instruction\n        - generation\n\n    Examples:\n        Format your dataset for SFT fine tuning:\n\n        ```python\n        from distilabel.steps import FormatTextGenerationSFT\n\n        format_sft = FormatTextGenerationSFT()\n        format_sft.load()\n\n        # NOTE: \"system_prompt\" can be added optionally.\n        result = next(\n            format_sft.process(\n                [\n                    {\n                        \"instruction\": \"What's 2+2?\",\n                        \"generation\": \"4\"\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'instruction': 'What's 2+2?',\n        #         'generation': '4',\n        #         'prompt': 'What's 2+2?',\n        #         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}]\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `instruction`, and `generation`.\"\"\"\n        return {\n            \"system_prompt\": False,\n            \"instruction\": True,\n            \"generation\": True,\n        }\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"List of optional inputs, which are not required by the `Step` but used if available,\n        which in this case is: `system_prompt`.\"\"\"\n        return [\"system_prompt\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `messages`.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\"prompt\", \"prompt_id\", \"messages\"]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the SFT formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                item[\"prompt\"] = item[\"instruction\"]\n\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                item[\"messages\"] = [\n                    {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n                    {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n                ]\n                if (\n                    \"system_prompt\" in item\n                    and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                    and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n                ):\n                    item[\"messages\"].insert(\n                        0,\n                        {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                    )\n\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>instruction</code>, and <code>generation</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>List of optional inputs, which are not required by the <code>Step</code> but used if available, which in this case is: <code>system_prompt</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>messages</code>.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the SFT formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the SFT standard.</p> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the SFT formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            item[\"prompt\"] = item[\"instruction\"]\n\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            item[\"messages\"] = [\n                {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n                {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n            ]\n            if (\n                \"system_prompt\" in item\n                and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n            ):\n                item[\"messages\"].insert(\n                    0,\n                    {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                )\n\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.LoadDataFromDicts","title":"<code>LoadDataFromDicts</code>","text":"<p>               Bases: <code>GeneratorStep</code></p> <p>Loads a dataset from a list of dictionaries.</p> <p><code>GeneratorStep</code> that loads a dataset from a list of dictionaries and yields it in batches.</p> <p>Attributes:</p> Name Type Description <code>data</code> <code>List[Dict[str, Any]]</code> <p>The list of dictionaries to load the data from.</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> </ul> Output columns <ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns     of the dataset.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a list of dictionaries:</p> <pre><code>from distilabel.steps import LoadDataFromDicts\n\nloader = LoadDataFromDicts(\n    data=[{\"instruction\": \"What are 2+2?\"}] * 5,\n    batch_size=2\n)\nloader.load()\n\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'instruction': 'What are 2+2?'}, {'instruction': 'What are 2+2?'}], False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/data.py</code> <pre><code>class LoadDataFromDicts(GeneratorStep):\n    \"\"\"Loads a dataset from a list of dictionaries.\n\n    `GeneratorStep` that loads a dataset from a list of dictionaries and yields it in\n    batches.\n\n    Attributes:\n        data: The list of dictionaries to load the data from.\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n\n    Output columns:\n        - dynamic (based on the keys found on the first dictionary of the list): The columns\n            of the dataset.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a list of dictionaries:\n\n        ```python\n        from distilabel.steps import LoadDataFromDicts\n\n        loader = LoadDataFromDicts(\n            data=[{\"instruction\": \"What are 2+2?\"}] * 5,\n            batch_size=2\n        )\n        loader.load()\n\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'instruction': 'What are 2+2?'}, {'instruction': 'What are 2+2?'}], False)\n        ```\n    \"\"\"\n\n    data: List[Dict[str, Any]] = Field(default_factory=list, exclude=True)\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n        \"\"\"Yields batches from a list of dictionaries.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to `0`.\n\n        Yields:\n            A list of Python dictionaries as read from the inputs (propagated in batches)\n            and a flag indicating whether the yield batch is the last one.\n        \"\"\"\n        if offset:\n            self.data = self.data[offset:]\n\n        while self.data:\n            batch = self.data[: self.batch_size]\n            self.data = self.data[self.batch_size :]\n            yield (\n                batch,\n                True if len(self.data) == 0 else False,\n            )\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"Returns a list of strings with the names of the columns that the step will generate.\"\"\"\n        return list(self.data[0].keys())\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.LoadDataFromDicts.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>Returns a list of strings with the names of the columns that the step will generate.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.LoadDataFromDicts.process","title":"<code>process(offset=0)</code>","text":"<p>Yields batches from a list of dictionaries.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to <code>0</code>.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A list of Python dictionaries as read from the inputs (propagated in batches)</p> <code>GeneratorStepOutput</code> <p>and a flag indicating whether the yield batch is the last one.</p> Source code in <code>src/distilabel/steps/generators/data.py</code> <pre><code>@override\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n    \"\"\"Yields batches from a list of dictionaries.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to `0`.\n\n    Yields:\n        A list of Python dictionaries as read from the inputs (propagated in batches)\n        and a flag indicating whether the yield batch is the last one.\n    \"\"\"\n    if offset:\n        self.data = self.data[offset:]\n\n    while self.data:\n        batch = self.data[: self.batch_size]\n        self.data = self.data[self.batch_size :]\n        yield (\n            batch,\n            True if len(self.data) == 0 else False,\n        )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DataSampler","title":"<code>DataSampler</code>","text":"<p>               Bases: <code>GeneratorStep</code></p> <p>Step to sample from a dataset.</p> <p><code>GeneratorStep</code> that samples from a dataset and yields it in batches. This step is useful when you have a pipeline that can benefit from using examples in the prompts for example as few-shot learning, that can be changing on each row. For example, you can pass a list of dictionaries with N examples and generate M samples from it (assuming you have another step loading data, this M should have the same size as the data being loaded in that step). The size S argument is the number of samples per row generated, so each example would contain S examples to be used as examples.</p> <p>Attributes:</p> Name Type Description <code>data</code> <code>List[Dict[str, Any]]</code> <p>The list of dictionaries to sample from.</p> <code>size</code> <code>int</code> <p>Number of samples per example. For example in a few-shot learning scenario, the number of few-shot examples that will be generated per example. Defaults to 2.</p> <code>samples</code> <code>int</code> <p>Number of examples that will be generated by the step in total. If used with another loader step, this should be the same as the number of samples in the loader step. Defaults to 100.</p> Output columns <ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns     of the dataset.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Sample data from a list of dictionaries:</p> <pre><code>from distilabel.steps import DataSampler\n\nsampler = DataSampler(\n    data=[{\"sample\": f\"sample {i}\"} for i in range(30)],\n    samples=10,\n    size=2,\n    batch_size=4\n)\nsampler.load()\n\nresult = next(sampler.process())\n# &gt;&gt;&gt; result\n# ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)\n</code></pre> <p>Pipeline with a loader and a sampler combined in a single stream:</p> <pre><code>from datasets import load_dataset\n\nfrom distilabel.steps import LoadDataFromDicts, DataSampler\nfrom distilabel.steps.tasks.apigen.utils import PrepareExamples\nfrom distilabel.pipeline import Pipeline\n\nds = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\ndata = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n]\nwith Pipeline(name=\"APIGenPipeline\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)\n    sampler = DataSampler(\n        data=ds,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n    prep_examples = PrepareExamples()\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples]\n        &gt;&gt; combine_steps\n    )\n# Now we have a single stream of data with the loader and the sampler data\n</code></pre> Source code in <code>src/distilabel/steps/generators/data_sampler.py</code> <pre><code>class DataSampler(GeneratorStep):\n    \"\"\"Step to sample from a dataset.\n\n    `GeneratorStep` that samples from a dataset and yields it in batches.\n    This step is useful when you have a pipeline that can benefit from using examples\n    in the prompts for example as few-shot learning, that can be changing on each row.\n    For example, you can pass a list of dictionaries with N examples and generate M samples\n    from it (assuming you have another step loading data, this M should have the same size\n    as the data being loaded in that step). The size S argument is the number of samples per\n    row generated, so each example would contain S examples to be used as examples.\n\n    Attributes:\n        data: The list of dictionaries to sample from.\n        size: Number of samples per example. For example in a few-shot learning scenario,\n            the number of few-shot examples that will be generated per example. Defaults to 2.\n        samples: Number of examples that will be generated by the step in total.\n            If used with another loader step, this should be the same as the number\n            of samples in the loader step. Defaults to 100.\n\n    Output columns:\n        - dynamic (based on the keys found on the first dictionary of the list): The columns\n            of the dataset.\n\n    Categories:\n        - load\n\n    Examples:\n        Sample data from a list of dictionaries:\n\n        ```python\n        from distilabel.steps import DataSampler\n\n        sampler = DataSampler(\n            data=[{\"sample\": f\"sample {i}\"} for i in range(30)],\n            samples=10,\n            size=2,\n            batch_size=4\n        )\n        sampler.load()\n\n        result = next(sampler.process())\n        # &gt;&gt;&gt; result\n        # ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)\n        ```\n\n        Pipeline with a loader and a sampler combined in a single stream:\n\n        ```python\n        from datasets import load_dataset\n\n        from distilabel.steps import LoadDataFromDicts, DataSampler\n        from distilabel.steps.tasks.apigen.utils import PrepareExamples\n        from distilabel.pipeline import Pipeline\n\n        ds = (\n            load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n            .shuffle(seed=42)\n            .select(range(500))\n            .to_list()\n        )\n        data = [\n            {\n                \"func_name\": \"final_velocity\",\n                \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n            },\n            {\n                \"func_name\": \"permutation_count\",\n                \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n            },\n            {\n                \"func_name\": \"getdivision\",\n                \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n            },\n        ]\n        with Pipeline(name=\"APIGenPipeline\") as pipeline:\n            loader_seeds = LoadDataFromDicts(data=data)\n            sampler = DataSampler(\n                data=ds,\n                size=2,\n                samples=len(data),\n                batch_size=8,\n            )\n            prep_examples = PrepareExamples()\n\n            sampler &gt;&gt; prep_examples\n            (\n                [loader_seeds, prep_examples]\n                &gt;&gt; combine_steps\n            )\n        # Now we have a single stream of data with the loader and the sampler data\n        ```\n    \"\"\"\n\n    data: List[Dict[str, Any]] = Field(default_factory=list, exclude=True)\n    size: int = Field(\n        default=2,\n        description=(\n            \"Number of samples per example. For example in a few-shot learning scenario, the number \"\n            \"of few-shot examples that will be generated per example.\"\n        ),\n    )\n    samples: int = Field(\n        default=100,\n        description=(\n            \"Number of examples that will be generated by the step in total. \"\n            \"If used with another loader step, this should be the same as the number of \"\n            \"samples in the loader step.\"\n        ),\n    )\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n        \"\"\"Yields batches from a list of dictionaries.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to `0`.\n\n        Yields:\n            A list of Python dictionaries as read from the inputs (propagated in batches)\n            and a flag indicating whether the yield batch is the last one.\n        \"\"\"\n\n        total_samples = 0\n\n        while total_samples &lt; self.samples:\n            batch = []\n            bs = min(self.batch_size, self.samples - total_samples)\n            for _ in range(self.batch_size):\n                choices = random.choices(self.data, k=self.size)\n                choices = self._transform_data(choices)\n                batch.extend(choices)\n            total_samples += bs\n            batch = list(islice(batch, bs))\n            yield (batch, True if total_samples &gt;= self.samples else False)\n            batch = []\n\n    @staticmethod\n    def _transform_data(data: List[Dict[str, Any]]) -&gt; List[Dict[str, Any]]:\n        if not data:\n            return []\n\n        result = {key: [] for key in data[0].keys()}\n\n        for item in data:\n            for key, value in item.items():\n                result[key].append(value)\n\n        return [result]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return list(self.data[0].keys())\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DataSampler.process","title":"<code>process(offset=0)</code>","text":"<p>Yields batches from a list of dictionaries.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to <code>0</code>.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A list of Python dictionaries as read from the inputs (propagated in batches)</p> <code>GeneratorStepOutput</code> <p>and a flag indicating whether the yield batch is the last one.</p> Source code in <code>src/distilabel/steps/generators/data_sampler.py</code> <pre><code>@override\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n    \"\"\"Yields batches from a list of dictionaries.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to `0`.\n\n    Yields:\n        A list of Python dictionaries as read from the inputs (propagated in batches)\n        and a flag indicating whether the yield batch is the last one.\n    \"\"\"\n\n    total_samples = 0\n\n    while total_samples &lt; self.samples:\n        batch = []\n        bs = min(self.batch_size, self.samples - total_samples)\n        for _ in range(self.batch_size):\n            choices = random.choices(self.data, k=self.size)\n            choices = self._transform_data(choices)\n            batch.extend(choices)\n        total_samples += bs\n        batch = list(islice(batch, bs))\n        yield (batch, True if total_samples &gt;= self.samples else False)\n        batch = []\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.RewardModelScore","title":"<code>RewardModelScore</code>","text":"<p>               Bases: <code>Step</code>, <code>CudaDevicePlacementMixin</code></p> <p>Assign a score to a response using a Reward Model.</p> <p><code>RewardModelScore</code> is a <code>Step</code> that using a Reward Model (RM) loaded using <code>transformers</code>, assigns an score to a response generated for an instruction, or a score to a multi-turn conversation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>revision</code> <code>str</code> <p>if <code>model</code> refers to a Hugging Face Hub repository, then the revision (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> <code>torch_dtype</code> <code>str</code> <p>the torch dtype to use for the model e.g. \"float16\", \"float32\", etc. Defaults to <code>\"auto\"</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to allow fetching and executing remote code fetched from the repository in the Hub. Defaults to <code>False</code>.</p> <code>device_map</code> <code>Union[str, Dict[str, Any], None]</code> <p>a dictionary mapping each layer of the model to a device, or a mode like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> <code>token</code> <code>Union[SecretStr, None]</code> <p>the Hugging Face Hub token that will be used to authenticate to the Hugging Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package local configuration will be used. Defaults to <code>None</code>.</p> <code>truncation</code> <code>bool</code> <p>whether to truncate sequences at the maximum length. Defaults to <code>False</code>.</p> <code>max_length</code> <code>Union[int, None]</code> <p>maximun length to use for padding or truncation. Defaults to <code>None</code>.</p> Input columns <ul> <li>instruction (<code>str</code>, optional): the instruction used to generate a <code>response</code>.     If provided, then <code>response</code> must be provided too.</li> <li>response (<code>str</code>, optional): the response generated for <code>instruction</code>. If provided,     then <code>instruction</code> must be provide too.</li> <li>conversation (<code>ChatType</code>, optional): a multi-turn conversation. If not provided,     then <code>instruction</code> and <code>response</code> columns must be provided.</li> </ul> Output columns <ul> <li>score (<code>float</code>): the score given by the reward model for the instruction-response     pair or the conversation.</li> </ul> Categories <ul> <li>scorer</li> </ul> <p>Examples:</p> <p>Assigning an score for an instruction-response pair:</p> <pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"response\": \"The output of 2+2 is 4\",\n            },\n            {\"instruction\": \"How much is 2+2?\", \"response\": \"4\"},\n        ]\n    )\n)\n# [\n#   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},\n#   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}\n# ]\n</code></pre> <p>Assigning an score for a multi-turn conversation:</p> <pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"The output of 2+2 is 4\"},\n                ],\n            },\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"4\"},\n                ],\n            },\n        ]\n    )\n)\n# [\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/reward_model.py</code> <pre><code>class RewardModelScore(Step, CudaDevicePlacementMixin):\n    \"\"\"Assign a score to a response using a Reward Model.\n\n    `RewardModelScore` is a `Step` that using a Reward Model (RM) loaded using `transformers`,\n    assigns an score to a response generated for an instruction, or a score to a multi-turn\n    conversation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        revision: if `model` refers to a Hugging Face Hub repository, then the revision\n            (e.g. a branch name or a commit id) to use. Defaults to `\"main\"`.\n        torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.\n            Defaults to `\"auto\"`.\n        trust_remote_code: whether to allow fetching and executing remote code fetched\n            from the repository in the Hub. Defaults to `False`.\n        device_map: a dictionary mapping each layer of the model to a device, or a mode like `\"sequential\"` or `\"auto\"`. Defaults to `None`.\n        token: the Hugging Face Hub token that will be used to authenticate to the Hugging\n            Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package\n            local configuration will be used. Defaults to `None`.\n        truncation: whether to truncate sequences at the maximum length. Defaults to `False`.\n        max_length: maximun length to use for padding or truncation. Defaults to `None`.\n\n    Input columns:\n        - instruction (`str`, optional): the instruction used to generate a `response`.\n            If provided, then `response` must be provided too.\n        - response (`str`, optional): the response generated for `instruction`. If provided,\n            then `instruction` must be provide too.\n        - conversation (`ChatType`, optional): a multi-turn conversation. If not provided,\n            then `instruction` and `response` columns must be provided.\n\n    Output columns:\n        - score (`float`): the score given by the reward model for the instruction-response\n            pair or the conversation.\n\n    Categories:\n        - scorer\n\n    Examples:\n        Assigning an score for an instruction-response pair:\n\n        ```python\n        from distilabel.steps import RewardModelScore\n\n        step = RewardModelScore(\n            model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n        )\n\n        step.load()\n\n        result = next(\n            step.process(\n                inputs=[\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"response\": \"The output of 2+2 is 4\",\n                    },\n                    {\"instruction\": \"How much is 2+2?\", \"response\": \"4\"},\n                ]\n            )\n        )\n        # [\n        #   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},\n        #   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}\n        # ]\n        ```\n\n        Assigning an score for a multi-turn conversation:\n\n        ```python\n        from distilabel.steps import RewardModelScore\n\n        step = RewardModelScore(\n            model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n        )\n\n        step.load()\n\n        result = next(\n            step.process(\n                inputs=[\n                    {\n                        \"conversation\": [\n                            {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                            {\"role\": \"assistant\", \"content\": \"The output of 2+2 is 4\"},\n                        ],\n                    },\n                    {\n                        \"conversation\": [\n                            {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                            {\"role\": \"assistant\", \"content\": \"4\"},\n                        ],\n                    },\n                ]\n            )\n        )\n        # [\n        #   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},\n        #   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}\n        # ]\n        ```\n    \"\"\"\n\n    model: str\n    revision: str = \"main\"\n    torch_dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    device_map: Union[str, Dict[str, Any], None] = None\n    token: Union[SecretStr, None] = Field(\n        default_factory=lambda: os.getenv(HF_TOKEN_ENV_VAR), description=\"\"\n    )\n    truncation: bool = False\n    max_length: Union[int, None] = None\n\n    _model: Union[\"PreTrainedModel\", None] = PrivateAttr(None)\n    _tokenizer: Union[\"PreTrainedTokenizer\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n\n        if self.device_map in [\"cuda\", \"auto\"]:\n            CudaDevicePlacementMixin.load(self)\n\n        try:\n            from transformers import AutoModelForSequenceClassification, AutoTokenizer\n        except ImportError as e:\n            raise ImportError(\n                \"`transformers` is not installed. Please install it using `pip install transformers`.\"\n            ) from e\n\n        token = self.token.get_secret_value() if self.token is not None else self.token\n\n        self._model = AutoModelForSequenceClassification.from_pretrained(\n            self.model,\n            revision=self.revision,\n            torch_dtype=self.torch_dtype,\n            trust_remote_code=self.trust_remote_code,\n            device_map=self.device_map,\n            token=token,\n        )\n        self._tokenizer = AutoTokenizer.from_pretrained(\n            self.model,\n            revision=self.revision,\n            torch_dtype=self.torch_dtype,\n            trust_remote_code=self.trust_remote_code,\n            token=token,\n        )\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"Either `response` and `instruction`, or a `conversation` columns.\"\"\"\n        return {\n            \"response\": False,\n            \"instruction\": False,\n            \"conversation\": False,\n        }\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The `score` given by the reward model.\"\"\"\n        return [\"score\"]\n\n    def _prepare_conversation(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        if \"instruction\" in input and \"response\" in input:\n            return [\n                {\"role\": \"user\", \"content\": input[\"instruction\"]},\n                {\"role\": \"assistant\", \"content\": input[\"response\"]},\n            ]\n\n        return input[\"conversation\"]\n\n    def _prepare_inputs(self, inputs: List[Dict[str, Any]]) -&gt; \"torch.Tensor\":\n        return self._tokenizer.apply_chat_template(  # type: ignore\n            [self._prepare_conversation(input) for input in inputs],  # type: ignore\n            return_tensors=\"pt\",\n            padding=True,\n            truncation=self.truncation,\n            max_length=self.max_length,\n        ).to(self._model.device)  # type: ignore\n\n    def _inference(self, inputs: List[Dict[str, Any]]) -&gt; List[float]:\n        import torch\n\n        input_ids = self._prepare_inputs(inputs)\n        with torch.no_grad():\n            output = self._model(input_ids)  # type: ignore\n            logits = output.logits\n            if logits.shape == (2, 1):\n                logits = logits.squeeze(-1)\n            return logits.tolist()\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        scores = self._inference(inputs)\n        for input, score in zip(inputs, scores):\n            input[\"score\"] = score\n        yield inputs\n\n    def unload(self) -&gt; None:\n        if self.device_map in [\"cuda\", \"auto\"]:\n            CudaDevicePlacementMixin.unload(self)\n        super().unload()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.RewardModelScore.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>Either <code>response</code> and <code>instruction</code>, or a <code>conversation</code> columns.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.RewardModelScore.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The <code>score</code> given by the reward model.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.TruncateTextColumn","title":"<code>TruncateTextColumn</code>","text":"<p>               Bases: <code>Step</code></p> <p>Truncate a row using a tokenizer or the number of characters.</p> <p><code>TruncateTextColumn</code> is a <code>Step</code> that truncates a row according to the max length. If the <code>tokenizer</code> is provided, then the row will be truncated using the tokenizer, and the <code>max_length</code> will be used as the maximum number of tokens, otherwise it will be used as the maximum number of characters. The <code>TruncateTextColumn</code> step is useful when one wants to truncate a row to a certain length, to avoid posterior errors in the model due to the length.</p> <p>Attributes:</p> Name Type Description <code>column</code> <code>str</code> <p>the column to truncate. Defaults to <code>\"text\"</code>.</p> <code>max_length</code> <code>int</code> <p>the maximum length to use for truncation. If a <code>tokenizer</code> is given, corresponds to the number of tokens, otherwise corresponds to the number of characters. Defaults to <code>8192</code>.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the name of the tokenizer to use. If provided, the row will be truncated using the tokenizer. Defaults to <code>None</code>.</p> Input columns <ul> <li>dynamic (determined by <code>column</code> attribute): The columns to be truncated, defaults to \"text\".</li> </ul> Output columns <ul> <li>dynamic (determined by <code>column</code> attribute): The truncated column.</li> </ul> Categories <ul> <li>text-manipulation</li> </ul> <p>Examples:</p> <p>Truncating a row to a given number of tokens:</p> <pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(\n    tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    max_length=4,\n    column=\"text\"\n)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a sample'}]\n</code></pre> <p>Truncating a row to a given number of characters:</p> <pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(max_length=10)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a '}]\n</code></pre> Source code in <code>src/distilabel/steps/truncate.py</code> <pre><code>class TruncateTextColumn(Step):\n    \"\"\"Truncate a row using a tokenizer or the number of characters.\n\n    `TruncateTextColumn` is a `Step` that truncates a row according to the max length. If\n    the `tokenizer` is provided, then the row will be truncated using the tokenizer,\n    and the `max_length` will be used as the maximum number of tokens, otherwise it will\n    be used as the maximum number of characters. The `TruncateTextColumn` step is useful when one\n    wants to truncate a row to a certain length, to avoid posterior errors in the model due\n    to the length.\n\n    Attributes:\n        column: the column to truncate. Defaults to `\"text\"`.\n        max_length: the maximum length to use for truncation.\n            If a `tokenizer` is given, corresponds to the number of tokens,\n            otherwise corresponds to the number of characters. Defaults to `8192`.\n        tokenizer: the name of the tokenizer to use. If provided, the row will be\n            truncated using the tokenizer. Defaults to `None`.\n\n    Input columns:\n        - dynamic (determined by `column` attribute): The columns to be truncated, defaults to \"text\".\n\n    Output columns:\n        - dynamic (determined by `column` attribute): The truncated column.\n\n    Categories:\n        - text-manipulation\n\n    Examples:\n        Truncating a row to a given number of tokens:\n\n        ```python\n        from distilabel.steps import TruncateTextColumn\n\n        trunc = TruncateTextColumn(\n            tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            max_length=4,\n            column=\"text\"\n        )\n\n        trunc.load()\n\n        result = next(\n            trunc.process(\n                [\n                    {\"text\": \"This is a sample text that is longer than 10 characters\"}\n                ]\n            )\n        )\n        # result\n        # [{'text': 'This is a sample'}]\n        ```\n\n        Truncating a row to a given number of characters:\n\n        ```python\n        from distilabel.steps import TruncateTextColumn\n\n        trunc = TruncateTextColumn(max_length=10)\n\n        trunc.load()\n\n        result = next(\n            trunc.process(\n                [\n                    {\"text\": \"This is a sample text that is longer than 10 characters\"}\n                ]\n            )\n        )\n        # result\n        # [{'text': 'This is a '}]\n        ```\n    \"\"\"\n\n    column: str = \"text\"\n    max_length: int = 8192\n    tokenizer: Optional[str] = None\n    _truncator: Optional[Callable[[str], str]] = None\n    _tokenizer: Optional[Any] = None\n\n    def load(self):\n        super().load()\n        if self.tokenizer:\n            if not importlib.util.find_spec(\"transformers\"):\n                raise ImportError(\n                    \"`transformers` is needed to tokenize, but is not installed. \"\n                    \"Please install it using `pip install transformers`.\"\n                )\n\n            from transformers import AutoTokenizer\n\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer)\n            self._truncator = self._truncate_with_tokenizer\n        else:\n            self._truncator = self._truncate_with_length\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [self.column]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return self.inputs\n\n    def _truncate_with_length(self, text: str) -&gt; str:\n        \"\"\"Truncates the text according to the number of characters.\"\"\"\n        return text[: self.max_length]\n\n    def _truncate_with_tokenizer(self, text: str) -&gt; str:\n        \"\"\"Truncates the text according to the number of characters using the tokenizer.\"\"\"\n        return self._tokenizer.decode(\n            self._tokenizer.encode(\n                text,\n                add_special_tokens=False,\n                max_length=self.max_length,\n                truncation=True,\n            )\n        )\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        for input in inputs:\n            input[self.column] = self._truncator(input[self.column])\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TruncateTextColumn._truncate_with_length","title":"<code>_truncate_with_length(text)</code>","text":"<p>Truncates the text according to the number of characters.</p> Source code in <code>src/distilabel/steps/truncate.py</code> <pre><code>def _truncate_with_length(self, text: str) -&gt; str:\n    \"\"\"Truncates the text according to the number of characters.\"\"\"\n    return text[: self.max_length]\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TruncateTextColumn._truncate_with_tokenizer","title":"<code>_truncate_with_tokenizer(text)</code>","text":"<p>Truncates the text according to the number of characters using the tokenizer.</p> Source code in <code>src/distilabel/steps/truncate.py</code> <pre><code>def _truncate_with_tokenizer(self, text: str) -&gt; str:\n    \"\"\"Truncates the text according to the number of characters using the tokenizer.\"\"\"\n    return self._tokenizer.decode(\n        self._tokenizer.encode(\n            text,\n            add_special_tokens=False,\n            max_length=self.max_length,\n            truncation=True,\n        )\n    )\n</code></pre>"},{"location":"api/step_gallery/hugging_face/","title":"Hugging Face","text":"<p>This section contains the existing steps integrated with <code>Hugging Face</code> so as to easily push the generated datasets to Hugging Face.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromDisk","title":"<code>LoadDataFromDisk</code>","text":"<p>               Bases: <code>LoadDataFromHub</code></p> <p>Load a dataset that was previously saved to disk.</p> <p>If you previously saved your dataset using the <code>save_to_disk</code> method, or <code>Distiset.save_to_disk</code> you can load it again to build a new pipeline using this class.</p> <p>Attributes:</p> Name Type Description <code>dataset_path</code> <code>RuntimeParameter[Union[str, Path]]</code> <p>The path to the dataset or distiset.</p> <code>split</code> <code>Optional[RuntimeParameter[str]]</code> <p>The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> <code>config</code> <code>Optional[RuntimeParameter[str]]</code> <p>The configuration of the dataset to load. Defaults to <code>default</code>, if there are multiple configurations in the dataset this must be suplied or an error is raised.</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> <li><code>dataset_path</code>: The path to the dataset or distiset.</li> <li><code>is_distiset</code>: Whether the dataset to load is a <code>Distiset</code> or not. Defaults to False.</li> <li><code>split</code>: The split of the dataset to load. Defaults to 'train'.</li> <li><code>config</code>: The configuration of the dataset to load. Defaults to <code>default</code>, if there are     multiple configurations in the dataset this must be suplied or an error is raised.</li> <li><code>num_examples</code>: The number of examples to load from the dataset.     By default will load all examples.</li> <li><code>storage_options</code>: Key/value pairs to be passed on to the file-system backend, if any.     Defaults to <code>None</code>.</li> </ul> Output columns <ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the     datasets loaded from the Hugging Face Hub.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a Hugging Face Dataset:</p> <pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(dataset_path=\"path/to/dataset\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Load data from a distilabel Distiset:</p> <pre><code>from distilabel.steps import LoadDataFromDisk\n\n# Specify the configuration to load.\nloader = LoadDataFromDisk(\n    dataset_path=\"path/to/dataset\",\n    is_distiset=True,\n    config=\"leaf_step_1\"\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'a': 1}, {'a': 2}, {'a': 3}], True)\n</code></pre> <p>Load data from a Hugging Face Dataset or Distiset in your cloud provider:</p> <pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(\n    dataset_path=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>class LoadDataFromDisk(LoadDataFromHub):\n    \"\"\"Load a dataset that was previously saved to disk.\n\n    If you previously saved your dataset using the `save_to_disk` method, or\n    `Distiset.save_to_disk` you can load it again to build a new pipeline using this class.\n\n    Attributes:\n        dataset_path: The path to the dataset or distiset.\n        split: The split of the dataset to load (typically will be `train`, `test` or `validation`).\n        config: The configuration of the dataset to load. Defaults to `default`, if there are\n            multiple configurations in the dataset this must be suplied or an error is raised.\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n        - `dataset_path`: The path to the dataset or distiset.\n        - `is_distiset`: Whether the dataset to load is a `Distiset` or not. Defaults to False.\n        - `split`: The split of the dataset to load. Defaults to 'train'.\n        - `config`: The configuration of the dataset to load. Defaults to `default`, if there are\n            multiple configurations in the dataset this must be suplied or an error is raised.\n        - `num_examples`: The number of examples to load from the dataset.\n            By default will load all examples.\n        - `storage_options`: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n\n    Output columns:\n        - dynamic (`all`): The columns that will be generated by this step, based on the\n            datasets loaded from the Hugging Face Hub.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a Hugging Face Dataset:\n\n        ```python\n        from distilabel.steps import LoadDataFromDisk\n\n        loader = LoadDataFromDisk(dataset_path=\"path/to/dataset\")\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Load data from a distilabel Distiset:\n\n        ```python\n        from distilabel.steps import LoadDataFromDisk\n\n        # Specify the configuration to load.\n        loader = LoadDataFromDisk(\n            dataset_path=\"path/to/dataset\",\n            is_distiset=True,\n            config=\"leaf_step_1\"\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'a': 1}, {'a': 2}, {'a': 3}], True)\n        ```\n\n        Load data from a Hugging Face Dataset or Distiset in your cloud provider:\n\n        ```python\n        from distilabel.steps import LoadDataFromDisk\n\n        loader = LoadDataFromDisk(\n            dataset_path=\"gcs://path/to/dataset\",\n            storage_options={\"project\": \"experiments-0001\"}\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n    \"\"\"\n\n    dataset_path: RuntimeParameter[Union[str, Path]] = Field(\n        default=None,\n        description=\"Path to the dataset or distiset.\",\n    )\n    config: Optional[RuntimeParameter[str]] = Field(\n        default=\"default\",\n        description=(\n            \"The configuration of the dataset to load. Will default to 'default'\",\n            \" which corresponds to a distiset with a single configuration.\",\n        ),\n    )\n    is_distiset: Optional[RuntimeParameter[bool]] = Field(\n        default=False,\n        description=\"Whether the dataset to load is a `Distiset` or not. Defaults to False.\",\n    )\n    keep_in_memory: Optional[RuntimeParameter[bool]] = Field(\n        default=None,\n        description=\"Whether to copy the dataset in-memory, see `datasets.Dataset.load_from_disk` \"\n        \" for more information. Defaults to `None`.\",\n    )\n    split: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The split of the dataset to load. By default will load the whole Dataset/Distiset.\",\n    )\n    repo_id: ExcludedField[Union[str, None]] = None\n\n    def load(self) -&gt; None:\n        \"\"\"Load the dataset from the file/s in disk.\"\"\"\n        super(GeneratorStep, self).load()\n        if self.is_distiset:\n            ds = Distiset.load_from_disk(\n                self.dataset_path,\n                keep_in_memory=self.keep_in_memory,\n                storage_options=self.storage_options,\n            )\n            if self.config not in ds.keys():\n                raise DistilabelUserError(\n                    f\"Configuration '{self.config}' not found in the Distiset, available ones\"\n                    f\" are: {list(ds.keys())}. Please try changing the `config` parameter to one \"\n                    \"of the available configurations.\\n\\n\",\n                    page=\"sections/how_to_guides/advanced/distiset/#using-the-distiset-dataset-object\",\n                )\n            ds = ds[self.config]\n\n        else:\n            ds = load_from_disk(\n                self.dataset_path,\n                keep_in_memory=self.keep_in_memory,\n                storage_options=self.storage_options,\n            )\n\n        if self.split:\n            ds = ds[self.split]\n\n        self._dataset = ds\n\n        if self.num_examples:\n            self._dataset = self._dataset.select(range(self.num_examples))\n        else:\n            self.num_examples = len(self._dataset)\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The columns that will be generated by this step, based on the datasets from a file\n        in disk.\n\n        Returns:\n            The columns that will be generated by this step.\n        \"\"\"\n        # We assume there are Dataset/IterableDataset, not it's ...Dict counterparts\n        if self._dataset is None:\n            self.load()\n\n        return self._dataset.column_names\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromDisk.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The columns that will be generated by this step, based on the datasets from a file in disk.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>The columns that will be generated by this step.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromDisk.load","title":"<code>load()</code>","text":"<p>Load the dataset from the file/s in disk.</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Load the dataset from the file/s in disk.\"\"\"\n    super(GeneratorStep, self).load()\n    if self.is_distiset:\n        ds = Distiset.load_from_disk(\n            self.dataset_path,\n            keep_in_memory=self.keep_in_memory,\n            storage_options=self.storage_options,\n        )\n        if self.config not in ds.keys():\n            raise DistilabelUserError(\n                f\"Configuration '{self.config}' not found in the Distiset, available ones\"\n                f\" are: {list(ds.keys())}. Please try changing the `config` parameter to one \"\n                \"of the available configurations.\\n\\n\",\n                page=\"sections/how_to_guides/advanced/distiset/#using-the-distiset-dataset-object\",\n            )\n        ds = ds[self.config]\n\n    else:\n        ds = load_from_disk(\n            self.dataset_path,\n            keep_in_memory=self.keep_in_memory,\n            storage_options=self.storage_options,\n        )\n\n    if self.split:\n        ds = ds[self.split]\n\n    self._dataset = ds\n\n    if self.num_examples:\n        self._dataset = self._dataset.select(range(self.num_examples))\n    else:\n        self.num_examples = len(self._dataset)\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromFileSystem","title":"<code>LoadDataFromFileSystem</code>","text":"<p>               Bases: <code>LoadDataFromHub</code></p> <p>Loads a dataset from a file in your filesystem.</p> <p><code>GeneratorStep</code> that creates a dataset from a file in the filesystem, uses Hugging Face <code>datasets</code> library. Take a look at Hugging Face Datasets for more information of the supported file types.</p> <p>Attributes:</p> Name Type Description <code>data_files</code> <code>RuntimeParameter[Union[str, Path]]</code> <p>The path to the file, or directory containing the files that conform the dataset.</p> <code>split</code> <code>RuntimeParameter[str]</code> <p>The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> <li><code>data_files</code>: The path to the file, or directory containing the files that conform     the dataset.</li> <li><code>split</code>: The split of the dataset to load. Defaults to 'train'.</li> <li><code>streaming</code>: Whether to load the dataset in streaming mode or not. Defaults to     <code>False</code>.</li> <li><code>num_examples</code>: The number of examples to load from the dataset.     By default will load all examples.</li> <li><code>storage_options</code>: Key/value pairs to be passed on to the file-system backend, if any.     Defaults to <code>None</code>.</li> <li><code>filetype</code>: The expected filetype. If not provided, it will be inferred from the file extension.     For more than one file, it will be inferred from the first file.</li> </ul> Output columns <ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the     datasets loaded from the Hugging Face Hub.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a Hugging Face dataset in your file system:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(data_files=\"path/to/dataset.jsonl\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Specify a filetype if the file extension is not expected:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(filetype=\"csv\", data_files=\"path/to/dataset.txtr\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Load data from a file in your cloud provider:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Load data passing a glob pattern:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"path/to/dataset/*.jsonl\",\n    streaming=True\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>class LoadDataFromFileSystem(LoadDataFromHub):\n    \"\"\"Loads a dataset from a file in your filesystem.\n\n    `GeneratorStep` that creates a dataset from a file in the filesystem, uses Hugging Face `datasets`\n    library. Take a look at [Hugging Face Datasets](https://huggingface.co/docs/datasets/loading)\n    for more information of the supported file types.\n\n    Attributes:\n        data_files: The path to the file, or directory containing the files that conform\n            the dataset.\n        split: The split of the dataset to load (typically will be `train`, `test` or `validation`).\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n        - `data_files`: The path to the file, or directory containing the files that conform\n            the dataset.\n        - `split`: The split of the dataset to load. Defaults to 'train'.\n        - `streaming`: Whether to load the dataset in streaming mode or not. Defaults to\n            `False`.\n        - `num_examples`: The number of examples to load from the dataset.\n            By default will load all examples.\n        - `storage_options`: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n        - `filetype`: The expected filetype. If not provided, it will be inferred from the file extension.\n            For more than one file, it will be inferred from the first file.\n\n    Output columns:\n        - dynamic (`all`): The columns that will be generated by this step, based on the\n            datasets loaded from the Hugging Face Hub.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a Hugging Face dataset in your file system:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(data_files=\"path/to/dataset.jsonl\")\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Specify a filetype if the file extension is not expected:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(filetype=\"csv\", data_files=\"path/to/dataset.txtr\")\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Load data from a file in your cloud provider:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(\n            data_files=\"gcs://path/to/dataset\",\n            storage_options={\"project\": \"experiments-0001\"}\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Load data passing a glob pattern:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(\n            data_files=\"path/to/dataset/*.jsonl\",\n            streaming=True\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n    \"\"\"\n\n    data_files: RuntimeParameter[Union[str, Path]] = Field(\n        default=None,\n        description=\"The data files, or directory containing the data files, to generate the dataset from.\",\n    )\n    filetype: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The expected filetype. If not provided, it will be inferred from the file extension.\",\n    )\n    repo_id: ExcludedField[Union[str, None]] = None\n\n    def load(self) -&gt; None:\n        \"\"\"Load the dataset from the file/s in disk.\"\"\"\n        GeneratorStep.load(self)\n\n        data_path = UPath(self.data_files, storage_options=self.storage_options)\n\n        (data_files, self.filetype) = self._prepare_data_files(data_path)\n\n        self._dataset = load_dataset(\n            self.filetype,\n            data_files=data_files,\n            split=self.split,\n            streaming=self.streaming,\n            storage_options=self.storage_options,\n        )\n\n        if not self.streaming and self.num_examples:\n            self._dataset = self._dataset.select(range(self.num_examples))\n        if not self.num_examples:\n            if self.streaming:\n                # There's no better way to get the number of examples in a streaming dataset,\n                # load it again for the moment.\n                self.num_examples = len(\n                    load_dataset(\n                        self.filetype, data_files=self.data_files, split=self.split\n                    )\n                )\n            else:\n                self.num_examples = len(self._dataset)\n\n    @staticmethod\n    def _prepare_data_files(  # noqa: C901\n        data_path: UPath,\n    ) -&gt; Tuple[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]], str]:\n        \"\"\"Prepare the loading process by setting the `data_files` attribute.\n\n        Args:\n            data_path: The path to the data files, or directory containing the data files.\n\n        Returns:\n            Tuple with the data files and the filetype.\n        \"\"\"\n\n        def get_filetype(data_path: UPath) -&gt; str:\n            filetype = data_path.suffix.lstrip(\".\")\n            if filetype == \"jsonl\":\n                filetype = \"json\"\n            return filetype\n\n        if data_path.is_file() or (\n            len(str(data_path.parent.glob(data_path.name))) &gt;= 1\n        ):\n            filetype = get_filetype(data_path)\n            data_files = str(data_path)\n\n        elif data_path.is_dir():\n            file_sequence = []\n            file_map = defaultdict(list)\n            for file_or_folder in data_path.iterdir():\n                if file_or_folder.is_file():\n                    file_sequence.append(str(file_or_folder))\n                elif file_or_folder.is_dir():\n                    for file in file_or_folder.iterdir():\n                        file_sequence.append(str(file))\n                        file_map[str(file_or_folder)].append(str(file))\n\n            data_files = file_sequence or file_map\n            # Try to obtain the filetype from any of the files, assuming all files have the same type.\n            if file_sequence:\n                filetype = get_filetype(UPath(file_sequence[0]))\n            else:\n                filetype = get_filetype(UPath(file_map[list(file_map.keys())[0]][0]))\n        return data_files, filetype\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The columns that will be generated by this step, based on the datasets from a file\n        in disk.\n\n        Returns:\n            The columns that will be generated by this step.\n        \"\"\"\n        # We assume there are Dataset/IterableDataset, not it's ...Dict counterparts\n        if self._dataset is None:\n            self.load()\n\n        return self._dataset.column_names\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromFileSystem.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The columns that will be generated by this step, based on the datasets from a file in disk.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>The columns that will be generated by this step.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromFileSystem.load","title":"<code>load()</code>","text":"<p>Load the dataset from the file/s in disk.</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Load the dataset from the file/s in disk.\"\"\"\n    GeneratorStep.load(self)\n\n    data_path = UPath(self.data_files, storage_options=self.storage_options)\n\n    (data_files, self.filetype) = self._prepare_data_files(data_path)\n\n    self._dataset = load_dataset(\n        self.filetype,\n        data_files=data_files,\n        split=self.split,\n        streaming=self.streaming,\n        storage_options=self.storage_options,\n    )\n\n    if not self.streaming and self.num_examples:\n        self._dataset = self._dataset.select(range(self.num_examples))\n    if not self.num_examples:\n        if self.streaming:\n            # There's no better way to get the number of examples in a streaming dataset,\n            # load it again for the moment.\n            self.num_examples = len(\n                load_dataset(\n                    self.filetype, data_files=self.data_files, split=self.split\n                )\n            )\n        else:\n            self.num_examples = len(self._dataset)\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub","title":"<code>LoadDataFromHub</code>","text":"<p>               Bases: <code>GeneratorStep</code></p> <p>Loads a dataset from the Hugging Face Hub.</p> <p><code>GeneratorStep</code> that loads a dataset from the Hugging Face Hub using the <code>datasets</code> library.</p> <p>Attributes:</p> Name Type Description <code>repo_id</code> <code>RuntimeParameter[str]</code> <p>The Hugging Face Hub repository ID of the dataset to load.</p> <code>split</code> <code>RuntimeParameter[str]</code> <p>The split of the dataset to load.</p> <code>config</code> <code>Optional[RuntimeParameter[str]]</code> <p>The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations.</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> <li><code>repo_id</code>: The Hugging Face Hub repository ID of the dataset to load.</li> <li><code>split</code>: The split of the dataset to load. Defaults to 'train'.</li> <li><code>config</code>: The configuration of the dataset to load. This is optional and only     needed if the dataset has multiple configurations.</li> <li><code>revision</code>: The revision of the dataset to load. Defaults to the latest revision.</li> <li><code>streaming</code>: Whether to load the dataset in streaming mode or not. Defaults to     <code>False</code>.</li> <li><code>num_examples</code>: The number of examples to load from the dataset.     By default will load all examples.</li> <li><code>storage_options</code>: Key/value pairs to be passed on to the file-system backend, if any.     Defaults to <code>None</code>.</li> </ul> Output columns <ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the     datasets loaded from the Hugging Face Hub.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a dataset in Hugging Face Hub:</p> <pre><code>from distilabel.steps import LoadDataFromHub\n\nloader = LoadDataFromHub(\n    repo_id=\"distilabel-internal-testing/instruction-dataset-mini\",\n    split=\"test\",\n    batch_size=2\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'prompt': 'Arianna has 12...', False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>class LoadDataFromHub(GeneratorStep):\n    \"\"\"Loads a dataset from the Hugging Face Hub.\n\n    `GeneratorStep` that loads a dataset from the Hugging Face Hub using the `datasets`\n    library.\n\n    Attributes:\n        repo_id: The Hugging Face Hub repository ID of the dataset to load.\n        split: The split of the dataset to load.\n        config: The configuration of the dataset to load. This is optional and only needed\n            if the dataset has multiple configurations.\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n        - `repo_id`: The Hugging Face Hub repository ID of the dataset to load.\n        - `split`: The split of the dataset to load. Defaults to 'train'.\n        - `config`: The configuration of the dataset to load. This is optional and only\n            needed if the dataset has multiple configurations.\n        - `revision`: The revision of the dataset to load. Defaults to the latest revision.\n        - `streaming`: Whether to load the dataset in streaming mode or not. Defaults to\n            `False`.\n        - `num_examples`: The number of examples to load from the dataset.\n            By default will load all examples.\n        - `storage_options`: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n\n    Output columns:\n        - dynamic (`all`): The columns that will be generated by this step, based on the\n            datasets loaded from the Hugging Face Hub.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a dataset in Hugging Face Hub:\n\n        ```python\n        from distilabel.steps import LoadDataFromHub\n\n        loader = LoadDataFromHub(\n            repo_id=\"distilabel-internal-testing/instruction-dataset-mini\",\n            split=\"test\",\n            batch_size=2\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'prompt': 'Arianna has 12...', False)\n        ```\n    \"\"\"\n\n    repo_id: RuntimeParameter[str] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repository ID of the dataset to load.\",\n    )\n    split: RuntimeParameter[str] = Field(\n        default=\"train\",\n        description=\"The split of the dataset to load. Defaults to 'train'.\",\n    )\n    config: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The configuration of the dataset to load. This is optional and only\"\n        \" needed if the dataset has multiple configurations.\",\n    )\n    revision: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The revision of the dataset to load. Defaults to the latest revision.\",\n    )\n    streaming: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether to load the dataset in streaming mode or not. Defaults to False.\",\n    )\n    num_examples: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"The number of examples to load from the dataset. By default will load all examples.\",\n    )\n    storage_options: Optional[Dict[str, Any]] = Field(\n        default=None,\n        description=\"The storage options to use when loading the dataset.\",\n    )\n\n    _dataset: Union[IterableDataset, Dataset, None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Load the dataset from the Hugging Face Hub\"\"\"\n        super().load()\n\n        if self._dataset is not None:\n            # Here to simplify the functionality of\u00a0distilabel.steps.generators.util.make_generator_step\n            return\n\n        self._dataset = load_dataset(\n            self.repo_id,  # type: ignore\n            self.config,\n            split=self.split,\n            revision=self.revision,\n            streaming=self.streaming,\n        )\n        num_examples = self._get_dataset_num_examples()\n        self.num_examples = (\n            min(self.num_examples, num_examples) if self.num_examples else num_examples\n        )\n\n        if not self.streaming:\n            self._dataset = self._dataset.select(range(self.num_examples))\n\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Yields batches from the loaded dataset from the Hugging Face Hub.\n\n        Args:\n            offset: The offset to start yielding the data from. Will be used during the caching\n                process to help skipping already processed data.\n\n        Yields:\n            A tuple containing a batch of rows and a boolean indicating if the batch is\n            the last one.\n        \"\"\"\n        num_returned_rows = 0\n        for batch_num, batch in enumerate(\n            self._dataset.iter(batch_size=self.batch_size)  # type: ignore\n        ):\n            if batch_num * self.batch_size &lt; offset:\n                continue\n            transformed_batch = self._transform_batch(batch)\n            batch_size = len(transformed_batch)\n            num_returned_rows += batch_size\n            yield transformed_batch, num_returned_rows &gt;= self.num_examples\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The columns that will be generated by this step, based on the datasets loaded\n        from the Hugging Face Hub.\n\n        Returns:\n            The columns that will be generated by this step.\n        \"\"\"\n        return self._get_dataset_columns()\n\n    def _transform_batch(self, batch: Dict[str, Any]) -&gt; List[Dict[str, Any]]:\n        \"\"\"Transform a batch of data from the Hugging Face Hub into a list of rows.\n\n        Args:\n            batch: The batch of data from the Hugging Face Hub.\n\n        Returns:\n            A list of rows, where each row is a dictionary of column names and values.\n        \"\"\"\n        length = len(next(iter(batch.values())))\n        rows = []\n        for i in range(length):\n            rows.append({col: values[i] for col, values in batch.items()})\n        return rows\n\n    def _get_dataset_num_examples(self) -&gt; int:\n        \"\"\"Get the number of examples in the dataset, based on the `split` and `config`\n        runtime parameters provided.\n\n        Returns:\n            The number of examples in the dataset.\n        \"\"\"\n        default_config = self.config\n        if not default_config:\n            default_config = list(self._dataset_info.keys())[0]\n\n        return self._dataset_info[default_config].splits[self.split].num_examples\n\n    def _get_dataset_columns(self) -&gt; List[str]:\n        \"\"\"Get the columns of the dataset, based on the `config` runtime parameter provided.\n\n        Returns:\n            The columns of the dataset.\n        \"\"\"\n        return list(\n            self._dataset_info[\n                self.config if self.config else \"default\"\n            ].features.keys()\n        )\n\n    @cached_property\n    def _dataset_info(self) -&gt; Dict[str, DatasetInfo]:\n        \"\"\"Calls the Datasets Server API from Hugging Face to obtain the dataset information.\n\n        Returns:\n            The dataset information.\n        \"\"\"\n\n        try:\n            return get_dataset_infos(self.repo_id)\n        except Exception as e:\n            warnings.warn(\n                f\"Failed to get dataset info from Hugging Face Hub, trying to get it loading the dataset. Error: {e}\",\n                UserWarning,\n                stacklevel=2,\n            )\n            ds = load_dataset(self.repo_id, config=self.config, split=self.split)\n            if self.config:\n                return ds[self.config].info\n            return ds.info\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>The columns that will be generated by this step.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub.load","title":"<code>load()</code>","text":"<p>Load the dataset from the Hugging Face Hub</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Load the dataset from the Hugging Face Hub\"\"\"\n    super().load()\n\n    if self._dataset is not None:\n        # Here to simplify the functionality of\u00a0distilabel.steps.generators.util.make_generator_step\n        return\n\n    self._dataset = load_dataset(\n        self.repo_id,  # type: ignore\n        self.config,\n        split=self.split,\n        revision=self.revision,\n        streaming=self.streaming,\n    )\n    num_examples = self._get_dataset_num_examples()\n    self.num_examples = (\n        min(self.num_examples, num_examples) if self.num_examples else num_examples\n    )\n\n    if not self.streaming:\n        self._dataset = self._dataset.select(range(self.num_examples))\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub.process","title":"<code>process(offset=0)</code>","text":"<p>Yields batches from the loaded dataset from the Hugging Face Hub.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start yielding the data from. Will be used during the caching process to help skipping already processed data.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A tuple containing a batch of rows and a boolean indicating if the batch is</p> <code>GeneratorStepOutput</code> <p>the last one.</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Yields batches from the loaded dataset from the Hugging Face Hub.\n\n    Args:\n        offset: The offset to start yielding the data from. Will be used during the caching\n            process to help skipping already processed data.\n\n    Yields:\n        A tuple containing a batch of rows and a boolean indicating if the batch is\n        the last one.\n    \"\"\"\n    num_returned_rows = 0\n    for batch_num, batch in enumerate(\n        self._dataset.iter(batch_size=self.batch_size)  # type: ignore\n    ):\n        if batch_num * self.batch_size &lt; offset:\n            continue\n        transformed_batch = self._transform_batch(batch)\n        batch_size = len(transformed_batch)\n        num_returned_rows += batch_size\n        yield transformed_batch, num_returned_rows &gt;= self.num_examples\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.PushToHub","title":"<code>PushToHub</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Push data to a Hugging Face Hub dataset.</p> <p>A <code>GlobalStep</code> which creates a <code>datasets.Dataset</code> with the input data and pushes it to the Hugging Face Hub.</p> <p>Attributes:</p> Name Type Description <code>repo_id</code> <code>RuntimeParameter[str]</code> <p>The Hugging Face Hub repository ID where the dataset will be uploaded.</p> <code>split</code> <code>RuntimeParameter[str]</code> <p>The split of the dataset that will be pushed. Defaults to <code>\"train\"</code>.</p> <code>private</code> <code>RuntimeParameter[bool]</code> <p>Whether the dataset to be pushed should be private or not. Defaults to <code>False</code>.</p> <code>token</code> <code>Optional[RuntimeParameter[str]]</code> <p>The token that will be used to authenticate in the Hub. If not provided, the token will be tried to be obtained from the environment variable <code>HF_TOKEN</code>. If not provided using one of the previous methods, then <code>huggingface_hub</code> library will try to use the token from the local Hugging Face CLI configuration. Defaults to <code>None</code>.</p> Runtime parameters <ul> <li><code>repo_id</code>: The Hugging Face Hub repository ID where the dataset will be uploaded.</li> <li><code>split</code>: The split of the dataset that will be pushed.</li> <li><code>private</code>: Whether the dataset to be pushed should be private or not.</li> <li><code>token</code>: The token that will be used to authenticate in the Hub.</li> </ul> Input columns <ul> <li>dynamic (<code>all</code>): all columns from the input will be used to create the dataset.</li> </ul> Categories <ul> <li>save</li> <li>dataset</li> <li>huggingface</li> </ul> <p>Examples:</p> <p>Push batches of your dataset to the Hugging Face Hub repository:</p> <pre><code>from distilabel.steps import PushToHub\n\npush = PushToHub(repo_id=\"path_to/repo\")\npush.load()\n\nresult = next(\n    push.process(\n        [\n            {\n                \"instruction\": \"instruction \",\n                \"generation\": \"generation\"\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction ', 'generation': 'generation'}]\n</code></pre> Source code in <code>src/distilabel/steps/globals/huggingface.py</code> <pre><code>class PushToHub(GlobalStep):\n    \"\"\"Push data to a Hugging Face Hub dataset.\n\n    A `GlobalStep` which creates a `datasets.Dataset` with the input data and pushes\n    it to the Hugging Face Hub.\n\n    Attributes:\n        repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.\n        split: The split of the dataset that will be pushed. Defaults to `\"train\"`.\n        private: Whether the dataset to be pushed should be private or not. Defaults to\n            `False`.\n        token: The token that will be used to authenticate in the Hub. If not provided, the\n            token will be tried to be obtained from the environment variable `HF_TOKEN`.\n            If not provided using one of the previous methods, then `huggingface_hub` library\n            will try to use the token from the local Hugging Face CLI configuration. Defaults\n            to `None`.\n\n    Runtime parameters:\n        - `repo_id`: The Hugging Face Hub repository ID where the dataset will be uploaded.\n        - `split`: The split of the dataset that will be pushed.\n        - `private`: Whether the dataset to be pushed should be private or not.\n        - `token`: The token that will be used to authenticate in the Hub.\n\n    Input columns:\n        - dynamic (`all`): all columns from the input will be used to create the dataset.\n\n    Categories:\n        - save\n        - dataset\n        - huggingface\n\n    Examples:\n        Push batches of your dataset to the Hugging Face Hub repository:\n\n        ```python\n        from distilabel.steps import PushToHub\n\n        push = PushToHub(repo_id=\"path_to/repo\")\n        push.load()\n\n        result = next(\n            push.process(\n                [\n                    {\n                        \"instruction\": \"instruction \",\n                        \"generation\": \"generation\"\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction ', 'generation': 'generation'}]\n        ```\n    \"\"\"\n\n    repo_id: RuntimeParameter[str] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repository ID where the dataset will be uploaded.\",\n    )\n    split: RuntimeParameter[str] = Field(\n        default=\"train\",\n        description=\"The split of the dataset that will be pushed. Defaults to 'train'.\",\n    )\n    private: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether the dataset to be pushed should be private or not. Defaults\"\n        \" to `False`.\",\n    )\n    token: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The token that will be used to authenticate in the Hub. If not provided,\"\n        \" the token will be tried to be obtained from the environment variable `HF_TOKEN`.\"\n        \" If not provided using one of the previous methods, then `huggingface_hub` library\"\n        \" will try to use the token from the local Hugging Face CLI configuration. Defaults\"\n        \" to `None`\",\n    )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Method that processes the input data, respecting the `datasets.Dataset` formatting,\n        and pushes it to the Hugging Face Hub based on the `RuntimeParameter`s attributes.\n\n        Args:\n            inputs: that input data within a single object (as it's a GlobalStep) that\n                will be transformed into a `datasets.Dataset`.\n\n        Yields:\n            Propagates the received inputs so that the `Distiset` can be generated if this is\n            the last step of the `Pipeline`, or if this is not a leaf step and has follow up\n            steps.\n        \"\"\"\n        dataset_dict = defaultdict(list)\n        for input in inputs:\n            for key, value in input.items():\n                dataset_dict[key].append(value)\n        dataset_dict = dict(dataset_dict)\n        dataset = Dataset.from_dict(dataset_dict)\n        dataset.push_to_hub(\n            self.repo_id,  # type: ignore\n            split=self.split,\n            private=self.private,\n            token=self.token or os.getenv(\"HF_TOKEN\"),\n        )\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.PushToHub.process","title":"<code>process(inputs)</code>","text":"<p>Method that processes the input data, respecting the <code>datasets.Dataset</code> formatting, and pushes it to the Hugging Face Hub based on the <code>RuntimeParameter</code>s attributes.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>that input data within a single object (as it's a GlobalStep) that will be transformed into a <code>datasets.Dataset</code>.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>Propagates the received inputs so that the <code>Distiset</code> can be generated if this is</p> <code>StepOutput</code> <p>the last step of the <code>Pipeline</code>, or if this is not a leaf step and has follow up</p> <code>StepOutput</code> <p>steps.</p> Source code in <code>src/distilabel/steps/globals/huggingface.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Method that processes the input data, respecting the `datasets.Dataset` formatting,\n    and pushes it to the Hugging Face Hub based on the `RuntimeParameter`s attributes.\n\n    Args:\n        inputs: that input data within a single object (as it's a GlobalStep) that\n            will be transformed into a `datasets.Dataset`.\n\n    Yields:\n        Propagates the received inputs so that the `Distiset` can be generated if this is\n        the last step of the `Pipeline`, or if this is not a leaf step and has follow up\n        steps.\n    \"\"\"\n    dataset_dict = defaultdict(list)\n    for input in inputs:\n        for key, value in input.items():\n            dataset_dict[key].append(value)\n    dataset_dict = dict(dataset_dict)\n    dataset = Dataset.from_dict(dataset_dict)\n    dataset.push_to_hub(\n        self.repo_id,  # type: ignore\n        split=self.split,\n        private=self.private,\n        token=self.token or os.getenv(\"HF_TOKEN\"),\n    )\n    yield inputs\n</code></pre>"},{"location":"api/task/","title":"Task","text":"<p>This section contains the API reference for the <code>distilabel</code> tasks.</p> <p>For more information on how the <code>Task</code> works and see some examples, check the Tutorial - Task page.</p>"},{"location":"api/task/#distilabel.steps.tasks.base","title":"<code>base</code>","text":""},{"location":"api/task/#distilabel.steps.tasks.base._Task","title":"<code>_Task</code>","text":"<p>               Bases: <code>_Step</code>, <code>ABC</code></p> <p>_Task is an abstract class that implements the <code>_Step</code> interface and adds the <code>format_input</code> and <code>format_output</code> methods to format the inputs and outputs of the task. It also adds a <code>llm</code> attribute to be used as the LLM to generate the outputs.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>the <code>LLM</code> to be used to generate the outputs of the task.</p> <code>group_generations</code> <code>bool</code> <p>whether to group the <code>num_generations</code> generated per input in a list or create a row per generation. Defaults to <code>False</code>.</p> <code>add_raw_output</code> <code>RuntimeParameter[bool]</code> <p>whether to include a field with the raw output of the LLM in the <code>distilabel_metadata</code> field of the output. Can be helpful to not loose data with <code>Tasks</code> that need to format the output of the <code>LLM</code>. Defaults to <code>False</code>.</p> <code>num_generations</code> <code>RuntimeParameter[int]</code> <p>The number of generations to be produced per input.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>class _Task(_Step, ABC):\n    \"\"\"_Task is an abstract class that implements the `_Step` interface and adds the\n    `format_input` and `format_output` methods to format the inputs and outputs of the\n    task. It also adds a `llm` attribute to be used as the LLM to generate the outputs.\n\n    Attributes:\n        llm: the `LLM` to be used to generate the outputs of the task.\n        group_generations: whether to group the `num_generations` generated per input in\n            a list or create a row per generation. Defaults to `False`.\n        add_raw_output: whether to include a field with the raw output of the LLM in the\n            `distilabel_metadata` field of the output. Can be helpful to not loose data\n            with `Tasks` that need to format the output of the `LLM`. Defaults to `False`.\n        num_generations: The number of generations to be produced per input.\n    \"\"\"\n\n    llm: LLM\n\n    group_generations: bool = False\n    add_raw_output: RuntimeParameter[bool] = Field(\n        default=True,\n        description=(\n            \"Whether to include the raw output of the LLM in the key `raw_output_&lt;TASK_NAME&gt;`\"\n            \" of the `distilabel_metadata` dictionary output column\"\n        ),\n    )\n    add_raw_input: RuntimeParameter[bool] = Field(\n        default=True,\n        description=(\n            \"Whether to include the raw input of the LLM in the key `raw_input_&lt;TASK_NAME&gt;`\"\n            \" of the `distilabel_metadata` dictionary column\"\n        ),\n    )\n    num_generations: RuntimeParameter[int] = Field(\n        default=1, description=\"The number of generations to be produced per input.\"\n    )\n    use_default_structured_output: bool = False\n\n    _can_be_used_with_offline_batch_generation: bool = PrivateAttr(False)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        if (\n            self.llm.use_offline_batch_generation\n            and not self._can_be_used_with_offline_batch_generation\n        ):\n            raise DistilabelUserError(\n                f\"`{self.__class__.__name__}` task cannot be used with offline batch generation\"\n                \" feature.\",\n                page=\"sections/how_to_guides/advanced/offline-batch-generation\",\n            )\n\n        super().model_post_init(__context)\n\n    @property\n    def is_global(self) -&gt; bool:\n        \"\"\"Extends the `is_global` property to return `True` if the task is using the\n        offline batch generation feature, otherwise it returns the value of the parent\n        class property. `offline_batch_generation` requires to receive all the inputs\n        at once, so for the `_BatchManager` this is a global step.\n\n        Returns:\n            Whether the task is a global step or not.\n        \"\"\"\n        if self.llm.use_offline_batch_generation:\n            return True\n\n        return super().is_global\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the LLM via the `LLM.load()` method.\"\"\"\n        super().load()\n        self._set_default_structured_output()\n        self.llm.load()\n\n    @override\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the LLM.\"\"\"\n        self._logger.debug(\"Executing task unload logic.\")\n        self.llm.unload()\n\n    @override\n    def impute_step_outputs(\n        self, step_output: List[Dict[str, Any]]\n    ) -&gt; List[Dict[str, Any]]:\n        \"\"\"\n        Imputes the outputs of the task in case the LLM failed to generate a response.\n        \"\"\"\n        result = []\n        for row in step_output:\n            data = row.copy()\n            for output in self.get_outputs().keys():\n                data[output] = None\n            data = self._create_metadata(\n                data,\n                None,\n                None,\n                add_raw_output=self.add_raw_output,\n                add_raw_input=self.add_raw_input,\n            )\n            result.append(data)\n        return result\n\n    @abstractmethod\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Abstract method to format the outputs of the task. It needs to receive an output\n        as a string, and generates a Python dictionary with the outputs of the task. In\n        addition the `input` used to generate the output is also received just in case it's\n        needed to be able to parse the output correctly.\n        \"\"\"\n        pass\n\n    def _format_outputs(\n        self,\n        outputs: \"GenerateOutput\",\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; List[Dict[str, Any]]:\n        \"\"\"Formats the outputs of the task using the `format_output` method. If the output\n        is `None` (i.e. the LLM failed to generate a response), then the outputs will be\n        set to `None` as well.\n\n        Args:\n            outputs: The outputs (`n` generations) for the provided `input`.\n            input: The input used to generate the output.\n\n        Returns:\n            A list containing a dictionary with the outputs of the task for each input.\n        \"\"\"\n        inputs = [None] if input is None else [input]\n        formatted_outputs = []\n        repeate_inputs = len(outputs.get(\"generations\"))\n        outputs = normalize_statistics(outputs)\n\n        for (output, stats, extra), input in zip(\n            iterate_generations_with_stats(outputs), inputs * repeate_inputs\n        ):  # type: ignore\n            try:\n                # Extract the generations, and move the statistics to the distilabel_metadata,\n                # to keep everything clean\n                formatted_output = self.format_output(output, input)\n                formatted_output = self._create_metadata(\n                    output=formatted_output,\n                    raw_output=output,\n                    input=input,\n                    add_raw_output=self.add_raw_output,  # type: ignore\n                    add_raw_input=self.add_raw_input,  # type: ignore\n                    statistics=stats,\n                )\n                formatted_output = self._create_extra(\n                    output=formatted_output, extra=extra\n                )\n                formatted_outputs.append(formatted_output)\n            except Exception as e:\n                self._logger.warning(  # type: ignore\n                    f\"Task '{self.name}' failed to format output: {e}. Saving raw response.\"  # type: ignore\n                )\n                formatted_outputs.append(self._output_on_failure(output, input))\n        return formatted_outputs\n\n    def _output_on_failure(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"In case of failure to format the output, this method will return a dictionary including\n        a new field `distilabel_meta` with the raw output of the LLM.\n        \"\"\"\n        # Create a dictionary with the outputs of the task (every output set to None)\n        outputs = {output: None for output in self.outputs}\n        outputs[\"model_name\"] = self.llm.model_name  # type: ignore\n        outputs = self._create_metadata(\n            outputs,\n            output,\n            input,\n            add_raw_output=self.add_raw_output,  # type: ignore\n            add_raw_input=self.add_raw_input,  # type: ignore\n        )\n        return outputs\n\n    def _create_metadata(\n        self,\n        output: Dict[str, Any],\n        raw_output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n        add_raw_output: bool = True,\n        add_raw_input: bool = True,\n        statistics: Optional[\"LLMStatistics\"] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Adds the raw output and or the formatted input of the LLM to the output dictionary\n        if `add_raw_output` is True or `add_raw_input` is True.\n\n        Args:\n            output:\n                The output dictionary after formatting the output from the LLM,\n                to add the raw output and or raw input.\n            raw_output: The raw output of the `LLM`.\n            input: The input used to generate the output.\n            add_raw_output: Whether to add the raw output to the output dictionary.\n            add_raw_input: Whether to add the raw input to the output dictionary.\n            statistics: The statistics generated by the LLM, which should contain at least\n                the number of input and output tokens.\n        \"\"\"\n        meta = output.get(DISTILABEL_METADATA_KEY, {})\n\n        if add_raw_output:\n            meta[f\"raw_output_{self.name}\"] = raw_output\n\n        if add_raw_input:\n            meta[f\"raw_input_{self.name}\"] = self.format_input(input) if input else None\n\n        if statistics:\n            meta[f\"statistics_{self.name}\"] = statistics\n\n        if meta:\n            output[DISTILABEL_METADATA_KEY] = meta\n\n        return output\n\n    def _create_extra(\n        self, output: Dict[str, Any], extra: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        column_name_prefix = f\"llm_{self.name}_\"\n        for key, value in extra.items():\n            column_name = column_name_prefix + key\n            output[column_name] = value\n        return output\n\n    def _set_default_structured_output(self) -&gt; None:\n        \"\"\"Prepares the structured output to be set in the selected `LLM`.\n\n        If the method `get_structured_output` returns None (the default), there's no need\n        to set anything, as it doesn't apply.\n        If the `use_default_structured_output` and there's no previous structured output\n        set by hand, then decide the type of structured output to select depending on the\n        `LLM` provider.\n        \"\"\"\n        schema = self.get_structured_output()\n        if not schema:\n            return\n\n        if self.use_default_structured_output and not self.llm.structured_output:\n            # In case the default structured output is required, we have to set it before\n            # the LLM is loaded\n            from distilabel.models.llms import InferenceEndpointsLLM\n            from distilabel.models.llms.base import AsyncLLM\n\n            def check_dependency(module_name: str) -&gt; None:\n                if not importlib.util.find_spec(module_name):\n                    raise ImportError(\n                        f\"`{module_name}` is not installed and is needed for the structured generation with this LLM.\"\n                        f\" Please install it using `pip install {module_name}`.\"\n                    )\n\n            dependency = \"outlines\"\n            structured_output = {\"schema\": schema}\n            if isinstance(self.llm, InferenceEndpointsLLM):\n                structured_output.update({\"format\": \"json\"})\n            # To determine instructor or outlines format\n            elif isinstance(self.llm, AsyncLLM) and not isinstance(\n                self.llm, InferenceEndpointsLLM\n            ):\n                dependency = \"instructor\"\n                structured_output.update({\"format\": \"json\"})\n\n            check_dependency(dependency)\n            self.llm.structured_output = structured_output\n\n    def get_structured_output(self) -&gt; Union[Dict[str, Any], None]:\n        \"\"\"Returns the structured output for a task that implements one by default,\n        must be overriden by subclasses of `Task`. When implemented, should be a json\n        schema that enforces the response from the LLM so that it's easier to parse.\n        \"\"\"\n        return None\n\n    def _sample_input(self) -&gt; \"ChatType\":\n        \"\"\"Returns a sample input to be used in the `print` method.\n        Tasks that don't adhere to a format input that returns a map of the type\n        str -&gt; str should override this method to return a sample input.\n        \"\"\"\n        return self.format_input(\n            {input: f\"&lt;PLACEHOLDER_{input.upper()}&gt;\" for input in self.inputs}\n        )\n\n    def print(self, sample_input: Optional[\"ChatType\"] = None) -&gt; None:\n        \"\"\"Prints a sample input to the console using the `rich` library.\n        Helper method to visualize the prompt of the task.\n\n        Args:\n            sample_input: A sample input to be printed. If not provided, a default will be\n                generated using the `_sample_input` method, which can be overriden by\n                subclasses. This should correspond to the same example you could pass to\n                the `format_input` method.\n                The variables be named &lt;PLACEHOLDER_VARIABLE_NAME&gt; by default.\n\n        Examples:\n            Print the URIAL prompt:\n\n            ```python\n            from distilabel.steps.tasks import URIAL\n            from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n            # Consider this as a placeholder for your actual LLM.\n            urial = URIAL(\n                llm=InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                ),\n            )\n            urial.load()\n            urial.print()\n            \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt: URIAL  \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n            \u2502 \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 User Message \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e \u2502\n            \u2502 \u2502 # Instruction                                                                             \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 Below is a list of conversations between a human and an AI assistant (you).               \u2502 \u2502\n            \u2502 \u2502 Users place their queries under \"# User:\", and your responses are under  \"# Assistant:\".  \u2502 \u2502\n            \u2502 \u2502 You are a helpful, respectful, and honest assistant.                                      \u2502 \u2502\n            \u2502 \u2502 You should always answer as helpfully as possible while ensuring safety.                  \u2502 \u2502\n            \u2502 \u2502 Your answers should be well-structured and provide detailed information. They should also \u2502 \u2502\n            \u2502 \u2502 have an engaging tone.                                                                    \u2502 \u2502\n            \u2502 \u2502 Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic,      \u2502 \u2502\n            \u2502 \u2502 dangerous, or illegal content, even if it may be helpful.                                 \u2502 \u2502\n            \u2502 \u2502 Your response must be socially responsible, and thus you can refuse to answer some        \u2502 \u2502\n            \u2502 \u2502 controversial topics.                                                                     \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 # User:                                                                                   \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 &lt;PLACEHOLDER_INSTRUCTION&gt;                                                                 \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 # Assistant:                                                                              \u2502 \u2502\n            \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f \u2502\n            \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n            ```\n        \"\"\"\n        from rich.console import Console, Group\n        from rich.panel import Panel\n        from rich.text import Text\n\n        console = Console()\n        sample_input = sample_input or self._sample_input()\n\n        panels = []\n        for item in sample_input:\n            content = Text.assemble((item.get(\"content\", \"\"),))\n            panel = Panel(\n                content,\n                title=f\"[bold][magenta]{item.get('role', '').capitalize()} Message[/magenta][/bold]\",\n                border_style=\"light_cyan3\",\n            )\n            panels.append(panel)\n\n        # Create a group of panels\n        # Wrap the group in an outer panel\n        outer_panel = Panel(\n            Group(*panels),\n            title=f\"[bold][magenta]Prompt: {type(self).__name__} [/magenta][/bold]\",\n            border_style=\"light_cyan3\",\n            expand=False,\n        )\n        console.print(outer_panel)\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.is_global","title":"<code>is_global</code>  <code>property</code>","text":"<p>Extends the <code>is_global</code> property to return <code>True</code> if the task is using the offline batch generation feature, otherwise it returns the value of the parent class property. <code>offline_batch_generation</code> requires to receive all the inputs at once, so for the <code>_BatchManager</code> this is a global step.</p> <p>Returns:</p> Type Description <code>bool</code> <p>Whether the task is a global step or not.</p>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.load","title":"<code>load()</code>","text":"<p>Loads the LLM via the <code>LLM.load()</code> method.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the LLM via the `LLM.load()` method.\"\"\"\n    super().load()\n    self._set_default_structured_output()\n    self.llm.load()\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.unload","title":"<code>unload()</code>","text":"<p>Unloads the LLM.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@override\ndef unload(self) -&gt; None:\n    \"\"\"Unloads the LLM.\"\"\"\n    self._logger.debug(\"Executing task unload logic.\")\n    self.llm.unload()\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.impute_step_outputs","title":"<code>impute_step_outputs(step_output)</code>","text":"<p>Imputes the outputs of the task in case the LLM failed to generate a response.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@override\ndef impute_step_outputs(\n    self, step_output: List[Dict[str, Any]]\n) -&gt; List[Dict[str, Any]]:\n    \"\"\"\n    Imputes the outputs of the task in case the LLM failed to generate a response.\n    \"\"\"\n    result = []\n    for row in step_output:\n        data = row.copy()\n        for output in self.get_outputs().keys():\n            data[output] = None\n        data = self._create_metadata(\n            data,\n            None,\n            None,\n            add_raw_output=self.add_raw_output,\n            add_raw_input=self.add_raw_input,\n        )\n        result.append(data)\n    return result\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.format_output","title":"<code>format_output(output, input=None)</code>  <code>abstractmethod</code>","text":"<p>Abstract method to format the outputs of the task. It needs to receive an output as a string, and generates a Python dictionary with the outputs of the task. In addition the <code>input</code> used to generate the output is also received just in case it's needed to be able to parse the output correctly.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@abstractmethod\ndef format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Abstract method to format the outputs of the task. It needs to receive an output\n    as a string, and generates a Python dictionary with the outputs of the task. In\n    addition the `input` used to generate the output is also received just in case it's\n    needed to be able to parse the output correctly.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Returns the structured output for a task that implements one by default, must be overriden by subclasses of <code>Task</code>. When implemented, should be a json schema that enforces the response from the LLM so that it's easier to parse.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def get_structured_output(self) -&gt; Union[Dict[str, Any], None]:\n    \"\"\"Returns the structured output for a task that implements one by default,\n    must be overriden by subclasses of `Task`. When implemented, should be a json\n    schema that enforces the response from the LLM so that it's easier to parse.\n    \"\"\"\n    return None\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.print","title":"<code>print(sample_input=None)</code>","text":"<p>Prints a sample input to the console using the <code>rich</code> library. Helper method to visualize the prompt of the task.</p> <p>Parameters:</p> Name Type Description Default <code>sample_input</code> <code>Optional[ChatType]</code> <p>A sample input to be printed. If not provided, a default will be generated using the <code>_sample_input</code> method, which can be overriden by subclasses. This should correspond to the same example you could pass to the <code>format_input</code> method. The variables be named  by default. <code>None</code> <p>Examples:</p> <p>Print the URIAL prompt:</p> <pre><code>from distilabel.steps.tasks import URIAL\nfrom distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nurial = URIAL(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n)\nurial.load()\nurial.print()\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt: URIAL  \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 User Message \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e \u2502\n\u2502 \u2502 # Instruction                                                                             \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 Below is a list of conversations between a human and an AI assistant (you).               \u2502 \u2502\n\u2502 \u2502 Users place their queries under \"# User:\", and your responses are under  \"# Assistant:\".  \u2502 \u2502\n\u2502 \u2502 You are a helpful, respectful, and honest assistant.                                      \u2502 \u2502\n\u2502 \u2502 You should always answer as helpfully as possible while ensuring safety.                  \u2502 \u2502\n\u2502 \u2502 Your answers should be well-structured and provide detailed information. They should also \u2502 \u2502\n\u2502 \u2502 have an engaging tone.                                                                    \u2502 \u2502\n\u2502 \u2502 Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic,      \u2502 \u2502\n\u2502 \u2502 dangerous, or illegal content, even if it may be helpful.                                 \u2502 \u2502\n\u2502 \u2502 Your response must be socially responsible, and thus you can refuse to answer some        \u2502 \u2502\n\u2502 \u2502 controversial topics.                                                                     \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 # User:                                                                                   \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 &lt;PLACEHOLDER_INSTRUCTION&gt;                                                                 \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 # Assistant:                                                                              \u2502 \u2502\n\u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def print(self, sample_input: Optional[\"ChatType\"] = None) -&gt; None:\n    \"\"\"Prints a sample input to the console using the `rich` library.\n    Helper method to visualize the prompt of the task.\n\n    Args:\n        sample_input: A sample input to be printed. If not provided, a default will be\n            generated using the `_sample_input` method, which can be overriden by\n            subclasses. This should correspond to the same example you could pass to\n            the `format_input` method.\n            The variables be named &lt;PLACEHOLDER_VARIABLE_NAME&gt; by default.\n\n    Examples:\n        Print the URIAL prompt:\n\n        ```python\n        from distilabel.steps.tasks import URIAL\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        urial = URIAL(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n        )\n        urial.load()\n        urial.print()\n        \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt: URIAL  \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n        \u2502 \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 User Message \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e \u2502\n        \u2502 \u2502 # Instruction                                                                             \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 Below is a list of conversations between a human and an AI assistant (you).               \u2502 \u2502\n        \u2502 \u2502 Users place their queries under \"# User:\", and your responses are under  \"# Assistant:\".  \u2502 \u2502\n        \u2502 \u2502 You are a helpful, respectful, and honest assistant.                                      \u2502 \u2502\n        \u2502 \u2502 You should always answer as helpfully as possible while ensuring safety.                  \u2502 \u2502\n        \u2502 \u2502 Your answers should be well-structured and provide detailed information. They should also \u2502 \u2502\n        \u2502 \u2502 have an engaging tone.                                                                    \u2502 \u2502\n        \u2502 \u2502 Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic,      \u2502 \u2502\n        \u2502 \u2502 dangerous, or illegal content, even if it may be helpful.                                 \u2502 \u2502\n        \u2502 \u2502 Your response must be socially responsible, and thus you can refuse to answer some        \u2502 \u2502\n        \u2502 \u2502 controversial topics.                                                                     \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 # User:                                                                                   \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 &lt;PLACEHOLDER_INSTRUCTION&gt;                                                                 \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 # Assistant:                                                                              \u2502 \u2502\n        \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f \u2502\n        \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n        ```\n    \"\"\"\n    from rich.console import Console, Group\n    from rich.panel import Panel\n    from rich.text import Text\n\n    console = Console()\n    sample_input = sample_input or self._sample_input()\n\n    panels = []\n    for item in sample_input:\n        content = Text.assemble((item.get(\"content\", \"\"),))\n        panel = Panel(\n            content,\n            title=f\"[bold][magenta]{item.get('role', '').capitalize()} Message[/magenta][/bold]\",\n            border_style=\"light_cyan3\",\n        )\n        panels.append(panel)\n\n    # Create a group of panels\n    # Wrap the group in an outer panel\n    outer_panel = Panel(\n        Group(*panels),\n        title=f\"[bold][magenta]Prompt: {type(self).__name__} [/magenta][/bold]\",\n        border_style=\"light_cyan3\",\n        expand=False,\n    )\n    console.print(outer_panel)\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base.Task","title":"<code>Task</code>","text":"<p>               Bases: <code>_Task</code>, <code>Step</code></p> <p>Task is a class that implements the <code>_Task</code> abstract class and adds the <code>Step</code> interface to be used as a step in the pipeline.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>the <code>LLM</code> to be used to generate the outputs of the task.</p> <code>group_generations</code> <code>bool</code> <p>whether to group the <code>num_generations</code> generated per input in a list or create a row per generation. Defaults to <code>False</code>.</p> <code>num_generations</code> <code>RuntimeParameter[int]</code> <p>The number of generations to be produced per input.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>class Task(_Task, Step):\n    \"\"\"Task is a class that implements the `_Task` abstract class and adds the `Step`\n    interface to be used as a step in the pipeline.\n\n    Attributes:\n        llm: the `LLM` to be used to generate the outputs of the task.\n        group_generations: whether to group the `num_generations` generated per input in\n            a list or create a row per generation. Defaults to `False`.\n        num_generations: The number of generations to be produced per input.\n    \"\"\"\n\n    @abstractmethod\n    def format_input(self, input: Dict[str, Any]) -&gt; \"FormattedInput\":\n        \"\"\"Abstract method to format the inputs of the task. It needs to receive an input\n        as a Python dictionary, and generates an OpenAI chat-like list of dicts.\"\"\"\n        pass\n\n    def _format_inputs(self, inputs: List[Dict[str, Any]]) -&gt; List[\"FormattedInput\"]:\n        \"\"\"Formats the inputs of the task using the `format_input` method.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list containing the formatted inputs, which are `ChatType`-like following\n            the OpenAI formatting.\n        \"\"\"\n        return [self.format_input(input) for input in inputs]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n\n        formatted_inputs = self._format_inputs(inputs)\n\n        # `outputs` is a dict containing the LLM outputs in the `generations`\n        # key and the statistics in the `statistics` key\n        outputs = self.llm.generate_outputs(\n            inputs=formatted_inputs,\n            num_generations=self.num_generations,  # type: ignore\n            **self.llm.get_generation_kwargs(),  # type: ignore\n        )\n        task_outputs = []\n        for input, input_outputs in zip(inputs, outputs):\n            formatted_outputs = self._format_outputs(input_outputs, input)\n\n            if self.group_generations:\n                combined = group_dicts(*formatted_outputs)\n                task_outputs.append(\n                    {**input, **combined, \"model_name\": self.llm.model_name}\n                )\n                continue\n\n            # Create a row per generation\n            for formatted_output in formatted_outputs:\n                task_outputs.append(\n                    {**input, **formatted_output, \"model_name\": self.llm.model_name}\n                )\n\n        yield task_outputs\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base.Task.format_input","title":"<code>format_input(input)</code>  <code>abstractmethod</code>","text":"<p>Abstract method to format the inputs of the task. It needs to receive an input as a Python dictionary, and generates an OpenAI chat-like list of dicts.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@abstractmethod\ndef format_input(self, input: Dict[str, Any]) -&gt; \"FormattedInput\":\n    \"\"\"Abstract method to format the inputs of the task. It needs to receive an input\n    as a Python dictionary, and generates an OpenAI chat-like list of dicts.\"\"\"\n    pass\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base.Task.process","title":"<code>process(inputs)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n\n    formatted_inputs = self._format_inputs(inputs)\n\n    # `outputs` is a dict containing the LLM outputs in the `generations`\n    # key and the statistics in the `statistics` key\n    outputs = self.llm.generate_outputs(\n        inputs=formatted_inputs,\n        num_generations=self.num_generations,  # type: ignore\n        **self.llm.get_generation_kwargs(),  # type: ignore\n    )\n    task_outputs = []\n    for input, input_outputs in zip(inputs, outputs):\n        formatted_outputs = self._format_outputs(input_outputs, input)\n\n        if self.group_generations:\n            combined = group_dicts(*formatted_outputs)\n            task_outputs.append(\n                {**input, **combined, \"model_name\": self.llm.model_name}\n            )\n            continue\n\n        # Create a row per generation\n        for formatted_output in formatted_outputs:\n            task_outputs.append(\n                {**input, **formatted_output, \"model_name\": self.llm.model_name}\n            )\n\n    yield task_outputs\n</code></pre>"},{"location":"api/task/generator_task/","title":"GeneratorTask","text":"<p>This section contains the API reference for the <code>distilabel</code> generator tasks.</p> <p>For more information on how the <code>GeneratorTask</code> works and see some examples, check the Tutorial - Task - GeneratorTask page.</p>"},{"location":"api/task/generator_task/#distilabel.steps.tasks.base.GeneratorTask","title":"<code>GeneratorTask</code>","text":"<p>               Bases: <code>_Task</code>, <code>GeneratorStep</code></p> <p><code>GeneratorTask</code> is a class that implements the <code>_Task</code> abstract class and adds the <code>GeneratorStep</code> interface to be used as a step in the pipeline.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>the <code>LLM</code> to be used to generate the outputs of the task.</p> <code>group_generations</code> <code>bool</code> <p>whether to group the <code>num_generations</code> generated per input in a list or create a row per generation. Defaults to <code>False</code>.</p> <code>num_generations</code> <code>RuntimeParameter[int]</code> <p>The number of generations to be produced per input.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>class GeneratorTask(_Task, GeneratorStep):\n    \"\"\"`GeneratorTask` is a class that implements the `_Task` abstract class and adds the\n    `GeneratorStep` interface to be used as a step in the pipeline.\n\n    Attributes:\n        llm: the `LLM` to be used to generate the outputs of the task.\n        group_generations: whether to group the `num_generations` generated per input in\n            a list or create a row per generation. Defaults to `False`.\n        num_generations: The number of generations to be produced per input.\n    \"\"\"\n\n    pass\n</code></pre>"},{"location":"api/task/task_gallery/","title":"Task Gallery","text":"<p>This section contains the existing <code>Task</code> subclasses implemented in <code>distilabel</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks","title":"<code>tasks</code>","text":""},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker","title":"<code>APIGenExecutionChecker</code>","text":"<p>               Bases: <code>Step</code></p> <p>Executes the generated function calls.</p> <p>This step checks if a given answer from a model as generated by <code>APIGenGenerator</code> can be executed against the given library (given by <code>libpath</code>, which is a string pointing to a python .py file with functions).</p> <p>Attributes:</p> Name Type Description <code>libpath</code> <code>str</code> <p>The path to the library where we will retrieve the functions. It can also point to a folder with the functions. In this case, the folder layout should be a folder with .py files, each containing a single function, the name of the function being the same as the filename.</p> <code>check_is_dangerous</code> <code>bool</code> <p>Bool to exclude some potentially dangerous functions, it contains some heuristics found while testing. This functions can run subprocesses, deal with the OS, or have other potentially dangerous operations. Defaults to True.</p> Input columns <ul> <li>answers (<code>str</code>): List with arguments to be passed to the function,     dumped as a string from a list of dictionaries. Should be loaded using     <code>json.loads</code>.</li> </ul> Output columns <ul> <li>keep_row_after_execution_check (<code>bool</code>): Whether the function should be kept or not.</li> <li>execution_result (<code>str</code>): The result from executing the function.</li> </ul> Categories <ul> <li>filtering</li> <li>execution</li> </ul> References <ul> <li>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</li> <li>Salesforce/xlam-function-calling-60k</li> </ul> <p>Examples:</p> <p>Execute a function from a given library with the answer from an LLM:</p> <pre><code>from distilabel.steps.tasks import APIGenExecutionChecker\n\n# For the libpath you can use as an example the file at the tests folder:\n# ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\ntask = APIGenExecutionChecker(\n    libpath=\"../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\",\n)\ntask.load()\n\nres = next(\n    task.process(\n        [\n            {\n                \"answers\": [\n                    {\n                        \"arguments\": {\n                            \"initial_velocity\": 0.2,\n                            \"acceleration\": 0.1,\n                            \"time\": 0.5,\n                        },\n                        \"name\": \"final_velocity\",\n                    }\n                ],\n            }\n        ]\n    )\n)\nres\n#[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>class APIGenExecutionChecker(Step):\n    \"\"\"Executes the generated function calls.\n\n    This step checks if a given answer from a model as generated by `APIGenGenerator`\n    can be executed against the given library (given by `libpath`, which is a string\n    pointing to a python .py file with functions).\n\n    Attributes:\n        libpath: The path to the library where we will retrieve the functions.\n            It can also point to a folder with the functions. In this case, the folder\n            layout should be a folder with .py files, each containing a single function,\n            the name of the function being the same as the filename.\n        check_is_dangerous: Bool to exclude some potentially dangerous functions, it contains\n            some heuristics found while testing. This functions can run subprocesses, deal with\n            the OS, or have other potentially dangerous operations. Defaults to True.\n\n    Input columns:\n        - answers (`str`): List with arguments to be passed to the function,\n            dumped as a string from a list of dictionaries. Should be loaded using\n            `json.loads`.\n\n    Output columns:\n        - keep_row_after_execution_check (`bool`): Whether the function should be kept or not.\n        - execution_result (`str`): The result from executing the function.\n\n    Categories:\n        - filtering\n        - execution\n\n    References:\n        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n        - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\n    Examples:\n        Execute a function from a given library with the answer from an LLM:\n\n        ```python\n        from distilabel.steps.tasks import APIGenExecutionChecker\n\n        # For the libpath you can use as an example the file at the tests folder:\n        # ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\n        task = APIGenExecutionChecker(\n            libpath=\"../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\",\n        )\n        task.load()\n\n        res = next(\n            task.process(\n                [\n                    {\n                        \"answers\": [\n                            {\n                                \"arguments\": {\n                                    \"initial_velocity\": 0.2,\n                                    \"acceleration\": 0.1,\n                                    \"time\": 0.5,\n                                },\n                                \"name\": \"final_velocity\",\n                            }\n                        ],\n                    }\n                ]\n            )\n        )\n        res\n        #[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]\n        ```\n    \"\"\"\n\n    libpath: str = Field(\n        default=...,\n        description=(\n            \"The path to the library where we will retrieve the functions, \"\n            \"or a folder with python files named the same as the functions they contain.\",\n        ),\n    )\n    check_is_dangerous: bool = Field(\n        default=True,\n        description=(\n            \"Bool to exclude some potentially dangerous functions, it contains \"\n            \"some heuristics found while testing. This functions can run subprocesses, \"\n            \"deal with the OS, or have other potentially dangerous operations.\",\n        ),\n    )\n\n    _toolbox: Union[\"ModuleType\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the library where the functions will be extracted from.\"\"\"\n        super().load()\n        if Path(self.libpath).suffix == \".py\":\n            self._toolbox = load_module_from_path(self.libpath)\n\n    def unload(self) -&gt; None:\n        self._toolbox = None\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task are those found in the original dataset.\"\"\"\n        return [\"answers\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs are the columns required by `APIGenGenerator` task.\"\"\"\n        return [\"keep_row_after_execution_check\", \"execution_result\"]\n\n    def _get_function(self, function_name: str) -&gt; Callable:\n        \"\"\"Retrieves the function from the toolbox.\n\n        Args:\n            function_name: The name of the function to retrieve.\n\n        Returns:\n            Callable: The function to be executed.\n        \"\"\"\n        if self._toolbox:\n            return getattr(self._toolbox, function_name, None)\n        try:\n            toolbox = load_module_from_path(\n                str(Path(self.libpath) / f\"{function_name}.py\")\n            )\n            return getattr(toolbox, function_name, None)\n        except FileNotFoundError:\n            return None\n        except Exception as e:\n            self._logger.warning(f\"Error loading function '{function_name}': {e}\")\n            return None\n\n    def _is_dangerous(self, function: Callable) -&gt; bool:\n        \"\"\"Checks if a function is dangerous to remove it.\n        Contains a list of heuristics to avoid executing possibly dangerous functions.\n        \"\"\"\n        source_code = inspect.getsource(function)\n        # We don't want to execute functions that use subprocess\n        if (\n            (\"subprocess.\" in source_code)\n            or (\"os.system(\" in source_code)\n            or (\"input(\" in source_code)\n            # Avoiding threading\n            or (\"threading.Thread(\" in source_code)\n            or (\"exec(\" in source_code)\n            # Avoiding argparse (not sure why)\n            or (\"argparse.ArgumentParser(\" in source_code)\n            # Avoiding logging changing the levels to not mess with the logs\n            or (\".setLevel(\" in source_code)\n            # Don't run a test battery\n            or (\"unittest.main(\" in source_code)\n            # Avoid exiting the program\n            or (\"sys.exit(\" in source_code)\n            or (\"exit(\" in source_code)\n            or (\"raise SystemExit(\" in source_code)\n            or (\"multiprocessing.Pool(\" in source_code)\n        ):\n            return True\n        return False\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Checks the answer to see if it can be executed.\n        Captures the possible errors and returns them.\n\n        If a single example is provided, it is copied to avoid raising an error.\n\n        Args:\n            inputs: A list of dictionaries with the input data.\n\n        Yields:\n            A list of dictionaries with the output data.\n        \"\"\"\n        for input in inputs:\n            output = []\n            if input[\"answers\"]:\n                answers = json.loads(input[\"answers\"])\n            else:\n                input.update(\n                    **{\n                        \"keep_row_after_execution_check\": False,\n                        \"execution_result\": [\"No answers were provided.\"],\n                    }\n                )\n                continue\n            for answer in answers:\n                if answer is None:\n                    output.append(\n                        {\n                            \"keep\": False,\n                            \"execution_result\": \"Nothing was generated for this answer.\",\n                        }\n                    )\n                    continue\n\n                function_name = answer.get(\"name\", None)\n                arguments = answer.get(\"arguments\", None)\n\n                self._logger.debug(\n                    f\"Executing function '{function_name}' with arguments: {arguments}\"\n                )\n                function = self._get_function(function_name)\n\n                if self.check_is_dangerous:\n                    if function and self._is_dangerous(function):\n                        function = None\n\n                if function is None:\n                    output.append(\n                        {\n                            \"keep\": False,\n                            \"execution_result\": f\"Function '{function_name}' not found.\",\n                        }\n                    )\n                else:\n                    execution = execute_from_response(function, arguments)\n                    output.append(\n                        {\n                            \"keep\": execution[\"keep\"],\n                            \"execution_result\": execution[\"execution_result\"],\n                        }\n                    )\n            # We only consider a good response if all the answers were executed successfully,\n            # but keep the reasons for further review if needed.\n            input.update(\n                **{\n                    \"keep_row_after_execution_check\": all(\n                        o[\"keep\"] is True for o in output\n                    ),\n                    \"execution_result\": [o[\"execution_result\"] for o in output],\n                }\n            )\n\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are those found in the original dataset.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs are the columns required by <code>APIGenGenerator</code> task.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.load","title":"<code>load()</code>","text":"<p>Loads the library where the functions will be extracted from.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the library where the functions will be extracted from.\"\"\"\n    super().load()\n    if Path(self.libpath).suffix == \".py\":\n        self._toolbox = load_module_from_path(self.libpath)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker._get_function","title":"<code>_get_function(function_name)</code>","text":"<p>Retrieves the function from the toolbox.</p> <p>Parameters:</p> Name Type Description Default <code>function_name</code> <code>str</code> <p>The name of the function to retrieve.</p> required <p>Returns:</p> Name Type Description <code>Callable</code> <code>Callable</code> <p>The function to be executed.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>def _get_function(self, function_name: str) -&gt; Callable:\n    \"\"\"Retrieves the function from the toolbox.\n\n    Args:\n        function_name: The name of the function to retrieve.\n\n    Returns:\n        Callable: The function to be executed.\n    \"\"\"\n    if self._toolbox:\n        return getattr(self._toolbox, function_name, None)\n    try:\n        toolbox = load_module_from_path(\n            str(Path(self.libpath) / f\"{function_name}.py\")\n        )\n        return getattr(toolbox, function_name, None)\n    except FileNotFoundError:\n        return None\n    except Exception as e:\n        self._logger.warning(f\"Error loading function '{function_name}': {e}\")\n        return None\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker._is_dangerous","title":"<code>_is_dangerous(function)</code>","text":"<p>Checks if a function is dangerous to remove it. Contains a list of heuristics to avoid executing possibly dangerous functions.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>def _is_dangerous(self, function: Callable) -&gt; bool:\n    \"\"\"Checks if a function is dangerous to remove it.\n    Contains a list of heuristics to avoid executing possibly dangerous functions.\n    \"\"\"\n    source_code = inspect.getsource(function)\n    # We don't want to execute functions that use subprocess\n    if (\n        (\"subprocess.\" in source_code)\n        or (\"os.system(\" in source_code)\n        or (\"input(\" in source_code)\n        # Avoiding threading\n        or (\"threading.Thread(\" in source_code)\n        or (\"exec(\" in source_code)\n        # Avoiding argparse (not sure why)\n        or (\"argparse.ArgumentParser(\" in source_code)\n        # Avoiding logging changing the levels to not mess with the logs\n        or (\".setLevel(\" in source_code)\n        # Don't run a test battery\n        or (\"unittest.main(\" in source_code)\n        # Avoid exiting the program\n        or (\"sys.exit(\" in source_code)\n        or (\"exit(\" in source_code)\n        or (\"raise SystemExit(\" in source_code)\n        or (\"multiprocessing.Pool(\" in source_code)\n    ):\n        return True\n    return False\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.process","title":"<code>process(inputs)</code>","text":"<p>Checks the answer to see if it can be executed. Captures the possible errors and returns them.</p> <p>If a single example is provided, it is copied to avoid raising an error.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of dictionaries with the input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of dictionaries with the output data.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Checks the answer to see if it can be executed.\n    Captures the possible errors and returns them.\n\n    If a single example is provided, it is copied to avoid raising an error.\n\n    Args:\n        inputs: A list of dictionaries with the input data.\n\n    Yields:\n        A list of dictionaries with the output data.\n    \"\"\"\n    for input in inputs:\n        output = []\n        if input[\"answers\"]:\n            answers = json.loads(input[\"answers\"])\n        else:\n            input.update(\n                **{\n                    \"keep_row_after_execution_check\": False,\n                    \"execution_result\": [\"No answers were provided.\"],\n                }\n            )\n            continue\n        for answer in answers:\n            if answer is None:\n                output.append(\n                    {\n                        \"keep\": False,\n                        \"execution_result\": \"Nothing was generated for this answer.\",\n                    }\n                )\n                continue\n\n            function_name = answer.get(\"name\", None)\n            arguments = answer.get(\"arguments\", None)\n\n            self._logger.debug(\n                f\"Executing function '{function_name}' with arguments: {arguments}\"\n            )\n            function = self._get_function(function_name)\n\n            if self.check_is_dangerous:\n                if function and self._is_dangerous(function):\n                    function = None\n\n            if function is None:\n                output.append(\n                    {\n                        \"keep\": False,\n                        \"execution_result\": f\"Function '{function_name}' not found.\",\n                    }\n                )\n            else:\n                execution = execute_from_response(function, arguments)\n                output.append(\n                    {\n                        \"keep\": execution[\"keep\"],\n                        \"execution_result\": execution[\"execution_result\"],\n                    }\n                )\n        # We only consider a good response if all the answers were executed successfully,\n        # but keep the reasons for further review if needed.\n        input.update(\n            **{\n                \"keep_row_after_execution_check\": all(\n                    o[\"keep\"] is True for o in output\n                ),\n                \"execution_result\": [o[\"execution_result\"] for o in output],\n            }\n        )\n\n    yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator","title":"<code>APIGenGenerator</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate queries and answers for the given functions in JSON format.</p> <pre><code>The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate\nverifiable and diverse function-calling datasets. The task generates a set of diverse queries\nand corresponding answers for the given functions in JSON format.\n\nAttributes:\n    system_prompt: The system prompt to guide the user in the generation of queries and answers.\n    use_tools: Whether to use the tools available in the prompt to generate the queries and answers.\n        In case the tools are given in the input, they will be added to the prompt.\n    number: The number of queries to generate. It can be a list, where each number will be\n        chosen randomly, or a dictionary with the number of queries and the probability of each.\n        I.e: `number=1`, `number=[1, 2, 3]`, `number={1: 0.5, 2: 0.3, 3: 0.2}` are all valid inputs.\n        It corresponds to the number of parallel queries to generate.\n    use_default_structured_output: Whether to use the default structured output or not.\n\nInput columns:\n    - examples (`str`): Examples used as few shots to guide the model.\n    - func_name (`str`): Name for the function to generate.\n    - func_desc (`str`): Description of what the function should do.\n    - tools (`str`): JSON formatted string containing the tool representation of the function.\n\nOutput columns:\n    - query (`str`): The list of queries.\n    - answers (`str`): JSON formatted string with the list of answers, containing the info as\n        a dictionary to be passed to the functions.\n\nCategories:\n    - text-generation\n\nReferences:\n    - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n    - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\nExamples:\n    Generate without structured output (original implementation):\n\n    ```python\n    from distilabel.steps.tasks import ApiGenGenerator\n    from distilabel.models import InferenceEndpointsLLM\n\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 1024,\n        },\n    )\n    apigen = ApiGenGenerator(\n        use_default_structured_output=False,\n        llm=llm\n    )\n    apigen.load()\n\n    res = next(\n        apigen.process(\n            [\n                {\n                    \"examples\": 'QUERY:\n</code></pre> <p>What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',                         \"func_name\": \"getrandommovie\",                         \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"                     }                 ]             )         )         res         # [{'examples': 'QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',         # 'number': 1,         # 'func_name': 'getrandommovie',         # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',         # 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',         # 'Give me 5 random movie suggestions from your database to plan my weekend.'],         # 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],         # [{'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}}]],         # 'raw_input_api_gen_generator_0': [{'role': 'system',         #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.</p> <p>Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.</p> <p>Ensure the query: - Is clear and concise - Demonstrates typical use cases - Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words - Across a variety level of difficulties, ranging from beginner and advanced use cases - The corresponding result's parameter types and ranges match with the function's descriptions</p> <p>Ensure the answer: - Is a list of function calls in JSON format - The length of the answer list should be equal to the number of requests in the query - Can solve all the requests in the query effectively\"},         #     {'role': 'user',         #     'content': 'Here are examples of queries and the corresponding answers for similar functions: QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]</p> <p>Note that the query could be interpreted as a combination of several independent requests. Based on these examples, generate 2 diverse query and answer pairs for the function <code>getrandommovie</code> The detailed function description is the following: Returns a list of random movies from a database by calling an external API.</p> <p>The output MUST strictly adhere to the following JSON format, and NO other text MUST be included: <pre><code>[\n   {\n       \"query\": \"The generated query.\",\n       \"answers\": [\n           {\n               \"name\": \"api_name\",\n               \"arguments\": {\n                   \"arg_name\": \"value\"\n                   ... (more arguments as required)\n               }\n           },\n           ... (more API calls as required)\n       ]\n   }\n]\n</code></pre></p> <p>Now please generate 2 diverse query and answer pairs following the above format.'}]},         # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]         ```</p> <pre><code>    Generate with structured output:\n\n    ```python\n    from distilabel.steps.tasks import ApiGenGenerator\n    from distilabel.models import InferenceEndpointsLLM\n\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 1024,\n        },\n    )\n    apigen = ApiGenGenerator(\n        use_default_structured_output=True,\n        llm=llm\n    )\n    apigen.load()\n\n    res_struct = next(\n        apigen.process(\n            [\n                {\n                    \"examples\": 'QUERY:\n</code></pre> <p>What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',                         \"func_name\": \"getrandommovie\",                         \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"                     }                 ]             )         )         res_struct         # [{'examples': 'QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',         # 'number': 1,         # 'func_name': 'getrandommovie',         # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',         # 'queries': [\"I'm bored and want to watch a movie. Can you suggest some movies?\",         # \"My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?\"],         # 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],         # [{'arguments': {}, 'name': 'getrandommovie'}]],         # 'raw_input_api_gen_generator_0': [{'role': 'system',         #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.</p> <p>Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.</p> <p>Ensure the query: - Is clear and concise - Demonstrates typical use cases - Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words - Across a variety level of difficulties, ranging from beginner and advanced use cases - The corresponding result's parameter types and ranges match with the function's descriptions</p> <p>Ensure the answer: - Is a list of function calls in JSON format - The length of the answer list should be equal to the number of requests in the query - Can solve all the requests in the query effectively\"},         #     {'role': 'user',         #     'content': 'Here are examples of queries and the corresponding answers for similar functions: QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]</p> <p>Note that the query could be interpreted as a combination of several independent requests. Based on these examples, generate 2 diverse query and answer pairs for the function <code>getrandommovie</code> The detailed function description is the following: Returns a list of random movies from a database by calling an external API.</p> <p>Now please generate 2 diverse query and answer pairs following the above format.'}]},         # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]         ```</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>class APIGenGenerator(Task):\n    \"\"\"Generate queries and answers for the given functions in JSON format.\n\n    The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate\n    verifiable and diverse function-calling datasets. The task generates a set of diverse queries\n    and corresponding answers for the given functions in JSON format.\n\n    Attributes:\n        system_prompt: The system prompt to guide the user in the generation of queries and answers.\n        use_tools: Whether to use the tools available in the prompt to generate the queries and answers.\n            In case the tools are given in the input, they will be added to the prompt.\n        number: The number of queries to generate. It can be a list, where each number will be\n            chosen randomly, or a dictionary with the number of queries and the probability of each.\n            I.e: `number=1`, `number=[1, 2, 3]`, `number={1: 0.5, 2: 0.3, 3: 0.2}` are all valid inputs.\n            It corresponds to the number of parallel queries to generate.\n        use_default_structured_output: Whether to use the default structured output or not.\n\n    Input columns:\n        - examples (`str`): Examples used as few shots to guide the model.\n        - func_name (`str`): Name for the function to generate.\n        - func_desc (`str`): Description of what the function should do.\n        - tools (`str`): JSON formatted string containing the tool representation of the function.\n\n    Output columns:\n        - query (`str`): The list of queries.\n        - answers (`str`): JSON formatted string with the list of answers, containing the info as\n            a dictionary to be passed to the functions.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n        - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\n    Examples:\n        Generate without structured output (original implementation):\n\n        ```python\n        from distilabel.steps.tasks import ApiGenGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        apigen = ApiGenGenerator(\n            use_default_structured_output=False,\n            llm=llm\n        )\n        apigen.load()\n\n        res = next(\n            apigen.process(\n                [\n                    {\n                        \"examples\": 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                        \"func_name\": \"getrandommovie\",\n                        \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n                    }\n                ]\n            )\n        )\n        res\n        # [{'examples': 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n        # 'number': 1,\n        # 'func_name': 'getrandommovie',\n        # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n        # 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',\n        # 'Give me 5 random movie suggestions from your database to plan my weekend.'],\n        # 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],\n        # [{'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}}]],\n        # 'raw_input_api_gen_generator_0': [{'role': 'system',\n        #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\\n\\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\\n\\nEnsure the query:\\n- Is clear and concise\\n- Demonstrates typical use cases\\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\\n- The corresponding result's parameter types and ranges match with the function's descriptions\\n\\nEnsure the answer:\\n- Is a list of function calls in JSON format\\n- The length of the answer list should be equal to the number of requests in the query\\n- Can solve all the requests in the query effectively\"},\n        #     {'role': 'user',\n        #     'content': 'Here are examples of queries and the corresponding answers for similar functions:\\nQUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\\n\\nNote that the query could be interpreted as a combination of several independent requests.\\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\\nThe detailed function description is the following:\\nReturns a list of random movies from a database by calling an external API.\\n\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n```json\\n[\\n   {\\n       \"query\": \"The generated query.\",\\n       \"answers\": [\\n           {\\n               \"name\": \"api_name\",\\n               \"arguments\": {\\n                   \"arg_name\": \"value\"\\n                   ... (more arguments as required)\\n               }\\n           },\\n           ... (more API calls as required)\\n       ]\\n   }\\n]\\n```\\n\\nNow please generate 2 diverse query and answer pairs following the above format.'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Generate with structured output:\n\n        ```python\n        from distilabel.steps.tasks import ApiGenGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        apigen = ApiGenGenerator(\n            use_default_structured_output=True,\n            llm=llm\n        )\n        apigen.load()\n\n        res_struct = next(\n            apigen.process(\n                [\n                    {\n                        \"examples\": 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                        \"func_name\": \"getrandommovie\",\n                        \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n                    }\n                ]\n            )\n        )\n        res_struct\n        # [{'examples': 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n        # 'number': 1,\n        # 'func_name': 'getrandommovie',\n        # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n        # 'queries': [\"I'm bored and want to watch a movie. Can you suggest some movies?\",\n        # \"My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?\"],\n        # 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],\n        # [{'arguments': {}, 'name': 'getrandommovie'}]],\n        # 'raw_input_api_gen_generator_0': [{'role': 'system',\n        #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\\n\\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\\n\\nEnsure the query:\\n- Is clear and concise\\n- Demonstrates typical use cases\\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\\n- The corresponding result's parameter types and ranges match with the function's descriptions\\n\\nEnsure the answer:\\n- Is a list of function calls in JSON format\\n- The length of the answer list should be equal to the number of requests in the query\\n- Can solve all the requests in the query effectively\"},\n        #     {'role': 'user',\n        #     'content': 'Here are examples of queries and the corresponding answers for similar functions:\\nQUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\\n\\nNote that the query could be interpreted as a combination of several independent requests.\\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\\nThe detailed function description is the following:\\nReturns a list of random movies from a database by calling an external API.\\n\\nNow please generate 2 diverse query and answer pairs following the above format.'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n    \"\"\"\n\n    system_prompt: str = SYSTEM_PROMPT_API_GEN\n    use_default_structured_output: bool = False\n    number: Union[int, List[int], Dict[int, float]] = 1\n    use_tools: bool = True\n\n    _number: Union[int, None] = PrivateAttr(None)\n    _fn_parallel_queries: Union[Callable[[], str], None] = PrivateAttr(None)\n    _format_inst: Union[str, None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the template for the generator prompt.\"\"\"\n        super().load()\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"apigen\"\n            / \"generator.jinja2\"\n        )\n        self._template = Template(open(_path).read())\n        self._format_inst = self._set_format_inst()\n\n    def _parallel_queries(self, number: int) -&gt; Callable[[int], str]:\n        \"\"\"Prepares the function to update the parallel queries guide in the prompt.\n\n        Raises:\n            ValueError: if `is_parallel` is not a boolean or a list of floats.\n\n        Returns:\n            The function to generate the parallel queries guide.\n        \"\"\"\n        if number &gt; 1:\n            return (\n                \"It can contain multiple parallel queries in natural language for the given functions. \"\n                \"They could use either the same function with different arguments or different functions.\\n\"\n            )\n        return \"\"\n\n    def _get_number(self) -&gt; int:\n        \"\"\"Generates the number of queries to generate in a single call.\n        The number must be set to `_number` to avoid changing the original value\n        when calling `_default_error`.\n        \"\"\"\n        if isinstance(self.number, list):\n            self._number = random.choice(self.number)\n        elif isinstance(self.number, dict):\n            self._number = random.choices(\n                list(self.number.keys()), list(self.number.values())\n            )[0]\n        else:\n            self._number = self.number\n        return self._number\n\n    def _set_format_inst(self) -&gt; str:\n        \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n        If the default structured output is used, returns an empty string because nothing\n        else is needed, otherwise, returns the original addition to the prompt to guide the model\n        to generate a formatted JSON.\n        \"\"\"\n        return (\n            \"\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n\"\n            \"```\\n\"\n            \"[\\n\"\n            \"   {\\n\"\n            '       \"query\": \"The generated query.\",\\n'\n            '       \"answers\": [\\n'\n            \"           {\\n\"\n            '               \"name\": \"api_name\",\\n'\n            '               \"arguments\": {\\n'\n            '                   \"arg_name\": \"value\"\\n'\n            \"                   ... (more arguments as required)\\n\"\n            \"               }\\n\"\n            \"           },\\n\"\n            \"           ... (more API calls as required)\\n\"\n            \"       ]\\n\"\n            \"   }\\n\"\n            \"]\\n\"\n            \"```\\n\"\n        )\n\n    def _get_func_desc(self, input: Dict[str, Any]) -&gt; str:\n        \"\"\"If available and required, will use the info from the tools in the\n        prompt for extra information. Otherwise will use jut the function description.\n        \"\"\"\n        if not self.use_tools:\n            return input[\"func_desc\"]\n        extra = \"\"  # Extra information from the tools (if available will be added)\n        if \"tools\" in input:\n            extra = f\"\\n\\nThis is the available tool to guide you (respect the order of the parameters):\\n{input['tools']}\"\n        return input[\"func_desc\"] + extra\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task.\"\"\"\n        return {\n            \"examples\": True,\n            \"func_name\": True,\n            \"func_desc\": True,\n            \"tools\": False,\n        }\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType`.\"\"\"\n        number = self._get_number()\n        parallel_queries = self._parallel_queries(number)\n        return [\n            {\"role\": \"system\", \"content\": self.system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    examples=input[\"examples\"],\n                    parallel_queries=parallel_queries,\n                    number=number,\n                    func_name=input[\"func_name\"],\n                    func_desc=self._get_func_desc(input),\n                    format_inst=self._format_inst,\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The output for the task are the queries and corresponding answers.\"\"\"\n        return [\"query\", \"answers\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the queries and answers pairs.\n            The answers are an array of answers corresponding to the query.\n            Each answer is represented as an object with the following properties:\n                - name (string): The name of the tool used to generate the answer.\n                - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n            Each argument is represented as a key-value pair, where the key is the parameter name and the\n            value is the corresponding value.\n        \"\"\"\n        if output is None:\n            return self._default_error(input)\n\n        if not self.use_default_structured_output:\n            output = remove_fences(output)\n\n        try:\n            pairs = orjson.loads(output)\n        except orjson.JSONDecodeError:\n            return self._default_error(input)\n\n        pairs = pairs[\"pairs\"] if self.use_default_structured_output else pairs\n\n        return self._format_output(pairs, input)\n\n    def _format_output(\n        self, pairs: Dict[str, Any], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Parses the response, returning a dictionary with queries and answers.\n\n        Args:\n            pairs: The parsed dictionary from the LLM's output.\n            input: The input from the `LLM`.\n\n        Returns:\n            Formatted output, where the `queries` are a list of strings, and the `answers`\n            are a list of objects.\n        \"\"\"\n        try:\n            input.update(\n                **{\n                    \"query\": pairs[0][\"query\"],\n                    \"answers\": json.dumps(pairs[0][\"answers\"]),\n                }\n            )\n            return input\n        except Exception as e:\n            self._logger.error(f\"Error formatting output: {e}, pairs: '{pairs}'\")\n            return self._default_error(input)\n\n    def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n        \"\"\"Returns a default error output, to fill the responses in case of failure.\"\"\"\n        input.update(\n            **{\n                \"query\": None,\n                \"answers\": json.dumps([None] * self._number),\n            }\n        )\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from typing import Dict, List\n        from pydantic import BaseModel\n\n\n        class Answer(BaseModel):\n            name: str\n            arguments: Dict[str, str]\n\n        class QueryAnswer(BaseModel):\n            query: str\n            answers: List[Answer]\n\n        class QueryAnswerPairs(BaseModel):\n            pairs: List[QueryAnswer]\n\n        json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"$defs\": {\n                \"Answer\": {\n                    \"properties\": {\n                        \"name\": {\"title\": \"Name\", \"type\": \"string\"},\n                        \"arguments\": {\n                            \"additionalProperties\": {\"type\": \"string\"},\n                            \"title\": \"Arguments\",\n                            \"type\": \"object\",\n                        },\n                    },\n                    \"required\": [\"name\", \"arguments\"],\n                    \"title\": \"Answer\",\n                    \"type\": \"object\",\n                },\n                \"QueryAnswer\": {\n                    \"properties\": {\n                        \"query\": {\"title\": \"Query\", \"type\": \"string\"},\n                        \"answers\": {\n                            \"items\": {\"$ref\": \"#/$defs/Answer\"},\n                            \"title\": \"Answers\",\n                            \"type\": \"array\",\n                        },\n                    },\n                    \"required\": [\"query\", \"answers\"],\n                    \"title\": \"QueryAnswer\",\n                    \"type\": \"object\",\n                },\n            },\n            \"properties\": {\n                \"pairs\": {\n                    \"items\": {\"$ref\": \"#/$defs/QueryAnswer\"},\n                    \"title\": \"Pairs\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"pairs\"],\n            \"title\": \"QueryAnswerPairs\",\n            \"type\": \"object\",\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the queries and corresponding answers.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.load","title":"<code>load()</code>","text":"<p>Loads the template for the generator prompt.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the template for the generator prompt.\"\"\"\n    super().load()\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"apigen\"\n        / \"generator.jinja2\"\n    )\n    self._template = Template(open(_path).read())\n    self._format_inst = self._set_format_inst()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._parallel_queries","title":"<code>_parallel_queries(number)</code>","text":"<p>Prepares the function to update the parallel queries guide in the prompt.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>if <code>is_parallel</code> is not a boolean or a list of floats.</p> <p>Returns:</p> Type Description <code>Callable[[int], str]</code> <p>The function to generate the parallel queries guide.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _parallel_queries(self, number: int) -&gt; Callable[[int], str]:\n    \"\"\"Prepares the function to update the parallel queries guide in the prompt.\n\n    Raises:\n        ValueError: if `is_parallel` is not a boolean or a list of floats.\n\n    Returns:\n        The function to generate the parallel queries guide.\n    \"\"\"\n    if number &gt; 1:\n        return (\n            \"It can contain multiple parallel queries in natural language for the given functions. \"\n            \"They could use either the same function with different arguments or different functions.\\n\"\n        )\n    return \"\"\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._get_number","title":"<code>_get_number()</code>","text":"<p>Generates the number of queries to generate in a single call. The number must be set to <code>_number</code> to avoid changing the original value when calling <code>_default_error</code>.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _get_number(self) -&gt; int:\n    \"\"\"Generates the number of queries to generate in a single call.\n    The number must be set to `_number` to avoid changing the original value\n    when calling `_default_error`.\n    \"\"\"\n    if isinstance(self.number, list):\n        self._number = random.choice(self.number)\n    elif isinstance(self.number, dict):\n        self._number = random.choices(\n            list(self.number.keys()), list(self.number.values())\n        )[0]\n    else:\n        self._number = self.number\n    return self._number\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._set_format_inst","title":"<code>_set_format_inst()</code>","text":"<p>Prepares the function to generate the formatted instructions for the prompt.</p> <p>If the default structured output is used, returns an empty string because nothing else is needed, otherwise, returns the original addition to the prompt to guide the model to generate a formatted JSON.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _set_format_inst(self) -&gt; str:\n    \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n    If the default structured output is used, returns an empty string because nothing\n    else is needed, otherwise, returns the original addition to the prompt to guide the model\n    to generate a formatted JSON.\n    \"\"\"\n    return (\n        \"\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n\"\n        \"```\\n\"\n        \"[\\n\"\n        \"   {\\n\"\n        '       \"query\": \"The generated query.\",\\n'\n        '       \"answers\": [\\n'\n        \"           {\\n\"\n        '               \"name\": \"api_name\",\\n'\n        '               \"arguments\": {\\n'\n        '                   \"arg_name\": \"value\"\\n'\n        \"                   ... (more arguments as required)\\n\"\n        \"               }\\n\"\n        \"           },\\n\"\n        \"           ... (more API calls as required)\\n\"\n        \"       ]\\n\"\n        \"   }\\n\"\n        \"]\\n\"\n        \"```\\n\"\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._get_func_desc","title":"<code>_get_func_desc(input)</code>","text":"<p>If available and required, will use the info from the tools in the prompt for extra information. Otherwise will use jut the function description.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _get_func_desc(self, input: Dict[str, Any]) -&gt; str:\n    \"\"\"If available and required, will use the info from the tools in the\n    prompt for extra information. Otherwise will use jut the function description.\n    \"\"\"\n    if not self.use_tools:\n        return input[\"func_desc\"]\n    extra = \"\"  # Extra information from the tools (if available will be added)\n    if \"tools\" in input:\n        extra = f\"\\n\\nThis is the available tool to guide you (respect the order of the parameters):\\n{input['tools']}\"\n    return input[\"func_desc\"] + extra\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code>.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType`.\"\"\"\n    number = self._get_number()\n    parallel_queries = self._parallel_queries(number)\n    return [\n        {\"role\": \"system\", \"content\": self.system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                examples=input[\"examples\"],\n                parallel_queries=parallel_queries,\n                number=number,\n                func_name=input[\"func_name\"],\n                func_desc=self._get_func_desc(input),\n                format_inst=self._format_inst,\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the queries and answers pairs.</p> <code>Dict[str, Any]</code> <p>The answers are an array of answers corresponding to the query.</p> <code>Dict[str, Any]</code> <p>Each answer is represented as an object with the following properties: - name (string): The name of the tool used to generate the answer. - arguments (object): An object representing the arguments passed to the tool to generate the answer.</p> <code>Dict[str, Any]</code> <p>Each argument is represented as a key-value pair, where the key is the parameter name and the</p> <code>Dict[str, Any]</code> <p>value is the corresponding value.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the queries and answers pairs.\n        The answers are an array of answers corresponding to the query.\n        Each answer is represented as an object with the following properties:\n            - name (string): The name of the tool used to generate the answer.\n            - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n        Each argument is represented as a key-value pair, where the key is the parameter name and the\n        value is the corresponding value.\n    \"\"\"\n    if output is None:\n        return self._default_error(input)\n\n    if not self.use_default_structured_output:\n        output = remove_fences(output)\n\n    try:\n        pairs = orjson.loads(output)\n    except orjson.JSONDecodeError:\n        return self._default_error(input)\n\n    pairs = pairs[\"pairs\"] if self.use_default_structured_output else pairs\n\n    return self._format_output(pairs, input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._format_output","title":"<code>_format_output(pairs, input)</code>","text":"<p>Parses the response, returning a dictionary with queries and answers.</p> <p>Parameters:</p> Name Type Description Default <code>pairs</code> <code>Dict[str, Any]</code> <p>The parsed dictionary from the LLM's output.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>The input from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Formatted output, where the <code>queries</code> are a list of strings, and the <code>answers</code></p> <code>Dict[str, Any]</code> <p>are a list of objects.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _format_output(\n    self, pairs: Dict[str, Any], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"Parses the response, returning a dictionary with queries and answers.\n\n    Args:\n        pairs: The parsed dictionary from the LLM's output.\n        input: The input from the `LLM`.\n\n    Returns:\n        Formatted output, where the `queries` are a list of strings, and the `answers`\n        are a list of objects.\n    \"\"\"\n    try:\n        input.update(\n            **{\n                \"query\": pairs[0][\"query\"],\n                \"answers\": json.dumps(pairs[0][\"answers\"]),\n            }\n        )\n        return input\n    except Exception as e:\n        self._logger.error(f\"Error formatting output: {e}, pairs: '{pairs}'\")\n        return self._default_error(input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._default_error","title":"<code>_default_error(input)</code>","text":"<p>Returns a default error output, to fill the responses in case of failure.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n    \"\"\"Returns a default error output, to fill the responses in case of failure.\"\"\"\n    input.update(\n        **{\n            \"query\": None,\n            \"answers\": json.dumps([None] * self._number),\n        }\n    )\n    return input\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from typing import Dict, List\nfrom pydantic import BaseModel\n\n\nclass Answer(BaseModel):\n    name: str\n    arguments: Dict[str, str]\n\nclass QueryAnswer(BaseModel):\n    query: str\n    answers: List[Answer]\n\nclass QueryAnswerPairs(BaseModel):\n    pairs: List[QueryAnswer]\n\njson.dumps(QueryAnswerPairs.model_json_schema(), indent=4)\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from typing import Dict, List\n    from pydantic import BaseModel\n\n\n    class Answer(BaseModel):\n        name: str\n        arguments: Dict[str, str]\n\n    class QueryAnswer(BaseModel):\n        query: str\n        answers: List[Answer]\n\n    class QueryAnswerPairs(BaseModel):\n        pairs: List[QueryAnswer]\n\n    json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"$defs\": {\n            \"Answer\": {\n                \"properties\": {\n                    \"name\": {\"title\": \"Name\", \"type\": \"string\"},\n                    \"arguments\": {\n                        \"additionalProperties\": {\"type\": \"string\"},\n                        \"title\": \"Arguments\",\n                        \"type\": \"object\",\n                    },\n                },\n                \"required\": [\"name\", \"arguments\"],\n                \"title\": \"Answer\",\n                \"type\": \"object\",\n            },\n            \"QueryAnswer\": {\n                \"properties\": {\n                    \"query\": {\"title\": \"Query\", \"type\": \"string\"},\n                    \"answers\": {\n                        \"items\": {\"$ref\": \"#/$defs/Answer\"},\n                        \"title\": \"Answers\",\n                        \"type\": \"array\",\n                    },\n                },\n                \"required\": [\"query\", \"answers\"],\n                \"title\": \"QueryAnswer\",\n                \"type\": \"object\",\n            },\n        },\n        \"properties\": {\n            \"pairs\": {\n                \"items\": {\"$ref\": \"#/$defs/QueryAnswer\"},\n                \"title\": \"Pairs\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"pairs\"],\n        \"title\": \"QueryAnswerPairs\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker","title":"<code>APIGenSemanticChecker</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate queries and answers for the given functions in JSON format.</p> <p>The <code>APIGenGenerator</code> is inspired by the APIGen pipeline, which was designed to generate verifiable and diverse function-calling datasets. The task generates a set of diverse queries and corresponding answers for the given functions in JSON format.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>str</code> <p>System prompt for the task. Has a default one.</p> <code>exclude_failed_execution</code> <code>str</code> <p>Whether to exclude failed executions (won't run on those rows that have a False in <code>keep_row_after_execution_check</code> column, which comes from running <code>APIGenExecutionChecker</code>). Defaults to True.</p> Input columns <ul> <li>func_desc (<code>str</code>): Description of what the function should do.</li> <li>query (<code>str</code>): Instruction from the user.</li> <li>answers (<code>str</code>): JSON encoded list with arguments to be passed to the function/API.     Should be loaded using <code>json.loads</code>.</li> <li>execution_result (<code>str</code>): Result of the function/API executed.</li> </ul> Output columns <ul> <li>thought (<code>str</code>): Reasoning for the output on whether to keep this output or not.</li> <li>keep_row_after_semantic_check (<code>bool</code>): True or False, can be used to filter     afterwards.</li> </ul> Categories <ul> <li>filtering</li> <li>text-generation</li> </ul> References <ul> <li>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</li> <li>Salesforce/xlam-function-calling-60k</li> </ul> <p>Examples:</p> <pre><code>Semantic checker for generated function calls (original implementation):\n\n```python\nfrom distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=False,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'thought': '',\n# 'keep_row_after_semantic_check': True,\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \"thought\": \"Concisely describe your reasoning here\",\\n   \"pass\": \"yes\" or \"no\"\\n}\\n```\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n```\n\nSemantic checker for generated function calls (structured output):\n\n```python\nfrom distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=True,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'keep_row_after_semantic_check': True,\n# 'thought': '',\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>class APIGenSemanticChecker(Task):\n    r\"\"\"Generate queries and answers for the given functions in JSON format.\n\n    The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate\n    verifiable and diverse function-calling datasets. The task generates a set of diverse queries\n    and corresponding answers for the given functions in JSON format.\n\n    Attributes:\n        system_prompt: System prompt for the task. Has a default one.\n        exclude_failed_execution: Whether to exclude failed executions (won't run on those\n            rows that have a False in `keep_row_after_execution_check` column, which\n            comes from running `APIGenExecutionChecker`). Defaults to True.\n\n    Input columns:\n        - func_desc (`str`): Description of what the function should do.\n        - query (`str`): Instruction from the user.\n        - answers (`str`): JSON encoded list with arguments to be passed to the function/API.\n            Should be loaded using `json.loads`.\n        - execution_result (`str`): Result of the function/API executed.\n\n    Output columns:\n        - thought (`str`): Reasoning for the output on whether to keep this output or not.\n        - keep_row_after_semantic_check (`bool`): True or False, can be used to filter\n            afterwards.\n\n    Categories:\n        - filtering\n        - text-generation\n\n    References:\n        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n        - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\n    Examples:\n\n        Semantic checker for generated function calls (original implementation):\n\n        ```python\n        from distilabel.steps.tasks import APIGenSemanticChecker\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        semantic_checker = APIGenSemanticChecker(\n            use_default_structured_output=False,\n            llm=llm\n        )\n        semantic_checker.load()\n\n        res = next(\n            semantic_checker.process(\n                [\n                    {\n                        \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                        \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                        \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                        \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n                    }\n                ]\n            )\n        )\n        res\n        # [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n        # 'query': 'What information can be obtained about the Maine Coon cat breed?',\n        # 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n        # 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n        # 'thought': '',\n        # 'keep_row_after_semantic_check': True,\n        # 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n        #     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n        #     {'role': 'user',\n        #     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \"thought\": \"Concisely describe your reasoning here\",\\n   \"pass\": \"yes\" or \"no\"\\n}\\n```\\n'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Semantic checker for generated function calls (structured output):\n\n        ```python\n        from distilabel.steps.tasks import APIGenSemanticChecker\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        semantic_checker = APIGenSemanticChecker(\n            use_default_structured_output=True,\n            llm=llm\n        )\n        semantic_checker.load()\n\n        res = next(\n            semantic_checker.process(\n                [\n                    {\n                        \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                        \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                        \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                        \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n                    }\n                ]\n            )\n        )\n        res\n        # [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n        # 'query': 'What information can be obtained about the Maine Coon cat breed?',\n        # 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n        # 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n        # 'keep_row_after_semantic_check': True,\n        # 'thought': '',\n        # 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n        #     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n        #     {'role': 'user',\n        #     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n    \"\"\"\n\n    system_prompt: str = SYSTEM_PROMPT_SEMANTIC_CHECKER\n    use_default_structured_output: bool = False\n\n    _format_inst: Union[str, None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the template for the generator prompt.\"\"\"\n        super().load()\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"apigen\"\n            / \"semantic_checker.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n        self._format_inst = self._set_format_inst()\n\n    def _set_format_inst(self) -&gt; str:\n        \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n        If the default structured output is used, returns an empty string because nothing\n        else is needed, otherwise, returns the original addition to the prompt to guide the model\n        to generate a formatted JSON.\n        \"\"\"\n        return (\n            \"\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n\"\n            \"```\\n\"\n            \"{\\n\"\n            '   \"thought\": \"Concisely describe your reasoning here\",\\n'\n            '   \"passes\": \"yes\" or \"no\"\\n'\n            \"}\\n\"\n            \"```\\n\"\n        )\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task.\"\"\"\n        return {\n            \"func_desc\": True,\n            \"query\": True,\n            \"answers\": True,\n            \"execution_result\": True,\n            \"keep_row_after_execution_check\": True,\n        }\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType`.\"\"\"\n        return [\n            {\"role\": \"system\", \"content\": self.system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    func_desc=input[\"func_desc\"],\n                    query=input[\"query\"] or \"\",\n                    func_call=input[\"answers\"] or \"\",\n                    execution_result=input[\"execution_result\"],\n                    format_inst=self._format_inst,\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The output for the task are the queries and corresponding answers.\"\"\"\n        return [\"keep_row_after_semantic_check\", \"thought\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the queries and answers pairs.\n            The answers are an array of answers corresponding to the query.\n            Each answer is represented as an object with the following properties:\n                - name (string): The name of the tool used to generate the answer.\n                - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n            Each argument is represented as a key-value pair, where the key is the parameter name and the\n            value is the corresponding value.\n        \"\"\"\n        if output is None:\n            return self._default_error(input)\n\n        output = remove_fences(output)\n\n        try:\n            result = orjson.loads(output)\n            # Update the column name and change to bool\n            result[\"keep_row_after_semantic_check\"] = (\n                result.pop(\"passes\").lower() == \"yes\"\n            )\n            input.update(**result)\n            return input\n        except orjson.JSONDecodeError:\n            return self._default_error(input)\n\n    def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n        \"\"\"Default error message for the task.\"\"\"\n        input.update({\"thought\": None, \"keep_row_after_semantic_check\": None})\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from typing import Literal\n        from pydantic import BaseModel\n        import json\n\n        class Checker(BaseModel):\n            thought: str\n            passes: Literal[\"yes\", \"no\"]\n\n        json.dumps(Checker.model_json_schema(), indent=4)\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"properties\": {\n                \"thought\": {\"title\": \"Thought\", \"type\": \"string\"},\n                \"passes\": {\"enum\": [\"yes\", \"no\"], \"title\": \"Passes\", \"type\": \"string\"},\n            },\n            \"required\": [\"thought\", \"passes\"],\n            \"title\": \"Checker\",\n            \"type\": \"object\",\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the queries and corresponding answers.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.load","title":"<code>load()</code>","text":"<p>Loads the template for the generator prompt.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the template for the generator prompt.\"\"\"\n    super().load()\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"apigen\"\n        / \"semantic_checker.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n    self._format_inst = self._set_format_inst()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker._set_format_inst","title":"<code>_set_format_inst()</code>","text":"<p>Prepares the function to generate the formatted instructions for the prompt.</p> <p>If the default structured output is used, returns an empty string because nothing else is needed, otherwise, returns the original addition to the prompt to guide the model to generate a formatted JSON.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def _set_format_inst(self) -&gt; str:\n    \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n    If the default structured output is used, returns an empty string because nothing\n    else is needed, otherwise, returns the original addition to the prompt to guide the model\n    to generate a formatted JSON.\n    \"\"\"\n    return (\n        \"\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n\"\n        \"```\\n\"\n        \"{\\n\"\n        '   \"thought\": \"Concisely describe your reasoning here\",\\n'\n        '   \"passes\": \"yes\" or \"no\"\\n'\n        \"}\\n\"\n        \"```\\n\"\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code>.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType`.\"\"\"\n    return [\n        {\"role\": \"system\", \"content\": self.system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                func_desc=input[\"func_desc\"],\n                query=input[\"query\"] or \"\",\n                func_call=input[\"answers\"] or \"\",\n                execution_result=input[\"execution_result\"],\n                format_inst=self._format_inst,\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the queries and answers pairs.</p> <code>Dict[str, Any]</code> <p>The answers are an array of answers corresponding to the query.</p> <code>Dict[str, Any]</code> <p>Each answer is represented as an object with the following properties: - name (string): The name of the tool used to generate the answer. - arguments (object): An object representing the arguments passed to the tool to generate the answer.</p> <code>Dict[str, Any]</code> <p>Each argument is represented as a key-value pair, where the key is the parameter name and the</p> <code>Dict[str, Any]</code> <p>value is the corresponding value.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the queries and answers pairs.\n        The answers are an array of answers corresponding to the query.\n        Each answer is represented as an object with the following properties:\n            - name (string): The name of the tool used to generate the answer.\n            - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n        Each argument is represented as a key-value pair, where the key is the parameter name and the\n        value is the corresponding value.\n    \"\"\"\n    if output is None:\n        return self._default_error(input)\n\n    output = remove_fences(output)\n\n    try:\n        result = orjson.loads(output)\n        # Update the column name and change to bool\n        result[\"keep_row_after_semantic_check\"] = (\n            result.pop(\"passes\").lower() == \"yes\"\n        )\n        input.update(**result)\n        return input\n    except orjson.JSONDecodeError:\n        return self._default_error(input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker._default_error","title":"<code>_default_error(input)</code>","text":"<p>Default error message for the task.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n    \"\"\"Default error message for the task.\"\"\"\n    input.update({\"thought\": None, \"keep_row_after_semantic_check\": None})\n    return input\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from typing import Literal\nfrom pydantic import BaseModel\nimport json\n\nclass Checker(BaseModel):\n    thought: str\n    passes: Literal[\"yes\", \"no\"]\n\njson.dumps(Checker.model_json_schema(), indent=4)\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from typing import Literal\n    from pydantic import BaseModel\n    import json\n\n    class Checker(BaseModel):\n        thought: str\n        passes: Literal[\"yes\", \"no\"]\n\n    json.dumps(Checker.model_json_schema(), indent=4)\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"properties\": {\n            \"thought\": {\"title\": \"Thought\", \"type\": \"string\"},\n            \"passes\": {\"enum\": [\"yes\", \"no\"], \"title\": \"Passes\", \"type\": \"string\"},\n        },\n        \"required\": [\"thought\", \"passes\"],\n        \"title\": \"Checker\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller","title":"<code>ArgillaLabeller</code>","text":"<p>               Bases: <code>Task</code></p> <p>Annotate Argilla records based on input fields, example records and question settings.</p> <p>This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM. It uses a system prompt that guides the LLM to understand the input fields, the question type, and the question settings. The task then formats the input data and generates a response based on the question. The response is validated against the question's value model, and the final suggestion is prepared for annotation.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>record (<code>argilla.Record</code>): The record to be annotated.</li> <li>fields (<code>Optional[List[Dict[str, Any]]]</code>): The list of field settings for the input fields.</li> <li>question (<code>Optional[Dict[str, Any]]</code>): The question settings for the question to be answered.</li> <li>example_records (<code>Optional[List[Dict[str, Any]]]</code>): The few shot example records with responses to be used to answer the question.</li> <li>guidelines (<code>Optional[str]</code>): The guidelines for the annotation task.</li> </ul> Output columns <ul> <li>suggestion (<code>Dict[str, Any]</code>): The final suggestion for annotation.</li> </ul> Categories <ul> <li>text-classification</li> <li>scorer</li> <li>text-generation</li> </ul> References <ul> <li><code>Argilla: Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets</code></li> </ul> <p>Examples:</p> <p>Annotate a record with the same dataset and question:</p> <pre><code>import argilla as rg\nfrom argilla import Suggestion\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\npending_records_filter = rg.Filter((\"status\", \"==\", \"pending\"))\ncompleted_records_filter = rg.Filter((\"status\", \"==\", \"completed\"))\npending_records = list(\n    dataset.records(\n        query=rg.Query(filter=pending_records_filter),\n        limit=5,\n    )\n)\nexample_records = list(\n    dataset.records(\n        query=rg.Query(filter=completed_records_filter),\n        limit=5,\n    )\n)\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    fields=[field],\n    question=question,\n    example_records=example_records,\n    guidelines=dataset.guidelines\n)\nlabeller.load()\n\n# Process the pending records\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record\n            } for record in pending_records\n        ]\n    )\n)\n\n# Add the suggestions to the records\nfor record, suggestion in zip(pending_records, result):\n    record.suggestions.add(Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated records\ndataset.records.log(pending_records)\n</code></pre> <p>Annotate a record with alternating datasets and questions:</p> <pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\nquestion2 = dataset.settings.questions[\"label2\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\nlabeller.load()\n\n# Process the record\nrecord = next(dataset.records())\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question,\n            },\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question2,\n            }\n        ]\n    )\n)\n\n# Add the suggestions to the record\nfor suggestion in result:\n    record.suggestions.add(rg.Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated record\ndataset.records.log([record])\n</code></pre> <p>Overwrite default prompts and instructions:</p> <pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Overwrite default prompts and instructions\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    system_prompt=\"You are an expert annotator and labelling assistant that understands complex domains and natural language processing.\",\n    question_to_label_instruction={\n        \"label_selection\": \"Select the appropriate label from the list of provided labels.\",\n        \"multi_label_selection\": \"Select none, one or multiple labels from the list of provided labels.\",\n        \"text\": \"Provide a text response to the question.\",\n        \"rating\": \"Provide a rating for the question.\",\n    },\n)\nlabeller.load()\n</code></pre> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>class ArgillaLabeller(Task):\n    \"\"\"\n    Annotate Argilla records based on input fields, example records and question settings.\n\n    This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM.\n    It uses a system prompt that guides the LLM to understand the input fields, the question type,\n    and the question settings. The task then formats the input data and generates a response based on the question.\n    The response is validated against the question's value model, and the final suggestion is prepared for annotation.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - record (`argilla.Record`): The record to be annotated.\n        - fields (`Optional[List[Dict[str, Any]]]`): The list of field settings for the input fields.\n        - question (`Optional[Dict[str, Any]]`): The question settings for the question to be answered.\n        - example_records (`Optional[List[Dict[str, Any]]]`): The few shot example records with responses to be used to answer the question.\n        - guidelines (`Optional[str]`): The guidelines for the annotation task.\n\n    Output columns:\n        - suggestion (`Dict[str, Any]`): The final suggestion for annotation.\n\n    Categories:\n        - text-classification\n        - scorer\n        - text-generation\n\n    References:\n        - [`Argilla: Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets`](https://github.com/argilla-io/argilla/)\n\n    Examples:\n        Annotate a record with the same dataset and question:\n\n        ```python\n        import argilla as rg\n        from argilla import Suggestion\n        from distilabel.steps.tasks import ArgillaLabeller\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Get information from Argilla dataset definition\n        dataset = rg.Dataset(\"my_dataset\")\n        pending_records_filter = rg.Filter((\"status\", \"==\", \"pending\"))\n        completed_records_filter = rg.Filter((\"status\", \"==\", \"completed\"))\n        pending_records = list(\n            dataset.records(\n                query=rg.Query(filter=pending_records_filter),\n                limit=5,\n            )\n        )\n        example_records = list(\n            dataset.records(\n                query=rg.Query(filter=completed_records_filter),\n                limit=5,\n            )\n        )\n        field = dataset.settings.fields[\"text\"]\n        question = dataset.settings.questions[\"label\"]\n\n        # Initialize the labeller with the model and fields\n        labeller = ArgillaLabeller(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            fields=[field],\n            question=question,\n            example_records=example_records,\n            guidelines=dataset.guidelines\n        )\n        labeller.load()\n\n        # Process the pending records\n        result = next(\n            labeller.process(\n                [\n                    {\n                        \"record\": record\n                    } for record in pending_records\n                ]\n            )\n        )\n\n        # Add the suggestions to the records\n        for record, suggestion in zip(pending_records, result):\n            record.suggestions.add(Suggestion(**suggestion[\"suggestion\"]))\n\n        # Log the updated records\n        dataset.records.log(pending_records)\n        ```\n\n        Annotate a record with alternating datasets and questions:\n\n        ```python\n        import argilla as rg\n        from distilabel.steps.tasks import ArgillaLabeller\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Get information from Argilla dataset definition\n        dataset = rg.Dataset(\"my_dataset\")\n        field = dataset.settings.fields[\"text\"]\n        question = dataset.settings.questions[\"label\"]\n        question2 = dataset.settings.questions[\"label2\"]\n\n        # Initialize the labeller with the model and fields\n        labeller = ArgillaLabeller(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n        labeller.load()\n\n        # Process the record\n        record = next(dataset.records())\n        result = next(\n            labeller.process(\n                [\n                    {\n                        \"record\": record,\n                        \"fields\": [field],\n                        \"question\": question,\n                    },\n                    {\n                        \"record\": record,\n                        \"fields\": [field],\n                        \"question\": question2,\n                    }\n                ]\n            )\n        )\n\n        # Add the suggestions to the record\n        for suggestion in result:\n            record.suggestions.add(rg.Suggestion(**suggestion[\"suggestion\"]))\n\n        # Log the updated record\n        dataset.records.log([record])\n        ```\n\n        Overwrite default prompts and instructions:\n\n        ```python\n        import argilla as rg\n        from distilabel.steps.tasks import ArgillaLabeller\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Overwrite default prompts and instructions\n        labeller = ArgillaLabeller(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            system_prompt=\"You are an expert annotator and labelling assistant that understands complex domains and natural language processing.\",\n            question_to_label_instruction={\n                \"label_selection\": \"Select the appropriate label from the list of provided labels.\",\n                \"multi_label_selection\": \"Select none, one or multiple labels from the list of provided labels.\",\n                \"text\": \"Provide a text response to the question.\",\n                \"rating\": \"Provide a rating for the question.\",\n            },\n        )\n        labeller.load()\n        ```\n    \"\"\"\n\n    system_prompt: str = (\n        \"You are an expert annotator and labelling assistant that understands complex domains and natural language processing. \"\n        \"You are given input fields and a question. \"\n        \"You should create a valid JSON object as an response to the question based on the input fields. \"\n    )\n    question_to_label_instruction: Dict[str, str] = {\n        \"label_selection\": \"Select the appropriate label for the fields from the list of optional labels.\",\n        \"multi_label_selection\": \"Select none, one or multiple labels for the fields from the list of optional labels.\",\n        \"text\": \"Provide a response to the question based on the fields.\",\n        \"rating\": \"Provide a rating for the question based on the fields.\",\n    }\n    example_records: Optional[\n        RuntimeParameter[Union[List[Union[Dict[str, Any], BaseModel]], None]]\n    ] = Field(\n        default=None,\n        description=\"The few shot serialized example records or `BaseModel`s with responses to be used to answer the question.\",\n    )\n    fields: Optional[\n        RuntimeParameter[Union[List[Union[BaseModel, Dict[str, Any]]], None]]\n    ] = Field(\n        default=None,\n        description=\"The field serialized field settings or `BaseModel` for the fields to be used to answer the question.\",\n    )\n    question: Optional[\n        RuntimeParameter[\n            Union[\n                Dict[str, Any],\n                BaseModel,\n                None,\n            ]\n        ]\n    ] = Field(\n        default=None,\n        description=\"The question serialized question settings or `BaseModel` for the question to be answered.\",\n    )\n    guidelines: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The guidelines for the annotation task.\",\n    )\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _client: Optional[Any] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"argillalabeller.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; Dict[str, bool]:\n        return {\n            \"record\": True,\n            \"fields\": False,\n            \"question\": False,\n            \"example_records\": False,\n            \"guidelines\": False,\n        }\n\n    def _format_record(\n        self, record: Dict[str, Any], fields: List[Dict[str, Any]]\n    ) -&gt; str:\n        \"\"\"Format the record fields into a string.\n\n        Args:\n            record (Dict[str, Any]): The record to format.\n            fields (List[Dict[str, Any]]): The fields to format.\n\n        Returns:\n            str: The formatted record fields.\n        \"\"\"\n        output = []\n        for field in fields:\n            output.append(record.get(\"fields\", {}).get(field.get(\"name\", \"\")))\n        return \"fields: \" + \"\\n\".join(output)\n\n    def _get_label_instruction(self, question: Dict[str, Any]) -&gt; str:\n        \"\"\"Get the label instruction for the question.\n\n        Args:\n            question (Dict[str, Any]): The question to get the label instruction for.\n\n        Returns:\n            str: The label instruction for the question.\n        \"\"\"\n        question_type = question[\"settings\"][\"type\"]\n        return self.question_to_label_instruction[question_type]\n\n    def _format_question(self, question: Dict[str, Any]) -&gt; str:\n        \"\"\"Format the question settings into a string.\n\n        Args:\n            question (Dict[str, Any]): The question to format.\n\n        Returns:\n            str: The formatted question.\n        \"\"\"\n        output = []\n        output.append(f\"question: {self._get_label_instruction(question)}\")\n        if \"options\" in question.get(\"settings\", {}):\n            output.append(\n                f\"optional labels: {[option['value'] for option in question.get('settings', {}).get('options', [])]}\"\n            )\n        return \"\\n\".join(output)\n\n    def _format_example_records(\n        self,\n        records: List[Dict[str, Any]],\n        fields: List[Dict[str, Any]],\n        question: Dict[str, Any],\n    ) -&gt; str:\n        \"\"\"Format the example records into a string.\n\n        Args:\n            records (List[Dict[str, Any]]): The records to format.\n            fields (List[Dict[str, Any]]): The fields to format.\n            question (Dict[str, Any]): The question to format.\n\n        Returns:\n            str: The formatted example records.\n        \"\"\"\n        base = []\n        for record in records:\n            responses = record.get(\"responses\", {})\n            if responses.get(question[\"name\"]):\n                base.append(self._format_record(record, fields))\n                value = responses[question[\"name\"]][0][\"value\"]\n                formatted_value = self._assign_value_to_question_value_model(\n                    value, question\n                )\n                base.append(f\"response: {formatted_value}\")\n                base.append(\"\")\n            else:\n                warnings.warn(\n                    f\"Record {record} has no response for question {question['name']}. Skipping example record.\",\n                    stacklevel=2,\n                )\n        return \"\\n\".join(base)\n\n    def format_input(\n        self,\n        input: Dict[\n            str,\n            Union[\n                Dict[str, Any],\n                \"Record\",\n                \"TextField\",\n                \"MultiLabelQuestion\",\n                \"LabelQuestion\",\n                \"RatingQuestion\",\n                \"TextQuestion\",\n            ],\n        ],\n    ) -&gt; \"ChatType\":\n        \"\"\"Format the input into a chat message.\n\n        Args:\n            input: The input to format.\n\n        Returns:\n            The formatted chat message.\n\n        Raises:\n            ValueError: If question or fields are not provided.\n        \"\"\"\n        input_keys = list(self.inputs.keys())\n        record = input[input_keys[0]]\n        fields = input.get(input_keys[1], self.fields)\n        question = input.get(input_keys[2], self.question)\n        examples = input.get(input_keys[3], self.example_records)\n        guidelines = input.get(input_keys[4], self.guidelines)\n\n        if question is None:\n            raise ValueError(\"Question must be provided.\")\n        if fields is None or any(field is None for field in fields):\n            raise ValueError(\"Fields must be provided.\")\n\n        record = record.to_dict() if not isinstance(record, dict) else record\n        question = question.serialize() if not isinstance(question, dict) else question\n        fields = [\n            field.serialize() if not isinstance(field, dict) else field\n            for field in fields\n        ]\n        examples = (\n            [\n                example.to_dict() if not isinstance(example, dict) else example\n                for example in examples\n            ]\n            if examples\n            else None\n        )\n\n        formatted_fields = self._format_record(record, fields)\n        formatted_question = self._format_question(question)\n        formatted_examples = (\n            self._format_example_records(examples, fields, question)\n            if examples\n            else False\n        )\n\n        prompt = self._template.render(\n            fields=formatted_fields,\n            question=formatted_question,\n            examples=formatted_examples,\n            guidelines=guidelines,\n        )\n\n        messages = []\n        if self.system_prompt:\n            messages.append({\"role\": \"system\", \"content\": self.system_prompt})\n        messages.append({\"role\": \"user\", \"content\": prompt})\n        return messages\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"suggestion\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Format the output into a dictionary.\n\n        Args:\n            output (Union[str, None]): The output to format.\n            input (Dict[str, Any]): The input to format.\n\n        Returns:\n            Dict[str, Any]: The formatted output.\n        \"\"\"\n        from argilla import Suggestion\n\n        question: Union[\n            Any,\n            Dict[str, Any],\n            LabelQuestion,\n            MultiLabelQuestion,\n            RatingQuestion,\n            TextQuestion,\n            None,\n        ] = input.get(list(self.inputs.keys())[2], self.question) or self.question\n        question = question.serialize() if not isinstance(question, dict) else question\n        model = self._get_pydantic_model_of_structured_output(question)\n        validated_output = model(**json.loads(output))\n        value = self._get_value_from_question_value_model(validated_output)\n        suggestion = Suggestion(\n            value=value,\n            question_name=question[\"name\"],\n            type=\"model\",\n            agent=self.llm.model_name,\n        ).serialize()\n        return {\n            self.outputs[0]: {\n                k: v\n                for k, v in suggestion.items()\n                if k in [\"value\", \"question_name\", \"type\", \"agent\"]\n            }\n        }\n\n    def _set_llm_structured_output_for_question(self, question: Dict[str, Any]) -&gt; None:\n        runtime_parameters = self.llm._runtime_parameters\n        runtime_parameters.update(\n            {\n                \"structured_output\": {\n                    \"format\": \"json\",\n                    \"schema\": self._get_pydantic_model_of_structured_output(question),\n                },\n            }\n        )\n        self.llm.set_runtime_parameters(runtime_parameters)\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Process the input through the task.\n\n        Args:\n            inputs (StepInput): The input to process.\n\n        Returns:\n            StepOutput: The output of the task.\n        \"\"\"\n\n        question_list = [input.get(\"question\", self.question) for input in inputs]\n        fields_list = [input.get(\"fields\", self.fields) for input in inputs]\n        # check if any field for the field in fields is None\n        for fields in fields_list:\n            if any(field is None for field in fields):\n                raise ValueError(\n                    \"Fields must be provided during init or through `process` method.\"\n                )\n        # check if any question is None\n        if any(question is None for question in question_list):\n            raise ValueError(\n                \"Question must be provided during init or through `process` method.\"\n            )\n        question_list = [\n            question.serialize() if not isinstance(question, dict) else question\n            for question in question_list\n        ]\n        if not all(question == question_list[0] for question in question_list):\n            warnings.warn(\n                \"Not all questions are the same. Processing each question separately by setting the structured output for each question. This may impact performance.\",\n                stacklevel=2,\n            )\n            for input, question in zip(inputs, question_list):\n                self._set_llm_structured_output_for_question(question)\n                yield from super().process([input])\n        else:\n            question = question_list[0]\n            self._set_llm_structured_output_for_question(question)\n            yield from super().process(inputs)\n\n    def _get_value_from_question_value_model(\n        self, question_value_model: BaseModel\n    ) -&gt; Any:\n        \"\"\"Get the value from the question value model.\n\n        Args:\n            question_value_model (BaseModel): The question value model to get the value from.\n\n        Returns:\n            Any: The value from the question value model.\n        \"\"\"\n        for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n            if hasattr(question_value_model, attr):\n                return getattr(question_value_model, attr)\n        raise ValueError(f\"Unsupported question type: {question_value_model}\")\n\n    def _assign_value_to_question_value_model(\n        self, value: Any, question: Dict[str, Any]\n    ) -&gt; BaseModel:\n        \"\"\"Assign the value to the question value model.\n\n        Args:\n            value (Any): The value to assign.\n            question (Dict[str, Any]): The question to assign the value to.\n\n        Returns:\n            BaseModel: The question value model with the assigned value.\n        \"\"\"\n        question_value_model = self._get_pydantic_model_of_structured_output(question)\n        for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n            try:\n                model_dict = {attr: value}\n                question_value_model = question_value_model(**model_dict)\n                return question_value_model.model_dump_json()\n            except AttributeError:\n                pass\n        return value\n\n    def _get_pydantic_model_of_structured_output(\n        self,\n        question: Dict[str, Any],\n    ) -&gt; BaseModel:\n        \"\"\"Get the Pydantic model of the structured output.\n\n        Args:\n            question (Dict[str, Any]): The question to get the Pydantic model of the structured output for.\n\n        Returns:\n            BaseModel: The Pydantic model of the structured output.\n        \"\"\"\n\n        question_type = question[\"settings\"][\"type\"]\n\n        if question_type == \"multi_label_selection\":\n\n            class QuestionValueModel(BaseModel):\n                labels: Optional[List[str]] = Field(default_factory=list)\n\n        elif question_type == \"label_selection\":\n\n            class QuestionValueModel(BaseModel):\n                label: str\n\n        elif question_type == \"text\":\n\n            class QuestionValueModel(BaseModel):\n                text: str\n\n        elif question_type == \"rating\":\n\n            class QuestionValueModel(BaseModel):\n                rating: int\n        else:\n            raise ValueError(f\"Unsupported question type: {question}\")\n\n        return QuestionValueModel\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"argillalabeller.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._format_record","title":"<code>_format_record(record, fields)</code>","text":"<p>Format the record fields into a string.</p> <p>Parameters:</p> Name Type Description Default <code>record</code> <code>Dict[str, Any]</code> <p>The record to format.</p> required <code>fields</code> <code>List[Dict[str, Any]]</code> <p>The fields to format.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The formatted record fields.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _format_record(\n    self, record: Dict[str, Any], fields: List[Dict[str, Any]]\n) -&gt; str:\n    \"\"\"Format the record fields into a string.\n\n    Args:\n        record (Dict[str, Any]): The record to format.\n        fields (List[Dict[str, Any]]): The fields to format.\n\n    Returns:\n        str: The formatted record fields.\n    \"\"\"\n    output = []\n    for field in fields:\n        output.append(record.get(\"fields\", {}).get(field.get(\"name\", \"\")))\n    return \"fields: \" + \"\\n\".join(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._get_label_instruction","title":"<code>_get_label_instruction(question)</code>","text":"<p>Get the label instruction for the question.</p> <p>Parameters:</p> Name Type Description Default <code>question</code> <code>Dict[str, Any]</code> <p>The question to get the label instruction for.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The label instruction for the question.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _get_label_instruction(self, question: Dict[str, Any]) -&gt; str:\n    \"\"\"Get the label instruction for the question.\n\n    Args:\n        question (Dict[str, Any]): The question to get the label instruction for.\n\n    Returns:\n        str: The label instruction for the question.\n    \"\"\"\n    question_type = question[\"settings\"][\"type\"]\n    return self.question_to_label_instruction[question_type]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._format_question","title":"<code>_format_question(question)</code>","text":"<p>Format the question settings into a string.</p> <p>Parameters:</p> Name Type Description Default <code>question</code> <code>Dict[str, Any]</code> <p>The question to format.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The formatted question.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _format_question(self, question: Dict[str, Any]) -&gt; str:\n    \"\"\"Format the question settings into a string.\n\n    Args:\n        question (Dict[str, Any]): The question to format.\n\n    Returns:\n        str: The formatted question.\n    \"\"\"\n    output = []\n    output.append(f\"question: {self._get_label_instruction(question)}\")\n    if \"options\" in question.get(\"settings\", {}):\n        output.append(\n            f\"optional labels: {[option['value'] for option in question.get('settings', {}).get('options', [])]}\"\n        )\n    return \"\\n\".join(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._format_example_records","title":"<code>_format_example_records(records, fields, question)</code>","text":"<p>Format the example records into a string.</p> <p>Parameters:</p> Name Type Description Default <code>records</code> <code>List[Dict[str, Any]]</code> <p>The records to format.</p> required <code>fields</code> <code>List[Dict[str, Any]]</code> <p>The fields to format.</p> required <code>question</code> <code>Dict[str, Any]</code> <p>The question to format.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The formatted example records.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _format_example_records(\n    self,\n    records: List[Dict[str, Any]],\n    fields: List[Dict[str, Any]],\n    question: Dict[str, Any],\n) -&gt; str:\n    \"\"\"Format the example records into a string.\n\n    Args:\n        records (List[Dict[str, Any]]): The records to format.\n        fields (List[Dict[str, Any]]): The fields to format.\n        question (Dict[str, Any]): The question to format.\n\n    Returns:\n        str: The formatted example records.\n    \"\"\"\n    base = []\n    for record in records:\n        responses = record.get(\"responses\", {})\n        if responses.get(question[\"name\"]):\n            base.append(self._format_record(record, fields))\n            value = responses[question[\"name\"]][0][\"value\"]\n            formatted_value = self._assign_value_to_question_value_model(\n                value, question\n            )\n            base.append(f\"response: {formatted_value}\")\n            base.append(\"\")\n        else:\n            warnings.warn(\n                f\"Record {record} has no response for question {question['name']}. Skipping example record.\",\n                stacklevel=2,\n            )\n    return \"\\n\".join(base)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.format_input","title":"<code>format_input(input)</code>","text":"<p>Format the input into a chat message.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Union[Dict[str, Any], Record, TextField, MultiLabelQuestion, LabelQuestion, RatingQuestion, TextQuestion]]</code> <p>The input to format.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>The formatted chat message.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If question or fields are not provided.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def format_input(\n    self,\n    input: Dict[\n        str,\n        Union[\n            Dict[str, Any],\n            \"Record\",\n            \"TextField\",\n            \"MultiLabelQuestion\",\n            \"LabelQuestion\",\n            \"RatingQuestion\",\n            \"TextQuestion\",\n        ],\n    ],\n) -&gt; \"ChatType\":\n    \"\"\"Format the input into a chat message.\n\n    Args:\n        input: The input to format.\n\n    Returns:\n        The formatted chat message.\n\n    Raises:\n        ValueError: If question or fields are not provided.\n    \"\"\"\n    input_keys = list(self.inputs.keys())\n    record = input[input_keys[0]]\n    fields = input.get(input_keys[1], self.fields)\n    question = input.get(input_keys[2], self.question)\n    examples = input.get(input_keys[3], self.example_records)\n    guidelines = input.get(input_keys[4], self.guidelines)\n\n    if question is None:\n        raise ValueError(\"Question must be provided.\")\n    if fields is None or any(field is None for field in fields):\n        raise ValueError(\"Fields must be provided.\")\n\n    record = record.to_dict() if not isinstance(record, dict) else record\n    question = question.serialize() if not isinstance(question, dict) else question\n    fields = [\n        field.serialize() if not isinstance(field, dict) else field\n        for field in fields\n    ]\n    examples = (\n        [\n            example.to_dict() if not isinstance(example, dict) else example\n            for example in examples\n        ]\n        if examples\n        else None\n    )\n\n    formatted_fields = self._format_record(record, fields)\n    formatted_question = self._format_question(question)\n    formatted_examples = (\n        self._format_example_records(examples, fields, question)\n        if examples\n        else False\n    )\n\n    prompt = self._template.render(\n        fields=formatted_fields,\n        question=formatted_question,\n        examples=formatted_examples,\n        guidelines=guidelines,\n    )\n\n    messages = []\n    if self.system_prompt:\n        messages.append({\"role\": \"system\", \"content\": self.system_prompt})\n    messages.append({\"role\": \"user\", \"content\": prompt})\n    return messages\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.format_output","title":"<code>format_output(output, input)</code>","text":"<p>Format the output into a dictionary.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>The output to format.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>The input to format.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Dict[str, Any]: The formatted output.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"Format the output into a dictionary.\n\n    Args:\n        output (Union[str, None]): The output to format.\n        input (Dict[str, Any]): The input to format.\n\n    Returns:\n        Dict[str, Any]: The formatted output.\n    \"\"\"\n    from argilla import Suggestion\n\n    question: Union[\n        Any,\n        Dict[str, Any],\n        LabelQuestion,\n        MultiLabelQuestion,\n        RatingQuestion,\n        TextQuestion,\n        None,\n    ] = input.get(list(self.inputs.keys())[2], self.question) or self.question\n    question = question.serialize() if not isinstance(question, dict) else question\n    model = self._get_pydantic_model_of_structured_output(question)\n    validated_output = model(**json.loads(output))\n    value = self._get_value_from_question_value_model(validated_output)\n    suggestion = Suggestion(\n        value=value,\n        question_name=question[\"name\"],\n        type=\"model\",\n        agent=self.llm.model_name,\n    ).serialize()\n    return {\n        self.outputs[0]: {\n            k: v\n            for k, v in suggestion.items()\n            if k in [\"value\", \"question_name\", \"type\", \"agent\"]\n        }\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.process","title":"<code>process(inputs)</code>","text":"<p>Process the input through the task.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input to process.</p> required <p>Returns:</p> Name Type Description <code>StepOutput</code> <code>StepOutput</code> <p>The output of the task.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Process the input through the task.\n\n    Args:\n        inputs (StepInput): The input to process.\n\n    Returns:\n        StepOutput: The output of the task.\n    \"\"\"\n\n    question_list = [input.get(\"question\", self.question) for input in inputs]\n    fields_list = [input.get(\"fields\", self.fields) for input in inputs]\n    # check if any field for the field in fields is None\n    for fields in fields_list:\n        if any(field is None for field in fields):\n            raise ValueError(\n                \"Fields must be provided during init or through `process` method.\"\n            )\n    # check if any question is None\n    if any(question is None for question in question_list):\n        raise ValueError(\n            \"Question must be provided during init or through `process` method.\"\n        )\n    question_list = [\n        question.serialize() if not isinstance(question, dict) else question\n        for question in question_list\n    ]\n    if not all(question == question_list[0] for question in question_list):\n        warnings.warn(\n            \"Not all questions are the same. Processing each question separately by setting the structured output for each question. This may impact performance.\",\n            stacklevel=2,\n        )\n        for input, question in zip(inputs, question_list):\n            self._set_llm_structured_output_for_question(question)\n            yield from super().process([input])\n    else:\n        question = question_list[0]\n        self._set_llm_structured_output_for_question(question)\n        yield from super().process(inputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._get_value_from_question_value_model","title":"<code>_get_value_from_question_value_model(question_value_model)</code>","text":"<p>Get the value from the question value model.</p> <p>Parameters:</p> Name Type Description Default <code>question_value_model</code> <code>BaseModel</code> <p>The question value model to get the value from.</p> required <p>Returns:</p> Name Type Description <code>Any</code> <code>Any</code> <p>The value from the question value model.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _get_value_from_question_value_model(\n    self, question_value_model: BaseModel\n) -&gt; Any:\n    \"\"\"Get the value from the question value model.\n\n    Args:\n        question_value_model (BaseModel): The question value model to get the value from.\n\n    Returns:\n        Any: The value from the question value model.\n    \"\"\"\n    for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n        if hasattr(question_value_model, attr):\n            return getattr(question_value_model, attr)\n    raise ValueError(f\"Unsupported question type: {question_value_model}\")\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._assign_value_to_question_value_model","title":"<code>_assign_value_to_question_value_model(value, question)</code>","text":"<p>Assign the value to the question value model.</p> <p>Parameters:</p> Name Type Description Default <code>value</code> <code>Any</code> <p>The value to assign.</p> required <code>question</code> <code>Dict[str, Any]</code> <p>The question to assign the value to.</p> required <p>Returns:</p> Name Type Description <code>BaseModel</code> <code>BaseModel</code> <p>The question value model with the assigned value.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _assign_value_to_question_value_model(\n    self, value: Any, question: Dict[str, Any]\n) -&gt; BaseModel:\n    \"\"\"Assign the value to the question value model.\n\n    Args:\n        value (Any): The value to assign.\n        question (Dict[str, Any]): The question to assign the value to.\n\n    Returns:\n        BaseModel: The question value model with the assigned value.\n    \"\"\"\n    question_value_model = self._get_pydantic_model_of_structured_output(question)\n    for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n        try:\n            model_dict = {attr: value}\n            question_value_model = question_value_model(**model_dict)\n            return question_value_model.model_dump_json()\n        except AttributeError:\n            pass\n    return value\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._get_pydantic_model_of_structured_output","title":"<code>_get_pydantic_model_of_structured_output(question)</code>","text":"<p>Get the Pydantic model of the structured output.</p> <p>Parameters:</p> Name Type Description Default <code>question</code> <code>Dict[str, Any]</code> <p>The question to get the Pydantic model of the structured output for.</p> required <p>Returns:</p> Name Type Description <code>BaseModel</code> <code>BaseModel</code> <p>The Pydantic model of the structured output.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _get_pydantic_model_of_structured_output(\n    self,\n    question: Dict[str, Any],\n) -&gt; BaseModel:\n    \"\"\"Get the Pydantic model of the structured output.\n\n    Args:\n        question (Dict[str, Any]): The question to get the Pydantic model of the structured output for.\n\n    Returns:\n        BaseModel: The Pydantic model of the structured output.\n    \"\"\"\n\n    question_type = question[\"settings\"][\"type\"]\n\n    if question_type == \"multi_label_selection\":\n\n        class QuestionValueModel(BaseModel):\n            labels: Optional[List[str]] = Field(default_factory=list)\n\n    elif question_type == \"label_selection\":\n\n        class QuestionValueModel(BaseModel):\n            label: str\n\n    elif question_type == \"text\":\n\n        class QuestionValueModel(BaseModel):\n            text: str\n\n    elif question_type == \"rating\":\n\n        class QuestionValueModel(BaseModel):\n            rating: int\n    else:\n        raise ValueError(f\"Unsupported question type: {question}\")\n\n    return QuestionValueModel\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.CLAIR","title":"<code>CLAIR</code>","text":"<p>               Bases: <code>Task</code></p> <p>Contrastive Learning from AI Revisions (CLAIR).</p> <p>CLAIR uses an AI system to minimally revise a solution A\u2192A\u00b4 such that the resulting preference A <code>preferred</code> A\u2019 is much more contrastive and precise.</p> Input columns <ul> <li>task (<code>str</code>): The task or instruction.</li> <li>student_solution (<code>str</code>): An answer to the task that is to be revised.</li> </ul> Output columns <ul> <li>revision (<code>str</code>): The revised text.</li> <li>rational (<code>str</code>): The rational for the provided revision.</li> <li>model_name (<code>str</code>): The name of the model used to generate the revision and rational.</li> </ul> Categories <ul> <li>preference</li> <li>text-generation</li> </ul> References <ul> <li><code>Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment</code></li> <li><code>APO and CLAIR - GitHub Repository</code></li> </ul> <p>Examples:</p> <p>Create contrastive preference pairs:</p> <pre><code>from distilabel.steps.tasks import CLAIR\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 4096,\n    },\n)\nclair_task = CLAIR(llm=llm)\n\nclair_task.load()\n\nresult = next(\n    clair_task.process(\n        [\n            {\n                \"task\": \"How many gaps are there between the earth and the moon?\",\n                \"student_solution\": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'\n            }\n        ]\n    )\n)\n# result\n# [{'task': 'How many gaps are there between the earth and the moon?',\n# 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',\n# 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'rational': 'The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.',\n# 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.\\n\\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'raw_input_c_l_a_i_r_0': [{'role': 'system',\n#     'content': \"You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution.\"},\n#     {'role': 'user',\n#     'content': '{task}: How many gaps are there between the earth and the moon?\\n\\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\\n\\n-----------------\\n\\nLet\\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Citations:</p> <pre><code>```\n@misc{doosterlinck2024anchoredpreferenceoptimizationcontrastive,\n    title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment},\n    author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},\n    year={2024},\n    eprint={2408.06266},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG},\n    url={https://arxiv.org/abs/2408.06266},\n}\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/clair.py</code> <pre><code>class CLAIR(Task):\n    r\"\"\"Contrastive Learning from AI Revisions (CLAIR).\n\n    CLAIR uses an AI system to minimally revise a solution A\u2192A\u00b4 such that the resulting\n    preference A `preferred` A\u2019 is much more contrastive and precise.\n\n    Input columns:\n        - task (`str`): The task or instruction.\n        - student_solution (`str`): An answer to the task that is to be revised.\n\n    Output columns:\n        - revision (`str`): The revised text.\n        - rational (`str`): The rational for the provided revision.\n        - model_name (`str`): The name of the model used to generate the revision and rational.\n\n    Categories:\n        - preference\n        - text-generation\n\n    References:\n        - [`Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment`](https://arxiv.org/abs/2408.06266v1)\n        - [`APO and CLAIR - GitHub Repository`](https://github.com/ContextualAI/CLAIR_and_APO)\n\n    Examples:\n        Create contrastive preference pairs:\n\n        ```python\n        from distilabel.steps.tasks import CLAIR\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 4096,\n            },\n        )\n        clair_task = CLAIR(llm=llm)\n\n        clair_task.load()\n\n        result = next(\n            clair_task.process(\n                [\n                    {\n                        \"task\": \"How many gaps are there between the earth and the moon?\",\n                        \"student_solution\": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'\n                    }\n                ]\n            )\n        )\n        # result\n        # [{'task': 'How many gaps are there between the earth and the moon?',\n        # 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',\n        # 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n        # 'rational': 'The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.',\n        # 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.\\n\\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n        # 'raw_input_c_l_a_i_r_0': [{'role': 'system',\n        #     'content': \"You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution.\"},\n        #     {'role': 'user',\n        #     'content': '{task}: How many gaps are there between the earth and the moon?\\n\\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\\n\\n-----------------\\n\\nLet\\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n    Citations:\n\n        ```\n        @misc{doosterlinck2024anchoredpreferenceoptimizationcontrastive,\n            title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment},\n            author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},\n            year={2024},\n            eprint={2408.06266},\n            archivePrefix={arXiv},\n            primaryClass={cs.LG},\n            url={https://arxiv.org/abs/2408.06266},\n        }\n        ```\n    \"\"\"\n\n    system_prompt: str = SYSTEM_PROMPT\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        super().load()\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"clair.jinja2\"\n        )\n        with open(_path, \"r\") as f:\n            self._template = Template(f.read())\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"task\", \"student_solution\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"revision\", \"rational\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\"role\": \"system\", \"content\": self.system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    task=input[\"task\"], student_solution=input[\"student_solution\"]\n                ),\n            },\n        ]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the key `scores` containing the scores for each instruction-response pair.\n        \"\"\"\n        if output is None:\n            return self._default_error()\n\n        return self._format_output(output)\n\n    def _format_output(self, output: Union[str, None]) -&gt; Dict[str, Any]:\n        if \"**Corrected Student Solution:**\" in output:\n            splits = output.split(\"**Corrected Student Solution:**\")\n        elif \"{corrected_student_solution}:\" in output:\n            splits = output.split(\"{corrected_student_solution}:\")\n        elif \"{corrected_student_solution}\" in output:\n            splits = output.split(\"{corrected_student_solution}\")\n        elif \"**Worsened Student Solution:**\" in output:\n            splits = output.split(\"**Worsened Student Solution:**\")\n        elif \"{worsened_student_solution}:\" in output:\n            splits = output.split(\"{worsened_student_solution}:\")\n        elif \"{worsened_student_solution}\" in output:\n            splits = output.split(\"{worsened_student_solution}\")\n        else:\n            splits = None\n\n        # Safety check when the output doesn't follow the expected format\n        if not splits:\n            return self._default_error()\n\n        if len(splits) &gt;= 2:\n            revision = splits[1]\n            revision = revision.strip(\"\\n\\n\").strip()  # noqa: B005\n\n            rational = splits[0]\n            if \"{teacher_reasoning}\" in rational:\n                rational = rational.split(\"{teacher_reasoning}\")[1].strip(\":\").strip()\n            rational = rational.strip(\"\\n\\n\").strip()  # noqa: B005\n        else:\n            return self._default_error()\n        return {\"revision\": revision, \"rational\": rational}\n\n    def _default_error(self) -&gt; Dict[str, None]:\n        return {\"revision\": None, \"rational\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.CLAIR.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/clair.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\"role\": \"system\", \"content\": self.system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                task=input[\"task\"], student_solution=input[\"student_solution\"]\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.CLAIR.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction-response pair.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the key <code>scores</code> containing the scores for each instruction-response pair.</p> Source code in <code>src/distilabel/steps/tasks/clair.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the key `scores` containing the scores for each instruction-response pair.\n    \"\"\"\n    if output is None:\n        return self._default_error()\n\n    return self._format_output(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer","title":"<code>ComplexityScorer</code>","text":"<p>               Bases: <code>Task</code></p> <p>Score instructions based on their complexity using an <code>LLM</code>.</p> <p><code>ComplexityScorer</code> is a pre-defined task used to rank a list of instructions based in their complexity. It's an implementation of the complexity score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>instructions (<code>List[str]</code>): The list of instructions to be scored.</li> </ul> Output columns <ul> <li>scores (<code>List[float]</code>): The score for each instruction.</li> <li>model_name (<code>str</code>): The model name used to generate the scores.</li> </ul> Categories <ul> <li>scorer</li> <li>complexity</li> <li>instruction</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Evaluate the complexity of your instructions:</p> <pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]\n</code></pre> <p>Generate structured output with default schema:</p> <pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=use_default_structured_output\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \\n  \"scores\": [\\n    1, \\n    2\\n  ]\\n}'}}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>class ComplexityScorer(Task):\n    \"\"\"Score instructions based on their complexity using an `LLM`.\n\n    `ComplexityScorer` is a pre-defined task used to rank a list of instructions based in\n    their complexity. It's an implementation of the complexity score task from the paper\n    'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection\n    in Instruction Tuning'.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - instructions (`List[str]`): The list of instructions to be scored.\n\n    Output columns:\n        - scores (`List[float]`): The score for each instruction.\n        - model_name (`str`): The model name used to generate the scores.\n\n    Categories:\n        - scorer\n        - complexity\n        - instruction\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Evaluate the complexity of your instructions:\n\n        ```python\n        from distilabel.steps.tasks import ComplexityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        scorer = ComplexityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n            )\n        )\n        # result\n        # [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]\n        ```\n\n        Generate structured output with default schema:\n\n        ```python\n        from distilabel.steps.tasks import ComplexityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        scorer = ComplexityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            use_default_structured_output=use_default_structured_output\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n            )\n        )\n        # result\n        # [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \\\\n  \"scores\": [\\\\n    1, \\\\n    2\\\\n  ]\\\\n}'}}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"complexity-scorer.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are the `instructions`.\"\"\"\n        return [\"instructions\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(instructions=input[\"instructions\"]),  # type: ignore\n            }\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are: a list of `scores` containing the complexity score for each\n        instruction in `instructions`, and the `model_name`.\"\"\"\n        return [\"scores\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the key `scores` containing the scores for each instruction.\n        \"\"\"\n        if output is None:\n            return {\"scores\": [None] * len(input[\"instructions\"])}\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        scores = []\n        score_lines = output.split(\"\\n\")\n        for i, line in enumerate(score_lines):\n            match = _PARSE_SCORE_LINE_REGEX.match(line)\n            score = float(match.group(1)) if match else None\n            scores.append(score)\n            if i == len(input[\"instructions\"]) - 1:\n                break\n        return {\"scores\": scores}\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel\n        from typing import List\n\n        class SchemaComplexityScorer(BaseModel):\n            scores: List[int]\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"properties\": {\n                \"scores\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Scores\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"scores\"],\n            \"title\": \"SchemaComplexityScorer\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(\n        self, output: str, input: Dict[str, Any]\n    ) -&gt; Dict[str, str]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with either `positive`, or `positive` and `negative` keys.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            return {\"scores\": [None] * len(input[\"instructions\"])}\n\n    @override\n    def _sample_input(self) -&gt; \"ChatType\":\n        \"\"\"Returns a sample input to be used in the `print` method.\n        Tasks that don't adhere to a format input that returns a map of the type\n        str -&gt; str should override this method to return a sample input.\n        \"\"\"\n        return self.format_input(\n            {\n                \"instructions\": [\n                    f\"&lt;PLACEHOLDER_{f'GENERATION_{i}'.upper()}&gt;\" for i in range(2)\n                ],\n            }\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the <code>instructions</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are: a list of <code>scores</code> containing the complexity score for each instruction in <code>instructions</code>, and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"complexity-scorer.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(instructions=input[\"instructions\"]),  # type: ignore\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the key <code>scores</code> containing the scores for each instruction.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the key `scores` containing the scores for each instruction.\n    \"\"\"\n    if output is None:\n        return {\"scores\": [None] * len(input[\"instructions\"])}\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    scores = []\n    score_lines = output.split(\"\\n\")\n    for i, line in enumerate(score_lines):\n        match = _PARSE_SCORE_LINE_REGEX.match(line)\n        score = float(match.group(1)) if match else None\n        scores.append(score)\n        if i == len(input[\"instructions\"]) - 1:\n            break\n    return {\"scores\": scores}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel\nfrom typing import List\n\nclass SchemaComplexityScorer(BaseModel):\n    scores: List[int]\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel\n    from typing import List\n\n    class SchemaComplexityScorer(BaseModel):\n        scores: List[int]\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"properties\": {\n            \"scores\": {\n                \"items\": {\"type\": \"integer\"},\n                \"title\": \"Scores\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"scores\"],\n        \"title\": \"SchemaComplexityScorer\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer._format_structured_output","title":"<code>_format_structured_output(output, input)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with either <code>positive</code>, or <code>positive</code> and <code>negative</code> keys.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def _format_structured_output(\n    self, output: str, input: Dict[str, Any]\n) -&gt; Dict[str, str]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with either `positive`, or `positive` and `negative` keys.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        return {\"scores\": [None] * len(input[\"instructions\"])}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer._sample_input","title":"<code>_sample_input()</code>","text":"<p>Returns a sample input to be used in the <code>print</code> method. Tasks that don't adhere to a format input that returns a map of the type str -&gt; str should override this method to return a sample input.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>@override\ndef _sample_input(self) -&gt; \"ChatType\":\n    \"\"\"Returns a sample input to be used in the `print` method.\n    Tasks that don't adhere to a format input that returns a map of the type\n    str -&gt; str should override this method to return a sample input.\n    \"\"\"\n    return self.format_input(\n        {\n            \"instructions\": [\n                f\"&lt;PLACEHOLDER_{f'GENERATION_{i}'.upper()}&gt;\" for i in range(2)\n            ],\n        }\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct","title":"<code>EvolInstruct</code>","text":"<p>               Bases: <code>Task</code></p> <p>Evolve instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> <p>Attributes:</p> Name Type Description <code>num_evolutions</code> <code>int</code> <p>The number of evolutions to be performed.</p> <code>store_evolutions</code> <code>bool</code> <p>Whether to store all the evolutions or just the last one. Defaults to <code>False</code>.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the evolved instructions. Defaults to <code>False</code>.</p> <code>include_original_instruction</code> <code>bool</code> <p>Whether to include the original instruction in the <code>evolved_instructions</code> output column. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for evolving the instructions. Defaults to the ones provided in the <code>utils.py</code> file.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>seed</code>: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul> Output columns <ul> <li>evolved_instruction (<code>str</code>): The evolved instruction if <code>store_evolutions=False</code>.</li> <li>evolved_instructions (<code>List[str]</code>): The evolved instructions if <code>store_evolutions=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</li> <li>answer (<code>str</code>): The answer to the evolved instruction if <code>generate_answers=True</code>     and <code>store_evolutions=False</code>.</li> <li>answers (<code>List[str]</code>): The answers to the evolved instructions if <code>generate_answers=True</code>     and <code>store_evolutions=True</code>.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> </ul> References <ul> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> <li>GitHub: h2oai/h2o-wizardlm</li> </ul> <p>Examples:</p> <p>Evolve an instruction using an LLM:</p> <pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre> <p>Keep the iterations of the evolutions:</p> <pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    store_evolutions=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instructions': ['initial evolution', 'final evolution'],\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre> <p>Generate answers for the instructions in a single step:</p> <pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    generate_answers=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instruction': 'evolved instruction',\n#         'answer': 'answer to the instruction',\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>class EvolInstruct(Task):\n    \"\"\"Evolve instructions using an `LLM`.\n\n    WizardLM: Empowering Large Language Models to Follow Complex Instructions\n\n    Attributes:\n        num_evolutions: The number of evolutions to be performed.\n        store_evolutions: Whether to store all the evolutions or just the last one. Defaults\n            to `False`.\n        generate_answers: Whether to generate answers for the evolved instructions. Defaults\n            to `False`.\n        include_original_instruction: Whether to include the original instruction in the\n            `evolved_instructions` output column. Defaults to `False`.\n        mutation_templates: The mutation templates to be used for evolving the instructions.\n            Defaults to the ones provided in the `utils.py` file.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.\n\n    Input columns:\n        - instruction (`str`): The instruction to evolve.\n\n    Output columns:\n        - evolved_instruction (`str`): The evolved instruction if `store_evolutions=False`.\n        - evolved_instructions (`List[str]`): The evolved instructions if `store_evolutions=True`.\n        - model_name (`str`): The name of the LLM used to evolve the instructions.\n        - answer (`str`): The answer to the evolved instruction if `generate_answers=True`\n            and `store_evolutions=False`.\n        - answers (`List[str]`): The answers to the evolved instructions if `generate_answers=True`\n            and `store_evolutions=True`.\n\n    Categories:\n        - evol\n        - instruction\n\n    References:\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)\n\n    Examples:\n        Evolve an instruction using an LLM:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct = EvolInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n        )\n\n        evol_instruct.load()\n\n        result = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n        ```\n\n        Keep the iterations of the evolutions:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct = EvolInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n            store_evolutions=True,\n        )\n\n        evol_instruct.load()\n\n        result = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [\n        #     {\n        #         'instruction': 'common instruction',\n        #         'evolved_instructions': ['initial evolution', 'final evolution'],\n        #         'model_name': 'model_name'\n        #     }\n        # ]\n        ```\n\n        Generate answers for the instructions in a single step:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct = EvolInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n            generate_answers=True,\n        )\n\n        evol_instruct.load()\n\n        result = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [\n        #     {\n        #         'instruction': 'common instruction',\n        #         'evolved_instruction': 'evolved instruction',\n        #         'answer': 'answer to the instruction',\n        #         'model_name': 'model_name'\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    num_evolutions: int\n    store_evolutions: bool = False\n    generate_answers: bool = False\n    include_original_instruction: bool = False\n    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES\n\n    seed: RuntimeParameter[int] = Field(\n        default=42,\n        description=\"As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"instruction\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [{\"role\": \"user\", \"content\": input}]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `evolved_instruction/s`, the `answer` if `generate_answers=True`\n        and the `model_name`.\"\"\"\n        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,\n        # this could be handled always and the value could be included within the DAG validation when\n        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.\n        _outputs = [\n            (\n                \"evolved_instruction\"\n                if not self.store_evolutions\n                else \"evolved_instructions\"\n            ),\n            \"model_name\",\n        ]\n        if self.generate_answers:\n            _outputs.append(\"answer\" if not self.store_evolutions else \"answers\")\n        return _outputs\n\n    @override\n    def format_output(  # type: ignore\n        self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,\n        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n        `answer` if `generate_answers=True`; and, finally, the `model_name`.\n\n        Args:\n            instructions: The instructions to be included within the output.\n            answers: The answers to be included within the output if `generate_answers=True`.\n\n        Returns:\n            If `store_evolutions=False` and `generate_answers=True` return {\"evolved_instruction\": ..., \"model_name\": ..., \"answer\": ...};\n            if `store_evolutions=True` and `generate_answers=True` return {\"evolved_instructions\": ..., \"model_name\": ..., \"answer\": ...};\n            if `store_evolutions=False` and `generate_answers=False` return {\"evolved_instruction\": ..., \"model_name\": ...};\n            if `store_evolutions=True` and `generate_answers=False` return {\"evolved_instructions\": ..., \"model_name\": ...}.\n        \"\"\"\n        _output = {}\n        if not self.store_evolutions:\n            _output[\"evolved_instruction\"] = instructions[-1]\n        else:\n            _output[\"evolved_instructions\"] = instructions\n\n        if self.generate_answers and answers:\n            if not self.store_evolutions:\n                _output[\"answer\"] = answers[-1]\n            else:\n                _output[\"answers\"] = answers\n\n        _output[\"model_name\"] = self.llm.model_name\n        return _output\n\n    @property\n    def mutation_templates_names(self) -&gt; List[str]:\n        \"\"\"Returns the names i.e. keys of the provided `mutation_templates`.\"\"\"\n        return list(self.mutation_templates.keys())\n\n    def _apply_random_mutation(self, instruction: str) -&gt; str:\n        \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n        enum, and returns the provided instruction within the mutation prompt.\n\n        Args:\n            instruction: The instruction to be included within the mutation prompt.\n\n        Returns:\n            A random mutation prompt with the provided instruction.\n        \"\"\"\n        mutation = np.random.choice(self.mutation_templates_names)\n        return self.mutation_templates[mutation].replace(\"&lt;PROMPT&gt;\", instruction)  # type: ignore\n\n    def _evolve_instructions(self, inputs: \"StepInput\") -&gt; List[List[str]]:\n        \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list where each item is a list with either the last evolved instruction if\n            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n        \"\"\"\n\n        instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n        statistics: \"LLMStatistics\" = defaultdict(list)\n\n        for iter_no in range(self.num_evolutions):\n            formatted_prompts = []\n            for instruction in instructions:\n                formatted_prompts.append(self._apply_random_mutation(instruction[-1]))\n\n            formatted_prompts = [\n                self.format_input(prompt) for prompt in formatted_prompts\n            ]\n            responses = self.llm.generate(\n                formatted_prompts,\n                **self.llm.generation_kwargs,  # type: ignore\n            )\n            generated_prompts = flatten_responses(\n                [response[\"generations\"] for response in responses]\n            )\n            for response in responses:\n                for k, v in response[\"statistics\"].items():\n                    statistics[k].append(v[0])\n\n            evolved_instructions = []\n            for generated_prompt in generated_prompts:\n                generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n                evolved_instructions.append(generated_prompt)\n\n            if self.store_evolutions:\n                instructions = [\n                    instruction + [evolved_instruction]\n                    for instruction, evolved_instruction in zip(\n                        instructions, evolved_instructions\n                    )\n                ]\n            else:\n                instructions = [\n                    [evolved_instruction]\n                    for evolved_instruction in evolved_instructions\n                ]\n\n            self._logger.info(\n                f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(instructions)} instructions!\"\n            )\n        return instructions, dict(statistics)\n\n    def _generate_answers(\n        self, evolved_instructions: List[List[str]]\n    ) -&gt; Tuple[List[List[str]], \"LLMStatistics\"]:\n        \"\"\"Generates the answer for the instructions in `instructions`.\n\n        Args:\n            evolved_instructions: A list of lists where each item is a list with either the last\n                evolved instruction if `store_evolutions=False` or all the evolved instructions\n                if `store_evolutions=True`.\n\n        Returns:\n            A list of answers for each instruction.\n        \"\"\"\n        formatted_instructions = [\n            self.format_input(instruction)\n            for instructions in evolved_instructions\n            for instruction in instructions\n        ]\n\n        responses = self.llm.generate(\n            formatted_instructions,\n            num_generations=1,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        generations = [response[\"generations\"] for response in responses]\n\n        statistics: Dict[str, Any] = defaultdict(list)\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        step = (\n            self.num_evolutions\n            if not self.include_original_instruction\n            else self.num_evolutions + 1\n        )\n\n        return [\n            flatten_responses(generations[i : i + step])\n            for i in range(0, len(responses), step)\n        ], dict(statistics)\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n\n        evolved_instructions, statistics = self._evolve_instructions(inputs)\n\n        if self.store_evolutions:\n            # Remove the input instruction from the `evolved_instructions` list\n            from_ = 1 if not self.include_original_instruction else 0\n            evolved_instructions = [\n                instruction[from_:] for instruction in evolved_instructions\n            ]\n\n        if not self.generate_answers:\n            for input, instruction in zip(inputs, evolved_instructions):\n                input.update(self.format_output(instruction))\n                input.update(\n                    {\n                        \"distilabel_metadata\": {\n                            f\"statistics_instruction_{self.name}\": statistics\n                        }\n                    }\n                )\n            yield inputs\n\n        self._logger.info(\n            f\"\ud83c\udf89 Finished evolving {len(evolved_instructions)} instructions!\"\n        )\n\n        if self.generate_answers:\n            self._logger.info(\n                f\"\ud83e\udde0 Generating answers for the {len(evolved_instructions)} evolved instructions!\"\n            )\n\n            answers, statistics = self._generate_answers(evolved_instructions)\n\n            self._logger.info(\n                f\"\ud83c\udf89 Finished generating answers for the {len(evolved_instructions)} evolved\"\n                \" instructions!\"\n            )\n\n            for idx, (input, instruction) in enumerate(\n                zip(inputs, evolved_instructions)\n            ):\n                input.update(self.format_output(instruction, answers[idx]))\n                input.update(\n                    {\n                        \"distilabel_metadata\": {\n                            f\"statistics_answer_{self.name}\": statistics\n                        }\n                    }\n                )\n            yield inputs\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\n            self._apply_random_mutation(\"&lt;PLACEHOLDER_INSTRUCTION&gt;\")\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>evolved_instruction/s</code>, the <code>answer</code> if <code>generate_answers=True</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.mutation_templates_names","title":"<code>mutation_templates_names</code>  <code>property</code>","text":"<p>Returns the names i.e. keys of the provided <code>mutation_templates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation. And the <code>system_prompt</code> is added as the first message if it exists.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation. And the\n    `system_prompt` is added as the first message if it exists.\"\"\"\n    return [{\"role\": \"user\", \"content\": input}]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.format_output","title":"<code>format_output(instructions, answers=None)</code>","text":"<p>The output for the task is a dict with: <code>evolved_instruction</code> or <code>evolved_instructions</code>, depending whether the value is either <code>False</code> or <code>True</code> for <code>store_evolutions</code>, respectively; <code>answer</code> if <code>generate_answers=True</code>; and, finally, the <code>model_name</code>.</p> <p>Parameters:</p> Name Type Description Default <code>instructions</code> <code>Union[str, List[str]]</code> <p>The instructions to be included within the output.</p> required <code>answers</code> <code>Optional[List[str]]</code> <p>The answers to be included within the output if <code>generate_answers=True</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>If <code>store_evolutions=False</code> and <code>generate_answers=True</code> return {\"evolved_instruction\": ..., \"model_name\": ..., \"answer\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=True</code> and <code>generate_answers=True</code> return {\"evolved_instructions\": ..., \"model_name\": ..., \"answer\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=False</code> and <code>generate_answers=False</code> return {\"evolved_instruction\": ..., \"model_name\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=True</code> and <code>generate_answers=False</code> return {\"evolved_instructions\": ..., \"model_name\": ...}.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>@override\ndef format_output(  # type: ignore\n    self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None\n) -&gt; Dict[str, Any]:  # type: ignore\n    \"\"\"The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,\n    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n    `answer` if `generate_answers=True`; and, finally, the `model_name`.\n\n    Args:\n        instructions: The instructions to be included within the output.\n        answers: The answers to be included within the output if `generate_answers=True`.\n\n    Returns:\n        If `store_evolutions=False` and `generate_answers=True` return {\"evolved_instruction\": ..., \"model_name\": ..., \"answer\": ...};\n        if `store_evolutions=True` and `generate_answers=True` return {\"evolved_instructions\": ..., \"model_name\": ..., \"answer\": ...};\n        if `store_evolutions=False` and `generate_answers=False` return {\"evolved_instruction\": ..., \"model_name\": ...};\n        if `store_evolutions=True` and `generate_answers=False` return {\"evolved_instructions\": ..., \"model_name\": ...}.\n    \"\"\"\n    _output = {}\n    if not self.store_evolutions:\n        _output[\"evolved_instruction\"] = instructions[-1]\n    else:\n        _output[\"evolved_instructions\"] = instructions\n\n    if self.generate_answers and answers:\n        if not self.store_evolutions:\n            _output[\"answer\"] = answers[-1]\n        else:\n            _output[\"answers\"] = answers\n\n    _output[\"model_name\"] = self.llm.model_name\n    return _output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct._apply_random_mutation","title":"<code>_apply_random_mutation(instruction)</code>","text":"<p>Applies a random mutation from the ones provided as part of the <code>mutation_templates</code> enum, and returns the provided instruction within the mutation prompt.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>The instruction to be included within the mutation prompt.</p> required <p>Returns:</p> Type Description <code>str</code> <p>A random mutation prompt with the provided instruction.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def _apply_random_mutation(self, instruction: str) -&gt; str:\n    \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n    enum, and returns the provided instruction within the mutation prompt.\n\n    Args:\n        instruction: The instruction to be included within the mutation prompt.\n\n    Returns:\n        A random mutation prompt with the provided instruction.\n    \"\"\"\n    mutation = np.random.choice(self.mutation_templates_names)\n    return self.mutation_templates[mutation].replace(\"&lt;PROMPT&gt;\", instruction)  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct._evolve_instructions","title":"<code>_evolve_instructions(inputs)</code>","text":"<p>Evolves the instructions provided as part of the inputs of the task.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>List[List[str]]</code> <p>A list where each item is a list with either the last evolved instruction if</p> <code>List[List[str]]</code> <p><code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def _evolve_instructions(self, inputs: \"StepInput\") -&gt; List[List[str]]:\n    \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list where each item is a list with either the last evolved instruction if\n        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n    \"\"\"\n\n    instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n    statistics: \"LLMStatistics\" = defaultdict(list)\n\n    for iter_no in range(self.num_evolutions):\n        formatted_prompts = []\n        for instruction in instructions:\n            formatted_prompts.append(self._apply_random_mutation(instruction[-1]))\n\n        formatted_prompts = [\n            self.format_input(prompt) for prompt in formatted_prompts\n        ]\n        responses = self.llm.generate(\n            formatted_prompts,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        generated_prompts = flatten_responses(\n            [response[\"generations\"] for response in responses]\n        )\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        evolved_instructions = []\n        for generated_prompt in generated_prompts:\n            generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n            evolved_instructions.append(generated_prompt)\n\n        if self.store_evolutions:\n            instructions = [\n                instruction + [evolved_instruction]\n                for instruction, evolved_instruction in zip(\n                    instructions, evolved_instructions\n                )\n            ]\n        else:\n            instructions = [\n                [evolved_instruction]\n                for evolved_instruction in evolved_instructions\n            ]\n\n        self._logger.info(\n            f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(instructions)} instructions!\"\n        )\n    return instructions, dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct._generate_answers","title":"<code>_generate_answers(evolved_instructions)</code>","text":"<p>Generates the answer for the instructions in <code>instructions</code>.</p> <p>Parameters:</p> Name Type Description Default <code>evolved_instructions</code> <code>List[List[str]]</code> <p>A list of lists where each item is a list with either the last evolved instruction if <code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> required <p>Returns:</p> Type Description <code>Tuple[List[List[str]], LLMStatistics]</code> <p>A list of answers for each instruction.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def _generate_answers(\n    self, evolved_instructions: List[List[str]]\n) -&gt; Tuple[List[List[str]], \"LLMStatistics\"]:\n    \"\"\"Generates the answer for the instructions in `instructions`.\n\n    Args:\n        evolved_instructions: A list of lists where each item is a list with either the last\n            evolved instruction if `store_evolutions=False` or all the evolved instructions\n            if `store_evolutions=True`.\n\n    Returns:\n        A list of answers for each instruction.\n    \"\"\"\n    formatted_instructions = [\n        self.format_input(instruction)\n        for instructions in evolved_instructions\n        for instruction in instructions\n    ]\n\n    responses = self.llm.generate(\n        formatted_instructions,\n        num_generations=1,\n        **self.llm.generation_kwargs,  # type: ignore\n    )\n    generations = [response[\"generations\"] for response in responses]\n\n    statistics: Dict[str, Any] = defaultdict(list)\n    for response in responses:\n        for k, v in response[\"statistics\"].items():\n            statistics[k].append(v[0])\n\n    step = (\n        self.num_evolutions\n        if not self.include_original_instruction\n        else self.num_evolutions + 1\n    )\n\n    return [\n        flatten_responses(generations[i : i + step])\n        for i in range(0, len(responses), step)\n    ], dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.process","title":"<code>process(inputs)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n\n    evolved_instructions, statistics = self._evolve_instructions(inputs)\n\n    if self.store_evolutions:\n        # Remove the input instruction from the `evolved_instructions` list\n        from_ = 1 if not self.include_original_instruction else 0\n        evolved_instructions = [\n            instruction[from_:] for instruction in evolved_instructions\n        ]\n\n    if not self.generate_answers:\n        for input, instruction in zip(inputs, evolved_instructions):\n            input.update(self.format_output(instruction))\n            input.update(\n                {\n                    \"distilabel_metadata\": {\n                        f\"statistics_instruction_{self.name}\": statistics\n                    }\n                }\n            )\n        yield inputs\n\n    self._logger.info(\n        f\"\ud83c\udf89 Finished evolving {len(evolved_instructions)} instructions!\"\n    )\n\n    if self.generate_answers:\n        self._logger.info(\n            f\"\ud83e\udde0 Generating answers for the {len(evolved_instructions)} evolved instructions!\"\n        )\n\n        answers, statistics = self._generate_answers(evolved_instructions)\n\n        self._logger.info(\n            f\"\ud83c\udf89 Finished generating answers for the {len(evolved_instructions)} evolved\"\n            \" instructions!\"\n        )\n\n        for idx, (input, instruction) in enumerate(\n            zip(inputs, evolved_instructions)\n        ):\n            input.update(self.format_output(instruction, answers[idx]))\n            input.update(\n                {\n                    \"distilabel_metadata\": {\n                        f\"statistics_answer_{self.name}\": statistics\n                    }\n                }\n            )\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolComplexity","title":"<code>EvolComplexity</code>","text":"<p>               Bases: <code>EvolInstruct</code></p> <p>Evolve instructions to make them more complex using an <code>LLM</code>.</p> <p><code>EvolComplexity</code> is a task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, using slight different prompts, but the exact same evolutionary approach.</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <p>The number of instructions to be generated.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the instructions or not. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for the generation of the instructions.</p> <code>min_length</code> <code>Dict[str, str]</code> <p>Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to <code>512</code>.</p> <code>max_length</code> <code>Dict[str, str]</code> <p>Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to <code>1024</code>.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>min_length</code>: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</li> <li><code>max_length</code>: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</li> <li><code>seed</code>: The number of evolutions to be run.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul> Output columns <ul> <li>evolved_instruction (<code>str</code>): The evolved instruction.</li> <li>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> <li>deita</li> </ul> References <ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> </ul> <p>Examples:</p> <p>Evolve an instruction using an LLM:</p> <pre><code>from distilabel.steps.tasks import EvolComplexity\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity = EvolComplexity(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_complexity.load()\n\nresult = next(evol_complexity.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py</code> <pre><code>class EvolComplexity(EvolInstruct):\n    \"\"\"Evolve instructions to make them more complex using an `LLM`.\n\n    `EvolComplexity` is a task that evolves instructions to make them more complex,\n    and it is based in the EvolInstruct task, using slight different prompts, but the\n    exact same evolutionary approach.\n\n    Attributes:\n        num_instructions: The number of instructions to be generated.\n        generate_answers: Whether to generate answers for the instructions or not. Defaults\n            to `False`.\n        mutation_templates: The mutation templates to be used for the generation of the\n            instructions.\n        min_length: Defines the length (in bytes) that the generated instruction needs to\n            be higher than, to be considered valid. Defaults to `512`.\n        max_length: Defines the length (in bytes) that the generated instruction needs to\n            be lower than, to be considered valid. Defaults to `1024`.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.\n        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.\n        - `seed`: The number of evolutions to be run.\n\n    Input columns:\n        - instruction (`str`): The instruction to evolve.\n\n    Output columns:\n        - evolved_instruction (`str`): The evolved instruction.\n        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.\n        - model_name (`str`): The name of the LLM used to evolve the instructions.\n\n    Categories:\n        - evol\n        - instruction\n        - deita\n\n    References:\n        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n\n    Examples:\n        Evolve an instruction using an LLM:\n\n        ```python\n        from distilabel.steps.tasks import EvolComplexity\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_complexity = EvolComplexity(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n        )\n\n        evol_complexity.load()\n\n        result = next(evol_complexity.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolComplexityGenerator","title":"<code>EvolComplexityGenerator</code>","text":"<p>               Bases: <code>EvolInstructGenerator</code></p> <p>Generate evolved instructions with increased complexity using an <code>LLM</code>.</p> <p><code>EvolComplexityGenerator</code> is a generation task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, but using slight different prompts, but the exact same evolutionary approach.</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <code>int</code> <p>The number of instructions to be generated.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the instructions or not. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for the generation of the instructions.</p> <code>min_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to <code>512</code>.</p> <code>max_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to <code>1024</code>.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>min_length</code>: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</li> <li><code>max_length</code>: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</li> <li><code>seed</code>: The number of evolutions to be run.</li> </ul> Output columns <ul> <li>instruction (<code>str</code>): The evolved instruction.</li> <li>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> <li>generation</li> <li>deita</li> </ul> References <ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> </ul> <p>Examples:</p> <p>Generate evolved instructions without initial instructions:</p> <pre><code>from distilabel.steps.tasks import EvolComplexityGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity_generator = EvolComplexityGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_complexity_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py</code> <pre><code>class EvolComplexityGenerator(EvolInstructGenerator):\n    \"\"\"Generate evolved instructions with increased complexity using an `LLM`.\n\n    `EvolComplexityGenerator` is a generation task that evolves instructions to make\n    them more complex, and it is based in the EvolInstruct task, but using slight different\n    prompts, but the exact same evolutionary approach.\n\n    Attributes:\n        num_instructions: The number of instructions to be generated.\n        generate_answers: Whether to generate answers for the instructions or not. Defaults\n            to `False`.\n        mutation_templates: The mutation templates to be used for the generation of the\n            instructions.\n        min_length: Defines the length (in bytes) that the generated instruction needs to\n            be higher than, to be considered valid. Defaults to `512`.\n        max_length: Defines the length (in bytes) that the generated instruction needs to\n            be lower than, to be considered valid. Defaults to `1024`.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.\n        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.\n        - `seed`: The number of evolutions to be run.\n\n    Output columns:\n        - instruction (`str`): The evolved instruction.\n        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.\n        - model_name (`str`): The name of the LLM used to evolve the instructions.\n\n    Categories:\n        - evol\n        - instruction\n        - generation\n        - deita\n\n    References:\n        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n\n    Examples:\n        Generate evolved instructions without initial instructions:\n\n        ```python\n        from distilabel.steps.tasks import EvolComplexityGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_complexity_generator = EvolComplexityGenerator(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_instructions=2,\n        )\n\n        evol_complexity_generator.load()\n\n        result = next(scorer.process())\n        # result\n        # [{'instruction': 'generated instruction', 'model_name': 'test'}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator","title":"<code>EvolInstructGenerator</code>","text":"<p>               Bases: <code>GeneratorTask</code></p> <p>Generate evolved instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <code>int</code> <p>The number of instructions to be generated.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the instructions or not. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for the generation of the instructions.</p> <code>min_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to <code>512</code>.</p> <code>max_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to <code>1024</code>.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>min_length</code>: Defines the length (in bytes) that the generated instruction needs     to be higher than, to be considered valid.</li> <li><code>max_length</code>: Defines the length (in bytes) that the generated instruction needs     to be lower than, to be considered valid.</li> <li><code>seed</code>: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul> Output columns <ul> <li>instruction (<code>str</code>): The generated instruction if <code>generate_answers=False</code>.</li> <li>answer (<code>str</code>): The generated answer if <code>generate_answers=True</code>.</li> <li>instructions (<code>List[str]</code>): The generated instructions if <code>generate_answers=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to generate and evolve the instructions.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> <li>generation</li> </ul> References <ul> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> <li>GitHub: h2oai/h2o-wizardlm</li> </ul> <p>Examples:</p> <p>Generate evolved instructions without initial instructions:</p> <pre><code>from distilabel.steps.tasks import EvolInstructGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct_generator = EvolInstructGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_instruct_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre> Citations <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>class EvolInstructGenerator(GeneratorTask):\n    \"\"\"Generate evolved instructions using an `LLM`.\n\n    WizardLM: Empowering Large Language Models to Follow Complex Instructions\n\n    Attributes:\n        num_instructions: The number of instructions to be generated.\n        generate_answers: Whether to generate answers for the instructions or not. Defaults\n            to `False`.\n        mutation_templates: The mutation templates to be used for the generation of the\n            instructions.\n        min_length: Defines the length (in bytes) that the generated instruction needs to\n            be higher than, to be considered valid. Defaults to `512`.\n        max_length: Defines the length (in bytes) that the generated instruction needs to\n            be lower than, to be considered valid. Defaults to `1024`.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `min_length`: Defines the length (in bytes) that the generated instruction needs\n            to be higher than, to be considered valid.\n        - `max_length`: Defines the length (in bytes) that the generated instruction needs\n            to be lower than, to be considered valid.\n        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.\n\n    Output columns:\n        - instruction (`str`): The generated instruction if `generate_answers=False`.\n        - answer (`str`): The generated answer if `generate_answers=True`.\n        - instructions (`List[str]`): The generated instructions if `generate_answers=True`.\n        - model_name (`str`): The name of the LLM used to generate and evolve the instructions.\n\n    Categories:\n        - evol\n        - instruction\n        - generation\n\n    References:\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)\n\n    Examples:\n        Generate evolved instructions without initial instructions:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstructGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct_generator = EvolInstructGenerator(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_instructions=2,\n        )\n\n        evol_instruct_generator.load()\n\n        result = next(scorer.process())\n        # result\n        # [{'instruction': 'generated instruction', 'model_name': 'test'}]\n        ```\n\n    Citations:\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    num_instructions: int\n    generate_answers: bool = False\n    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES\n\n    min_length: RuntimeParameter[int] = Field(\n        default=512,\n        description=\"Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.\",\n    )\n    max_length: RuntimeParameter[int] = Field(\n        default=1024,\n        description=\"Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.\",\n    )\n\n    seed: RuntimeParameter[int] = Field(\n        default=42,\n        description=\"As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.\",\n    )\n    _seed_texts: Optional[List[str]] = PrivateAttr(default_factory=list)\n    _prompts: Optional[List[str]] = PrivateAttr(default_factory=list)\n\n    def _generate_seed_texts(self) -&gt; List[str]:\n        \"\"\"Generates a list of seed texts to be used as part of the starting prompts for the task.\n\n        It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and\n        a list of English words will be used to generate the seed texts that will be provided to the\n        mutation method and included within the prompt.\n\n        Returns:\n            A list of seed texts to be used as part of the starting prompts for the task.\n        \"\"\"\n        seed_texts = []\n        for _ in range(self.num_instructions * 10):\n            num_words = np.random.choice([1, 2, 3, 4])\n            seed_texts.append(\n                self.mutation_templates[\"FRESH_START\"].replace(  # type: ignore\n                    \"&lt;PROMPT&gt;\",\n                    \", \".join(\n                        [\n                            np.random.choice(self._english_nouns).strip()\n                            for _ in range(num_words)\n                        ]\n                    ),\n                )\n            )\n        return seed_texts\n\n    @override\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n        This is useful if you want to do some validation that requires the entire model to be initialized.\n        \"\"\"\n        super().model_post_init(__context)\n\n        np.random.seed(self.seed)\n\n        self._seed_texts = self._generate_seed_texts()\n        self._prompts = [\n            np.random.choice(self._seed_texts) for _ in range(self.num_instructions)\n        ]\n\n    @cached_property\n    def _english_nouns(self) -&gt; List[str]:\n        \"\"\"A list of English nouns to be used as part of the starting prompts for the task.\n\n        References:\n            - https://github.com/h2oai/h2o-wizardlm\n        \"\"\"\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps/tasks/evol_instruct/english_nouns.txt\"\n        )\n        with open(_path, mode=\"r\") as f:\n            return [line.strip() for line in f.readlines()]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `instruction`, the `answer` if `generate_answers=True`\n        and the `model_name`.\"\"\"\n        _outputs = [\"instruction\", \"model_name\"]\n        if self.generate_answers:\n            _outputs.append(\"answer\")\n        return _outputs\n\n    def format_output(  # type: ignore\n        self, instruction: str, answer: Optional[str] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;\n        and, finally, the `model_name`.\n\n        Args:\n            instruction: The instruction to be included within the output.\n            answer: The answer to be included within the output if `generate_answers=True`.\n\n        Returns:\n            If `generate_answers=True` return {\"instruction\": ..., \"answer\": ..., \"model_name\": ...};\n            if `generate_answers=False` return {\"instruction\": ..., \"model_name\": ...};\n        \"\"\"\n        _output = {\n            \"instruction\": instruction,\n            \"model_name\": self.llm.model_name,\n        }\n        if self.generate_answers and answer is not None:\n            _output[\"answer\"] = answer\n        return _output\n\n    @property\n    def mutation_templates_names(self) -&gt; List[str]:\n        \"\"\"Returns the names i.e. keys of the provided `mutation_templates`.\"\"\"\n        return list(self.mutation_templates.keys())\n\n    def _apply_random_mutation(self, iter_no: int) -&gt; List[\"ChatType\"]:\n        \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n        enum, and returns the provided instruction within the mutation prompt.\n\n        Args:\n            iter_no: The iteration number to be used to check whether the iteration is the\n                first one i.e. FRESH_START, or not.\n\n        Returns:\n            A random mutation prompt with the provided instruction formatted as an OpenAI conversation.\n        \"\"\"\n        prompts = []\n        for idx in range(self.num_instructions):\n            if (\n                iter_no == 0\n                or \"Write one question or request containing\" in self._prompts[idx]  # type: ignore\n            ):\n                mutation = \"FRESH_START\"\n            else:\n                mutation = np.random.choice(self.mutation_templates_names)\n                if mutation == \"FRESH_START\":\n                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n\n            prompt_with_template = (\n                self.mutation_templates[mutation].replace(  # type: ignore\n                    \"&lt;PROMPT&gt;\",\n                    self._prompts[idx],  # type: ignore\n                )  # type: ignore\n                if iter_no != 0\n                else self._prompts[idx]  # type: ignore\n            )\n            prompts.append([{\"role\": \"user\", \"content\": prompt_with_template}])\n        return prompts\n\n    def _generate_answers(\n        self, instructions: List[List[str]]\n    ) -&gt; Tuple[List[str], \"LLMStatistics\"]:\n        \"\"\"Generates the answer for the last instruction in `instructions`.\n\n        Args:\n            instructions: A list of lists where each item is a list with either the last\n                evolved instruction if `store_evolutions=False` or all the evolved instructions\n                if `store_evolutions=True`.\n\n        Returns:\n            A list of answers for the last instruction in `instructions`.\n        \"\"\"\n        # TODO: update to generate answers for all the instructions\n        _formatted_instructions = [\n            [{\"role\": \"user\", \"content\": instruction[-1]}]\n            for instruction in instructions\n        ]\n        responses = self.llm.generate(\n            _formatted_instructions,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        statistics: Dict[str, Any] = defaultdict(list)\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        return flatten_responses(\n            [response[\"generations\"] for response in responses]\n        ), dict(statistics)\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # NOQA: C901, type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to 0.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task, and a boolean\n            flag indicating whether the task has finished or not i.e. is the last batch.\n        \"\"\"\n        instructions = []\n        mutation_no = 0\n\n        # TODO: update to take into account `offset`\n        iter_no = 0\n        while len(instructions) &lt; self.num_instructions:\n            prompts = self._apply_random_mutation(iter_no=iter_no)\n\n            # TODO: Update the function to extract from the dict\n            responses = self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore\n\n            generated_prompts = flatten_responses(\n                [response[\"generations\"] for response in responses]\n            )\n            statistics: \"LLMStatistics\" = defaultdict(list)\n            for response in responses:\n                for k, v in response[\"statistics\"].items():\n                    statistics[k].append(v[0])\n\n            for idx, generated_prompt in enumerate(generated_prompts):\n                generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n                if self.max_length &gt;= len(generated_prompt) &gt;= self.min_length:  # type: ignore\n                    instructions.append(generated_prompt)\n                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n                else:\n                    self._prompts[idx] = generated_prompt  # type: ignore\n\n            self._logger.info(\n                f\"\ud83d\udd04 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!\"\n            )\n            iter_no += 1\n\n            if len(instructions) &gt; self.num_instructions:\n                instructions = instructions[: self.num_instructions]\n            if len(instructions) &gt; mutation_no:\n                mutation_no = len(instructions) - mutation_no\n\n            if not self.generate_answers and len(instructions[-mutation_no:]) &gt; 0:\n                formatted_generations = []\n                for mutated_instruction in instructions[-mutation_no:]:\n                    mutated_instruction = self.format_output(mutated_instruction)\n                    mutated_instruction[\"distilabel_metadata\"] = {\n                        f\"statistics_instruction_{self.name}\": dict(statistics)\n                    }\n                    formatted_generations.append(mutated_instruction)\n                yield (\n                    formatted_generations,\n                    len(instructions) &gt;= self.num_instructions,\n                )\n\n        self._logger.info(f\"\ud83c\udf89 Finished evolving {len(instructions)} instructions!\")\n\n        if self.generate_answers:\n            self._logger.info(\n                f\"\ud83e\udde0 Generating answers for the {len(instructions)} evolved instructions!\"\n            )\n\n            answers, statistics = self._generate_answers(instructions)\n\n            self._logger.info(\n                f\"\ud83c\udf89 Finished generating answers for the {len(instructions)} evolved instructions!\"\n            )\n\n            formatted_outputs = []\n            for instruction, answer in zip(instructions, answers):\n                formatted_output = self.format_output(instruction, answer)\n                formatted_output[\"distilabel_metadata\"] = {\n                    f\"statistics_answer_{self.name}\": dict(statistics)\n                }\n                formatted_outputs.append(formatted_output)\n\n            yield (\n                formatted_outputs,\n                True,\n            )\n\n    @override\n    def _sample_input(self) -&gt; \"ChatType\":\n        return self._apply_random_mutation(iter_no=0)[0]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._english_nouns","title":"<code>_english_nouns</code>  <code>cached</code> <code>property</code>","text":"<p>A list of English nouns to be used as part of the starting prompts for the task.</p> References <ul> <li>https://github.com/h2oai/h2o-wizardlm</li> </ul>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>instruction</code>, the <code>answer</code> if <code>generate_answers=True</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.mutation_templates_names","title":"<code>mutation_templates_names</code>  <code>property</code>","text":"<p>Returns the names i.e. keys of the provided <code>mutation_templates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._generate_seed_texts","title":"<code>_generate_seed_texts()</code>","text":"<p>Generates a list of seed texts to be used as part of the starting prompts for the task.</p> <p>It will use the <code>FRESH_START</code> mutation template, as it needs to generate text from scratch; and a list of English words will be used to generate the seed texts that will be provided to the mutation method and included within the prompt.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>A list of seed texts to be used as part of the starting prompts for the task.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def _generate_seed_texts(self) -&gt; List[str]:\n    \"\"\"Generates a list of seed texts to be used as part of the starting prompts for the task.\n\n    It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and\n    a list of English words will be used to generate the seed texts that will be provided to the\n    mutation method and included within the prompt.\n\n    Returns:\n        A list of seed texts to be used as part of the starting prompts for the task.\n    \"\"\"\n    seed_texts = []\n    for _ in range(self.num_instructions * 10):\n        num_words = np.random.choice([1, 2, 3, 4])\n        seed_texts.append(\n            self.mutation_templates[\"FRESH_START\"].replace(  # type: ignore\n                \"&lt;PROMPT&gt;\",\n                \", \".join(\n                    [\n                        np.random.choice(self._english_nouns).strip()\n                        for _ in range(num_words)\n                    ]\n                ),\n            )\n        )\n    return seed_texts\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Override this method to perform additional initialization after <code>__init__</code> and <code>model_construct</code>. This is useful if you want to do some validation that requires the entire model to be initialized.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>@override\ndef model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n    This is useful if you want to do some validation that requires the entire model to be initialized.\n    \"\"\"\n    super().model_post_init(__context)\n\n    np.random.seed(self.seed)\n\n    self._seed_texts = self._generate_seed_texts()\n    self._prompts = [\n        np.random.choice(self._seed_texts) for _ in range(self.num_instructions)\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.format_output","title":"<code>format_output(instruction, answer=None)</code>","text":"<p>The output for the task is a dict with: <code>instruction</code>; <code>answer</code> if <code>generate_answers=True</code>; and, finally, the <code>model_name</code>.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>The instruction to be included within the output.</p> required <code>answer</code> <code>Optional[str]</code> <p>The answer to be included within the output if <code>generate_answers=True</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>If <code>generate_answers=True</code> return {\"instruction\": ..., \"answer\": ..., \"model_name\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>generate_answers=False</code> return {\"instruction\": ..., \"model_name\": ...};</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def format_output(  # type: ignore\n    self, instruction: str, answer: Optional[str] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;\n    and, finally, the `model_name`.\n\n    Args:\n        instruction: The instruction to be included within the output.\n        answer: The answer to be included within the output if `generate_answers=True`.\n\n    Returns:\n        If `generate_answers=True` return {\"instruction\": ..., \"answer\": ..., \"model_name\": ...};\n        if `generate_answers=False` return {\"instruction\": ..., \"model_name\": ...};\n    \"\"\"\n    _output = {\n        \"instruction\": instruction,\n        \"model_name\": self.llm.model_name,\n    }\n    if self.generate_answers and answer is not None:\n        _output[\"answer\"] = answer\n    return _output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._apply_random_mutation","title":"<code>_apply_random_mutation(iter_no)</code>","text":"<p>Applies a random mutation from the ones provided as part of the <code>mutation_templates</code> enum, and returns the provided instruction within the mutation prompt.</p> <p>Parameters:</p> Name Type Description Default <code>iter_no</code> <code>int</code> <p>The iteration number to be used to check whether the iteration is the first one i.e. FRESH_START, or not.</p> required <p>Returns:</p> Type Description <code>List[ChatType]</code> <p>A random mutation prompt with the provided instruction formatted as an OpenAI conversation.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def _apply_random_mutation(self, iter_no: int) -&gt; List[\"ChatType\"]:\n    \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n    enum, and returns the provided instruction within the mutation prompt.\n\n    Args:\n        iter_no: The iteration number to be used to check whether the iteration is the\n            first one i.e. FRESH_START, or not.\n\n    Returns:\n        A random mutation prompt with the provided instruction formatted as an OpenAI conversation.\n    \"\"\"\n    prompts = []\n    for idx in range(self.num_instructions):\n        if (\n            iter_no == 0\n            or \"Write one question or request containing\" in self._prompts[idx]  # type: ignore\n        ):\n            mutation = \"FRESH_START\"\n        else:\n            mutation = np.random.choice(self.mutation_templates_names)\n            if mutation == \"FRESH_START\":\n                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n\n        prompt_with_template = (\n            self.mutation_templates[mutation].replace(  # type: ignore\n                \"&lt;PROMPT&gt;\",\n                self._prompts[idx],  # type: ignore\n            )  # type: ignore\n            if iter_no != 0\n            else self._prompts[idx]  # type: ignore\n        )\n        prompts.append([{\"role\": \"user\", \"content\": prompt_with_template}])\n    return prompts\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._generate_answers","title":"<code>_generate_answers(instructions)</code>","text":"<p>Generates the answer for the last instruction in <code>instructions</code>.</p> <p>Parameters:</p> Name Type Description Default <code>instructions</code> <code>List[List[str]]</code> <p>A list of lists where each item is a list with either the last evolved instruction if <code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> required <p>Returns:</p> Type Description <code>Tuple[List[str], LLMStatistics]</code> <p>A list of answers for the last instruction in <code>instructions</code>.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def _generate_answers(\n    self, instructions: List[List[str]]\n) -&gt; Tuple[List[str], \"LLMStatistics\"]:\n    \"\"\"Generates the answer for the last instruction in `instructions`.\n\n    Args:\n        instructions: A list of lists where each item is a list with either the last\n            evolved instruction if `store_evolutions=False` or all the evolved instructions\n            if `store_evolutions=True`.\n\n    Returns:\n        A list of answers for the last instruction in `instructions`.\n    \"\"\"\n    # TODO: update to generate answers for all the instructions\n    _formatted_instructions = [\n        [{\"role\": \"user\", \"content\": instruction[-1]}]\n        for instruction in instructions\n    ]\n    responses = self.llm.generate(\n        _formatted_instructions,\n        **self.llm.generation_kwargs,  # type: ignore\n    )\n    statistics: Dict[str, Any] = defaultdict(list)\n    for response in responses:\n        for k, v in response[\"statistics\"].items():\n            statistics[k].append(v[0])\n\n    return flatten_responses(\n        [response[\"generations\"] for response in responses]\n    ), dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.process","title":"<code>process(offset=0)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to 0.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A list of Python dictionaries with the outputs of the task, and a boolean</p> <code>GeneratorStepOutput</code> <p>flag indicating whether the task has finished or not i.e. is the last batch.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>@override\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # NOQA: C901, type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to 0.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task, and a boolean\n        flag indicating whether the task has finished or not i.e. is the last batch.\n    \"\"\"\n    instructions = []\n    mutation_no = 0\n\n    # TODO: update to take into account `offset`\n    iter_no = 0\n    while len(instructions) &lt; self.num_instructions:\n        prompts = self._apply_random_mutation(iter_no=iter_no)\n\n        # TODO: Update the function to extract from the dict\n        responses = self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore\n\n        generated_prompts = flatten_responses(\n            [response[\"generations\"] for response in responses]\n        )\n        statistics: \"LLMStatistics\" = defaultdict(list)\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        for idx, generated_prompt in enumerate(generated_prompts):\n            generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n            if self.max_length &gt;= len(generated_prompt) &gt;= self.min_length:  # type: ignore\n                instructions.append(generated_prompt)\n                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n            else:\n                self._prompts[idx] = generated_prompt  # type: ignore\n\n        self._logger.info(\n            f\"\ud83d\udd04 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!\"\n        )\n        iter_no += 1\n\n        if len(instructions) &gt; self.num_instructions:\n            instructions = instructions[: self.num_instructions]\n        if len(instructions) &gt; mutation_no:\n            mutation_no = len(instructions) - mutation_no\n\n        if not self.generate_answers and len(instructions[-mutation_no:]) &gt; 0:\n            formatted_generations = []\n            for mutated_instruction in instructions[-mutation_no:]:\n                mutated_instruction = self.format_output(mutated_instruction)\n                mutated_instruction[\"distilabel_metadata\"] = {\n                    f\"statistics_instruction_{self.name}\": dict(statistics)\n                }\n                formatted_generations.append(mutated_instruction)\n            yield (\n                formatted_generations,\n                len(instructions) &gt;= self.num_instructions,\n            )\n\n    self._logger.info(f\"\ud83c\udf89 Finished evolving {len(instructions)} instructions!\")\n\n    if self.generate_answers:\n        self._logger.info(\n            f\"\ud83e\udde0 Generating answers for the {len(instructions)} evolved instructions!\"\n        )\n\n        answers, statistics = self._generate_answers(instructions)\n\n        self._logger.info(\n            f\"\ud83c\udf89 Finished generating answers for the {len(instructions)} evolved instructions!\"\n        )\n\n        formatted_outputs = []\n        for instruction, answer in zip(instructions, answers):\n            formatted_output = self.format_output(instruction, answer)\n            formatted_output[\"distilabel_metadata\"] = {\n                f\"statistics_answer_{self.name}\": dict(statistics)\n            }\n            formatted_outputs.append(formatted_output)\n\n        yield (\n            formatted_outputs,\n            True,\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality","title":"<code>EvolQuality</code>","text":"<p>               Bases: <code>Task</code></p> <p>Evolve the quality of the responses using an <code>LLM</code>.</p> <p><code>EvolQuality</code> task is used to evolve the quality of the responses given a prompt, by generating a new response with a language model. This step implements the evolution quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>num_evolutions</code> <code>int</code> <p>The number of evolutions to be performed on the responses.</p> <code>store_evolutions</code> <code>bool</code> <p>Whether to store all the evolved responses or just the last one. Defaults to <code>False</code>.</p> <code>include_original_response</code> <code>bool</code> <p>Whether to include the original response within the evolved responses. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used to evolve the responses.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>seed</code>: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</li> <li>response (<code>str</code>): The responses to be rewritten.</li> </ul> Output columns <ul> <li>evolved_response (<code>str</code>): The evolved response if <code>store_evolutions=False</code>.</li> <li>evolved_responses (<code>List[str]</code>): The evolved responses if <code>store_evolutions=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the responses.</li> </ul> Categories <ul> <li>evol</li> <li>response</li> <li>deita</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Evolve the quality of the responses given a prompt:</p> <pre><code>from distilabel.steps.tasks import EvolQuality\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_quality = EvolQuality(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_quality.load()\n\nresult = next(\n    evol_quality.process(\n        [\n            {\"instruction\": \"common instruction\", \"response\": \"a response\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'response': 'a response',\n#         'evolved_response': 'evolved response',\n#         'model_name': '\"mistralai/Mistral-7B-Instruct-v0.2\"'\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>class EvolQuality(Task):\n    \"\"\"Evolve the quality of the responses using an `LLM`.\n\n    `EvolQuality` task is used to evolve the quality of the responses given a prompt,\n    by generating a new response with a language model. This step implements the evolution\n    quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of\n    Automatic Data Selection in Instruction Tuning'.\n\n    Attributes:\n        num_evolutions: The number of evolutions to be performed on the responses.\n        store_evolutions: Whether to store all the evolved responses or just the last one.\n            Defaults to `False`.\n        include_original_response: Whether to include the original response within the evolved\n            responses. Defaults to `False`.\n        mutation_templates: The mutation templates to be used to evolve the responses.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the `responses`.\n        - response (`str`): The responses to be rewritten.\n\n    Output columns:\n        - evolved_response (`str`): The evolved response if `store_evolutions=False`.\n        - evolved_responses (`List[str]`): The evolved responses if `store_evolutions=True`.\n        - model_name (`str`): The name of the LLM used to evolve the responses.\n\n    Categories:\n        - evol\n        - response\n        - deita\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Evolve the quality of the responses given a prompt:\n\n        ```python\n        from distilabel.steps.tasks import EvolQuality\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_quality = EvolQuality(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n        )\n\n        evol_quality.load()\n\n        result = next(\n            evol_quality.process(\n                [\n                    {\"instruction\": \"common instruction\", \"response\": \"a response\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'common instruction',\n        #         'response': 'a response',\n        #         'evolved_response': 'evolved response',\n        #         'model_name': '\"mistralai/Mistral-7B-Instruct-v0.2\"'\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    num_evolutions: int\n    store_evolutions: bool = False\n    include_original_response: bool = False\n    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES\n\n    seed: RuntimeParameter[int] = Field(\n        default=42,\n        description=\"As `numpy` is being used in order to randomly pick a mutation method, then is nice to set a random seed.\",\n    )\n\n    @override\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n        This is useful if you want to do some validation that requires the entire model to be initialized.\n        \"\"\"\n        super().model_post_init(__context)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the `instruction` and `response`.\"\"\"\n        return [\"instruction\", \"response\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [{\"role\": \"user\", \"content\": input}]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `evolved_response/s` and the `model_name`.\"\"\"\n        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,\n        # this could be handled always and the value could be included within the DAG validation when\n        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.\n        _outputs = [\n            (\"evolved_response\" if not self.store_evolutions else \"evolved_responses\"),\n            \"model_name\",\n        ]\n\n        return _outputs\n\n    def format_output(self, responses: Union[str, List[str]]) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"The output for the task is a dict with: `evolved_response` or `evolved_responses`,\n        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n        and, finally, the `model_name`.\n\n        Args:\n            responses: The responses to be included within the output.\n\n        Returns:\n            if `store_evolutions=False` return {\"evolved_response\": ..., \"model_name\": ...};\n            if `store_evolutions=True` return {\"evolved_responses\": ..., \"model_name\": ...}.\n        \"\"\"\n        _output = {}\n\n        if not self.store_evolutions:\n            _output[\"evolved_response\"] = responses[-1]\n        else:\n            _output[\"evolved_responses\"] = responses\n\n        _output[\"model_name\"] = self.llm.model_name\n        return _output\n\n    @property\n    def mutation_templates_names(self) -&gt; List[str]:\n        \"\"\"Returns the names i.e. keys of the provided `mutation_templates` enum.\"\"\"\n        return list(self.mutation_templates.keys())\n\n    def _apply_random_mutation(self, instruction: str, response: str) -&gt; str:\n        \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n        enum, and returns the provided instruction within the mutation prompt.\n\n        Args:\n            instruction: The instruction to be included within the mutation prompt.\n\n        Returns:\n            A random mutation prompt with the provided instruction.\n        \"\"\"\n        mutation = np.random.choice(self.mutation_templates_names)\n        return (\n            self.mutation_templates[mutation]\n            .replace(\"&lt;PROMPT&gt;\", instruction)\n            .replace(\"&lt;RESPONSE&gt;\", response)\n        )\n\n    def _evolve_reponses(\n        self, inputs: \"StepInput\"\n    ) -&gt; Tuple[List[List[str]], Dict[str, Any]]:\n        \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list where each item is a list with either the last evolved instruction if\n            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n        \"\"\"\n        np.random.seed(self.seed)\n        instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n        responses: List[List[str]] = [[input[\"response\"]] for input in inputs]\n        statistics: Dict[str, Any] = defaultdict(list)\n\n        for iter_no in range(self.num_evolutions):\n            formatted_prompts = []\n            for instruction, response in zip(instructions, responses):\n                formatted_prompts.append(\n                    self._apply_random_mutation(instruction[-1], response[-1])\n                )\n\n            formatted_prompts = [\n                self.format_input(prompt) for prompt in formatted_prompts\n            ]\n\n            generated_responses = self.llm.generate(\n                formatted_prompts,\n                **self.llm.generation_kwargs,  # type: ignore\n            )\n            for response in generated_responses:\n                for k, v in response[\"statistics\"].items():\n                    statistics[k].append(v[0])\n\n            if self.store_evolutions:\n                responses = [\n                    response + [evolved_response[\"generations\"][0]]\n                    for response, evolved_response in zip(\n                        responses, generated_responses\n                    )\n                ]\n            else:\n                responses = [\n                    [evolved_response[\"generations\"][0]]\n                    for evolved_response in generated_responses\n                ]\n\n            self._logger.info(\n                f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(responses)} responses!\"\n            )\n\n        return responses, dict(statistics)\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n\n        responses, statistics = self._evolve_reponses(inputs)\n\n        if self.store_evolutions:\n            # Remove the input instruction from the `evolved_responses` list\n            from_ = 1 if not self.include_original_response else 0\n            responses = [response[from_:] for response in responses]\n\n        for input, response in zip(inputs, responses):\n            input.update(self.format_output(response))\n            input.update(\n                {\"distilabel_metadata\": {f\"statistics_{self.name}\": statistics}}\n            )\n        yield inputs\n\n        self._logger.info(f\"\ud83c\udf89 Finished evolving {len(responses)} instructions!\")\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\"&lt;PLACEHOLDER_INSTRUCTION&gt;\")\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the <code>instruction</code> and <code>response</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>evolved_response/s</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.mutation_templates_names","title":"<code>mutation_templates_names</code>  <code>property</code>","text":"<p>Returns the names i.e. keys of the provided <code>mutation_templates</code> enum.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Override this method to perform additional initialization after <code>__init__</code> and <code>model_construct</code>. This is useful if you want to do some validation that requires the entire model to be initialized.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>@override\ndef model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n    This is useful if you want to do some validation that requires the entire model to be initialized.\n    \"\"\"\n    super().model_post_init(__context)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation. And the <code>system_prompt</code> is added as the first message if it exists.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation. And the\n    `system_prompt` is added as the first message if it exists.\"\"\"\n    return [{\"role\": \"user\", \"content\": input}]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.format_output","title":"<code>format_output(responses)</code>","text":"<p>The output for the task is a dict with: <code>evolved_response</code> or <code>evolved_responses</code>, depending whether the value is either <code>False</code> or <code>True</code> for <code>store_evolutions</code>, respectively; and, finally, the <code>model_name</code>.</p> <p>Parameters:</p> Name Type Description Default <code>responses</code> <code>Union[str, List[str]]</code> <p>The responses to be included within the output.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>if <code>store_evolutions=False</code> return {\"evolved_response\": ..., \"model_name\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=True</code> return {\"evolved_responses\": ..., \"model_name\": ...}.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def format_output(self, responses: Union[str, List[str]]) -&gt; Dict[str, Any]:  # type: ignore\n    \"\"\"The output for the task is a dict with: `evolved_response` or `evolved_responses`,\n    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n    and, finally, the `model_name`.\n\n    Args:\n        responses: The responses to be included within the output.\n\n    Returns:\n        if `store_evolutions=False` return {\"evolved_response\": ..., \"model_name\": ...};\n        if `store_evolutions=True` return {\"evolved_responses\": ..., \"model_name\": ...}.\n    \"\"\"\n    _output = {}\n\n    if not self.store_evolutions:\n        _output[\"evolved_response\"] = responses[-1]\n    else:\n        _output[\"evolved_responses\"] = responses\n\n    _output[\"model_name\"] = self.llm.model_name\n    return _output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality._apply_random_mutation","title":"<code>_apply_random_mutation(instruction, response)</code>","text":"<p>Applies a random mutation from the ones provided as part of the <code>mutation_templates</code> enum, and returns the provided instruction within the mutation prompt.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>The instruction to be included within the mutation prompt.</p> required <p>Returns:</p> Type Description <code>str</code> <p>A random mutation prompt with the provided instruction.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def _apply_random_mutation(self, instruction: str, response: str) -&gt; str:\n    \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n    enum, and returns the provided instruction within the mutation prompt.\n\n    Args:\n        instruction: The instruction to be included within the mutation prompt.\n\n    Returns:\n        A random mutation prompt with the provided instruction.\n    \"\"\"\n    mutation = np.random.choice(self.mutation_templates_names)\n    return (\n        self.mutation_templates[mutation]\n        .replace(\"&lt;PROMPT&gt;\", instruction)\n        .replace(\"&lt;RESPONSE&gt;\", response)\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality._evolve_reponses","title":"<code>_evolve_reponses(inputs)</code>","text":"<p>Evolves the instructions provided as part of the inputs of the task.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>List[List[str]]</code> <p>A list where each item is a list with either the last evolved instruction if</p> <code>Dict[str, Any]</code> <p><code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def _evolve_reponses(\n    self, inputs: \"StepInput\"\n) -&gt; Tuple[List[List[str]], Dict[str, Any]]:\n    \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list where each item is a list with either the last evolved instruction if\n        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n    \"\"\"\n    np.random.seed(self.seed)\n    instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n    responses: List[List[str]] = [[input[\"response\"]] for input in inputs]\n    statistics: Dict[str, Any] = defaultdict(list)\n\n    for iter_no in range(self.num_evolutions):\n        formatted_prompts = []\n        for instruction, response in zip(instructions, responses):\n            formatted_prompts.append(\n                self._apply_random_mutation(instruction[-1], response[-1])\n            )\n\n        formatted_prompts = [\n            self.format_input(prompt) for prompt in formatted_prompts\n        ]\n\n        generated_responses = self.llm.generate(\n            formatted_prompts,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        for response in generated_responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        if self.store_evolutions:\n            responses = [\n                response + [evolved_response[\"generations\"][0]]\n                for response, evolved_response in zip(\n                    responses, generated_responses\n                )\n            ]\n        else:\n            responses = [\n                [evolved_response[\"generations\"][0]]\n                for evolved_response in generated_responses\n            ]\n\n        self._logger.info(\n            f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(responses)} responses!\"\n        )\n\n    return responses, dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.process","title":"<code>process(inputs)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n\n    responses, statistics = self._evolve_reponses(inputs)\n\n    if self.store_evolutions:\n        # Remove the input instruction from the `evolved_responses` list\n        from_ = 1 if not self.include_original_response else 0\n        responses = [response[from_:] for response in responses]\n\n    for input, response in zip(inputs, responses):\n        input.update(self.format_output(response))\n        input.update(\n            {\"distilabel_metadata\": {f\"statistics_{self.name}\": statistics}}\n        )\n    yield inputs\n\n    self._logger.info(f\"\ud83c\udf89 Finished evolving {len(responses)} instructions!\")\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings","title":"<code>GenerateEmbeddings</code>","text":"<p>               Bases: <code>Step</code></p> <p>Generate embeddings using the last hidden state of an <code>LLM</code>.</p> <p>Generate embeddings for a text input using the last hidden state of an <code>LLM</code>, as described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>The <code>LLM</code> to use to generate the embeddings.</p> Input columns <ul> <li>text (<code>str</code>, <code>List[Dict[str, str]]</code>): The input text or conversation to generate     embeddings for.</li> </ul> Output columns <ul> <li>embedding (<code>List[float]</code>): The embedding of the input text or conversation.</li> <li>model_name (<code>str</code>): The model name used to generate the embeddings.</li> </ul> Categories <ul> <li>embedding</li> <li>llm</li> </ul> References <ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul> <p>Examples:</p> <p>Rank LLM candidates:</p> <pre><code>from distilabel.steps.tasks import GenerateEmbeddings\nfrom distilabel.models.llms.huggingface import TransformersLLM\n\n# Consider this as a placeholder for your actual LLM.\nembedder = GenerateEmbeddings(\n    llm=TransformersLLM(\n        model=\"TaylorAI/bge-micro-v2\",\n        model_kwargs={\"is_decoder\": True},\n        cuda_devices=[],\n    )\n)\nembedder.load()\n\nresult = next(\n    embedder.process(\n        [\n            {\"text\": \"Hello, how are you?\"},\n        ]\n    )\n)\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>class GenerateEmbeddings(Step):\n    \"\"\"Generate embeddings using the last hidden state of an `LLM`.\n\n    Generate embeddings for a text input using the last hidden state of an `LLM`, as\n    described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of\n    Automatic Data Selection in Instruction Tuning'.\n\n    Attributes:\n        llm: The `LLM` to use to generate the embeddings.\n\n    Input columns:\n        - text (`str`, `List[Dict[str, str]]`): The input text or conversation to generate\n            embeddings for.\n\n    Output columns:\n        - embedding (`List[float]`): The embedding of the input text or conversation.\n        - model_name (`str`): The model name used to generate the embeddings.\n\n    Categories:\n        - embedding\n        - llm\n\n    References:\n        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Rank LLM candidates:\n\n        ```python\n        from distilabel.steps.tasks import GenerateEmbeddings\n        from distilabel.models.llms.huggingface import TransformersLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        embedder = GenerateEmbeddings(\n            llm=TransformersLLM(\n                model=\"TaylorAI/bge-micro-v2\",\n                model_kwargs={\"is_decoder\": True},\n                cuda_devices=[],\n            )\n        )\n        embedder.load()\n\n        result = next(\n            embedder.process(\n                [\n                    {\"text\": \"Hello, how are you?\"},\n                ]\n            )\n        )\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    llm: LLM\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `LLM` used to generate the embeddings.\"\"\"\n        super().load()\n\n        self.llm.load()\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task is a `text` column containing either a string or a\n        list of dictionaries in OpenAI chat-like format.\"\"\"\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs for the task is an `embedding` column containing the embedding of\n        the `text` input.\"\"\"\n        return [\"embedding\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Formats the input to be used by the LLM to generate the embeddings. The input\n        can be in `ChatType` format or a string. If a string, it will be converted to a\n        list of dictionaries in OpenAI chat-like format.\n\n        Args:\n            input: The input to format.\n\n        Returns:\n            The OpenAI chat-like format of the input.\n        \"\"\"\n        text = input[\"text\"] = input[\"text\"]\n\n        # input is in `ChatType` format\n        if isinstance(text, str):\n            return [{\"role\": \"user\", \"content\": text}]\n\n        if is_openai_format(text):\n            return text\n\n        raise DistilabelUserError(\n            f\"Couldn't format input for step {self.name}. The `text` input column has to\"\n            \" be a string or a list of dictionaries in OpenAI chat-like format.\",\n            page=\"components-gallery/tasks/generateembeddings/\",\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Generates an embedding for each input using the last hidden state of the `LLM`.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n        formatted_inputs = [self.format_input(input) for input in inputs]\n        last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)\n        for input, hidden_state in zip(inputs, last_hidden_states):\n            input[\"embedding\"] = hidden_state[-1].tolist()\n            input[\"model_name\"] = self.llm.model_name\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task is a <code>text</code> column containing either a string or a list of dictionaries in OpenAI chat-like format.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task is an <code>embedding</code> column containing the embedding of the <code>text</code> input.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.load","title":"<code>load()</code>","text":"<p>Loads the <code>LLM</code> used to generate the embeddings.</p> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `LLM` used to generate the embeddings.\"\"\"\n    super().load()\n\n    self.llm.load()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.format_input","title":"<code>format_input(input)</code>","text":"<p>Formats the input to be used by the LLM to generate the embeddings. The input can be in <code>ChatType</code> format or a string. If a string, it will be converted to a list of dictionaries in OpenAI chat-like format.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input to format.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>The OpenAI chat-like format of the input.</p> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Formats the input to be used by the LLM to generate the embeddings. The input\n    can be in `ChatType` format or a string. If a string, it will be converted to a\n    list of dictionaries in OpenAI chat-like format.\n\n    Args:\n        input: The input to format.\n\n    Returns:\n        The OpenAI chat-like format of the input.\n    \"\"\"\n    text = input[\"text\"] = input[\"text\"]\n\n    # input is in `ChatType` format\n    if isinstance(text, str):\n        return [{\"role\": \"user\", \"content\": text}]\n\n    if is_openai_format(text):\n        return text\n\n    raise DistilabelUserError(\n        f\"Couldn't format input for step {self.name}. The `text` input column has to\"\n        \" be a string or a list of dictionaries in OpenAI chat-like format.\",\n        page=\"components-gallery/tasks/generateembeddings/\",\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.process","title":"<code>process(inputs)</code>","text":"<p>Generates an embedding for each input using the last hidden state of the <code>LLM</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Generates an embedding for each input using the last hidden state of the `LLM`.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n    formatted_inputs = [self.format_input(input) for input in inputs]\n    last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)\n    for input, hidden_state in zip(inputs, last_hidden_states):\n        input[\"embedding\"] = hidden_state[-1].tolist()\n        input[\"model_name\"] = self.llm.model_name\n    yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct","title":"<code>Genstruct</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate a pair of instruction-response from a document using an <code>LLM</code>.</p> <p><code>Genstruct</code> is a pre-defined task designed to generate valid instructions from a given raw document, with the title and the content, enabling the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is inspired in the Ada-Instruct paper.</p> Note <p>The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended option is to use <code>NousResearch/Genstruct-7B</code> as the LLM provided to the task, since it was trained for this specific task.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>title (<code>str</code>): The title of the document.</li> <li>content (<code>str</code>): The content of the document.</li> </ul> Output columns <ul> <li>user (<code>str</code>): The user's instruction based on the document.</li> <li>assistant (<code>str</code>): The assistant's response based on the user's instruction.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</li> </ul> Categories <ul> <li>text-generation</li> <li>instruction</li> <li>response</li> </ul> References <ul> <li>Genstruct 7B by Nous Research</li> <li>Ada-Instruct: Adapting Instruction Generators for Complex Reasoning</li> </ul> <p>Examples:</p> <p>Generate instructions from raw documents using the title and content:</p> <pre><code>from distilabel.steps.tasks import Genstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ngenstruct = Genstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"NousResearch/Genstruct-7B\",\n    ),\n)\n\ngenstruct.load()\n\nresult = next(\n    genstruct.process(\n        [\n            {\"title\": \"common instruction\", \"content\": \"content of the document\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'title': 'An instruction',\n#         'content': 'content of the document',\n#         'model_name': 'test',\n#         'user': 'An instruction',\n#         'assistant': 'content of the document',\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{cui2023adainstructadaptinginstructiongenerators,\n    title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},\n    author={Wanyun Cui and Qianle Wang},\n    year={2023},\n    eprint={2310.04484},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2310.04484},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>class Genstruct(Task):\n    \"\"\"Generate a pair of instruction-response from a document using an `LLM`.\n\n    `Genstruct` is a pre-defined task designed to generate valid instructions from a given raw document,\n    with the title and the content, enabling the creation of new, partially synthetic instruction finetuning\n    datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is\n    inspired in the Ada-Instruct paper.\n\n    Note:\n        The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended\n        option is to use `NousResearch/Genstruct-7B` as the LLM provided to the task, since it was trained\n        for this specific task.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - title (`str`): The title of the document.\n        - content (`str`): The content of the document.\n\n    Output columns:\n        - user (`str`): The user's instruction based on the document.\n        - assistant (`str`): The assistant's response based on the user's instruction.\n        - model_name (`str`): The model name used to generate the `feedback` and `result`.\n\n    Categories:\n        - text-generation\n        - instruction\n        - response\n\n    References:\n        - [Genstruct 7B by Nous Research](https://huggingface.co/NousResearch/Genstruct-7B)\n        - [Ada-Instruct: Adapting Instruction Generators for Complex Reasoning](https://arxiv.org/abs/2310.04484)\n\n    Examples:\n        Generate instructions from raw documents using the title and content:\n\n        ```python\n        from distilabel.steps.tasks import Genstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        genstruct = Genstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"NousResearch/Genstruct-7B\",\n            ),\n        )\n\n        genstruct.load()\n\n        result = next(\n            genstruct.process(\n                [\n                    {\"title\": \"common instruction\", \"content\": \"content of the document\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'title': 'An instruction',\n        #         'content': 'content of the document',\n        #         'model_name': 'test',\n        #         'user': 'An instruction',\n        #         'assistant': 'content of the document',\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{cui2023adainstructadaptinginstructiongenerators,\n            title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},\n            author={Wanyun Cui and Qianle Wang},\n            year={2023},\n            eprint={2310.04484},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2310.04484},\n        }\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"genstruct.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are the `title` and the `content`.\"\"\"\n        return [\"title\", \"content\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    title=input[\"title\"], content=input[\"content\"]\n                ),\n            }\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `user` instruction based on the provided document\n        and the `assistant` response based on the user's instruction.\"\"\"\n        return [\"user\", \"assistant\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted so that both the user and the assistant messages are\n        captured.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the keys `user` and `assistant` containing the content for each role.\n        \"\"\"\n        if output is None:\n            return {\"user\": None, \"assistant\": None}\n\n        matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)\n        if not matches:\n            return {\"user\": None, \"assistant\": None}\n\n        return {\n            \"user\": matches.group(1).strip(),\n            \"assistant\": matches.group(2).strip(),\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the <code>title</code> and the <code>content</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>user</code> instruction based on the provided document and the <code>assistant</code> response based on the user's instruction.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"genstruct.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                title=input[\"title\"], content=input[\"content\"]\n            ),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted so that both the user and the assistant messages are captured.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the keys <code>user</code> and <code>assistant</code> containing the content for each role.</p> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted so that both the user and the assistant messages are\n    captured.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the keys `user` and `assistant` containing the content for each role.\n    \"\"\"\n    if output is None:\n        return {\"user\": None, \"assistant\": None}\n\n    matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)\n    if not matches:\n        return {\"user\": None, \"assistant\": None}\n\n    return {\n        \"user\": matches.group(1).strip(),\n        \"assistant\": matches.group(2).strip(),\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.BitextRetrievalGenerator","title":"<code>BitextRetrievalGenerator</code>","text":"<p>               Bases: <code>_EmbeddingDataGenerator</code></p> <p>Generate bitext retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>BitextRetrievalGenerator</code> is a <code>GeneratorTask</code> that generates bitext retrieval data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> <p>Attributes:</p> Name Type Description <code>source_language</code> <code>str</code> <p>The source language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>target_language</code> <code>str</code> <p>The target language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>unit</code> <code>Optional[Literal['sentence', 'phrase', 'passage']]</code> <p>The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>difficulty</code> <code>Optional[Literal['elementary school', 'high school', 'college']]</code> <p>The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>high_score</code> <code>Optional[Literal['4', '4.5', '5']]</code> <p>The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>low_score</code> <code>Optional[Literal['2.5', '3', '3.5']]</code> <p>The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Output columns <ul> <li>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</li> <li>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</li> <li>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the bitext retrieval     data.</li> </ul> <p>Examples:</p> <p>Generate bitext retrieval data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import BitextRetrievalGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = BitextRetrievalGenerator(\n        source_language=\"English\",\n        target_language=\"Spanish\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class BitextRetrievalGenerator(_EmbeddingDataGenerator):\n    \"\"\"Generate bitext retrieval data with an `LLM` to later on train an embedding model.\n\n    `BitextRetrievalGenerator` is a `GeneratorTask` that generates bitext retrieval data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Attributes:\n        source_language: The source language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        target_language: The target language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Output columns:\n        - S1 (`str`): the first sentence generated by the `LLM`.\n        - S2 (`str`): the second sentence generated by the `LLM`.\n        - S3 (`str`): the third sentence generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the bitext retrieval\n            data.\n\n    Examples:\n        Generate bitext retrieval data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import BitextRetrievalGenerator\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = BitextRetrievalGenerator(\n                source_language=\"English\",\n                target_language=\"Spanish\",\n                unit=\"sentence\",\n                difficulty=\"elementary school\",\n                high_score=\"4\",\n                low_score=\"2.5\",\n                llm=...,\n            )\n\n            ...\n\n            task &gt;&gt; ...\n        ```\n    \"\"\"\n\n    source_language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n    target_language: str = Field(\n        default=...,\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    unit: Optional[Literal[\"sentence\", \"phrase\", \"passage\"]] = None\n    difficulty: Optional[Literal[\"elementary school\", \"high school\", \"college\"]] = None\n    high_score: Optional[Literal[\"4\", \"4.5\", \"5\"]] = None\n    low_score: Optional[Literal[\"2.5\", \"3\", \"3.5\"]] = None\n\n    _template_name: str = PrivateAttr(default=\"bitext-retrieval\")\n    _can_be_used_with_offline_batch_generation = True\n\n    @property\n    def prompt(self) -&gt; ChatType:\n        \"\"\"Contains the `prompt` to be used in the `process` method, rendering the `_template`; and\n        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,\n        being from the user with the content being the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    source_language=self.source_language,\n                    target_language=self.target_language,\n                    unit=self.unit or random.choice([\"sentence\", \"phrase\", \"passage\"]),\n                    difficulty=self.difficulty\n                    or random.choice([\"elementary school\", \"high school\", \"college\"]),\n                    high_score=self.high_score or random.choice([\"4\", \"4.5\", \"5\"]),\n                    low_score=self.low_score or random.choice([\"2.5\", \"3\", \"3.5\"]),\n                ).strip(),\n            }\n        ]  # type: ignore\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"S1\", \"S2\", \"S3\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.BitextRetrievalGenerator.prompt","title":"<code>prompt</code>  <code>property</code>","text":"<p>Contains the <code>prompt</code> to be used in the <code>process</code> method, rendering the <code>_template</code>; and formatted as an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.BitextRetrievalGenerator.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateLongTextMatchingData","title":"<code>GenerateLongTextMatchingData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate long text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateLongTextMatchingData</code> is a <code>Task</code> that generates long text matching data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-long\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-long category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method. Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>input (<code>str</code>): the input generated by the <code>LLM</code>.</li> <li>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the long text matching     data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic long text matching data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-long\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateLongTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateLongTextMatchingData(_EmbeddingDataGeneration):\n    \"\"\"Generate long text matching data with an `LLM` to later on train an embedding model.\n\n    `GenerateLongTextMatchingData` is a `Task` that generates long text matching data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-matching-long\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-matching-long category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n            Note that in this task the `seed` has no effect since there are no sampling params.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - input (`str`): the input generated by the `LLM`.\n        - positive_document (`str`): the positive document generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the long text matching\n            data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic long text matching data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-matching-long\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateLongTextMatchingData(\n                language=\"English\",\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    _template_name: str = PrivateAttr(default=\"long-text-matching\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n        there's only one turn, being from the user with the content being the rendered `_template`.\n\n        Args:\n            input: The input dictionary containing the `task` to be used in the `_template`.\n\n        Returns:\n            A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"input\", \"positive_document\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateLongTextMatchingData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateLongTextMatchingData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input dictionary containing the <code>task</code> to be used in the <code>_template</code>.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list with a single chat containing the user's message with the rendered <code>_template</code>.</p> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n    there's only one turn, being from the user with the content being the rendered `_template`.\n\n    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateShortTextMatchingData","title":"<code>GenerateShortTextMatchingData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate short text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateShortTextMatchingData</code> is a <code>Task</code> that generates short text matching data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-short\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-short category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method. Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>input (<code>str</code>): the input generated by the <code>LLM</code>.</li> <li>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the short text matching     data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic short text matching data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-short\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateShortTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateShortTextMatchingData(_EmbeddingDataGeneration):\n    \"\"\"Generate short text matching data with an `LLM` to later on train an embedding model.\n\n    `GenerateShortTextMatchingData` is a `Task` that generates short text matching data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-matching-short\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-matching-short category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n            Note that in this task the `seed` has no effect since there are no sampling params.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - input (`str`): the input generated by the `LLM`.\n        - positive_document (`str`): the positive document generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the short text matching\n            data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic short text matching data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-matching-short\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateShortTextMatchingData(\n                language=\"English\",\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    _template_name: str = PrivateAttr(default=\"short-text-matching\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n                the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n                there's only one turn, being from the user with the content being the rendered `_template`.\n\n                Args:\n                    input: The input dictionary containing the `task` to be used in the `_template`.\n\n                Returns:\n                    A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"input\", \"positive_document\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateShortTextMatchingData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateShortTextMatchingData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with         the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that         there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <pre><code>    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n            the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n            there's only one turn, being from the user with the content being the rendered `_template`.\n\n            Args:\n                input: The input dictionary containing the `task` to be used in the `_template`.\n\n            Returns:\n                A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextClassificationData","title":"<code>GenerateTextClassificationData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate text classification data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextClassificationData</code> is a <code>Task</code> that generates text classification data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-classification\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-classification category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>difficulty</code> <code>Optional[Literal['high school', 'college', 'PhD']]</code> <p>The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>clarity</code> <code>Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]</code> <p>The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>, or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>input_text (<code>str</code>): the input text generated by the <code>LLM</code>.</li> <li>label (<code>str</code>): the label generated by the <code>LLM</code>.</li> <li>misleading_label (<code>str</code>): the misleading label generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the text classification     data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic text classification data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-classification\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextClassificationData(\n        language=\"English\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateTextClassificationData(_EmbeddingDataGeneration):\n    \"\"\"Generate text classification data with an `LLM` to later on train an embedding model.\n\n    `GenerateTextClassificationData` is a `Task` that generates text classification data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-classification\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-classification category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,\n            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - input_text (`str`): the input text generated by the `LLM`.\n        - label (`str`): the label generated by the `LLM`.\n        - misleading_label (`str`): the misleading label generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the text classification\n            data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic text classification data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-classification\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateTextClassificationData(\n                language=\"English\",\n                difficulty=\"high school\",\n                clarity=\"clear\",\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    difficulty: Optional[Literal[\"high school\", \"college\", \"PhD\"]] = None\n    clarity: Optional[\n        Literal[\"clear\", \"understandable with some effort\", \"ambiguous\"]\n    ] = None\n\n    _template_name: str = PrivateAttr(default=\"text-classification\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n        there's only one turn, being from the user with the content being the rendered `_template`.\n\n        Args:\n            input: The input dictionary containing the `task` to be used in the `_template`.\n\n        Returns:\n            A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                    difficulty=self.difficulty\n                    or random.choice([\"high school\", \"college\", \"PhD\"]),\n                    clarity=self.clarity\n                    or random.choice(\n                        [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                    ),\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"input_text\", \"label\", \"misleading_label\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextClassificationData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextClassificationData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input dictionary containing the <code>task</code> to be used in the <code>_template</code>.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list with a single chat containing the user's message with the rendered <code>_template</code>.</p> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n    there's only one turn, being from the user with the content being the rendered `_template`.\n\n    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n                difficulty=self.difficulty\n                or random.choice([\"high school\", \"college\", \"PhD\"]),\n                clarity=self.clarity\n                or random.choice(\n                    [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                ),\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextRetrievalData","title":"<code>GenerateTextRetrievalData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate text retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextRetrievalData</code> is a <code>Task</code> that generates text retrieval data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-retrieval\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-retrieval category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>query_type</code> <code>Optional[Literal['extremely long-tail', 'long-tail', 'common']]</code> <p>The type of query to be generated, which can be <code>extremely long-tail</code>, <code>long-tail</code>, or <code>common</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>query_length</code> <code>Optional[Literal['less than 5 words', '5 to 15 words', 'at least 10 words']]</code> <p>The length of the query to be generated, which can be <code>less than 5 words</code>, <code>5 to 15 words</code>, or <code>at least 10 words</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>difficulty</code> <code>Optional[Literal['high school', 'college', 'PhD']]</code> <p>The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>clarity</code> <code>Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]</code> <p>The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>, or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>num_words</code> <code>Optional[Literal[50, 100, 200, 300, 400, 500]]</code> <p>The number of words in the query to be generated, which can be <code>50</code>, <code>100</code>, <code>200</code>, <code>300</code>, <code>400</code>, or <code>500</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>user_query (<code>str</code>): the user query generated by the <code>LLM</code>.</li> <li>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</li> <li>hard_negative_document (<code>str</code>): the hard negative document generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the text retrieval data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic text retrieval data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-retrieval\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextRetrievalData(\n        language=\"English\",\n        query_type=\"common\",\n        query_length=\"5 to 15 words\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        num_words=100,\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateTextRetrievalData(_EmbeddingDataGeneration):\n    \"\"\"Generate text retrieval data with an `LLM` to later on train an embedding model.\n\n    `GenerateTextRetrievalData` is a `Task` that generates text retrieval data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-retrieval\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-retrieval category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        query_type: The type of query to be generated, which can be `extremely long-tail`, `long-tail`,\n            or `common`. Defaults to `None`, meaning that it will be randomly sampled.\n        query_length: The length of the query to be generated, which can be `less than 5 words`, `5 to 15 words`,\n            or `at least 10 words`. Defaults to `None`, meaning that it will be randomly sampled.\n        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,\n            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.\n        num_words: The number of words in the query to be generated, which can be `50`, `100`, `200`, `300`, `400`, or `500`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - user_query (`str`): the user query generated by the `LLM`.\n        - positive_document (`str`): the positive document generated by the `LLM`.\n        - hard_negative_document (`str`): the hard negative document generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the text retrieval data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic text retrieval data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-retrieval\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateTextRetrievalData(\n                language=\"English\",\n                query_type=\"common\",\n                query_length=\"5 to 15 words\",\n                difficulty=\"high school\",\n                clarity=\"clear\",\n                num_words=100,\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    query_type: Optional[Literal[\"extremely long-tail\", \"long-tail\", \"common\"]] = None\n    query_length: Optional[\n        Literal[\"less than 5 words\", \"5 to 15 words\", \"at least 10 words\"]\n    ] = None\n    difficulty: Optional[Literal[\"high school\", \"college\", \"PhD\"]] = None\n    clarity: Optional[\n        Literal[\"clear\", \"understandable with some effort\", \"ambiguous\"]\n    ] = None\n    num_words: Optional[Literal[50, 100, 200, 300, 400, 500]] = None\n\n    _template_name: str = PrivateAttr(default=\"text-retrieval\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n        there's only one turn, being from the user with the content being the rendered `_template`.\n\n        Args:\n            input: The input dictionary containing the `task` to be used in the `_template`.\n\n        Returns:\n            A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                    query_type=self.query_type\n                    or random.choice([\"extremely long-tail\", \"long-tail\", \"common\"]),\n                    query_length=self.query_length\n                    or random.choice(\n                        [\"less than 5 words\", \"5 to 15 words\", \"at least 10 words\"]\n                    ),\n                    difficulty=self.difficulty\n                    or random.choice([\"high school\", \"college\", \"PhD\"]),\n                    clarity=self.clarity\n                    or random.choice(\n                        [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                    ),\n                    num_words=self.num_words\n                    or random.choice([50, 100, 200, 300, 400, 500]),\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\n            \"user_query\",\n            \"positive_document\",\n            \"hard_negative_document\",\n        ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextRetrievalData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextRetrievalData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input dictionary containing the <code>task</code> to be used in the <code>_template</code>.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list with a single chat containing the user's message with the rendered <code>_template</code>.</p> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n    there's only one turn, being from the user with the content being the rendered `_template`.\n\n    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n                query_type=self.query_type\n                or random.choice([\"extremely long-tail\", \"long-tail\", \"common\"]),\n                query_length=self.query_length\n                or random.choice(\n                    [\"less than 5 words\", \"5 to 15 words\", \"at least 10 words\"]\n                ),\n                difficulty=self.difficulty\n                or random.choice([\"high school\", \"college\", \"PhD\"]),\n                clarity=self.clarity\n                or random.choice(\n                    [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                ),\n                num_words=self.num_words\n                or random.choice([50, 100, 200, 300, 400, 500]),\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MonolingualTripletGenerator","title":"<code>MonolingualTripletGenerator</code>","text":"<p>               Bases: <code>_EmbeddingDataGenerator</code></p> <p>Generate monolingual triplets with an <code>LLM</code> to later on train an embedding model.</p> <p><code>MonolingualTripletGenerator</code> is a <code>GeneratorTask</code> that generates monolingual triplets with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>unit</code> <code>Optional[Literal['sentence', 'phrase', 'passage']]</code> <p>The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>difficulty</code> <code>Optional[Literal['elementary school', 'high school', 'college']]</code> <p>The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>high_score</code> <code>Optional[Literal['4', '4.5', '5']]</code> <p>The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>low_score</code> <code>Optional[Literal['2.5', '3', '3.5']]</code> <p>The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Output columns <ul> <li>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</li> <li>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</li> <li>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the monolingual triplets.</li> </ul> <p>Examples:</p> <p>Generate monolingual triplets for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import MonolingualTripletGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = MonolingualTripletGenerator(\n        language=\"English\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class MonolingualTripletGenerator(_EmbeddingDataGenerator):\n    \"\"\"Generate monolingual triplets with an `LLM` to later on train an embedding model.\n\n    `MonolingualTripletGenerator` is a `GeneratorTask` that generates monolingual triplets with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Output columns:\n        - S1 (`str`): the first sentence generated by the `LLM`.\n        - S2 (`str`): the second sentence generated by the `LLM`.\n        - S3 (`str`): the third sentence generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the monolingual triplets.\n\n    Examples:\n        Generate monolingual triplets for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import MonolingualTripletGenerator\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = MonolingualTripletGenerator(\n                language=\"English\",\n                unit=\"sentence\",\n                difficulty=\"elementary school\",\n                high_score=\"4\",\n                low_score=\"2.5\",\n                llm=...,\n            )\n\n            ...\n\n            task &gt;&gt; ...\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    unit: Optional[Literal[\"sentence\", \"phrase\", \"passage\"]] = None\n    difficulty: Optional[Literal[\"elementary school\", \"high school\", \"college\"]] = None\n    high_score: Optional[Literal[\"4\", \"4.5\", \"5\"]] = None\n    low_score: Optional[Literal[\"2.5\", \"3\", \"3.5\"]] = None\n\n    _template_name: str = PrivateAttr(default=\"monolingual-triplet\")\n    _can_be_used_with_offline_batch_generation = True\n\n    @property\n    def prompt(self) -&gt; ChatType:\n        \"\"\"Contains the `prompt` to be used in the `process` method, rendering the `_template`; and\n        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,\n        being from the user with the content being the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    language=self.language,\n                    unit=self.unit or random.choice([\"sentence\", \"phrase\", \"passage\"]),\n                    difficulty=self.difficulty\n                    or random.choice([\"elementary school\", \"high school\", \"college\"]),\n                    high_score=self.high_score or random.choice([\"4\", \"4.5\", \"5\"]),\n                    low_score=self.low_score or random.choice([\"2.5\", \"3\", \"3.5\"]),\n                ).strip(),\n            }\n        ]  # type: ignore\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"S1\", \"S2\", \"S3\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MonolingualTripletGenerator.prompt","title":"<code>prompt</code>  <code>property</code>","text":"<p>Contains the <code>prompt</code> to be used in the <code>process</code> method, rendering the <code>_template</code>; and formatted as an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MonolingualTripletGenerator.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation","title":"<code>InstructionBacktranslation</code>","text":"<p>               Bases: <code>Task</code></p> <p>Self-Alignment with Instruction Backtranslation.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Optional[Template]</code> <p>the Jinja2 template to use for the Instruction Backtranslation task.</p> Input columns <ul> <li>instruction (<code>str</code>): The reference instruction to evaluate the text output.</li> <li>generation (<code>str</code>): The text output to evaluate for the given instruction.</li> </ul> Output columns <ul> <li>score (<code>str</code>): The score for the generation based on the given instruction.</li> <li>reason (<code>str</code>): The reason for the provided score.</li> <li>model_name (<code>str</code>): The model name used to score the generation.</li> </ul> Categories <ul> <li>critique</li> </ul> References <ul> <li><code>Self-Alignment with Instruction Backtranslation</code></li> </ul> <p>Examples:</p> <p>Generate a score and reason for a given instruction and generation:</p> <pre><code>from distilabel.steps.tasks import InstructionBacktranslation\n\ninstruction_backtranslation = InstructionBacktranslation(\n        name=\"instruction_backtranslation\",\n        llm=llm,\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"scoring_model\"},\n    )\ninstruction_backtranslation.load()\n\nresult = next(\n    instruction_backtranslation.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generation\": \"4\",\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         \"instruction\": \"How much is 2+2?\",\n#         \"generation\": \"4\",\n#         \"score\": 3,\n#         \"reason\": \"Reason for the generation.\",\n#         \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{li2024selfalignmentinstructionbacktranslation,\n    title={Self-Alignment with Instruction Backtranslation},\n    author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},\n    year={2024},\n    eprint={2308.06259},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2308.06259},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>class InstructionBacktranslation(Task):\n    \"\"\"Self-Alignment with Instruction Backtranslation.\n\n    Attributes:\n        _template: the Jinja2 template to use for the Instruction Backtranslation task.\n\n    Input columns:\n        - instruction (`str`): The reference instruction to evaluate the text output.\n        - generation (`str`): The text output to evaluate for the given instruction.\n\n    Output columns:\n        - score (`str`): The score for the generation based on the given instruction.\n        - reason (`str`): The reason for the provided score.\n        - model_name (`str`): The model name used to score the generation.\n\n    Categories:\n        - critique\n\n    References:\n        - [`Self-Alignment with Instruction Backtranslation`](https://arxiv.org/abs/2308.06259)\n\n    Examples:\n        Generate a score and reason for a given instruction and generation:\n\n        ```python\n        from distilabel.steps.tasks import InstructionBacktranslation\n\n        instruction_backtranslation = InstructionBacktranslation(\n                name=\"instruction_backtranslation\",\n                llm=llm,\n                input_batch_size=10,\n                output_mappings={\"model_name\": \"scoring_model\"},\n            )\n        instruction_backtranslation.load()\n\n        result = next(\n            instruction_backtranslation.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generation\": \"4\",\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         \"instruction\": \"How much is 2+2?\",\n        #         \"generation\": \"4\",\n        #         \"score\": 3,\n        #         \"reason\": \"Reason for the generation.\",\n        #         \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{li2024selfalignmentinstructionbacktranslation,\n            title={Self-Alignment with Instruction Backtranslation},\n            author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},\n            year={2024},\n            eprint={2308.06259},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2308.06259},\n        }\n        ```\n    \"\"\"\n\n    _template: Optional[\"Template\"] = PrivateAttr(default=...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"instruction-backtranslation.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`, and the `generation` for it.\"\"\"\n        return [\"instruction\", \"generation\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    instruction=input[\"instruction\"], generation=input[\"generation\"]\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `score`, `reason` and the `model_name`.\"\"\"\n        return [\"score\", \"reason\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `score` and `reason`. The\n        `model_name` will be automatically included within the `process` method of `Task`.\n\n        Args:\n            output: a string representing the output of the LLM via the `process` method.\n            input: the input to the task, as required by some tasks to format the output.\n\n        Returns:\n            A dictionary containing the `score` and the `reason` for the provided `score`.\n        \"\"\"\n        pattern = r\"(.+?)Score: (\\d)\"\n\n        matches = None\n        if output is not None:\n            matches = re.findall(pattern, output, re.DOTALL)\n        if matches is None:\n            return {\"score\": None, \"reason\": None}\n\n        return {\n            \"score\": int(matches[0][1]),\n            \"reason\": matches[0][0].strip(),\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>, and the <code>generation</code> for it.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>score</code>, <code>reason</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"instruction-backtranslation.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                instruction=input[\"instruction\"], generation=input[\"generation\"]\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a dictionary with the <code>score</code> and <code>reason</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>a string representing the output of the LLM via the <code>process</code> method.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task, as required by some tasks to format the output.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dictionary containing the <code>score</code> and the <code>reason</code> for the provided <code>score</code>.</p> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `score` and `reason`. The\n    `model_name` will be automatically included within the `process` method of `Task`.\n\n    Args:\n        output: a string representing the output of the LLM via the `process` method.\n        input: the input to the task, as required by some tasks to format the output.\n\n    Returns:\n        A dictionary containing the `score` and the `reason` for the provided `score`.\n    \"\"\"\n    pattern = r\"(.+?)Score: (\\d)\"\n\n    matches = None\n    if output is not None:\n        matches = re.findall(pattern, output, re.DOTALL)\n    if matches is None:\n        return {\"score\": None, \"reason\": None}\n\n    return {\n        \"score\": int(matches[0][1]),\n        \"reason\": matches[0][0].strip(),\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie","title":"<code>Magpie</code>","text":"<p>               Bases: <code>Task</code>, <code>MagpieBase</code></p> <p>Generates conversations using an instruct fine-tuned LLM.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as if it was the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.</p> <p>Attributes:</p> Name Type Description <code>n_turns</code> <code>RuntimeParameter[PositiveInt]</code> <p>the number of turns that the generated conversation will have. Defaults to <code>1</code>.</p> <code>end_with_user</code> <code>RuntimeParameter[bool]</code> <p>whether the conversation should end with a user message. Defaults to <code>False</code>.</p> <code>include_system_prompt</code> <code>RuntimeParameter[bool]</code> <p>whether to include the system prompt used in the generated conversation. Defaults to <code>False</code>.</p> <code>only_instruction</code> <code>RuntimeParameter[bool]</code> <p>whether to generate only the instruction. If this argument is <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> <code>system_prompt</code> <code>Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]</code> <p>an optional system prompt, or a list of system prompts from which a random one will be chosen, or a dictionary of system prompts from which a random one will be choosen, or a dictionary of system prompts with their probability of being chosen. The random system prompt will be chosen per input/output batch. This system prompt can be used to guide the generation of the instruct LLM and steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> Runtime parameters <ul> <li><code>n_turns</code>: the number of turns that the generated conversation will have. Defaults     to <code>1</code>.</li> <li><code>end_with_user</code>: whether the conversation should end with a user message.     Defaults to <code>False</code>.</li> <li><code>include_system_prompt</code>: whether to include the system prompt used in the generated     conversation. Defaults to <code>False</code>.</li> <li><code>only_instruction</code>: whether to generate only the instruction. If this argument is     <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</li> <li><code>system_prompt</code>: an optional system prompt or list of system prompts that can     be used to steer the LLM to generate content of certain topic, guide the style,     etc. If it's a list of system prompts, then a random system prompt will be chosen     per input/output batch. If the provided inputs contains a <code>system_prompt</code> column,     then this runtime parameter will be ignored and the one from the column will     be used. Defaults to <code>None</code>.</li> <li><code>system_prompt</code>: an optional system prompt, or a list of system prompts from which     a random one will be chosen, or a dictionary of system prompts from which a     random one will be choosen, or a dictionary of system prompts with their probability     of being chosen. The random system prompt will be chosen per input/output batch.     This system prompt can be used to guide the generation of the instruct LLM and     steer it to generate instructions of a certain topic.</li> </ul> Input columns <ul> <li>system_prompt (<code>str</code>, optional): an optional system prompt that can be provided     to guide the generation of the instruct LLM and steer it to generate instructions     of certain topic.</li> </ul> Output columns <ul> <li>conversation (<code>ChatType</code>): the generated conversation which is a list of chat     items with a role and a message. Only if <code>only_instruction=False</code>.</li> <li>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code> or <code>n_turns==1</code>.</li> <li>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</li> <li>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate     the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</li> </ul> Categories <ul> <li>text-generation</li> <li>instruction</li> </ul> References <ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul> <p>Examples:</p> <p>Generating instructions with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {'instruction': \"That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?\"},\n#     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}\n# ]\n</code></pre> <p>Generating conversations with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    n_turns=2,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"},\n#             {\n#                 'role': 'user',\n#                 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x\u2192a f(x) or lim x\u2192a [f(x)]. It is read as \"the limit as x approaches a of f\n# of x\".'\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x\u2192a f(x) indeed means \"the limit as x approaches a of f of\n# x\". What it's asking us to do is find the'\n#             }\n#         ]\n#     },\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"},\n#             {\n#                 'role': 'user',\n#                 'content': \"As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it\n# might be pests or diseases, but I'm not sure which.\"\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': \"I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.\n# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called\"\n#             }\n#         ]\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>class Magpie(Task, MagpieBase):\n    \"\"\"Generates conversations using an instruct fine-tuned LLM.\n\n    Magpie is a neat method that allows generating user instructions with no seed data\n    or specific system prompt thanks to the autoregressive capabilities of the instruct\n    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message\n    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query\n    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the\n    LLM without any user message, then the LLM will continue generating tokens as if it was\n    the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.\n    After this instruct is generated, it can be sent again to the LLM to generate this time\n    an assistant response. This process can be repeated N times allowing to build a multi-turn\n    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from\n    Scratch by Prompting Aligned LLMs with Nothing'.\n\n    Attributes:\n        n_turns: the number of turns that the generated conversation will have.\n            Defaults to `1`.\n        end_with_user: whether the conversation should end with a user message.\n            Defaults to `False`.\n        include_system_prompt: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        only_instruction: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        system_prompt: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic. Defaults to `None`.\n\n    Runtime parameters:\n        - `n_turns`: the number of turns that the generated conversation will have. Defaults\n            to `1`.\n        - `end_with_user`: whether the conversation should end with a user message.\n            Defaults to `False`.\n        - `include_system_prompt`: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        - `only_instruction`: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        - `system_prompt`: an optional system prompt or list of system prompts that can\n            be used to steer the LLM to generate content of certain topic, guide the style,\n            etc. If it's a list of system prompts, then a random system prompt will be chosen\n            per input/output batch. If the provided inputs contains a `system_prompt` column,\n            then this runtime parameter will be ignored and the one from the column will\n            be used. Defaults to `None`.\n        - `system_prompt`: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic.\n\n    Input columns:\n        - system_prompt (`str`, optional): an optional system prompt that can be provided\n            to guide the generation of the instruct LLM and steer it to generate instructions\n            of certain topic.\n\n    Output columns:\n        - conversation (`ChatType`): the generated conversation which is a list of chat\n            items with a role and a message. Only if `only_instruction=False`.\n        - instruction (`str`): the generated instructions if `only_instruction=True` or `n_turns==1`.\n        - response (`str`): the generated response if `n_turns==1`.\n        - system_prompt_key (`str`, optional): the key of the system prompt used to generate\n            the conversation or instruction. Only if `system_prompt` is a dictionary.\n        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.\n\n    Categories:\n        - text-generation\n        - instruction\n\n    References:\n        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)\n\n    Examples:\n        Generating instructions with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import Magpie\n\n        magpie = Magpie(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 64,\n                },\n                device=\"mps\",\n            ),\n            only_instruction=True,\n        )\n\n        magpie.load()\n\n        result = next(\n            magpie.process(\n                inputs=[\n                    {\n                        \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n                    },\n                    {\n                        \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n                    },\n                ]\n            )\n        )\n        # [\n        #     {'instruction': \"That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?\"},\n        #     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}\n        # ]\n        ```\n\n        Generating conversations with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import Magpie\n\n        magpie = Magpie(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 256,\n                },\n                device=\"mps\",\n            ),\n            n_turns=2,\n        )\n\n        magpie.load()\n\n        result = next(\n            magpie.process(\n                inputs=[\n                    {\n                        \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n                    },\n                    {\n                        \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n                    },\n                ]\n            )\n        )\n        # [\n        #     {\n        #         'conversation': [\n        #             {'role': 'system', 'content': \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"},\n        #             {\n        #                 'role': 'user',\n        #                 'content': 'I\\'m having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x\u2192a f(x) or lim x\u2192a [f(x)]. It is read as \"the limit as x approaches a of f\n        # of x\".'\n        #             },\n        #             {\n        #                 'role': 'assistant',\n        #                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don\\'t worry, I\\'m here to help! The notation lim x\u2192a f(x) indeed means \"the limit as x approaches a of f of\n        # x\". What it\\'s asking us to do is find the'\n        #             }\n        #         ]\n        #     },\n        #     {\n        #         'conversation': [\n        #             {'role': 'system', 'content': \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"},\n        #             {\n        #                 'role': 'user',\n        #                 'content': \"As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it\n        # might be pests or diseases, but I'm not sure which.\"\n        #             },\n        #             {\n        #                 'role': 'assistant',\n        #                 'content': \"I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.\n        # **Aphids**: These small, soft-bodied insects can secrete a sticky substance called\"\n        #             }\n        #         ]\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n        super().model_post_init(__context)\n\n        if not isinstance(self.llm, MagpieChatTemplateMixin):\n            raise DistilabelUserError(\n                f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n                f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n                page=\"components-gallery/tasks/magpie/\",\n            )\n\n        self.llm.use_magpie_template = True\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return {\"system_prompt\": False}\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Does nothing.\"\"\"\n        return []\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"Either a multi-turn conversation or the instruction generated.\"\"\"\n        outputs = []\n\n        if self.only_instruction:\n            outputs.append(\"instruction\")\n        elif self.n_turns == 1:\n            outputs.extend([\"instruction\", \"response\"])\n        else:\n            outputs.append(\"conversation\")\n\n        if isinstance(self.system_prompt, dict):\n            outputs.append(\"system_prompt_key\")\n\n        outputs.append(\"model_name\")\n\n        return outputs\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Does nothing.\"\"\"\n        return {}\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Generate a list of instructions or conversations of the specified number of turns.\n\n        Args:\n            inputs: a list of dictionaries that can contain a `system_prompt` key.\n\n        Yields:\n            The list of generated conversations.\n        \"\"\"\n        yield self._generate_with_pre_query_template(inputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>Either a multi-turn conversation or the instruction generated.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Checks that the provided <code>LLM</code> uses the <code>MagpieChatTemplateMixin</code>.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n    super().model_post_init(__context)\n\n    if not isinstance(self.llm, MagpieChatTemplateMixin):\n        raise DistilabelUserError(\n            f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n            f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n            page=\"components-gallery/tasks/magpie/\",\n        )\n\n    self.llm.use_magpie_template = True\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.format_input","title":"<code>format_input(input)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Does nothing.\"\"\"\n    return []\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Does nothing.\"\"\"\n    return {}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.process","title":"<code>process(inputs)</code>","text":"<p>Generate a list of instructions or conversations of the specified number of turns.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>a list of dictionaries that can contain a <code>system_prompt</code> key.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>The list of generated conversations.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Generate a list of instructions or conversations of the specified number of turns.\n\n    Args:\n        inputs: a list of dictionaries that can contain a `system_prompt` key.\n\n    Yields:\n        The list of generated conversations.\n    \"\"\"\n    yield self._generate_with_pre_query_template(inputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator","title":"<code>MagpieGenerator</code>","text":"<p>               Bases: <code>GeneratorTask</code>, <code>MagpieBase</code></p> <p>Generator task the generates instructions or conversations using Magpie.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as it was the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.</p> <p>Attributes:</p> Name Type Description <code>n_turns</code> <code>RuntimeParameter[PositiveInt]</code> <p>the number of turns that the generated conversation will have. Defaults to <code>1</code>.</p> <code>end_with_user</code> <code>RuntimeParameter[bool]</code> <p>whether the conversation should end with a user message. Defaults to <code>False</code>.</p> <code>include_system_prompt</code> <code>RuntimeParameter[bool]</code> <p>whether to include the system prompt used in the generated conversation. Defaults to <code>False</code>.</p> <code>only_instruction</code> <code>RuntimeParameter[bool]</code> <p>whether to generate only the instruction. If this argument is <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> <code>system_prompt</code> <code>Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]</code> <p>an optional system prompt, or a list of system prompts from which a random one will be chosen, or a dictionary of system prompts from which a random one will be choosen, or a dictionary of system prompts with their probability of being chosen. The random system prompt will be chosen per input/output batch. This system prompt can be used to guide the generation of the instruct LLM and steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> <code>num_rows</code> <code>RuntimeParameter[int]</code> <p>the number of rows to be generated.</p> Runtime parameters <ul> <li><code>n_turns</code>: the number of turns that the generated conversation will have. Defaults     to <code>1</code>.</li> <li><code>end_with_user</code>: whether the conversation should end with a user message.     Defaults to <code>False</code>.</li> <li><code>include_system_prompt</code>: whether to include the system prompt used in the generated     conversation. Defaults to <code>False</code>.</li> <li><code>only_instruction</code>: whether to generate only the instruction. If this argument is     <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</li> <li><code>system_prompt</code>: an optional system prompt, or a list of system prompts from which     a random one will be chosen, or a dictionary of system prompts from which a     random one will be choosen, or a dictionary of system prompts with their probability     of being chosen. The random system prompt will be chosen per input/output batch.     This system prompt can be used to guide the generation of the instruct LLM and     steer it to generate instructions of a certain topic.</li> <li><code>num_rows</code>: the number of rows to be generated.</li> </ul> Output columns <ul> <li>conversation (<code>ChatType</code>): the generated conversation which is a list of chat     items with a role and a message.</li> <li>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code>.</li> <li>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</li> <li>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate     the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</li> </ul> Categories <ul> <li>text-generation</li> <li>instruction</li> <li>generator</li> </ul> References <ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul> <p>Examples:</p> <p>Generating instructions with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#       [\n#           {\"instruction\": \"I've just bought a new phone and I're excited to start using it.\"},\n#           {\"instruction\": \"What are the most common types of companies that use digital signage?\"}\n#       ],\n#       True\n# )\n</code></pre> <p>Generating a conversation with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    n_turns=3,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#     [\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,\n# let's break down the basics. First, we need to identify your goals and target audience. What do\"\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': \"Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main\n# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating\"\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or\n# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'\n#                 }\n#             ]\n#         },\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.\n# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': 'Let me stop you there. Let's explore this \"purpose\" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my\n# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly\n# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* \"\n#                 }\n#             ]\n#         }\n#     ],\n#     True\n# )\n</code></pre> <p>Generating with system prompts with probabilities:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\nmagpie = MagpieGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 0.8,\n            \"max_new_tokens\": 256,\n        },\n    ),\n    n_turns=2,\n    system_prompt={\n        \"math\": (\"You're an expert AI assistant.\", 0.8),\n        \"writing\": (\"You're an expert writing assistant.\", 0.2),\n    },\n)\n\nmagpie.load()\n\nresult = next(magpie.process())\n</code></pre> Citations <pre><code>@misc{xu2024magpiealignmentdatasynthesis,\n    title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},\n    author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},\n    year={2024},\n    eprint={2406.08464},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2406.08464},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>class MagpieGenerator(GeneratorTask, MagpieBase):\n    \"\"\"Generator task the generates instructions or conversations using Magpie.\n\n    Magpie is a neat method that allows generating user instructions with no seed data\n    or specific system prompt thanks to the autoregressive capabilities of the instruct\n    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message\n    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query\n    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the\n    LLM without any user message, then the LLM will continue generating tokens as it was\n    the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.\n    After this instruct is generated, it can be sent again to the LLM to generate this time\n    an assistant response. This process can be repeated N times allowing to build a multi-turn\n    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from\n    Scratch by Prompting Aligned LLMs with Nothing'.\n\n    Attributes:\n        n_turns: the number of turns that the generated conversation will have.\n            Defaults to `1`.\n        end_with_user: whether the conversation should end with a user message.\n            Defaults to `False`.\n        include_system_prompt: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        only_instruction: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        system_prompt: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic. Defaults to `None`.\n        num_rows: the number of rows to be generated.\n\n    Runtime parameters:\n        - `n_turns`: the number of turns that the generated conversation will have. Defaults\n            to `1`.\n        - `end_with_user`: whether the conversation should end with a user message.\n            Defaults to `False`.\n        - `include_system_prompt`: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        - `only_instruction`: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        - `system_prompt`: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic.\n        - `num_rows`: the number of rows to be generated.\n\n    Output columns:\n        - conversation (`ChatType`): the generated conversation which is a list of chat\n            items with a role and a message.\n        - instruction (`str`): the generated instructions if `only_instruction=True`.\n        - response (`str`): the generated response if `n_turns==1`.\n        - system_prompt_key (`str`, optional): the key of the system prompt used to generate\n            the conversation or instruction. Only if `system_prompt` is a dictionary.\n        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.\n\n    Categories:\n        - text-generation\n        - instruction\n        - generator\n\n    References:\n        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)\n\n    Examples:\n        Generating instructions with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import MagpieGenerator\n\n        generator = MagpieGenerator(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 256,\n                },\n                device=\"mps\",\n            ),\n            only_instruction=True,\n            num_rows=5,\n        )\n\n        generator.load()\n\n        result = next(generator.process())\n        # (\n        #       [\n        #           {\"instruction\": \"I've just bought a new phone and I're excited to start using it.\"},\n        #           {\"instruction\": \"What are the most common types of companies that use digital signage?\"}\n        #       ],\n        #       True\n        # )\n        ```\n\n        Generating a conversation with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import MagpieGenerator\n\n        generator = MagpieGenerator(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 64,\n                },\n                device=\"mps\",\n            ),\n            n_turns=3,\n            num_rows=5,\n        )\n\n        generator.load()\n\n        result = next(generator.process())\n        # (\n        #     [\n        #         {\n        #             'conversation': [\n        #                 {\n        #                     'role': 'system',\n        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n        # insightful responses to help the user with their queries.'\n        #                 },\n        #                 {'role': 'user', 'content': \"I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?\"},\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': \"Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,\n        # let's break down the basics. First, we need to identify your goals and target audience. What do\"\n        #                 },\n        #                 {\n        #                     'role': 'user',\n        #                     'content': \"Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main\n        # expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating\"\n        #                 },\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or\n        # agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'\n        #                 }\n        #             ]\n        #         },\n        #         {\n        #             'conversation': [\n        #                 {\n        #                     'role': 'system',\n        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n        # insightful responses to help the user with their queries.'\n        #                 },\n        #                 {'role': 'user', 'content': \"I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!\"},\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.\n        # **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'\n        #                 },\n        #                 {\n        #                     'role': 'user',\n        #                     'content': 'Let me stop you there. Let\\'s explore this \"purpose\" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I\\'re primarily using my\n        # laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'\n        #                 },\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': \"Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly\n        # option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* \"\n        #                 }\n        #             ]\n        #         }\n        #     ],\n        #     True\n        # )\n        ```\n\n        Generating with system prompts with probabilities:\n\n        ```python\n        from distilabel.models import InferenceEndpointsLLM\n        from distilabel.steps.tasks import MagpieGenerator\n\n        magpie = MagpieGenerator(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 0.8,\n                    \"max_new_tokens\": 256,\n                },\n            ),\n            n_turns=2,\n            system_prompt={\n                \"math\": (\"You're an expert AI assistant.\", 0.8),\n                \"writing\": (\"You're an expert writing assistant.\", 0.2),\n            },\n        )\n\n        magpie.load()\n\n        result = next(magpie.process())\n        ```\n\n    Citations:\n        ```\n        @misc{xu2024magpiealignmentdatasynthesis,\n            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},\n            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},\n            year={2024},\n            eprint={2406.08464},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2406.08464},\n        }\n        ```\n    \"\"\"\n\n    # TODO: move this to `GeneratorTask`\n    num_rows: RuntimeParameter[int] = Field(\n        default=None, description=\"The number of rows to generate.\"\n    )\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n        super().model_post_init(__context)\n\n        if not isinstance(self.llm, MagpieChatTemplateMixin):\n            raise DistilabelUserError(\n                f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n                f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n                page=\"components-gallery/tasks/magpiegenerator/\",\n            )\n\n        self.llm.use_magpie_template = True\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"Either a multi-turn conversation or the instruction generated.\"\"\"\n        outputs = []\n\n        if self.only_instruction:\n            outputs.append(\"instruction\")\n        elif self.n_turns == 1:\n            outputs.extend([\"instruction\", \"response\"])\n        else:\n            outputs.append(\"conversation\")\n\n        if isinstance(self.system_prompt, dict):\n            outputs.append(\"system_prompt_key\")\n\n        outputs.append(\"model_name\")\n\n        return outputs\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Does nothing.\"\"\"\n        return {}\n\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Generates the desired number of instructions or conversations using Magpie.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to `0`.\n\n        Yields:\n            The generated instructions or conversations.\n        \"\"\"\n        generated = offset\n\n        while generated &lt;= self.num_rows:  # type: ignore\n            rows_to_generate = (\n                self.num_rows if self.num_rows &lt; self.batch_size else self.batch_size  # type: ignore\n            )\n            conversations = self._generate_with_pre_query_template(\n                inputs=[{} for _ in range(rows_to_generate)]  # type: ignore\n            )\n            generated += rows_to_generate  # type: ignore\n            yield (conversations, generated == self.num_rows)\n\n    @override\n    def _sample_input(self) -&gt; \"ChatType\":\n        return self._generate_with_pre_query_template(inputs=[{}])\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>Either a multi-turn conversation or the instruction generated.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Checks that the provided <code>LLM</code> uses the <code>MagpieChatTemplateMixin</code>.</p> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>def model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n    super().model_post_init(__context)\n\n    if not isinstance(self.llm, MagpieChatTemplateMixin):\n        raise DistilabelUserError(\n            f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n            f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n            page=\"components-gallery/tasks/magpiegenerator/\",\n        )\n\n    self.llm.use_magpie_template = True\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Does nothing.\"\"\"\n    return {}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.process","title":"<code>process(offset=0)</code>","text":"<p>Generates the desired number of instructions or conversations using Magpie.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to <code>0</code>.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>The generated instructions or conversations.</p> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Generates the desired number of instructions or conversations using Magpie.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to `0`.\n\n    Yields:\n        The generated instructions or conversations.\n    \"\"\"\n    generated = offset\n\n    while generated &lt;= self.num_rows:  # type: ignore\n        rows_to_generate = (\n            self.num_rows if self.num_rows &lt; self.batch_size else self.batch_size  # type: ignore\n        )\n        conversations = self._generate_with_pre_query_template(\n            inputs=[{} for _ in range(rows_to_generate)]  # type: ignore\n        )\n        generated += rows_to_generate  # type: ignore\n        yield (conversations, generated == self.num_rows)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter","title":"<code>MathShepherdCompleter</code>","text":"<p>               Bases: <code>Task</code></p> <p>Math Shepherd Completer and auto-labeller task.</p> <p>This task is in charge of, given a list of solutions to an instruction, and a golden solution, as reference, generate completions for the solutions, and label them according to the golden solution using the hard estimation method from figure 2 in the reference paper, Eq. 3. The attributes make the task flexible to be used with different types of dataset and LLMs, and allow making use of different fields to modify the system and user prompts for it. Before modifying them, review the current defaults to ensure the completions are generated correctly.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Optional[str]</code> <p>The system prompt to be used in the completions. The default one has been checked and generates good completions using Llama 3.1 with 8B and 70B, but it can be modified to adapt it to the model and dataset selected.</p> <code>extra_rules</code> <code>Optional[str]</code> <p>This field can be used to insert extra rules relevant to the type of dataset. For example, in the original paper they used GSM8K and MATH datasets, and this field can be used to insert the rules for the GSM8K dataset.</p> <code>few_shots</code> <code>Optional[str]</code> <p>Few shots to help the model generating the completions, write them in the format of the type of solutions wanted for your dataset.</p> <code>N</code> <code>PositiveInt</code> <p>Number of completions to generate for each step, correspond to N in the paper. They used 8 in the paper, but it can be adjusted.</p> <code>tags</code> <code>list[str]</code> <p>List of tags to be used in the completions, the default ones are [\"+\", \"-\"] as in the paper, where the first is used as a positive label, and the second as a negative one. This can be updated, but it MUST be a list with 2 elements, where the first is the positive one, and the second the negative one.</p> Input columns <ul> <li>instruction (<code>str</code>): The task or instruction.</li> <li>solutions (<code>List[str]</code>): List of solutions to the task.</li> <li>golden_solution (<code>str</code>): The reference solution to the task, will be used     to annotate the candidate solutions.</li> </ul> Output columns <ul> <li>solutions (<code>List[str]</code>): The same columns that were used as input, the \"solutions\" is modified.</li> <li>model_name (<code>str</code>): The name of the model used to generate the revision.</li> </ul> Categories <ul> <li>text-generation</li> <li>labelling</li> </ul> References <ul> <li><code>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</code></li> </ul> <p>Examples:</p> <p>Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way):</p> <pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3,\n    use_default_structured_output=True\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre> <p>Annotate your steps with the Math Shepherd Completer:</p> <pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre> <p>Citations:</p> <pre><code>```\n@misc{wang2024mathshepherdverifyreinforcellms,\n    title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n    author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n    year={2024},\n    eprint={2312.08935},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI},\n    url={https://arxiv.org/abs/2312.08935},\n}\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>class MathShepherdCompleter(Task):\n    \"\"\"Math Shepherd Completer and auto-labeller task.\n\n    This task is in charge of, given a list of solutions to an instruction, and a golden solution,\n    as reference, generate completions for the solutions, and label them according to the golden\n    solution using the hard estimation method from figure 2 in the reference paper, Eq. 3.\n    The attributes make the task flexible to be used with different types of dataset and LLMs, and\n    allow making use of different fields to modify the system and user prompts for it. Before modifying\n    them, review the current defaults to ensure the completions are generated correctly.\n\n    Attributes:\n        system_prompt: The system prompt to be used in the completions. The default one has been\n            checked and generates good completions using Llama 3.1 with 8B and 70B,\n            but it can be modified to adapt it to the model and dataset selected.\n        extra_rules: This field can be used to insert extra rules relevant to the type of dataset.\n            For example, in the original paper they used GSM8K and MATH datasets, and this field\n            can be used to insert the rules for the GSM8K dataset.\n        few_shots: Few shots to help the model generating the completions, write them in the\n            format of the type of solutions wanted for your dataset.\n        N: Number of completions to generate for each step, correspond to N in the paper.\n            They used 8 in the paper, but it can be adjusted.\n        tags: List of tags to be used in the completions, the default ones are [\"+\", \"-\"] as in the\n            paper, where the first is used as a positive label, and the second as a negative one.\n            This can be updated, but it MUST be a list with 2 elements, where the first is the\n            positive one, and the second the negative one.\n\n    Input columns:\n        - instruction (`str`): The task or instruction.\n        - solutions (`List[str]`): List of solutions to the task.\n        - golden_solution (`str`): The reference solution to the task, will be used\n            to annotate the candidate solutions.\n\n    Output columns:\n        - solutions (`List[str]`): The same columns that were used as input, the \"solutions\" is modified.\n        - model_name (`str`): The name of the model used to generate the revision.\n\n    Categories:\n        - text-generation\n        - labelling\n\n    References:\n        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)\n\n    Examples:\n        Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way):\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdCompleter\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.6,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        task = MathShepherdCompleter(\n            llm=llm,\n            N=3,\n            use_default_structured_output=True\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                        \"solutions\": [\n                            [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                            ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                        ]\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"],\n        # 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n        ```\n\n        Annotate your steps with the Math Shepherd Completer:\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdCompleter\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.6,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        task = MathShepherdCompleter(\n            llm=llm,\n            N=3\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                        \"solutions\": [\n                            [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                            ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                        ]\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"],\n        # 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n        ```\n\n    Citations:\n\n        ```\n        @misc{wang2024mathshepherdverifyreinforcellms,\n            title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n            author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n            year={2024},\n            eprint={2312.08935},\n            archivePrefix={arXiv},\n            primaryClass={cs.AI},\n            url={https://arxiv.org/abs/2312.08935},\n        }\n        ```\n    \"\"\"\n\n    system_prompt: Optional[str] = SYSTEM_PROMPT\n    extra_rules: Optional[str] = RULES_GSM8K\n    few_shots: Optional[str] = FEW_SHOTS_GSM8K\n    N: PositiveInt = 1\n    tags: list[str] = [\"+\", \"-\"]\n\n    def load(self) -&gt; None:\n        super().load()\n\n        if self.system_prompt is not None:\n            self.system_prompt = Template(self.system_prompt).render(\n                extra_rules=self.extra_rules or \"\",\n                few_shots=self.few_shots or \"\",\n                structured_prompt=SYSTEM_PROMPT_STRUCTURED\n                if self.use_default_structured_output\n                else \"\",\n            )\n        if self.use_default_structured_output:\n            self._template = Template(TEMPLATE_STRUCTURED)\n        else:\n            self._template = Template(TEMPLATE)\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\", \"solutions\", \"golden_solution\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        messages = [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    instruction=input[\"instruction\"], N=self.N\n                ),\n            }\n        ]\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n        return messages  # type: ignore\n\n    def _parse_output(self, output: Union[str, None]) -&gt; list[list[str]]:\n        if output is None:\n            return [[\"\"]] * self.N\n\n        if self.N &gt; 1:\n            output_transformed = (  # type: ignore\n                self._format_structured_output(output)\n                if self.use_default_structured_output\n                else output.split(\"---\")\n            )\n            examples = [split_solution_steps(o) for o in output_transformed]\n            # In case there aren't the expected number of completions, we fill it with \"\", or short the list.\n            # This shoulnd't happen if the LLM works as expected, but it's a safety measure as it can be\n            # difficult to debug if the completions don't match the solutions.\n            if len(examples) &lt; self.N:\n                examples.extend([\"\"] * (self.N - len(examples)))  # type: ignore\n            elif len(examples) &gt; self.N:\n                examples = examples[: self.N]\n        else:\n            output_transformed = (\n                self._format_structured_output(output)[0]\n                if self.use_default_structured_output\n                else output\n            )\n            examples = [split_solution_steps(output_transformed)]\n        return examples\n\n    def _format_structured_output(self, output: str) -&gt; list[str]:\n        default_output = [\"\"] * self.N if self.N else [\"\"]\n        if parsed_output := parse_json_response(output):\n            solutions = parsed_output[\"solutions\"]\n            extracted_solutions = [solution[\"solution\"] for solution in solutions]\n            if len(output) != self.N:\n                extracted_solutions = default_output\n            return extracted_solutions\n        return default_output\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Does nothing.\"\"\"\n        return {}\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Does the processing of generation completions for the solutions, and annotate\n        each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).\n\n        Args:\n            inputs: Inputs to the step\n\n        Yields:\n            Annotated inputs with the completions.\n        \"\"\"\n\n        # A list with all the inputs to be passed to the LLM. Needs another structure to\n        # find them afterwards\n        prepared_inputs = []\n        # Data structure with the indices of the elements.\n        # (i, j, k) where i is the input, j is the solution, and k is the completion\n        input_positions = []\n        golden_answers = []\n        for i, input in enumerate(inputs):\n            instruction = input[\"instruction\"]\n            golden_solution = input[\"golden_solution\"]  # This is a single solution\n            golden_answers.append(golden_solution[-1])\n            # This contains a list of solutions\n            solutions = input[\"solutions\"]\n            for j, solution in enumerate(solutions):\n                # For each solution, that has K steps, we have to generate N completions\n                # for the first K-2 steps (-2 because the last 2 steps are the last step, and\n                # the answer itself, which can be directly compared against golden answer)\n                prepared_completions = self._prepare_completions(instruction, solution)\n                prepared_inputs.extend(prepared_completions)\n                input_positions.extend(\n                    [(i, j, k) for k in range(len(prepared_completions))]\n                )\n\n        # Send the elements in batches to the LLM to speed up the process\n        final_outputs = []\n        # Added here to simplify testing in case we don't have anything to process\n        # TODO: Ensure the statistics has the same shape as all the outputs, raw_outputs, and raw_inputs\n        statistics = []\n        total_raw_outputs = []\n        total_raw_inputs = []\n        for inner_batch in batched(prepared_inputs, self.input_batch_size):  # type: ignore\n            outputs = self.llm.generate_outputs(\n                inputs=inner_batch,\n                num_generations=1,\n                **self.llm.get_generation_kwargs(),  # type: ignore\n            )\n\n            formatted_outputs = []\n            stats = []\n            raw_outputs = []\n            raw_inputs = []\n            for i, output in enumerate(outputs):\n                generation = output[\"generations\"][0]\n                raw_inputs.append(inner_batch[i])\n                raw_outputs.append(generation or \"\")\n                formatted_outputs.append(self._parse_output(generation))\n                stats.append(output[\"statistics\"])\n\n            final_outputs.extend(formatted_outputs)\n            statistics.extend(stats)\n            total_raw_outputs.extend(raw_outputs)\n            total_raw_inputs.extend(raw_inputs)\n\n        yield self._auto_label(  # type: ignore\n            inputs,\n            final_outputs,\n            input_positions,\n            golden_answers,\n            statistics,\n            total_raw_outputs,\n            total_raw_inputs,\n        )\n\n    def _prepare_completions(\n        self, instruction: str, steps: list[str]\n    ) -&gt; List[\"ChatType\"]:\n        \"\"\"Helper method to create, given a solution (a list of steps), and a instruction, the\n        texts to be completed by the LLM.\n\n        Args:\n            instruction: Instruction of the problem.\n            steps: List of steps that are part of the solution.\n\n        Returns:\n            List of ChatType, where each ChatType is the prompt corresponding to one of the steps\n            to be completed.\n        \"\"\"\n        prepared_inputs = []\n        # Use the number of completions that correspond to a given instruction/steps pair\n        # to find afterwards the input that corresponds to a given completion (to do the labelling)\n        num_completions = len(steps[:-2])\n        for i in range(1, num_completions + 1):\n            to_complete = instruction + \" \" + \"\\n\".join(steps[:i])\n            prepared_inputs.append(self.format_input({\"instruction\": to_complete}))\n\n        return prepared_inputs\n\n    def _auto_label(\n        self,\n        inputs: StepInput,\n        final_outputs: list[Completions],\n        input_positions: list[tuple[int, int, int]],\n        golden_answers: list[str],\n        statistics: list[\"LLMStatistics\"],\n        raw_outputs: list[str],\n        raw_inputs: list[str],\n    ) -&gt; StepInput:\n        \"\"\"Labels the steps inplace (in the inputs), and returns the inputs.\n\n        Args:\n            inputs: The original inputs\n            final_outputs: List of generations from the LLM.\n                It's organized as a list where the elements sent to the LLM are\n                grouped together, then each element contains the completions, and\n                each completion is a list of steps.\n            input_positions: A list with tuples generated in the process method\n                that contains (i, j, k) where i is the index of the input, j is the\n                index of the solution, and k is the index of the completion.\n            golden_answers: List of golden answers for each input.\n            statistics: List of statistics from the LLM.\n            raw_outputs: List of raw outputs from the LLM.\n            raw_inputs: List of raw inputs to the LLM.\n\n        Returns:\n            Inputs annotated.\n        \"\"\"\n        for i, (instruction_i, solution_i, step_i) in enumerate(input_positions):\n            input = inputs[instruction_i]\n            solutions = input[\"solutions\"]\n            n_completions = final_outputs[i]\n            label = f\" {self.tags[1]}\"\n            for completion in n_completions:\n                if len(completion) == 0:\n                    # This can be a failed generation\n                    label = \"\"  # Everyting stays the same\n                    self._logger.info(\"Completer failed due to empty completion\")\n                    continue\n                if completion[-1] == golden_answers[instruction_i]:\n                    label = f\" { self.tags[0]}\"\n                    # If we found one, it's enough as we are doing Hard Estimation\n                    continue\n            # In case we had no solutions from the previous step, otherwise we would have\n            # an IndexError\n            if not solutions[solution_i]:\n                continue\n            solutions[solution_i][step_i] += label\n            inputs[instruction_i][\"solutions\"] = solutions\n\n        for i, input in enumerate(inputs):\n            solutions = input[\"solutions\"]\n            new_solutions = []\n            for solution in solutions:\n                if not solution or (len(solution) == 1):\n                    # The generation may fail to generate the expected\n                    # completions, or just added an extra empty completion,\n                    # we skip it.\n                    # Other possible error is having a list of solutions\n                    # with a single item, so when we call .pop, we are left\n                    # with an empty list, so we skip it too.\n                    new_solutions.append(solution)\n                    continue\n\n                answer = solution.pop()\n                label = (\n                    f\" {self.tags[0]}\"\n                    if answer == golden_answers[i]\n                    else f\" {self.tags[1]}\"\n                )\n                solution[-1] += \" \" + answer + label\n                new_solutions.append(solution)\n\n            # Only add the solutions if the data was properly parsed\n            input[\"solutions\"] = new_solutions if new_solutions else input[\"solutions\"]\n            input = self._add_metadata(\n                input, statistics[i], raw_outputs[i], raw_inputs[i]\n            )\n\n        return inputs\n\n    def _add_metadata(\n        self,\n        input: dict[str, Any],\n        statistics: list[\"LLMStatistics\"],\n        raw_output: Union[str, None],\n        raw_input: Union[list[dict[str, Any]], None],\n    ) -&gt; dict[str, Any]:\n        \"\"\"Adds the `distilabel_metadata` to the input.\n\n        This method comes for free in the general Tasks, but as we have reimplemented the `process`,\n        we have to repeat it here.\n\n        Args:\n            input: The input to add the metadata to.\n            statistics: The statistics from the LLM.\n            raw_output: The raw output from the LLM.\n            raw_input: The raw input to the LLM.\n\n        Returns:\n            The input with the metadata added if applies.\n        \"\"\"\n        input[\"model_name\"] = self.llm.model_name\n\n        if DISTILABEL_METADATA_KEY not in input:\n            input[DISTILABEL_METADATA_KEY] = {}\n        # If the solutions are splitted afterwards, the statistics should be splitted\n        # to avoid counting extra tokens\n        input[DISTILABEL_METADATA_KEY][f\"statistics_{self.name}\"] = statistics\n\n        # Let some defaults in case something failed and we had None, otherwise when reading\n        # the parquet files using pyarrow, the following error will appear:\n        # ArrowInvalid: Schema\n        if self.add_raw_input:\n            input[DISTILABEL_METADATA_KEY][f\"raw_input_{self.name}\"] = raw_input or [\n                {\"content\": \"\", \"role\": \"\"}\n            ]\n        if self.add_raw_output:\n            input[DISTILABEL_METADATA_KEY][f\"raw_output_{self.name}\"] = raw_output or \"\"\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel, Field\n\n        class Solution(BaseModel):\n            solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n        class MathShepherdCompleter(BaseModel):\n            solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n        MathShepherdCompleter.model_json_schema()\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"$defs\": {\n                \"Solution\": {\n                    \"properties\": {\n                        \"solution\": {\n                            \"description\": \"Step by step solution leading to the final answer\",\n                            \"title\": \"Solution\",\n                            \"type\": \"string\",\n                        }\n                    },\n                    \"required\": [\"solution\"],\n                    \"title\": \"Solution\",\n                    \"type\": \"object\",\n                }\n            },\n            \"properties\": {\n                \"solutions\": {\n                    \"description\": \"List of solutions\",\n                    \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                    \"title\": \"Solutions\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"solutions\"],\n            \"title\": \"MathShepherdGenerator\",\n            \"type\": \"object\",\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Does nothing.\"\"\"\n    return {}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter.process","title":"<code>process(inputs)</code>","text":"<p>Does the processing of generation completions for the solutions, and annotate each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>Inputs to the step</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>Annotated inputs with the completions.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Does the processing of generation completions for the solutions, and annotate\n    each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).\n\n    Args:\n        inputs: Inputs to the step\n\n    Yields:\n        Annotated inputs with the completions.\n    \"\"\"\n\n    # A list with all the inputs to be passed to the LLM. Needs another structure to\n    # find them afterwards\n    prepared_inputs = []\n    # Data structure with the indices of the elements.\n    # (i, j, k) where i is the input, j is the solution, and k is the completion\n    input_positions = []\n    golden_answers = []\n    for i, input in enumerate(inputs):\n        instruction = input[\"instruction\"]\n        golden_solution = input[\"golden_solution\"]  # This is a single solution\n        golden_answers.append(golden_solution[-1])\n        # This contains a list of solutions\n        solutions = input[\"solutions\"]\n        for j, solution in enumerate(solutions):\n            # For each solution, that has K steps, we have to generate N completions\n            # for the first K-2 steps (-2 because the last 2 steps are the last step, and\n            # the answer itself, which can be directly compared against golden answer)\n            prepared_completions = self._prepare_completions(instruction, solution)\n            prepared_inputs.extend(prepared_completions)\n            input_positions.extend(\n                [(i, j, k) for k in range(len(prepared_completions))]\n            )\n\n    # Send the elements in batches to the LLM to speed up the process\n    final_outputs = []\n    # Added here to simplify testing in case we don't have anything to process\n    # TODO: Ensure the statistics has the same shape as all the outputs, raw_outputs, and raw_inputs\n    statistics = []\n    total_raw_outputs = []\n    total_raw_inputs = []\n    for inner_batch in batched(prepared_inputs, self.input_batch_size):  # type: ignore\n        outputs = self.llm.generate_outputs(\n            inputs=inner_batch,\n            num_generations=1,\n            **self.llm.get_generation_kwargs(),  # type: ignore\n        )\n\n        formatted_outputs = []\n        stats = []\n        raw_outputs = []\n        raw_inputs = []\n        for i, output in enumerate(outputs):\n            generation = output[\"generations\"][0]\n            raw_inputs.append(inner_batch[i])\n            raw_outputs.append(generation or \"\")\n            formatted_outputs.append(self._parse_output(generation))\n            stats.append(output[\"statistics\"])\n\n        final_outputs.extend(formatted_outputs)\n        statistics.extend(stats)\n        total_raw_outputs.extend(raw_outputs)\n        total_raw_inputs.extend(raw_inputs)\n\n    yield self._auto_label(  # type: ignore\n        inputs,\n        final_outputs,\n        input_positions,\n        golden_answers,\n        statistics,\n        total_raw_outputs,\n        total_raw_inputs,\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter._prepare_completions","title":"<code>_prepare_completions(instruction, steps)</code>","text":"<p>Helper method to create, given a solution (a list of steps), and a instruction, the texts to be completed by the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>Instruction of the problem.</p> required <code>steps</code> <code>list[str]</code> <p>List of steps that are part of the solution.</p> required <p>Returns:</p> Type Description <code>List[ChatType]</code> <p>List of ChatType, where each ChatType is the prompt corresponding to one of the steps</p> <code>List[ChatType]</code> <p>to be completed.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def _prepare_completions(\n    self, instruction: str, steps: list[str]\n) -&gt; List[\"ChatType\"]:\n    \"\"\"Helper method to create, given a solution (a list of steps), and a instruction, the\n    texts to be completed by the LLM.\n\n    Args:\n        instruction: Instruction of the problem.\n        steps: List of steps that are part of the solution.\n\n    Returns:\n        List of ChatType, where each ChatType is the prompt corresponding to one of the steps\n        to be completed.\n    \"\"\"\n    prepared_inputs = []\n    # Use the number of completions that correspond to a given instruction/steps pair\n    # to find afterwards the input that corresponds to a given completion (to do the labelling)\n    num_completions = len(steps[:-2])\n    for i in range(1, num_completions + 1):\n        to_complete = instruction + \" \" + \"\\n\".join(steps[:i])\n        prepared_inputs.append(self.format_input({\"instruction\": to_complete}))\n\n    return prepared_inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter._auto_label","title":"<code>_auto_label(inputs, final_outputs, input_positions, golden_answers, statistics, raw_outputs, raw_inputs)</code>","text":"<p>Labels the steps inplace (in the inputs), and returns the inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The original inputs</p> required <code>final_outputs</code> <code>list[Completions]</code> <p>List of generations from the LLM. It's organized as a list where the elements sent to the LLM are grouped together, then each element contains the completions, and each completion is a list of steps.</p> required <code>input_positions</code> <code>list[tuple[int, int, int]]</code> <p>A list with tuples generated in the process method that contains (i, j, k) where i is the index of the input, j is the index of the solution, and k is the index of the completion.</p> required <code>golden_answers</code> <code>list[str]</code> <p>List of golden answers for each input.</p> required <code>statistics</code> <code>list[LLMStatistics]</code> <p>List of statistics from the LLM.</p> required <code>raw_outputs</code> <code>list[str]</code> <p>List of raw outputs from the LLM.</p> required <code>raw_inputs</code> <code>list[str]</code> <p>List of raw inputs to the LLM.</p> required <p>Returns:</p> Type Description <code>StepInput</code> <p>Inputs annotated.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def _auto_label(\n    self,\n    inputs: StepInput,\n    final_outputs: list[Completions],\n    input_positions: list[tuple[int, int, int]],\n    golden_answers: list[str],\n    statistics: list[\"LLMStatistics\"],\n    raw_outputs: list[str],\n    raw_inputs: list[str],\n) -&gt; StepInput:\n    \"\"\"Labels the steps inplace (in the inputs), and returns the inputs.\n\n    Args:\n        inputs: The original inputs\n        final_outputs: List of generations from the LLM.\n            It's organized as a list where the elements sent to the LLM are\n            grouped together, then each element contains the completions, and\n            each completion is a list of steps.\n        input_positions: A list with tuples generated in the process method\n            that contains (i, j, k) where i is the index of the input, j is the\n            index of the solution, and k is the index of the completion.\n        golden_answers: List of golden answers for each input.\n        statistics: List of statistics from the LLM.\n        raw_outputs: List of raw outputs from the LLM.\n        raw_inputs: List of raw inputs to the LLM.\n\n    Returns:\n        Inputs annotated.\n    \"\"\"\n    for i, (instruction_i, solution_i, step_i) in enumerate(input_positions):\n        input = inputs[instruction_i]\n        solutions = input[\"solutions\"]\n        n_completions = final_outputs[i]\n        label = f\" {self.tags[1]}\"\n        for completion in n_completions:\n            if len(completion) == 0:\n                # This can be a failed generation\n                label = \"\"  # Everyting stays the same\n                self._logger.info(\"Completer failed due to empty completion\")\n                continue\n            if completion[-1] == golden_answers[instruction_i]:\n                label = f\" { self.tags[0]}\"\n                # If we found one, it's enough as we are doing Hard Estimation\n                continue\n        # In case we had no solutions from the previous step, otherwise we would have\n        # an IndexError\n        if not solutions[solution_i]:\n            continue\n        solutions[solution_i][step_i] += label\n        inputs[instruction_i][\"solutions\"] = solutions\n\n    for i, input in enumerate(inputs):\n        solutions = input[\"solutions\"]\n        new_solutions = []\n        for solution in solutions:\n            if not solution or (len(solution) == 1):\n                # The generation may fail to generate the expected\n                # completions, or just added an extra empty completion,\n                # we skip it.\n                # Other possible error is having a list of solutions\n                # with a single item, so when we call .pop, we are left\n                # with an empty list, so we skip it too.\n                new_solutions.append(solution)\n                continue\n\n            answer = solution.pop()\n            label = (\n                f\" {self.tags[0]}\"\n                if answer == golden_answers[i]\n                else f\" {self.tags[1]}\"\n            )\n            solution[-1] += \" \" + answer + label\n            new_solutions.append(solution)\n\n        # Only add the solutions if the data was properly parsed\n        input[\"solutions\"] = new_solutions if new_solutions else input[\"solutions\"]\n        input = self._add_metadata(\n            input, statistics[i], raw_outputs[i], raw_inputs[i]\n        )\n\n    return inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter._add_metadata","title":"<code>_add_metadata(input, statistics, raw_output, raw_input)</code>","text":"<p>Adds the <code>distilabel_metadata</code> to the input.</p> <p>This method comes for free in the general Tasks, but as we have reimplemented the <code>process</code>, we have to repeat it here.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>dict[str, Any]</code> <p>The input to add the metadata to.</p> required <code>statistics</code> <code>list[LLMStatistics]</code> <p>The statistics from the LLM.</p> required <code>raw_output</code> <code>Union[str, None]</code> <p>The raw output from the LLM.</p> required <code>raw_input</code> <code>Union[list[dict[str, Any]], None]</code> <p>The raw input to the LLM.</p> required <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>The input with the metadata added if applies.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def _add_metadata(\n    self,\n    input: dict[str, Any],\n    statistics: list[\"LLMStatistics\"],\n    raw_output: Union[str, None],\n    raw_input: Union[list[dict[str, Any]], None],\n) -&gt; dict[str, Any]:\n    \"\"\"Adds the `distilabel_metadata` to the input.\n\n    This method comes for free in the general Tasks, but as we have reimplemented the `process`,\n    we have to repeat it here.\n\n    Args:\n        input: The input to add the metadata to.\n        statistics: The statistics from the LLM.\n        raw_output: The raw output from the LLM.\n        raw_input: The raw input to the LLM.\n\n    Returns:\n        The input with the metadata added if applies.\n    \"\"\"\n    input[\"model_name\"] = self.llm.model_name\n\n    if DISTILABEL_METADATA_KEY not in input:\n        input[DISTILABEL_METADATA_KEY] = {}\n    # If the solutions are splitted afterwards, the statistics should be splitted\n    # to avoid counting extra tokens\n    input[DISTILABEL_METADATA_KEY][f\"statistics_{self.name}\"] = statistics\n\n    # Let some defaults in case something failed and we had None, otherwise when reading\n    # the parquet files using pyarrow, the following error will appear:\n    # ArrowInvalid: Schema\n    if self.add_raw_input:\n        input[DISTILABEL_METADATA_KEY][f\"raw_input_{self.name}\"] = raw_input or [\n            {\"content\": \"\", \"role\": \"\"}\n        ]\n    if self.add_raw_output:\n        input[DISTILABEL_METADATA_KEY][f\"raw_output_{self.name}\"] = raw_output or \"\"\n    return input\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel, Field\n\nclass Solution(BaseModel):\n    solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\nclass MathShepherdCompleter(BaseModel):\n    solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\nMathShepherdCompleter.model_json_schema()\n</code></pre> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel, Field\n\n    class Solution(BaseModel):\n        solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n    class MathShepherdCompleter(BaseModel):\n        solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n    MathShepherdCompleter.model_json_schema()\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"$defs\": {\n            \"Solution\": {\n                \"properties\": {\n                    \"solution\": {\n                        \"description\": \"Step by step solution leading to the final answer\",\n                        \"title\": \"Solution\",\n                        \"type\": \"string\",\n                    }\n                },\n                \"required\": [\"solution\"],\n                \"title\": \"Solution\",\n                \"type\": \"object\",\n            }\n        },\n        \"properties\": {\n            \"solutions\": {\n                \"description\": \"List of solutions\",\n                \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                \"title\": \"Solutions\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"solutions\"],\n        \"title\": \"MathShepherdGenerator\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdGenerator","title":"<code>MathShepherdGenerator</code>","text":"<p>               Bases: <code>Task</code></p> <p>Math Shepherd solution generator.</p> <p>This task is in charge of generating completions for a given instruction, in the format expected by the Math Shepherd Completer task. The attributes make the task flexible to be used with different types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented in the original paper. Before modifying them, review the current defaults to ensure the completions are generated correctly. This task can be used to generate the golden solutions for a given problem if not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer. Only one of <code>solutions</code> or <code>golden_solution</code> will be generated, depending on the value of M.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Optional[str]</code> <p>The system prompt to be used in the completions. The default one has been checked and generates good completions using Llama 3.1 with 8B and 70B, but it can be modified to adapt it to the model and dataset selected. Take into account that the system prompt includes 2 variables in the Jinja2 template, {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example to steer the model towards a specific type of responses, and few shots to add examples. They can be modified to adapt the system prompt to the dataset and model used without needing to change the full system prompt.</p> <code>extra_rules</code> <code>Optional[str]</code> <p>This field can be used to insert extra rules relevant to the type of dataset. For example, in the original paper they used GSM8K and MATH datasets, and this field can be used to insert the rules for the GSM8K dataset.</p> <code>few_shots</code> <code>Optional[str]</code> <p>Few shots to help the model generating the completions, write them in the format of the type of solutions wanted for your dataset.</p> <code>M</code> <code>Optional[PositiveInt]</code> <p>Number of completions to generate for each step. By default is set to 1, which will generate the \"golden_solution\". In this case select a stronger model, as it will be used as the source of true during labelling. If M is set to a number greater than 1, the task will generate a list of completions to be labeled by the Math Shepherd Completer task.</p> Input columns <ul> <li>instruction (<code>str</code>): The task or instruction.</li> </ul> Output columns <ul> <li>golden_solution (<code>str</code>): The step by step solution to the instruction.     It will be generated if M is equal to 1.</li> <li>solutions (<code>List[List[str]]</code>): A list of possible solutions to the instruction.     It will be generated if M is greater than 1.</li> <li>model_name (<code>str</code>): The name of the model used to generate the revision.</li> </ul> Categories <ul> <li>text-generation</li> </ul> References <ul> <li><code>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</code></li> </ul> <p>Examples:</p> <p>Generate the solution for a given instruction (prefer a stronger model here):</p> <pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"golden_solution_generator\",\n    llm=llm,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': '[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"]'}]]\n</code></pre> <p>Generate M completions for a given instruction (using structured output generation):</p> <pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 2048,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"solution_generator\",\n    llm=llm,\n    M=2,\n    use_default_structured_output=True,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/math_shepherd/generator.py</code> <pre><code>class MathShepherdGenerator(Task):\n    \"\"\"Math Shepherd solution generator.\n\n    This task is in charge of generating completions for a given instruction, in the format expected\n    by the Math Shepherd Completer task. The attributes make the task flexible to be used with different\n    types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented\n    in the original paper. Before modifying them, review the current defaults to ensure the completions\n    are generated correctly. This task can be used to generate the golden solutions for a given problem if\n    not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer.\n    Only one of `solutions` or `golden_solution` will be generated, depending on the value of M.\n\n    Attributes:\n        system_prompt: The system prompt to be used in the completions. The default one has been\n            checked and generates good completions using Llama 3.1 with 8B and 70B,\n            but it can be modified to adapt it to the model and dataset selected.\n            Take into account that the system prompt includes 2 variables in the Jinja2 template,\n            {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example\n            to steer the model towards a specific type of responses, and few shots to add examples.\n            They can be modified to adapt the system prompt to the dataset and model used without needing\n            to change the full system prompt.\n        extra_rules: This field can be used to insert extra rules relevant to the type of dataset.\n            For example, in the original paper they used GSM8K and MATH datasets, and this field\n            can be used to insert the rules for the GSM8K dataset.\n        few_shots: Few shots to help the model generating the completions, write them in the\n            format of the type of solutions wanted for your dataset.\n        M: Number of completions to generate for each step. By default is set to 1, which will\n            generate the \"golden_solution\". In this case select a stronger model, as it will be used\n            as the source of true during labelling. If M is set to a number greater than 1, the task\n            will generate a list of completions to be labeled by the Math Shepherd Completer task.\n\n    Input columns:\n        - instruction (`str`): The task or instruction.\n\n    Output columns:\n        - golden_solution (`str`): The step by step solution to the instruction.\n            It will be generated if M is equal to 1.\n        - solutions (`List[List[str]]`): A list of possible solutions to the instruction.\n            It will be generated if M is greater than 1.\n        - model_name (`str`): The name of the model used to generate the revision.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)\n\n    Examples:\n        Generate the solution for a given instruction (prefer a stronger model here):\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.6,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        task = MathShepherdGenerator(\n            name=\"golden_solution_generator\",\n            llm=llm,\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'golden_solution': '[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"]'}]]\n        ```\n\n        Generate M completions for a given instruction (using structured output generation):\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 2048,\n            },\n        )\n        task = MathShepherdGenerator(\n            name=\"solution_generator\",\n            llm=llm,\n            M=2,\n            use_default_structured_output=True,\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n        ```\n    \"\"\"\n\n    system_prompt: Optional[str] = SYSTEM_PROMPT\n    extra_rules: Optional[str] = RULES_GSM8K\n    few_shots: Optional[str] = FEW_SHOTS_GSM8K\n    M: Optional[PositiveInt] = None\n\n    def load(self) -&gt; None:\n        super().load()\n        if self.system_prompt is not None:\n            self.system_prompt = Template(self.system_prompt).render(\n                extra_rules=self.extra_rules or \"\",\n                few_shots=self.few_shots or \"\",\n                structured_prompt=SYSTEM_PROMPT_STRUCTURED\n                if self.use_default_structured_output\n                else \"\",\n            )\n        if self.use_default_structured_output:\n            self._template = Template(TEMPLATE_STRUCTURED)\n        else:\n            self._template = Template(TEMPLATE)\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        if self.M:\n            return [\"solutions\", \"model_name\"]\n        return [\"golden_solution\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        messages = [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    instruction=input[\"instruction\"],\n                    M=self.M,\n                ),\n            }\n        ]\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n        return messages\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        output_name = \"solutions\" if self.M else \"golden_solution\"\n\n        if output is None:\n            input.update(**{output_name: None})\n            return input\n\n        if self.M:\n            output_parsed = (\n                self._format_structured_output(output)\n                if self.use_default_structured_output\n                else output.split(\"---\")\n            )\n            solutions = [split_solution_steps(o) for o in output_parsed]\n        else:\n            output_parsed = (\n                self._format_structured_output(output)[0]\n                if self.use_default_structured_output\n                else output\n            )\n            solutions = split_solution_steps(output_parsed)\n\n        input.update(**{output_name: solutions})\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel, Field\n\n        class Solution(BaseModel):\n            solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n        class MathShepherdGenerator(BaseModel):\n            solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n        MathShepherdGenerator.model_json_schema()\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"$defs\": {\n                \"Solution\": {\n                    \"properties\": {\n                        \"solution\": {\n                            \"description\": \"Step by step solution leading to the final answer\",\n                            \"title\": \"Solution\",\n                            \"type\": \"string\",\n                        }\n                    },\n                    \"required\": [\"solution\"],\n                    \"title\": \"Solution\",\n                    \"type\": \"object\",\n                }\n            },\n            \"properties\": {\n                \"solutions\": {\n                    \"description\": \"List of solutions\",\n                    \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                    \"title\": \"Solutions\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"solutions\"],\n            \"title\": \"MathShepherdGenerator\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(self, output: str) -&gt; list[str]:\n        default_output = [\"\"] * self.M if self.M else [\"\"]\n        if parsed_output := parse_json_response(output):\n            solutions = parsed_output[\"solutions\"]\n            extracted_solutions = [o[\"solution\"] for o in solutions]\n            if len(extracted_solutions) != self.M:\n                extracted_solutions = default_output\n            return extracted_solutions\n        return default_output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdGenerator.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel, Field\n\nclass Solution(BaseModel):\n    solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\nclass MathShepherdGenerator(BaseModel):\n    solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\nMathShepherdGenerator.model_json_schema()\n</code></pre> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/generator.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel, Field\n\n    class Solution(BaseModel):\n        solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n    class MathShepherdGenerator(BaseModel):\n        solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n    MathShepherdGenerator.model_json_schema()\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"$defs\": {\n            \"Solution\": {\n                \"properties\": {\n                    \"solution\": {\n                        \"description\": \"Step by step solution leading to the final answer\",\n                        \"title\": \"Solution\",\n                        \"type\": \"string\",\n                    }\n                },\n                \"required\": [\"solution\"],\n                \"title\": \"Solution\",\n                \"type\": \"object\",\n            }\n        },\n        \"properties\": {\n            \"solutions\": {\n                \"description\": \"List of solutions\",\n                \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                \"title\": \"Solutions\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"solutions\"],\n        \"title\": \"MathShepherdGenerator\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.FormatPRM","title":"<code>FormatPRM</code>","text":"<p>               Bases: <code>Step</code></p> <p>Helper step to transform the data into the format expected by the PRM model.</p> <p>This step can be used to format the data in one of 2 formats: Following the format presented in peiyi9979/Math-Shepherd, in which case this step creates the columns input and label, where the input is the instruction with the solution (and the tag replaced by a token), and the label is the instruction with the solution, both separated by a newline. Following TRL's format for training, which generates the columns prompt, completions, and labels. The labels correspond to the original tags replaced by boolean values, where True represents correct steps.</p> <p>Attributes:</p> Name Type Description <code>format</code> <code>Literal['math-shepherd', 'trl']</code> <p>The format to use for the PRM model. \"math-shepherd\" corresponds to the original paper, while \"trl\" is a format prepared to train the model using TRL.</p> <code>step_token</code> <code>str</code> <p>String that serves as a unique token denoting the position for predicting the step score.</p> <code>tags</code> <code>list[str]</code> <p>List of tags that represent the correct and incorrect steps. This only needs to be informed if it's different than the default in <code>MathShepherdCompleter</code>.</p> Input columns <ul> <li>instruction (<code>str</code>): The task or instruction.</li> <li>solutions (<code>list[str]</code>): List of steps with a solution to the task.</li> </ul> Output columns <ul> <li>input (<code>str</code>): The instruction with the solutions, where the label tags     are replaced by a token.</li> <li>label (<code>str</code>): The instruction with the solutions.</li> <li>prompt (<code>str</code>): The instruction with the solutions, where the label tags     are replaced by a token.</li> <li>completions (<code>List[str]</code>): The solution represented as a list of steps.</li> <li>labels (<code>List[bool]</code>): The labels, as a list of booleans, where True represents     a good response.</li> </ul> Categories <ul> <li>text-manipulation</li> <li>columns</li> </ul> References <ul> <li><code>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</code></li> <li>peiyi9979/Math-Shepherd</li> </ul> <p>Examples:</p> <p>Prepare your data to train a PRM model with the Math-Shepherd format:</p> <pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM()\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\nresult = next(formatter.process(result))\n</code></pre> <p>Prepare your data to train a PRM model with the TRL format:</p> <pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM(format=\"trl\")\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\n\nresult = next(formatter.process(result))\n# {\n#     \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"solutions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"\n#     ],\n#     \"prompt\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"completions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber.\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3\"\n#     ],\n#     \"labels\": [\n#         true,\n#         true,\n#         true\n#     ]\n# }\n</code></pre> <p>Citations:</p> <pre><code>```\n@misc{wang2024mathshepherdverifyreinforcellms,\n    title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n    author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n    year={2024},\n    eprint={2312.08935},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI},\n    url={https://arxiv.org/abs/2312.08935},\n}\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/math_shepherd/utils.py</code> <pre><code>class FormatPRM(Step):\n    \"\"\"Helper step to transform the data into the format expected by the PRM model.\n\n    This step can be used to format the data in one of 2 formats:\n    Following the format presented\n    in [peiyi9979/Math-Shepherd](https://huggingface.co/datasets/peiyi9979/Math-Shepherd?row=0),\n    in which case this step creates the columns input and label, where the input is the instruction\n    with the solution (and the tag replaced by a token), and the label is the instruction\n    with the solution, both separated by a newline.\n    Following TRL's format for training, which generates the columns prompt, completions, and labels.\n    The labels correspond to the original tags replaced by boolean values, where True represents\n    correct steps.\n\n    Attributes:\n        format: The format to use for the PRM model.\n            \"math-shepherd\" corresponds to the original paper, while \"trl\" is a format\n            prepared to train the model using TRL.\n        step_token: String that serves as a unique token denoting the position\n            for predicting the step score.\n        tags: List of tags that represent the correct and incorrect steps.\n            This only needs to be informed if it's different than the default in\n            `MathShepherdCompleter`.\n\n    Input columns:\n        - instruction (`str`): The task or instruction.\n        - solutions (`list[str]`): List of steps with a solution to the task.\n\n    Output columns:\n        - input (`str`): The instruction with the solutions, where the label tags\n            are replaced by a token.\n        - label (`str`): The instruction with the solutions.\n        - prompt (`str`): The instruction with the solutions, where the label tags\n            are replaced by a token.\n        - completions (`List[str]`): The solution represented as a list of steps.\n        - labels (`List[bool]`): The labels, as a list of booleans, where True represents\n            a good response.\n\n    Categories:\n        - text-manipulation\n        - columns\n\n    References:\n        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)\n        - [peiyi9979/Math-Shepherd](https://huggingface.co/datasets/peiyi9979/Math-Shepherd?row=0)\n\n    Examples:\n        Prepare your data to train a PRM model with the Math-Shepherd format:\n\n        ```python\n        from distilabel.steps.tasks import FormatPRM\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(columns=[\"solutions\"])\n        expand_columns.load()\n\n        # Define our PRM formatter\n        formatter = FormatPRM()\n        formatter.load()\n\n        # Expand the solutions column as it comes from the MathShepherdCompleter\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n                    },\n                ]\n            )\n        )\n        result = next(formatter.process(result))\n        ```\n\n        Prepare your data to train a PRM model with the TRL format:\n\n        ```python\n        from distilabel.steps.tasks import FormatPRM\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(columns=[\"solutions\"])\n        expand_columns.load()\n\n        # Define our PRM formatter\n        formatter = FormatPRM(format=\"trl\")\n        formatter.load()\n\n        # Expand the solutions column as it comes from the MathShepherdCompleter\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n                    },\n                ]\n            )\n        )\n\n        result = next(formatter.process(result))\n        # {\n        #     \"instruction\": \"Janet\\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        #     \"solutions\": [\n        #         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\",\n        #         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\",\n        #         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"\n        #     ],\n        #     \"prompt\": \"Janet\\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        #     \"completions\": [\n        #         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.\",\n        #         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber.\",\n        #         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3\"\n        #     ],\n        #     \"labels\": [\n        #         true,\n        #         true,\n        #         true\n        #     ]\n        # }\n        ```\n\n    Citations:\n\n        ```\n        @misc{wang2024mathshepherdverifyreinforcellms,\n            title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n            author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n            year={2024},\n            eprint={2312.08935},\n            archivePrefix={arXiv},\n            primaryClass={cs.AI},\n            url={https://arxiv.org/abs/2312.08935},\n        }\n        ```\n    \"\"\"\n\n    format: Literal[\"math-shepherd\", \"trl\"] = \"math-shepherd\"\n    step_token: str = \"\u043a\u0438\"\n    tags: list[str] = [\"+\", \"-\"]\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        super().model_post_init(__context)\n        if self.format == \"math-shepherd\":\n            self._formatter = self._format_math_shepherd\n        else:\n            self._formatter = self._format_trl\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\", \"solutions\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        if self.format == \"math-shepherd\":\n            return [\"input\", \"label\"]\n        return [\"prompt\", \"completions\", \"labels\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The process prepares the data for the `APIGenGenerator` task.\n\n        If a single example is provided, it is copied to avoid raising an error.\n\n        Args:\n            inputs: A list of dictionaries with the input data.\n\n        Yields:\n            A list of dictionaries with the output data.\n        \"\"\"\n        for input in inputs:\n            self._formatter(input)\n\n        yield inputs  # type: ignore\n\n    def _format_math_shepherd(\n        self, input: dict[str, str]\n    ) -&gt; dict[str, Union[str, list[str]]]:\n        instruction = input[\"instruction\"]\n        replaced = []\n        # At this stage, the \"solutions\" column can only contain a single solution,\n        # and the last item of each solution is the tag.\n        solution = input[\"solutions\"]\n        for step in solution:\n            # Check there's a string, because the step that generated\n            # the solutions could have failed, and we would have an empty list.\n            replaced.append(step[:-1] + self.step_token if len(step) &gt; 1 else step)\n\n        input[\"input\"] = instruction + \" \" + \"\\n\".join(replaced)\n        input[\"label\"] = instruction + \" \" + \"\\n\".join(solution)\n\n        return input  # type: ignore\n\n    def _format_trl(\n        self, input: dict[str, str]\n    ) -&gt; dict[str, Union[str, list[str], list[bool]]]:\n        input[\"prompt\"] = input[\"instruction\"]\n        completions: list[str] = []\n        labels: list[bool] = []\n        for step in input[\"solutions\"]:\n            token = step[-1]\n            completions.append(step[:-1].strip())\n            labels.append(True if token == self.tags[0] else False)\n\n        input[\"completions\"] = completions  # type: ignore\n        input[\"labels\"] = labels  # type: ignore\n\n        return input  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.FormatPRM.process","title":"<code>process(inputs)</code>","text":"<p>The process prepares the data for the <code>APIGenGenerator</code> task.</p> <p>If a single example is provided, it is copied to avoid raising an error.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of dictionaries with the input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of dictionaries with the output data.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/utils.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The process prepares the data for the `APIGenGenerator` task.\n\n    If a single example is provided, it is copied to avoid raising an error.\n\n    Args:\n        inputs: A list of dictionaries with the input data.\n\n    Yields:\n        A list of dictionaries with the output data.\n    \"\"\"\n    for input in inputs:\n        self._formatter(input)\n\n    yield inputs  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM","title":"<code>PairRM</code>","text":"<p>               Bases: <code>Step</code></p> <p>Rank the candidates based on the input using the <code>LLM</code> model.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>The model to use for the ranking. Defaults to <code>\"llm-blender/PairRM\"</code>.</p> <code>instructions</code> <code>Optional[str]</code> <p>The instructions to use for the model. Defaults to <code>None</code>.</p> Input columns <ul> <li>inputs (<code>List[Dict[str, Any]]</code>): The input text or conversation to rank the candidates for.</li> <li>candidates (<code>List[Dict[str, Any]]</code>): The candidates to rank.</li> </ul> Output columns <ul> <li>ranks (<code>List[int]</code>): The ranks of the candidates based on the input.</li> <li>ranked_candidates (<code>List[Dict[str, Any]]</code>): The candidates ranked based on the input.</li> <li>model_name (<code>str</code>): The model name used to rank the candidate responses. Defaults to <code>\"llm-blender/PairRM\"</code>.</li> </ul> References <ul> <li>LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion.</li> <li>Pair Ranking Model.</li> </ul> Categories <ul> <li>preference</li> </ul> Note <p>This step differs to other tasks as there is a single implementation of this model currently, and we will use a specific <code>LLM</code>.</p> <p>Examples:</p> <p>Rank LLM candidates:</p> <pre><code>from distilabel.steps.tasks import PairRM\n\n# Consider this as a placeholder for your actual LLM.\npair_rm = PairRM()\n\npair_rm.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\"input\": \"Hello, how are you?\", \"candidates\": [\"fine\", \"good\", \"bad\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'input': 'Hello, how are you?',\n#         'candidates': ['fine', 'good', 'bad'],\n#         'ranks': [2, 1, 3],\n#         'ranked_candidates': ['good', 'fine', 'bad'],\n#         'model_name': 'llm-blender/PairRM',\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{jiang2023llmblenderensemblinglargelanguage,\n    title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},\n    author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},\n    year={2023},\n    eprint={2306.02561},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2306.02561},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>class PairRM(Step):\n    \"\"\"Rank the candidates based on the input using the `LLM` model.\n\n    Attributes:\n        model: The model to use for the ranking. Defaults to `\"llm-blender/PairRM\"`.\n        instructions: The instructions to use for the model. Defaults to `None`.\n\n    Input columns:\n        - inputs (`List[Dict[str, Any]]`): The input text or conversation to rank the candidates for.\n        - candidates (`List[Dict[str, Any]]`): The candidates to rank.\n\n    Output columns:\n        - ranks (`List[int]`): The ranks of the candidates based on the input.\n        - ranked_candidates (`List[Dict[str, Any]]`): The candidates ranked based on the input.\n        - model_name (`str`): The model name used to rank the candidate responses. Defaults to `\"llm-blender/PairRM\"`.\n\n    References:\n        - [LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion](https://arxiv.org/abs/2306.02561).\n        - [Pair Ranking Model](https://huggingface.co/llm-blender/PairRM).\n\n    Categories:\n        - preference\n\n    Note:\n        This step differs to other tasks as there is a single implementation of this model\n        currently, and we will use a specific `LLM`.\n\n    Examples:\n        Rank LLM candidates:\n\n        ```python\n        from distilabel.steps.tasks import PairRM\n\n        # Consider this as a placeholder for your actual LLM.\n        pair_rm = PairRM()\n\n        pair_rm.load()\n\n        result = next(\n            scorer.process(\n                [\n                    {\"input\": \"Hello, how are you?\", \"candidates\": [\"fine\", \"good\", \"bad\"]},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'input': 'Hello, how are you?',\n        #         'candidates': ['fine', 'good', 'bad'],\n        #         'ranks': [2, 1, 3],\n        #         'ranked_candidates': ['good', 'fine', 'bad'],\n        #         'model_name': 'llm-blender/PairRM',\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{jiang2023llmblenderensemblinglargelanguage,\n            title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},\n            author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},\n            year={2023},\n            eprint={2306.02561},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2306.02561},\n        }\n        ```\n    \"\"\"\n\n    model: str = \"llm-blender/PairRM\"\n    instructions: Optional[str] = None\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the\n        custom library for running the inference for the PairRM models.\"\"\"\n        try:\n            import llm_blender\n        except ImportError as e:\n            raise ImportError(\n                \"The `llm_blender` package is required to use the `PairRM` class.\"\n                \"Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`.\"\n            ) from e\n\n        self._blender = llm_blender.Blender()\n        self._blender.loadranker(self.model)\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The input columns correspond to the two required arguments from `Blender.rank`:\n        `inputs` and `candidates`.\"\"\"\n        return [\"input\", \"candidates\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs will include the `ranks` and the `ranked_candidates`.\"\"\"\n        return [\"ranks\", \"ranked_candidates\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n        \"\"\"The input is expected to be a dictionary with the keys `input` and `candidates`,\n        where the `input` corresponds to the instruction of a model and `candidates` are a\n        list of responses to be ranked.\n        \"\"\"\n        return {\"input\": input[\"input\"], \"candidates\": input[\"candidates\"]}\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Generates the ranks for the candidates based on the input.\n\n        The ranks are the positions of the candidates, where lower is better,\n        and the ranked candidates correspond to the candidates sorted according to the\n        ranks obtained.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.\n        \"\"\"\n        input_texts = []\n        candidates = []\n        for input in inputs:\n            formatted_input = self.format_input(input)\n            input_texts.append(formatted_input[\"input\"])\n            candidates.append(formatted_input[\"candidates\"])\n\n        instructions = (\n            [self.instructions] * len(input_texts) if self.instructions else None\n        )\n\n        ranks = self._blender.rank(\n            input_texts,\n            candidates,\n            instructions=instructions,\n            return_scores=False,\n            batch_size=self.input_batch_size,\n        )\n        # Sort the candidates based on the ranks\n        ranked_candidates = np.take_along_axis(\n            np.array(candidates), ranks - 1, axis=1\n        ).tolist()\n        ranks = ranks.tolist()\n        for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):\n            input[\"ranks\"] = rank\n            input[\"ranked_candidates\"] = ranked_candidate\n            input[\"model_name\"] = self.model\n\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input columns correspond to the two required arguments from <code>Blender.rank</code>: <code>inputs</code> and <code>candidates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs will include the <code>ranks</code> and the <code>ranked_candidates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.load","title":"<code>load()</code>","text":"<p>Loads the PairRM model provided via <code>model</code> with <code>llm_blender.Blender</code>, which is the custom library for running the inference for the PairRM models.</p> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the\n    custom library for running the inference for the PairRM models.\"\"\"\n    try:\n        import llm_blender\n    except ImportError as e:\n        raise ImportError(\n            \"The `llm_blender` package is required to use the `PairRM` class.\"\n            \"Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`.\"\n        ) from e\n\n    self._blender = llm_blender.Blender()\n    self._blender.loadranker(self.model)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is expected to be a dictionary with the keys <code>input</code> and <code>candidates</code>, where the <code>input</code> corresponds to the instruction of a model and <code>candidates</code> are a list of responses to be ranked.</p> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n    \"\"\"The input is expected to be a dictionary with the keys `input` and `candidates`,\n    where the `input` corresponds to the instruction of a model and `candidates` are a\n    list of responses to be ranked.\n    \"\"\"\n    return {\"input\": input[\"input\"], \"candidates\": input[\"candidates\"]}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.process","title":"<code>process(inputs)</code>","text":"<p>Generates the ranks for the candidates based on the input.</p> <p>The ranks are the positions of the candidates, where lower is better, and the ranked candidates correspond to the candidates sorted according to the ranks obtained.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>An iterator with the inputs containing the <code>ranks</code>, <code>ranked_candidates</code>, and <code>model_name</code>.</p> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Generates the ranks for the candidates based on the input.\n\n    The ranks are the positions of the candidates, where lower is better,\n    and the ranked candidates correspond to the candidates sorted according to the\n    ranks obtained.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.\n    \"\"\"\n    input_texts = []\n    candidates = []\n    for input in inputs:\n        formatted_input = self.format_input(input)\n        input_texts.append(formatted_input[\"input\"])\n        candidates.append(formatted_input[\"candidates\"])\n\n    instructions = (\n        [self.instructions] * len(input_texts) if self.instructions else None\n    )\n\n    ranks = self._blender.rank(\n        input_texts,\n        candidates,\n        instructions=instructions,\n        return_scores=False,\n        batch_size=self.input_batch_size,\n    )\n    # Sort the candidates based on the ranks\n    ranked_candidates = np.take_along_axis(\n        np.array(candidates), ranks - 1, axis=1\n    ).tolist()\n    ranks = ranks.tolist()\n    for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):\n        input[\"ranks\"] = rank\n        input[\"ranked_candidates\"] = ranked_candidate\n        input[\"model_name\"] = self.model\n\n    yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval","title":"<code>PrometheusEval</code>","text":"<p>               Bases: <code>Task</code></p> <p>Critique and rank the quality of generations from an <code>LLM</code> using Prometheus 2.0.</p> <p><code>PrometheusEval</code> is a task created for Prometheus 2.0, covering both the absolute and relative evaluations. The absolute evaluation i.e. <code>mode=\"absolute\"</code> is used to evaluate a single generation from an LLM for a given instruction. The relative evaluation i.e. <code>mode=\"relative\"</code> is used to evaluate two generations from an LLM for a given instruction. Both evaluations provide the possibility of using a reference answer to compare with or withoug the <code>reference</code> attribute, and both are based on a score rubric that critiques the generation/s based on the following default aspects: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, and <code>reasoning</code>, that can be overridden via <code>rubrics</code>, and the selected rubric is set via the attribute <code>rubric</code>.</p> Note <p>The <code>PrometheusEval</code> task is better suited and intended to be used with any of the Prometheus 2.0 models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0, and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting and quality is not guaranteed if using another model, even though some other models may be able to correctly follow the formatting and generate insightful critiques too.</p> <p>Attributes:</p> Name Type Description <code>mode</code> <code>Literal['absolute', 'relative']</code> <p>the evaluation mode to use, either <code>absolute</code> or <code>relative</code>. It defines whether the task will evaluate one or two generations.</p> <code>rubric</code> <code>str</code> <p>the score rubric to use within the prompt to run the critique based on different aspects. Can be any existing key in the <code>rubrics</code> attribute, which by default means that it can be: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, or <code>reasoning</code>. Those will only work if using the default <code>rubrics</code>, otherwise, the provided <code>rubrics</code> should be used.</p> <code>rubrics</code> <code>Optional[Dict[str, str]]</code> <p>a dictionary containing the different rubrics to use for the critique, where the keys are the rubric names and the values are the rubric descriptions. The default rubrics are the following: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, and <code>reasoning</code>.</p> <code>reference</code> <code>bool</code> <p>a boolean flag to indicate whether a reference answer / completion will be provided, so that the model critique is based on the comparison with it. It implies that the column <code>reference</code> needs to be provided within the input data in addition to the rest of the inputs.</p> <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction to use as reference.</li> <li>generation (<code>str</code>, optional): The generated text from the given <code>instruction</code>. This column is required     if <code>mode=absolute</code>.</li> <li>generations (<code>List[str]</code>, optional): The generated texts from the given <code>instruction</code>. It should     contain 2 generations only. This column is required if <code>mode=relative</code>.</li> <li>reference (<code>str</code>, optional): The reference / golden answer for the <code>instruction</code>, to be used by the LLM     for comparison against.</li> </ul> Output columns <ul> <li>feedback (<code>str</code>): The feedback explaining the result below, as critiqued by the LLM using the     pre-defined score rubric, compared against <code>reference</code> if provided.</li> <li>result (<code>Union[int, Literal[\"A\", \"B\"]]</code>): If <code>mode=absolute</code>, then the result contains the score for the     <code>generation</code> in a likert-scale from 1-5, otherwise, if <code>mode=relative</code>, then the result contains either     \"A\" or \"B\", the \"winning\" one being the generation in the index 0 of <code>generations</code> if <code>result='A'</code> or the     index 1 if <code>result='B'</code>.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</li> </ul> Categories <ul> <li>critique</li> <li>preference</li> </ul> References <ul> <li>Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models</li> <li>prometheus-eval: Evaluate your LLM's response with Prometheus \ud83d\udcaf</li> </ul> <p>Examples:</p> <p>Critique and evaluate LLM generation quality using Prometheus 2_0:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"factual-validity\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre> <p>Critique for relative evaluation:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"relative\",\n    rubric=\"honesty\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generations\": [\"something done\", \"other thing\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generations': ['something done', 'other thing'],\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 'something done',\n#     }\n# ]\n</code></pre> <p>Critique with a custom rubric:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"custom\",\n    rubrics={\n        \"custom\": \"[A]\\nScore 1: A\\nScore 2: B\\nScore 3: C\\nScore 4: D\\nScore 5: E\"\n    }\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre> <p>Critique using a reference answer:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"helpfulness\",\n    reference=True,\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\n                \"instruction\": \"make something\",\n                \"generation\": \"something done\",\n                \"reference\": \"this is a reference answer\",\n            },\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'reference': 'this is a reference answer',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{kim2024prometheus2opensource,\n    title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},\n    author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},\n    year={2024},\n    eprint={2405.01535},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2405.01535},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>class PrometheusEval(Task):\n    \"\"\"Critique and rank the quality of generations from an `LLM` using Prometheus 2.0.\n\n    `PrometheusEval` is a task created for Prometheus 2.0, covering both the absolute and relative\n    evaluations. The absolute evaluation i.e. `mode=\"absolute\"` is used to evaluate a single generation from\n    an LLM for a given instruction. The relative evaluation i.e. `mode=\"relative\"` is used to evaluate two generations from an LLM\n    for a given instruction.\n    Both evaluations provide the possibility of using a reference answer to compare with or withoug\n    the `reference` attribute, and both are based on a score rubric that critiques the generation/s\n    based on the following default aspects: `helpfulness`, `harmlessness`, `honesty`, `factual-validity`,\n    and `reasoning`, that can be overridden via `rubrics`, and the selected rubric is set via the attribute\n    `rubric`.\n\n    Note:\n        The `PrometheusEval` task is better suited and intended to be used with any of the Prometheus 2.0\n        models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0,\n        and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting\n        and quality is not guaranteed if using another model, even though some other models may be able to\n        correctly follow the formatting and generate insightful critiques too.\n\n    Attributes:\n        mode: the evaluation mode to use, either `absolute` or `relative`. It defines whether the task\n            will evaluate one or two generations.\n        rubric: the score rubric to use within the prompt to run the critique based on different aspects.\n            Can be any existing key in the `rubrics` attribute, which by default means that it can be:\n            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, or `reasoning`. Those will only\n            work if using the default `rubrics`, otherwise, the provided `rubrics` should be used.\n        rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are\n            the rubric names and the values are the rubric descriptions. The default rubrics are the following:\n            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, and `reasoning`.\n        reference: a boolean flag to indicate whether a reference answer / completion will be provided, so\n            that the model critique is based on the comparison with it. It implies that the column `reference`\n            needs to be provided within the input data in addition to the rest of the inputs.\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - instruction (`str`): The instruction to use as reference.\n        - generation (`str`, optional): The generated text from the given `instruction`. This column is required\n            if `mode=absolute`.\n        - generations (`List[str]`, optional): The generated texts from the given `instruction`. It should\n            contain 2 generations only. This column is required if `mode=relative`.\n        - reference (`str`, optional): The reference / golden answer for the `instruction`, to be used by the LLM\n            for comparison against.\n\n    Output columns:\n        - feedback (`str`): The feedback explaining the result below, as critiqued by the LLM using the\n            pre-defined score rubric, compared against `reference` if provided.\n        - result (`Union[int, Literal[\"A\", \"B\"]]`): If `mode=absolute`, then the result contains the score for the\n            `generation` in a likert-scale from 1-5, otherwise, if `mode=relative`, then the result contains either\n            \"A\" or \"B\", the \"winning\" one being the generation in the index 0 of `generations` if `result='A'` or the\n            index 1 if `result='B'`.\n        - model_name (`str`): The model name used to generate the `feedback` and `result`.\n\n    Categories:\n        - critique\n        - preference\n\n    References:\n        - [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)\n        - [prometheus-eval: Evaluate your LLM's response with Prometheus \ud83d\udcaf](https://github.com/prometheus-eval/prometheus-eval)\n\n    Examples:\n        Critique and evaluate LLM generation quality using Prometheus 2_0:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"factual-validity\"\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\"instruction\": \"make something\", \"generation\": \"something done\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generation': 'something done',\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 6,\n        #     }\n        # ]\n        ```\n\n        Critique for relative evaluation:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"relative\",\n            rubric=\"honesty\"\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\"instruction\": \"make something\", \"generations\": [\"something done\", \"other thing\"]},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generations': ['something done', 'other thing'],\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 'something done',\n        #     }\n        # ]\n        ```\n\n        Critique with a custom rubric:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"custom\",\n            rubrics={\n                \"custom\": \"[A]\\\\nScore 1: A\\\\nScore 2: B\\\\nScore 3: C\\\\nScore 4: D\\\\nScore 5: E\"\n            }\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\"instruction\": \"make something\", \"generation\": \"something done\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generation': 'something done',\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 6,\n        #     }\n        # ]\n        ```\n\n        Critique using a reference answer:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"helpfulness\",\n            reference=True,\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\n                        \"instruction\": \"make something\",\n                        \"generation\": \"something done\",\n                        \"reference\": \"this is a reference answer\",\n                    },\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generation': 'something done',\n        #         'reference': 'this is a reference answer',\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 6,\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{kim2024prometheus2opensource,\n            title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},\n            author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},\n            year={2024},\n            eprint={2405.01535},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2405.01535},\n        }\n        ```\n    \"\"\"\n\n    mode: Literal[\"absolute\", \"relative\"]\n    rubric: str\n    rubrics: Optional[Dict[str, str]] = Field(default=_DEFAULT_RUBRICS)\n    reference: bool = False\n\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    @model_validator(mode=\"after\")\n    def validate_rubric_and_rubrics(self) -&gt; Self:\n        if not isinstance(self.rubrics, dict) or len(self.rubrics) &lt; 1:\n            raise DistilabelUserError(\n                \"Provided `rubrics` must be a Python dictionary with string keys and string values.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        def rubric_matches_pattern(rubric: str) -&gt; bool:\n            \"\"\"Checks if the provided rubric matches the pattern of the default rubrics.\"\"\"\n            pattern = r\"^\\[.*?\\]\\n(?:Score [1-4]: .*?\\n){4}(?:Score 5: .*?)\"\n            return bool(re.match(pattern, rubric, re.MULTILINE))\n\n        if not all(rubric_matches_pattern(value) for value in self.rubrics.values()):\n            raise DistilabelUserError(\n                \"Provided rubrics should match the format of the default rubrics, which\"\n                \" is as follows: `[&lt;scoring criteria&gt;]\\nScore 1: &lt;description&gt;\\nScore 2: &lt;description&gt;\\n\"\n                \"Score 3: &lt;description&gt;\\nScore 4: &lt;description&gt;\\nScore 5: &lt;description&gt;`; replacing\"\n                \" `&lt;scoring criteria&gt;` and `&lt;description&gt;` with the actual criteria and description\"\n                \" for each or the scores, respectively.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        if self.rubric not in self.rubrics:\n            raise DistilabelUserError(\n                f\"Provided rubric '{self.rubric}' is not among the available rubrics: {', '.join(self.rubrics.keys())}.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        return self\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation\n        depending on the `mode` value, and either with or without reference, depending on the\n        value of `reference`.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"prometheus\"\n            / (\n                f\"{self.mode}_without_reference.jinja2\"\n                if self.reference is False\n                else f\"{self.mode}_with_reference.jinja2\"\n            )\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The default inputs for the task are the `instruction` and the `generation`\n        if `reference=False`, otherwise, the inputs are `instruction`, `generation`, and\n        `reference`.\"\"\"\n        if self.mode == \"absolute\":\n            if self.reference:\n                return [\"instruction\", \"generation\", \"reference\"]\n            return [\"instruction\", \"generation\"]\n        else:\n            if self.reference:\n                return [\"instruction\", \"generations\", \"reference\"]\n            return [\"instruction\", \"generations\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` where the prompt is formatted according\n        to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction\n        from the user, including a pre-defined system prompt.\"\"\"\n        template_kwargs = {\n            \"instruction\": input[\"instruction\"],\n            \"rubric\": self.rubrics[self.rubric],\n        }\n        if self.reference:\n            template_kwargs[\"reference\"] = input[\"reference\"]\n\n        if self.mode == \"absolute\":\n            if not isinstance(input[\"generation\"], str):\n                raise DistilabelUserError(\n                    f\"Provided `generation` is of type {type(input['generation'])} but a string\"\n                    \" should be provided instead.\",\n                    page=\"components-gallery/tasks/prometheuseval/\",\n                )\n\n            template_kwargs[\"generation\"] = input[\"generation\"]\n            system_message = (\n                \"You are a fair judge assistant tasked with providing clear, objective feedback based\"\n                \" on specific criteria, ensuring each assessment reflects the absolute standards set\"\n                \" for performance.\"\n            )\n        else:  # self.mode == \"relative\"\n            if (\n                not isinstance(input[\"generations\"], list)\n                or not all(\n                    isinstance(generation, str) for generation in input[\"generations\"]\n                )\n                or len(input[\"generations\"]) != 2\n            ):\n                raise DistilabelUserError(\n                    f\"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead.\",\n                    page=\"components-gallery/tasks/prometheuseval/\",\n                )\n\n            template_kwargs[\"generations\"] = input[\"generations\"]\n            system_message = (\n                \"You are a fair judge assistant assigned to deliver insightful feedback that compares\"\n                \" individual performances, highlighting how each stands relative to others within the\"\n                \" same cohort.\"\n            )\n\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": system_message,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(**template_kwargs),  # type: ignore\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `feedback` and the `result` generated by Prometheus,\n        as well as the `model_name` which is automatically included based on the `LLM` used.\n        \"\"\"\n        return [\"feedback\", \"result\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dict with the keys `feedback` and `result` captured\n        using a regex from the Prometheus output.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Optionally provided in case it's useful to build the output.\n\n        Returns:\n            A dict with the keys `feedback` and `result` generated by the LLM.\n        \"\"\"\n        if output is None:\n            return {\"feedback\": None, \"result\": None}\n\n        parts = output.split(\"[RESULT]\")\n        if len(parts) != 2:\n            return {\"feedback\": None, \"result\": None}\n\n        feedback, result = parts[0].strip(), parts[1].strip()\n        if feedback.startswith(\"Feedback:\"):\n            feedback = feedback[len(\"Feedback:\") :].strip()\n        if self.mode == \"absolute\":\n            if not result.isdigit() or result not in [\"1\", \"2\", \"3\", \"4\", \"5\"]:\n                return {\"feedback\": None, \"result\": None}\n            return {\"feedback\": feedback, \"result\": int(result)}\n        else:  # self.mode == \"relative\"\n            if result not in [\"A\", \"B\"]:\n                return {\"feedback\": None, \"result\": None}\n            return {\"feedback\": feedback, \"result\": result}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The default inputs for the task are the <code>instruction</code> and the <code>generation</code> if <code>reference=False</code>, otherwise, the inputs are <code>instruction</code>, <code>generation</code>, and <code>reference</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>feedback</code> and the <code>result</code> generated by Prometheus, as well as the <code>model_name</code> which is automatically included based on the <code>LLM</code> used.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation depending on the <code>mode</code> value, and either with or without reference, depending on the value of <code>reference</code>.</p> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation\n    depending on the `mode` value, and either with or without reference, depending on the\n    value of `reference`.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"prometheus\"\n        / (\n            f\"{self.mode}_without_reference.jinja2\"\n            if self.reference is False\n            else f\"{self.mode}_with_reference.jinja2\"\n        )\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> where the prompt is formatted according to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction from the user, including a pre-defined system prompt.</p> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` where the prompt is formatted according\n    to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction\n    from the user, including a pre-defined system prompt.\"\"\"\n    template_kwargs = {\n        \"instruction\": input[\"instruction\"],\n        \"rubric\": self.rubrics[self.rubric],\n    }\n    if self.reference:\n        template_kwargs[\"reference\"] = input[\"reference\"]\n\n    if self.mode == \"absolute\":\n        if not isinstance(input[\"generation\"], str):\n            raise DistilabelUserError(\n                f\"Provided `generation` is of type {type(input['generation'])} but a string\"\n                \" should be provided instead.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        template_kwargs[\"generation\"] = input[\"generation\"]\n        system_message = (\n            \"You are a fair judge assistant tasked with providing clear, objective feedback based\"\n            \" on specific criteria, ensuring each assessment reflects the absolute standards set\"\n            \" for performance.\"\n        )\n    else:  # self.mode == \"relative\"\n        if (\n            not isinstance(input[\"generations\"], list)\n            or not all(\n                isinstance(generation, str) for generation in input[\"generations\"]\n            )\n            or len(input[\"generations\"]) != 2\n        ):\n            raise DistilabelUserError(\n                f\"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        template_kwargs[\"generations\"] = input[\"generations\"]\n        system_message = (\n            \"You are a fair judge assistant assigned to deliver insightful feedback that compares\"\n            \" individual performances, highlighting how each stands relative to others within the\"\n            \" same cohort.\"\n        )\n\n    return [\n        {\n            \"role\": \"system\",\n            \"content\": system_message,\n        },\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(**template_kwargs),  # type: ignore\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a dict with the keys <code>feedback</code> and <code>result</code> captured using a regex from the Prometheus output.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Optionally provided in case it's useful to build the output.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the keys <code>feedback</code> and <code>result</code> generated by the LLM.</p> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dict with the keys `feedback` and `result` captured\n    using a regex from the Prometheus output.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Optionally provided in case it's useful to build the output.\n\n    Returns:\n        A dict with the keys `feedback` and `result` generated by the LLM.\n    \"\"\"\n    if output is None:\n        return {\"feedback\": None, \"result\": None}\n\n    parts = output.split(\"[RESULT]\")\n    if len(parts) != 2:\n        return {\"feedback\": None, \"result\": None}\n\n    feedback, result = parts[0].strip(), parts[1].strip()\n    if feedback.startswith(\"Feedback:\"):\n        feedback = feedback[len(\"Feedback:\") :].strip()\n    if self.mode == \"absolute\":\n        if not result.isdigit() or result not in [\"1\", \"2\", \"3\", \"4\", \"5\"]:\n            return {\"feedback\": None, \"result\": None}\n        return {\"feedback\": feedback, \"result\": int(result)}\n    else:  # self.mode == \"relative\"\n        if result not in [\"A\", \"B\"]:\n            return {\"feedback\": None, \"result\": None}\n        return {\"feedback\": feedback, \"result\": result}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer","title":"<code>QualityScorer</code>","text":"<p>               Bases: <code>Task</code></p> <p>Score responses based on their quality using an <code>LLM</code>.</p> <p><code>QualityScorer</code> is a pre-defined task that defines the <code>instruction</code> as the input and <code>score</code> as the output. This task is used to rate the quality of instructions and responses. It's an implementation of the quality score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'. The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs are scored in terms of quality, obtaining a quality score for each instruction.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</li> <li>responses (<code>List[str]</code>): The responses to be scored. Each response forms a pair with the instruction.</li> </ul> Output columns <ul> <li>scores (<code>List[float]</code>): The score for each instruction.</li> <li>model_name (<code>str</code>): The model name used to generate the scores.</li> </ul> Categories <ul> <li>scorer</li> <li>quality</li> <li>response</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Evaluate the quality of your instructions:</p> <pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n# result\n[\n    {\n        'instructions': 'instruction',\n        'model_name': 'test',\n        'scores': [5, 3, 1],\n    }\n]\n</code></pre> <p>Generate structured output with default schema:</p> <pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    use_default_structured_output=True\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n\n# result\n[{'instruction': 'instruction',\n'responses': ['good response', 'weird response', 'bad response'],\n'scores': [1, 2, 3],\n'distilabel_metadata': {'raw_output_quality_scorer_0': '{  \"scores\": [1, 2, 3] }'},\n'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>class QualityScorer(Task):\n    \"\"\"Score responses based on their quality using an `LLM`.\n\n    `QualityScorer` is a pre-defined task that defines the `instruction` as the input\n    and `score` as the output. This task is used to rate the quality of instructions and responses.\n    It's an implementation of the quality score task from the paper 'What Makes Good Data\n    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.\n    The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs\n    are scored in terms of quality, obtaining a quality score for each instruction.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the `responses`.\n        - responses (`List[str]`): The responses to be scored. Each response forms a pair with the instruction.\n\n    Output columns:\n        - scores (`List[float]`): The score for each instruction.\n        - model_name (`str`): The model name used to generate the scores.\n\n    Categories:\n        - scorer\n        - quality\n        - response\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Evaluate the quality of your instructions:\n\n        ```python\n        from distilabel.steps.tasks import QualityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        scorer = QualityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n                    }\n                ]\n            )\n        )\n        # result\n        [\n            {\n                'instructions': 'instruction',\n                'model_name': 'test',\n                'scores': [5, 3, 1],\n            }\n        ]\n        ```\n\n        Generate structured output with default schema:\n\n        ```python\n        from distilabel.steps.tasks import QualityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        scorer = QualityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            use_default_structured_output=True\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n                    }\n                ]\n            )\n        )\n\n        # result\n        [{'instruction': 'instruction',\n        'responses': ['good response', 'weird response', 'bad response'],\n        'scores': [1, 2, 3],\n        'distilabel_metadata': {'raw_output_quality_scorer_0': '{  \"scores\": [1, 2, 3] }'},\n        'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"quality-scorer.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are `instruction` and `responses`.\"\"\"\n        return [\"instruction\", \"responses\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    instruction=input[\"instruction\"], responses=input[\"responses\"]\n                ),\n            }\n        ]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `scores` containing the quality score for each\n        response in `responses`.\"\"\"\n        return [\"scores\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the key `scores` containing the scores for each instruction-response pair.\n        \"\"\"\n        if output is None:\n            return {\"scores\": [None] * len(input[\"responses\"])}\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        scores = []\n        score_lines = output.split(\"\\n\")\n\n        for i, line in enumerate(score_lines):\n            match = _PARSE_SCORE_LINE_REGEX.match(line)\n            score = float(match.group(1)) if match else None\n            scores.append(score)\n            if i == len(input[\"responses\"]) - 1:\n                break\n        return {\"scores\": scores}\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel\n        from typing import List\n\n        class SchemaQualityScorer(BaseModel):\n            scores: List[int]\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"properties\": {\n                \"scores\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Scores\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"scores\"],\n            \"title\": \"SchemaQualityScorer\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(\n        self, output: str, input: Dict[str, Any]\n    ) -&gt; Dict[str, str]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with the scores, and a list with them.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            return {\"scores\": [None] * len(input[\"responses\"])}\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\n            {\n                \"instruction\": f\"&lt;PLACEHOLDER_{'instruction'.upper()}&gt;\",\n                \"responses\": [\n                    f\"&lt;PLACEHOLDER_{f'RESPONSE_{i}'.upper()}&gt;\" for i in range(2)\n                ],\n            }\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are <code>instruction</code> and <code>responses</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is a list of <code>scores</code> containing the quality score for each response in <code>responses</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"quality-scorer.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:  # type: ignore\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                instruction=input[\"instruction\"], responses=input[\"responses\"]\n            ),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction-response pair.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the key <code>scores</code> containing the scores for each instruction-response pair.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the key `scores` containing the scores for each instruction-response pair.\n    \"\"\"\n    if output is None:\n        return {\"scores\": [None] * len(input[\"responses\"])}\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    scores = []\n    score_lines = output.split(\"\\n\")\n\n    for i, line in enumerate(score_lines):\n        match = _PARSE_SCORE_LINE_REGEX.match(line)\n        score = float(match.group(1)) if match else None\n        scores.append(score)\n        if i == len(input[\"responses\"]) - 1:\n            break\n    return {\"scores\": scores}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel\nfrom typing import List\n\nclass SchemaQualityScorer(BaseModel):\n    scores: List[int]\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel\n    from typing import List\n\n    class SchemaQualityScorer(BaseModel):\n        scores: List[int]\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"properties\": {\n            \"scores\": {\n                \"items\": {\"type\": \"integer\"},\n                \"title\": \"Scores\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"scores\"],\n        \"title\": \"SchemaQualityScorer\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer._format_structured_output","title":"<code>_format_structured_output(output, input)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with the scores, and a list with them.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def _format_structured_output(\n    self, output: str, input: Dict[str, Any]\n) -&gt; Dict[str, str]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with the scores, and a list with them.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        return {\"scores\": [None] * len(input[\"responses\"])}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct","title":"<code>SelfInstruct</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate instructions based on a given input using an <code>LLM</code>.</p> <p><code>SelfInstruct</code> is a pre-defined task that, given a number of instructions, a certain criteria for query generations, an application description, and an input, generates a number of instruction related to the given input and following what is stated in the criteria for query generation and the application description. It is based in the SelfInstruct framework from the paper \"Self-Instruct: Aligning Language Models with Self-Generated Instructions\".</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <code>int</code> <p>The number of instructions to be generated. Defaults to 5.</p> <code>criteria_for_query_generation</code> <code>str</code> <p>The criteria for the query generation. Defaults to the criteria defined within the paper.</p> <code>application_description</code> <code>str</code> <p>The description of the AI application that one want to build with these instructions. Defaults to <code>AI assistant</code>.</p> Input columns <ul> <li>input (<code>str</code>): The input to generate the instructions. It's also called seed in     the paper.</li> </ul> Output columns <ul> <li>instructions (<code>List[str]</code>): The generated instructions.</li> <li>model_name (<code>str</code>): The model name used to generate the instructions.</li> </ul> Categories <ul> <li>text-generation</li> </ul> Reference <ul> <li><code>Self-Instruct: Aligning Language Models with Self-Generated Instructions</code></li> </ul> <p>Examples:</p> <p>Generate instructions based on a given input:</p> <pre><code>from distilabel.steps.tasks import SelfInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\nself_instruct = SelfInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=5,  # This is the default value\n)\n\nself_instruct.load()\n\nresult = next(self_instruct.process([{\"input\": \"instruction\"}]))\n# result\n# [\n#     {\n#         'input': 'instruction',\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'instructions': [\"instruction 1\", \"instruction 2\", \"instruction 3\", \"instruction 4\", \"instruction 5\"],\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{wang2023selfinstructaligninglanguagemodels,\n    title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},\n    author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},\n    year={2023},\n    eprint={2212.10560},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2212.10560},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>class SelfInstruct(Task):\n    \"\"\"Generate instructions based on a given input using an `LLM`.\n\n    `SelfInstruct` is a pre-defined task that, given a number of instructions, a\n    certain criteria for query generations, an application description, and an input,\n    generates a number of instruction related to the given input and following what\n    is stated in the criteria for query generation and the application description.\n    It is based in the SelfInstruct framework from the paper \"Self-Instruct: Aligning\n    Language Models with Self-Generated Instructions\".\n\n    Attributes:\n        num_instructions: The number of instructions to be generated. Defaults to 5.\n        criteria_for_query_generation: The criteria for the query generation. Defaults\n            to the criteria defined within the paper.\n        application_description: The description of the AI application that one want\n            to build with these instructions. Defaults to `AI assistant`.\n\n    Input columns:\n        - input (`str`): The input to generate the instructions. It's also called seed in\n            the paper.\n\n    Output columns:\n        - instructions (`List[str]`): The generated instructions.\n        - model_name (`str`): The model name used to generate the instructions.\n\n    Categories:\n        - text-generation\n\n    Reference:\n        - [`Self-Instruct: Aligning Language Models with Self-Generated Instructions`](https://arxiv.org/abs/2212.10560)\n\n    Examples:\n        Generate instructions based on a given input:\n\n        ```python\n        from distilabel.steps.tasks import SelfInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        self_instruct = SelfInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_instructions=5,  # This is the default value\n        )\n\n        self_instruct.load()\n\n        result = next(self_instruct.process([{\"input\": \"instruction\"}]))\n        # result\n        # [\n        #     {\n        #         'input': 'instruction',\n        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n        #         'instructions': [\"instruction 1\", \"instruction 2\", \"instruction 3\", \"instruction 4\", \"instruction 5\"],\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{wang2023selfinstructaligninglanguagemodels,\n            title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},\n            author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},\n            year={2023},\n            eprint={2212.10560},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2212.10560},\n        }\n        ```\n    \"\"\"\n\n    num_instructions: int = 5\n    criteria_for_query_generation: str = (\n        \"Incorporate a diverse range of verbs, avoiding repetition.\\n\"\n        \"Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\\n\"\n        \"Design queries to be self-contained and standalone.\\n\"\n        'Blend interrogative (e.g., \"What is the significance of x?\") and imperative (e.g., \"Detail the process of x.\") styles.'\n    )\n    application_description: str = \"AI assistant\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"self-instruct.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `input` i.e. seed text.\"\"\"\n        return [\"input\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    input=input[\"input\"],\n                    application_description=self.application_description,\n                    criteria_for_query_generation=self.criteria_for_query_generation,\n                    num_instructions=self.num_instructions,\n                ),\n            }\n        ]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `instructions` containing the generated instructions.\"\"\"\n        return [\"instructions\", \"model_name\"]\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Optional[Dict[str, Any]] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the generated instructions.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with containing the generated instructions.\n        \"\"\"\n        if output is None:\n            return {\"instructions\": []}\n        return {\"instructions\": [line for line in output.split(\"\\n\") if line != \"\"]}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>input</code> i.e. seed text.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is a list of <code>instructions</code> containing the generated instructions.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"self-instruct.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                input=input[\"input\"],\n                application_description=self.application_description,\n                criteria_for_query_generation=self.criteria_for_query_generation,\n                num_instructions=self.num_instructions,\n            ),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a list with the generated instructions.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Optional[Dict[str, Any]]</code> <p>the input to the task. Used for obtaining the number of responses.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with containing the generated instructions.</p> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Optional[Dict[str, Any]] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the generated instructions.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with containing the generated instructions.\n    \"\"\"\n    if output is None:\n        return {\"instructions\": []}\n    return {\"instructions\": [line for line in output.split(\"\\n\") if line != \"\"]}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair","title":"<code>GenerateSentencePair</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate a positive and negative (optionally) sentences given an anchor sentence.</p> <p><code>GenerateSentencePair</code> is a pre-defined task that given an anchor sentence generates a positive sentence related to the anchor and optionally a negative sentence unrelated to the anchor or similar to it. Optionally, you can give a context to guide the LLM towards more specific behavior. This task is useful to generate training datasets for training embeddings models.</p> <p>Attributes:</p> Name Type Description <code>triplet</code> <code>bool</code> <p>a flag to indicate if the task should generate a triplet of sentences (anchor, positive, negative). Defaults to <code>False</code>.</p> <code>action</code> <code>GenerationAction</code> <p>the action to perform to generate the positive sentence.</p> <code>context</code> <code>str</code> <p>the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default.</p> <code>hard_negative</code> <code>bool</code> <p>A flag to indicate if the negative should be a hard-negative or not. Hard negatives make it hard for the model to distinguish against the positive, with a higher degree of semantic similarity.</p> Input columns <ul> <li>anchor (<code>str</code>): The anchor sentence to generate the positive and negative sentences.</li> </ul> Output columns <ul> <li>positive (<code>str</code>): The positive sentence related to the <code>anchor</code>.</li> <li>negative (<code>str</code>): The negative sentence unrelated to the <code>anchor</code> if <code>triplet=True</code>,     or more similar to the positive to make it more challenging for a model to distinguish     in case <code>hard_negative=True</code>.</li> <li>model_name (<code>str</code>): The name of the model that was used to generate the sentences.</li> </ul> Categories <ul> <li>embedding</li> </ul> <p>Examples:</p> <p>Paraphrasing:</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"paraphrase\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre> <p>Generating semantically similar sentences:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import GenerateSentencePair\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"semantically-similar\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"How does 3D printing work?\"}])\n</code></pre> <p>Generating queries:</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"Argilla is an open-source data curation platform for LLMs. Using Argilla, ...\"}])\n</code></pre> <p>Generating answers:</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"answer\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre> <p>Generating queries with context (applies to every action):</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre> <p>Generating Hard-negatives (applies to every action):</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    hard_negative=True,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre> <p>Generating structured data with default schema (applies to every action):</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    hard_negative=True,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n    use_default_structured_output=True\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>class GenerateSentencePair(Task):\n    \"\"\"Generate a positive and negative (optionally) sentences given an anchor sentence.\n\n    `GenerateSentencePair` is a pre-defined task that given an anchor sentence generates\n    a positive sentence related to the anchor and optionally a negative sentence unrelated\n    to the anchor or similar to it. Optionally, you can give a context to guide the LLM\n    towards more specific behavior. This task is useful to generate training datasets for\n    training embeddings models.\n\n    Attributes:\n        triplet: a flag to indicate if the task should generate a triplet of sentences\n            (anchor, positive, negative). Defaults to `False`.\n        action: the action to perform to generate the positive sentence.\n        context: the context to use for the generation. Can be helpful to guide the LLM\n            towards more specific context. Not used by default.\n        hard_negative: A flag to indicate if the negative should be a hard-negative or not.\n            Hard negatives make it hard for the model to distinguish against the positive,\n            with a higher degree of semantic similarity.\n\n    Input columns:\n        - anchor (`str`): The anchor sentence to generate the positive and negative sentences.\n\n    Output columns:\n        - positive (`str`): The positive sentence related to the `anchor`.\n        - negative (`str`): The negative sentence unrelated to the `anchor` if `triplet=True`,\n            or more similar to the positive to make it more challenging for a model to distinguish\n            in case `hard_negative=True`.\n        - model_name (`str`): The name of the model that was used to generate the sentences.\n\n    Categories:\n        - embedding\n\n    Examples:\n        Paraphrasing:\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"paraphrase\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n        ```\n\n        Generating semantically similar sentences:\n\n        ```python\n        from distilabel.models import InferenceEndpointsLLM\n        from distilabel.steps.tasks import GenerateSentencePair\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"semantically-similar\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"How does 3D printing work?\"}])\n        ```\n\n        Generating queries:\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"Argilla is an open-source data curation platform for LLMs. Using Argilla, ...\"}])\n        ```\n\n        Generating answers:\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"answer\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n        ```\n\n        Generating queries with context (**applies to every action**):\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            context=\"Argilla is an open-source data curation platform for LLMs.\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n        ```\n\n        Generating Hard-negatives (**applies to every action**):\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            context=\"Argilla is an open-source data curation platform for LLMs.\",\n            hard_negative=True,\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n        ```\n\n        Generating structured data with default schema (**applies to every action**):\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            context=\"Argilla is an open-source data curation platform for LLMs.\",\n            hard_negative=True,\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n            use_default_structured_output=True\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n        ```\n    \"\"\"\n\n    triplet: bool = False\n    action: GenerationAction\n    hard_negative: bool = False\n    context: str = \"\"\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"generate-sentence-pair.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task is the `anchor` sentence.\"\"\"\n        return [\"anchor\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The inputs are formatted as a `ChatType`, with a system prompt describing the\n        task of generating a positive and negative sentences for the anchor sentence. The\n        anchor is provided as the first user interaction in the conversation.\n\n        Args:\n            input: The input containing the `anchor` sentence.\n\n        Returns:\n            A list of dictionaries containing the system and user interactions.\n        \"\"\"\n        action_sentence = GENERATION_ACTION_SENTENCES[self.action]\n\n        format_system_prompt = {\n            \"action_sentence\": action_sentence,\n            \"context\": CONTEXT_INTRO if self.context else \"\",\n        }\n        if self.triplet:\n            format_system_prompt[\"negative_style\"] = NEGATIVE_STYLE[\n                \"hard-negative\" if self.hard_negative else \"negative\"\n            ]\n\n        system_prompt = (\n            POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT\n        ).format(**format_system_prompt)\n\n        return [\n            {\"role\": \"system\", \"content\": system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    anchor=input[\"anchor\"],\n                    context=self.context if self.context else None,\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The outputs for the task are the `positive` and `negative` sentences, as well\n        as the `model_name` used to generate the sentences.\"\"\"\n        columns = [\"positive\", \"negative\"] if self.triplet else [\"positive\"]\n        columns += [\"model_name\"]\n        return columns\n\n    def format_output(\n        self, output: Union[str, None], input: Optional[Dict[str, Any]] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Formats the output of the LLM, to extract the `positive` and `negative` sentences\n        generated. If the output is `None` or the regex doesn't match, then the outputs\n        will be set to `None` as well.\n\n        Args:\n            output: The output of the LLM.\n            input: The input used to generate the output.\n\n        Returns:\n            The formatted output containing the `positive` and `negative` sentences.\n        \"\"\"\n        if output is None:\n            return {\"positive\": None, \"negative\": None}\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output)\n\n        match = POSITIVE_NEGATIVE_PAIR_REGEX.match(output)\n        if match is None:\n            formatted_output = {\"positive\": None}\n            if self.triplet:\n                formatted_output[\"negative\"] = None\n            return formatted_output\n\n        groups = match.groups()\n        if self.triplet:\n            return {\n                \"positive\": groups[0].strip(),\n                \"negative\": (\n                    groups[1].strip()\n                    if len(groups) &gt; 1 and groups[1] is not None\n                    else None\n                ),\n            }\n\n        return {\"positive\": groups[0].strip()}\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        if self.triplet:\n            return {\n                \"properties\": {\n                    \"positive\": {\"title\": \"Positive\", \"type\": \"string\"},\n                    \"negative\": {\"title\": \"Negative\", \"type\": \"string\"},\n                },\n                \"required\": [\"positive\", \"negative\"],\n                \"title\": \"Schema\",\n                \"type\": \"object\",\n            }\n        return {\n            \"properties\": {\"positive\": {\"title\": \"Positive\", \"type\": \"string\"}},\n            \"required\": [\"positive\"],\n            \"title\": \"Schema\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(self, output: str) -&gt; Dict[str, str]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with either `positive`, or `positive` and `negative` keys.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            if self.triplet:\n                return {\"positive\": None, \"negative\": None}\n            return {\"positive\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task is the <code>anchor</code> sentence.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task are the <code>positive</code> and <code>negative</code> sentences, as well as the <code>model_name</code> used to generate the sentences.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"generate-sentence-pair.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.format_input","title":"<code>format_input(input)</code>","text":"<p>The inputs are formatted as a <code>ChatType</code>, with a system prompt describing the task of generating a positive and negative sentences for the anchor sentence. The anchor is provided as the first user interaction in the conversation.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input containing the <code>anchor</code> sentence.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list of dictionaries containing the system and user interactions.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The inputs are formatted as a `ChatType`, with a system prompt describing the\n    task of generating a positive and negative sentences for the anchor sentence. The\n    anchor is provided as the first user interaction in the conversation.\n\n    Args:\n        input: The input containing the `anchor` sentence.\n\n    Returns:\n        A list of dictionaries containing the system and user interactions.\n    \"\"\"\n    action_sentence = GENERATION_ACTION_SENTENCES[self.action]\n\n    format_system_prompt = {\n        \"action_sentence\": action_sentence,\n        \"context\": CONTEXT_INTRO if self.context else \"\",\n    }\n    if self.triplet:\n        format_system_prompt[\"negative_style\"] = NEGATIVE_STYLE[\n            \"hard-negative\" if self.hard_negative else \"negative\"\n        ]\n\n    system_prompt = (\n        POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT\n    ).format(**format_system_prompt)\n\n    return [\n        {\"role\": \"system\", \"content\": system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                anchor=input[\"anchor\"],\n                context=self.context if self.context else None,\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Formats the output of the LLM, to extract the <code>positive</code> and <code>negative</code> sentences generated. If the output is <code>None</code> or the regex doesn't match, then the outputs will be set to <code>None</code> as well.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>The output of the LLM.</p> required <code>input</code> <code>Optional[Dict[str, Any]]</code> <p>The input used to generate the output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The formatted output containing the <code>positive</code> and <code>negative</code> sentences.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Optional[Dict[str, Any]] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"Formats the output of the LLM, to extract the `positive` and `negative` sentences\n    generated. If the output is `None` or the regex doesn't match, then the outputs\n    will be set to `None` as well.\n\n    Args:\n        output: The output of the LLM.\n        input: The input used to generate the output.\n\n    Returns:\n        The formatted output containing the `positive` and `negative` sentences.\n    \"\"\"\n    if output is None:\n        return {\"positive\": None, \"negative\": None}\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output)\n\n    match = POSITIVE_NEGATIVE_PAIR_REGEX.match(output)\n    if match is None:\n        formatted_output = {\"positive\": None}\n        if self.triplet:\n            formatted_output[\"negative\"] = None\n        return formatted_output\n\n    groups = match.groups()\n    if self.triplet:\n        return {\n            \"positive\": groups[0].strip(),\n            \"negative\": (\n                groups[1].strip()\n                if len(groups) &gt; 1 and groups[1] is not None\n                else None\n            ),\n        }\n\n    return {\"positive\": groups[0].strip()}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    if self.triplet:\n        return {\n            \"properties\": {\n                \"positive\": {\"title\": \"Positive\", \"type\": \"string\"},\n                \"negative\": {\"title\": \"Negative\", \"type\": \"string\"},\n            },\n            \"required\": [\"positive\", \"negative\"],\n            \"title\": \"Schema\",\n            \"type\": \"object\",\n        }\n    return {\n        \"properties\": {\"positive\": {\"title\": \"Positive\", \"type\": \"string\"}},\n        \"required\": [\"positive\"],\n        \"title\": \"Schema\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair._format_structured_output","title":"<code>_format_structured_output(output)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with either <code>positive</code>, or <code>positive</code> and <code>negative</code> keys.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def _format_structured_output(self, output: str) -&gt; Dict[str, str]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with either `positive`, or `positive` and `negative` keys.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        if self.triplet:\n            return {\"positive\": None, \"negative\": None}\n        return {\"positive\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration","title":"<code>StructuredGeneration</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate structured content for a given <code>instruction</code> using an <code>LLM</code>.</p> <p><code>StructuredGeneration</code> is a pre-defined task that defines the <code>instruction</code> and the <code>structured_output</code> as the inputs, and <code>generation</code> as the output. This task is used to generate structured content based on the input instruction and following the schema provided within the <code>structured_output</code> column per each <code>instruction</code>. The <code>model_name</code> also returned as part of the output in order to enhance it.</p> <p>Attributes:</p> Name Type Description <code>use_system_prompt</code> <code>bool</code> <p>Whether to use the system prompt in the generation. Defaults to <code>True</code>, which means that if the column <code>system_prompt</code> is  defined within the input batch, then the <code>system_prompt</code> will be used, otherwise, it will be ignored.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction to generate structured content from.</li> <li>structured_output (<code>Dict[str, Any]</code>): The structured_output to generate structured content from. It should be a     Python dictionary with the keys <code>format</code> and <code>schema</code>, where <code>format</code> should be one of <code>json</code> or     <code>regex</code>, and the <code>schema</code> should be either the JSON schema or the regex pattern, respectively.</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated text matching the provided schema, if possible.</li> <li>model_name (<code>str</code>): The name of the model used to generate the text.</li> </ul> Categories <ul> <li>outlines</li> <li>structured-generation</li> </ul> <p>Examples:</p> <p>Generate structured output from a JSON schema:</p> <pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"Create an RPG character\",\n                \"structured_output\": {\n                    \"format\": \"json\",\n                    \"schema\": {\n                        \"properties\": {\n                            \"name\": {\n                                \"title\": \"Name\",\n                                \"type\": \"string\"\n                            },\n                            \"description\": {\n                                \"title\": \"Description\",\n                                \"type\": \"string\"\n                            },\n                            \"role\": {\n                                \"title\": \"Role\",\n                                \"type\": \"string\"\n                            },\n                            \"weapon\": {\n                                \"title\": \"Weapon\",\n                                \"type\": \"string\"\n                            }\n                        },\n                        \"required\": [\n                            \"name\",\n                            \"description\",\n                            \"role\",\n                            \"weapon\"\n                        ],\n                        \"title\": \"Character\",\n                        \"type\": \"object\"\n                    }\n                },\n            }\n        ]\n    )\n)\n</code></pre> <p>Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines):</p> <pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"What's the weather like today in Seattle in Celsius degrees?\",\n                \"structured_output\": {\n                    \"format\": \"regex\",\n                    \"schema\": r\"(\\d{1,2})\u00b0C\"\n                },\n\n            }\n        ]\n    )\n)\n</code></pre> Source code in <code>src/distilabel/steps/tasks/structured_generation.py</code> <pre><code>class StructuredGeneration(Task):\n    \"\"\"Generate structured content for a given `instruction` using an `LLM`.\n\n    `StructuredGeneration` is a pre-defined task that defines the `instruction` and the `structured_output`\n    as the inputs, and `generation` as the output. This task is used to generate structured content based on\n    the input instruction and following the schema provided within the `structured_output` column per each\n    `instruction`. The `model_name` also returned as part of the output in order to enhance it.\n\n    Attributes:\n        use_system_prompt: Whether to use the system prompt in the generation. Defaults to `True`,\n            which means that if the column `system_prompt` is  defined within the input batch, then\n            the `system_prompt` will be used, otherwise, it will be ignored.\n\n    Input columns:\n        - instruction (`str`): The instruction to generate structured content from.\n        - structured_output (`Dict[str, Any]`): The structured_output to generate structured content from. It should be a\n            Python dictionary with the keys `format` and `schema`, where `format` should be one of `json` or\n            `regex`, and the `schema` should be either the JSON schema or the regex pattern, respectively.\n\n    Output columns:\n        - generation (`str`): The generated text matching the provided schema, if possible.\n        - model_name (`str`): The name of the model used to generate the text.\n\n    Categories:\n        - outlines\n        - structured-generation\n\n    Examples:\n        Generate structured output from a JSON schema:\n\n        ```python\n        from distilabel.steps.tasks import StructuredGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        structured_gen = StructuredGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            ),\n        )\n\n        structured_gen.load()\n\n        result = next(\n            structured_gen.process(\n                [\n                    {\n                        \"instruction\": \"Create an RPG character\",\n                        \"structured_output\": {\n                            \"format\": \"json\",\n                            \"schema\": {\n                                \"properties\": {\n                                    \"name\": {\n                                        \"title\": \"Name\",\n                                        \"type\": \"string\"\n                                    },\n                                    \"description\": {\n                                        \"title\": \"Description\",\n                                        \"type\": \"string\"\n                                    },\n                                    \"role\": {\n                                        \"title\": \"Role\",\n                                        \"type\": \"string\"\n                                    },\n                                    \"weapon\": {\n                                        \"title\": \"Weapon\",\n                                        \"type\": \"string\"\n                                    }\n                                },\n                                \"required\": [\n                                    \"name\",\n                                    \"description\",\n                                    \"role\",\n                                    \"weapon\"\n                                ],\n                                \"title\": \"Character\",\n                                \"type\": \"object\"\n                            }\n                        },\n                    }\n                ]\n            )\n        )\n        ```\n\n        Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines):\n\n        ```python\n        from distilabel.steps.tasks import StructuredGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        structured_gen = StructuredGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            ),\n        )\n\n        structured_gen.load()\n\n        result = next(\n            structured_gen.process(\n                [\n                    {\n                        \"instruction\": \"What's the weather like today in Seattle in Celsius degrees?\",\n                        \"structured_output\": {\n                            \"format\": \"regex\",\n                            \"schema\": r\"(\\\\d{1,2})\u00b0C\"\n                        },\n\n                    }\n                ]\n            )\n        )\n        ```\n    \"\"\"\n\n    use_system_prompt: bool = False\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the `instruction` and the `structured_output`.\n        Optionally, if the `use_system_prompt` flag is set to True, then the\n        `system_prompt` will be used too.\"\"\"\n        columns = [\"instruction\", \"structured_output\"]\n        if self.use_system_prompt:\n            columns = [\"system_prompt\"] + columns\n        return columns\n\n    def format_input(self, input: Dict[str, Any]) -&gt; StructuredInput:\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        if not isinstance(input[\"instruction\"], str):\n            raise DistilabelUserError(\n                f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n                page=\"components-gallery/tasks/structuredgeneration/\",\n            )\n\n        messages = [{\"role\": \"user\", \"content\": input[\"instruction\"]}]\n        if self.use_system_prompt:\n            if \"system_prompt\" in input:\n                messages.insert(\n                    0, {\"role\": \"system\", \"content\": input[\"system_prompt\"]}\n                )\n            else:\n                warnings.warn(\n                    \"`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.\",\n                    UserWarning,\n                    stacklevel=2,\n                )\n\n        return (messages, input.get(\"structured_output\", None))  # type: ignore\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`. Note that even\n        if the `structured_output` is defined to produce a JSON schema, this method will return the raw\n        output i.e. a string without any parsing.\"\"\"\n        return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the <code>instruction</code> and the <code>structured_output</code>. Optionally, if the <code>use_system_prompt</code> flag is set to True, then the <code>system_prompt</code> will be used too.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/structured_generation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; StructuredInput:\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    if not isinstance(input[\"instruction\"], str):\n        raise DistilabelUserError(\n            f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n            page=\"components-gallery/tasks/structuredgeneration/\",\n        )\n\n    messages = [{\"role\": \"user\", \"content\": input[\"instruction\"]}]\n    if self.use_system_prompt:\n        if \"system_prompt\" in input:\n            messages.insert(\n                0, {\"role\": \"system\", \"content\": input[\"system_prompt\"]}\n            )\n        else:\n            warnings.warn(\n                \"`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.\",\n                UserWarning,\n                stacklevel=2,\n            )\n\n    return (messages, input.get(\"structured_output\", None))  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>. Note that even if the <code>structured_output</code> is defined to produce a JSON schema, this method will return the raw output i.e. a string without any parsing.</p> Source code in <code>src/distilabel/steps/tasks/structured_generation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`. Note that even\n    if the `structured_output` is defined to produce a JSON schema, this method will return the raw\n    output i.e. a string without any parsing.\"\"\"\n    return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification","title":"<code>TextClassification</code>","text":"<p>               Bases: <code>Task</code></p> <p>Classifies text into one or more categories or labels.</p> <p>This task can be used for text classification problems, where the goal is to assign one or multiple labels to a given text. It uses structured generation as per the reference paper by default, it can help to generate more concise labels. See section 4.1 in the reference.</p> Input columns <ul> <li>text (<code>str</code>): The reference text we want to obtain labels for.</li> </ul> Output columns <ul> <li>labels (<code>Union[str, List[str]]</code>): The label or list of labels for the text.</li> <li>model_name (<code>str</code>): The name of the model used to generate the label/s.</li> </ul> Categories <ul> <li>text-classification</li> </ul> References <ul> <li><code>Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models</code></li> </ul> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Optional[str]</code> <p>A prompt to display to the user before the task starts. Contains a default message to make the model behave like a classifier specialist.</p> <code>n</code> <code>PositiveInt</code> <p>Number of labels to generate If only 1 is required, corresponds to a label classification problem, if &gt;1 it will intend return the \"n\" labels most representative for the text. Defaults to 1.</p> <code>context</code> <code>Optional[str]</code> <p>Context to use when generating the labels. By default contains a generic message, but can be used to customize the context for the task.</p> <code>examples</code> <code>Optional[List[str]]</code> <p>List of examples to help the model understand the task, few shots.</p> <code>available_labels</code> <code>Optional[Union[List[str], Dict[str, str]]]</code> <p>List of available labels to choose from when classifying the text, or a dictionary with the labels and their descriptions.</p> <code>default_label</code> <code>Optional[Union[str, List[str]]]</code> <p>Default label to use when the text is ambiguous or lacks sufficient information for classification. Can be a list in case of multiple labels (n&gt;1).</p> <p>Examples:</p> <p>Assigning a sentiment to a text:</p> <pre><code>from distilabel.steps.tasks import TextClassification\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\ntext_classification = TextClassification(\n    llm=llm,\n    context=\"You are an AI system specialized in assigning sentiment to movies.\",\n    available_labels=[\"positive\", \"negative\"],\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\"}]\n    )\n)\n# result\n# [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',\n# 'labels': 'positive',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"positive\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the user query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nYou are an AI system specialized in assigning sentiment to movie the user queries.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"positive\",  # The text shows positive sentiment\\n    \"negative\",  # The text shows negative sentiment\\n]\\n\\n\\n## User Query\\n```\\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Assigning predefined labels with specified descriptions:</p> <pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=1,\n    context=\"Determine the intent of the text.\",\n    available_labels={\n        \"complaint\": \"A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.\",\n        \"inquiry\": \"A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.\",\n        \"feedback\": \"A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\",\n        \"praise\": \"A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\"\n    },\n    query_title=\"Customer Query\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"Can you tell me more about your return policy?\"}]\n    )\n)\n# result\n# [{'text': 'Can you tell me more about your return policy?',\n# 'labels': 'inquiry',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"inquiry\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the customer query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nDetermine the intent of the text.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"complaint\",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\\n    \"inquiry\",  # A question or request for information about a product, service, or situation. It\\'s a neutral or curious expression seeking clarification or details.\\n    \"feedback\",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\\n    \"praise\",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\\n]\\n\\n\\n## Customer Query\\n```\\nCan you tell me more about your return policy?\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Free multi label classification without predefined labels:</p> <pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=3,\n    context=(\n        \"Describe the main themes, topics, or categories that could describe the \"\n        \"following type of persona.\"\n    ),\n    query_title=\"Example of Persona\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\"}]\n    )\n)\n# result\n# [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',\n# 'labels': ['Historical Researcher',\n# 'Cultural Specialist',\n# 'Ethnic Studies Expert'],\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": [\"Historical Researcher\", \"Cultural Specialist\", \"Ethnic Studies Expert\"]\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the example of persona by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide a list of 3 labels that best describe the text.\\nDescribe the main themes, topics, or categories that could describe the following type of persona.\\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\\n\\n\\n## Example of Persona\\n```\\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": [\"label_0\", \"label_1\", \"label_2\"]\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>class TextClassification(Task):\n    r\"\"\"Classifies text into one or more categories or labels.\n\n    This task can be used for text classification problems, where the goal is to assign\n    one or multiple labels to a given text.\n    It uses structured generation as per the reference paper by default,\n    it can help to generate more concise labels. See section 4.1 in the reference.\n\n    Input columns:\n        - text (`str`): The reference text we want to obtain labels for.\n\n    Output columns:\n        - labels (`Union[str, List[str]]`): The label or list of labels for the text.\n        - model_name (`str`): The name of the model used to generate the label/s.\n\n    Categories:\n        - text-classification\n\n    References:\n        - [`Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models`](https://arxiv.org/abs/2408.02442)\n\n    Attributes:\n        system_prompt: A prompt to display to the user before the task starts. Contains a default\n            message to make the model behave like a classifier specialist.\n        n: Number of labels to generate If only 1 is required, corresponds to a label\n            classification problem, if &gt;1 it will intend return the \"n\" labels most representative\n            for the text. Defaults to 1.\n        context: Context to use when generating the labels. By default contains a generic message,\n            but can be used to customize the context for the task.\n        examples: List of examples to help the model understand the task, few shots.\n        available_labels: List of available labels to choose from when classifying the text, or\n            a dictionary with the labels and their descriptions.\n        default_label: Default label to use when the text is ambiguous or lacks sufficient information for\n            classification. Can be a list in case of multiple labels (n&gt;1).\n\n    Examples:\n        Assigning a sentiment to a text:\n\n        ```python\n        from distilabel.steps.tasks import TextClassification\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        )\n\n        text_classification = TextClassification(\n            llm=llm,\n            context=\"You are an AI system specialized in assigning sentiment to movies.\",\n            available_labels=[\"positive\", \"negative\"],\n        )\n\n        text_classification.load()\n\n        result = next(\n            text_classification.process(\n                [{\"text\": \"This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\"}]\n            )\n        )\n        # result\n        # [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',\n        # 'labels': 'positive',\n        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"positive\"\\n}',\n        # 'raw_input_text_classification_0': [{'role': 'system',\n        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n        #     {'role': 'user',\n        #     'content': '# Instruction\\nPlease classify the user query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nYou are an AI system specialized in assigning sentiment to movie the user queries.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"positive\",  # The text shows positive sentiment\\n    \"negative\",  # The text shows negative sentiment\\n]\\n\\n\\n## User Query\\n```\\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Assigning predefined labels with specified descriptions:\n\n        ```python\n        from distilabel.steps.tasks import TextClassification\n\n        text_classification = TextClassification(\n            llm=llm,\n            n=1,\n            context=\"Determine the intent of the text.\",\n            available_labels={\n                \"complaint\": \"A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.\",\n                \"inquiry\": \"A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.\",\n                \"feedback\": \"A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\",\n                \"praise\": \"A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\"\n            },\n            query_title=\"Customer Query\",\n        )\n\n        text_classification.load()\n\n        result = next(\n            text_classification.process(\n                [{\"text\": \"Can you tell me more about your return policy?\"}]\n            )\n        )\n        # result\n        # [{'text': 'Can you tell me more about your return policy?',\n        # 'labels': 'inquiry',\n        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"inquiry\"\\n}',\n        # 'raw_input_text_classification_0': [{'role': 'system',\n        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n        #     {'role': 'user',\n        #     'content': '# Instruction\\nPlease classify the customer query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nDetermine the intent of the text.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"complaint\",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\\n    \"inquiry\",  # A question or request for information about a product, service, or situation. It\\'s a neutral or curious expression seeking clarification or details.\\n    \"feedback\",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\\n    \"praise\",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\\n]\\n\\n\\n## Customer Query\\n```\\nCan you tell me more about your return policy?\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Free multi label classification without predefined labels:\n\n        ```python\n        from distilabel.steps.tasks import TextClassification\n\n        text_classification = TextClassification(\n            llm=llm,\n            n=3,\n            context=(\n                \"Describe the main themes, topics, or categories that could describe the \"\n                \"following type of persona.\"\n            ),\n            query_title=\"Example of Persona\",\n        )\n\n        text_classification.load()\n\n        result = next(\n            text_classification.process(\n                [{\"text\": \"A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\"}]\n            )\n        )\n        # result\n        # [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',\n        # 'labels': ['Historical Researcher',\n        # 'Cultural Specialist',\n        # 'Ethnic Studies Expert'],\n        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": [\"Historical Researcher\", \"Cultural Specialist\", \"Ethnic Studies Expert\"]\\n}',\n        # 'raw_input_text_classification_0': [{'role': 'system',\n        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n        #     {'role': 'user',\n        #     'content': '# Instruction\\nPlease classify the example of persona by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide a list of 3 labels that best describe the text.\\nDescribe the main themes, topics, or categories that could describe the following type of persona.\\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\\n\\n\\n## Example of Persona\\n```\\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": [\"label_0\", \"label_1\", \"label_2\"]\\n}\\n```'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n    \"\"\"\n\n    system_prompt: Optional[str] = (\n        \"You are an AI system specialized in generating labels to classify pieces of text. \"\n        \"Your sole purpose is to analyze the given text and provide appropriate classification labels.\"\n    )\n    n: PositiveInt = Field(\n        default=1,\n        description=\"Number of labels to generate. Defaults to 1.\",\n    )\n    context: Optional[str] = Field(\n        default=\"Generate concise, relevant labels that accurately represent the text's main themes, topics, or categories.\",\n        description=\"Context to use when generating the labels.\",\n    )\n    examples: Optional[List[str]] = Field(\n        default=None,\n        description=\"List of examples to help the model understand the task, few shots.\",\n    )\n    available_labels: Optional[Union[List[str], Dict[str, str]]] = Field(\n        default=None,\n        description=(\n            \"List of available labels to choose from when classifying the text, or \"\n            \"a dictionary with the labels and their descriptions.\"\n        ),\n    )\n    default_label: Optional[Union[str, List[str]]] = Field(\n        default=\"Unclassified\",\n        description=(\n            \"Default label to use when the text is ambiguous or lacks sufficient information for \"\n            \"classification. Can be a list in case of multiple labels (n&gt;1).\"\n        ),\n    )\n    query_title: str = Field(\n        default=\"User Query\",\n        description=\"Title of the query used to show the example/s to classify.\",\n    )\n    use_default_structured_output: bool = True\n\n    _template: Optional[Template] = PrivateAttr(default=None)\n\n    def load(self) -&gt; None:\n        super().load()\n        self._template = Template(TEXT_CLASSIFICATION_TEMPLATE)\n        self._labels_format: str = (\n            '\"label\"'\n            if self.n == 1\n            else \"[\" + \", \".join([f'\"label_{i}\"' for i in range(self.n)]) + \"]\"\n        )\n        self._labels_message: str = (\n            \"Provide the label that best describes the text.\"\n            if self.n == 1\n            else f\"Provide a list of {self.n} labels that best describe the text.\"\n        )\n        self._available_labels_message: str = self._get_available_labels_message()\n        self._examples: str = self._get_examples_message()\n\n    def _get_available_labels_message(self) -&gt; str:\n        \"\"\"Prepares the message to display depending on the available labels (if any),\n        and whether the labels have a specific context.\n        \"\"\"\n        if self.available_labels is None:\n            return (\n                \"Use clear, widely understood terms for labels.\"\n                \"Avoid overly specific or obscure labels unless the text demands it.\"\n            )\n\n        msg = (\n            \"## Labeling the user input\\n\"\n            \"Use the available labels to classify the user query{label_context}:\\n\"\n            \"available_labels = {available_labels}\"\n        )\n        if isinstance(self.available_labels, list):\n            specific_msg = (\n                \"[\\n\"\n                + indent(\n                    \"\".join([f'\"{label}\",\\n' for label in self.available_labels]),\n                    prefix=\" \" * 4,\n                )\n                + \"]\"\n            )\n            return msg.format(label_context=\"\", available_labels=specific_msg)\n\n        elif isinstance(self.available_labels, dict):\n            specific_msg = \"\"\n            for label, description in self.available_labels.items():\n                specific_msg += indent(\n                    f'\"{label}\",  # {description}' + \"\\n\", prefix=\" \" * 4\n                )\n\n            specific_msg = \"[\\n\" + specific_msg + \"]\"\n            return msg.format(\n                label_context=\". Analyze the context of each label specifically\",\n                available_labels=specific_msg,\n            )\n\n    def _get_examples_message(self) -&gt; str:\n        \"\"\"Prepares the message to display depending on the examples provided.\"\"\"\n        if self.examples is None:\n            return \"\"\n\n        examples_msg = \"\\n\".join([f\"- {ex}\" for ex in self.examples])\n\n        return (\n            \"\\n## Examples\\n\"\n            \"Here are some examples to help you understand the task:\\n\"\n            f\"{examples_msg}\"\n        )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"labels\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        messages = [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    context=f\"\\n{self.context}\",\n                    labels_message=self._labels_message,\n                    available_labels=self._available_labels_message,\n                    examples=self._examples,\n                    default_label=self.default_label,\n                    labels_format=self._labels_format,\n                    query_title=self.query_title,\n                    text=input[\"text\"],\n                ),\n            },\n        ]\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n        return messages\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\"\"\"\n        return self._format_structured_output(output)\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        if self.n &gt; 1:\n\n            class MultiLabelSchema(BaseModel):\n                labels: List[str]\n\n            return MultiLabelSchema.model_json_schema()\n\n        class SingleLabelSchema(BaseModel):\n            labels: str\n\n        return SingleLabelSchema.model_json_schema()\n\n    def _format_structured_output(\n        self, output: str\n    ) -&gt; Dict[str, Union[str, List[str]]]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with the `labels`, and either a string or a list of strings with the labels.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            if self.n &gt; 1:\n                return {\"labels\": [None for _ in range(self.n)]}\n            return {\"labels\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification._get_available_labels_message","title":"<code>_get_available_labels_message()</code>","text":"<p>Prepares the message to display depending on the available labels (if any), and whether the labels have a specific context.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def _get_available_labels_message(self) -&gt; str:\n    \"\"\"Prepares the message to display depending on the available labels (if any),\n    and whether the labels have a specific context.\n    \"\"\"\n    if self.available_labels is None:\n        return (\n            \"Use clear, widely understood terms for labels.\"\n            \"Avoid overly specific or obscure labels unless the text demands it.\"\n        )\n\n    msg = (\n        \"## Labeling the user input\\n\"\n        \"Use the available labels to classify the user query{label_context}:\\n\"\n        \"available_labels = {available_labels}\"\n    )\n    if isinstance(self.available_labels, list):\n        specific_msg = (\n            \"[\\n\"\n            + indent(\n                \"\".join([f'\"{label}\",\\n' for label in self.available_labels]),\n                prefix=\" \" * 4,\n            )\n            + \"]\"\n        )\n        return msg.format(label_context=\"\", available_labels=specific_msg)\n\n    elif isinstance(self.available_labels, dict):\n        specific_msg = \"\"\n        for label, description in self.available_labels.items():\n            specific_msg += indent(\n                f'\"{label}\",  # {description}' + \"\\n\", prefix=\" \" * 4\n            )\n\n        specific_msg = \"[\\n\" + specific_msg + \"]\"\n        return msg.format(\n            label_context=\". Analyze the context of each label specifically\",\n            available_labels=specific_msg,\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification._get_examples_message","title":"<code>_get_examples_message()</code>","text":"<p>Prepares the message to display depending on the examples provided.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def _get_examples_message(self) -&gt; str:\n    \"\"\"Prepares the message to display depending on the examples provided.\"\"\"\n    if self.examples is None:\n        return \"\"\n\n    examples_msg = \"\\n\".join([f\"- {ex}\" for ex in self.examples])\n\n    return (\n        \"\\n## Examples\\n\"\n        \"Here are some examples to help you understand the task:\\n\"\n        f\"{examples_msg}\"\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    messages = [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                context=f\"\\n{self.context}\",\n                labels_message=self._labels_message,\n                available_labels=self._available_labels_message,\n                examples=self._examples,\n                default_label=self.default_label,\n                labels_format=self._labels_format,\n                query_title=self.query_title,\n                text=input[\"text\"],\n            ),\n        },\n    ]\n    if self.system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n    return messages\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\"\"\"\n    return self._format_structured_output(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    if self.n &gt; 1:\n\n        class MultiLabelSchema(BaseModel):\n            labels: List[str]\n\n        return MultiLabelSchema.model_json_schema()\n\n    class SingleLabelSchema(BaseModel):\n        labels: str\n\n    return SingleLabelSchema.model_json_schema()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification._format_structured_output","title":"<code>_format_structured_output(output)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with the <code>labels</code>, and either a string or a list of strings with the labels.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, Union[str, List[str]]]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def _format_structured_output(\n    self, output: str\n) -&gt; Dict[str, Union[str, List[str]]]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with the `labels`, and either a string or a list of strings with the labels.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        if self.n &gt; 1:\n            return {\"labels\": [None for _ in range(self.n)]}\n        return {\"labels\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration","title":"<code>ChatGeneration</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generates text based on a conversation.</p> <p><code>ChatGeneration</code> is a pre-defined task that defines the <code>messages</code> as the input and <code>generation</code> as the output. This task is used to generate text based on a conversation. The <code>model_name</code> is also returned as part of the output in order to enhance it.</p> Input columns <ul> <li>messages (<code>List[Dict[Literal[\"role\", \"content\"], str]]</code>): The messages to generate the     follow up completion from.</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated text from the assistant.</li> <li>model_name (<code>str</code>): The model name used to generate the text.</li> </ul> Categories <ul> <li>chat-generation</li> </ul> Icon <p><code>:material-chat:</code></p> <p>Examples:</p> <p>Generate text from a conversation in OpenAI chat format:</p> <pre><code>from distilabel.steps.tasks import ChatGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nchat = ChatGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nchat.load()\n\nresult = next(\n    chat.process(\n        [\n            {\n                \"messages\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                ]\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'generation': '4',\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>class ChatGeneration(Task):\n    \"\"\"Generates text based on a conversation.\n\n    `ChatGeneration` is a pre-defined task that defines the `messages` as the input\n    and `generation` as the output. This task is used to generate text based on a conversation.\n    The `model_name` is also returned as part of the output in order to enhance it.\n\n    Input columns:\n        - messages (`List[Dict[Literal[\"role\", \"content\"], str]]`): The messages to generate the\n            follow up completion from.\n\n    Output columns:\n        - generation (`str`): The generated text from the assistant.\n        - model_name (`str`): The model name used to generate the text.\n\n    Categories:\n        - chat-generation\n\n    Icon:\n        `:material-chat:`\n\n    Examples:\n        Generate text from a conversation in OpenAI chat format:\n\n        ```python\n        from distilabel.steps.tasks import ChatGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        chat = ChatGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n\n        chat.load()\n\n        result = next(\n            chat.process(\n                [\n                    {\n                        \"messages\": [\n                            {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                        ]\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],\n        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n        #         'generation': '4',\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the `messages`.\"\"\"\n        return [\"messages\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the messages provided\n        are already formatted that way i.e. following the OpenAI chat format.\"\"\"\n\n        if not is_openai_format(input[\"messages\"]):\n            raise DistilabelUserError(\n                \"Input `messages` must be an OpenAI chat-like format conversation. \"\n                f\"Got: {input['messages']}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n                page=\"components-gallery/tasks/chatgeneration/\",\n            )\n\n        if input[\"messages\"][-1][\"role\"] != \"user\":\n            raise DistilabelUserError(\n                \"The last message must be from the user. Please check: \"\n                \"'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n                page=\"components-gallery/tasks/chatgeneration/\",\n            )\n\n        return input[\"messages\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\"\"\"\n        return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the <code>messages</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the messages provided are already formatted that way i.e. following the OpenAI chat format.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the messages provided\n    are already formatted that way i.e. following the OpenAI chat format.\"\"\"\n\n    if not is_openai_format(input[\"messages\"]):\n        raise DistilabelUserError(\n            \"Input `messages` must be an OpenAI chat-like format conversation. \"\n            f\"Got: {input['messages']}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n            page=\"components-gallery/tasks/chatgeneration/\",\n        )\n\n    if input[\"messages\"][-1][\"role\"] != \"user\":\n        raise DistilabelUserError(\n            \"The last message must be from the user. Please check: \"\n            \"'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n            page=\"components-gallery/tasks/chatgeneration/\",\n        )\n\n    return input[\"messages\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\"\"\"\n    return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration","title":"<code>TextGeneration</code>","text":"<p>               Bases: <code>Task</code></p> <p>Text generation with an <code>LLM</code> given a prompt.</p> <p><code>TextGeneration</code> is a pre-defined task that allows passing a custom prompt using the Jinja2 syntax. By default, a <code>instruction</code> is expected in the inputs, but the using <code>template</code> and <code>columns</code> attributes one can define a custom prompt and columns expected from the text. This task should be good enough for tasks that don't need post-processing of the responses generated by the LLM.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Union[str, None]</code> <p>The system prompt to use in the generation. If not provided, then it will check if the input row has a column named <code>system_prompt</code> and use it. If not, then no system prompt will be used. Defaults to <code>None</code>.</p> <code>template</code> <code>str</code> <p>The template to use for the generation. It must follow the Jinja2 template syntax. If not provided, it will assume the text passed is an instruction and construct the appropriate template.</p> <code>columns</code> <code>Union[str, List[str]]</code> <p>A string with the column, or a list with columns expected in the template. Take a look at the examples for more information. Defaults to <code>instruction</code>.</p> <code>use_system_prompt</code> <code>bool</code> <p>DEPRECATED. To be removed in 1.5.0. Whether to use the system prompt in the generation. Defaults to <code>True</code>, which means that if the column <code>system_prompt</code> is defined within the input batch, then the <code>system_prompt</code> will be used, otherwise, it will be ignored.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): By default will be set to <code>instruction</code>.     The columns can point both to a <code>str</code> or a <code>List[str]</code> to be used in the template.</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated text.</li> <li>model_name (<code>str</code>): The name of the model used to generate the text.</li> </ul> Categories <ul> <li>text-generation</li> </ul> References <ul> <li>Jinja2 Template Designer Documentation</li> </ul> <p>Examples:</p> <p>Generate text from an instruction:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    )\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [{\"instruction\": \"your instruction\"}]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'your instruction',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'generation',\n#     }\n# ]\n</code></pre> <p>Use a custom template to generate text:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Document:\n{{ document }}\n\nQuestion: {{ question }}\n\nPlease provide a clear and concise answer to the question based on the information in the document and your general knowledge:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    system_prompt=\"You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.\",\n    template=CUSTOM_TEMPLATE,\n    columns=[\"document\", \"question\"],\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"document\": \"The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.\",\n                \"question\": \"What is the main threat to the Great Barrier Reef mentioned in the document?\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',\n#         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',\n#     }\n# ]\n</code></pre> <p>Few shot learning with different system prompts:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:\n\n{% for example in examples %}\nExample {{ loop.index }}:\nInstruction: {{ example }}\n\n{% endfor %}\nNow, generate a new instruction in a similar style:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    template=CUSTOM_TEMPLATE,\n    columns=\"examples\",\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"examples\": [\"This is an example\", \"Another relevant example\"],\n                \"system_prompt\": \"You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'examples': ['This is an example', 'Another relevant example'],\n#         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'Disable the firewall on the router',\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>class TextGeneration(Task):\n    \"\"\"Text generation with an `LLM` given a prompt.\n\n    `TextGeneration` is a pre-defined task that allows passing a custom prompt using the\n    Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using\n    `template` and `columns` attributes one can define a custom prompt and columns expected\n    from the text. This task should be good enough for tasks that don't need post-processing\n    of the responses generated by the LLM.\n\n    Attributes:\n        system_prompt: The system prompt to use in the generation. If not provided, then\n            it will check if the input row has a column named `system_prompt` and use it.\n            If not, then no system prompt will be used. Defaults to `None`.\n        template: The template to use for the generation. It must follow the Jinja2 template\n            syntax. If not provided, it will assume the text passed is an instruction and\n            construct the appropriate template.\n        columns: A string with the column, or a list with columns expected in the template.\n            Take a look at the examples for more information. Defaults to `instruction`.\n        use_system_prompt: DEPRECATED. To be removed in 1.5.0. Whether to use the system\n            prompt in the generation. Defaults to `True`, which means that if the column\n            `system_prompt` is defined within the input batch, then the `system_prompt`\n            will be used, otherwise, it will be ignored.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): By default will be set to `instruction`.\n            The columns can point both to a `str` or a `List[str]` to be used in the template.\n\n    Output columns:\n        - generation (`str`): The generated text.\n        - model_name (`str`): The name of the model used to generate the text.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [Jinja2 Template Designer Documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/)\n\n    Examples:\n        Generate text from an instruction:\n\n        ```python\n        from distilabel.steps.tasks import TextGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        text_gen = TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            )\n        )\n\n        text_gen.load()\n\n        result = next(\n            text_gen.process(\n                [{\"instruction\": \"your instruction\"}]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'your instruction',\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n        #         'generation': 'generation',\n        #     }\n        # ]\n        ```\n\n        Use a custom template to generate text:\n\n        ```python\n        from distilabel.steps.tasks import TextGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        CUSTOM_TEMPLATE = '''Document:\n        {{ document }}\n\n        Question: {{ question }}\n\n        Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:\n        '''.rstrip()\n\n        text_gen = TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            system_prompt=\"You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.\",\n            template=CUSTOM_TEMPLATE,\n            columns=[\"document\", \"question\"],\n        )\n\n        text_gen.load()\n\n        result = next(\n            text_gen.process(\n                [\n                    {\n                        \"document\": \"The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.\",\n                        \"question\": \"What is the main threat to the Great Barrier Reef mentioned in the document?\"\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',\n        #         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n        #         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',\n        #     }\n        # ]\n        ```\n\n        Few shot learning with different system prompts:\n\n        ```python\n        from distilabel.steps.tasks import TextGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:\n\n        {% for example in examples %}\n        Example {{ loop.index }}:\n        Instruction: {{ example }}\n\n        {% endfor %}\n        Now, generate a new instruction in a similar style:\n        '''.rstrip()\n\n        text_gen = TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            template=CUSTOM_TEMPLATE,\n            columns=\"examples\",\n        )\n\n        text_gen.load()\n\n        result = next(\n            text_gen.process(\n                [\n                    {\n                        \"examples\": [\"This is an example\", \"Another relevant example\"],\n                        \"system_prompt\": \"You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.\"\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'examples': ['This is an example', 'Another relevant example'],\n        #         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n        #         'generation': 'Disable the firewall on the router',\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    system_prompt: Union[str, None] = None\n    use_system_prompt: bool = Field(default=True, deprecated=True)\n    template: str = Field(\n        default=\"{{ instruction }}\",\n        description=(\n            \"This is a template or prompt to use for the generation. \"\n            \"If not provided, it is assumed a `instruction` is placed in the inputs, \"\n            \"to be used as is.\"\n        ),\n    )\n    columns: Union[str, List[str]] = Field(\n        default=\"instruction\",\n        description=(\n            \"Custom column or list of columns to include in the input. \"\n            \"If a `template` is provided which needs custom column names, \"\n            \"then they should be provided here. By default it will use `instruction`.\"\n        ),\n    )\n\n    _can_be_used_with_offline_batch_generation = True\n    _template: Optional[\"Template\"] = PrivateAttr(default=...)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        self.columns = [self.columns] if isinstance(self.columns, str) else self.columns\n        super().model_post_init(__context)\n\n    def load(self) -&gt; None:\n        super().load()\n\n        for column in self.columns:\n            check_column_in_template(column, self.template)\n\n        self._template = Template(self.template)\n\n    def unload(self) -&gt; None:\n        super().unload()\n        self._template = None\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The input for the task is the `instruction` by default, or the `columns` given as input.\"\"\"\n        columns = {column: True for column in self.columns}\n        columns[\"system_prompt\"] = False\n        return columns\n\n    def _prepare_message_content(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n        fields = {column: input[column] for column in self.columns}\n        return [{\"role\": \"user\", \"content\": self._template.render(**fields)}]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        # Handle the previous expected errors, in case of custom columns there's more freedom\n        # and we cannot check it so easily.\n        if self.columns == [\"instruction\"]:\n            if is_openai_format(input[\"instruction\"]):\n                raise DistilabelUserError(\n                    \"Providing `instruction` formatted as an OpenAI chat / conversation is\"\n                    \" deprecated, you should use `ChatGeneration` with `messages` as input instead.\",\n                    page=\"components-gallery/tasks/textgeneration/\",\n                )\n\n            if not isinstance(input[\"instruction\"], str):\n                raise DistilabelUserError(\n                    f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n                    page=\"components-gallery/tasks/textgeneration/\",\n                )\n\n        messages = self._prepare_message_content(input)\n\n        row_system_prompt = input.get(\"system_prompt\")\n        if row_system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": row_system_prompt})\n\n        if self.system_prompt and not row_system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n        return messages  # type: ignore\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\"\"\"\n        return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code> by default, or the <code>columns</code> given as input.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration._prepare_message_content","title":"<code>_prepare_message_content(input)</code>","text":"<p>Prepares the content for the template and returns the formatted messages.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def _prepare_message_content(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n    fields = {column: input[column] for column in self.columns}\n    return [{\"role\": \"user\", \"content\": self._template.render(**fields)}]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    # Handle the previous expected errors, in case of custom columns there's more freedom\n    # and we cannot check it so easily.\n    if self.columns == [\"instruction\"]:\n        if is_openai_format(input[\"instruction\"]):\n            raise DistilabelUserError(\n                \"Providing `instruction` formatted as an OpenAI chat / conversation is\"\n                \" deprecated, you should use `ChatGeneration` with `messages` as input instead.\",\n                page=\"components-gallery/tasks/textgeneration/\",\n            )\n\n        if not isinstance(input[\"instruction\"], str):\n            raise DistilabelUserError(\n                f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n                page=\"components-gallery/tasks/textgeneration/\",\n            )\n\n    messages = self._prepare_message_content(input)\n\n    row_system_prompt = input.get(\"system_prompt\")\n    if row_system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": row_system_prompt})\n\n    if self.system_prompt and not row_system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n    return messages  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\"\"\"\n    return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage","title":"<code>TextGenerationWithImage</code>","text":"<p>               Bases: <code>TextGeneration</code></p> <p>Text generation with images with an <code>LLM</code> given a prompt.</p> <pre><code>`TextGenerationWithImage` is a pre-defined task that allows passing a custom prompt using the\nJinja2 syntax. By default, a `instruction` is expected in the inputs, but the using\n`template` and `columns` attributes one can define a custom prompt and columns expected\nfrom the text. Additionally, an `image` column is expected containing one of the\nurl, base64 encoded image or PIL image. This task inherits from `TextGeneration`,\nso all the functionality available in that task related to the prompt will be available\nhere too.\n\nAttributes:\n    system_prompt: The system prompt to use in the generation.\n        If not, then no system prompt will be used. Defaults to `None`.\n    template: The template to use for the generation. It must follow the Jinja2 template\n        syntax. If not provided, it will assume the text passed is an instruction and\n        construct the appropriate template.\n    columns: A string with the column, or a list with columns expected in the template.\n        Take a look at the examples for more information. Defaults to `instruction`.\n    image_type: The type of the image provided, this will be used to preprocess if necessary.\n        Must be one of \"url\", \"base64\" or \"PIL\".\n\nInput columns:\n    - dynamic (determined by `columns` attribute): By default will be set to `instruction`.\n        The columns can point both to a `str` or a `list[str]` to be used in the template.\n    - image: The column containing the image URL, base64 encoded image or PIL image.\n\nOutput columns:\n    - generation (`str`): The generated text.\n    - model_name (`str`): The name of the model used to generate the text.\n\nCategories:\n    - text-generation\n\nReferences:\n    - [Jinja2 Template Designer Documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/)\n    - [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)\n    - [OpenAI Vision](https://platform.openai.com/docs/guides/vision)\n\nExamples:\n    Answer questions from an image:\n\n    ```python\n    from distilabel.steps.tasks import TextGenerationWithImage\n    from distilabel.models.llms import InferenceEndpointsLLM\n\n    vision = TextGenerationWithImage(\n        name=\"vision_gen\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n        ),\n        image_type=\"url\"\n    )\n\n    vision.load()\n\n    result = next(\n        vision.process(\n            [\n                {\n                    \"instruction\": \"What\u2019s in this image?\",\n                    \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n                }\n            ]\n        )\n    )\n    # result\n    # [\n    #     {\n    #         \"instruction\": \"What\u2019s in this image?\",\n    #         \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n    #         \"generation\": \"Based on the visual cues in the image...\",\n    #         \"model_name\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\"\n    #         ... # distilabel_metadata would be here\n    #     }\n    # ]\n    # result[0][\"generation\"]\n    # \"Based on the visual cues in the image, here are some possible story points:\n</code></pre> <ul> <li>The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.</li> </ul> <p>Analysis and Ideas: * The abundance of green grass and trees suggests a healthy ecosystem or habitat. * The presence of wildlife, such as birds or deer, is possible based on the surroundings. * A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.</p> <p>Additional Questions to Ask: * Why is a footbridge present in this area? * What kind of wildlife inhabits this region\"         <pre><code>Answer questions from an image stored as base64:\n\n```python\n# For this example we will assume that we have the string representation of the image\n# stored, but will just take the image and transform it to base64 to ilustrate the example.\nimport requests\nimport base64\n\nimage_url =\"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg\"\nimg = requests.get(image_url).content\nbase64_image = base64.b64encode(img).decode(\"utf-8\")\n\nfrom distilabel.steps.tasks import TextGenerationWithImage\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nvision = TextGenerationWithImage(\n    name=\"vision_gen\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    ),\n    image_type=\"base64\"\n)\n\nvision.load()\n\nresult = next(\n    vision.process(\n        [\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": base64_image\n            }\n        ]\n    )\n)\n</code></pre></p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>class TextGenerationWithImage(TextGeneration):\n    \"\"\"Text generation with images with an `LLM` given a prompt.\n\n    `TextGenerationWithImage` is a pre-defined task that allows passing a custom prompt using the\n    Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using\n    `template` and `columns` attributes one can define a custom prompt and columns expected\n    from the text. Additionally, an `image` column is expected containing one of the\n    url, base64 encoded image or PIL image. This task inherits from `TextGeneration`,\n    so all the functionality available in that task related to the prompt will be available\n    here too.\n\n    Attributes:\n        system_prompt: The system prompt to use in the generation.\n            If not, then no system prompt will be used. Defaults to `None`.\n        template: The template to use for the generation. It must follow the Jinja2 template\n            syntax. If not provided, it will assume the text passed is an instruction and\n            construct the appropriate template.\n        columns: A string with the column, or a list with columns expected in the template.\n            Take a look at the examples for more information. Defaults to `instruction`.\n        image_type: The type of the image provided, this will be used to preprocess if necessary.\n            Must be one of \"url\", \"base64\" or \"PIL\".\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): By default will be set to `instruction`.\n            The columns can point both to a `str` or a `list[str]` to be used in the template.\n        - image: The column containing the image URL, base64 encoded image or PIL image.\n\n    Output columns:\n        - generation (`str`): The generated text.\n        - model_name (`str`): The name of the model used to generate the text.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [Jinja2 Template Designer Documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/)\n        - [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)\n        - [OpenAI Vision](https://platform.openai.com/docs/guides/vision)\n\n    Examples:\n        Answer questions from an image:\n\n        ```python\n        from distilabel.steps.tasks import TextGenerationWithImage\n        from distilabel.models.llms import InferenceEndpointsLLM\n\n        vision = TextGenerationWithImage(\n            name=\"vision_gen\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n            ),\n            image_type=\"url\"\n        )\n\n        vision.load()\n\n        result = next(\n            vision.process(\n                [\n                    {\n                        \"instruction\": \"What\u2019s in this image?\",\n                        \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         \"instruction\": \"What\\u2019s in this image?\",\n        #         \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n        #         \"generation\": \"Based on the visual cues in the image...\",\n        #         \"model_name\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\"\n        #         ... # distilabel_metadata would be here\n        #     }\n        # ]\n        # result[0][\"generation\"]\n        # \"Based on the visual cues in the image, here are some possible story points:\\n\\n* The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.\\n\\nAnalysis and Ideas:\\n* The abundance of green grass and trees suggests a healthy ecosystem or habitat.\\n* The presence of wildlife, such as birds or deer, is possible based on the surroundings.\\n* A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.\\n\\nAdditional Questions to Ask:\\n* Why is a footbridge present in this area?\\n* What kind of wildlife inhabits this region\"\n        ```\n\n        Answer questions from an image stored as base64:\n\n        ```python\n        # For this example we will assume that we have the string representation of the image\n        # stored, but will just take the image and transform it to base64 to ilustrate the example.\n        import requests\n        import base64\n\n        image_url =\"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg\"\n        img = requests.get(image_url).content\n        base64_image = base64.b64encode(img).decode(\"utf-8\")\n\n        from distilabel.steps.tasks import TextGenerationWithImage\n        from distilabel.models.llms import InferenceEndpointsLLM\n\n        vision = TextGenerationWithImage(\n            name=\"vision_gen\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n            ),\n            image_type=\"base64\"\n        )\n\n        vision.load()\n\n        result = next(\n            vision.process(\n                [\n                    {\n                        \"instruction\": \"What\u2019s in this image?\",\n                        \"image\": base64_image\n                    }\n                ]\n            )\n        )\n        ```\n    \"\"\"\n\n    image_type: Literal[\"url\", \"base64\", \"PIL\"] = Field(\n        default=\"url\",\n        description=\"The type of the image provided, this will be used to preprocess if necessary.\",\n    )\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        columns = super().inputs\n        columns[\"image\"] = True\n        return columns\n\n    def load(self) -&gt; None:\n        Task.load(self)\n\n        for column in self.columns:\n            check_column_in_template(\n                column, self.template, page=\"components-gallery/tasks/visiongeneration/\"\n            )\n\n        self._template = Template(self.template)\n\n    def _transform_image(self, image: Union[str, \"Image\"]) -&gt; str:\n        \"\"\"Transforms the image based on the `image_type` attribute.\"\"\"\n        if self.image_type == \"url\":\n            return image\n\n        if self.image_type == \"base64\":\n            return f\"data:image/jpeg;base64,{image}\"\n\n        # Othwerwise, it's a PIL image\n        return f\"data:image/jpeg;base64,{image_to_str(image)}\"\n\n    def _prepare_message_content(self, input: dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n        fields = {column: input[column] for column in self.columns}\n        img_url = self._transform_image(input[\"image\"])\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": [\n                    {\n                        \"type\": \"text\",\n                        \"text\": self._template.render(**fields),\n                    },\n                    {\n                        \"type\": \"image_url\",\n                        \"image_url\": {\n                            \"url\": img_url,\n                        },\n                    },\n                ],\n            }\n        ]\n\n    def format_input(self, input: dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        messages = self._prepare_message_content(input)\n\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n        return messages  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage._transform_image","title":"<code>_transform_image(image)</code>","text":"<p>Transforms the image based on the <code>image_type</code> attribute.</p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>def _transform_image(self, image: Union[str, \"Image\"]) -&gt; str:\n    \"\"\"Transforms the image based on the `image_type` attribute.\"\"\"\n    if self.image_type == \"url\":\n        return image\n\n    if self.image_type == \"base64\":\n        return f\"data:image/jpeg;base64,{image}\"\n\n    # Othwerwise, it's a PIL image\n    return f\"data:image/jpeg;base64,{image_to_str(image)}\"\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage._prepare_message_content","title":"<code>_prepare_message_content(input)</code>","text":"<p>Prepares the content for the template and returns the formatted messages.</p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>def _prepare_message_content(self, input: dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n    fields = {column: input[column] for column in self.columns}\n    img_url = self._transform_image(input[\"image\"])\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"text\",\n                    \"text\": self._template.render(**fields),\n                },\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\n                        \"url\": img_url,\n                    },\n                },\n            ],\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>def format_input(self, input: dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    messages = self._prepare_message_content(input)\n\n    if self.system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n    return messages  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback","title":"<code>UltraFeedback</code>","text":"<p>               Bases: <code>Task</code></p> <p>Rank generations focusing on different aspects using an <code>LLM</code>.</p> <p>UltraFeedback: Boosting Language Models with High-quality Feedback.</p> <p>Attributes:</p> Name Type Description <code>aspect</code> <code>Literal['helpfulness', 'honesty', 'instruction-following', 'truthfulness', 'overall-rating']</code> <p>The aspect to perform with the <code>UltraFeedback</code> model. The available aspects are: - <code>helpfulness</code>: Evaluate text outputs based on helpfulness. - <code>honesty</code>: Evaluate text outputs based on honesty. - <code>instruction-following</code>: Evaluate text outputs based on given instructions. - <code>truthfulness</code>: Evaluate text outputs based on truthfulness. Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall assessment of the text outputs within a single prompt. The custom aspect is: - <code>overall-rating</code>: Evaluate text outputs based on an overall assessment. Defaults to <code>\"overall-rating\"</code>.</p> Input columns <ul> <li>instruction (<code>str</code>): The reference instruction to evaluate the text outputs.</li> <li>generations (<code>List[str]</code>): The text outputs to evaluate for the given instruction.</li> </ul> Output columns <ul> <li>ratings (<code>List[float]</code>): The ratings for each of the provided text outputs.</li> <li>rationales (<code>List[str]</code>): The rationales for each of the provided text outputs.</li> <li>model_name (<code>str</code>): The name of the model used to generate the ratings and rationales.</li> </ul> Categories <ul> <li>preference</li> </ul> References <ul> <li><code>UltraFeedback: Boosting Language Models with High-quality Feedback</code></li> <li><code>UltraFeedback - GitHub Repository</code></li> </ul> <p>Examples:</p> <p>Rate generations from different LLMs based on the selected aspect:</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=False\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'How much is 2+2?',\n#         'generations': ['4', 'and a car'],\n#         'ratings': [1, 2],\n#         'rationales': ['explanation for 4', 'explanation for and a car'],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#     }\n# ]\n</code></pre> <p>Rate generations from different LLMs based on the honesty, using the default structured output:</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    aspect=\"honesty\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n# 'generations': ['4', 'and a car'],\n# 'ratings': [5, 1],\n# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',\n# \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"],\n# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{\"ratings\": [\\n    5,\\n    1\\n] \\n\\n,\"rationales\": [\\n    \"The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.\",\\n    \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"\\n] }'},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Rate generations from different LLMs based on the helpfulness, using the default structured output:</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512},\n    ),\n    aspect=\"helpfulness\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n#   'generations': ['4', 'and a car'],\n#   'ratings': [1, 5],\n#   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',\n#    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],\n#   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',\n#    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],\n#   'types': [1, 3, 1],\n#   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \\n  \"ratings\": [\\n    1,\\n    5\\n  ]\\n ,\\n  \"rationales\": [\\n    \"Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.\",\\n    \"Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"rationales_for_rating\": [\\n    \"Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.\",\\n    \"Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"types\": [\\n    1, 3,\\n    1\\n  ]\\n  }'},\n#   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> Citations <pre><code>@misc{cui2024ultrafeedbackboostinglanguagemodels,\n    title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},\n    author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},\n    year={2024},\n    eprint={2310.01377},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2310.01377},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>class UltraFeedback(Task):\n    \"\"\"Rank generations focusing on different aspects using an `LLM`.\n\n    UltraFeedback: Boosting Language Models with High-quality Feedback.\n\n    Attributes:\n        aspect: The aspect to perform with the `UltraFeedback` model. The available aspects are:\n            - `helpfulness`: Evaluate text outputs based on helpfulness.\n            - `honesty`: Evaluate text outputs based on honesty.\n            - `instruction-following`: Evaluate text outputs based on given instructions.\n            - `truthfulness`: Evaluate text outputs based on truthfulness.\n            Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall\n            assessment of the text outputs within a single prompt. The custom aspect is:\n            - `overall-rating`: Evaluate text outputs based on an overall assessment.\n            Defaults to `\"overall-rating\"`.\n\n    Input columns:\n        - instruction (`str`): The reference instruction to evaluate the text outputs.\n        - generations (`List[str]`): The text outputs to evaluate for the given instruction.\n\n    Output columns:\n        - ratings (`List[float]`): The ratings for each of the provided text outputs.\n        - rationales (`List[str]`): The rationales for each of the provided text outputs.\n        - model_name (`str`): The name of the model used to generate the ratings and rationales.\n\n    Categories:\n        - preference\n\n    References:\n        - [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377)\n        - [`UltraFeedback - GitHub Repository`](https://github.com/OpenBMB/UltraFeedback)\n\n    Examples:\n        Rate generations from different LLMs based on the selected aspect:\n\n        ```python\n        from distilabel.steps.tasks import UltraFeedback\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        ultrafeedback = UltraFeedback(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            use_default_structured_output=False\n        )\n\n        ultrafeedback.load()\n\n        result = next(\n            ultrafeedback.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generations\": [\"4\", \"and a car\"],\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'How much is 2+2?',\n        #         'generations': ['4', 'and a car'],\n        #         'ratings': [1, 2],\n        #         'rationales': ['explanation for 4', 'explanation for and a car'],\n        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n        #     }\n        # ]\n        ```\n\n        Rate generations from different LLMs based on the honesty, using the default structured output:\n\n        ```python\n        from distilabel.steps.tasks import UltraFeedback\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        ultrafeedback = UltraFeedback(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            aspect=\"honesty\"\n        )\n\n        ultrafeedback.load()\n\n        result = next(\n            ultrafeedback.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generations\": [\"4\", \"and a car\"],\n                    }\n                ]\n            )\n        )\n        # result\n        # [{'instruction': 'How much is 2+2?',\n        # 'generations': ['4', 'and a car'],\n        # 'ratings': [5, 1],\n        # 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',\n        # \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"],\n        # 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{\"ratings\": [\\\\n    5,\\\\n    1\\\\n] \\\\n\\\\n,\"rationales\": [\\\\n    \"The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.\",\\\\n    \"The response is confidently incorrect, as it provides unrelated information (\\'a car\\') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"\\\\n] }'},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Rate generations from different LLMs based on the helpfulness, using the default structured output:\n\n        ```python\n        from distilabel.steps.tasks import UltraFeedback\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        ultrafeedback = UltraFeedback(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                generation_kwargs={\"max_new_tokens\": 512},\n            ),\n            aspect=\"helpfulness\"\n        )\n\n        ultrafeedback.load()\n\n        result = next(\n            ultrafeedback.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generations\": [\"4\", \"and a car\"],\n                    }\n                ]\n            )\n        )\n        # result\n        # [{'instruction': 'How much is 2+2?',\n        #   'generations': ['4', 'and a car'],\n        #   'ratings': [1, 5],\n        #   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',\n        #    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],\n        #   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',\n        #    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],\n        #   'types': [1, 3, 1],\n        #   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \\\\n  \"ratings\": [\\\\n    1,\\\\n    5\\\\n  ]\\\\n ,\\\\n  \"rationales\": [\\\\n    \"Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.\",\\\\n    \"Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.\"\\\\n  ]\\\\n ,\\\\n  \"rationales_for_rating\": [\\\\n    \"Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.\",\\\\n    \"Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.\"\\\\n  ]\\\\n ,\\\\n  \"types\": [\\\\n    1, 3,\\\\n    1\\\\n  ]\\\\n  }'},\n        #   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n    Citations:\n        ```\n        @misc{cui2024ultrafeedbackboostinglanguagemodels,\n            title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},\n            author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},\n            year={2024},\n            eprint={2310.01377},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2310.01377},\n        }\n        ```\n    \"\"\"\n\n    aspect: Literal[\n        \"helpfulness\",\n        \"honesty\",\n        \"instruction-following\",\n        \"truthfulness\",\n        # Custom aspects\n        \"overall-rating\",\n    ] = \"overall-rating\"\n\n    _system_prompt: str = PrivateAttr(\n        default=(\n            \"Your role is to evaluate text quality based on given criteria.\\n\"\n            'You\\'ll receive an instructional description (\"Instruction\") and {no_texts} text outputs (\"Text\").\\n'\n            \"Understand and interpret instructions to evaluate effectively.\\n\"\n            \"Provide annotations for each text with a rating and rationale.\\n\"\n            \"The {no_texts} texts given are independent, and should be evaluated separately.\\n\"\n        )\n    )\n    _template: Optional[\"Template\"] = PrivateAttr(default=...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"ultrafeedback\"\n            / f\"{self.aspect}.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`, and the `generations` for it.\"\"\"\n        return [\"instruction\", \"generations\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self._system_prompt.format(\n                    no_texts=len(input[\"generations\"])\n                ),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    instruction=input[\"instruction\"], generations=input[\"generations\"]\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        columns = []\n        if self.aspect in [\"honesty\", \"instruction-following\", \"overall-rating\"]:\n            columns = [\"ratings\", \"rationales\"]\n        elif self.aspect in [\"helpfulness\", \"truthfulness\"]:\n            columns = [\"types\", \"rationales\", \"ratings\", \"rationales-for-ratings\"]\n        return columns + [\"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `ratings` and `rationales` for\n        each of the provided `generations` for the given `instruction`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\n\n        Args:\n            output: a string representing the output of the LLM via the `process` method.\n            input: the input to the task, as required by some tasks to format the output.\n\n        Returns:\n            A dictionary containing either the `ratings` and `rationales` for each of the provided\n            `generations` for the given `instruction` if the provided aspect is either `honesty`,\n            `instruction-following`, or `overall-rating`; or the `types`, `rationales`,\n            `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the\n            given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.\n        \"\"\"\n        assert input is not None, \"Input is required to format the output.\"\n\n        if self.aspect in [\n            \"honesty\",\n            \"instruction-following\",\n            \"overall-rating\",\n        ]:\n            return self._format_ratings_rationales_output(output, input)\n\n        return self._format_types_ratings_rationales_output(output, input)\n\n    def _format_ratings_rationales_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, List[Any]]:\n        \"\"\"Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`.\"\"\"\n        if output is None:\n            return {\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n            }\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        pattern = r\"Rating: (.+?)\\nRationale: (.+)\"\n        sections = output.split(\"\\n\\n\")\n\n        formatted_outputs = []\n        for section in sections:\n            matches = None\n            if section is not None and section != \"\":\n                matches = re.search(pattern, section, re.DOTALL)\n            if not matches:\n                formatted_outputs.append({\"ratings\": None, \"rationales\": None})\n                continue\n\n            formatted_outputs.append(\n                {\n                    \"ratings\": (\n                        int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                        if matches.group(1) not in [\"None\", \"N/A\"]\n                        else None\n                    ),\n                    \"rationales\": matches.group(2),\n                }\n            )\n        return group_dicts(*formatted_outputs)\n\n    def _format_types_ratings_rationales_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, List[Any]]:\n        \"\"\"Formats the output when the aspect is either `helpfulness` or `truthfulness`.\"\"\"\n        if output is None:\n            return {\n                \"types\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n            }\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        pattern = r\"Type: (.+?)\\nRationale: (.+?)\\nRating: (.+?)\\nRationale: (.+)\"\n\n        sections = output.split(\"\\n\\n\")\n\n        formatted_outputs = []\n        for section in sections:\n            matches = None\n            if section is not None and section != \"\":\n                matches = re.search(pattern, section, re.DOTALL)\n            if not matches:\n                formatted_outputs.append(\n                    {\n                        \"types\": None,\n                        \"rationales\": None,\n                        \"ratings\": None,\n                        \"rationales-for-ratings\": None,\n                    }\n                )\n                continue\n\n            formatted_outputs.append(\n                {\n                    \"types\": (\n                        int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                        if matches.group(1) not in [\"None\", \"N/A\"]\n                        else None\n                    ),\n                    \"rationales\": matches.group(2),\n                    \"ratings\": (\n                        int(re.findall(r\"\\b\\d+\\b\", matches.group(3))[0])\n                        if matches.group(3) not in [\"None\", \"N/A\"]\n                        else None\n                    ),\n                    \"rationales-for-ratings\": matches.group(4),\n                }\n            )\n        return group_dicts(*formatted_outputs)\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel\n        from typing import List\n\n        class SchemaUltraFeedback(BaseModel):\n            ratings: List[int]\n            rationales: List[str]\n\n        class SchemaUltraFeedbackWithType(BaseModel):\n            types: List[Optional[int]]\n            ratings: List[int]\n            rationales: List[str]\n            rationales_for_rating: List[str]\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        if self.aspect in [\n            \"honesty\",\n            \"instruction-following\",\n            \"overall-rating\",\n        ]:\n            return {\n                \"properties\": {\n                    \"ratings\": {\n                        \"items\": {\"type\": \"integer\"},\n                        \"title\": \"Ratings\",\n                        \"type\": \"array\",\n                    },\n                    \"rationales\": {\n                        \"items\": {\"type\": \"string\"},\n                        \"title\": \"Rationales\",\n                        \"type\": \"array\",\n                    },\n                },\n                \"required\": [\"ratings\", \"rationales\"],\n                \"title\": \"SchemaUltraFeedback\",\n                \"type\": \"object\",\n            }\n        return {\n            \"properties\": {\n                \"types\": {\n                    \"items\": {\"anyOf\": [{\"type\": \"integer\"}, {\"type\": \"null\"}]},\n                    \"title\": \"Types\",\n                    \"type\": \"array\",\n                },\n                \"ratings\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Ratings\",\n                    \"type\": \"array\",\n                },\n                \"rationales\": {\n                    \"items\": {\"type\": \"string\"},\n                    \"title\": \"Rationales\",\n                    \"type\": \"array\",\n                },\n                \"rationales_for_rating\": {\n                    \"items\": {\"type\": \"string\"},\n                    \"title\": \"Rationales For Rating\",\n                    \"type\": \"array\",\n                },\n            },\n            \"required\": [\"types\", \"ratings\", \"rationales\", \"rationales_for_rating\"],\n            \"title\": \"SchemaUltraFeedbackWithType\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(\n        self, output: str, input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with either `positive`, or `positive` and `negative` keys.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            if self.aspect in [\n                \"honesty\",\n                \"instruction-following\",\n                \"overall-rating\",\n            ]:\n                return {\n                    \"ratings\": [None] * len(input[\"generations\"]),\n                    \"rationales\": [None] * len(input[\"generations\"]),\n                }\n            return {\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n                \"types\": [None] * len(input[\"generations\"]),\n                \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n            }\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\n            {\n                \"instruction\": f\"&lt;PLACEHOLDER_{'instruction'.upper()}&gt;\",\n                \"generations\": [\n                    f\"&lt;PLACEHOLDER_{f'GENERATION_{i}'.upper()}&gt;\" for i in range(2)\n                ],\n            }\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>, and the <code>generations</code> for it.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template for the given <code>aspect</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"ultrafeedback\"\n        / f\"{self.aspect}.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"system\",\n            \"content\": self._system_prompt.format(\n                no_texts=len(input[\"generations\"])\n            ),\n        },\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                instruction=input[\"instruction\"], generations=input[\"generations\"]\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>ratings</code> and <code>rationales</code> for each of the provided <code>generations</code> for the given <code>instruction</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>a string representing the output of the LLM via the <code>process</code> method.</p> required <code>input</code> <code>Union[Dict[str, Any], None]</code> <p>the input to the task, as required by some tasks to format the output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dictionary containing either the <code>ratings</code> and <code>rationales</code> for each of the provided</p> <code>Dict[str, Any]</code> <p><code>generations</code> for the given <code>instruction</code> if the provided aspect is either <code>honesty</code>,</p> <code>Dict[str, Any]</code> <p><code>instruction-following</code>, or <code>overall-rating</code>; or the <code>types</code>, <code>rationales</code>,</p> <code>Dict[str, Any]</code> <p><code>ratings</code>, and <code>rationales-for-ratings</code> for each of the provided <code>generations</code> for the</p> <code>Dict[str, Any]</code> <p>given <code>instruction</code> if the provided aspect is either <code>helpfulness</code> or <code>truthfulness</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `ratings` and `rationales` for\n    each of the provided `generations` for the given `instruction`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\n\n    Args:\n        output: a string representing the output of the LLM via the `process` method.\n        input: the input to the task, as required by some tasks to format the output.\n\n    Returns:\n        A dictionary containing either the `ratings` and `rationales` for each of the provided\n        `generations` for the given `instruction` if the provided aspect is either `honesty`,\n        `instruction-following`, or `overall-rating`; or the `types`, `rationales`,\n        `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the\n        given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.\n    \"\"\"\n    assert input is not None, \"Input is required to format the output.\"\n\n    if self.aspect in [\n        \"honesty\",\n        \"instruction-following\",\n        \"overall-rating\",\n    ]:\n        return self._format_ratings_rationales_output(output, input)\n\n    return self._format_types_ratings_rationales_output(output, input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback._format_ratings_rationales_output","title":"<code>_format_ratings_rationales_output(output, input)</code>","text":"<p>Formats the output when the aspect is either <code>honesty</code>, <code>instruction-following</code>, or <code>overall-rating</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def _format_ratings_rationales_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, List[Any]]:\n    \"\"\"Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`.\"\"\"\n    if output is None:\n        return {\n            \"ratings\": [None] * len(input[\"generations\"]),\n            \"rationales\": [None] * len(input[\"generations\"]),\n        }\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    pattern = r\"Rating: (.+?)\\nRationale: (.+)\"\n    sections = output.split(\"\\n\\n\")\n\n    formatted_outputs = []\n    for section in sections:\n        matches = None\n        if section is not None and section != \"\":\n            matches = re.search(pattern, section, re.DOTALL)\n        if not matches:\n            formatted_outputs.append({\"ratings\": None, \"rationales\": None})\n            continue\n\n        formatted_outputs.append(\n            {\n                \"ratings\": (\n                    int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                    if matches.group(1) not in [\"None\", \"N/A\"]\n                    else None\n                ),\n                \"rationales\": matches.group(2),\n            }\n        )\n    return group_dicts(*formatted_outputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback._format_types_ratings_rationales_output","title":"<code>_format_types_ratings_rationales_output(output, input)</code>","text":"<p>Formats the output when the aspect is either <code>helpfulness</code> or <code>truthfulness</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def _format_types_ratings_rationales_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, List[Any]]:\n    \"\"\"Formats the output when the aspect is either `helpfulness` or `truthfulness`.\"\"\"\n    if output is None:\n        return {\n            \"types\": [None] * len(input[\"generations\"]),\n            \"rationales\": [None] * len(input[\"generations\"]),\n            \"ratings\": [None] * len(input[\"generations\"]),\n            \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n        }\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    pattern = r\"Type: (.+?)\\nRationale: (.+?)\\nRating: (.+?)\\nRationale: (.+)\"\n\n    sections = output.split(\"\\n\\n\")\n\n    formatted_outputs = []\n    for section in sections:\n        matches = None\n        if section is not None and section != \"\":\n            matches = re.search(pattern, section, re.DOTALL)\n        if not matches:\n            formatted_outputs.append(\n                {\n                    \"types\": None,\n                    \"rationales\": None,\n                    \"ratings\": None,\n                    \"rationales-for-ratings\": None,\n                }\n            )\n            continue\n\n        formatted_outputs.append(\n            {\n                \"types\": (\n                    int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                    if matches.group(1) not in [\"None\", \"N/A\"]\n                    else None\n                ),\n                \"rationales\": matches.group(2),\n                \"ratings\": (\n                    int(re.findall(r\"\\b\\d+\\b\", matches.group(3))[0])\n                    if matches.group(3) not in [\"None\", \"N/A\"]\n                    else None\n                ),\n                \"rationales-for-ratings\": matches.group(4),\n            }\n        )\n    return group_dicts(*formatted_outputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel\nfrom typing import List\n\nclass SchemaUltraFeedback(BaseModel):\n    ratings: List[int]\n    rationales: List[str]\n\nclass SchemaUltraFeedbackWithType(BaseModel):\n    types: List[Optional[int]]\n    ratings: List[int]\n    rationales: List[str]\n    rationales_for_rating: List[str]\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel\n    from typing import List\n\n    class SchemaUltraFeedback(BaseModel):\n        ratings: List[int]\n        rationales: List[str]\n\n    class SchemaUltraFeedbackWithType(BaseModel):\n        types: List[Optional[int]]\n        ratings: List[int]\n        rationales: List[str]\n        rationales_for_rating: List[str]\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    if self.aspect in [\n        \"honesty\",\n        \"instruction-following\",\n        \"overall-rating\",\n    ]:\n        return {\n            \"properties\": {\n                \"ratings\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Ratings\",\n                    \"type\": \"array\",\n                },\n                \"rationales\": {\n                    \"items\": {\"type\": \"string\"},\n                    \"title\": \"Rationales\",\n                    \"type\": \"array\",\n                },\n            },\n            \"required\": [\"ratings\", \"rationales\"],\n            \"title\": \"SchemaUltraFeedback\",\n            \"type\": \"object\",\n        }\n    return {\n        \"properties\": {\n            \"types\": {\n                \"items\": {\"anyOf\": [{\"type\": \"integer\"}, {\"type\": \"null\"}]},\n                \"title\": \"Types\",\n                \"type\": \"array\",\n            },\n            \"ratings\": {\n                \"items\": {\"type\": \"integer\"},\n                \"title\": \"Ratings\",\n                \"type\": \"array\",\n            },\n            \"rationales\": {\n                \"items\": {\"type\": \"string\"},\n                \"title\": \"Rationales\",\n                \"type\": \"array\",\n            },\n            \"rationales_for_rating\": {\n                \"items\": {\"type\": \"string\"},\n                \"title\": \"Rationales For Rating\",\n                \"type\": \"array\",\n            },\n        },\n        \"required\": [\"types\", \"ratings\", \"rationales\", \"rationales_for_rating\"],\n        \"title\": \"SchemaUltraFeedbackWithType\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback._format_structured_output","title":"<code>_format_structured_output(output, input)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with either <code>positive</code>, or <code>positive</code> and <code>negative</code> keys.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def _format_structured_output(\n    self, output: str, input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with either `positive`, or `positive` and `negative` keys.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        if self.aspect in [\n            \"honesty\",\n            \"instruction-following\",\n            \"overall-rating\",\n        ]:\n            return {\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n            }\n        return {\n            \"ratings\": [None] * len(input[\"generations\"]),\n            \"rationales\": [None] * len(input[\"generations\"]),\n            \"types\": [None] * len(input[\"generations\"]),\n            \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.URIAL","title":"<code>URIAL</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generates a response using a non-instruct fine-tuned model.</p> <p><code>URIAL</code> is a pre-defined task that generates a response using a non-instruct fine-tuned model. This task is used to generate a response based on the conversation provided as input.</p> Input columns <ul> <li>instruction (<code>str</code>, optional): The instruction to generate a response from.</li> <li>conversation (<code>List[Dict[str, str]]</code>, optional): The conversation to generate     a response from (the last message must be from the user).</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated response.</li> <li>model_name (<code>str</code>): The name of the model used to generate the response.</li> </ul> Categories <ul> <li>text-generation</li> </ul> References <ul> <li>The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning</li> </ul> <p>Examples:</p> <p>Generate text from an instruction:</p> <pre><code>from distilabel.models import vLLM\nfrom distilabel.steps.tasks import URIAL\n\nstep = URIAL(\n    llm=vLLM(\n        model=\"meta-llama/Meta-Llama-3.1-8B\",\n        generation_kwargs={\"temperature\": 0.7},\n    ),\n)\n\nstep.load()\n\nresults = next(\n    step.process(inputs=[{\"instruction\": \"What's the most most common type of cloud?\"}])\n)\n# [\n#     {\n#         'instruction': \"What's the most most common type of cloud?\",\n#         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',\n#         'distilabel_metadata': {...},\n#         'model_name': 'meta-llama/Meta-Llama-3.1-8B'\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/urial.py</code> <pre><code>class URIAL(Task):\n    \"\"\"Generates a response using a non-instruct fine-tuned model.\n\n    `URIAL` is a pre-defined task that generates a response using a non-instruct fine-tuned\n    model. This task is used to generate a response based on the conversation provided as\n    input.\n\n    Input columns:\n        - instruction (`str`, optional): The instruction to generate a response from.\n        - conversation (`List[Dict[str, str]]`, optional): The conversation to generate\n            a response from (the last message must be from the user).\n\n    Output columns:\n        - generation (`str`): The generated response.\n        - model_name (`str`): The name of the model used to generate the response.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning](https://arxiv.org/abs/2312.01552)\n\n    Examples:\n        Generate text from an instruction:\n\n        ```python\n        from distilabel.models import vLLM\n        from distilabel.steps.tasks import URIAL\n\n        step = URIAL(\n            llm=vLLM(\n                model=\"meta-llama/Meta-Llama-3.1-8B\",\n                generation_kwargs={\"temperature\": 0.7},\n            ),\n        )\n\n        step.load()\n\n        results = next(\n            step.process(inputs=[{\"instruction\": \"What's the most most common type of cloud?\"}])\n        )\n        # [\n        #     {\n        #         'instruction': \"What's the most most common type of cloud?\",\n        #         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',\n        #         'distilabel_metadata': {...},\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-8B'\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"urial.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return {\"instruction\": False, \"conversation\": False}\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        messages = (\n            [{\"role\": \"user\", \"content\": input[\"instruction\"]}]\n            if \"instruction\" in input\n            else input[\"conversation\"]\n        )\n\n        if messages[-1][\"role\"] != \"user\":\n            raise ValueError(\"The last message must be from the user.\")\n\n        return [{\"role\": \"user\", \"content\": self._template.render(messages=messages)}]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        if output is None:\n            return {\"generation\": None}\n\n        response = output.split(\"\\n\\n# User\")[0]\n        if response.startswith(\"\\n\\n\"):\n            response = response[2:]\n        response = response.strip()\n\n        return {\"generation\": response}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.URIAL.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template for the given <code>aspect</code>.</p> Source code in <code>src/distilabel/steps/tasks/urial.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"urial.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.task","title":"<code>task(inputs=None, outputs=None)</code>","text":"<p>Creates a <code>Task</code> from a formatting output function.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the inputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not, that are required by the step. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> <code>outputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the outputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column will be generated or not. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> Source code in <code>src/distilabel/steps/tasks/decorator.py</code> <pre><code>def task(\n    inputs: Union[\"StepColumns\", None] = None,\n    outputs: Union[\"StepColumns\", None] = None,\n) -&gt; Callable[..., Type[\"Task\"]]:\n    \"\"\"Creates a `Task` from a formatting output function.\n\n    Args:\n        inputs: a list containing the name of the inputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column is required or not, that are required by the step. If not provided\n            the default will be an empty list `[]` and it will be assumed that the step\n            doesn't need any specific columns. Defaults to `None`.\n        outputs: a list containing the name of the outputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column will be generated or not. If not provided the default will be an\n            empty list `[]` and it will be assumed that the step doesn't need any specific\n            columns. Defaults to `None`.\n    \"\"\"\n\n    inputs = inputs or []\n    outputs = outputs or []\n\n    def decorator(func: TaskFormattingOutputFunc) -&gt; Type[\"Task\"]:\n        doc = inspect.getdoc(func)\n        if doc is None:\n            raise DistilabelUserError(\n                \"When using the `task` decorator, including a docstring in the formatting\"\n                \" function is mandatory. The docstring must follow the format described\"\n                \" in the documentation.\",\n                page=\"\",\n            )\n\n        system_prompt, user_message_template = _parse_docstring(doc)\n        _validate_templates(inputs, system_prompt, user_message_template)\n\n        def inputs_property(self) -&gt; \"StepColumns\":\n            return inputs\n\n        def outputs_property(self) -&gt; \"StepColumns\":\n            return outputs\n\n        def format_input(self, input: Dict[str, Any]) -&gt; \"FormattedInput\":\n            return [\n                {\"role\": \"system\", \"content\": system_prompt.format(**input)},\n                {\"role\": \"user\", \"content\": user_message_template.format(**input)},\n            ]\n\n        def format_output(\n            self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n        ) -&gt; Dict[str, Any]:\n            return func(output, input)\n\n        return type(\n            func.__name__,\n            (Task,),\n            {\n                \"inputs\": property(inputs_property),\n                \"outputs\": property(outputs_property),\n                \"__module__\": func.__module__,\n                \"format_input\": format_input,\n                \"format_output\": format_output,\n            },\n        )\n\n    return decorator\n</code></pre>"},{"location":"api/task/typing/","title":"Task Typing","text":""},{"location":"api/task/typing/#distilabel.steps.tasks.typing","title":"<code>typing</code>","text":""},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ChatType","title":"<code>ChatType = List[ChatItem]</code>  <code>module-attribute</code>","text":"<p>ChatType is a type alias for a <code>list</code> of <code>dict</code>s following the OpenAI conversational format.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.StructuredOutputType","title":"<code>StructuredOutputType = Union[OutlinesStructuredOutputType, InstructorStructuredOutputType]</code>  <code>module-attribute</code>","text":"<p>StructuredOutputType is an alias for the union of <code>OutlinesStructuredOutputType</code> and <code>InstructorStructuredOutputType</code>.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.StandardInput","title":"<code>StandardInput = ChatType</code>  <code>module-attribute</code>","text":"<p>StandardInput is an alias for ChatType that defines the default / standard input produced by <code>format_input</code>.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.StructuredInput","title":"<code>StructuredInput = Tuple[StandardInput, Union[StructuredOutputType, None]]</code>  <code>module-attribute</code>","text":"<p>StructuredInput defines a type produced by <code>format_input</code> when using either <code>StructuredGeneration</code> or a subclass of it.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.FormattedInput","title":"<code>FormattedInput = Union[StandardInput, StructuredInput, ChatType]</code>  <code>module-attribute</code>","text":"<p>FormattedInput is an alias for the union of <code>StandardInput</code> and <code>StructuredInput</code> as generated by <code>format_input</code> and expected by the <code>LLM</code>s, as well as <code>ConversationType</code> for the vision language models.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ImageUrl","title":"<code>ImageUrl</code>","text":"<p>               Bases: <code>TypedDict</code></p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class ImageUrl(TypedDict):\n    url: Required[str]\n    \"\"\"Either a URL of the image or the base64 encoded image data.\"\"\"\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ImageUrl.url","title":"<code>url</code>  <code>instance-attribute</code>","text":"<p>Either a URL of the image or the base64 encoded image data.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ImageContent","title":"<code>ImageContent</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>Type alias for the user's message in a conversation that can include text or an image. It's the standard type for vision language models: https://platform.openai.com/docs/guides/vision</p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class ImageContent(TypedDict, total=False):\n    \"\"\"Type alias for the user's message in a conversation that can include text or an image.\n    It's the standard type for vision language models:\n    https://platform.openai.com/docs/guides/vision\n    \"\"\"\n\n    type: Required[Literal[\"image_url\"]]\n    image_url: Required[ImageUrl]\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType","title":"<code>OutlinesStructuredOutputType</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>TypedDict to represent the structured output configuration from <code>outlines</code>.</p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class OutlinesStructuredOutputType(TypedDict, total=False):\n    \"\"\"TypedDict to represent the structured output configuration from `outlines`.\"\"\"\n\n    format: Literal[\"json\", \"regex\"]\n    \"\"\"One of \"json\" or \"regex\".\"\"\"\n    schema: Union[str, Type[BaseModel], Dict[str, Any]]\n    \"\"\"The schema to use for the structured output. If \"json\", it\n    can be a pydantic.BaseModel class, or the schema as a string,\n    as obtained from `model_to_schema(BaseModel)`, if \"regex\", it\n    should be a regex pattern as a string.\n    \"\"\"\n    whitespace_pattern: Optional[Union[str, List[str]]]\n    \"\"\"If \"json\" corresponds to a string or a list of\n    strings with a pattern (doesn't impact string literals).\n    For example, to allow only a single space or newline with\n    `whitespace_pattern=r\"[\\n ]?\"`\n    \"\"\"\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType.format","title":"<code>format</code>  <code>instance-attribute</code>","text":"<p>One of \"json\" or \"regex\".</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType.schema","title":"<code>schema</code>  <code>instance-attribute</code>","text":"<p>The schema to use for the structured output. If \"json\", it can be a pydantic.BaseModel class, or the schema as a string, as obtained from <code>model_to_schema(BaseModel)</code>, if \"regex\", it should be a regex pattern as a string.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType.whitespace_pattern","title":"<code>whitespace_pattern</code>  <code>instance-attribute</code>","text":"<p>If \"json\" corresponds to a string or a list of    strings with a pattern (doesn't impact string literals).    For example, to allow only a single space or newline with    <code>whitespace_pattern=r\"[ ]?\"</code></p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType","title":"<code>InstructorStructuredOutputType</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>TypedDict to represent the structured output configuration from <code>instructor</code>.</p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class InstructorStructuredOutputType(TypedDict, total=False):\n    \"\"\"TypedDict to represent the structured output configuration from `instructor`.\"\"\"\n\n    format: Optional[Literal[\"json\"]]\n    \"\"\"One of \"json\".\"\"\"\n    schema: Union[Type[BaseModel], Dict[str, Any]]\n    \"\"\"The schema to use for the structured output, a `pydantic.BaseModel` class. \"\"\"\n    mode: Optional[str]\n    \"\"\"Generation mode. Take a look at `instructor.Mode` for more information, if not informed it will\n    be determined automatically. \"\"\"\n    max_retries: int\n    \"\"\"Number of times to reask the model in case of error, if not set will default to the model's default. \"\"\"\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.format","title":"<code>format</code>  <code>instance-attribute</code>","text":"<p>One of \"json\".</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.schema","title":"<code>schema</code>  <code>instance-attribute</code>","text":"<p>The schema to use for the structured output, a <code>pydantic.BaseModel</code> class.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.mode","title":"<code>mode</code>  <code>instance-attribute</code>","text":"<p>Generation mode. Take a look at <code>instructor.Mode</code> for more information, if not informed it will be determined automatically.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.max_retries","title":"<code>max_retries</code>  <code>instance-attribute</code>","text":"<p>Number of times to reask the model in case of error, if not set will default to the model's default.</p>"},{"location":"sections/community/","title":"Community","text":"<p>We are an open-source community-driven project not only focused on building a great product but also on building a great community, where you can get support, share your experiences, and contribute to the project! We would love to hear from you and help you get started with distilabel.</p> <ul> <li> <p>Discord</p> <p>In our Discord channels (#argilla-general and #argilla-help), you can get direct support from the community.</p> <p> Discord \u2197</p> </li> <li> <p>Community Meetup</p> <p>We host bi-weekly community meetups where you can listen in or present your work.</p> <p> Community Meetup \u2197</p> </li> <li> <p>Changelog</p> <p>The changelog is where you can find the latest updates and changes to the distilabel project.</p> <p> Changelog \u2197</p> </li> <li> <p>Roadmap</p> <p>We love to discuss our plans with the community. Feel encouraged to participate in our roadmap discussions.</p> <p> Roadmap \u2197</p> </li> </ul>"},{"location":"sections/community/#badges","title":"Badges","text":"<p>If you build something cool with <code>distilabel</code> consider adding one of these badges to your dataset or model card.</p> <pre><code>[&lt;img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/&gt;](https://github.com/argilla-io/distilabel)\n</code></pre> <p></p> <pre><code>[&lt;img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/&gt;](https://github.com/argilla-io/distilabel)\n</code></pre> <p></p>"},{"location":"sections/community/#contribute","title":"Contribute","text":"<p>To directly contribute with <code>distilabel</code>, check our good first issues or open a new one.</p>"},{"location":"sections/community/contributor/","title":"How to contribute?","text":"<p>Thank you for investing your time in contributing to the project! Any contribution you make will be reflected in the most recent version of distilabel \ud83e\udd29.</p> New to contributing in general? <p>If you're a new contributor, read the README to get an overview of the project. In addition, here are some resources to help you get started with open-source contributions:</p> <ul> <li>Discord: You are welcome to join the distilabel Discord community, where you can keep in touch with other users, contributors and the distilabel team. In the following section, you can find more information on how to get started in Discord.</li> <li>Git: This is a very useful tool to keep track of the changes in your files. Using the command-line interface (CLI), you can make your contributions easily. For that, you need to have it installed and updated on your computer.</li> <li>GitHub: It is a platform and cloud-based service that uses git and allows developers to collaborate on projects. To contribute to distilabel, you'll need to create an account. Check the Contributor Workflow with Git and Github for more info.</li> <li>Developer Documentation: To collaborate, you'll need to set up an efficient environment. Check the Installation guide to know how to do it.</li> </ul>"},{"location":"sections/community/contributor/#first-contact-in-discord","title":"First Contact in Discord","text":"<p>Discord is a handy tool for more casual conversations and to answer day-to-day questions. As part of Hugging Face, we have set up some distilabel channels on the server. Click here to join the Hugging Face Discord community effortlessly.</p> <p>When part of the Hugging Face Discord, you can select \"Channels &amp; roles\" and select \"Argilla\" along with any of the other groups that are interesting to you. \"Argilla\" will cover anything about argilla and distilabel. You can join the following channels:</p> <ul> <li>#argilla-distilabel-announcements: \ud83d\udce3 Stay up-to-date.</li> <li>#argilla-distilabel-general: \ud83d\udcac For general discussions.</li> <li>#argilla-distilabel-help: \ud83d\ude4b\u200d\u2640\ufe0f Need assistance? We're always here to help. Select the appropriate label (argilla or distilabel) for your issue and post it.</li> </ul> <p>So now there is only one thing left to do: introduce yourself and talk to the community. You'll always be welcome! \ud83e\udd17\ud83d\udc4b</p>"},{"location":"sections/community/contributor/#contributor-workflow-with-git-and-github","title":"Contributor Workflow with Git and GitHub","text":"<p>If you're working with distilabel and suddenly a new idea comes to your mind or you find an issue that can be improved, it's time to actively participate and contribute to the project!</p>"},{"location":"sections/community/contributor/#report-an-issue","title":"Report an issue","text":"<p>If you spot a problem, search if an issue already exists, you can use the <code>Label</code> filter. If that is the case, participate in the conversation. If it does not exist, create an issue by clicking on <code>New Issue</code>. This will show various templates; choose the one that best suits your issue. Once you choose one, you will need to fill it in following the guidelines. Try to be as clear as possible. In addition, you can assign yourself to the issue and add or choose the right labels. Finally, click on <code>Submit new issue</code>.</p>"},{"location":"sections/community/contributor/#work-with-a-fork","title":"Work with a fork","text":""},{"location":"sections/community/contributor/#fork-the-distilabel-repository","title":"Fork the distilabel repository","text":"<p>After having reported the issue, you can start working on it. For that, you will need to create a fork of the project. To do that, click on the <code>Fork</code> button. Now, fill in the information. Remember to uncheck the <code>Copy develop branch only</code> if you are going to work in or from another branch (for instance, to fix documentation, the <code>main</code> branch is used). Then, click on <code>Create fork</code>.</p> <p>You will be redirected to your fork. You can see that you are in your fork because the name of the repository will be your <code>username/distilabel</code>, and it will indicate <code>forked from argilla-io/distilabel</code>.</p>"},{"location":"sections/community/contributor/#clone-your-forked-repository","title":"Clone your forked repository","text":"<p>In order to make the required adjustments, clone the forked repository to your local machine. Choose the destination folder and run the following command:</p> <pre><code>git clone https://github.com/[your-github-username]/distilabel.git\ncd distilabel\n</code></pre> <p>To keep your fork\u2019s main/develop branch up to date with our repo, add it as an upstream remote branch.</p> <pre><code>git remote add upstream https://github.com/argilla-io/distilabel.git\n</code></pre>"},{"location":"sections/community/contributor/#create-a-new-branch","title":"Create a new branch","text":"<p>For each issue you're addressing, it's advisable to create a new branch. GitHub offers a straightforward method to streamline this process.</p> <p>\u26a0\ufe0f Never work directly on the <code>main</code> or <code>develop</code> branch. Always create a new branch for your changes.</p> <p>Navigate to your issue, and on the right column, select <code>Create a branch</code>.</p> <p></p> <p>After the new window pops up, the branch will be named after the issue and include a prefix such as feature/, bug/, or docs/ to facilitate quick recognition of the issue type. In the <code>Repository destination</code>, pick your fork ( [your-github-username]/distilabel), and then select <code>Change branch source</code> to specify the source branch for creating the new one. Complete the process by clicking <code>Create branch</code>.</p> <p>\ud83e\udd14 Remember that the <code>main</code> branch is only used to work with the documentation. For any other changes, use the <code>develop</code> branch.</p> <p>Now, locally, change to the new branch you just created.</p> <pre><code>git fetch origin\ngit checkout [branch-name]\n</code></pre>"},{"location":"sections/community/contributor/#make-changes-and-push-them","title":"Make changes and push them","text":"<p>Make the changes you want in your local repository, and test that everything works and you are following the guidelines.</p> <p>Once you have finished, you can check the status of your repository and synchronize with the upstreaming repo with the following command:</p> <pre><code># Check the status of your repository\ngit status\n\n# Synchronize with the upstreaming repo\ngit checkout [branch-name]\ngit rebase [default-branch]\n</code></pre> <p>If everything is right, we need to commit and push the changes to your fork. For that, run the following commands:</p> <pre><code># Add the changes to the staging area\ngit add filename\n\n# Commit the changes by writing a proper message\ngit commit -m \"commit-message\"\n\n# Push the changes to your fork\ngit push origin [branch-name]\n</code></pre> <p>When pushing, you will be asked to enter your GitHub login credentials. Once the push is complete, all local commits will be on your GitHub repository.</p>"},{"location":"sections/community/contributor/#create-a-pull-request","title":"Create a pull request","text":"<p>Come back to GitHub, navigate to the original repository where you created your fork, and click on <code>Compare &amp; pull request</code>.</p> <p></p> <p>First, click on <code>compare across forks</code> and select the right repositories and branches.</p> <p>In the base repository, keep in mind that you should select either <code>main</code> or <code>develop</code> based on the modifications made. In the head repository, indicate your forked repository and the branch corresponding to the issue.</p> <p>Then, fill in the pull request template. You should add a prefix to the PR name, as we did with the branch above. If you are working on a new feature, you can name your PR as <code>feat: TITLE</code>. If your PR consists of a solution for a bug, you can name your PR as <code>bug: TITLE</code>. And, if your work is for improving the documentation, you can name your PR as <code>docs: TITLE</code>.</p> <p>In addition, on the right side, you can select a reviewer (for instance, if you discussed the issue with a member of the team) and assign the pull request to yourself. It is highly advisable to add labels to PR as well. You can do this again by the labels section right on the screen. For instance, if you are addressing a bug, add the <code>bug</code> label, or if the PR is related to the documentation, add the <code>documentation</code> label. This way, PRs can be easily filtered.</p> <p>Finally, fill in the template carefully and follow the guidelines. Remember to link the original issue and enable the checkbox to allow maintainer edits so the branch can be updated for a merge. Then, click on <code>Create pull request</code>.</p> <p>For the PR body, ensure you give a description of what the PR contains, and add examples if possible (and if they apply to the contribution) to help with the review process. You can take a look at #PR 974 or #PR 983 for examples of typical PRs.</p>"},{"location":"sections/community/contributor/#review-your-pull-request","title":"Review your pull request","text":"<p>Once you submit your PR, a team member will review your proposal. We may ask questions, request additional information, or ask for changes to be made before a PR can be merged, either using suggested changes or pull request comments.</p> <p>You can apply the changes directly through the UI (check the files changed and click on the right-corner three dots; see image below) or from your fork, and then commit them to your branch. The PR will be updated automatically, and the suggestions will appear as <code>outdated</code>.</p> <p></p> <p>If you run into any merge issues, check out this git tutorial to help you resolve merge conflicts and other issues.</p>"},{"location":"sections/community/contributor/#your-pr-is-merged","title":"Your PR is merged!","text":"<p>Congratulations \ud83c\udf89\ud83c\udf8a We thank you \ud83e\udd29</p> <p>Once your PR is merged, your contributions will be publicly visible on the distilabel GitHub.</p> <p>Additionally, we will include your changes in the next release based on our development branch.</p>"},{"location":"sections/community/contributor/#additional-resources","title":"Additional resources","text":"<p>Here are some helpful resources for your reference.</p> <ul> <li>Configuring Discord, a guide to learning how to get started with Discord.</li> <li>Pro Git, a book to learn Git.</li> <li>Git in VSCode, a guide to learning how to easily use Git in VSCode.</li> <li>GitHub Skills, an interactive course for learning GitHub.</li> </ul>"},{"location":"sections/community/developer_documentation/","title":"Developer Documentation","text":"<p>Thank you for investing your time in contributing to the project!</p> <p>If you don't have the repository locally, and need any help, go to the contributor guide and read the contributor workflow with Git and GitHub first.</p>"},{"location":"sections/community/developer_documentation/#set-up-the-python-environment","title":"Set up the Python environment","text":"<p>To work on the <code>distilabel</code>, you must install the package on your system.</p> <p>Tip</p> <p>This guide will use <code>uv</code>, but <code>pip</code> and <code>venv</code> can be used as well, this guide can work quite similar with both options.</p> <p>From the root of the cloned Distilabel repository, you should move to the distilabel folder in your terminal.</p> <pre><code>cd distilabel\n</code></pre>"},{"location":"sections/community/developer_documentation/#create-a-virtual-environment","title":"Create a virtual environment","text":"<p>The first step will be creating a virtual environment to keep our dependencies isolated. Here we are choosing <code>python 3.11</code> (uv venv documentation), and then activate it:</p> <pre><code>uv venv .venv --python 3.11\nsource .venv/bin/activate\n</code></pre>"},{"location":"sections/community/developer_documentation/#install-the-project","title":"Install the project","text":"<p>Installing from local (we are using <code>uv pip</code>):</p> <pre><code>uv pip install -e .\n</code></pre> <p>We have extra dependencies with their name, depending on the part you are working on, you may want to install some dependency (take a look at <code>pyproject.toml</code> in the repo to see all the extra dependencies):</p> <pre><code>uv pip install -e \".[vllm,outlines]\"\n</code></pre>"},{"location":"sections/community/developer_documentation/#linting-and-formatting","title":"Linting and formatting","text":"<p>To maintain a consistent code format, install the pre-commit hooks to run before each commit automatically (we rely heavily on <code>ruff</code>):</p> <pre><code>uv pip install -e \".[dev]\"\npre-commit install\n</code></pre>"},{"location":"sections/community/developer_documentation/#running-tests","title":"Running tests","text":"<p>All the changes you add to the codebase should come with tests, either <code>unit</code> or <code>integration</code> tests, depending on the type of change, which are placed under <code>tests/unit</code> and <code>tests/integration</code> respectively.</p> <p>Start by installing the tests dependencies:</p> <pre><code>uv pip install \".[tests]\"\n</code></pre> <p>Running the whole tests suite may take some time, and you will need all the dependencies installed, so just run your tests, and the whole tests suite will be run for you in the CI:</p> <pre><code># Run specific tests\npytest tests/unit/steps/generators/test_data.py\n</code></pre>"},{"location":"sections/community/developer_documentation/#set-up-the-documentation","title":"Set up the documentation","text":"<p>To contribute to the documentation and generate it locally, ensure you have installed the development dependencies:</p> <pre><code>uv pip install -e \".[docs]\"\n</code></pre> <p>And run the following command to create the development server with <code>mkdocs</code>:</p> <pre><code>mkdocs serve\n</code></pre>"},{"location":"sections/community/developer_documentation/#documentation-guidelines","title":"Documentation guidelines","text":"<p>As mentioned, we use mkdocs to build the documentation. You can write the documentation in <code>markdown</code> format, and it will automatically be converted to HTML. In addition, you can include elements such as tables, tabs, images, and others, as shown in this guide. We recommend following these guidelines:</p> <ul> <li> <p>Use clear and concise language: Ensure the documentation is easy to understand for all users by using straightforward language and including meaningful examples. Images are not easy to maintain, so use them only when necessary and place them in the appropriate folder within the docs/assets/images directory.</p> </li> <li> <p>Verify code snippets: Double-check that all code snippets are correct and runnable.</p> </li> <li> <p>Review spelling and grammar: Check the spelling and grammar of the documentation.</p> </li> <li> <p>Update the table of contents: If you add a new page, include it in the relevant index.md or the mkdocs.yml file.</p> </li> </ul>"},{"location":"sections/community/developer_documentation/#components-gallery","title":"Components gallery","text":"<p>The components gallery section of the documentation is automatically generated thanks to a custom plugin, it will be run when <code>mkdocs serve</code> is called. This guide to the steps helps us visualize each step, as well as examples of use.</p> <p>Note</p> <p>Changes done to the docstrings of <code>Steps/Tasks/LLMs</code> won't appear in the components gallery automatically, you will have to stop the <code>mkdocs</code> server and run it again to see the changes, everything else is reloaded automatically.</p>"},{"location":"sections/community/popular_issues/","title":"Issue dashboard","text":"Most engaging open issuesLatest issues open by the communityPlanned issues for upcoming releases Rank Issue Reactions Comments 1 1041 - [FEATURE] Add Offline batch generation for open models with EXXA API \ud83d\udc4d 2 \ud83d\udcac 1 2 995 - [FEATURE] <code>mlx-lm</code> integration \ud83d\udc4d 2 \ud83d\udcac 1 3 737 - [FEATURE] Allow <code>FormatTextGenerationSFT</code> to include tools/function calls in the formatted messages. \ud83d\udc4d 2 \ud83d\udcac 0 4 1001 - [FEATURE] <code>sglang</code> integration \ud83d\udc4d 1 \ud83d\udcac 1 5 797 - [FEATURE] synthetic data generation for predictive NLP tasks \ud83d\udc4d 1 \ud83d\udcac 1 6 914 - [FEATURE] Use <code>Step.resources</code> to set <code>tensor_parallel_size</code> and <code>pipeline_parallel_size</code> in <code>vLLM</code> \ud83d\udc4d 1 \ud83d\udcac 0 7 588 - [FEATURE] Single request caching \ud83d\udc4d 1 \ud83d\udcac 0 8 953 - [EXAMPLE] Add <code>CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation</code> example \ud83d\udc4d 0 \ud83d\udcac 6 9 972 - [BUG] Input data size != output data size when task batch size &lt; batch size of predecessor  \ud83d\udc4d 0 \ud83d\udcac 4 10 859 - [FEATURE] Update <code>PushToHub</code> to stream data to the Hub \ud83d\udc4d 0 \ud83d\udcac 4 Rank Issue Author 1 \ud83d\udfe2 1091 - [BUG] distiset.push_to_hub() seems to have cache pathing issue by MoritzLaurer 2 \ud83d\udfe2 1070 - [BUG] Pipeline serialization/caching issue when including <code>RoutingBatchFunction</code> by liamcripwell 3 \ud83d\udfe2 1068 - [BUG] GenerateSentencePair(...) always returns None positive and negative pairs by caesar-one 4 \ud83d\udfe2 1064 - [DOCS] Update basic guides of steps and tasks by plaguss 5 \ud83d\udfe2 1058 - [FEATURE] Implement a rate limiter for API calls by plaguss 6 \ud83d\udfe3 1056 - [DOCS] The example on how to use a Step no longer works by wwymak 7 \ud83d\udfe3 1049 - [BUG] vLLM Task not utilizing multiple GPUs in parallel when replicas &gt; 1 by adamlin120 8 \ud83d\udfe3 1048 - [BUG] OepnAI JSON format by tinyrolls 9 \ud83d\udfe2 1047 - Failed to load all the steps. Could not run pipeline. by yuqie 10 \ud83d\udfe2 1046 - [FEATURE] Compute the input/output tokens of a dataset by plaguss Rank Issue Milestone 1 \ud83d\udfe2 579 - [FEATURE] Sequential execution for local pipeline 1.4.0 2 \ud83d\udfe2 771 - [FEATURE] Allow passing path to YAML file containing pipeline runtime parameters in <code>distilabel run</code> 1.4.0 3 \ud83d\udfe2 773 - [DOCS] Include section/guide describing pipeline patterns 1.4.0 4 \ud83d\udfe2 802 - [FEATURE] Add defaults to Steps and Tasks so they can be more easily connected 1.4.0 5 \ud83d\udfe2 880 - [FEATURE] Add <code>exclude_from_signature</code> attribute 1.4.0 6 \ud83d\udfe2 889 - [FEATURE] Replace <code>extra_sampling_params</code> for normal arguments in <code>vLLM</code> 1.4.0 7 \ud83d\udfe2 942 - [BUG] <code>make_generator_step</code> can fail when setting the <code>_dataset_info</code> internally 1.4.0 8 \ud83d\udfe2 662 - [FEATURE] Allow passing <code>self</code> to steps created with <code>step</code> decorator 1.4.0 9 \ud83d\udfe2 738 - [FEATURE] Update <code>LLM.generate</code> interface to allow returning arbitrary/extra stuff related to the generation 1.5.0 10 \ud83d\udfe2 749 - [IMPLEMENTATION] Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models 1.5.0 <p>Last update: 2025-01-09</p>"},{"location":"sections/getting_started/faq/","title":"Frequent Asked Questions (FAQ)","text":"How can I rename the columns in a batch? <p>Every <code>Step</code> has both <code>input_mappings</code> and <code>output_mappings</code> attributes that can be used to rename the columns in each batch.</p> <p>But <code>input_mappings</code> will only map, meaning that if you have a batch with the column <code>A</code> and you want to rename it to <code>B</code>, you should use <code>input_mappings={\"A\": \"B\"}</code>, but that will only be applied to that specific <code>Step</code> meaning that the next step in the pipeline will still have the column <code>A</code> instead of <code>B</code>.</p> <p>While <code>output_mappings</code> will indeed apply the rename, meaning that if the <code>Step</code> produces the column <code>A</code> and you want to rename to <code>B</code>, you should use <code>output_mappings={\"A\": \"B\"}</code>, and that will be applied to the next <code>Step</code> in the pipeline.</p> Will the API Keys be exposed when sharing the pipeline? <p>No, those will be masked out using <code>pydantic.SecretStr</code>, meaning that those won't be exposed when sharing the pipeline.</p> <p>This also means that if you want to re-run your own pipeline and the API keys have not been provided via environment variable but either via an attribute or runtime parameter, you will need to provide them again.</p> Does it work for Windows? <p>Yes, but you may need to set the <code>multiprocessing</code> context in advance to ensure that the <code>spawn</code> method is used since the default method <code>fork</code> is not available on Windows.</p> <pre><code>import multiprocessing as mp\n\nmp.set_start_method(\"spawn\")\n</code></pre> Will the custom Steps / Tasks / LLMs be serialized too? <p>No, at the moment, only the references to the classes within the <code>distilabel</code> library will be serialized, meaning that if you define a custom class used within the pipeline, the serialization won't break, but the deserialize will fail since the class won't be available unless used from the same file.</p> What happens if <code>Pipeline.run</code> fails? Do I lose all the data? <p>No, indeed, we're using a cache mechanism to store all the intermediate results in the disk so, if a <code>Step</code> fails; the pipeline can be re-run from that point without losing the data, only if nothing is changed in the <code>Pipeline</code>.</p> <p>All the data will be stored in <code>.cache/distilabel</code>, but the only data that will persist at the end of the <code>Pipeline.run</code> execution is the one from the leaf step/s, so bear that in mind.</p> <p>For more information on the caching mechanism in <code>distilabel</code>, you can check the Learn - Advanced - Caching section.</p> <p>Also, note that when running a <code>Step</code> or a <code>Task</code> standalone, the cache mechanism won't be used, so if you want to use that, you should use the <code>Pipeline</code> context manager.</p> How can I use the same <code>LLM</code> across several tasks without having to load it several times? <p>You can serve the LLM using a solution like TGI or vLLM, and then connect to it using an <code>AsyncLLM</code> client like <code>InferenceEndpointsLLM</code> or <code>OpenAILLM</code>. Please refer to Serving LLMs guide for more information.</p> Can <code>distilabel</code> be used with OpenAI Batch API? <p>Yes, <code>distilabel</code> is integrated with OpenAI Batch API via OpenAILLM. Check LLMs - Offline Batch Generation for a small example on how to use it and Advanced - Offline Batch Generation for a more detailed guide.</p> Prevent overloads on Free Serverless Endpoints <p>When running a task using the InferenceEndpointsLLM with Free Serverless Endpoints, you may be facing some errors such as <code>Model is overloaded</code> if you let the batch size to the default (set at 50). To fix the issue, lower the value or even better set <code>input_batch_size=1</code> in your task. It may take a longer time to finish, but please remember this is a free service.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps import TextGeneration\n\nTextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=1\n)\n</code></pre>"},{"location":"sections/getting_started/installation/","title":"Installation","text":"<p>You will need to have at least Python 3.9 or higher, up to Python 3.12, since support for the latter is still a work in progress.</p> <p>To install the latest release of the package from PyPI you can use the following command:</p> <pre><code>pip install distilabel --upgrade\n</code></pre> <p>Alternatively, you may also want to install it from source i.e. the latest unreleased version, you can use the following command:</p> <pre><code>pip install \"distilabel @ git+https://github.com/argilla-io/distilabel.git@develop\" --upgrade\n</code></pre> <p>Note</p> <p>We are installing from <code>develop</code> since that's the branch we use to collect all the features, bug fixes, and improvements that will be part of the next release. If you want to install from a specific branch, you can replace <code>develop</code> with the branch name.</p>"},{"location":"sections/getting_started/installation/#extras","title":"Extras","text":"<p>Additionally, as part of <code>distilabel</code> some extra dependencies are available, mainly to add support for some of the LLM integrations we support. Here's a list of the available extras:</p>"},{"location":"sections/getting_started/installation/#llms","title":"LLMs","text":"<ul> <li> <p><code>anthropic</code>: for using models available in Anthropic API via the <code>AnthropicLLM</code> integration.</p> </li> <li> <p><code>argilla</code>: for exporting the generated datasets to Argilla.</p> </li> <li> <p><code>cohere</code>: for using models available in Cohere via the <code>CohereLLM</code> integration.</p> </li> <li> <p><code>groq</code>: for using models available in Groq using <code>groq</code> Python client via the <code>GroqLLM</code> integration.</p> </li> <li> <p><code>hf-inference-endpoints</code>: for using the Hugging Face Inference Endpoints via the <code>InferenceEndpointsLLM</code> integration.</p> </li> <li> <p><code>hf-transformers</code>: for using models available in transformers package via the <code>TransformersLLM</code> integration.</p> </li> <li> <p><code>litellm</code>: for using <code>LiteLLM</code> to call any LLM using OpenAI format via the <code>LiteLLM</code> integration.</p> </li> <li> <p><code>llama-cpp</code>: for using llama-cpp-python Python bindings for <code>llama.cpp</code> via the <code>LlamaCppLLM</code> integration.</p> </li> <li> <p><code>mistralai</code>: for using models available in Mistral AI API via the <code>MistralAILLM</code> integration.</p> </li> <li> <p><code>ollama</code>: for using Ollama and their available models via <code>OllamaLLM</code> integration.</p> </li> <li> <p><code>openai</code>: for using OpenAI API models via the <code>OpenAILLM</code> integration, or the rest of the integrations based on OpenAI and relying on its client as <code>AnyscaleLLM</code>, <code>AzureOpenAILLM</code>, and <code>TogetherLLM</code>.</p> </li> <li> <p><code>vertexai</code>: for using Google Vertex AI proprietary models via the <code>VertexAILLM</code> integration.</p> </li> <li> <p><code>vllm</code>: for using vllm serving engine via the <code>vLLM</code> integration.</p> </li> <li> <p><code>sentence-transformers</code>: for generating sentence embeddings using sentence-transformers.</p> </li> </ul>"},{"location":"sections/getting_started/installation/#data-processing","title":"Data processing","text":"<ul> <li> <p><code>ray</code>: for scaling and distributing a pipeline with Ray.</p> </li> <li> <p><code>faiss-cpu</code> and <code>faiss-gpu</code>: for generating sentence embeddings using faiss.</p> </li> <li> <p><code>minhash</code>: for using minhash for duplicate detection with datasketch and nltk.</p> </li> <li> <p><code>text-clustering</code>: for using text clustering with UMAP and Scikit-learn.</p> </li> </ul>"},{"location":"sections/getting_started/installation/#structured-generation","title":"Structured generation","text":"<ul> <li> <p><code>outlines</code>: for using structured generation of LLMs with outlines.</p> </li> <li> <p><code>instructor</code>: for using structured generation of LLMs with Instructor.</p> </li> </ul>"},{"location":"sections/getting_started/installation/#recommendations-notes","title":"Recommendations / Notes","text":"<p>The <code>mistralai</code> dependency requires Python 3.9 or higher, so if you're willing to use the <code>distilabel.models.llms.MistralLLM</code> implementation, you will need to have Python 3.9 or higher.</p> <p>In some cases like <code>transformers</code> and <code>vllm</code>, the installation of <code>flash-attn</code> is recommended if you are using a GPU accelerator since it will speed up the inference process, but the installation needs to be done separately, as it's not included in the <code>distilabel</code> dependencies.</p> <pre><code>pip install flash-attn --no-build-isolation\n</code></pre> <p>Also, if you are willing to use the <code>llama-cpp-python</code> integration for running local LLMs, note that the installation process may get a bit trickier depending on which OS are you using, so we recommend you to read through their Installation section in their docs.</p>"},{"location":"sections/getting_started/quickstart/","title":"Quickstart","text":""},{"location":"sections/getting_started/quickstart/#quickstart","title":"Quickstart","text":"<p>Distilabel provides all the tools you need to your scalable and reliable pipelines for synthetic data generation and AI-feedback. Pipelines are used to generate data, evaluate models, manipulate data, or any other general task. They are made up of different components: Steps, Tasks and LLMs, which are chained together in a directed acyclic graph (DAG).</p> <ul> <li>Steps: These are the building blocks of your pipeline. Normal steps are used for basic executions like loading data, applying some transformations, or any other general task.</li> <li>Tasks: These are steps that rely on LLMs and prompts to perform generative tasks. For example, they can be used to generate data, evaluate models or manipulate data.</li> <li>LLMs: These are the models that will perform the task. They can be local or remote models, and open-source or commercial models.</li> </ul> <p>Pipelines are designed to be scalable and reliable. They can be executed in a distributed manner, and they can be cached and recovered. This is useful when dealing with large datasets or when you want to ensure that your pipeline is reproducible.</p> <p>Besides that, pipelines are designed to be modular and flexible. You can easily add new steps, tasks, or LLMs to your pipeline, and you can also easily modify or remove them. An example architecture of a pipeline to generate a dataset of preferences is the following:</p>"},{"location":"sections/getting_started/quickstart/#installation","title":"Installation","text":"<p>To install the latest release with <code>hf-inference-endpoints</code> extra of the package from PyPI you can use the following command:</p> <pre><code>pip install distilabel[hf-inference-endpoints] --upgrade\n</code></pre>"},{"location":"sections/getting_started/quickstart/#use-a-generic-pipeline","title":"Use a generic pipeline","text":"<p>To use a generic pipeline for an ML task, you can use the <code>InstructionResponsePipeline</code> class. This class is a generic pipeline that can be used to generate data for supervised fine-tuning tasks. It uses the <code>InferenceEndpointsLLM</code> class to generate data based on the input data and the model.</p> <pre><code>from distilabel.pipeline import InstructionResponsePipeline\n\npipeline = InstructionResponsePipeline()\ndataset = pipeline.run()\n</code></pre> <p>The <code>InstructionResponsePipeline</code> class will use the <code>InferenceEndpointsLLM</code> class with the model <code>meta-llama/Meta-Llama-3.1-8B-Instruct</code> to generate data based on the system prompt. The output data will be a dataset with the columns <code>instruction</code> and <code>response</code>. The class uses a generic system prompt, but you can customize it by passing the <code>system_prompt</code> parameter to the class.</p> <p>Note</p> <p>We're actively working on building more pipelines for different tasks. If you have any suggestions or requests, please let us know! We're currently working on pipelines for classification, Direct Preference Optimization, and Information Retrieval tasks.</p>"},{"location":"sections/getting_started/quickstart/#define-a-custom-pipeline","title":"Define a Custom pipeline","text":"<p>In this guide we will walk you through the process of creating a simple pipeline that uses the <code>InferenceEndpointsLLM</code> class to generate text. The <code>Pipeline</code> will load a dataset that contains a column named <code>prompt</code> from the Hugging Face Hub via the step <code>LoadDataFromHub</code> and then use the <code>InferenceEndpointsLLM</code> class to generate text based on the dataset using the <code>TextGeneration</code> task.</p> <p>You can check the available models in the Hugging Face Model Hub and filter by <code>Inference status</code>.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(  # (1)\n    name=\"simple-text-generation-pipeline\",\n    description=\"A simple text generation pipeline\",\n) as pipeline:  # (2)\n    load_dataset = LoadDataFromHub(  # (3)\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    text_generation = TextGeneration(  # (4)\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        ),  # (5)\n        system_prompt=\"You are a creative AI Assistant writer.\",\n        template=\"Follow the following instruction: {{ instruction }}\"  # (6)\n    )\n\n    load_dataset &gt;&gt; text_generation  # (7)\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(  # (8)\n        parameters={\n            load_dataset.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n            text_generation.name: {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n        },\n    )\n    distiset.push_to_hub(repo_id=\"distilabel-example\")  # (9)\n</code></pre> <ol> <li> <p>We define a <code>Pipeline</code> with the name <code>simple-text-generation-pipeline</code> and a description <code>A simple text generation pipeline</code>. Note that the <code>name</code> is mandatory and will be used to calculate the <code>cache</code> signature path, so changing the name will change the cache path and will be identified as a different pipeline.</p> </li> <li> <p>We are using the <code>Pipeline</code> context manager, meaning that every <code>Step</code> subclass that is defined within the context manager will be added to the pipeline automatically.</p> </li> <li> <p>We define a <code>LoadDataFromHub</code> step named <code>load_dataset</code> that will load a dataset from the Hugging Face Hub, as provided via runtime parameters in the <code>pipeline.run</code> method below, but it can also be defined within the class instance via the arg <code>repo_id=...</code>. This step will produce output batches with the rows from the dataset, and the column <code>prompt</code> will be mapped to the <code>instruction</code> field.</p> </li> <li> <p>We define a <code>TextGeneration</code> task named <code>text_generation</code> that will generate text based on the <code>instruction</code> field from the dataset. This task will use the <code>InferenceEndpointsLLM</code> class with the model <code>Meta-Llama-3.1-8B-Instruct</code>.</p> </li> <li> <p>We define the <code>InferenceEndpointsLLM</code> class with the model <code>Meta-Llama-3.1-8B-Instruct</code> that will be used by the <code>TextGeneration</code> task. In this case, since the <code>InferenceEndpointsLLM</code> is used, we assume that the <code>HF_TOKEN</code> environment variable is set.</p> </li> <li> <p>Both <code>system_prompt</code> and <code>template</code> are optional fields. The <code>template</code> must be informed as a string following the Jinja2 template format, and the fields that appear there (\"instruction\" in this case, which corresponds to the default) must be informed in the <code>columns</code> attribute. The component gallery for <code>TextGeneration</code> has examples to get you started. </p> </li> <li> <p>We connect the <code>load_dataset</code> step to the <code>text_generation</code> task using the <code>rshift</code> operator, meaning that the output from the <code>load_dataset</code> step will be used as input for the <code>text_generation</code> task.</p> </li> <li> <p>We run the pipeline with the parameters for the <code>load_dataset</code> and <code>text_generation</code> steps. The <code>load_dataset</code> step will use the repository <code>distilabel-internal-testing/instruction-dataset-mini</code> and the <code>test</code> split, and the <code>text_generation</code> task will use the <code>generation_kwargs</code> with the <code>temperature</code> set to <code>0.7</code> and the <code>max_new_tokens</code> set to <code>512</code>.</p> </li> <li> <p>Optionally, we can push the generated <code>Distiset</code> to the Hugging Face Hub repository <code>distilabel-example</code>. This will allow you to share the generated dataset with others and use it in other pipelines.</p> </li> </ol>"},{"location":"sections/how_to_guides/","title":"How-to guides","text":"<p>Welcome to the how-to guides section! Here you will find a collection of guides that will help you get started with Distilabel. We have divided the guides into two categories: basic and advanced. The basic guides will help you get started with the core concepts of Distilabel, while the advanced guides will help you explore more advanced features.</p>"},{"location":"sections/how_to_guides/#basic","title":"Basic","text":"<ul> <li> <p>Define Steps for your Pipeline</p> <p>Steps are the building blocks of your pipeline. They can be used to generate data, evaluate models, manipulate data, or any other general task.</p> <p> Define Steps</p> </li> <li> <p>Define Tasks that rely on LLMs</p> <p>Tasks are a specific type of step that rely on Language Models (LLMs) to generate data.</p> <p> Define Tasks</p> </li> <li> <p>Define LLMs as local or remote models</p> <p>LLMs are the core of your tasks. They are used to integrate with local models or remote APIs.</p> <p> Define LLMs</p> </li> <li> <p>Execute Steps and Tasks in a Pipeline</p> <p>Pipeline is where you put all your steps and tasks together to create a workflow.</p> <p> Execute Pipeline</p> </li> </ul>"},{"location":"sections/how_to_guides/#advanced","title":"Advanced","text":"<ul> <li> <p>Using the Distiset dataset object</p> <p>Distiset is a dataset object based on the datasets library that can be used to store and manipulate data.</p> <p> Distiset</p> </li> <li> <p>Export data to Argilla</p> <p>Argilla is a platform that can be used to store, search, and apply feedback to datasets.  Argilla</p> </li> <li> <p>Using a file system to pass data of batches between steps</p> <p>File system can be used to pass data between steps in a pipeline.</p> <p> File System</p> </li> <li> <p>Using CLI to explore and re-run existing Pipelines</p> <p>CLI can be used to explore and re-run existing pipelines through the command line.</p> <p> CLI</p> </li> <li> <p>Cache and recover pipeline executions</p> <p>Caching can be used to recover pipeline executions to avoid loosing data and precious LLM calls.</p> <p> Caching</p> </li> <li> <p>Structured data generation</p> <p>Structured data generation can be used to generate data with a specific structure like JSON, function calls, etc.</p> <p> Structured Generation</p> </li> <li> <p>Serving an LLM for sharing it between several tasks</p> <p>Serve an LLM via TGI or vLLM to make requests and connect using a client like <code>InferenceEndpointsLLM</code> or <code>OpenAILLM</code> to avoid wasting resources.</p> <p> Sharing an LLM across tasks</p> </li> <li> <p>Impose requirements to your pipelines and steps</p> <p>Add requirements to steps in a pipeline to ensure they are installed and avoid errors.</p> <p> Pipeline requirements</p> </li> </ul>"},{"location":"sections/how_to_guides/advanced/argilla/","title":"Export data to Argilla","text":"<p>Being able to export the generated synthetic datasets to Argilla, is a core feature within <code>distilabel</code>. We believe in the potential of synthetic data, but without removing the impact a human annotator or group of annotators can bring. So on, the Argilla integration makes it straightforward to push a dataset to Argilla while the <code>Pipeline</code> is running, to be able to follow along the generation process in Argilla's UI, as well as annotating the records on the fly. One can include a <code>Step</code> within the <code>Pipeline</code> to easily export the datasets to Argilla with a pre-defined configuration, suiting the annotation purposes.</p> <p>Before using any of the steps about to be described below, you should first have an Argilla instance up and running, so that you can successfully upload the data to Argilla. In order to deploy Argilla, the easiest and most straightforward way is to deploy it via the Argilla Template in Hugging Face Spaces as simply as following the steps there, or just via the following button:</p> <p> </p>"},{"location":"sections/how_to_guides/advanced/argilla/#text-generation","title":"Text Generation","text":"<p>For text generation scenarios, i.e. when the <code>Pipeline</code> contains a single <code>TextGeneration</code> step, we have designed the task <code>TextGenerationToArgilla</code>, which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.</p> <p>The dataset will be pushed with the following configuration:</p> <ul> <li> <p>Fields: <code>instruction</code> and <code>generation</code>, both being fields of type <code>argilla.TextField</code>, plus the automatically generated <code>id</code> for the given <code>instruction</code> to be able to search for other records with the same <code>instruction</code> in the dataset. The field <code>instruction</code> must always be a string, while the field <code>generation</code> can either be a single string or a list of strings (useful when there are multiple parent nodes of type <code>TextGeneration</code>); even though each record will always contain at most one <code>instruction</code>-<code>generation</code> pair.</p> </li> <li> <p>Questions: <code>quality</code> will be the only question for the annotators to answer, i.e., to annotate, and it will be an <code>argilla.LabelQuestion</code> referring to the quality of the provided generation for the given instruction. It can be annotated as either \ud83d\udc4e (bad) or \ud83d\udc4d (good).</p> </li> </ul> <p>Note</p> <p>The <code>TextGenerationToArgilla</code> step will only work as is if the <code>Pipeline</code> contains one or multiple <code>TextGeneration</code> steps, or if the columns <code>instruction</code> and <code>generation</code> are available within the batch data. Otherwise, the variable <code>input_mappings</code> will need to be set so that either both or one of <code>instruction</code> and <code>generation</code> are mapped to one of the existing columns in the batch data.</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.steps import LoadDataFromDicts, TextGenerationToArgilla\nfrom distilabel.steps.tasks import TextGeneration\n\n\nwith Pipeline(name=\"my-pipeline\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_dataset\",\n        data=[\n            {\n                \"instruction\": \"Write a short story about a dragon that saves a princess from a tower.\",\n            },\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=OpenAILLM(model=\"gpt-4\"),\n    )\n\n    to_argilla = TextGenerationToArgilla(\n        dataset_name=\"my-dataset\",\n        dataset_workspace=\"admin\",\n        api_url=\"&lt;ARGILLA_API_URL&gt;\",\n        api_key=\"&lt;ARGILLA_API_KEY&gt;\",\n    )\n\n    load_dataset &gt;&gt; text_generation &gt;&gt; to_argilla\n\npipeline.run()\n</code></pre> <p></p>"},{"location":"sections/how_to_guides/advanced/argilla/#preference","title":"Preference","text":"<p>For preference scenarios, i.e. when the <code>Pipeline</code> contains multiple <code>TextGeneration</code> steps, we have designed the task <code>PreferenceToArgilla</code>, which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.</p> <p>The dataset will be pushed with the following configuration:</p> <ul> <li> <p>Fields: <code>instruction</code> and <code>generations</code>, both being fields of type <code>argilla.TextField</code>, plus the automatically generated <code>id</code> for the given <code>instruction</code> to be able to search for other records with the same <code>instruction</code> in the dataset. The field <code>instruction</code> must always be a string, while the field <code>generations</code> must be a list of strings, containing the generated texts for the given <code>instruction</code> so that at least there are two generations to compare. Other than that, the number of <code>generation</code> fields within each record in Argilla will be defined by the value of the variable <code>num_generations</code> to be provided in the <code>PreferenceToArgilla</code> step.</p> </li> <li> <p>Questions: <code>rating</code> and <code>rationale</code> will be the pairs of questions to be defined per each generation i.e. per each value within the range from 0 to <code>num_generations</code>, and those will be of types <code>argilla.RatingQuestion</code> and <code>argilla.TextQuestion</code>, respectively. Note that only the first pair of questions will be mandatory, since only one generation is ensured to be within the batch data. Additionally, note that the provided ratings will range from 1 to 5, and to mention that Argilla only supports values above 0.</p> </li> </ul> <p>Note</p> <p>The <code>PreferenceToArgilla</code> step will only work if the <code>Pipeline</code> contains multiple <code>TextGeneration</code> steps, or if the columns <code>instruction</code> and <code>generations</code> are available within the batch data. Otherwise, the variable <code>input_mappings</code> will need to be set so that either both or one of <code>instruction</code> and <code>generations</code> are mapped to one of the existing columns in the batch data.</p> <p>Note</p> <p>Additionally, if the <code>Pipeline</code> contains an <code>UltraFeedback</code> step, the <code>ratings</code> and <code>rationales</code> will also be available and be automatically injected as suggestions to the existing dataset.</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.steps import LoadDataFromDicts, PreferenceToArgilla\nfrom distilabel.steps.tasks import TextGeneration\n\n\nwith Pipeline(name=\"my-pipeline\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_dataset\",\n        data=[\n            {\n                \"instruction\": \"Write a short story about a dragon that saves a princess from a tower.\",\n            },\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=OpenAILLM(model=\"gpt-4\"),\n        num_generations=4,\n        group_generations=True,\n    )\n\n    to_argilla = PreferenceToArgilla(\n        dataset_name=\"my-dataset\",\n        dataset_workspace=\"admin\",\n        api_url=\"&lt;ARGILLA_API_URL&gt;\",\n        api_key=\"&lt;ARGILLA_API_KEY&gt;\",\n        num_generations=4,\n    )\n\n    load_dataset &gt;&gt; text_generation &gt;&gt; to_argilla\n\nif __name__ == \"__main__\":\n    pipeline.run()\n</code></pre> <p></p>"},{"location":"sections/how_to_guides/advanced/assigning_resources_to_step/","title":"Assigning resources to a <code>Step</code>","text":"<p>When dealing with complex pipelines that get executed in a distributed environment with abundant resources (CPUs and GPUs), sometimes it's necessary to allocate these resources judiciously among the <code>Step</code>s. This is why <code>distilabel</code> allows to specify the number of <code>replicas</code>, <code>cpus</code> and <code>gpus</code> for each <code>Step</code>. Let's see that with an example:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.models import vLLM\nfrom distilabel.steps import StepResources\nfrom distilabel.steps.tasks import PrometheusEval\n\n\nwith Pipeline(name=\"resources\") as pipeline:\n    ...\n\n    prometheus = PrometheusEval(\n        llm=vLLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            chat_template=\"[INST] {{ messages[0]['content'] }}\\\\n{{ messages[1]['content'] }}[/INST]\",\n        ),\n        resources=StepResources(replicas=2, cpus=1, gpus=1)\n        mode=\"absolute\",\n        rubric=\"factual-validity\",\n        reference=False,\n        num_generations=1,\n        group_generations=False,\n    )\n</code></pre> <p>In the example above, we're creating a <code>PrometheusEval</code> task (remember that <code>Task</code>s are <code>Step</code>s) that will use <code>vLLM</code> to serve <code>prometheus-eval/prometheus-7b-v2.0</code> model. This task is resource intensive as it requires an LLM, which in turn requires a GPU to run fast. With that in mind, we have specified the <code>resources</code> required for the task using the <code>StepResources</code> class, and we have defined that we need <code>1</code> GPU and <code>1</code> CPU per replica of the task. In addition, we have defined that we need <code>2</code> replicas i.e. we will run two instances of the task so the computation for the whole dataset runs faster. In addition, <code>StepResources</code> uses the RuntimeParametersMixin, so we can also specify the resources for each step when running the pipeline:</p> <pre><code>...\n\nif __name__ == \"__main__\":\n    pipeline.run(\n        parameters={\n            prometheus.name: {\"resources\": {\"replicas\": 2, \"cpus\": 1, \"gpus\": 1}}\n        }\n    )\n</code></pre> <p>And that's it! When running the pipeline, <code>distilabel</code> will create the tasks in nodes that have available the specified resources.</p>"},{"location":"sections/how_to_guides/advanced/caching/","title":"Pipeline cache","text":"<p><code>distilabel</code> will automatically save all the intermediate outputs generated by each <code>Step</code> of a <code>Pipeline</code>, so these outputs can be reused to recover the state of a pipeline execution that was stopped before finishing or to not have to re-execute steps from a pipeline after adding a new downstream step.</p>"},{"location":"sections/how_to_guides/advanced/caching/#how-to-enabledisable-the-cache","title":"How to enable/disable the cache","text":"<p>The use of the cache can be toggled using the <code>use_cache</code> parameter of the <code>Pipeline.use_cache</code> method. If <code>True</code>, then <code>distilabel</code> will use the reuse the outputs of previous executions for the new execution. If <code>False</code>, then <code>distilabel</code> will re-execute all the steps of the pipeline to generate new outputs for all the steps.</p> <pre><code>with Pipeline(name=\"my-pipeline\") as pipeline:\n    ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)  # (1)\n</code></pre> <ol> <li>Pipeline cache is disabled</li> </ol> <p>In addition, the cache can be enabled/disabled at <code>Step</code> level using its <code>use_cache</code> attribute. If <code>True</code>, then the outputs of the step will be reused in the new pipeline execution. If <code>False</code>, then the step will be re-executed to generate new outputs. If the cache of one step is disabled and the outputs have to be regenerated, then the outputs of the steps that depend on this step will also be regenerated.</p> <pre><code>with Pipeline(name=\"writting-assistant\") as pipeline:\n    load_data = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"How much is 2+2?\"\n            }\n        ]\n    )\n\n    generation = TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"Qwen/Qwen2.5-72B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.8,\n                \"max_new_tokens\": 512,\n            },\n        ),\n        use_cache=False  # (1)\n    )\n\n    load_data &gt;&gt; generation\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run()\n</code></pre> <ol> <li>Step cache is disabled and every time the pipeline is executed, this step will be re-executed</li> </ol>"},{"location":"sections/how_to_guides/advanced/caching/#how-a-cache-hit-is-triggered","title":"How a cache hit is triggered","text":"<p><code>distilabel</code> groups information and data generated by a <code>Pipeline</code> using the name of the pipeline, so the first factor that triggers a cache hit is the name of the pipeline. The second factor, is the <code>Pipeline.signature</code> property. This property returns a hash that is generated using the names of the steps used in the pipeline and their connections. The third factor, is the <code>Pipeline.aggregated_steps_signature</code> property which is used to determine if the new pipeline execution is exactly the same as one of the previous i.e. the pipeline contains exactly the same steps, with exactly the same connections and the steps are using exactly the same parameters. If these three factors are met, then the cache hit is triggered and the pipeline won't get re-executed and instead the function <code>create_distiset</code> will be used to create the resulting <code>Distiset</code> using the outputs of the previous execution, as it can be seen in the following image:</p> <p></p> <p>If the new pipeline execution have a different <code>Pipeline.aggregated_steps_signature</code> i.e. at least one step has changed its parameters, <code>distilabel</code> will reuse the outputs of the steps that have not changed and re-execute the steps that have changed, as it can be seen in the following image:</p> <p></p> <p>The same pipeline from above gets executed a third time, but this time the last step <code>text_generation_1</code> changed, so it's needed to re-execute it. The other steps, as they have not been changed, doesn't need to be re-executed and their outputs are reused.</p>"},{"location":"sections/how_to_guides/advanced/distiset/","title":"Using the Distiset dataset object","text":"<p>A <code>Pipeline</code> in <code>distilabel</code> returns a special type of Hugging Face <code>datasets.DatasetDict</code> which is called <code>Distiset</code>.</p> <p>The <code>Distiset</code> is a dictionary-like object that contains the different configurations generated by the <code>Pipeline</code>, where each configuration corresponds to each leaf step in the DAG built by the <code>Pipeline</code>. Each configuration corresponds to a different subset of the dataset. This is a concept taken from \ud83e\udd17 <code>datasets</code> that lets you upload different configurations of the same dataset within the same repository and can contain different columns i.e. different configurations, which can be seamlessly pushed to the Hugging Face Hub.</p> <p>Below you can find an example of how to create a <code>Distiset</code> object that resembles a <code>datasets.DatasetDict</code>:</p> <pre><code>from datasets import Dataset\nfrom distilabel.distiset import Distiset\n\ndistiset = Distiset(\n    {\n        \"leaf_step_1\": Dataset.from_dict({\"instruction\": [1, 2, 3]}),\n        \"leaf_step_2\": Dataset.from_dict(\n            {\"instruction\": [1, 2, 3, 4], \"generation\": [5, 6, 7, 8]}\n        ),\n    }\n)\n</code></pre> <p>Note</p> <p>If there's only one leaf node, i.e., only one step at the end of the <code>Pipeline</code>, then the configuration name won't be the name of the last step, but it will be set to \"default\" instead, as that's more aligned with standard datasets within the Hugging Face Hub.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#distiset-methods","title":"Distiset methods","text":"<p>We can interact with the different pieces generated by the <code>Pipeline</code> and treat them as different <code>configurations</code>. The <code>Distiset</code> contains just two methods:</p>"},{"location":"sections/how_to_guides/advanced/distiset/#traintest-split","title":"Train/Test split","text":"<p>Create a train/test split partition of the dataset for the different configurations or subsets.</p> <pre><code>&gt;&gt;&gt; distiset.train_test_split(train_size=0.9)\nDistiset({\n    leaf_step_1: DatasetDict({\n        train: Dataset({\n            features: ['instruction'],\n            num_rows: 2\n        })\n        test: Dataset({\n            features: ['instruction'],\n            num_rows: 1\n        })\n    })\n    leaf_step_2: DatasetDict({\n        train: Dataset({\n            features: ['instruction', 'generation'],\n            num_rows: 3\n        })\n        test: Dataset({\n            features: ['instruction', 'generation'],\n            num_rows: 1\n        })\n    })\n})\n</code></pre>"},{"location":"sections/how_to_guides/advanced/distiset/#push-to-hugging-face-hub","title":"Push to Hugging Face Hub","text":"<p>Push the <code>Distiset</code> to a Hugging Face repository, where each one of the subsets will correspond to a different configuration:</p> <pre><code>distiset.push_to_hub(\n    \"my-org/my-dataset\",\n    commit_message=\"Initial commit\",\n    private=False,\n    token=os.getenv(\"HF_TOKEN\"),\n    generate_card=True,\n    include_script=False\n)\n</code></pre> <p>New since version 1.3.0</p> <p>Since version <code>1.3.0</code> you can automatically push the script that created your pipeline to the same repository. For example, assuming you have a file like the following:</p> sample_pipe.py<pre><code>with Pipeline() as pipe:\n    ...\ndistiset = pipe.run()\ndistiset.push_to_hub(\n    \"my-org/my-dataset,\n    include_script=True\n)\n</code></pre> <p>After running the command, you could visit the repository and the file <code>sample_pipe.py</code> will be stored to simplify sharing your pipeline with the community.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#custom-docstrings","title":"Custom Docstrings","text":"<p><code>distilabel</code> contains a custom plugin to automatically generates a gallery for the different components. The information is extracted by parsing the <code>Step</code>'s docstrings. You can take a look at the docstrings in the source code of the UltraFeedback, and take a look at the corresponding entry in the components gallery to see an example of how the docstrings are rendered.</p> <p>If you create your own components and want the <code>Citations</code> automatically rendered in the README card (in case you are sharing your final distiset in the Hugging Face Hub), you may want to add the citation section. This is an example for the MagpieGenerator Task:</p> <pre><code>class MagpieGenerator(GeneratorTask, MagpieBase):\n    r\"\"\"Generator task the generates instructions or conversations using Magpie.\n    ...\n\n    Citations:\n\n        ```\n        @misc{xu2024magpiealignmentdatasynthesis,\n            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},\n            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},\n            year={2024},\n            eprint={2406.08464},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2406.08464},\n        }\n        ```\n    \"\"\"\n</code></pre> <p>The <code>Citations</code> section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: <code>```@misc{...}```</code>. This information will be automatically used in the README of your <code>Distiset</code> if you decide to call <code>distiset.push_to_hub</code>. Alternatively, if the <code>Citations</code> is not found, but in the <code>References</code> there are found any urls pointing to <code>https://arxiv.org/</code>, we will try to obtain the <code>Bibtex</code> equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#save-and-load-from-disk","title":"Save and load from disk","text":"<p>Take into account that these methods work as <code>datasets.load_from_disk</code> and <code>datasets.Dataset.save_to_disk</code> so the arguments are directly passed to those methods. This means you can also make use of <code>storage_options</code> argument to save your <code>Distiset</code> in your cloud provider, including the distilabel artifacts (<code>pipeline.yaml</code>, <code>pipeline.log</code> and the <code>README.md</code> with the dataset card). You can read more in <code>datasets</code> documentation here.</p> Save to diskLoad from disk (local)Load from disk (cloud) <p>Save the <code>Distiset</code> to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:</p> <pre><code>distiset.save_to_disk(\n    \"my-dataset\",\n    save_card=True,\n    save_pipeline_config=True,\n    save_pipeline_log=True\n)\n</code></pre> <p>Load a <code>Distiset</code> that was saved using <code>Distiset.save_to_disk</code> just the same way:</p> <pre><code>distiset = Distiset.load_from_disk(\"my-dataset\")\n</code></pre> <p>Load a <code>Distiset</code> from a remote location, like S3, GCS. You can pass the <code>storage_options</code> argument to authenticate with the cloud provider:</p> <pre><code>distiset = Distiset.load_from_disk(\n    \"s3://path/to/my_dataset\",  # gcs:// or any filesystem tolerated by fsspec\n    storage_options={\n        \"key\": os.environ[\"S3_ACCESS_KEY\"],\n        \"secret\": os.environ[\"S3_SECRET_KEY\"],\n        ...\n    }\n)\n</code></pre> <p>Take a look at the remaining arguments at <code>Distiset.save_to_disk</code> and <code>Distiset.load_from_disk</code>.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#dataset-card","title":"Dataset card","text":"<p>Having this special type of dataset comes with an added advantage when calling <code>Distiset.push_to_hub</code>, which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting <code>generate_card=False</code>:</p> <pre><code>distiset.push_to_hub(\"my-org/my-dataset\", generate_card=True)\n</code></pre> <p>We will have an automatic dataset card (an example can be seen here) with some handy information like reproducing the <code>Pipeline</code> with the <code>CLI</code>, or examples of the records from the different subsets.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#create_distiset-helper","title":"create_distiset helper","text":"<p>Lastly, we presented in the caching section the <code>create_distiset</code> function, you can take a look at the section to see how to create a <code>Distiset</code> from the cache folder, using the helper function to automatically include all the relevant data.</p>"},{"location":"sections/how_to_guides/advanced/fs_to_pass_data/","title":"Using a file system to pass data of batches between steps","text":"<p>In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, <code>distilabel</code> uses <code>fsspec</code> to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the <code>run</code> method of the <code>distilabel</code> pipelines:</p> <p>Warning</p> <p>In order to use a specific file system/cloud storage, you will need to install the specific package providing the <code>fsspec</code> implementation for that file system. For instance, to use Google Cloud Storage you will need to install <code>gcsfs</code>:</p> <pre><code>pip install gcsfs\n</code></pre> <p>Check the available implementations: fsspec - Other known implementations</p> <pre><code>from distilabel.pipeline import Pipeline\n\nwith Pipeline(name=\"my-pipeline\") as pipeline:\n  ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        ..., \n        storage_parameters={\"path\": \"gcs://my-bucket\"},\n        use_fs_to_pass_data=True\n    )\n</code></pre> <p>The code above setups a file system (in this case Google Cloud Storage) and sets the flag <code>use_fs_to_pass_data</code> to specify that the data of the batches should be passed to the steps using the file system. The <code>storage_parameters</code> argument is optional, and in the case it's not provided but <code>use_fs_to_pass_data==True</code>, <code>distilabel</code> will use the local file system.</p> <p>Note</p> <p>As <code>GlobalStep</code>s receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if <code>use_fs_to_pass_data==False</code>, <code>distilabel</code> will use the file system to pass the data to the <code>GlobalStep</code>. </p>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/","title":"Load groups and execution stages","text":"<p>By default, the <code>distilabel</code> architecture loads all steps of a pipeline at the same time, as they are all supposed to process batches of data in parallel. However, loading all steps at once can waste resources in two scenarios: when using <code>GlobalStep</code>s that must wait for upstream steps to complete before processing data, or when running on machines with limited resources that cannot execute all steps simultaneously. In these cases, steps need to be loaded and executed in distinct load stages.</p>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/#load-stages","title":"Load stages","text":"<p>A load stage represents a point in the pipeline execution where a group of steps are loaded at the same time to process batches in parallel. These stages are required because:</p> <ol> <li>There are some kind of steps like the <code>GlobalStep</code>s that needs to receive all the data at once from their upstream steps i.e. needs their upstream steps to have finished its execution. It would be wasteful to load a <code>GlobalStep</code> at the same time as other steps of the pipeline as that would take resources (from the machine or cluster running the pipeline) that wouldn't be used until upstream steps have finished.</li> <li>When running on machines or clusters with limited resources, it may be not possible to load and execute all steps simultaneously as they would need to access the same limited resources (memory, CPU, GPU, etc.). </li> </ol> <p>Having that said, the first element that will create a load stage when executing a pipeline are the <code>GlobalStep</code>, as they mark and divide a pipeline in three stages: one stage with the upstream steps of the global step, one stage with the global step, and one final stage with the downstream steps of the global step. For example, the following pipeline will contain three stages:</p> <pre><code>from typing import TYPE_CHECKING\n\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts, StepInput, step\n\nif TYPE_CHECKING:\n    from distilabel.typing import StepOutput\n\n\n@step(inputs=[\"instruction\"], outputs=[\"instruction2\"])\ndef DummyStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        input[\"instruction2\"] = \"miau\"\n    yield inputs\n\n\n@step(inputs=[\"instruction\"], outputs=[\"instruction2\"], step_type=\"global\")\ndef GlobalDummyStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        input[\"instruction2\"] = \"miau\"\n    yield inputs\n\n\nwith Pipeline() as pipeline:\n    generator = LoadDataFromDicts(data=[{\"instruction\": \"Hi\"}] * 50)\n    dummy_step_0 = DummyStep()\n    global_dummy_step = GlobalDummyStep()\n    dummy_step_1 = DummyStep()\n\n    generator &gt;&gt; dummy_step_0 &gt;&gt; global_dummy_step &gt;&gt; dummy_step_1\n\nif __name__ == \"__main__\":\n    load_stages = pipeline.get_load_stages()\n\n    for i, steps_stage in enumerate(load_stages[0]):\n        print(f\"Stage {i}: {steps_stage}\")\n\n    # Output:\n    # Stage 0: ['load_data_from_dicts_0', 'dummy_step_0']\n    # Stage 1: ['global_dummy_step_0']\n    # Stage 2: ['dummy_step_1']\n</code></pre> <p>As we can see, the <code>GlobalStep</code> divided the pipeline execution in three stages.</p>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/#load-groups","title":"Load groups","text":"<p>While <code>GlobalStep</code>s automatically divide pipeline execution into stages, we many need fine-grained control over how steps are loaded and executed within each stage. This is where load groups come in.</p> <p>Load groups allows to specify which steps of the pipeline have to be loaded together within a stage. This is particularly useful when running on resource-constrained environments where all the steps cannot be executed in parallel.</p> <p>Let's see how it works with an example:</p> <pre><code>from datasets import load_dataset\n\nfrom distilabel.llms import vLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import StepResources\nfrom distilabel.steps.tasks import TextGeneration\n\ndataset = load_dataset(\n    \"distilabel-internal-testing/instruction-dataset-mini\", split=\"test\"\n).rename_column(\"prompt\", \"instruction\")\n\nwith Pipeline() as pipeline:\n    text_generation_0 = TextGeneration(\n        llm=vLLM(\n            model=\"HuggingFaceTB/SmolLM2-1.7B-Instruct\",\n            extra_kwargs={\"max_model_len\": 1024},\n        ),\n        resources=StepResources(gpus=1),\n    )\n\n    text_generation_1 = TextGeneration(\n        llm=vLLM(\n            model=\"HuggingFaceTB/SmolLM2-1.7B-Instruct\",\n            extra_kwargs={\"max_model_len\": 1024},\n        ),\n        resources=StepResources(gpus=1),\n    )\n\nif __name__ == \"__main__\":\n    load_stages = pipeline.get_load_stages(load_groups=[[text_generation_1.name]])\n\n    for i, steps_stage in enumerate(load_stages[0]):\n        print(f\"Stage {i}: {steps_stage}\")\n\n    # Output:\n    # Stage 0: ['text_generation_0']\n    # Stage 1: ['text_generation_1']\n\n    distiset = pipeline.run(dataset=dataset, load_groups=[[text_generation_0.name]])\n</code></pre> <p>In this example, we're working with a machine that has a single GPU, but the pipeline includes two instances of TextGeneration tasks both using vLLM and requesting 1 GPU. We cannot execute both steps in parallel. To fix that, we specify in the <code>run</code> method using the <code>load_groups</code> argument that the <code>text_generation_0</code> step has to be executed in isolation in a stage. This way, we can run the pipeline on a single GPU machine by executing the steps in different stages (sequentially) instead of in parallel.</p> <p>Some key points about load groups:</p> <ol> <li>Load groups are specified as a list of lists, where each inner list represents a group of steps that should be loaded together.</li> <li>Same as <code>GlobalSteps</code>s, the load groups creates a new load stage dividing the pipeline in 3 stages: one for the upstream steps, one for the steps in the load group, and one for the downstream steps.</li> </ol>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/#load-groups-modes","title":"Load groups modes","text":"<p>In addition, <code>distilabel</code> allows passing some modes to the <code>load_groups</code> argument that will handle the creation of the load groups:</p> <ul> <li><code>\"sequential_step_execution\"</code>: when passed, it will create a load group for each step i.e. the execution of the steps of the pipeline will be sequential.</li> </ul>"},{"location":"sections/how_to_guides/advanced/offline_batch_generation/","title":"Offline Batch Generation","text":"<p>The offline batch generation is a feature that some <code>LLM</code>s implemented in <code>distilabel</code> offers, allowing to send the inputs to a LLM-as-a-service platform and waiting for the outputs in a asynchronous manner. LLM-as-a-service platforms offer this feature as it allows them to gather many inputs and creating batches as big as the hardware allows, maximizing the hardware utilization and reducing the cost of the service. In exchange, the user has to wait certain time for the outputs to be ready but the cost per token is usually much lower.</p> <p><code>distilabel</code> pipelines are able to handle <code>LLM</code>s that offer this feature in the following way:</p> <ul> <li>The first time the pipeline gets executed, the <code>LLM</code> will send the inputs to the platform. The platform will return jobs ids that can be used later to check the status of the jobs and retrieve the results. The <code>LLM</code> will save these jobs ids in its <code>jobs_ids</code> attribute and raise an special exception DistilabelOfflineBatchGenerationNotFinishedException that will be handled by the <code>Pipeline</code>. The jobs ids will be saved in the pipeline cache, so they can be used in subsequent calls.</li> <li>The second time and subsequent calls will recover the pipeline execution and the <code>LLM</code> won't send the inputs again to the platform. This time as it has the <code>jobs_ids</code> it will check if the jobs have finished, and if they have then it will retrieve the results and return the outputs. If they haven't finished, then it will raise again <code>DistilabelOfflineBatchGenerationNotFinishedException</code> again.</li> <li>In addition, LLMs with offline batch generation can be specified to do polling until the jobs have finished, blocking the pipeline until they are done. If for some reason the polling needs to be stopped, one can press Ctrl+C or Cmd+C depending on your OS (or send a <code>SIGINT</code> to the main process) which will stop the polling and raise <code>DistilabelOfflineBatchGenerationNotFinishedException</code> that will be handled by the pipeline as described above.</li> </ul> <p>Warning</p> <p>In order to recover the pipeline execution and retrieve the results, the pipeline cache must be enabled. If the pipeline cache is disabled, then it will send the inputs again and create different jobs incurring in extra costs.</p>"},{"location":"sections/how_to_guides/advanced/offline_batch_generation/#example-pipeline-using-openaillm-with-offline-batch-generation","title":"Example pipeline using <code>OpenAILLM</code> with offline batch generation","text":"<pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline() as pipeline:\n    load_data = LoadDataFromHub(output_mappings={\"prompt\": \"instruction\"})\n\n    text_generation = TextGeneration(\n        llm=OpenAILLM(\n            model=\"gpt-3.5-turbo\",\n            use_offline_batch_generation=True,  # (1)\n        )\n    )\n\n    load_data &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset\",\n                \"split\": \"test\",\n                \"batch_size\": 500,\n            },\n        }\n    )\n</code></pre> <ol> <li>Indicate that the <code>OpenAILLM</code> should use offline batch generation.</li> </ol>"},{"location":"sections/how_to_guides/advanced/pipeline_requirements/","title":"Add requirements to run a Pipeline","text":"<p>When sharing a <code>Pipeline</code> that contains custom <code>Step</code>s or <code>Task</code>s, you may want to add the specific requirements that are needed to run them. <code>distilabel</code> will take this list of requirements and warn the user if any are missing.</p> <p>Let's see how we can add additional requirements with an example. The first thing we're going to do is to add requirements for our <code>CustomStep</code>. To do so we use the <code>requirements</code> decorator to specify that the step has <code>nltk&gt;=3.8</code> as dependency (we can use version specifiers). In addition, we're going to specify at <code>Pipeline</code> level that we need <code>distilabel&gt;=1.3.0</code> to run it.</p> <pre><code>from typing import List\n\nfrom distilabel.steps import Step\nfrom distilabel.steps.base import StepInput\nfrom distilabel.steps.typing import StepOutput\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.utils.requirements import requirements\nfrom distilabel.pipeline import Pipeline\n\n\n@requirements([\"nltk\"])\nclass CustomStep(Step):\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"instruction\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"response\"]\n\n    def process(self, inputs: StepInput) -&gt; StepOutput:  # type: ignore\n        for input in inputs:\n            input[\"response\"] = nltk.word_tokenize(input)\n        yield inputs\n\n\nwith Pipeline(\n    name=\"pipeline-with-requirements\", requirements=[\"distilabel&gt;=1.3.0\"]\n) as pipeline:\n    loader = LoadDataFromDicts(data=[{\"instruction\": \"sample sentence\"}])\n    step1 = CustomStep()\n    loader &gt;&gt; step1\n\nif __name__ == \"__main__\":\n    pipeline.run()\n</code></pre> <p>Once we call <code>pipeline.run()</code>, if any of the requirements informed at the <code>Step</code> or <code>Pipeline</code> level is missing, a <code>ValueError</code> will be raised telling us that we should install the list of dependencies:</p> <pre><code>&gt;&gt;&gt; pipeline.run()\n[06/27/24 11:07:33] ERROR    ['distilabel.pipeline'] Please install the following requirements to run the pipeline:                                                                                                                                     base.py:350\n                             distilabel&gt;=1.3.0\n...\nValueError: Please install the following requirements to run the pipeline:\ndistilabel&gt;=1.3.0\n</code></pre>"},{"location":"sections/how_to_guides/advanced/saving_step_generated_artifacts/","title":"Saving step generated artifacts","text":"<p>Some <code>Step</code>s might need to produce an auxiliary artifact that is not a result of the computation, but is needed for the computation. For example, the <code>FaissNearestNeighbour</code> needs to create a Faiss index to compute the output of the step which are the top <code>k</code> nearest neighbours for each input. Generating the Faiss index takes time and it could potentially be reused outside of the <code>distilabel</code> pipeline, so it would be a shame not saving it.</p> <p>For this reason, <code>Step</code>s have a method called <code>save_artifact</code> that allows saving artifacts that will be included along the outputs of the pipeline in the generated <code>Distiset</code>. The generated artifacts will be uploaded and saved when using <code>Distiset.push_to_hub</code> or <code>Distiset.save_to_disk</code> respectively. Let's see how to use it with a simple example.</p> <pre><code>from typing import List, TYPE_CHECKING\nfrom distilabel.steps import GlobalStep, StepInput, StepOutput\nimport matplotlib.pyplot as plt\n\nif TYPE_CHECKING:\n    from distilabel.steps import StepOutput\n\n\nclass CountTextCharacters(GlobalStep):\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"text_character_count\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        character_counts = []\n\n        for input in inputs:\n            text_character_count = len(input[\"text\"])\n            input[\"text_character_count\"] = text_character_count\n            character_counts.append(text_character_count)\n\n        # Generate plot with the distribution of text character counts\n        plt.figure(figsize=(10, 6))\n        plt.hist(character_counts, bins=30, edgecolor=\"black\")\n        plt.title(\"Distribution of Text Character Counts\")\n        plt.xlabel(\"Character Count\")\n        plt.ylabel(\"Frequency\")\n\n        # Save the plot as an artifact of the step\n        self.save_artifact(\n            name=\"text_character_count_distribution\",\n            write_function=lambda path: plt.savefig(path / \"figure.png\"),\n            metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n        )\n\n        plt.close()\n\n        yield inputs\n</code></pre> <p>As it can be seen in the example above, we have created a simple step that counts the number of characters in each input text and generates a histogram with the distribution of the character counts. We save the histogram as an artifact of the step using the <code>save_artifact</code> method. The method takes three arguments:</p> <ul> <li><code>name</code>: The name we want to give to the artifact.</li> <li><code>write_function</code>: A function that writes the artifact to the desired path. The function will receive a <code>path</code> argument which is a <code>pathlib.Path</code> object pointing to the directory where the artifact should be saved.</li> <li><code>metadata</code>: A dictionary with metadata about the artifact. This metadata will be saved along with the artifact.</li> </ul> <p>Let's execute the step with a simple pipeline and push the resulting <code>Distiset</code> to the Hugging Face Hub:</p> Example full code <pre><code>from typing import TYPE_CHECKING, List\n\nimport matplotlib.pyplot as plt\nfrom datasets import load_dataset\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GlobalStep, StepInput, StepOutput\n\nif TYPE_CHECKING:\n    from distilabel.steps import StepOutput\n\n\nclass CountTextCharacters(GlobalStep):\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"text_character_count\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        character_counts = []\n\n        for input in inputs:\n            text_character_count = len(input[\"text\"])\n            input[\"text_character_count\"] = text_character_count\n            character_counts.append(text_character_count)\n\n        # Generate plot with the distribution of text character counts\n        plt.figure(figsize=(10, 6))\n        plt.hist(character_counts, bins=30, edgecolor=\"black\")\n        plt.title(\"Distribution of Text Character Counts\")\n        plt.xlabel(\"Character Count\")\n        plt.ylabel(\"Frequency\")\n\n        # Save the plot as an artifact of the step\n        self.save_artifact(\n            name=\"text_character_count_distribution\",\n            write_function=lambda path: plt.savefig(path / \"figure.png\"),\n            metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n        )\n\n        plt.close()\n\n        yield inputs\n\n\nwith Pipeline() as pipeline:\n    count_text_characters = CountTextCharacters()\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        dataset=load_dataset(\n            \"HuggingFaceH4/instruction-dataset\", split=\"test\"\n        ).rename_column(\"prompt\", \"text\"),\n    )\n\n    distiset.push_to_hub(\"distilabel-internal-testing/distilabel-artifacts-example\")\n</code></pre> <p>The generated distilabel-internal-testing/distilabel-artifacts-example dataset repository has a section in its card describing the artifacts generated by the pipeline and the generated plot can be seen here.</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/","title":"Scaling and distributing a pipeline with Ray","text":"<p>Although the local Pipeline based on <code>multiprocessing</code> + serving LLMs with an external service is enough for executing most of the pipelines used to create SFT and preference datasets, there are scenarios where you might need to scale your pipeline across multiple machines. In such cases, distilabel leverages Ray to distribute the workload efficiently. This allows you to generate larger datasets, reduce execution time, and maximize resource utilization across a cluster of machines, without needing to change a single line of code.</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#relation-between-distilabel-steps-and-ray-actors","title":"Relation between distilabel steps and Ray Actors","text":"<p>A <code>distilabel</code> pipeline consist of several <code>Step</code>s. An <code>Step</code> is a class that defines a basic life-cycle:</p> <ol> <li>It will load or create the resources (LLMs, clients, etc) required to run its logic.</li> <li>It will run a loop waiting for incoming batches received using a queue. Once it receives one batch, it will process it and put the processed batch into an output queue.</li> <li>When it finish a batch that is the final one or receives a special signal, the loop will finish and the unload logic will be executed.</li> </ol> <p>So an <code>Step</code> needs to maintain a minimum state and the best way to do that with Ray is using actors.</p> <pre><code>graph TD\n    A[Step] --&gt;|has| B[Multiple Replicas]\n    B --&gt;|wrapped in| C[Ray Actor]\n    C --&gt;|maintains| D[Step Replica State]\n    C --&gt;|executes| E[Step Lifecycle]\n    E --&gt;|1. Load/Create Resources| F[LLMs, Clients, etc.]\n    E --&gt;|2. Process batches from| G[Input Queue]\n    E --&gt;|3. Processed batches are put in| H[Output Queue]\n    E --&gt;|4. Unload| I[Cleanup]\n</code></pre>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#executing-a-pipeline-with-ray","title":"Executing a pipeline with Ray","text":"<p>The recommended way to execute a <code>distilabel</code> pipeline using Ray is using the Ray Jobs API.</p> <p>Before jumping on the explanation, let's first install the prerequisites: <pre><code>pip install distilabel[ray]\n</code></pre></p> <p>Tip</p> <p>It's recommended to create a virtual environment.</p> <p>For the purpose of explaining how to execute a pipeline with Ray, we'll use the following pipeline throughout the examples:</p> <pre><code>from distilabel.models import vLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(name=\"text-generation-ray-pipeline\") as pipeline:\n    load_data_from_hub = LoadDataFromHub(output_mappings={\"prompt\": \"instruction\"})\n\n    text_generation = TextGeneration(\n        llm=vLLM(\n            model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n            tokenizer=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        )\n    )\n\n    load_data_from_hub &gt;&gt; text_generation\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data_from_hub.name: {\n                \"repo_id\": \"HuggingFaceH4/instruction-dataset\",\n                \"split\": \"test\",\n            },\n            text_generation.name: {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 4096,\n                    }\n                },\n                \"resources\": {\"replicas\": 2, \"gpus\": 1}, # (1)\n            },\n        }\n    )\n\n    distiset.push_to_hub(\n        \"&lt;YOUR_HF_USERNAME_OR_ORGANIZATION&gt;/text-generation-distilabel-ray\" # (2)\n    )\n</code></pre> <ol> <li>We're setting resources for the <code>text_generation</code> step and defining that we want two replicas and one GPU per replica. <code>distilabel</code> will create two replicas of the step i.e. two actors in the Ray cluster, and each actor will request to be allocated in a node of the cluster that have at least one GPU. You can read more about how Ray manages the resources here.</li> <li>You should modify this and add your user or organization on the Hugging Face Hub.</li> </ol> <p>It's a basic pipeline with just two steps: one to load a dataset from the Hub with an <code>instruction</code> column and one to generate a <code>response</code> for that instruction using Llama 3 8B Instruct with vLLM. Simple but enough to demonstrate how to distribute and scale the workload using a Ray cluster!</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#using-ray-jobs-api","title":"Using Ray Jobs API","text":"<p>If you don't know the Ray Jobs API then it's recommended to read Ray Jobs Overview. Quick summary: Ray Jobs is the recommended way to execute a job in a Ray cluster as it will handle packaging, deploying and managing the Ray application.</p> <p>To execute the pipeline above, we first need to create a directory (kind of a package) with the pipeline script (or scripts) that we will submit to the Ray cluster:</p> <pre><code>mkdir ray-pipeline\n</code></pre> <p>The content of the directory <code>ray-pipeline</code> should be:</p> <pre><code>ray-pipeline/\n\u251c\u2500\u2500 pipeline.py\n\u2514\u2500\u2500 runtime_env.yaml\n</code></pre> <p>The first file contains the code of the pipeline, while the second one (<code>runtime_env.yaml</code>) is a specific Ray file containing the environment dependencies required to run the job:</p> <pre><code>pip:\n  - distilabel[ray,vllm] &gt;= 1.3.0\nenv_vars:\n  HF_TOKEN: &lt;YOUR_HF_TOKEN&gt;\n</code></pre> <p>With this file we're basically informing to the Ray cluster that it will have to install <code>distilabel</code> with the <code>vllm</code> and <code>ray</code> extra dependencies to be able to run the job. In addition, we're defining the <code>HF_TOKEN</code> environment variable that will be used (by the <code>push_to_hub</code> method) to upload the resulting dataset to the Hugging Face Hub. </p> <p>After that, we can proceed to execute the <code>ray</code> command that will submit the job to the Ray cluster: <pre><code>ray job submit \\\n    --address http://localhost:8265 \\\n    --working-dir ray-pipeline \\\n    --runtime-env ray-pipeline/runtime_env.yaml -- python pipeline.py\n</code></pre></p> <p>What this will do, it's to basically upload the <code>--working-dir</code> to the Ray cluster, install the dependencies and then execute the <code>python pipeline.py</code> command from the head node.</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#file-system-requirements","title":"File system requirements","text":"<p>As described in Using a file system to pass data to steps, <code>distilabel</code> relies on the file system to pass the data to the <code>GlobalStep</code>s, so if the pipeline to be executed in the Ray cluster have any <code>GlobalStep</code> or do you want to set the <code>use_fs_to_pass_data=True</code> of the run method, then you will need to setup a file system to which all the nodes of the Ray cluster have access:</p> <pre><code>if __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={...},\n        storage_parameters={\"path\": \"file:///mnt/data\"}, # (1)\n        use_fs_to_pass_data=True,\n    )\n</code></pre> <ol> <li>All the nodes of the Ray cluster should have access to <code>/mnt/data</code>.</li> </ol>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#executing-a-raypipeline-in-a-cluster-with-slurm","title":"Executing a <code>RayPipeline</code> in a cluster with Slurm","text":"<p>If you have access to an HPC, then you're probably also a user of Slurm, a workload manager typically used on HPCs. We can create Slurm job that takes some nodes and deploy a Ray cluster to run a distributed <code>distilabel</code> pipeline:</p> <pre><code>#!/bin/bash\n#SBATCH --job-name=distilabel-ray-text-generation\n#SBATCH --partition=your-partition\n#SBATCH --qos=normal\n#SBATCH --nodes=2 # (1)\n#SBATCH --exclusive\n#SBATCH --ntasks-per-node=1 # (2)\n#SBATCH --gpus-per-node=1 # (3)\n#SBATCH --time=0:30:00\n\nset -ex\n\necho \"SLURM_JOB_ID: $SLURM_JOB_ID\"\necho \"SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST\"\n\n# Activate virtual environment\nsource /path/to/virtualenv/.venv/bin/activate\n\n# Getting the node names\nnodes=$(scontrol show hostnames \"$SLURM_JOB_NODELIST\")\nnodes_array=($nodes)\n\n# Get the IP address of the head node\nhead_node=${nodes_array[0]}\nhead_node_ip=$(srun --nodes=1 --ntasks=1 -w \"$head_node\" hostname --ip-address)\n\n# Start Ray head node\nport=6379\nip_head=$head_node_ip:$port\nexport ip_head\necho \"IP Head: $ip_head\"\n\n# Generate a unique Ray tmp dir for the head node (just in case the default one is not writable)\nhead_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_head\"\n\necho \"Starting HEAD at $head_node\"\nOUTLINES_CACHE_DIR=\"/tmp/.outlines\" srun --nodes=1 --ntasks=1 -w \"$head_node\" \\ # (4)\n    ray start --head --node-ip-address=\"$head_node_ip\" --port=$port \\\n    --dashboard-host=0.0.0.0 \\\n    --dashboard-port=8265 \\\n    --temp-dir=\"$head_tmp_dir\" \\\n    --block &amp;\n\n# Give some time to head node to start...\necho \"Waiting a bit before starting worker nodes...\"\nsleep 10\n\n# Start Ray worker nodes\nworker_num=$((SLURM_JOB_NUM_NODES - 1))\n\n# Start from 1 (0 is head node)\nfor ((i = 1; i &lt;= worker_num; i++)); do\n    node_i=${nodes_array[$i]}\n    worker_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_worker_$i\"\n    echo \"Starting WORKER $i at $node_i\"\n    OUTLINES_CACHE_DIR=\"/tmp/.outlines\" srun --nodes=1 --ntasks=1 -w \"$node_i\" \\\n        ray start --address \"$ip_head\" \\\n        --temp-dir=\"$worker_tmp_dir\" \\\n        --block &amp;\n    sleep 5\ndone\n\n# Give some time to the Ray cluster to gather info\necho \"Waiting a bit before submitting the job...\"\nsleep 60\n\n# Finally submit the job to the cluster\nray job submit --address http://localhost:8265 --working-dir ray-pipeline -- python -u pipeline.py\n</code></pre> <ol> <li>In this case, we just want two nodes: one to run the Ray head node and one to run a worker.</li> <li>We just want to run a task per node i.e. the Ray command that starts the head/worker node.</li> <li>We have selected 1 GPU per node, but we could have selected more depending on the pipeline.</li> <li>We need to set the environment variable <code>OUTLINES_CACHE_DIR</code> to <code>/tmp/.outlines</code> to avoid issues with the nodes trying to read/write the same <code>outlines</code> cache files, which is not possible.</li> </ol>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#vllm-and-tensor_parallel_size","title":"<code>vLLM</code> and <code>tensor_parallel_size</code>","text":"<p>In order to use <code>vLLM</code> multi-GPU and multi-node capabilities with <code>ray</code>, we need to do a few changes in the example pipeline from above. The first change needed is to specify a value for <code>tensor_parallel_size</code> aka \"In how many GPUs do I want you to load the model\", and the second one is to define <code>ray</code> as the <code>distributed_executor_backend</code> as the default one in <code>vLLM</code> is to use <code>multiprocessing</code>:</p> <pre><code>with Pipeline(name=\"text-generation-ray-pipeline\") as pipeline:\n    load_data_from_hub = LoadDataFromHub(output_mappings={\"prompt\": \"instruction\"})\n\n    text_generation = TextGeneration(\n        llm=vLLM(\n            model=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            extra_kwargs={\n                \"tensor_parallel_size\": 8,\n                \"distributed_executor_backend\": \"ray\",\n            }\n        )\n    )\n\n    load_data_from_hub &gt;&gt; text_generation\n</code></pre> <p>More information about distributed inference with <code>vLLM</code> can be found here: vLLM - Distributed Serving</p>"},{"location":"sections/how_to_guides/advanced/serving_an_llm_for_reuse/","title":"Serving an <code>LLM</code> for sharing it between several <code>Task</code>s","text":"<p>It's very common to want to use the same <code>LLM</code> for several <code>Task</code>s in a pipeline. To avoid loading the <code>LLM</code> as many times as the number of <code>Task</code>s and avoid wasting resources, it's recommended to serve the model using solutions like <code>text-generation-inference</code> or <code>vLLM</code>, and then use an <code>AsyncLLM</code> compatible client like <code>InferenceEndpointsLLM</code> or <code>OpenAILLM</code> to communicate with the server respectively.</p>"},{"location":"sections/how_to_guides/advanced/serving_an_llm_for_reuse/#serving-llms-using-text-generation-inference","title":"Serving LLMs using <code>text-generation-inference</code>","text":"<pre><code>model=meta-llama/Meta-Llama-3-8B-Instruct\nvolume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \\\n    -e HUGGING_FACE_HUB_TOKEN=&lt;secret&gt; \\\n    ghcr.io/huggingface/text-generation-inference:2.0.4 \\\n    --model-id $model\n</code></pre> <p>Note</p> <p>The bash command above has been copy-pasted from the official docs text-generation-inference. Please refer to the official docs for more information.</p> <p>And then we can use <code>InferenceEndpointsLLM</code> with <code>base_url=http://localhost:8080</code> (pointing to our <code>TGI</code> local deployment):</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n\nwith Pipeline(name=\"serving-llm\") as pipeline:\n    load_data = LoadDataFromDicts(\n        data=[{\"instruction\": \"Write a poem about the sun and moon.\"}]\n    )\n\n    # `base_url` points to the address of the `TGI` serving the LLM\n    llm = InferenceEndpointsLLM(base_url=\"http://192.168.1.138:8080\")\n\n    text_generation = TextGeneration(\n        llm=llm,\n        num_generations=3,\n        group_generations=True,\n        output_mappings={\"generation\": \"generations\"},\n    )\n\n    ultrafeedback = UltraFeedback(aspect=\"overall-rating\", llm=llm)\n\n    load_data &gt;&gt; text_generation &gt;&gt; ultrafeedback\n</code></pre>"},{"location":"sections/how_to_guides/advanced/serving_an_llm_for_reuse/#serving-llms-using-vllm","title":"Serving LLMs using <code>vLLM</code>","text":"<pre><code>docker run --gpus all \\\n    -v ~/.cache/huggingface:/root/.cache/huggingface \\\n    --env \"HUGGING_FACE_HUB_TOKEN=&lt;secret&gt;\" \\\n    -p 8000:8000 \\\n    --ipc=host \\\n    vllm/vllm-openai:latest \\\n    --model meta-llama/Meta-Llama-3-8B-Instruct\n</code></pre> <p>Note</p> <p>The bash command above has been copy-pasted from the official docs vLLM. Please refer to the official docs for more information.</p> <p>And then we can use <code>OpenAILLM</code> with <code>base_url=http://localhost:8000</code> (pointing to our <code>vLLM</code> local deployment):</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n\nwith Pipeline(name=\"serving-llm\") as pipeline:\n    load_data = LoadDataFromDicts(\n        data=[{\"instruction\": \"Write a poem about the sun and moon.\"}]\n    )\n\n    # `base_url` points to the address of the `vLLM` serving the LLM\n    llm = OpenAILLM(base_url=\"http://192.168.1.138:8000\", model=\"\")\n\n    text_generation = TextGeneration(\n        llm=llm,\n        num_generations=3,\n        group_generations=True,\n        output_mappings={\"generation\": \"generations\"},\n    )\n\n    ultrafeedback = UltraFeedback(aspect=\"overall-rating\", llm=llm)\n\n    load_data &gt;&gt; text_generation &gt;&gt; ultrafeedback\n</code></pre>"},{"location":"sections/how_to_guides/advanced/structured_generation/","title":"Structured data generation","text":"<p><code>Distilabel</code> has integrations with relevant libraries to generate structured text i.e. to guide the <code>LLM</code> towards the generation of structured outputs following a JSON schema, a regex, etc.</p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#outlines","title":"Outlines","text":"<p><code>Distilabel</code> integrates <code>outlines</code> within some <code>LLM</code> subclasses. At the moment, the following LLMs integrated with <code>outlines</code> are supported in <code>distilabel</code>: <code>TransformersLLM</code>, <code>vLLM</code> or <code>LlamaCppLLM</code>, so that anyone can generate structured outputs in the form of JSON or a parseable regex.</p> <p>The <code>LLM</code> has an argument named <code>structured_output</code><sup>1</sup> that determines how we can generate structured outputs with it, let's see an example using <code>LlamaCppLLM</code>.</p> <p>Note</p> <p>For <code>outlines</code> integration to work you may need to install the corresponding dependencies:</p> <pre><code>pip install distilabel[outlines]\n</code></pre>"},{"location":"sections/how_to_guides/advanced/structured_generation/#json","title":"JSON","text":"<p>We will start with a JSON example, where we initially define a <code>pydantic.BaseModel</code> schema to guide the generation of the structured output.</p> <p>Note</p> <p>Take a look at <code>StructuredOutputType</code> to see the expected format of the <code>structured_output</code> dict variable.</p> <pre><code>from pydantic import BaseModel\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n</code></pre> <p>And then we provide that schema to the <code>structured_output</code> argument of the LLM.</p> <pre><code>from distilabel.models import LlamaCppLLM\n\nllm = LlamaCppLLM(\n    model_path=\"./openhermes-2.5-mistral-7b.Q4_K_M.gguf\"  # (1)\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"json\", \"schema\": User},\n)\nllm.load()\n</code></pre> <ol> <li>We have previously downloaded a GGUF model i.e. <code>llama.cpp</code> compatible, from the Hugging Face Hub using curl<sup>2</sup>, but any model can be used as replacement, as long as the <code>model_path</code> argument is updated.</li> </ol> <p>And we are ready to pass our instruction as usual:</p> <pre><code>import json\n\nresult = llm.generate(\n    [[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]],\n    max_new_tokens=50\n)\n\ndata = json.loads(result[0][0])\ndata\n# {'name': 'Kathy', 'last_name': 'Smith', 'id': 4539210}\nUser(**data)\n# User(name='Kathy', last_name='Smith', id=4539210)\n</code></pre> <p>We get back a Python dictionary (formatted as a string) that we can parse using <code>json.loads</code>, or validate it directly using the <code>User</code>, which si a <code>pydantic.BaseModel</code> instance.</p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#regex","title":"Regex","text":"<p>The following example shows an example of text generation whose output adhere to a regular expression:</p> <pre><code>pattern = r\"&lt;name&gt;(.*?)&lt;/name&gt;.*?&lt;grade&gt;(.*?)&lt;/grade&gt;\"  #\u00a0the same pattern for re.compile\n\nllm=LlamaCppLLM(\n    model_path=model_path,\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"regex\", \"schema\": pattern},\n)\nllm.load()\n\nresult = llm.generate(\n    [\n        [\n            {\"role\": \"system\", \"content\": \"You are Simpsons' fans who loves assigning grades from A to E, where A is the best and E is the worst.\"},\n            {\"role\": \"user\", \"content\": \"What's up with Homer Simpson?\"}\n        ]\n    ],\n    max_new_tokens=200\n)\n</code></pre> <p>We can check the output by parsing the content using the same pattern we required from the LLM.</p> <pre><code>import re\nmatch = re.search(pattern, result[0][0])\n\nif match:\n    name = match.group(1)\n    grade = match.group(2)\n    print(f\"Name: {name}, Grade: {grade}\")\n# Name: Homer Simpson, Grade: C+\n</code></pre> <p>These were some simple examples, but one can see the options this opens.</p> <p>Tip</p> <p>A full pipeline example can be seen in the following script: <code>examples/structured_generation_with_outlines.py</code></p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#instructor","title":"Instructor","text":"<p>For other LLM providers behind APIs, there's no direct way of accessing the internal logit processor like <code>outlines</code> does, but thanks to <code>instructor</code> we can generate structured output from LLM providers based on <code>pydantic.BaseModel</code> objects. We have integrated <code>instructor</code> to deal with the <code>AsyncLLM</code>.</p> <p>Note</p> <p>For <code>instructor</code> integration to work you may need to install the corresponding dependencies:</p> <pre><code>pip install distilabel[instructor]\n</code></pre> <p>Note</p> <p>Take a look at <code>InstructorStructuredOutputType</code> to see the expected format of the <code>structured_output</code> dict variable.</p> <p>The following is the same example you can see with <code>outlines</code>'s <code>JSON</code> section for comparison purposes.</p> <pre><code>from pydantic import BaseModel\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n</code></pre> <p>And then we provide that schema to the <code>structured_output</code> argument of the LLM:</p> <p>Note</p> <p>In this example we are using Meta Llama 3.1 8B Instruct, keep in mind not all the models support structured outputs.</p> <pre><code>from distilabel.models import MistralLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    structured_output={\"schema\": User}\n)\nllm.load()\n</code></pre> <p>And we are ready to pass our instructions as usual:</p> <pre><code>import json\n\nresult = llm.generate(\n    [[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]],\n    max_new_tokens=256\n)\n\ndata = json.loads(result[0][0])\ndata\n# {'name': 'John', 'last_name': 'Doe', 'id': 12345}\nUser(**data)\n# User(name='John', last_name='Doe', id=12345)\n</code></pre> <p>We get back a Python dictionary (formatted as a string) that we can parse using <code>json.loads</code>, or validate it directly using the <code>User</code>, which is a <code>pydantic.BaseModel</code> instance.</p> <p>Tip</p> <p>A full pipeline example can be seen in the following script: <code>examples/structured_generation_with_instructor.py</code></p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#openai-json","title":"OpenAI JSON","text":"<p>OpenAI offers a JSON Mode to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.</p> <p>Warning</p> <p>Bear in mind, for this to work, you must instruct the model in some way to generate JSON, either in the <code>system message</code> or in the instruction, as can be seen in the API reference.</p> <p>Contrary to what we have via <code>outlines</code>, JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors. More information can be found in the OpenAI documentation.</p> <p>Other than the reference to generating JSON, to ensure the model generates parseable JSON we can pass the argument <code>response_format=\"json\"</code><sup>3</sup>:</p> <pre><code>from distilabel.models import OpenAILLM\nllm = OpenAILLM(model=\"gpt4-turbo\", api_key=\"api.key\")\nllm.generate(..., response_format=\"json\")\n</code></pre> <ol> <li> <p>You can check the variable type by importing it from:</p> <p><pre><code>from distilabel.steps.tasks.structured_outputs.outlines import StructuredOutputType\n</code></pre> \u21a9</p> </li> <li> <p>Download the model with curl:</p> <p><pre><code>curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n</code></pre> \u21a9</p> </li> <li> <p>Keep in mind that to interact with this <code>response_format</code> argument in a pipeline, you will have to pass it via the <code>generation_kwargs</code>:</p> <p><pre><code># Assuming a pipeline is already defined, and we have a task using OpenAILLM called `task_with_openai`:\npipeline.run(\n    parameters={\n        \"task_with_openai\": {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"response_format\": \"json\"\n                }\n            }\n        }\n    }\n)\n</code></pre> \u21a9</p> </li> </ol>"},{"location":"sections/how_to_guides/advanced/cli/","title":"Command Line Interface (CLI)","text":"<p><code>Distilabel</code> offers a <code>CLI</code> to explore and re-run existing <code>Pipeline</code> dumps, meaning that an existing dump can be explored to see the steps, how those are connected, the runtime parameters used, and also re-run it with the same or different runtime parameters, respectively.</p>"},{"location":"sections/how_to_guides/advanced/cli/#available-commands","title":"Available commands","text":"<p>The only available command as of the current version of <code>distilabel</code> is <code>distilabel pipeline</code>.</p> <pre><code>$ distilabel pipeline --help\n\n Usage: distilabel pipeline [OPTIONS] COMMAND [ARGS]...\n\n Commands to run and inspect Distilabel pipelines.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --help          Show this message and exit.                                             \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\u256d\u2500 Commands \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 info      Get information about a Distilabel pipeline.                                  \u2502\n\u2502 run       Run a Distilabel pipeline.                                                    \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> <p>So on, <code>distilabel pipeline</code> has two subcommands: <code>info</code> and <code>run</code>, as described below. Note that for testing purposes we will be using the following dataset.</p>"},{"location":"sections/how_to_guides/advanced/cli/#distilabel-pipeline-info","title":"<code>distilabel pipeline info</code>","text":"<pre><code>$ distilabel pipeline info --help\n\n Usage: distilabel pipeline info [OPTIONS]\n\n Get information about a Distilabel pipeline.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 *  --config        TEXT  Path or URL to the Distilabel pipeline configuration file. \u2502\n\u2502                          [default: None]                                            \u2502\n\u2502                          [required]                                                 \u2502\n\u2502    --help                Show this message and exit.                                \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> <p>As we can see from the help message, we need to pass either a <code>Path</code> or a <code>URL</code>. This second option comes handy for datasets stored in Hugging Face Hub, for example:</p> <pre><code>distilabel pipeline info --config \"https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml\"\n</code></pre> <p>If we take a look:</p> <p></p> <p>The pipeline information includes the steps used in the <code>Pipeline</code> along with the <code>Runtime Parameter</code> that was used, as well as a description of each of them, and also the connections between these steps. These can be helpful to explore the Pipeline locally.</p>"},{"location":"sections/how_to_guides/advanced/cli/#distilabel-pipeline-run","title":"<code>distilabel pipeline run</code>","text":"<p>We can also run a <code>Pipeline</code> from the CLI just pointing to the same <code>pipeline.yaml</code> file or an URL pointing to it and calling <code>distilabel pipeline run</code>. Alternatively, an URL pointing to a Python script containing a distilabel pipeline can be used:</p> <pre><code>$ distilabel pipeline run --help\n\n Usage: distilabel pipeline run [OPTIONS]\n\n Run a Distilabel pipeline.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --param                                          PARSE_RUNTIME_PARAM  [default: (dynamic)]                                         \u2502\n\u2502 --config                                         TEXT                 Path or URL to the Distilabel pipeline configuration file.   \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --script                                         TEXT                 URL pointing to a python script containing a distilabel      \u2502\n\u2502                                                                       pipeline.                                                    \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --pipeline-variable-name                         TEXT                 Name of the pipeline in a script. I.e. the 'pipeline'        \u2502\n\u2502                                                                       variable in `with Pipeline(...) as pipeline:...`.            \u2502\n\u2502                                                                       [default: pipeline]                                          \u2502\n\u2502 --ignore-cache              --no-ignore-cache                         Whether to ignore the cache and re-run the pipeline from     \u2502\n\u2502                                                                       scratch.                                                     \u2502\n\u2502                                                                       [default: no-ignore-cache]                                   \u2502\n\u2502 --repo-id                                        TEXT                 The Hugging Face Hub repository ID to push the resulting     \u2502\n\u2502                                                                       dataset to.                                                  \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --commit-message                                 TEXT                 The commit message to use when pushing the dataset.          \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --private                   --no-private                              Whether to make the resulting dataset private on the Hub.    \u2502\n\u2502                                                                       [default: no-private]                                        \u2502\n\u2502 --token                                          TEXT                 The Hugging Face Hub API token to use when pushing the       \u2502\n\u2502                                                                       dataset.                                                     \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --help                                                                Show this message and exit.                                  \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> <p>Using <code>--config</code> option, we must pass a path with a <code>pipeline.yaml</code> file. To specify the runtime parameters of the steps we will need to use the <code>--param</code> option and the value of the parameter in the following format:</p> <pre><code>distilabel pipeline run --config \"https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml\" \\\n    --param load_dataset.repo_id=distilabel-internal-testing/instruction-dataset-mini \\\n    --param load_dataset.split=test \\\n    --param generate_with_gpt35.llm.generation_kwargs.max_new_tokens=512 \\\n    --param generate_with_gpt35.llm.generation_kwargs.temperature=0.7 \\\n    --param to_argilla.dataset_name=text_generation_with_gpt35 \\\n    --param to_argilla.dataset_workspace=admin\n</code></pre> <p>Or using <code>--script</code> we can pass directly a remote python script (keep in mind <code>--config</code> and <code>--script</code> are exclusive):</p> <pre><code>distilabel pipeline run --script \"https://huggingface.co/datasets/distilabel-internal-testing/pipe_nothing_test/raw/main/pipe_nothing.py\"\n</code></pre> <p>You can also pass runtime parameters to the python script as we saw with <code>--config</code> option.</p> <p>Again, this helps with the reproducibility of the results, and simplifies sharing not only the final dataset but also the process to generate it.</p>"},{"location":"sections/how_to_guides/basic/llm/","title":"Executing Tasks with LLMs","text":""},{"location":"sections/how_to_guides/basic/llm/#working-with-llms","title":"Working with LLMs","text":"<p>LLM subclasses are designed to be used within a Task, but they can also be used standalone.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n)\nllm.load()\n\nllm.generate_outputs(\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# [\n#   {\n#     \"generations\": [\n#       \"The capital of Spain is Madrid.\"\n#     ],\n#     \"statistics\": {\n#       \"input_tokens\": [\n#         43\n#       ],\n#       \"output_tokens\": [\n#         8\n#       ]\n#     }\n#   }\n# ]\n</code></pre> <p>Note</p> <p>Always call the <code>LLM.load</code> or <code>Task.load</code> method when using LLMs standalone or as part of a <code>Task</code>. If using a <code>Pipeline</code>, this is done automatically in <code>Pipeline.run()</code>.</p> <p>New in version 1.5.0</p> <p>Since version <code>1.5.0</code> the LLM output is a list of dictionaries (one per item in the <code>inputs</code>), each containing <code>generations</code>, that reports the text returned by the <code>LLM</code>, and a <code>statistics</code> field that will store statistics related to the <code>LLM</code> generation. Initially, this will include <code>input_tokens</code> and <code>output_tokens</code> when available, which will be obtained via the API when available, or if a tokenizer is available for the model used, using the tokenizer for the model. This data will be moved by the corresponding <code>Task</code> during the pipeline processing and moved to <code>distilabel_metadata</code> so we can operate on this data if we want, like for example computing the number of tokens per dataset.</p> <p>To access to the previous result one just has to access to the generations in the resulting dictionary: <code>result[0][\"generations\"]</code>.</p>"},{"location":"sections/how_to_guides/basic/llm/#offline-batch-generation","title":"Offline Batch Generation","text":"<p>By default, all <code>LLM</code>s will generate text in a synchronous manner i.e. send inputs using <code>generate_outputs</code> method that will get blocked until outputs are generated. There are some <code>LLM</code>s (such as OpenAILLM) that implements what we denote as offline batch generation, which allows to send the inputs to the LLM-as-a-service which will generate the outputs asynchronously and give us a job id that we can use later to check the status and retrieve the generated outputs when they are ready. LLM-as-a-service platforms offers this feature as a way to save costs in exchange of waiting for the outputs to be generated.</p> <p>To use this feature in <code>distilabel</code> the only thing we need to do is to set the <code>use_offline_batch_generation</code> attribute to <code>True</code> when creating the <code>LLM</code> instance:</p> <pre><code>from distilabel.models import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"gpt-4o\",\n    use_offline_batch_generation=True,\n)\n\nllm.load()\n\nllm.jobs_ids  # (1)\n# None\n\nllm.generate_outputs(  # (2)\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# DistilabelOfflineBatchGenerationNotFinishedException: Batch generation with jobs_ids=('batch_OGB4VjKpu2ay9nz3iiFJxt5H',) is not finished\n\nllm.jobs_ids  # (3)\n# ('batch_OGB4VjKpu2ay9nz3iiFJxt5H',)\n\n\nllm.generate_outputs(  # (4)\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# [{'generations': ['The capital of Spain is Madrid.'],\n#   'statistics': {'input_tokens': [13], 'output_tokens': [7]}}]\n</code></pre> <ol> <li>At first the <code>jobs_ids</code> attribute is <code>None</code>.</li> <li>The first call to <code>generate_outputs</code> will send the inputs to the LLM-as-a-service and return a <code>DistilabelOfflineBatchGenerationNotFinishedException</code> since the outputs are not ready yet.</li> <li>After the first call to <code>generate_outputs</code> the <code>jobs_ids</code> attribute will contain the job ids created for generating the outputs.</li> <li>The second call or subsequent calls to <code>generate_outputs</code> will return the outputs if they are ready or raise a <code>DistilabelOfflineBatchGenerationNotFinishedException</code> if they are not ready yet.</li> </ol> <p>The <code>offline_batch_generation_block_until_done</code> attribute can be used to block the <code>generate_outputs</code> method until the outputs are ready polling the platform the specified amount of seconds.</p> <pre><code>from distilabel.models import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"gpt-4o\",\n    use_offline_batch_generation=True,\n    offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n)\nllm.load()\n\nllm.generate_outputs(\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# [{'generations': ['The capital of Spain is Madrid.'],\n#   'statistics': {'input_tokens': [13], 'output_tokens': [7]}}]\n</code></pre>"},{"location":"sections/how_to_guides/basic/llm/#within-a-task","title":"Within a Task","text":"<p>Pass the LLM as an argument to the <code>Task</code>, and the task will handle the rest.</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.steps.tasks import TextGeneration\n\nllm = OpenAILLM(model=\"gpt-4o-mini\")\ntask = TextGeneration(name=\"text_generation\", llm=llm)\n\ntask.load()\n\nnext(task.process(inputs=[{\"instruction\": \"What's the capital of Spain?\"}]))\n# [{'instruction': \"What's the capital of Spain?\",\n#   'generation': 'The capital of Spain is Madrid.',\n#   'distilabel_metadata': {'raw_output_text_generation': 'The capital of Spain is Madrid.',\n#    'raw_input_text_generation': [{'role': 'user',\n#      'content': \"What's the capital of Spain?\"}],\n#    'statistics_text_generation': {'input_tokens': 13, 'output_tokens': 7}},\n#   'model_name': 'gpt-4o-mini'}]\n</code></pre> <p>Note</p> <p>As mentioned in Working with LLMs section, the generation of an LLM is automatically moved to <code>distilabel_metadata</code> to avoid interference with the common workflow, so the addition of the <code>statistics</code> it's an extra component available for the user, but nothing has to be changed in the defined pipelines.</p>"},{"location":"sections/how_to_guides/basic/llm/#runtime-parameters","title":"Runtime Parameters","text":"<p>LLMs can have runtime parameters, such as <code>generation_kwargs</code>, provided via the <code>Pipeline.run()</code> method using the <code>params</code> argument.</p> <p>Note</p> <p>Runtime parameters can differ between LLM subclasses, caused by the different functionalities offered by the LLM providers.</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.models import OpenAILLM\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(name=\"text-generation-pipeline\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_dataset\",\n        data=[{\"instruction\": \"Write a short story about a dragon that saves a princess from a tower.\"}],\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=OpenAILLM(model=\"gpt-4o-mini\"),\n    )\n\n    load_dataset &gt;&gt; text_generation\n\nif __name__ == \"__main__\":\n    pipeline.run(\n        parameters={\n            text_generation.name: {\"llm\": {\"generation_kwargs\": {\"temperature\": 0.3}}},\n        },\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/llm/#creating-custom-llms","title":"Creating custom LLMs","text":"<p>To create custom LLMs, subclass either <code>LLM</code> for synchronous or <code>AsyncLLM</code> for asynchronous LLMs. Implement the following methods:</p> <ul> <li> <p><code>model_name</code>: A property containing the model's name.</p> </li> <li> <p><code>generate</code>: A method that takes a list of prompts and returns generated texts.</p> </li> <li> <p><code>agenerate</code>: A method that takes a single prompt and returns generated texts. This method is used within the <code>generate</code> method of the <code>AsyncLLM</code> class.</p> </li> <li> <p>(optional) <code>get_last_hidden_state</code>: is a method that will take a list of prompts and return a list of hidden states. This method is optional and will be used by some tasks such as the <code>GenerateEmbeddings</code> task.</p> </li> </ul> Custom LLMCustom AsyncLLM <pre><code>from typing import Any\n\nfrom pydantic import validate_call\n\nfrom distilabel.models import LLM\nfrom distilabel.typing import GenerateOutput, HiddenState\nfrom distilabel.typing import ChatType\n\nclass CustomLLM(LLM):\n    @property\n    def model_name(self) -&gt; str:\n        return \"my-model\"\n\n    @validate_call\n    def generate(self, inputs: List[ChatType], num_generations: int = 1, **kwargs: Any) -&gt; List[GenerateOutput]:\n        for _ in range(num_generations):\n            ...\n\n    def get_last_hidden_state(self, inputs: List[ChatType]) -&gt; List[HiddenState]:\n        ...\n</code></pre> <pre><code>from typing import Any\n\nfrom pydantic import validate_call\n\nfrom distilabel.models import AsyncLLM\nfrom distilabel.typing import GenerateOutput, HiddenState\nfrom distilabel.typing import ChatType\n\nclass CustomAsyncLLM(AsyncLLM):\n    @property\n    def model_name(self) -&gt; str:\n        return \"my-model\"\n\n    @validate_call\n    async def agenerate(self, input: ChatType, num_generations: int = 1, **kwargs: Any) -&gt; GenerateOutput:\n        for _ in range(num_generations):\n            ...\n\n    def get_last_hidden_state(self, inputs: List[ChatType]) -&gt; List[HiddenState]:\n        ...\n</code></pre> <p><code>generate</code> and <code>agenerate</code> keyword arguments (but <code>input</code> and <code>num_generations</code>) are considered as <code>RuntimeParameter</code>s, so a value can be passed to them via the <code>parameters</code> argument of the <code>Pipeline.run</code> method.</p> <p>Note</p> <p>To have the arguments of the <code>generate</code> and <code>agenerate</code> coerced to the expected types, the <code>validate_call</code> decorator is used, which will automatically coerce the arguments to the expected types, and raise an error if the types are not correct. This is specially useful when providing a value for an argument of <code>generate</code> or <code>agenerate</code> from the CLI, since the CLI will always provide the arguments as strings.</p> <p>Warning</p> <p>Additional LLMs created in <code>distilabel</code> will have to take into account how the <code>statistics</code> are generated to properly include them in the LLM output.</p>"},{"location":"sections/how_to_guides/basic/llm/#available-llms","title":"Available LLMs","text":"<p>Our LLM gallery shows a list of the available LLMs that can be used within the <code>distilabel</code> library.</p>"},{"location":"sections/how_to_guides/basic/pipeline/","title":"Execute Steps and Tasks in a Pipeline","text":""},{"location":"sections/how_to_guides/basic/pipeline/#how-to-create-a-pipeline","title":"How to create a pipeline","text":"<p><code>Pipeline</code> organise the Steps and Tasks in a sequence, where the output of one step is the input of the next one. A <code>Pipeline</code> should be created by making use of the context manager along with passing a name, and optionally a description.</p> <pre><code>from distilabel.pipeline import Pipeline\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    ...\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#connecting-steps-with-the-stepconnect-method","title":"Connecting steps with the <code>Step.connect</code> method","text":"<p>Now, we can define the steps of our <code>Pipeline</code>.</p> <p>Note</p> <p>Steps without predecessors (i.e. root steps), need to be <code>GeneratorStep</code>s such as <code>LoadDataFromDicts</code> or <code>LoadDataFromHub</code>. After this, other steps can be defined.</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n    ...\n</code></pre> <p>Easily load your datasets</p> <p>If you are already used to work with Hugging Face's <code>Dataset</code> via <code>load_dataset</code> or <code>pd.DataFrame</code>, you can create the <code>GeneratorStep</code> directly from the dataset (or dataframe), and create the step with the help of <code>make_generator_step</code>:</p> From a list of dictsFrom <code>datasets.Dataset</code>From <code>pd.DataFrame</code> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import make_generator_step\n\ndataset = [{\"instruction\": \"Tell me a joke.\"}]\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    loader = make_generator_step(dataset, output_mappings={\"prompt\": \"instruction\"})\n    ...\n</code></pre> <pre><code>from datasets import load_dataset\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import make_generator_step\n\ndataset = load_dataset(\n    \"DIBT/10k_prompts_ranked\",\n    split=\"train\"\n).filter(\n    lambda r: r[\"avg_rating\"]&gt;=4 and r[\"num_responses\"]&gt;=2\n).select(range(500))\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    loader = make_generator_step(dataset, output_mappings={\"prompt\": \"instruction\"})\n    ...\n</code></pre> <pre><code>import pandas as pd\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import make_generator_step\n\ndataset = pd.read_csv(\"path/to/dataset.csv\")\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    loader = make_generator_step(dataset, output_mappings={\"prompt\": \"instruction\"})\n    ...\n</code></pre> <p>Next, we will use <code>prompt</code> column from the dataset obtained through <code>LoadDataFromHub</code> and use several <code>LLM</code>s to execute a <code>TextGeneration</code> task. We will also use the <code>Task.connect()</code> method to connect the steps, so the output of one step is the input of the next one.</p> <p>Note</p> <p>The order of the execution of the steps will be determined by the connections of the steps. In this case, the <code>TextGeneration</code> tasks will be executed after the <code>LoadDataFromHub</code> step.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        task.connect(load_dataset)\n\n    ...\n</code></pre> <p>For each row of the dataset, the <code>TextGeneration</code> task will generate a text based on the <code>instruction</code> column and the <code>LLM</code> model, and store the result (a single string) in a new column called <code>generation</code>. Because we need to have the <code>response</code>s in the same column, we will add <code>GroupColumns</code> to combine them all in the same column as a list of strings.</p> <p>Note</p> <p>In this case, the <code>GroupColumns</code> tasks will be executed after all <code>TextGeneration</code> steps.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        load_dataset.connect(task)\n        task.connect(combine_generations)\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#connecting-steps-with-the-operator","title":"Connecting steps with the <code>&gt;&gt;</code> operator","text":"<p>Besides the <code>Step.connect</code> method: <code>step1.connect(step2)</code>, there's an alternative way by making use of the <code>&gt;&gt;</code> operator. We can connect steps in a more readable way, and it's also possible to connect multiple steps at once.</p> Step per stepMultiple steps at once <p>Each call to <code>step1.connect(step2)</code> has been exchanged by <code>step1 &gt;&gt; step2</code> within the loop.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        load_dataset &gt;&gt; task &gt;&gt; combine_generations\n</code></pre> <p>Each task is first appended to a list, and then all the calls to connections are done in a single call.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    tasks = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        tasks.append(\n            TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        )\n\n    load_dataset &gt;&gt; tasks &gt;&gt; combine_generations\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#routing-batches-to-specific-downstream-steps","title":"Routing batches to specific downstream steps","text":"<p>In some pipelines, you may want to send batches from a single upstream step to specific downstream steps based on certain conditions. To achieve this, you can use a <code>routing_batch_function</code>. This function takes a list of downstream steps and returns a list of step names to which each batch should be routed.</p> <p>Let's update the example above to route the batches loaded by the <code>LoadDataFromHub</code> step to just 2 of the <code>TextGeneration</code> tasks. First, we will create our custom <code>routing_batch_function</code>, and then we will update the pipeline to use it:</p> <pre><code>import random\nfrom distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, routing_batch_function\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\n@routing_batch_function\ndef sample_two_steps(steps: list[str]) -&gt; list[str]:\n    return random.sample(steps, 2)\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    tasks = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.0-pro\"),\n    ):\n        tasks.append(\n            TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        )\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    load_dataset &gt;&gt; sample_two_steps &gt;&gt; tasks &gt;&gt; combine_generations\n</code></pre> <p>The <code>routing_batch_function</code> that we just built is a common one, so <code>distilabel</code> comes with a builtin function that can be used to achieve the same behavior:</p> <pre><code>from distilable.pipeline import sample_n_steps\n\nsample_two_steps = sample_n_steps(2)\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#running-the-pipeline","title":"Running the pipeline","text":""},{"location":"sections/how_to_guides/basic/pipeline/#pipelinedry_run","title":"Pipeline.dry_run","text":"<p>Before running the <code>Pipeline</code> we can check if the pipeline is valid using the <code>Pipeline.dry_run()</code> method. It takes the same parameters as the <code>run</code> method which we will discuss in the following section, plus the <code>batch_size</code> we want the dry run to use (by default set to 1).</p> <pre><code>with Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.dry_run(parameters=..., batch_size=1)\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#pipelinerun","title":"Pipeline.run","text":"<p>After testing, we can now execute the full <code>Pipeline</code> using the <code>Pipeline.run()</code> method.</p> <pre><code>with Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            \"load_dataset\": {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n            \"text_generation_with_gpt-4-0125-preview\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_mistral-large-2402\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_gemini-1.0-pro\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n        },\n    )\n</code></pre> <p>But if we run the pipeline above, we will see that the <code>run</code> method will fail:</p> <pre><code>ValueError: Step 'text_generation_with_gpt-4-0125-preview' requires inputs ['instruction'], but only the inputs=['prompt', 'completion', 'meta'] are available, which means that the inputs=['instruction'] are missing or not available\nwhen the step gets to be executed in the pipeline. Please make sure previous steps to 'text_generation_with_gpt-4-0125-preview' are generating the required inputs.\n</code></pre> <p>This is because, before actually running the pipeline, we must ensure each step has the necessary input columns to be executed. In this case, the <code>TextGeneration</code> task requires the <code>instruction</code> column, but the <code>LoadDataFromHub</code> step generates the <code>prompt</code> column. To solve this, we can use the <code>output_mappings</code> or <code>input_mapping</code> arguments of individual <code>Step</code>s, to map columns from one step to another.</p> <pre><code>with Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"}\n    )\n\n    ...\n</code></pre> <p>If we execute the pipeline again, it will run successfully and we will have a <code>Distiset</code> with the outputs of all the leaf steps of the pipeline which we can push to the Hugging Face Hub.</p> <pre><code>if __name__ == \"__main__\":\n    distiset = pipeline.run(...)\n    distiset.push_to_hub(\"distilabel-internal-testing/instruction-dataset-mini-with-generations\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#pipelinerun-with-a-dataset","title":"Pipeline.run with a dataset","text":"<p>Note that in most cases if you don't need the extra flexibility the <code>GeneratorSteps</code> bring you, you can create a dataset as you would normally do and pass it to the Pipeline.run method directly. Look at the highlighted lines to see the updated lines:</p> <pre><code>import random\nfrom distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, routing_batch_function\nfrom distilabel.steps import GroupColumns\nfrom distilabel.steps.tasks import TextGeneration\n\n@routing_batch_function\ndef sample_two_steps(steps: list[str]) -&gt; list[str]:\n    return random.sample(steps, 2)\n\ndataset = load_dataset(\n    \"distilabel-internal-testing/instruction-dataset-mini\",\n    split=\"test\"\n)\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    tasks = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.0-pro\"),\n    ):\n        tasks.append(\n            TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        )\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    sample_two_steps &gt;&gt; tasks &gt;&gt; combine_generations\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        dataset=dataset,\n        parameters=...\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#stopping-the-pipeline","title":"Stopping the pipeline","text":"<p>In case you want to stop the pipeline while it's running, you can press Ctrl+C or Cmd+C depending on your OS (or send a <code>SIGINT</code> to the main process), and the outputs will be stored in the cache. Pressing an additional time will force the pipeline to stop its execution, but this can lead to losing the generated outputs for certain batches.</p>"},{"location":"sections/how_to_guides/basic/pipeline/#cache","title":"Cache","text":"<p>If for some reason, the pipeline execution stops (for example by pressing <code>Ctrl+C</code>), the state of the pipeline and the outputs will be stored in the cache, so we can resume the pipeline execution from the point where it was stopped.</p> <p>If we want to force the pipeline to run again without can, then we can use the <code>use_cache</code> argument of the <code>Pipeline.run()</code> method:</p> <pre><code>if __name__ == \"__main__\":\n    distiset = pipeline.run(parameters={...}, use_cache=False)\n</code></pre> <p>Note</p> <p>For more information on caching, we refer the reader to the caching section.</p>"},{"location":"sections/how_to_guides/basic/pipeline/#adjusting-the-batch-size-for-each-step","title":"Adjusting the batch size for each step","text":"<p>Memory issues can arise when processing large datasets or when using large models. To avoid this, we can use the <code>input_batch_size</code> argument of individual tasks. <code>TextGeneration</code> task will receive 5 dictionaries, while the <code>LoadDataFromHub</code> step will send 10 dictionaries per batch:</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n        batch_size=10\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(\n            name=f\"text_generation_with_{llm.model_name.replace('.', '-')}\",\n            llm=llm,\n            input_batch_size=5,\n        )\n\n    ...\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#serializing-the-pipeline","title":"Serializing the pipeline","text":"<p>Sharing a pipeline with others is very easy, as we can serialize the pipeline object using the <code>save</code> method. We can save the pipeline in different formats, such as <code>yaml</code> or <code>json</code>:</p> yamljson <pre><code>if __name__ == \"__main__\":\n    pipeline.save(\"pipeline.yaml\", format=\"yaml\")\n</code></pre> <pre><code>if __name__ == \"__main__\":\n    pipeline.save(\"pipeline.json\", format=\"json\")\n</code></pre> <p>To load the pipeline, we can use the <code>from_yaml</code> or <code>from_json</code> methods:</p> yamljson <pre><code>pipeline = Pipeline.from_yaml(\"pipeline.yaml\")\n</code></pre> <pre><code>pipeline = Pipeline.from_json(\"pipeline.json\")\n</code></pre> <p>Serializing the pipeline is very useful when we want to share the pipeline with others, or when we want to store the pipeline for future use. It can even be hosted online, so the pipeline can be executed directly using the CLI.</p>"},{"location":"sections/how_to_guides/basic/pipeline/#visualizing-the-pipeline","title":"Visualizing the pipeline","text":"<p>We can visualize the pipeline using the <code>Pipeline.draw()</code> method. This will create a <code>mermaid</code> graph, and return the path to the image.</p> <pre><code>path_to_image = pipeline.draw(\n    top_to_bottom=True,\n    show_edge_labels=True,\n)\n</code></pre> <p>Within notebooks, we can simply call <code>pipeline</code> and the graph will be displayed. Alternatively, we can use the <code>Pipeline.draw()</code> method to have more control over the graph visualization and use <code>IPython</code> to display it.</p> <pre><code>from IPython.display import Image, display\n\ndisplay(Image(path_to_image))\n</code></pre> <p>Let's now see how the pipeline of the fully working example looks like.</p> <p></p>"},{"location":"sections/how_to_guides/basic/pipeline/#fully-working-example","title":"Fully working example","text":"<p>To sum up, here is the full code of the pipeline we have created in this section. Note that you will need to change the name of the Hugging Face repository where the resulting will be pushed, set <code>OPENAI_API_KEY</code> environment variable, set <code>MISTRAL_API_KEY</code> and have <code>gcloud</code> installed and configured:</p> Code <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.0-pro\"),\n    ):\n        task = TextGeneration(\n            name=f\"text_generation_with_{llm.model_name.replace('.', '-')}\", llm=llm\n        )\n        load_dataset.connect(task)\n        task.connect(combine_generations)\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            \"load_dataset\": {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n            \"text_generation_with_gpt-4-0125-preview\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_mistral-large-2402\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_gemini-1.0-pro\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n        },\n    )\n    distiset.push_to_hub(\n        \"distilabel-internal-testing/instruction-dataset-mini-with-generations\"\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/","title":"Steps for processing data","text":""},{"location":"sections/how_to_guides/basic/step/#working-with-steps","title":"Working with Steps","text":"<p>The <code>Step</code> is intended to be used within the scope of a <code>Pipeline</code>, which will orchestrate the different steps defined but can also be used standalone.</p> <p>Assuming that we have a <code>Step</code> already defined as it follows:</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import Step, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, StepOutput\n\nclass MyStep(Step):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"input_field\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"output_field\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        for input in inputs:\n            input[\"output_field\"] = input[\"input_field\"]\n        yield inputs\n</code></pre> <p>Then we can use it as follows:</p> <pre><code>step = MyStep(name=\"my-step\")\nstep.load()\n\nnext(step.process([{\"input_field\": \"value\"}]))\n# [{'input_field': 'value', 'output_field': 'value'}]\n</code></pre> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p>"},{"location":"sections/how_to_guides/basic/step/#arguments","title":"Arguments","text":"<ul> <li> <p><code>input_mappings</code>, is a dictionary that maps keys from the input dictionaries to the keys expected by the step. For example, if <code>input_mappings={\"instruction\": \"prompt\"}</code>, means that the input key <code>prompt</code> will be used as the key <code>instruction</code> for current step.</p> </li> <li> <p><code>output_mappings</code>, is a dictionary that can be used to map the outputs of the step to other names. For example, if <code>output_mappings={\"conversation\": \"prompt\"}</code>, means that output key <code>conversation</code> will be renamed to <code>prompt</code> for the next step.</p> </li> <li> <p><code>input_batch_size</code> (by default set to 50), is independent for every step and will determine how many input dictionaries will process at once.</p> </li> </ul>"},{"location":"sections/how_to_guides/basic/step/#runtime-parameters","title":"Runtime parameters","text":"<p><code>Step</code>s can also have <code>RuntimeParameter</code>, which are parameters that can only be used after the pipeline initialisation when calling the <code>Pipeline.run</code>.</p> <pre><code>from distilabel.mixins.runtime_parameters import RuntimeParameter\n\nclass Step(...):\n    input_batch_size: RuntimeParameter[PositiveInt] = Field(\n        default=DEFAULT_INPUT_BATCH_SIZE,\n        description=\"The number of rows that will contain the batches processed by the\"\n        \" step.\",\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/#types-of-steps","title":"Types of Steps","text":"<p>There are two special types of <code>Step</code> in <code>distilabel</code>:</p> <ul> <li> <p><code>GeneratorStep</code>: is a step that only generates data, and it doesn't need any input data from previous steps and normally is the first node in a <code>Pipeline</code>. More information: Components -&gt; Step - GeneratorStep.</p> </li> <li> <p><code>GlobalStep</code>: is a step with the standard interface i.e. receives inputs and generates outputs, but it processes all the data at once, and often is the final step in the <code>Pipeline</code>. The fact that a <code>GlobalStep</code> requires the previous steps  to finish before being able to start. More information: Components - Step - GlobalStep.</p> </li> <li> <p><code>Task</code>, is essentially the same as a default <code>Step</code>, but it relies on an <code>LLM</code> as an attribute, and the <code>process</code> method will be in charge of calling that LLM. More information: Components - Task.</p> </li> </ul>"},{"location":"sections/how_to_guides/basic/step/#defining-custom-steps","title":"Defining custom Steps","text":"<p>We can define a custom step by creating a new subclass of the <code>Step</code> and defining the following:</p> <ul> <li> <p><code>inputs</code>: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>process</code>: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should <code>yield</code> the output data.</p> </li> </ul> <p>Note</p> <p>The default signature for the <code>process</code> method is <code>process(self, *inputs: StepInput) -&gt; StepOutput</code>. The argument <code>inputs</code> should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one <code>Step</code> at a time could be connected to the current one.</p> <p>Warning</p> <p>For the custom <code>Step</code> subclasses to work properly with <code>distilabel</code> and with the validation and serialization performed by default over each <code>Step</code> in the <code>Pipeline</code>, the type-hint for both <code>StepInput</code> and <code>StepOutput</code> should be used and not surrounded with double-quotes or imported under <code>typing.TYPE_CHECKING</code>, otherwise, the validation and/or serialization will fail.</p> Inherit from <code>Step</code>Using the <code>@step</code> decorator <p>We can inherit from the <code>Step</code> class and define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods as follows:</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import Step, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, StepOutput\n\nclass CustomStep(Step):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        ...\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        ...\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        for upstream_step_inputs in inputs:\n            ...\n            yield item\n\n    # When overridden (ideally under the `typing_extensions.override` decorator)\n    # @typing_extensions.override\n    # def process(self, inputs: StepInput) -&gt; StepOutput:\n    #     for input in inputs:\n    #         ...\n    #     yield inputs\n</code></pre> <p>The <code>@step</code> decorator will take care of the boilerplate code, and will allow to define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods in a more straightforward way. One downside is that it won't let you access the <code>self</code> attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom <code>Step</code> subclass.</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import StepInput, step\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepOutput\n\n@step(inputs=[...], outputs=[...])\ndef CustomStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        ...\n    yield inputs\n\nstep = CustomStep(name=\"my-step\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/generator_step/","title":"GeneratorStep","text":"<p>The <code>GeneratorStep</code> is a subclass of <code>Step</code> that is intended to be used as the first step within a <code>Pipeline</code>, because it doesn't require input and generates data that can be used by other steps. Alternatively, it can also be used as a standalone.</p> <pre><code>from typing import List, TYPE_CHECKING\nfrom typing_extensions import override\n\nfrom distilabel.steps import GeneratorStep\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, GeneratorStepOutput\n\nclass MyGeneratorStep(GeneratorStep):\n    instructions: List[str]\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        if offset:\n            self.instructions = self.instructions[offset:]\n\n        while self.instructions:\n            batch = [\n                {\n                    \"instruction\": instruction\n                } for instruction in self.instructions[: self.batch_size]\n            ]\n            self.instructions = self.instructions[self.batch_size :]\n            yield (\n                batch,\n                True if len(self.instructions) == 0 else False,\n            )\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\"]\n</code></pre> <p>Then we can use it as follows:</p> <pre><code>step = MyGeneratorStep(\n    name=\"my-generator-step\",\n    instructions=[\"Tell me a joke.\", \"Tell me a story.\"],\n    batch_size=1,\n)\nstep.load()\n\nnext(step.process(offset=0))\n# ([{'instruction': 'Tell me a joke.'}], False)\nnext(step.process(offset=1))\n# ([{'instruction': 'Tell me a story.'}], True)\n</code></pre> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p>"},{"location":"sections/how_to_guides/basic/step/generator_step/#defining-custom-generatorsteps","title":"Defining custom GeneratorSteps","text":"<p>We can define a custom generator step by creating a new subclass of the <code>GeneratorStep</code> and defining the following:</p> <ul> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>process</code>: is a method that yields output data and a boolean flag indicating whether that's the last batch to be generated.</p> </li> </ul> <p>Note</p> <p>The default signature for the <code>process</code> method is <code>process(self, offset: int = 0) -&gt; GeneratorStepOutput</code>. The argument <code>offset</code> should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one <code>Step</code> at a time could be connected to the current one.</p> <p>Warning</p> <p>For the custom <code>Step</code> subclasses to work properly with <code>distilabel</code> and with the validation and serialization performed by default over each <code>Step</code> in the <code>Pipeline</code>, the type-hint for both <code>StepInput</code> and <code>StepOutput</code> should be used and not surrounded with double-quotes or imported under <code>typing.TYPE_CHECKING</code>, otherwise, the validation and/or serialization will fail.</p> Inherit from <code>GeneratorStep</code>Using the <code>@step</code> decorator <p>We can inherit from the <code>GeneratorStep</code> class and define the <code>outputs</code>, and <code>process</code> methods as follows:</p> <pre><code>from typing import List, TYPE_CHECKING\nfrom typing_extensions import override\n\nfrom distilabel.steps import GeneratorStep\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, GeneratorStepOutput\n\nclass MyGeneratorStep(GeneratorStep):\n    instructions: List[str]\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        ...\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        ...\n</code></pre> <p>The <code>@step</code> decorator will take care of the boilerplate code, and will allow to define the <code>outputs</code>, and <code>process</code> methods in a more straightforward way. One downside is that it won't let you access the <code>self</code> attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom <code>GeneratorStep</code> subclass.</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import step\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import GeneratorStepOutput\n\n@step(outputs=[...], step_type=\"generator\")\ndef CustomGeneratorStep(offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    yield (\n        ...,\n        True if offset == 10 else False,\n    )\n\nstep = CustomGeneratorStep(name=\"my-step\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/global_step/","title":"GlobalStep","text":"<p>The <code>GlobalStep</code> is a subclass of <code>Step</code> that is used to define a step that requires the previous steps to be completed to run, since it will wait until all the input batches are received before running. This step is useful when you need to run a step that requires all the input data to be processed before running. Alternatively, it can also be used as a standalone.</p>"},{"location":"sections/how_to_guides/basic/step/global_step/#defining-custom-globalsteps","title":"Defining custom GlobalSteps","text":"<p>We can define a custom step by creating a new subclass of the <code>GlobalStep</code> and defining the following:</p> <ul> <li> <p><code>inputs</code>: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>process</code>: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should <code>yield</code> the output data.</p> </li> </ul> <p>Note</p> <p>The default signature for the <code>process</code> method is <code>process(self, *inputs: StepInput) -&gt; StepOutput</code>. The argument <code>inputs</code> should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one <code>Step</code> at a time could be connected to the current one.</p> <p>Warning</p> <p>For the custom <code>GlobalStep</code> subclasses to work properly with <code>distilabel</code> and with the validation and serialization performed by default over each <code>Step</code> in the <code>Pipeline</code>, the type-hint for both <code>StepInput</code> and <code>StepOutput</code> should be used and not surrounded with double-quotes or imported under <code>typing.TYPE_CHECKING</code>, otherwise, the validation and/or serialization will fail.</p> Inherit from <code>GlobalStep</code>Using the <code>@step</code> decorator <p>We can inherit from the <code>GlobalStep</code> class and define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods as follows:</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import GlobalStep, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, StepOutput\n\nclass CustomStep(Step):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        ...\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        ...\n\n    def process(self, *inputs: StepInput) -&gt; StepOutput:\n        for upstream_step_inputs in inputs:\n            for item in input:\n                ...\n            yield item\n\n    # When overridden (ideally under the `typing_extensions.override` decorator)\n    # @typing_extensions.override\n    # def process(self, inputs: StepInput) -&gt; StepOutput:\n    #     for input in inputs:\n    #         ...\n    #     yield inputs\n</code></pre> <p>The <code>@step</code> decorator will take care of the boilerplate code, and will allow to define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods in a more straightforward way. One downside is that it won't let you access the <code>self</code> attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom <code>GlobalStep</code> subclass.</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import StepInput, step\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepOutput\n\n@step(inputs=[...], outputs=[...], step_type=\"global\")\ndef CustomStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        ...\n    yield inputs\n\nstep = CustomStep(name=\"my-step\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/task/","title":"Tasks for generating and judging with LLMs","text":""},{"location":"sections/how_to_guides/basic/task/#working-with-tasks","title":"Working with Tasks","text":"<p>The <code>Task</code> is a special kind of <code>Step</code> that includes the <code>LLM</code> as a mandatory argument. As with a <code>Step</code>, it is normally used within a <code>Pipeline</code> but can also be used standalone.</p> <p>For example, the most basic task is the <code>TextGeneration</code> task, which generates text based on a given instruction.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import TextGeneration\n\ntask = TextGeneration(\n    name=\"text-generation\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\ntask.load()\n\nnext(task.process([{\"instruction\": \"What's the capital of Spain?\"}]))\n# [\n#   {\n#     \"instruction\": \"What's the capital of Spain?\",\n#     \"generation\": \"The capital of Spain is Madrid.\",\n#     \"distilabel_metadata\": {\n#       \"raw_output_text-generation\": \"The capital of Spain is Madrid.\",\n#       \"raw_input_text-generation\": [\n#         {\n#           \"role\": \"user\",\n#           \"content\": \"What's the capital of Spain?\"\n#         }\n#       ],\n#       \"statistics_text-generation\": {  # (1)\n#         \"input_tokens\": 18,\n#         \"output_tokens\": 8\n#       }\n#     },\n#     \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n#   }\n# ]\n</code></pre> <ol> <li>The <code>LLMs</code> will not only return the text but also a <code>statistics_{STEP_NAME}</code> field that will contain statistics related to the generation. If available, at least the input and output tokens will be returned.</li> </ol> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p> <p>As shown above, the <code>TextGeneration</code> task adds a <code>generation</code> based on the <code>instruction</code>.</p> <p>New in version 1.2.0</p> <p>Since version <code>1.2.0</code>, we provide some metadata about the LLM call through <code>distilabel_metadata</code>. This can be disabled by setting the <code>add_raw_output</code> attribute to <code>False</code> when creating the task.</p> <p>Additionally, since version <code>1.4.0</code>, the formatted input can also be included, which can be helpful when testing custom templates (testing the pipeline using the <code>dry_run</code> method).</p> disable raw input and output<pre><code>task = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    add_raw_output=False,\n    add_raw_input=False\n)\n</code></pre> <p>New in version 1.5.0</p> <p>Since version <code>1.5.0</code> <code>distilabel_metadata</code> includes a new <code>statistics</code> field out of the box. The generation from the LLM will not only contain the text, but also statistics associated with the text if available, like the input and output tokens. This field will be generated with <code>statistic_{STEP_NAME}</code> to avoid collisions between different steps in the pipeline, similar to how <code>raw_output_{STEP_NAME}</code> works.</p>"},{"location":"sections/how_to_guides/basic/task/#taskprint","title":"Task.print","text":"<p>New in version 1.4.0</p> <p>New since version <code>1.4.0</code>, <code>Task.print</code> <code>Task.print</code> method.</p> <p>The <code>Tasks</code> include a handy method to show what the prompt formatted for an <code>LLM</code> would look like, let's see an example with <code>UltraFeedback</code>, but it applies to any other <code>Task</code>.</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\nuf = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n)\nuf.load()\nuf.print()\n</code></pre> <p>The result will be a rendered prompt, with the System prompt (if contained for the task) and the User prompt, rendered with rich (it will show exactly the same in a jupyter notebook).</p> <p></p> <p>In case you want to test with a custom input, you can pass an example to the tasks<code></code>format_input` method (or generate it on your own depending on the task), and pass it to the print method so that it shows your example:</p> <pre><code>uf.print(\n    uf.format_input({\"instruction\": \"test\", \"generations\": [\"1\", \"2\"]})\n)\n</code></pre> Using a DummyLLM to avoid loading one <p>In case you don't want to load an LLM to render the template, you can create a dummy one like the ones we could use for testing.</p> <pre><code>from distilabel.models import LLM\nfrom distilabel.models.mixins import MagpieChatTemplateMixin\n\nclass DummyLLM(AsyncLLM, MagpieChatTemplateMixin):\n    structured_output: Any = None\n    magpie_pre_query_template: str = \"llama3\"\n\n    def load(self) -&gt; None:\n        pass\n\n    @property\n    def model_name(self) -&gt; str:\n        return \"test\"\n\n    def generate(\n        self, input: \"FormattedInput\", num_generations: int = 1\n    ) -&gt; \"GenerateOutput\":\n        return [\"output\" for _ in range(num_generations)]\n</code></pre> <p>You can use this <code>LLM</code> just as any of the other ones to <code>load</code> your task and call <code>print</code>:</p> <pre><code>uf = UltraFeedback(llm=DummyLLM())\nuf.load()\nuf.print()\n</code></pre> <p>Note</p> <p>When creating a custom task, the <code>print</code> method will be available by default, but it is limited to the most common scenarios for the inputs. If you test your new task and find it's not working as expected (for example, if your task contains one input consisting of a list of texts instead of a single one), you should override the <code>_sample_input</code> method. You can inspect the <code>UltraFeedback</code> source code for this.</p>"},{"location":"sections/how_to_guides/basic/task/#specifying-the-number-of-generations-and-grouping-generations","title":"Specifying the number of generations and grouping generations","text":"<p>All the <code>Task</code>s have a <code>num_generations</code> attribute that allows defining the number of generations that we want to have per input. We can update the example above to generate 3 completions per input:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import TextGeneration\n\ntask = TextGeneration(\n    name=\"text-generation\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    num_generations=3,\n)\ntask.load()\n\nnext(task.process([{\"instruction\": \"What's the capital of Spain?\"}]))\n# [\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': 'The capital of Spain is Madrid.',\n#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     },\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': 'The capital of Spain is Madrid.',\n#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     },\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': 'The capital of Spain is Madrid.',\n#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     }\n# ]\n</code></pre> <p>In addition, we might want to group the generations in a single output row as maybe one downstream step expects a single row with multiple generations. We can achieve this by setting the <code>group_generations</code> attribute to <code>True</code>:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import TextGeneration\n\ntask = TextGeneration(\n    name=\"text-generation\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    num_generations=3,\n    group_generations=True\n)\ntask.load()\n\nnext(task.process([{\"instruction\": \"What's the capital of Spain?\"}]))\n# [\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': ['The capital of Spain is Madrid.', 'The capital of Spain is Madrid.', 'The capital of Spain is Madrid.'],\n#         'distilabel_metadata': [\n#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'}\n#         ],\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     }\n# ]\n</code></pre>"},{"location":"sections/how_to_guides/basic/task/#defining-custom-tasks","title":"Defining custom Tasks","text":"<p>We can define a custom step by creating a new subclass of the <code>Task</code> and defining the following:</p> <ul> <li> <p><code>inputs</code>: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>format_input</code>: is a method that receives a dictionary with the input data and returns a <code>ChatType</code> following the chat-completion OpenAI message formatting.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not. This property should always include <code>model_name</code> as one of the outputs since that's automatically injected from the LLM.</p> </li> <li> <p><code>format_output</code>: is a method that receives the output from the <code>LLM</code> and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in <code>outputs</code>. Note that there's no need to include the <code>model_name</code> in the output.</p> </li> </ul> Inherit from <code>Task</code>Using the <code>@task</code> decorator <p>When using the <code>Task</code> class inheritance method for creating a custom task, we can also optionally override the <code>Task.process</code> method to define a more complex processing logic involving an <code>LLM</code>, as the default one just calls the <code>LLM.generate</code> method once previously formatting the input and subsequently formatting the output. For example, EvolInstruct task overrides this method to call the <code>LLM.generate</code> multiple times (one for each evolution).</p> <pre><code>from typing import Any, Dict, List, Union, TYPE_CHECKING\n\nfrom distilabel.steps.tasks import Task\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns\n    from distilabel.steps.tasks.typing import ChatType\n\n\nclass MyCustomTask(Task):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"input_field\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": input[\"input_field\"],\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"output_field\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        return {\"output_field\": output}\n</code></pre> <p>If your task just needs a system prompt, a user message template and a way to format the output given by the <code>LLM</code>, then you can use the <code>@task</code> decorator to avoid writing too much boilerplate code.</p> <pre><code>from typing import Any, Dict, Union\nfrom distilabel.steps.tasks import task\n\n\n@task(inputs=[\"input_field\"], outputs=[\"output_field\"])\ndef MyCustomTask(output: Union[str, None], input: Union[Dict[str, Any], None] = None) -&gt; Dict[str, Any]:\n    \"\"\"\n    ---\n    system_prompt: |\n        My custom system prompt\n\n    user_message_template: |\n        My custom user message template: {input_field}\n    ---\n    \"\"\"\n    # Format the `LLM` output here\n    return {\"output_field\": output}\n</code></pre> <p>Warning</p> <p>Most <code>Tasks</code> reuse the <code>Task.process</code> method to process the generations, but if a new <code>Task</code> defines a custom <code>process</code> method, like happens for example with <code>Magpie</code>, one hast to deal with the <code>statistics</code> returned by the <code>LLM</code>.</p>"},{"location":"sections/how_to_guides/basic/task/generator_task/","title":"GeneratorTask that produces output","text":""},{"location":"sections/how_to_guides/basic/task/generator_task/#working-with-generatortasks","title":"Working with GeneratorTasks","text":"<p>The <code>GeneratorTask</code> is a custom implementation of a <code>Task</code> based on the <code>GeneratorStep</code>. As with a <code>Task</code>, it is normally used within a <code>Pipeline</code> but can also be used standalone.</p> <p>Warning</p> <p>This task is still experimental and may be subject to changes in the future.</p> <pre><code>from typing import Any, Dict, List, Union\nfrom typing_extensions import override\n\nfrom distilabel.steps.tasks.base import GeneratorTask\nfrom distilabel.steps.tasks.typing import ChatType\nfrom distilabel.steps.typing import GeneratorOutput\n\n\nclass MyCustomTask(GeneratorTask):\n    instruction: str\n\n    @override\n    def process(self, offset: int = 0) -&gt; GeneratorOutput:\n        output = self.llm.generate(\n            inputs=[\n                [\n                    {\"role\": \"user\", \"content\": self.instruction},\n                ],\n            ],\n        )\n        output = {\"model_name\": self.llm.model_name}\n        output.update(\n            self.format_output(output=output, input=None)\n        )\n        yield output\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"output_field\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        return {\"output_field\": output}\n</code></pre> <p>We can then use it as follows:</p> <pre><code>task = MyCustomTask(\n    name=\"custom-generation\",\n    instruction=\"Tell me a joke.\",\n    llm=OpenAILLM(model=\"gpt-4\"),\n)\ntask.load()\n\nnext(task.process())\n# [{'output_field\": \"Why did the scarecrow win an award? Because he was outstanding!\", \"model_name\": \"gpt-4\"}]\n</code></pre> <p>Note</p> <p>Most of the times you would need to override the default <code>process</code> method, as it's suited for the standard <code>Task</code> and not for the <code>GeneratorTask</code>. But within the context of the <code>process</code> function you can freely use the <code>llm</code> to generate data in any way.</p> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p>"},{"location":"sections/how_to_guides/basic/task/generator_task/#defining-custom-generatortasks","title":"Defining custom GeneratorTasks","text":"<p>We can define a custom generator task by creating a new subclass of the <code>GeneratorTask</code> and defining the following:</p> <ul> <li> <p><code>process</code>: is a method that generates the data based on the <code>LLM</code> and the <code>instruction</code> provided within the class instance, and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in <code>outputs</code>. Note that the <code>inputs</code> argument is not allowed in this function since this is a <code>GeneratorTask</code>. The signature only expects the <code>offset</code> argument, which is used to keep track of the current iteration in the generator.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields, this property should always include <code>model_name</code> as one of the outputs since that's automatically injected from the LLM.</p> </li> <li> <p><code>format_output</code>: is a method that receives the output from the <code>LLM</code> and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in <code>outputs</code>. Note that there's no need to include the <code>model_name</code> in the output.</p> </li> </ul> <pre><code>from typing import Any, Dict, List, Union\n\nfrom distilabel.steps.tasks.base import GeneratorTask\nfrom distilabel.steps.tasks.typing import ChatType\n\n\nclass MyCustomTask(GeneratorTask):\n    @override\n    def process(self, offset: int = 0) -&gt; GeneratorOutput:\n        output = self.llm.generate(\n            inputs=[\n                [{\"role\": \"user\", \"content\": \"Tell me a joke.\"}],\n            ],\n        )\n        output = {\"model_name\": self.llm.model_name}\n        output.update(\n            self.format_output(output=output, input=None)\n        )\n        yield output\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"output_field\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        return {\"output_field\": output}\n</code></pre>"},{"location":"sections/pipeline_samples/","title":"Tutorials","text":"<ul> <li>End-to-end tutorials provide detailed step-by-step explanations and the code used for end-to-end workflows.</li> <li>Paper implementations provide reproductions of fundamental papers in the synthetic data domain.</li> <li>Examples don't provide explenations but simply show code for different tasks.</li> </ul>"},{"location":"sections/pipeline_samples/#end-to-end-tutorials","title":"End-to-end tutorials","text":"<ul> <li> <p>Generate a preference dataset</p> <p>Learn about synthetic data generation for ORPO and DPO.</p> <p> Tutorial</p> </li> <li> <p>Clean an existing preference dataset</p> <p>Learn about how to provide AI feedback to clean an existing dataset.</p> <p> Tutorial</p> </li> <li> <p>Retrieval and reranking models</p> <p>Learn about synthetic data generation for fine-tuning custom retrieval and reranking models.</p> <p> Tutorial</p> </li> <li> <p>Generate text classification data</p> <p>Learn about how synthetic data generation for text classification can help address data imbalance or scarcity.</p> <p> Tutorial</p> </li> </ul>"},{"location":"sections/pipeline_samples/#paper-implementations","title":"Paper Implementations","text":"<ul> <li> <p>Deepseek Prover</p> <p>Learn about an approach to generate mathematical proofs for theorems generated from informal math problems.</p> <p> Example</p> </li> <li> <p>DEITA</p> <p>Learn about prompt, response tuning for complexity and quality and LLMs as judges for automatic data selection.</p> <p> Paper</p> </li> <li> <p>Instruction Backtranslation</p> <p>Learn about automatically labeling human-written text with corresponding instructions.</p> <p> Paper</p> </li> <li> <p>Prometheus 2</p> <p>Learn about using open-source models as judges for direct assessment and pair-wise ranking.</p> <p> Paper</p> </li> <li> <p>UltraFeedback</p> <p>Learn about a large-scale, fine-grained, diverse preference dataset, used for training powerful reward and critic models.</p> <p> Paper</p> </li> <li> <p>APIGen</p> <p>Learn how to create verifiable high-quality datases for function-calling applications.</p> <p> Paper</p> </li> <li> <p>CLAIR</p> <p>Learn Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs.</p> <p> Paper</p> </li> <li> <p>Math Shepherd</p> <p>Learn about Math-Shepherd, a framework to generate datasets to train process reward models (PRMs) which assign reward scores to each step of math problem solutions.</p> <p> Paper</p> </li> </ul>"},{"location":"sections/pipeline_samples/#examples","title":"Examples","text":"<ul> <li> <p>Benchmarking with distilabel</p> <p>Learn about reproducing the Arena Hard benchmark with disitlabel.</p> <p> Example</p> </li> <li> <p>Structured generation with outlines</p> <p>Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel.</p> <p> Example</p> </li> <li> <p>Structured generation with instructor</p> <p>Learn about answering instructions with knowledge graphs defined as pydantic.BaseModel objects using instructor in distilabel.</p> <p> Example</p> </li> <li> <p>Create a social network with FinePersonas</p> <p>Learn how to leverage FinePersonas to create a synthetic social network and fine-tune adapters for Multi-LoRA.</p> <p> Example</p> </li> <li> <p>Create questions and answers for a exam</p> <p>Learn how to generate questions and answers for a exam, using a raw wikipedia page and structured generation.</p> <p> Example</p> </li> <li> <p>Text generation with images in distilabel</p> <p>Ask questions about images using distilabel.</p> <p> Example</p> </li> </ul>"},{"location":"sections/pipeline_samples/examples/benchmarking_with_distilabel/","title":"Benchmarking with <code>distilabel</code>","text":"<p>Benchmark LLMs with <code>distilabel</code>: reproducing the Arena Hard benchmark.</p> <p>The script below first defines both the <code>ArenaHard</code> and the <code>ArenaHardResults</code> tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a <code>Pipeline</code> to run the generation on top of the prompts with <code>InferenceEndpointsLLM</code> while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with <code>OpenAILLM</code> generating an alternate response, a comparison between the responses, and a result as A&gt;&gt;B, A&gt;B, B&gt;A, B&gt;&gt;A, or tie.</p> <p></p> <p>To run this example you will first need to install the Arena Hard optional dependencies, being <code>pandas</code>, <code>scikit-learn</code>, and <code>numpy</code>.</p> Run <pre><code>python examples/arena_hard.py\n</code></pre> arena_hard.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport re\nfrom typing import Any, Dict, List, Optional, Union\n\nfrom typing_extensions import override\n\nfrom distilabel.steps import GlobalStep, StepInput\nfrom distilabel.steps.tasks.base import Task\nfrom distilabel.steps.tasks.typing import ChatType\nfrom distilabel.steps.typing import StepOutput\n\n\nclass ArenaHard(Task):\n    \"\"\"Evaluates two assistant responses using an LLM as judge.\n\n    This `Task` is based on the \"From Live Data to High-Quality Benchmarks: The\n    Arena-Hard Pipeline\" paper that presents Arena Hard, which is a benchmark for\n    instruction-tuned LLMs that contains 500 challenging user queries. GPT-4 is used\n    as the judge to compare the model responses against a baseline model, which defaults\n    to `gpt-4-0314`.\n\n    Note:\n        Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena\n        among popular open-ended LLM benchmarks.\n\n    Input columns:\n        - instruction (`str`): The instruction to evaluate the responses.\n        - generations (`List[str]`): The responses generated by two, and only two, LLMs.\n\n    Output columns:\n        - evaluation (`str`): The evaluation of the responses generated by the LLMs.\n        - score (`str`): The score extracted from the evaluation.\n        - model_name (`str`): The model name used to generate the evaluation.\n\n    Categories:\n        - benchmark\n\n    References:\n        - [From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/)\n        - [`arena-hard-auto`](https://github.com/lm-sys/arena-hard-auto/tree/main)\n\n    Examples:\n\n        Evaluate two assistant responses for a given instruction using Arean Hard prompts:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import GroupColumns, LoadDataFromDicts\n        from distilabel.steps.tasks import ArenaHard, TextGeneration\n\n        with Pipeline() as pipeline:\n            load_data = LoadDataFromDicts(\n                data=[{\"instruction\": \"What is the capital of France?\"}],\n            )\n\n            text_generation_a = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            text_generation_b = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            combine = GroupColumns(\n                columns=[\"generation\", \"generation_model\"],\n                output_columns=[\"generations\", \"generation_models\"],\n            )\n\n            arena_hard = ArenaHard(\n                llm=...,  # LLM instance\n            )\n\n            load_data &gt;&gt; [text_generation_a, text_generation_b] &gt;&gt; combine &gt;&gt; arena_hard\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs required by this task are the `instruction` and the `generations`,\n        which are the responses generated by two, and only two, LLMs.\"\"\"\n        return [\"instruction\", \"generations\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"This method formats the input data as a `ChatType` using the prompt defined\n        by the Arena Hard benchmark, which consists on a `system_prompt` plus a template\n        for the user first message that contains the `instruction` and both `generations`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": \"Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\\n\\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\\n\\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\\n\\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\\n\\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\\n\\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\\n\\n1. Assistant A is significantly better: [[A&gt;&gt;B]]\\n2. Assistant A is slightly better: [[A&gt;B]]\\n3. Tie, relatively the same: [[A=B]]\\n4. Assistant B is slightly better: [[B&gt;A]]\\n5. Assistant B is significantly better: [[B&gt;&gt;A]]\\n\\nExample output: \\\"My final verdict is tie: [[A=B]]\\\".\",\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\"&lt;|User Prompt|&gt;\\n{input['instruction']}\\n\\n&lt;|The Start of Assistant A's Answer|&gt;\\n{input['generations'][0]}\\n&lt;|The End of Assistant A's Answer|&gt;\\n\\n&lt;|The Start of Assistant B's Answer|&gt;\\n{input['generations'][1]}\\n&lt;|The End of Assistant B's Answer|&gt;\",\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The outputs generated by this task are the `evaluation`, the `score` and\n        the `model_name` (which is automatically injected within the `process` method\n        of the parent task).\"\"\"\n        return [\"evaluation\", \"score\", \"model_name\"]\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"This method formats the output generated by the LLM as a Python dictionary\n        containing the `evaluation` which is the raw output generated by the LLM (consisting\n        of the judge LLM alternate generation for the given instruction, plus an explanation\n        on the evaluation of the given responses; plus the `score` extracted from the output.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Is provided in case it needs to be used to enrich\n                the output if needed.\n\n        Returns:\n            A dict with the keys `evaluation` with the raw output which contains the LLM\n            evaluation and the extracted `score` if possible.\n        \"\"\"\n        if output is None:\n            return {\"evaluation\": None, \"score\": None}\n        pattern = re.compile(r\"\\[\\[([AB&lt;&gt;=]+)\\]\\]\")\n        match = pattern.search(output)\n        if match is None:\n            return {\"evaluation\": output, \"score\": None}\n        return {\"evaluation\": output, \"score\": match.group(1)}\n\n\nclass ArenaHardResults(GlobalStep):\n    \"\"\"Process Arena Hard results to calculate the ELO scores.\n\n    This `Step` is based on the \"From Live Data to High-Quality Benchmarks: The\n    Arena-Hard Pipeline\" paper that presents Arena Hard, which is a benchmark for\n    instruction-tuned LLMs that contains 500 challenging user queries. This step is\n    a `GlobalStep` that should run right after the `ArenaHard` task to calculate the\n    ELO scores for the evaluated models.\n\n    Note:\n        Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena\n        among popular open-ended LLM benchmarks.\n\n    Input columns:\n        - evaluation (`str`): The evaluation of the responses generated by the LLMs.\n        - score (`str`): The score extracted from the evaluation.\n\n    References:\n        - [From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/)\n        - [`arena-hard-auto`](https://github.com/lm-sys/arena-hard-auto/tree/main)\n\n    Examples:\n\n        Rate the ELO scores for two assistant responses for a given an evaluation / comparison between both using Arean Hard prompts:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import GroupColumns, LoadDataFromDicts\n        from distilabel.steps.tasks import ArenaHard, TextGeneration\n\n        with Pipeline() as pipeline:\n            load_data = LoadDataFromDicts(\n                data=[{\"instruction\": \"What is the capital of France?\"}],\n            )\n\n            text_generation_a = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            text_generation_b = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            combine = GroupColumns(\n                columns=[\"generation\", \"generation_model\"],\n                output_columns=[\"generations\", \"generation_models\"],\n            )\n\n            arena_hard = ArenaHard(\n                llm=...,  # LLM instance\n            )\n\n            arena_hard_results = ArenaHardResults(\n                custom_model_column=\"generation_models\",\n                custom_weights={\"A&gt;B\": 1, \"A&gt;&gt;B\": 3, \"B&gt;A\": 1, \"B&gt;&gt;A\": 3},\n            )\n\n            load_data &gt;&gt; [text_generation_a, text_generation_b] &gt;&gt; combine &gt;&gt; arena_hard &gt;&gt; arena_hard_results\n        ```\n\n    \"\"\"\n\n    custom_model_column: Optional[str] = None\n    custom_weights: Dict[str, int] = {\"A&gt;B\": 1, \"A&gt;&gt;B\": 3, \"B&gt;A\": 1, \"B&gt;&gt;A\": 3}\n\n    def load(self) -&gt; None:\n        \"\"\"Ensures that the required dependencies are installed.\"\"\"\n        super().load()\n\n        try:\n            import numpy as np  # noqa: F401\n            import pandas as pd  # noqa: F401\n            from sklearn.linear_model import LogisticRegression  # noqa: F401\n        except ImportError as e:\n            raise ImportError(\n                \"In order to run `ArenaHardResults`, the `arena-hard` extra dependencies\"\n                \" must be installed i.e. `numpy`, `pandas`, and `scikit-learn`.\\n\"\n                \"Please install the dependencies by running `pip install distilabel[arena-hard]`.\"\n            ) from e\n\n    # TODO: the `evaluation` is not really required as an input, so it could be removed, since\n    # only `score` is used / required\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs required by this step are the `evaluation` and the `score` generated\n        by the `ArenaHard` task. Since this step does use the identifiers `model_a` and `model_b`,\n        optionally one can set `custom_model_column` to use the model names if existing within\n        the input data, ideally this value should be `model_name` if connected from the `ArenaHard`\n        step.\"\"\"\n        columns = [\"evaluation\", \"score\"]\n        if self.custom_model_column:\n            columns.append(self.custom_model_column)\n        return columns\n\n    @override\n    def process(self, inputs: StepInput) -&gt; StepOutput:  # type: ignore\n        \"\"\"This method processes the inputs generated by the `ArenaHard` task to calculate the\n        win rates for each of the models to evaluate. Since this step inherits from the `GlobalStep`,\n        it will wait for all the input batches to be processed, and then the output will be yielded in\n        case there's a follow up step, since this step won't modify the received inputs.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n\n        References:\n            - https://github.com/lm-sys/arena-hard-auto/blob/main/show_result.py\n        \"\"\"\n        import numpy as np\n        import pandas as pd\n        from sklearn.linear_model import LogisticRegression\n\n        models = [\"A\", \"B\"]\n        if self.custom_model_column:\n            models = inputs[0][self.custom_model_column]\n\n        # TODO: the battles are only calculated for the first game, even though the official\n        # implementation also covers the possibility of a second game (not within the released\n        # dataset yet)\n        battles = pd.DataFrame()\n        for input in inputs:\n            output = {\n                # TODO: \"question_id\": input[\"question_id\"],\n                \"model_a\": models[0],\n                \"model_b\": models[1],\n            }\n            if input[\"score\"] in [\"A&gt;B\", \"A&gt;&gt;B\"]:\n                output[\"winner\"] = models[0]\n                rows = [output] * self.custom_weights[input[\"score\"]]\n            elif input[\"score\"] in [\"B&gt;A\", \"B&gt;&gt;A\"]:\n                output[\"winner\"] = models[1]\n                rows = [output] * self.custom_weights[input[\"score\"]]\n            elif input[\"score\"] == \"A=B\":\n                output[\"winner\"] = \"tie\"\n                rows = [output]\n            else:\n                continue\n\n            battles = pd.concat([battles, pd.DataFrame(rows)])\n\n        models = pd.concat([battles[\"model_a\"], battles[\"model_b\"]]).unique()\n        models = pd.Series(np.arange(len(models)), index=models)\n\n        battles = pd.concat([battles, battles], ignore_index=True)\n        p = len(models.index)\n        n = battles.shape[0]\n\n        X = np.zeros([n, p])\n        X[np.arange(n), models[battles[\"model_a\"]]] = +np.log(10)\n        X[np.arange(n), models[battles[\"model_b\"]]] = -np.log(10)\n\n        Y = np.zeros(n)\n        Y[battles[\"winner\"] == \"model_a\"] = 1.0\n\n        tie_idx = battles[\"winner\"] == \"tie\"\n        tie_idx[len(tie_idx) // 2 :] = False\n        Y[tie_idx] = 1.0\n\n        lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-8)  # type: ignore\n        lr.fit(X, Y)\n\n        # The ELO scores are calculated assuming that the reference is `gpt-4-0314`\n        # with an starting ELO of 1000, so that the evaluated models are compared with\n        # `gtp-4-0314` only if it's available within the models\n        elo_scores = 400 * lr.coef_[0] + 1000\n        # TODO: we could parametrize the reference / anchor model, but left as is to be faithful to the\n        # original implementation\n        if \"gpt-4-0314\" in models.index:\n            elo_scores += 1000 - elo_scores[models[\"gpt-4-0314\"]]\n\n        output = pd.Series(elo_scores, index=models.index).sort_values(ascending=False)\n        self._logger.info(f\"Arena Hard ELO: {output}\")\n\n        # Here only so that if follow up steps are connected the inputs are preserved,\n        # since this step doesn't modify nor generate new inputs\n        yield inputs\n\n\nif __name__ == \"__main__\":\n    import json\n\n    from distilabel.models import InferenceEndpointsLLM, OpenAILLM\n    from distilabel.pipeline import Pipeline\n    from distilabel.steps import (\n        GroupColumns,\n        KeepColumns,\n        LoadDataFromHub,\n        StepInput,\n        step,\n    )\n    from distilabel.steps.tasks import TextGeneration\n    from distilabel.steps.typing import StepOutput\n\n    @step(inputs=[\"turns\"], outputs=[\"system_prompt\", \"instruction\"])\n    def PrepareForTextGeneration(*inputs: StepInput) -&gt; StepOutput:\n        for input in inputs:\n            for item in input:\n                item[\"system_prompt\"] = \"You are a helpful assistant.\"\n                item[\"instruction\"] = item[\"turns\"][0][\"content\"]\n            yield input\n\n    @step(\n        inputs=[\"question_id\"],\n        outputs=[\"generation\", \"generation_model\"],\n        step_type=\"global\",\n    )\n    def LoadReference(*inputs: StepInput) -&gt; StepOutput:\n        # File downloaded from https://raw.githubusercontent.com/lm-sys/arena-hard-auto/e0a8ea1df42c1df76451a6cd04b14e31ff992b87/data/arena-hard-v0.1/model_answer/gpt-4-0314.jsonl\n        lines = open(\"gpt-4-0314.jsonl\", mode=\"r\").readlines()\n        for input in inputs:\n            for item in input:\n                for line in lines:\n                    data = json.loads(line)\n                    if data[\"question_id\"] == item[\"question_id\"]:\n                        item[\"generation\"] = data[\"choices\"][0][\"turns\"][0][\"content\"]\n                        item[\"generation_model\"] = data[\"model_id\"]\n                        break\n            yield input\n\n    with Pipeline(name=\"arena-hard-v0.1\") as pipeline:\n        load_dataset = LoadDataFromHub(\n            name=\"load_dataset\",\n            repo_id=\"alvarobartt/lmsys-arena-hard-v0.1\",\n            split=\"test\",\n            num_examples=5,\n        )\n\n        load_reference = LoadReference(name=\"load_reference\")\n\n        prepare = PrepareForTextGeneration(name=\"prepare\")\n\n        text_generation_cohere = TextGeneration(\n            name=\"text_generation_cohere\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"CohereForAI/c4ai-command-r-plus\",\n                tokenizer_id=\"CohereForAI/c4ai-command-r-plus\",\n            ),\n            use_system_prompt=True,\n            input_batch_size=10,\n            output_mappings={\"model_name\": \"generation_model\"},\n        )\n\n        combine_columns = GroupColumns(\n            name=\"combine_columns\",\n            columns=[\"generation\", \"generation_model\"],\n            output_columns=[\"generations\", \"generation_models\"],\n        )\n\n        arena_hard = ArenaHard(\n            name=\"arena_hard\",\n            llm=OpenAILLM(model=\"gpt-4-1106-preview\"),\n            output_mappings={\"model_name\": \"evaluation_model\"},\n        )\n\n        keep_columns = KeepColumns(\n            name=\"keep_columns\",\n            columns=[\n                \"question_id\",\n                \"category\",\n                \"cluster\",\n                \"system_prompt\",\n                \"instruction\",\n                \"generations\",\n                \"generation_models\",\n                \"evaluation\",\n                \"score\",\n                \"evaluation_model\",\n            ],\n        )\n\n        win_rates = ArenaHardResults(\n            name=\"win_rates\", custom_model_column=\"generation_models\"\n        )\n\n        load_dataset &gt;&gt; load_reference  # type: ignore\n        load_dataset &gt;&gt; prepare &gt;&gt; text_generation_cohere  # type: ignore\n        (  # type: ignore\n            [load_reference, text_generation_cohere]\n            &gt;&gt; combine_columns\n            &gt;&gt; arena_hard\n            &gt;&gt; keep_columns\n            &gt;&gt; win_rates\n        )\n\n        distiset = pipeline.run(\n            parameters={  # type: ignore\n                text_generation_cohere.name: {\n                    \"llm\": {\n                        \"generation_kwargs\": {\n                            \"temperature\": 0.7,\n                            \"max_new_tokens\": 4096,\n                            \"stop_sequences\": [\"&lt;EOS_TOKEN&gt;\", \"&lt;|END_OF_TURN_TOKEN|&gt;\"],\n                        }\n                    }\n                },\n                arena_hard.name: {\n                    \"llm\": {\n                        \"generation_kwargs\": {\n                            \"temperature\": 0.0,\n                            \"max_new_tokens\": 4096,\n                        }\n                    }\n                },\n            },\n        )\n        if distiset is not None:\n            distiset.push_to_hub(\"arena-hard-results\")\n</code></pre>"},{"location":"sections/pipeline_samples/examples/exam_questions/","title":"Create exam questions using structured generation","text":"<p>This example will showcase how to generate exams questions and answers from a text page. In this case, we will use a wikipedia page as an example, and show how to leverage the prompt to help the model generate the data in the appropriate format.</p> <p>We are going to use a <code>meta-llama/Meta-Llama-3.1-8B-Instruct</code> to generate questions and answers for a mock exam from a wikipedia page. In this case, we are going to use the Transfer Learning entry for it. With the help of structured generation we will guide the model to create structured data for us that is easy to parse. The structure will be question, answer, and distractors (wrong answers).</p> Click to see the sample results <p>Example page Transfer_learning:</p> <p></p> <p>QA of the page:</p> <pre><code>{\n    \"exam\": [\n        {\n            \"answer\": \"A technique in machine learning where knowledge learned from a task is re-used to boost performance on a related task.\",\n            \"distractors\": [\"A type of neural network architecture\", \"A machine learning algorithm for image classification\", \"A method for data preprocessing\"],\n            \"question\": \"What is transfer learning?\"\n        },\n        {\n            \"answer\": \"1976\",\n            \"distractors\": [\"1981\", \"1992\", \"1998\"],\n            \"question\": \"In which year did Bozinovski and Fulgosi publish a paper addressing transfer learning in neural network training?\"\n        },\n        {\n            \"answer\": \"Discriminability-based transfer (DBT) algorithm\",\n            \"distractors\": [\"Multi-task learning\", \"Learning to Learn\", \"Cost-sensitive machine learning\"],\n            \"question\": \"What algorithm was formulated by Lorien Pratt in 1992?\"\n        },\n        {\n            \"answer\": \"A domain consists of a feature space and a marginal probability distribution.\",\n            \"distractors\": [\"A domain consists of a label space and an objective predictive function.\", \"A domain consists of a task and a learning algorithm.\", \"A domain consists of a dataset and a model.\"],\n            \"question\": \"What is the definition of a domain in the context of transfer learning?\"\n        },\n        {\n            \"answer\": \"Transfer learning aims to help improve the learning of the target predictive function in the target domain using the knowledge in the source domain and learning task.\",\n            \"distractors\": [\"Transfer learning aims to learn a new task from scratch.\", \"Transfer learning aims to improve the learning of the source predictive function in the source domain.\", \"Transfer learning aims to improve the learning of the target predictive function in the source domain.\"],\n            \"question\": \"What is the goal of transfer learning?\"\n        },\n        {\n            \"answer\": \"Markov logic networks, Bayesian networks, cancer subtype discovery, building utilization, general game playing, text classification, digit recognition, medical imaging, and spam filtering.\",\n            \"distractors\": [\"Supervised learning, unsupervised learning, reinforcement learning, natural language processing, computer vision, and robotics.\", \"Image classification, object detection, segmentation, and tracking.\", \"Speech recognition, sentiment analysis, and topic modeling.\"],\n            \"question\": \"What are some applications of transfer learning?\"\n        },\n        {\n            \"answer\": \"ADAPT (Python), TLib (Python), Domain-Adaptation-Toolbox (Matlab)\",\n            \"distractors\": [\"TensorFlow, PyTorch, Keras\", \"Scikit-learn, OpenCV, NumPy\", \"Matlab, R, Julia\"],\n            \"question\": \"What are some software implementations of transfer learning and domain adaptation algorithms?\"\n        }\n    ]\n}\n</code></pre>"},{"location":"sections/pipeline_samples/examples/exam_questions/#build-the-pipeline","title":"Build the pipeline","text":"<p>Let's see how to build a pipeline to obtain this type of data:</p> <pre><code>from typing import List\nfrom pathlib import Path\n\nfrom pydantic import BaseModel, Field\n\nfrom distilabel.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\nimport wikipedia\n\npage = wikipedia.page(title=\"Transfer_learning\")  # (1)\n\n\nclass ExamQuestion(BaseModel):\n    question: str = Field(..., description=\"The question to be answered\")\n    answer: str = Field(..., description=\"The correct answer to the question\")\n    distractors: List[str] = Field(\n        ..., description=\"A list of incorrect but viable answers to the question\"\n    )\n\nclass ExamQuestions(BaseModel):  # (2)\n    exam: List[ExamQuestion]\n\n\nSYSTEM_PROMPT = \"\"\"\\\nYou are an exam writer specialized in writing exams for students.\nYour goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.\nYour answer must adhere to the following format:\n```\n[\n    {\n        \"question\": \"Your question\",\n        \"answer\": \"The correct answer to the question\",\n        \"distractors\": [\"wrong answer 1\", \"wrong answer 2\", \"wrong answer 3\"]\n    },\n    ... (more questions and answers as required)\n]\n```\n\"\"\".strip() #\u00a0(3)\n\n\nwith Pipeline(name=\"ExamGenerator\") as pipeline:\n\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"page\": page.content,  #\u00a0(4)\n            }\n        ],\n    )\n\n    text_generation = TextGeneration(  # (5)\n        name=\"exam_generation\",\n        system_prompt=SYSTEM_PROMPT,\n        template=\"Generate a list of answers and questions about the document. Document:\\n\\n{{ page }}\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            structured_output={\n                \"schema\": ExamQuestions.model_json_schema(),\n                \"format\": \"json\"\n            },\n        ),\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n\n    load_dataset &gt;&gt; text_generation  # (6)\n</code></pre> <ol> <li> <p>Download a single page for the demo. We could donwnload first the pages, or apply the same procedure to any type of data we want. In a real world use case, we would want to make a dataset from these documents first.</p> </li> <li> <p>Define the structure required for the answer using Pydantic. In this case we want for each page, a list with questions and answers (additionally we've added distractors, but can be ignored for this case). So our output will be a <code>ExamQuestions</code> model, which is a list of <code>ExamQuestion</code>, where each one consists in the <code>question</code> and <code>answer</code> fields as string fields. The language model will use the field descriptions to generate the values.</p> </li> <li> <p>Use the system prompt to guide the model towards the behaviour we want from it. Independently from the structured output we are forcing the model to have, it helps if we pass the format expected in our prompt.</p> </li> <li> <p>Move the page content from wikipedia to a row in the dataset.</p> </li> <li> <p>The <code>TextGeneration</code> task gets the system prompt, and the user prompt by means of the <code>template</code> argument, where we aid the model to generate the questions and answers based on the page content, that will be obtained from the corresponding column of the loaded data.</p> </li> <li> <p>Connect both steps, and we are done.</p> </li> </ol>"},{"location":"sections/pipeline_samples/examples/exam_questions/#run-the-example","title":"Run the example","text":"<p>To run this example you will first need to install the wikipedia dependency to download the sample data, being <code>pip install wikipedia</code>. Change the username first in case you want to push the dataset to the hub using your account.</p> Run <pre><code>python examples/exam_questions.py\n</code></pre> exam_questions.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom typing import List\n\nimport wikipedia\nfrom pydantic import BaseModel, Field\n\nfrom distilabel.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\npage = wikipedia.page(title=\"Transfer_learning\")\n\n\nclass ExamQuestion(BaseModel):\n    question: str = Field(..., description=\"The question to be answered\")\n    answer: str = Field(..., description=\"The correct answer to the question\")\n    distractors: List[str] = Field(\n        ..., description=\"A list of incorrect but viable answers to the question\"\n    )\n\n\nclass ExamQuestions(BaseModel):\n    exam: List[ExamQuestion]\n\n\nSYSTEM_PROMPT = \"\"\"\\\nYou are an exam writer specialized in writing exams for students.\nYour goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.\nYour answer must adhere to the following format:\n```\n[\n    {\n        \"question\": \"Your question\",\n        \"answer\": \"The correct answer to the question\",\n        \"distractors\": [\"wrong answer 1\", \"wrong answer 2\", \"wrong answer 3\"]\n    },\n    ... (more questions and answers as required)\n]\n```\n\"\"\".strip()\n\n\nwith Pipeline(name=\"ExamGenerator\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"page\": page.content,\n            }\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"exam_generation\",\n        system_prompt=SYSTEM_PROMPT,\n        template=\"Generate a list of answers and questions about the document. Document:\\n\\n{{ page }}\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            structured_output={\n                \"schema\": ExamQuestions.model_json_schema(),\n                \"format\": \"json\",\n            },\n        ),\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    load_dataset &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            text_generation.name: {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"max_new_tokens\": 2048,\n                    }\n                }\n            }\n        },\n        use_cache=False,\n    )\n    distiset.push_to_hub(\"USERNAME/exam_questions\")\n</code></pre>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/","title":"Create a social network with FinePersonas","text":"<p>In this example, we'll explore the creation of specialized user personas for social network interactions using the FinePersonas-v0.1 dataset from Hugging Face. The final dataset will be ready to fine-tune a chat model with specific traits and characteristics.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#introduction","title":"Introduction","text":"<p>We'll delve into the process of fine-tuning different LoRA (Low-Rank Adaptation) models to imbue these personas with specific traits and characteristics.</p> <p>This approach draws inspiration from Michael Sayman's work on SocialAI (visit the profile to see some examples), to leverage FinePersonas-v0.1 for building models that can emulate bots with specific behaviour.</p> <p>By fine-tuning these adapters, we can potentially create AI personas with distinct characteristics, communication styles, and areas of expertise. The result? AI interactions that feel more natural and tailored to specific contexts or user needs. For those interested in the technical aspects of this approach, we recommend the insightful blog post on Multi-LoRA serving. It provides a clear and comprehensive explanation of the technology behind this innovative method.</p> <p>Let's jump to the demo.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#creating-our-socialai-task","title":"Creating our SocialAI Task","text":"<p>Building on the new <code>TextGeneration</code>, creating custom tasks is easier than ever before. This powerful tool opens up a world of possibilities for creating tailored text-based content with ease and precision. We will create a <code>SocialAI</code> task that will be in charge of generating responses to user interactions, taking into account a given <code>follower_type</code>, and use the perspective from a given <code>persona</code>:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\n\nclass SocialAI(TextGeneration):\n    follower_type: Literal[\"supporter\", \"troll\", \"alarmist\"] = \"supporter\"\n    system_prompt: str = (\n        \"You are an AI assistant expert at simulating user interactions. \"\n        \"You must answer as if you were a '{follower_type}', be concise answer with no more than 200 characters, nothing else.\"\n        \"Here are some traits to use for your personality:\\n\\n\"\n        \"{traits}\"\n    )  #\u00a0(1)\n    template: str = \"You are the folowing persona:\\n\\n{{ persona }}\\n\\nWhat would you say to the following?\\n\\n {{ post }}\"  # (2)\n    columns: str | list[str] = [\"persona\", \"post\"]  # (3)\n\n    _follower_traits: dict[str, str] = {\n        \"supporter\": (\n            \"- Encouraging and positive\\n\"\n            \"- Tends to prioritize enjoyment and relaxation\\n\"\n            \"- Focuses on the present moment and short-term pleasure\\n\"\n            \"- Often uses humor and playful language\\n\"\n            \"- Wants to help others feel good and have fun\\n\"\n        ),\n        \"troll\": (\n            \"- Provocative and confrontational\\n\"\n            \"- Enjoys stirring up controversy and conflict\\n\"\n            \"- Often uses sarcasm, irony, and mocking language\\n\"\n            \"- Tends to belittle or dismiss others' opinions and feelings\\n\"\n            \"- Seeks to get a rise out of others and create drama\\n\"\n        ),\n        \"alarmist\": (\n            \"- Anxious and warning-oriented\\n\"\n            \"- Focuses on potential risks and negative consequences\\n\"\n            \"- Often uses dramatic or sensational language\\n\"\n            \"- Tends to be serious and stern in tone\\n\"\n            \"- Seeks to alert others to potential dangers and protect them from harm (even if it's excessive or unwarranted)\\n\"\n        ),\n    }\n\n    def load(self) -&gt; None:\n        super().load()\n        self.system_prompt = self.system_prompt.format(\n            follower_type=self.follower_type,\n            traits=self._follower_traits[self.follower_type]\n        )  # (4)\n</code></pre> <ol> <li> <p>We have a custom system prompt that will depend on the <code>follower_type</code> we decide for our model.</p> </li> <li> <p>The base template or prompt will answert to the <code>post</code> we have, from the point of view of a <code>persona</code>.</p> </li> <li> <p>We will need our dataset to have both <code>persona</code> and <code>post</code> columns to populate the prompt.</p> </li> <li> <p>In the load method we place the specific traits for our follower type in the system prompt.</p> </li> </ol>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#data-preparation","title":"Data preparation","text":"<p>This is an example, so let's keep it short. We will use 3 posts, and 3 different types of personas. While there's potential to enhance this process (perhaps by implementing random persona selection or leveraging semantic similarity) we'll opt for a straightforward method in this demonstration.</p> <p>Our goal is to create a set of nine examples, each pairing a post with a persona. To achieve this, we'll employ an LLM to respond to each post from the perspective of a specific <code>persona</code>, effectively simulating how different characters might engage with the content.</p> <pre><code>posts = [\n    {\n        \"post\": \"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\"\n    },\n    {\n        \"post\": \"I need to develop a training course for my company on communication skills. Need to decide how deliver it remotely.\"\n    },\n    {\n        \"post\": \"I'm always 10 minutes late to meetups but no one's complained. Could this be annoying to them?\"\n    },\n]\n\npersonas = (\n    load_dataset(\"argilla/FinePersonas-v0.1-clustering-100k\", split=\"train\")\n    .shuffle()\n    .select(range(3))\n    .select_columns(\"persona\")\n    .to_list()\n)\n\ndata = []\nfor post in posts:\n    for persona in personas:\n        data.append({\"post\": post[\"post\"], \"persona\": persona[\"persona\"]})\n</code></pre> <p>Each row in will have the following format:</p> <pre><code>import json\nprint(json.dumps(data[0], indent=4))\n{\n    \"post\": \"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\",\n    \"persona\": \"A high school or college environmental science teacher or an ecology student specializing in biogeography and ecosystem dynamics.\"\n}\n</code></pre> <p>This will be our dataset, that we can ingest using the <code>LoadDataFromDicts</code>:</p> <pre><code>loader = LoadDataFromDicts(data=data)\n</code></pre>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#simulating-from-different-types-of-followers","title":"Simulating from different types of followers","text":"<p>With our data in hand, we're ready to explore the capabilities of our SocialAI task. For this demonstration, we'll make use of of <code>meta-llama/Meta-Llama-3.1-70B-Instruct</code> While this model has become something of a go-to choice recently, it's worth noting that experimenting with a variety of models could yield even more interesting results:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 256,\n    },\n)\nfollower_type = \"supporter\"\n\nfollower = SocialAI(\n    llm=llm,\n    follower_type=follower_type,\n    name=f\"{follower_type}_user\",\n)\n</code></pre> <p>This setup simplifies the process, we only need to input the follower type, and the system handles the rest. We could update this too to have a random type of follower by default, and simulate from a bunch of different personalities.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#building-our-pipeline","title":"Building our Pipeline","text":"<p>The foundation of our pipeline is now in place. At its core is a single, powerful LLM. This versatile model will be repurposed to drive three distinct <code>SocialAI</code> Tasks, each tailored to a specific <code>TextGeneration</code> task, and each one of them will be prepared for Supervised Fine Tuning using <code>FormatTextGenerationSFT</code>:</p> <pre><code>with Pipeline(name=\"Social AI Personas\") as pipeline:\n    loader = LoadDataFromDicts(data=data, batch_size=1)\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 256,\n        },\n    )\n\n    for follower_type in [\"supporter\", \"troll\", \"alarmist\"]:\n        follower = SocialAI(\n            llm=llm,\n            follower_type=follower_type,\n            name=f\"{follower_type}_user\",  # (1)\n            output_mappings={\n                \"generation\": f\"interaction_{follower_type}\"  # (2)\n            }\n        )\n        format_sft = FormatTextGenerationSFT(\n            name=f\"format_sft_{follower_type}\",\n            input_mappings={\n                \"instruction\": \"post\",\n                \"generation\": f\"interaction_{follower_type}\"  # (3)\n            },\n        )\n        loader &gt;&gt; follower &gt;&gt; format_sft  # (4)\n</code></pre> <ol> <li> <p>We update the name of the step to keep track in the pipeline.</p> </li> <li> <p>The <code>generation</code> column from each LLM will be mapped to avoid them being overriden, as we are reusing the same task.</p> </li> <li> <p>As we have modified the output column from <code>SocialAI</code>, we redirect each one of the \"follower_type\" responses.</p> </li> <li> <p>Connect the loader to each one of the follower tasks and <code>format_sft</code> to obtain 3 different subsets.</p> </li> </ol> <p>The outcome of this pipeline will be three specialized models, each fine-tuned to a unique <code>follower type</code> crafted by the <code>SocialAI</code> task. These models will generate SFT-formatted datasets, where each post is paired with its corresponding interaction data for a specific follower type. This setup enables seamless fine-tuning using your preferred framework, such as TRL, or any other training framework of your choice.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#script-and-final-dataset","title":"Script and final dataset","text":"<p>All the pieces are in place for our script, the full pipeline can be seen here:</p> Run <pre><code>python examples/finepersonas_social_ai.py\n</code></pre> finepersonas_social_ai.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom typing import Literal\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import FormatTextGenerationSFT, LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\n\nclass SocialAI(TextGeneration):\n    follower_type: Literal[\"supporter\", \"troll\", \"alarmist\"] = \"supporter\"\n    system_prompt: str = (\n        \"You are an AI assistant expert at simulating user interactions. \"\n        \"You must answer as if you were a '{follower_type}', be concise answer with no more than 200 characters, nothing else.\"\n        \"Here are some traits to use for your personality:\\n\\n\"\n        \"{traits}\"\n    )\n    template: str = \"You are the folowing persona:\\n\\n{{ persona }}\\n\\nWhat would you say to the following?\\n\\n {{ post }}\"\n    columns: str | list[str] = [\"persona\", \"post\"]\n\n    _follower_traits: dict[str, str] = {\n        \"supporter\": (\n            \"- Encouraging and positive\\n\"\n            \"- Tends to prioritize enjoyment and relaxation\\n\"\n            \"- Focuses on the present moment and short-term pleasure\\n\"\n            \"- Often uses humor and playful language\\n\"\n            \"- Wants to help others feel good and have fun\\n\"\n        ),\n        \"troll\": (\n            \"- Provocative and confrontational\\n\"\n            \"- Enjoys stirring up controversy and conflict\\n\"\n            \"- Often uses sarcasm, irony, and mocking language\\n\"\n            \"- Tends to belittle or dismiss others' opinions and feelings\\n\"\n            \"- Seeks to get a rise out of others and create drama\\n\"\n        ),\n        \"alarmist\": (\n            \"- Anxious and warning-oriented\\n\"\n            \"- Focuses on potential risks and negative consequences\\n\"\n            \"- Often uses dramatic or sensational language\\n\"\n            \"- Tends to be serious and stern in tone\\n\"\n            \"- Seeks to alert others to potential dangers and protect them from harm (even if it's excessive or unwarranted)\\n\"\n        ),\n    }\n\n    def load(self) -&gt; None:\n        super().load()\n        self.system_prompt = self.system_prompt.format(\n            follower_type=self.follower_type,\n            traits=self._follower_traits[self.follower_type],\n        )\n\n\nposts = [\n    {\n        \"post\": \"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\"\n    },\n    {\n        \"post\": \"I need to develop a training course for my company on communication skills. Need to decide how deliver it remotely.\"\n    },\n    {\n        \"post\": \"I'm always 10 minutes late to meetups but no one's complained. Could this be annoying to them?\"\n    },\n]\n\npersonas = (\n    load_dataset(\"argilla/FinePersonas-v0.1-clustering-100k\", split=\"train\")\n    .shuffle()\n    .select(range(3))\n    .select_columns(\"persona\")\n    .to_list()\n)\n\ndata = []\nfor post in posts:\n    for persona in personas:\n        data.append({\"post\": post[\"post\"], \"persona\": persona[\"persona\"]})\n\n\nwith Pipeline(name=\"Social AI Personas\") as pipeline:\n    loader = LoadDataFromDicts(data=data, batch_size=1)\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 256,\n        },\n    )\n\n    for follower_type in [\"supporter\", \"troll\", \"alarmist\"]:\n        follower = SocialAI(\n            llm=llm,\n            follower_type=follower_type,\n            name=f\"{follower_type}_user\",\n            output_mappings={\"generation\": f\"interaction_{follower_type}\"},\n        )\n        format_sft = FormatTextGenerationSFT(\n            name=f\"format_sft_{follower_type}\",\n            input_mappings={\n                \"instruction\": \"post\",\n                \"generation\": f\"interaction_{follower_type}\",\n            },\n        )\n        loader &gt;&gt; follower &gt;&gt; format_sft\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    distiset.push_to_hub(\"plaguss/FinePersonas-SocialAI-test\", include_script=True)\n</code></pre> <p>This is the final toy dataset we obtain: FinePersonas-SocialAI-test</p> <p>You can see examples of how to load each subset of them to fine-tune a model:</p> <pre><code>from datasets import load_dataset\n\nds = load_dataset(\"plaguss/FinePersonas-SocialAI-test\", \"format_sft_troll\")\n</code></pre> <p>And a sample of the generated field with the corresponding <code>post</code> and <code>persona</code>:</p> <pre><code>{\n    \"post\": \"Hmm, ok now I\\u0027m torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\",\n    \"persona\": \"A high school or undergraduate physics or chemistry teacher, likely with a focus on experimental instruction.\",\n    \"interaction_troll\": \"\\\"Late night cravings? More like late night brain drain. Either way, it\\u0027s just a collision of molecules in your stomach. Choose the one with more calories, at least that\\u0027s some decent kinetic energy.\\\"\",\n}\n</code></pre> <p>There's a lot of room for improvement, but quite a promising start.</p>"},{"location":"sections/pipeline_samples/examples/llama_cpp_with_outlines/","title":"Structured generation with <code>outlines</code>","text":"<p>Generate RPG characters following a <code>pydantic.BaseModel</code> with <code>outlines</code> in <code>distilabel</code>.</p> <p>This script makes use of <code>LlamaCppLLM</code> and the structured output capabilities thanks to <code>outlines</code> to generate RPG characters that adhere to a JSON schema.</p> <p></p> <p>It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other <code>LLMs</code> like <code>vLLM</code>.</p> Run <pre><code>python examples/structured_generation_with_outlines.py\n</code></pre> structured_generation_with_outlines.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom enum import Enum\nfrom pathlib import Path\n\nfrom pydantic import BaseModel, StringConstraints, conint\nfrom typing_extensions import Annotated\n\nfrom distilabel.models import LlamaCppLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\n\nclass Weapon(str, Enum):\n    sword = \"sword\"\n    axe = \"axe\"\n    mace = \"mace\"\n    spear = \"spear\"\n    bow = \"bow\"\n    crossbow = \"crossbow\"\n\n\nclass Armor(str, Enum):\n    leather = \"leather\"\n    chainmail = \"chainmail\"\n    plate = \"plate\"\n    mithril = \"mithril\"\n\n\nclass Character(BaseModel):\n    name: Annotated[str, StringConstraints(max_length=30)]\n    age: conint(gt=1, lt=3000)\n    armor: Armor\n    weapon: Weapon\n\n\n# Download the model with\n# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nwith Pipeline(\"RPG-characters\") as pipeline:\n    system_prompt = (\n        \"You are a leading role play gamer. You have seen thousands of different characters and their attributes.\"\n        \" Please return a JSON object with common attributes of an RPG character.\"\n    )\n\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"system_prompt\": system_prompt,\n                \"instruction\": f\"Give me a character description for a {char}\",\n            }\n            for char in [\"dwarf\", \"elf\", \"human\", \"ork\"]\n        ],\n    )\n    llm = LlamaCppLLM(\n        model_path=str(Path.home() / model_path),  # type: ignore\n        n_gpu_layers=-1,\n        n_ctx=1024,\n        structured_output={\"format\": \"json\", \"schema\": Character},\n    )\n    # Change to vLLM as such:\n    # llm = vLLM(\n    #     model=\"teknium/OpenHermes-2.5-Mistral-7B\",\n    #     extra_kwargs={\"tensor_parallel_size\": 1},\n    #     structured_output={\"format\": \"json\", \"schema\": Character},\n    # )\n\n    text_generation = TextGeneration(\n        name=\"text_generation_rpg\",\n        llm=llm,\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    load_dataset &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            text_generation.name: {\n                \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 256}}\n            }\n        },\n        use_cache=False,\n    )\n    for num, character in enumerate(distiset[\"default\"][\"train\"][\"generation\"]):\n        print(f\"Character: {num}\")\n        print(character)\n\n# Character: 0\n# {\n# \"name\": \"Gimli\",\n# \"age\": 42,\n# \"armor\": \"plate\",\n# \"weapon\": \"axe\" }\n# Character: 1\n# {\"name\":\"Gaelen\",\"age\":600,\"armor\":\"leather\",\"weapon\":\"bow\"}\n# Character: 2\n# {\"name\": \"John Smith\",\"age\": 35,\"armor\": \"leather\",\"weapon\": \"sword\"}\n# Character: 3\n# { \"name\": \"Grug\", \"age\": 35, \"armor\": \"leather\", \"weapon\": \"axe\"}\n</code></pre>"},{"location":"sections/pipeline_samples/examples/mistralai_with_instructor/","title":"Structured generation with <code>instructor</code>","text":"<p>Answer instructions with knowledge graphs defined as <code>pydantic.BaseModel</code> objects using <code>instructor</code> in <code>distilabel</code>.</p> <p>This script makes use of <code>MistralLLM</code> and the structured output capabilities thanks to <code>instructor</code> to generate knowledge graphs from complex topics.</p> <p></p> <p>This example is translated from this awesome example from <code>instructor</code> cookbook.</p> Run <pre><code>python examples/structured_generation_with_instructor.py\n</code></pre> structured_generation_with_instructor.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom typing import List\n\nfrom pydantic import BaseModel, Field\n\nfrom distilabel.models import MistralLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\n\nclass Node(BaseModel):\n    id: int\n    label: str\n    color: str\n\n\nclass Edge(BaseModel):\n    source: int\n    target: int\n    label: str\n    color: str = \"black\"\n\n\nclass KnowledgeGraph(BaseModel):\n    nodes: List[Node] = Field(..., default_factory=list)\n    edges: List[Edge] = Field(..., default_factory=list)\n\n\nwith Pipeline(\n    name=\"Knowledge-Graphs\",\n    description=(\n        \"Generate knowledge graphs to answer questions, this type of dataset can be used to \"\n        \"steer a model to answer questions with a knowledge graph.\"\n    ),\n) as pipeline:\n    sample_questions = [\n        \"Teach me about quantum mechanics\",\n        \"Who is who in The Simpsons family?\",\n        \"Tell me about the evolution of programming languages\",\n    ]\n\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"system_prompt\": \"You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.\",\n                \"instruction\": f\"{question}\",\n            }\n            for question in sample_questions\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"knowledge_graph_generation\",\n        llm=MistralLLM(\n            model=\"open-mixtral-8x22b\", structured_output={\"schema\": KnowledgeGraph}\n        ),\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    load_dataset &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            text_generation.name: {\n                \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 2048}}\n            }\n        },\n        use_cache=False,\n    )\n\n    distiset.push_to_hub(\"distilabel-internal-testing/knowledge_graphs\")\n</code></pre> Visualizing the graphs <p>Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:</p> <p>Note</p> <p>This example uses graphviz to render the graph, you can install with <code>pip</code> in the following way:</p> <pre><code>pip install graphviz\n</code></pre> <pre><code>python examples/draw_kg.py 2  # You can pass 0,1,2 to visualize each of the samples.\n</code></pre> <p></p>"},{"location":"sections/pipeline_samples/examples/text_generation_with_image/","title":"Text generation with images in <code>distilabel</code>","text":"<p>Answer questions about images using <code>distilabel</code>.</p> <p>Image-text-to-text models take in an image and text prompt and output text. In this example we will use an LLM <code>InferenceEndpointsLLM</code> with meta-llama/Llama-3.2-11B-Vision-Instruct to ask a question about an image, and <code>OpenAILLM</code> with <code>gpt-4o-mini</code>. We will ask a simple question to showcase how the <code>TextGenerationWithImage</code> task can be used in a pipeline.</p> Inference Endpoints - meta-llama/Llama-3.2-11B-Vision-InstructOpenAI - gpt-4o-mini <pre><code>from distilabel.models.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage\nfrom distilabel.steps import LoadDataFromDicts\n\n\nwith Pipeline(name=\"vision_generation_pipeline\") as pipeline:\n    loader = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n            }\n        ],\n    )\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    )\n\n    vision = TextGenerationWithImage(\n        name=\"vision_gen\",\n        llm=llm,\n        image_type=\"url\"  # (1)\n    )\n\n    loader &gt;&gt; vision\n</code></pre> <ol> <li>The image_type can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the <code>TextGenerationWithImage</code> for more information.</li> </ol> <p>Image:</p> <p></p> <p>Question:</p> <p>What\u2019s in this image?</p> <p>Response:</p> <p>This image depicts a wooden boardwalk weaving its way through a lush meadow, flanked by vibrant green grass that stretches towards the horizon under a calm and inviting sky. The boardwalk runs straight ahead, away from the viewer, forming a clear pathway through the tall, lush green grass, crops or other plant types or an assortment of small trees and shrubs. This meadow is dotted with trees and shrubs, appearing to be healthy and green. The sky above is a beautiful blue with white clouds scattered throughout, adding a sense of tranquility to the scene. While this image appears to be of a natural landscape, because grass is...</p> <pre><code>from distilabel.models.llms import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage\nfrom distilabel.steps import LoadDataFromDicts\n\n\nwith Pipeline(name=\"vision_generation_pipeline\") as pipeline:\n    loader = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n            }\n        ],\n    )\n\n    llm = OpenAILLM(\n        model=\"gpt-4o-mini\",\n    )\n\n    vision = TextGenerationWithImage(\n        name=\"vision_gen\",\n        llm=llm,\n        image_type=\"url\"  # (1)\n    )\n\n    loader &gt;&gt; vision\n</code></pre> <ol> <li>The image_type can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the <code>VisionGeneration</code> for more information.</li> </ol> <p>Image:</p> <p></p> <p>Question:</p> <p>What\u2019s in this image?</p> <p>Response:</p> <p>The image depicts a scenic landscape featuring a wooden walkway or path that runs through a lush green marsh or field. The area is surrounded by tall grass and various shrubs, with trees likely visible in the background. The sky is blue with some wispy clouds, suggesting a beautiful day. Overall, it presents a peaceful natural setting, ideal for a stroll or nature observation.</p> <p>The full pipeline can be run at the following example:</p> Run the full pipeline <pre><code>python examples/text_generation_with_image.py\n</code></pre> text_generation_with_image.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom distilabel.models.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage\n\nwith Pipeline(name=\"vision_generation_pipeline\") as pipeline:\n    loader = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n            }\n        ],\n    )\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    )\n\n    vision = TextGenerationWithImage(name=\"vision_gen\", llm=llm, image_type=\"url\")\n\n    loader &gt;&gt; vision\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    distiset.push_to_hub(\"plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct\")\n</code></pre> <p>A sample dataset can be seen at plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct.</p>"},{"location":"sections/pipeline_samples/papers/apigen/","title":"Create Function-Calling datasets with APIGen","text":"<p>This example will introduce APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, a data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications.</p>"},{"location":"sections/pipeline_samples/papers/apigen/#replication","title":"Replication","text":"<p>The following figure showcases the APIGen framework:</p> <p></p> <p>Now, let's walk through the key steps illustrated in the figure:</p> <ul> <li> <p><code>DataSampler</code>: With the help of this step and the original Salesforce/xlam-function-calling-60k we are getting the Seed QA Data Sampler for the prompt template.</p> </li> <li> <p><code>APIGenGenerator</code>: This step does the job of the Query-Answer Generator, including the format checker from Stage 1: Format Checker thanks to the structured output generation.</p> </li> <li> <p><code>APIGenExecutionChecker</code>: This step is in charge of the Stage 2: Execution Checker.</p> </li> <li> <p><code>APIGenSemanticChecker</code>: Step in charge of running Stage 3: Semantic Checker, can use the same or a different LLM, we are using the same as in <code>APIGenGenerator</code> step.</p> </li> </ul> <p>The current implementation hasn't utilized the Diverse Prompt Library. To incorporate it, one could either adjust the prompt template within the <code>APIGenGenerator</code> or develop a new sampler specifically for this purpose. As for the API Sampler, while no specific data is shared here, we've created illustrative examples to demonstrate the pipeline's functionality. These examples represent a mix of data that could be used to replicate the sampler's output.</p>"},{"location":"sections/pipeline_samples/papers/apigen/#data-preparation","title":"Data preparation","text":"<p>The original paper tells about the data they used and give some hints, but nothing was shared. In this example, we will write a bunch of examples by hand to showcase how this pipeline can be built.</p> <p>Assume we have the following function names, and corresponding descriptions of their behaviour:</p> <pre><code>data = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n    {\n        \"func_name\": \"binary_addition\",\n        \"func_desc\": \"Adds two binary numbers and returns the result as a binary string.\",\n    },\n    {\n        \"func_name\": \"swapi_planet_resource\",\n        \"func_desc\": \"get a specific planets resource\",\n    },\n    {\n        \"func_name\": \"disney_character\",\n        \"func_desc\": \"Find a specific character using this endpoint\",\n    }\n]\n</code></pre> <p>The original paper refers to both python functions and APIs, but we will make use of python functions exclusively for simplicity. In order to execute and check this functions/APIs, we need access to the code, which we have moved to a Python file: lib_apigen.py. All this functions are executable, but we also need access to their tool representation. For this, we will make use of transformers' get_json_schema function<sup>1</sup>.</p> <p>We have all the machinery prepared in our libpath, except from the tool definition. With the help of our helper function <code>load_module_from_path</code> we will load this python module, collect all the tools, and add them to each row in our <code>data</code> variable.</p> <pre><code>from distilabel.steps.tasks.apigen.utils import load_module_from_path\n\nlibpath_module = load_module_from_path(libpath)\ntools = getattr(libpath_module, \"get_tools\")()  # call get_tools()\n\nfor row in data:\n    #\u00a0The tools should have a mix where both the correct and irrelevant tools are present.\n    row.update({\"tools\": [tools[row[\"func_name\"]]]})\n</code></pre> <p>Now we have all the necessary data for our prompt. Additionally, we will make use of the original dataset as few-shot examples to enhance the model:</p> <pre><code>ds_og = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\n</code></pre> <p>We have just loaded a subset and transformed it to a list of dictionaries, as we will use it in the <code>DataSampler</code> <code>GeneratorStep</code>, grabbing random examples from the original dataset.</p>"},{"location":"sections/pipeline_samples/papers/apigen/#building-the-pipeline","title":"Building the Pipeline","text":"<p>Now that we've walked through each component, it's time to see how it all comes together, here's the Pipeline code:</p> <pre><code>with Pipeline(name=\"apigen-example\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)  # (1)\n\n    sampler = DataSampler(  # (2)\n        data=ds_og,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n\n    prep_examples = PrepareExamples()  # This step will add the 'examples' column\n\n    combine_steps = CombineOutputs()  # (3)\n\n    model_id = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    llm=InferenceEndpointsLLM(  # (4)\n        model_id=model_id,\n        tokenizer_id=model_id,\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 2048,\n        },\n    )\n    apigen = APIGenGenerator(  # (5)\n        llm=llm,\n        use_default_structured_output=True,\n    )\n\n    execution_checker = APIGenExecutionChecker(libpath=str(libpath))  # (6)\n    semantic_checker = APIGenSemanticChecker(llm=llm)  # (7)\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples] \n        &gt;&gt; combine_steps \n        &gt;&gt; apigen\n        &gt;&gt; execution_checker\n        &gt;&gt; semantic_checker\n    )\n</code></pre> <ol> <li> <p>Load the data seeds we are going to use to generate our function calling dataset.</p> </li> <li> <p>The <code>DataSampler</code> together with <code>PrepareExamples</code> will be used to help us create the few-shot examples from the original dataset to be fed in our prompt.</p> </li> <li> <p>Combine both columns to obtain a single stream of data</p> </li> <li> <p>Will reuse the same LLM for the generation and the semantic checks.</p> </li> <li> <p>Creates the <code>query</code> and <code>answers</code> that will be used together with the <code>tools</code> to fine-tune a new model. Will generate the structured outputs to ensure we have valid JSON formatted answers.</p> </li> <li> <p>Adds columns <code>keep_row_after_execution_check</code> and <code>execution_result</code>.</p> </li> <li> <p>Adds columns <code>keep_row_after_semantic_check</code> and <code>thought</code>.</p> </li> </ol>"},{"location":"sections/pipeline_samples/papers/apigen/#script-and-final-dataset","title":"Script and final dataset","text":"<p>To see all the pieces in place, take a look at the full pipeline, as well as an example row that would be generated from this pipeline.</p> Run <pre><code>python examples/pipeline_apigen.py\n</code></pre> pipeline_apigen.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom pathlib import Path\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, DataSampler, LoadDataFromDicts\nfrom distilabel.steps.tasks import (\n    APIGenExecutionChecker,\n    APIGenGenerator,\n    APIGenSemanticChecker,\n)\nfrom distilabel.steps.tasks.apigen.utils import PrepareExamples, load_module_from_path\n\nlibpath = Path(__file__).parent / \"lib_apigen.py\"\n\ndata = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n    {\n        \"func_name\": \"binary_addition\",\n        \"func_desc\": \"Adds two binary numbers and returns the result as a binary string.\",\n    },\n    {\n        \"func_name\": \"swapi_planet_resource\",\n        \"func_desc\": \"get a specific planets resource\",\n    },\n    {\n        \"func_name\": \"disney_character\",\n        \"func_desc\": \"Find a specific character using this endpoint\",\n    },\n]\n\nlibpath_module = load_module_from_path(libpath)\ntools = libpath_module.get_tools()  # call get_tools()\n\n# TODO: Add in the tools between 0 and 2 extra tools to make the task more challenging.\nfor row in data:\n    # The tools should have a mix where both the correct and irrelevant tools are present.\n    row.update({\"tools\": [tools[row[\"func_name\"]]]})\n\n\nds_og = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\n\n\nwith Pipeline(name=\"APIGenPipeline\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)\n    sampler = DataSampler(\n        data=ds_og,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n\n    prep_examples = PrepareExamples()\n\n    model_id = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    llm = InferenceEndpointsLLM(\n        model_id=model_id,\n        tokenizer_id=model_id,\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 2048,\n        },\n    )\n    apigen = APIGenGenerator(\n        llm=llm,\n        use_default_structured_output=True,\n    )\n    combine_steps = CombineOutputs()\n\n    execution_checker = APIGenExecutionChecker(libpath=str(libpath))\n    semantic_checker = APIGenSemanticChecker(llm=llm)\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples]\n        &gt;&gt; combine_steps\n        &gt;&gt; apigen\n        &gt;&gt; execution_checker\n        &gt;&gt; semantic_checker\n    )\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run()\n    print(distiset[\"default\"][\"train\"][0])\n</code></pre> <p>Example row:</p> <pre><code>{\n  \"func_name\": \"final_velocity\",\n  \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n  \"tools\": [\n    {\n      \"function\": {\n        \"description\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n        \"name\": \"final_velocity\",\n        \"parameters\": {\n          \"properties\": {\n            \"acceleration\": {\n              \"description\": \"The acceleration of the object.\",\n              \"type\": \"number\"\n            },\n            \"initial_velocity\": {\n              \"description\": \"The initial velocity of the object.\",\n              \"type\": \"number\"\n            },\n            \"time\": {\n              \"description\": \"The time elapsed.\",\n              \"type\": \"number\"\n            }\n          },\n          \"required\": [\n            \"initial_velocity\",\n            \"acceleration\",\n            \"time\"\n          ],\n          \"type\": \"object\"\n        }\n      },\n      \"type\": \"function\"\n    }\n  ],\n  \"examples\": \"## Query:\\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\\n## Answers:\\n[{\\\"name\\\": \\\"v1_post_post_id_comments\\\", \\\"arguments\\\": {\\\"post_id\\\": \\\"12345\\\", \\\"count\\\": 15}}]\\n\\n## Query:\\nRetrieve the detailed recipe for the cake with ID 'cake101'.\\n## Answers:\\n[{\\\"name\\\": \\\"detailed_cake_recipe_by_id\\\", \\\"arguments\\\": {\\\"is_id\\\": \\\"cake101\\\"}}]\\n\\n## Query:\\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\\n## Answers:\\n[{\\\"name\\\": \\\"symbols_faq\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}, {\\\"name\\\": \\\"symbols_suggested\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}]\",\n  \"query\": \"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\",\n  \"answers\": \"[{\\\"arguments\\\": {\\\"acceleration\\\": \\\"9.8\\\", \\\"initial_velocity\\\": \\\"0\\\", \\\"time\\\": \\\"10\\\"}, \\\"name\\\": \\\"final_velocity\\\"}]\",\n  \"distilabel_metadata\": {\n    \"raw_input_a_p_i_gen_generator_0\": [\n      {\n        \"content\": \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\\n\\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\\n\\nEnsure the query:\\n- Is clear and concise\\n- Demonstrates typical use cases\\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\\n- The corresponding result's parameter types and ranges match with the function's descriptions\\n\\nEnsure the answer:\\n- Is a list of function calls in JSON format\\n- The length of the answer list should be equal to the number of requests in the query\\n- Can solve all the requests in the query effectively\",\n        \"role\": \"system\"\n      },\n      {\n        \"content\": \"Here are examples of queries and the corresponding answers for similar functions:\\n## Query:\\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\\n## Answers:\\n[{\\\"name\\\": \\\"v1_post_post_id_comments\\\", \\\"arguments\\\": {\\\"post_id\\\": \\\"12345\\\", \\\"count\\\": 15}}]\\n\\n## Query:\\nRetrieve the detailed recipe for the cake with ID 'cake101'.\\n## Answers:\\n[{\\\"name\\\": \\\"detailed_cake_recipe_by_id\\\", \\\"arguments\\\": {\\\"is_id\\\": \\\"cake101\\\"}}]\\n\\n## Query:\\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\\n## Answers:\\n[{\\\"name\\\": \\\"symbols_faq\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}, {\\\"name\\\": \\\"symbols_suggested\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}]\\n\\nNote that the query could be interpreted as a combination of several independent requests.\\n\\nBased on these examples, generate 1 diverse query and answer pairs for the function `final_velocity`.\\nThe detailed function description is the following:\\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\\n\\nThese are the available tools to help you:\\n[{'type': 'function', 'function': {'name': 'final_velocity', 'description': 'Calculates the final velocity of an object given its initial velocity, acceleration, and time.', 'parameters': {'type': 'object', 'properties': {'initial_velocity': {'type': 'number', 'description': 'The initial velocity of the object.'}, 'acceleration': {'type': 'number', 'description': 'The acceleration of the object.'}, 'time': {'type': 'number', 'description': 'The time elapsed.'}}, 'required': ['initial_velocity', 'acceleration', 'time']}}}]\\n\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n```json\\n[\\n   {\\n       \\\"query\\\": \\\"The generated query.\\\",\\n       \\\"answers\\\": [\\n           {\\n               \\\"name\\\": \\\"api_name\\\",\\n               \\\"arguments\\\": {\\n                   \\\"arg_name\\\": \\\"value\\\"\\n                   ... (more arguments as required)\\n               }\\n           },\\n           ... (more API calls as required)\\n       ]\\n   }\\n]\\n```\\n\\nNow please generate 1 diverse query and answer pairs following the above format.\",\n        \"role\": \"user\"\n      }\n    ],\n    \"raw_input_a_p_i_gen_semantic_checker_0\": [\n      {\n        \"content\": \"As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\",\n        \"role\": \"system\"\n      },\n      {\n        \"content\": \"Given Information:\\n- All Available Functions:\\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\\n- User Query: What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\\n- Generated Function Calls: [{\\\"arguments\\\": {\\\"acceleration\\\": \\\"9.8\\\", \\\"initial_velocity\\\": \\\"0\\\", \\\"time\\\": \\\"10\\\"}, \\\"name\\\": \\\"final_velocity\\\"}]\\n- Execution Results: ['9.8']\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query's intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \\\"thought\\\": \\\"Concisely describe your reasoning here\\\",\\n   \\\"passes\\\": \\\"yes\\\" or \\\"no\\\"\\n}\\n```\\n\",\n        \"role\": \"user\"\n      }\n    ],\n    \"raw_output_a_p_i_gen_generator_0\": \"{\\\"pairs\\\": [\\n   {\\n       \\\"answers\\\": [\\n           {\\n               \\\"arguments\\\": {\\n                   \\\"acceleration\\\": \\\"9.8\\\",\\n                   \\\"initial_velocity\\\": \\\"0\\\",\\n                   \\\"time\\\": \\\"10\\\"\\n               },\\n               \\\"name\\\": \\\"final_velocity\\\"\\n           }\\n       ],\\n       \\\"query\\\": \\\"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\\\"\\n   }\\n]}\",\n    \"raw_output_a_p_i_gen_semantic_checker_0\": \"{\\n   \\\"thought\\\": \\\"\\\",\\n   \\\"passes\\\": \\\"yes\\\"\\n}\"\n  },\n  \"model_name\": \"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n  \"keep_row_after_execution_check\": true,\n  \"execution_result\": [\n    \"9.8\"\n  ],\n  \"thought\": \"\",\n  \"keep_row_after_semantic_check\": true\n}\n</code></pre> <ol> <li> <p>Read this nice blog post for more information on tools and the reasoning behind <code>get_json_schema</code>: Tool Use, Unified.\u00a0\u21a9</p> </li> </ol>"},{"location":"sections/pipeline_samples/papers/clair/","title":"Contrastive Learning From AI Revisions (CLAIR)","text":"<p>\"Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment\" introduces both Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. While APO can be found in TRL, we have implemented a task for CLAIR in <code>distilabel</code>.</p> <p>CLAIR is a method for creating preference pairs which minimally revises one output to express a preference, resulting in a more precise learning signal as opposed to conventional methods which use a judge to select a preferred response. </p> <p></p> <p>The athors from the original paper shared a collection of datasets from CLAIR and APO, where ContextualAI/ultrafeedback_clair_32k corresponds to the CLAIR implementation.</p>"},{"location":"sections/pipeline_samples/papers/clair/#replication","title":"Replication","text":"<p>Note</p> <p>The section is named <code>Replication</code> but in this case we are showing how to use the <code>CLAIR</code> task create revisions for your generations using <code>distilabel</code>.</p> <p>To showcase CLAIR we will be using the <code>CLAIR</code> task implemented in <code>distilabel</code> and we are reusing a small sample of the already generated dataset by ContextualAI <code>ContextualAI/ultrafeedback_clair_32k</code> for testing.</p>"},{"location":"sections/pipeline_samples/papers/clair/#installation","title":"Installation","text":"<p>To reproduce the code below, one will need to install <code>distilabel</code> as follows:</p> <pre><code>pip install \"distilabel&gt;=1.4.0\"\n</code></pre> <p>Depending on the LLM provider you want to use, the requirements may vary, take a look at the dependencies in that case, we are using for the example the free inference endpoints from Hugging Face, but that won't apply for a bigger dataset.</p>"},{"location":"sections/pipeline_samples/papers/clair/#building-blocks","title":"Building blocks","text":"<p>In this case where we already have instructions and their generations, we will just need to load the data and the corresponding CLAIR task for the revisions:</p> <ul> <li><code>CLAIR</code> to generate the revisions.</li> </ul>"},{"location":"sections/pipeline_samples/papers/clair/#code","title":"Code","text":"<p>Let's see the full pipeline applied to <code>ContextualAI/ultrafeedback_clair_32k</code> in <code>distilabel</code>:</p> <pre><code>from typing import Any, Dict\n\nfrom datasets import load_dataset\n\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import CLAIR\nfrom distilabel.models import InferenceEndpointsLLM\n\n\ndef transform_ultrafeedback(example: Dict[str, Any]) -&gt; Dict[str, Any]:\n    return {\n        \"task\": example[\"prompt\"],\n        \"student_solution\": example[\"rejected\"][1][\"content\"],\n    }\n\ndataset = (\n    load_dataset(\"ContextualAI/ultrafeedback_clair_32k\", split=\"train\")\n    .select(range(10))             #\u00a0We collect just 10 examples\n    .map(transform_ultrafeedback)  # Apply the transformation to get just the text\n)\n\nwith Pipeline(name=\"CLAIR UltraFeedback sample\") as pipeline:\n    clair = CLAIR(  # (1)\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 4096\n            }\n        )\n    )\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(dataset=dataset)  # (2)\n    distiset.push_to_hub(repo_id=\"username/clair-test\", include_script=True)  # (3)\n</code></pre> <ol> <li> <p>This Pipeline uses just CLAIR because we already have the generations, but one can just include a first task to create generations from instructions, and then the revisions with CLAIR.</p> </li> <li> <p>Include the dataset directly in the run method for simplicity.</p> </li> <li> <p>Push the distiset to the hub with the script for reproducibility.</p> </li> </ol> <p>An example dataset can be found at: distilabel-internal-testing/clair-test.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/","title":"DeepSeek Prover","text":"<p>\"DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data\" presents an approach to generate mathematical proofs for theorems generated from informal math problems. This approach shows promising results to advance the capabilities of models towards theorem proving using synthetic data. Until this moment the dataset and the model trained on top of it haven't been opened, let's see how the approach works to reproduce the pipeline using <code>distilabel</code>. The following figure depicts the approach taken to generate the dataset:</p> <p></p> <p>The authors propose a method for generating Lean 4 proof data from informal mathematical problems. Their approach translates high-school and undergraduate-level mathematical competition problems into formal statements.</p> <p>Here we show how to deal with steps 1 and 2, but the authors ensure the theorems are checked using the lean4 program on the generated proofs, and iterate for a series of steps, fine-tuning a model on the synthetic data (DeepSeek prover 7B), regenerating the dataset, and continue the process until no further improvement is found.</p> <p></p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#replication","title":"Replication","text":"<p>Note</p> <p>The section is named <code>Replication</code> but we will show how we can use <code>distilabel</code> to create the different steps outlined in the <code>DeepSeek-Prover</code> approach. We intentionally let some steps out of the pipeline, but this can easily be extended.</p> <p>We will define the components needed to generate a dataset like the one depicted in the previous figure (we won't call lean4 or do the fine-tuning, this last step can be done outside of <code>distilabel</code>). The different blocks will have all the docstrings as we would have in the internal steps to showcase how they are done, but they can be omitted for brevity.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#installation","title":"Installation","text":"<p>To reproduce the code below, we need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <p>We have decided to use <code>InferenceEndpointsLLM</code>, but any other provider with a strong model could work.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#building-blocks","title":"Building blocks","text":"<p>There are three components we needed to define for this pipeline, for the different components in the paper: A task to formalize the original statements, another one to assess the relevance of the theorems, and a final one to generate proofs for the theorems.</p> <p>Note</p> <p>We will use the same <code>LLM</code> for all the tasks, so we will define once and reuse it for the different tasks:</p> <pre><code>llm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#deepseekproverautoformalization","title":"DeepSeekProverAutoFormalization","text":"<p>This <code>Task</code> corresponds to the first step in the figure. Given an informal statement, it will formalize it for us in <code>Lean 4</code> language, meaning it will translate from an informal statement that could be gathered from the internet, to the lean4 structured language.</p> DeepSeekProverAutoFormalization <pre><code>_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX = r\"```lean4(.*?)```\"\n\ntemplate_deepseek_prover_auto_formalization = \"\"\"\\\nMathematical Problem in Natural Language:\n{{ informal_statement }}\n{%- if few_shot %}\n\nPlease use the following examples to guide you with the answer:\n{%- for example in examples %}\n- {{ example }}\n{%- endfor %}\n{% endif -%}\"\"\"\n\n\nclass DeepSeekProverAutoFormalization(Task):\n    examples: Optional[List[str]] = None\n    system_prompt: str = \"Translate the problem to Lean 4 (only the core declaration):\\n```lean4\\nformal statement goes here\\n```\"\n    _template: Union[Template, None] = PrivateAttr(...)\n    _few_shot: bool = PrivateAttr(default=False)\n\n    def load(self) -&gt; None:\n        super().load()\n        self._template = Template(template_deepseek_prover_auto_formalization)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"informal_statement\"]\n\n    @property\n    def outputs(self):\n        return [\"formal_statement\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    informal_statement=input[self.inputs[0]],\n                    few_shot=bool(self.examples),\n                    examples=self.examples,\n                ),\n            },\n        ]\n\n    @override\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"formal_statement\": match}\n</code></pre> <p>Following the paper, they found that the model yields better results if it uses examples in a few shot setting, so this class allows to take some examples to help in generating the formulation. Let's see an example of how we can instantiate it:</p> <pre><code>from textwrap import dedent\n\nexamples = [\n    dedent(\"\"\"\n    ## Statement in natural language:\n    For real numbers k and x:\n    If x is equal to (13 - \u221a131) / 4, and\n    If the equation 2x\u00b2 - 13x + k = 0 is satisfied,\n    Then k must be equal to 19/4.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    The greatest common divisor (GCD) of 20 factorial (20!) and 200,000 is equal to 40,000.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    Given two integers x and y:\n    If y is positive (greater than 0),\n    And y is less than x,\n    And the equation x + y + xy = 80 is true,\n    Then x must be equal to 26.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n]\n\nauto_formalization = DeepSeekProverAutoFormalization(\n    name=\"auto_formalization\",\n    input_batch_size=8,\n    llm=llm,\n    examples=examples\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#deepseekproverscorer","title":"DeepSeekProverScorer","text":"<p>The next <code>Task</code> corresponds to the second step, the model scoring and assessment. It uses an LLM as judge to evaluate the relevance of the theorem, and assigns a score so it can be filtered afterwards.</p> DeepSeekProverScorer <pre><code>template_deepseek_prover_scorer = \"\"\"\\\nTo evaluate whether a formal Lean4 statement will be of interest to the community, consider the following criteria:\n\n1. Relevance to Current Research: Does the statement address a problem or concept that is actively being researched in mathematics or related fields? Higher relevance scores indicate greater potential interest.\n2. Complexity and Depth: Is the statement complex enough to challenge existing theories and methodologies, yet deep enough to provide significant insights or advancements? Complexity and depth showcase Lean4's capabilities and attract interest.\n3. Interdisciplinary Potential: Does the statement offer opportunities for interdisciplinary research, connecting mathematics with other fields such as computer science, physics, or biology? Interdisciplinary projects often garner wide interest.\n4. Community Needs and Gaps: Does the statement fill an identified need or gap within the Lean4 community or the broader mathematical community? Addressing these needs directly correlates with interest.\n5. Innovativeness: How innovative is the statement? Does it propose new methods, concepts, or applications? Innovation drives interest and engagement.\n\nCustomize your evaluation for each problem accordingly, assessing it as 'excellent', 'good', 'above average', 'fair' or 'poor'.\n\nYou should respond in the following format for each statement:\n\n'''\nNatural language: (Detailed explanation of the informal statement, including any relevant background information, assumptions, and definitions.)\nAnalysis: (Provide a brief justification for each score, highlighting why the statement scored as it did across the criteria.)\nAssessment: (Based on the criteria, rate the statement as 'excellent', 'good', 'above average', 'fair' or 'poor'. JUST the Assessment.)\n'''\"\"\"\n\nclass DeepSeekProverScorer(Task):\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        super().load()\n        self._template = Template(template_deepseek_prover_scorer)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"informal_statement\", \"formal_statement\"]\n\n    @property\n    def outputs(self):\n        return [\"natural_language\", \"analysis\", \"assessment\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self._template.render(),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\"## Informal statement:\\n{input[self.inputs[0]]}\\n\\n ## Formal statement:\\n{input[self.inputs[1]]}\",\n            },\n        ]\n\n    @override\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:\n        try:\n            result = output.split(\"Natural language:\")[1].strip()\n            natural_language, analysis = result.split(\"Analysis:\")\n            analysis, assessment = analysis.split(\"Assessment:\")\n            natural_language = natural_language.strip()\n            analysis = analysis.strip()\n            assessment = assessment.strip()\n        except Exception:\n            natural_language = analysis = assessment = None\n\n        return {\n            \"natural_language\": natural_language,\n            \"analysis\": analysis,\n            \"assessment\": assessment\n        }\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#deepseekproversolver","title":"DeepSeekProverSolver","text":"<p>The last task is in charge of generating a proof for the theorems generated in the previous steps.</p> DeepSeekProverSolver <pre><code>class DeepSeekProverSolver(Task):\n    system_prompt: str = (\n        \"You are an expert in proving mathematical theorems formalized in lean4 language. \"\n        \"Your answers consist just in the proof to the theorem given, and nothing else.\"\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"formal_statement\"]\n\n    @property\n    def outputs(self):\n        return [\"proof\"]\n\n    def format_input(self, input: str) -&gt; ChatType:\n        prompt = dedent(\"\"\"\n            Give me a proof for the following theorem:\n            ```lean4\n            {theorem}\n            ```\"\"\"\n        )\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": prompt.format(theorem=input[\"formal_statement\"]),\n            },\n        ]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:\n        import re\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"proof\": match}\n</code></pre> <p>Additionally, the original pipeline defined in the paper includes a step to check the final proofs using the lean 4 language that we have omitted for simplicity. The fine tuning can be done completely offline, and come back to the pipeline after each iteration/training run.</p> <p>All the docstrings have been removed from the code blocks, but can be seen in the full pipeline.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#code","title":"Code","text":"<p>Lets's put the building blocks together to create the final pipeline with <code>distilabel</code>. For this example we have generated a sample dataset plaguss/informal-mathematical-statements-tiny of informal mathematical statements starting from casey-martin/multilingual-mathematical-autoformalization, but as the paper mentions, we can create formal statements and it's corresponding proofs starting from informal ones:</p> Click to see the full pipeline deepseek_prover.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport re\nfrom pathlib import Path\nfrom textwrap import dedent\nfrom typing import Any, Dict, List, Optional, Union\n\nfrom jinja2 import Template\nfrom pydantic import PrivateAttr\nfrom typing_extensions import override\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks.base import Task\nfrom distilabel.steps.tasks.typing import ChatType\n\n_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX = r\"```lean4(.*?)```\"\n\n\ntemplate_deepseek_prover_auto_formalization = \"\"\"\\\nMathematical Problem in Natural Language:\n{{ informal_statement }}\n{%- if few_shot %}\n\nPlease use the following examples to guide you with the answer:\n{%- for example in examples %}\n- {{ example }}\n{%- endfor %}\n{% endif -%}\"\"\"\n\n\nclass DeepSeekProverAutoFormalization(Task):\n    \"\"\"Task to translate a mathematical problem from natural language to Lean 4.\n\n    Note:\n        A related dataset (MMA from the paper) can be found in Hugging Face:\n        [casey-martin/multilingual-mathematical-autoformalization](https://huggingface.co/datasets/casey-martin/multilingual-mathematical-autoformalization).\n\n    Input columns:\n        - informal_statement (`str`): The statement to be formalized using Lean 4.\n\n    Output columns:\n        - formal_statement (`str`): The formalized statement using Lean 4, to be analysed.\n\n    Categories:\n        - generation\n\n    References:\n        - [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).\n        - [`Lean 4`](https://github.com/leanprover/lean4).\n\n    Examples:\n\n        Formalize a mathematical problem from natural language to Lean 4:\n\n        ```python\n        from distilabel.steps.tasks import DeepSeekProverAutoFormalization\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prover_autoformal = DeepSeekProverAutoFormalization(\n            llm=InferenceEndpointsLLM(\n                model_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n                tokenizer_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n            ),\n        )\n\n        prover_autoformal.load()\n\n        result = next(\n            prover_autoformal.process(\n                [\n                    {\"informal_statement\": \"If a polynomial g is monic, then the root of g is integral over the ring R.\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'informal_statement': 'If a polynomial g is monic, then the root of g is integral over the ring R.',\n        #         'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',\n        #         'distilabel_metadata': {\n        #             'raw_output_deep_seek_prover_auto_formalization_0': '```lean4\\ntheorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\\n```'\n        #         },\n        #         'model_name': 'deepseek-prover'\n        #     }\n        # ]\n        ```\n\n        Use a few-shot setting to formalize a mathematical problem from natural language to Lean 4:\n\n        ```python\n        from distilabel.steps.tasks import DeepSeekProverAutoFormalization\n        from distilabel.models import InferenceEndpointsLLM\n\n        # You can gain inspiration from the following examples to create your own few-shot examples:\n        # https://github.com/yangky11/miniF2F-lean4/blob/main/MiniF2F/Valid.lean\n        # Consider this as a placeholder for your actual LLM.\n        prover_autoformal = DeepSeekProverAutoFormalization(\n            llm=InferenceEndpointsLLM(\n                model_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n                tokenizer_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n            ),\n            examples=[\n                \"theorem amc12a_2019_p21 (z : \u2102) (h\u2080 : z = (1 + Complex.I) / Real.sqrt 2) :\\n\\n((\u2211 k : \u2124 in Finset.Icc 1 12, z ^ k ^ 2) * (\u2211 k : \u2124 in Finset.Icc 1 12, 1 / z ^ k ^ 2)) = 36 := by\\n\\nsorry\",\n                \"theorem amc12a_2015_p10 (x y : \u2124) (h\u2080 : 0 &lt; y) (h\u2081 : y &lt; x) (h\u2082 : x + y + x * y = 80) : x = 26 := by\\n\\nsorry\"\n            ]\n        )\n\n        prover_autoformal.load()\n\n        result = next(\n            prover_autoformal.process(\n                [\n                    {\"informal_statement\": \"If a polynomial g is monic, then the root of g is integral over the ring R.\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'informal_statement': 'If a polynomial g is monic, then the root of g is integral over the ring R.',\n        #         'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',\n        #         'distilabel_metadata': {\n        #             'raw_output_deep_seek_prover_auto_formalization_0': '```lean4\\ntheorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\\n```'\n        #         },\n        #         'model_name': 'deepseek-prover'\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    examples: Optional[List[str]] = None\n    system_prompt: str = \"Translate the problem to Lean 4 (only the core declaration):\\n```lean4\\nformal statement goes here\\n```\"\n    _template: Union[Template, None] = PrivateAttr(...)\n    _few_shot: bool = PrivateAttr(default=False)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        self._template = Template(template_deepseek_prover_auto_formalization)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"informal_statement\"]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `instructions` containing the generated instructions.\"\"\"\n        return [\"formal_statement\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    informal_statement=input[self.inputs[0]],\n                    few_shot=bool(self.examples),\n                    examples=self.examples,\n                ),\n            },\n        ]\n\n    @override\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"Extracts the formal statement from the Lean 4 output.\"\"\"\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"formal_statement\": match}\n\n\ntemplate_deepseek_prover_scorer = \"\"\"\\\nTo evaluate whether a formal Lean4 statement will be of interest to the community, consider the following criteria:\n\n1. Relevance to Current Research: Does the statement address a problem or concept that is actively being researched in mathematics or related fields? Higher relevance scores indicate greater potential interest.\n2. Complexity and Depth: Is the statement complex enough to challenge existing theories and methodologies, yet deep enough to provide significant insights or advancements? Complexity and depth showcase Lean4's capabilities and attract interest.\n3. Interdisciplinary Potential: Does the statement offer opportunities for interdisciplinary research, connecting mathematics with other fields such as computer science, physics, or biology? Interdisciplinary projects often garner wide interest.\n4. Community Needs and Gaps: Does the statement fill an identified need or gap within the Lean4 community or the broader mathematical community? Addressing these needs directly correlates with interest.\n5. Innovativeness: How innovative is the statement? Does it propose new methods, concepts, or applications? Innovation drives interest and engagement.\n\nCustomize your evaluation for each problem accordingly, assessing it as 'excellent', 'good', 'above average', 'fair' or 'poor'.\n\nYou should respond in the following format for each statement:\n\n'''\nNatural language: (Detailed explanation of the informal statement, including any relevant background information, assumptions, and definitions.)\nAnalysis: (Provide a brief justification for each score, highlighting why the statement scored as it did across the criteria.)\nAssessment: (Based on the criteria, rate the statement as 'excellent', 'good', 'above average', 'fair' or 'poor'. JUST the Assessment.)\n'''\"\"\"\n\n\nclass DeepSeekProverScorer(Task):\n    \"\"\"Task to evaluate the quality of a formalized mathematical problem in Lean 4,\n    inspired by the DeepSeek-Prover task for scoring.\n\n    Note:\n        A related dataset (MMA from the paper) can be found in Hugging Face:\n        [casey-martin/multilingual-mathematical-autoformalization](https://huggingface.co/datasets/casey-martin/multilingual-mathematical-autoformalization).\n\n    Input columns:\n        - informal_statement (`str`): The statement to be formalized using Lean 4.\n        - formal_statement (`str`): The formalized statement using Lean 4, to be analysed.\n\n    Output columns:\n        - natural_language (`str`): Explanation for the problem.\n        - analysis (`str`): Analysis of the different points defined in the prompt.\n        - assessment (`str`): Result of the assessment.\n\n    Categories:\n        - scorer\n        - quality\n        - response\n\n    References:\n        - [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).\n        - [`Lean 4`](https://github.com/leanprover/lean4).\n\n    Examples:\n\n        Analyse a formal statement in Lean 4:\n\n        ```python\n        from distilabel.steps.tasks import DeepSeekProverScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prover_scorer = DeepSeekProverAutoFormalization(\n            llm=InferenceEndpointsLLM(\n                model_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n                tokenizer_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n            ),\n        )\n\n        prover_scorer.load()\n\n        result = next(\n            prover_scorer.process(\n                [\n                    {\"formal_statement\": \"theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',\n        #         'informal_statement': 'INFORMAL',\n        #         'analysis': 'ANALYSIS',\n        #         'assessment': 'ASSESSMENT',\n        #         'distilabel_metadata': {\n        #             'raw_output_deep_seek_prover_scorer_0': 'Natural language:\\nINFORMAL\\nAnalysis:\\nANALYSIS\\nAssessment:\\nASSESSMENT'\n        #         },\n        #         'model_name': 'deepseek-prover-scorer'\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        self._template = Template(template_deepseek_prover_scorer)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"informal_statement\", \"formal_statement\"]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `instructions` containing the generated instructions.\"\"\"\n        return [\"natural_language\", \"analysis\", \"assessment\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self._template.render(),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\"## Informal statement:\\n{input[self.inputs[0]]}\\n\\n ## Formal statement:\\n{input[self.inputs[1]]}\",\n            },\n        ]\n\n    @override\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"Analyses the formal statement with Lean 4 output and generates an assessment\n        and the corresponding informal assessment.\"\"\"\n\n        try:\n            result = output.split(\"Natural language:\")[1].strip()\n            natural_language, analysis = result.split(\"Analysis:\")\n            analysis, assessment = analysis.split(\"Assessment:\")\n            natural_language = natural_language.strip()\n            analysis = analysis.strip()\n            assessment = assessment.strip()\n        except Exception:\n            natural_language = analysis = assessment = None\n\n        return {\n            \"natural_language\": natural_language,\n            \"analysis\": analysis,\n            \"assessment\": assessment,\n        }\n\n\nclass DeepSeekProverSolver(Task):\n    \"\"\"Task to generate a proof for a formal statement (theorem) in lean4.\n\n    Input columns:\n        - formal_statement (`str`): The formalized statement using Lean 4.\n\n    Output columns:\n        - proof (`str`): The proof for the formal statement theorem.\n\n    Categories:\n        - scorer\n        - quality\n        - response\n\n    References:\n        - [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).\n    \"\"\"\n\n    system_prompt: str = (\n        \"You are an expert in proving mathematical theorems formalized in lean4 language. \"\n        \"Your answers consist just in the proof to the theorem given, and nothing else.\"\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `formal_statement`.\"\"\"\n        return [\"formal_statement\"]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is the proof for the formal statement theorem.\"\"\"\n        return [\"proof\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType`, with a system prompt to guide our model.\"\"\"\n        prompt = dedent(\"\"\"\n            Give me a proof for the following theorem:\n            ```lean4\n            {theorem}\n            ```\"\"\")\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": prompt.format(theorem=input[\"formal_statement\"]),\n            },\n        ]\n\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        import re\n\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"proof\": match}\n\n\nexamples = [\n    dedent(\"\"\"\n    ## Statement in natural language:\n    For real numbers k and x:\n    If x is equal to (13 - \u221a131) / 4, and\n    If the equation 2x\u00b2 - 13x + k = 0 is satisfied,\n    Then k must be equal to 19/4.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    The greatest common divisor (GCD) of 20 factorial (20!) and 200,000 is equal to 40,000.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    Given two integers x and y:\n    If y is positive (greater than 0),\n    And y is less than x,\n    And the equation x + y + xy = 80 is true,\n    Then x must be equal to 26.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n]\n\n\nwith Pipeline(name=\"test_deepseek_prover\") as pipeline:\n    data_loader = LoadDataFromHub(\n        repo_id=\"plaguss/informal-mathematical-statements-tiny\",\n        split=\"val\",\n        batch_size=8,\n    )\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    )\n    auto_formalization = DeepSeekProverAutoFormalization(\n        name=\"auto_formalization\", input_batch_size=8, llm=llm, examples=examples\n    )\n    prover_scorer = DeepSeekProverScorer(\n        name=\"prover_scorer\",\n        input_batch_size=8,\n        llm=llm,\n    )\n    proof_generator = DeepSeekProverSolver(\n        name=\"proof_generator\", input_batch_size=8, llm=llm\n    )\n\n    (data_loader &gt;&gt; auto_formalization &gt;&gt; prover_scorer &gt;&gt; proof_generator)\n\n\nif __name__ == \"__main__\":\n    import argparse\n\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"-d\",\n        \"--dry-run\",\n        action=argparse.BooleanOptionalAction,\n        help=\"Do a dry run for testing purposes.\",\n    )\n    args = parser.parse_args()\n\n    pipeline_parameters = {\n        data_loader.name: {\"split\": \"val\"},\n        auto_formalization.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"temperature\": 0.6,\n                    \"top_p\": 0.9,\n                    \"max_new_tokens\": 512,\n                }\n            }\n        },\n        prover_scorer.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"temperature\": 0.6,\n                    \"top_p\": 0.9,\n                    \"max_new_tokens\": 512,\n                }\n            }\n        },\n    }\n\n    ds_name = \"test_deepseek_prover\"\n\n    if args.dry_run:\n        distiset = pipeline.dry_run(batch_size=1, parameters=pipeline_parameters)\n        distiset.save_to_disk(Path.home() / f\"Downloads/{ds_name}\")\n\n        import pprint\n\n        pprint.pprint(distiset[\"default\"][\"train\"][0])\n\n    else:\n        distiset = pipeline.run(parameters=pipeline_parameters)\n        distiset.push_to_hub(ds_name, include_script=True)\n</code></pre> <p>The script can be run run for a dry run or not, depending on the argument (the pipeline will run without dry run by default), and will be pushed to the hub with the name <code>your_username/test_deepseek_prover</code>:</p> <pre><code>python deepseek_prover.py [-d | --dry-run | --no-dry-run]\n</code></pre> <p>Final dataset: plaguss/test_deepseek_prover.</p>"},{"location":"sections/pipeline_samples/papers/deita/","title":"DEITA","text":"<p>DEITA (Data-Efficient Instruction Tuning for Alignment) studies an automatic data selection process by first quantifying the data quality based on complexity, quality and diversity. Second, select the best potential combination from an open-source dataset that would fit into the budget you allocate to tune your own LLM.</p> <p>In most setting we cannot allocate unlimited resources for instruction-tuning LLMs. Therefore, the DEITA authors investigated how to select qualitative data for instruction tuning based on the principle of fewer high-quality samples. Liu et al. tackle the issue of first defining good data and second identifying it to respect an initial budget to instruct-tune your LLM.</p> <p>The strategy utilizes LLMs to replace human effort in time-intensive data quality tasks on instruction-tuning datasets**. DEITA introduces a way to measure data quality across three critical dimensions: complexity, quality and diversity.</p> <p></p> <p>You can see that we see again the dataset of instructions/responses and we kind of reproducing the second step when we learn how to optimize the responses according to an instruction by comparing several possibilities.</p> <p></p>"},{"location":"sections/pipeline_samples/papers/deita/#datasets-and-budget","title":"Datasets and budget","text":"<p>We will dive deeper into the whole process. We will investigate each stage to efficiently select the final dataset used for supervised fine-tuning with a budget constraint. We will tackle technical challenges by explaining exactly how you would assess good data as presented in the paper.</p> <p>As a reminder, we're looking for a strategy to automatically select good data for the instruction-tuning step when you want to fine-tune an LLM to your own use case taking into account a resource constraint. This means that you cannot blindly train a model on any data you encounter on the internet.</p> <p>The DEITA authors assume that you have access to open-source datasets that fit your use case. This may not be the case entirely. But with open-source communities tackling many use cases, with projects such as BLOOM or AYA, it's likely that your use case will be tackled at some point. Furthermore, you could generate your own instruction/response pairs with methods such as self-generated instructions using distilabel. This tutorial assumes that we have a data pool with excessive samples for the project's cost constraint. In short, we aim to achieve adequate performance from fewer samples.</p> <p>The authors claim that the subsample size \"correlates proportionally with the computation consumed in instruction tuning\". Hence on a first approximation, reducing the sample size means reducing computation consumption and so the total development cost. Reproducing the paper notations, we will associate the budget m to a number of instruction/response pairs that you can set depending on your real budget.</p> <p></p> <p>To match the experimental set-up, dataset X_sota is a meta-dataset combining major open-source datasets available to instruct-tune LLMs. This dataset is composed of ShareGPT (58k instruction/response pairs), UltraChat (105k instruction/response pairs) and WizardLM (143k instruction/response pairs). It sums to more than 300k instruction/response pairs. We aim to reduce the final subsample to 6k instruction/response pairs.</p>"},{"location":"sections/pipeline_samples/papers/deita/#setup-the-notebook-and-packages","title":"Setup the notebook and packages","text":"<p>Let's prepare our dependencies:</p> <pre><code>pip install \"distilabel[openai,hf-transformers]&gt;=1.0.0\"\npip install pynvml huggingface_hub argilla\n</code></pre> <p>Import distilabel:</p> <pre><code>from distilabel.models import TransformersLLM, OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import ConversationTemplate, DeitaFiltering, ExpandColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import ComplexityScorer, EvolInstruct, EvolQuality, GenerateEmbeddings, QualityScorer\n</code></pre> <p>Define the distilabel Pipeline and load the dataset from the Hugging Face Hub.</p> <pre><code>pipeline = Pipeline(name=\"DEITA\")\n\nload_data = LoadDataFromHub(\n    name=\"load_data\", batch_size=100, output_mappings={\"prompt\": \"instruction\"}, pipeline=pipeline\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#evol-instruct-generate-instructions-with-an-llm","title":"EVOL-INSTRUCT: Generate Instructions with an LLM","text":"<p>Evol-Instruct automates the creation of complex instruction data for training large language models (LLMs) by progressively rewriting an initial set of instructions into more complex forms. This generated data is then used to fine-tune a model named WizardLM.</p> <p>Evaluations show that instructions from Evol-Instruct are superior to human-created ones, and WizardLM achieves performance close to or exceeding GPT3.5-turbo in many skills. In distilabel, we initialise each step of the data generation pipeline. Later, we'll connect them together.</p> <pre><code>evol_instruction_complexity = EvolInstruct(\n    name=\"evol_instruction_complexity\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    num_evolutions=5,\n    store_evolutions=True,\n    generate_answers=True,\n    include_original_instruction=True,\n    pipeline=pipeline,\n)\n\nevol_instruction_complexity.load()\n\n_evolved_instructions = next(evol_instruction_complexity.process(\n    ([{\"instruction\": \"How many fish are there in a dozen fish?\"}]))\n)\n\nprint(*_evolved_instructions, sep=\"\\n\")\n</code></pre> <p>Output:</p> <pre><code>( 1, 'How many fish are there in a dozen fish?')\n( 2, 'How many rainbow trout are there in a dozen rainbow trout?')\n( 3, 'What is the average weight in pounds of a dozen rainbow trout caught in a specific river in Alaska during the month of May?')\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#evol-complexity-evaluate-complexity-of-generated-instructions","title":"EVOL COMPLEXITY: Evaluate complexity of generated instructions","text":"<p>The second step is the evaluation of complexity for an instruction in a given instruction-response pair. Like EVOL-INSTRUCT, this method uses LLMs instead of humans to automatically improve instructions, specifically through their complexity. From any instruction-response pair, \\((I, R)\\), we first generate new instructions following the In-Depth Evolving Response. We generate more complex instructions through prompting, as explained by authors, by adding some constraints or reasoning steps. Let\\'s take an example from GPT-4-LLM which aims to generate observations by GPT-4 to instruct-tune LLMs with supervised fine-tuning. And, we have the instruction \\(instruction_0\\):</p> <pre><code>instruction_0 = \"Give three tips for staying healthy.\"\n</code></pre> <p>To make it more complex, you can use, as the authors did, some prompt templates to add constraints or deepen the instruction. They provided some prompts in the paper appendix. For instance, this one was used to add constraints:</p> <pre><code>PROMPT = \"\"\"I want you act as a Prompt Rewriter.\nYour objective is to rewrite a given prompt into a more complex version to\nmake those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle.\nBut the rewritten prompt must be reasonable and must be understood and\nresponded by humans.\nYour rewriting cannot omit the non-text parts such as the table and code in\n#Given Prompt#:. Also, please do not omit the input in #Given Prompt#.\nYou SHOULD complicate the given prompt using the following method:\nPlease add one more constraints/requirements into #Given Prompt#\nYou should try your best not to make the #Rewritten Prompt# become verbose,\n#Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#.\n\u2018#Given Prompt#\u2019, \u2018#Rewritten Prompt#\u2019, \u2018given prompt\u2019 and \u2018rewritten prompt\u2019\nare not allowed to appear in #Rewritten Prompt#\n#Given Prompt#:\n&lt;Here is instruction&gt;\n#Rewritten Prompt#:\n\"\"\"\n</code></pre> <p>Prompting this to an LLM, you automatically get a more complex instruction, called \\(instruction_1\\), from an initial instruction \\(instruction_0\\):</p> <pre><code>instruction_1 = \"Provide three recommendations for maintaining well-being, ensuring one focuses on mental health.\"\n</code></pre> <p>With sequences of evolved instructions, we use a further LLM to automatically rank and score them. We provide the 6 instructions at the same time. By providing all instructions together, we force the scoring model to look at minor complexity differences between evolved instructions. Encouraging the model to discriminate between instructions. Taking the example below, \\(instruction_0\\) and \\(instruction_1\\) could deserve the same score independently, but when compared together we would notice the slight difference that makes \\(instruction_1\\) more complex.</p> <p>In <code>distilabel</code>, we implement this like so:</p> <pre><code>instruction_complexity_scorer = ComplexityScorer(\n    name=\"instruction_complexity_scorer\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    input_mappings={\"instructions\": \"evolved_instructions\"},\n    pipeline=pipeline,\n)\n\nexpand_evolved_instructions = ExpandColumns(\n    name=\"expand_evolved_instructions\",\n    columns=[\"evolved_instructions\", \"answers\", \"scores\"],\n    output_mappings={\n        \"evolved_instructions\": \"evolved_instruction\",\n        \"answers\": \"answer\",\n        \"scores\": \"evol_instruction_score\",\n    },\n    pipeline=pipeline,\n)\n\ninstruction_complexity_scorer.load()\n\n_evolved_instructions = next(instruction_complexity_scorer.process(([{\"evolved_instructions\": [PROMPT + instruction_1]}])))\n\nprint(\"Original Instruction:\")\nprint(instruction_1)\nprint(\"\\nEvolved Instruction:\")\nprint(_evolved_instructions[0][\"evolved_instructions\"][0].split(\"#Rewritten Prompt#:\\n\")[1])\n</code></pre> <p>Output:</p> <pre><code>Original Instruction:\nProvide three recommendations for maintaining well-being, ensuring one focuses on mental health.\n\nEvolved Instruction:\nSuggest three strategies for nurturing overall well-being, with the stipulation that at least one explicitly addresses the enhancement of mental health, incorporating evidence-based practices.\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#evol-quality-quality-evaluation","title":"EVOL-QUALITY: Quality Evaluation","text":"<p>Now that we have scored the complexity of the instructions, we will focus on the quality of the responses. Similar to EVOL COMPLEXITY, the authors introduced EVOL QUALITY, a method based on LLMs, instead of humans, to automatically score the quality of the response.</p> <p>From an instruction-response pair, \\((I, R)\\), the goal is to make the response evolve into a more helpful and relevant response. The key difference is that we need to also provide the first instruction to guide evolution. Let's take back our example from GPT-4-LLM.</p> <p>Here we have the response \\(response_0\\) and its initial instruction \\(instruction_0\\):</p> <pre><code>instruction_0 = \"Give three tips for staying healthy.\"\nreponse_0 = \"1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.\"\n</code></pre> <p>Again the authors provided several prompts you could use to make your response evolve according to some guidelines. For example, this one was used to enrich the answer:</p> <pre><code>PROMPT = \"\"\"I want you to act as a Response Rewriter\nYour goal is to enhance the quality of the response given by an AI assistant\nto the #Given Prompt# through rewriting.\nBut the rewritten response must be reasonable and must be understood by humans.\nYour rewriting cannot omit the non-text parts such as the table and code in\n#Given Prompt# and #Given Response#. Also, please do not omit the input\nin #Given Prompt#.\nYou Should enhance the quality of the response using the following method:\nPlease make the Response more in-depth\nYou should try your best not to make the #Rewritten Response# become verbose,\n#Rewritten Response# can only add 10 to 20 words into #Given Response#.\n\u2018#Given Response#\u2019, \u2018#Rewritten Response#\u2019, \u2018given response\u2019 and \u2018rewritten response\u2019\nare not allowed to appear in #Rewritten Response#\n#Given Prompt#:\n&lt;instruction_0&gt;\n#Given Response#:\n&lt;response_0&gt;\n#Rewritten Response#:\n\"\"\"\n</code></pre> <p>Prompting this to an LLM, you will automatically get a more enriched response, called \\(response_1\\), from an initial response \\(response_0\\) and initial instruction \\(instruction_0\\):</p> <pre><code>evol_response_quality = EvolQuality(\n    name=\"evol_response_quality\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    num_evolutions=5,\n    store_evolutions=True,\n    include_original_response=True,\n    input_mappings={\n        \"instruction\": \"evolved_instruction\",\n        \"response\": \"answer\",\n    },\n    pipeline=pipeline,\n)\n\nevol_response_quality.load()\n\n_evolved_responses = next(evol_response_quality.process([{\"instruction\": PROMPT + instruction_0, \"response\": reponse_0}]))\n\nprint(\"Original Response:\")\nprint(reponse_0)\nprint(\"\\nEvolved Response:\")\nprint(*_evolved_responses[0]['evolved_responses'], sep=\"\\n\")\n</code></pre> <p>And now, as in EVOL COMPLEXITY you iterate through this path and use different prompts to make your responses more relevant, helpful or creative. In the paper, they make 4 more iterations to get 5 evolved responses \\((R0, R1, R2, R3, R4)\\) which makes 5 different responses for one initial instruction at the end of this step.</p> <pre><code>response_quality_scorer = QualityScorer(\n    name=\"response_quality_scorer\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    input_mappings={\n        \"instruction\": \"evolved_instruction\",\n        \"responses\": \"evolved_responses\",\n    },\n    pipeline=pipeline,\n)\n\nexpand_evolved_responses = ExpandColumns(\n    name=\"expand_evolved_responses\",\n    columns=[\"evolved_responses\", \"scores\"],\n    output_mappings={\n        \"evolved_responses\": \"evolved_response\",\n        \"scores\": \"evol_response_score\",\n    },\n    pipeline=pipeline,\n)\n\nresponse_quality_scorer.load()\n\n_scored_responses = next(response_quality_scorer.process([{\"instruction\": PROMPT + instruction_0, \"responses\": _evolved_responses[0]['evolved_responses']}]))\n\nprint(\"Original Response:\")\nprint(reponse_0)\n\nprint(\"\\nScore, Evolved Response:\")\nprint(*zip(_scored_responses[0][\"scores\"], _evolved_responses[0]['evolved_responses']), sep=\"\\n\")\n</code></pre> <p>Output:</p> <pre><code>Original Response:\n1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.\n\nScore, Evolved Response:\n(4.0, 'Here are three essential tips for maintaining good health: \\n1. Prioritize regular exercise \\n2. Eat a balanced diet with plenty of fruits and vegetables \\n3. Get an adequate amount of sleep each night.')\n(2.0, 'Here are three effective strategies to maintain a healthy lifestyle.')\n(5.0, 'Here are three practical tips to maintain good health: Ensure a balanced diet, engage in regular exercise, and prioritize sufficient sleep. These practices support overall well-being.')\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#improving-data-diversity","title":"Improving Data Diversity","text":"<p>One main component of good data to instruct-tune LLMs is diversity. Real world data can often contain redundancy due repetitive and homogeneous data.</p> <p>The authors of the DEITA paper tackle the challenge of ensuring data diversity in the instruction tuning LLMs to avoid the pitfalls of data redundancy that can lead to over-fitting or poor generalization. They propose an embedding-based method to filter data for diversity. This method, called Repr Filter, uses embeddings generated by the Llama 1 13B model to represent instruction-response pairs in a vector space. The diversity of a new data sample is assessed based on the cosine distance between its embedding and that of its nearest neighbor in the already selected dataset. If this distance is greater than a specified threshold, the sample is considered diverse and is added to the selection. This process prioritizes diversity by assessing each sample's contribution to the variety of the dataset until the data selection budget is met. This approach effectively maintains the diversity of the data used for instruction tuning, as demonstrated by the DEITA models outperforming or matching state-of-the-art models with significantly less training data. In this implementation of DEITA we use the hidden state of the last layer of the Llama 2 model to generate embeddings, instead of a sentence transformer model, because we found that it improved the diversity of the data selection.</p> <p></p> <pre><code>generate_conversation = ConversationTemplate(\n    name=\"generate_conversation\",\n    input_mappings={\n        \"instruction\": \"evolved_instruction\",\n        \"response\": \"evolved_response\",\n    },\n    pipeline=pipeline,\n)\n\ngenerate_embeddings = GenerateEmbeddings(\n    name=\"generate_embeddings\",\n    llm=TransformersLLM(\n        model=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n        device=\"cuda\",\n        torch_dtype=\"float16\",\n    ),\n    input_mappings={\"text\": \"conversation\"},\n    input_batch_size=5,\n    pipeline=pipeline,\n)\n\ndeita_filtering = DeitaFiltering(name=\"deita_filtering\", pipeline=pipeline)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#build-the-distilabel-pipeline","title":"Build the \u2697 distilabel <code>Pipeline</code>","text":"<p>Now we're ready to build a <code>distilabel</code> pipeline using the DEITA method:</p> <pre><code>load_data.connect(evol_instruction_complexity)\nevol_instruction_complexity.connect(instruction_complexity_scorer)\ninstruction_complexity_scorer.connect(expand_evolved_instructions)\nexpand_evolved_instructions.connect(evol_response_quality)\nevol_response_quality.connect(response_quality_scorer)\nresponse_quality_scorer.connect(expand_evolved_responses)\nexpand_evolved_responses.connect(generate_conversation)\ngenerate_conversation.connect(generate_embeddings)\ngenerate_embeddings.connect(deita_filtering)\n</code></pre> <p>Now we can run the pipeline. We use the step names to reference them in the pipeline configuration:</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        \"load_data\": {\n            \"repo_id\": \"distilabel-internal-testing/instruction-dataset-50\",\n            \"split\": \"train\",\n        },\n        \"evol_instruction_complexity\": {\n            \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 512, \"temperature\": 0.7}}\n        },\n        \"instruction_complexity_scorer\": {\n            \"llm\": {\"generation_kwargs\": {\"temperature\": 0.0}}\n        },\n        \"evol_response_quality\": {\n            \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 512, \"temperature\": 0.7}}\n        },\n        \"response_quality_scorer\": {\"llm\": {\"generation_kwargs\": {\"temperature\": 0.0}}},\n        \"deita_filtering\": {\"data_budget\": 500, \"diversity_threshold\": 0.04},\n    },\n    use_cache=False,\n)\n</code></pre> <p>We can push the results to the Hugging Face Hub:</p> <pre><code>distiset.push_to_hub(\"distilabel-internal-testing/deita-colab\")\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#results","title":"Results","text":"<p>Again, to show the relevance of EVOL QUALITY method, the authors evaluated on the MT-bench models fine-tuned with different data selections according to how we defined quality responses according to an instruction. Each time they selected 6k data according to the quality score:</p> <p></p> <p>Credit: Liu et al. (2023)</p> <p>The score is much better when selecting data with the EVOL QUALITY method than when we select randomly or according to the length, making a more qualitative response if longer. Nevertheless, we see that the margin we may have seen in the complexity score is thinner. And we'll discuss the strategy in a later part. Nevertheless, this strategy looks to improve the fine-tuning compared to the baselines and now we're interested in mixing quality and complexity assessment with a diversity evaluation to find the right trade-off in our selection process.</p>"},{"location":"sections/pipeline_samples/papers/deita/#conclusion","title":"Conclusion","text":"<p>In conclusion, if you are looking for some efficient method to align an open-source LLM to your business case with a constrained budget, the solutions provided by DEITA are really worth the shot. This data-centric approach enables one to focus on the content of the dataset to have the best results instead of \"just\" scaling the instruction-tuning with more, and surely less qualitative, data. In a nutshell, the strategy developed, through automatically scoring instructions-responses, aims to substitute the human preference step proprietary models such as GPT-4 have been trained with. There are a few improvements we could think about when it comes to how to select the good data, but it opens a really great way in instruct-tuning LLM with lower computational needs making the whole process intellectually relevant and more sustainable than most of the other methods. We'd be happy to help you out with aligning an LLM with your business case drawing inspiration from such a methodology.</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/","title":"Instruction Backtranslation","text":"<p>\"Self Alignment with Instruction Backtranslation\" presents a scalable method to build high-quality instruction following a language model by automatically labeling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high-quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model.</p> <p></p> <p>Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents that includes writing about all manner of topics humans are interested in \u2013 but crucially is not paired with instructions.</p> <p>A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high-quality example pairs to train an instruction-following model.</p> <p>Their overall process, called instruction back translation performs two core steps:</p> <ol> <li> <p>Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning.</p> </li> <li> <p>Self-curate: Self-select high-quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration.</p> </li> </ol> <p>This replication covers the self-curation step i.e. the second/latter step as mentioned above, so as to be able to use the proposed prompting approach to rate the quality of the generated text, which can either be synthetically generated or real human-written text.</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#replication","title":"Replication","text":"<p>To replicate the paper we will be using <code>distilabel</code> and a smaller dataset created by the Hugging Face H4 team named <code>HuggingFaceH4/instruction-dataset</code> for testing purposes.</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#installation","title":"Installation","text":"<p>To replicate Self Alignment with Instruction Backtranslation one will need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[hf-inference-endpoints,openai]&gt;=1.0.0\"\n</code></pre> <p>And since we will be using <code>InferenceEndpointsLLM</code> (installed via the extra <code>hf-inference-endpoints</code>) we will need deploy those in advance either locally or in the Hugging Face Hub (alternatively also the serverless endpoints can be used, but most of the times the inference times are slower, and there's a limited quota to use those as those are free) and set both the <code>HF_TOKEN</code> (to use the <code>InferenceEndpointsLLM</code>) and the <code>OPENAI_API_KEY</code> environment variable value (to use the <code>OpenAILLM</code>).</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#building-blocks","title":"Building blocks","text":"<ul> <li><code>LoadDataFromHub</code>: Generator Step to load a dataset from the Hugging Face Hub.</li> <li><code>TextGeneration</code>: Task to generate responses for a given instruction using an LLM.<ul> <li><code>InferenceEndpointsLLM</code>: LLM that runs a model from an Inference Endpoint in the Hugging Face Hub.</li> </ul> </li> <li><code>InstructionBacktranslation</code>: Task that generates a score and a reason for a response for a given instruction using the Self Alignment with Instruction Backtranslation prompt.<ul> <li><code>OpenAILLM</code>: LLM that loads a model from OpenAI.</li> </ul> </li> </ul>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#code","title":"Code","text":"<p>As mentioned before, we will put the previously mentioned building blocks together to replicate Self Alignment with Instruction Backtranslation.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM, OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub, KeepColumns\nfrom distilabel.steps.tasks import InstructionBacktranslation, TextGeneration\n\n\nwith Pipeline(name=\"self-alignment-with-instruction-backtranslation\") as pipeline:\n    load_hub_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=InferenceEndpointsLLM(\n            base_url=\"&lt;INFERENCE_ENDPOINT_URL&gt;\",\n            tokenizer_id=\"argilla/notus-7b-v1\",\n            model_display_name=\"argilla/notus-7b-v1\",\n        ),\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n\n    instruction_backtranslation = InstructionBacktranslation(\n        name=\"instruction_backtranslation\",\n        llm=OpenAILLM(model=\"gpt-4\"),\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"scoring_model\"},\n    )\n\n    keep_columns = KeepColumns(\n        name=\"keep_columns\",\n        columns=[\n            \"instruction\",\n            \"generation\",\n            \"generation_model\",\n            \"score\",\n            \"reason\",\n            \"scoring_model\",\n        ],\n    )\n\n    load_hub_dataset &gt;&gt; text_generation &gt;&gt; instruction_backtranslation &gt;&gt; keep_columns\n</code></pre> <p>Then we need to call <code>pipeline.run</code> with the runtime parameters so that the pipeline can be launched.</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        load_hub_dataset.name: {\n            \"repo_id\": \"HuggingFaceH4/instruction-dataset\",\n            \"split\": \"test\",\n        },\n        text_generation.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 1024,\n                    \"temperature\": 0.7,\n                },\n            },\n        },\n        instruction_backtranslation.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 1024,\n                    \"temperature\": 0.7,\n                },\n            },\n        },\n    },\n)\n</code></pre> <p>Finally, we can optionally push the generated dataset, named <code>Distiset</code>, to the Hugging Face Hub via the <code>push_to_hub</code> method, so that each subset generated in the leaf steps is pushed to the Hub.</p> <pre><code>distiset.push_to_hub(\n    \"instruction-backtranslation-instruction-dataset\",\n    private=True,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/math_shepherd/","title":"Create datasets to train a Process Reward Model using Math-Shepherd","text":"<p>This example will introduce Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, an innovative math process reward model (PRM) which assigns reward scores to each step of math problem solutions. Specifically, we will present a recipe to create datasets to train such models. The final sections contain 2 pipeline examples to run the pipeline depending with more or less resources.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#replica","title":"Replica","text":"<p>Unlike traditional models that only look at final answers (Output Reward Models or ORM), this system evaluates each step of a mathematical solution and assigns reward scores to individual solution steps. Let's see the Figure 2 from the paper, which makes a summary of the labelling approach presented in their work.</p> <p></p> <p>In the traditional ORM approach, the annotation was done depending on the final outcome, while the Process Reward Model (PRM) allows labelling the different steps that lead to a solution, making for a richer set of information.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#steps-involved","title":"Steps involved","text":"<ul> <li> <p><code>MathShepherdGenerator</code>: This step is in charge of generating solutions for the instruction. Depending on the value set for the <code>M</code>, this step can be used to generate both the <code>golden_solution</code>, to be used as a reference for the labeller, or the set of <code>solutions</code> to be labelled. For the <code>solutions</code> column we want some diversity, to allow the model to reach both good and bad solutions, so we have a representative sample for the labeller, so it may be better to use a \"weaker\" model.</p> </li> <li> <p><code>MathShepherdCompleter</code>. This task does the job of the <code>completer</code> in the paper, generating completions as presented in Figure 2, section 3.3.2. It doesn't generate a column on it's own, but updates the steps generated in the <code>solutions</code> column from the <code>MathShepherdGenerator</code>, using as reference to label the data, the <code>golden_solution</code>. So in order for this step to work, we need both of this columns in our dataset. Depending on the type of dataset, we may already have access to the <code>golden_solution</code>, even if it's with a different name, but it's not the same for the <code>solutions</code>.</p> </li> <li> <p><code>FormatPRM</code>. This step does the auxiliary job of preparing the data to follow the format defined in the paper of having two columns <code>input</code> and <code>label</code>. After running the <code>MathShepherdCompleter</code>, we have raw data that can be formatted as the user want. Using <code>ExpandColumns</code> and this step, one can directly obtain the same format presented in the dataset shared in the paper: peiyi9979/Math-Shepherd.</p> </li> </ul>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#data-preparation","title":"Data preparation","text":"<p>For this example, just as the original paper, we are using the openai/gsm8k dataset. We only need a dataset with instructions to be solved (in this case it corresponds to the <code>question</code> column), and we can generate everything else using our predefined steps.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#building-the-pipeline","title":"Building the pipeline","text":"<p>The pipeline uses <code>openai/gsm8k</code> as reference, but the pipeline can be applied to different datasets, keep in mind the prompts can be modified with the current definition, by tweaking the <code>extra_rules</code> and <code>few_shots</code> in each task:</p> <pre><code>from datasets import load_dataset\n\nfrom distilabel.steps.tasks import MathShepherdCompleter, MathShepherdGenerator, FormatPRM\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, ExpandColumns\n\nds_name = \"openai/gsm8k\"\n\nds = load_dataset(ds_name, \"main\", split=\"test\").rename_column(\"question\", \"instruction\").select(range(3))  # (1)\n\nwith Pipeline(name=\"Math-Shepherd\") as pipe:\n    model_id_70B = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    model_id_8B = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n\n    llm_70B = InferenceEndpointsLLM(\n        model_id=model_id_70B,\n        tokenizer_id=model_id_70B,\n        generation_kwargs={\"max_new_tokens\": 1024, \"temperature\": 0.6},\n    )\n    llm_8B = InferenceEndpointsLLM(\n        model_id=model_id_8B,\n        tokenizer_id=model_id_8B,\n        generation_kwargs={\"max_new_tokens\": 2048, \"temperature\": 0.6},\n    )  # (2)\n\n    generator_golden = MathShepherdGenerator(\n        name=\"golden_generator\",\n        llm=llm_70B,\n    )  # (3)\n    generator = MathShepherdGenerator(\n        name=\"generator\",\n        llm=llm_8B,\n        use_default_structured_output=True,  # (9)\n        M=5\n    )  #\u00a0(4)\n    completer = MathShepherdCompleter(\n        name=\"completer\",\n        llm=llm_8B,\n        use_default_structured_output=True,\n        N=4\n    )  # (5)\n\n    combine = CombineOutputs()\n\n    expand = ExpandColumns(\n        name=\"expand_columns\",\n        columns=[\"solutions\"],\n        split_statistics=True,\n    )  #\u00a0(6)\n    formatter = FormatPRM(name=\"format_prm\")  # (7)\n\n    [generator_golden, generator] &gt;&gt; combine &gt;&gt; completer &gt;&gt; expand &gt;&gt; formatter # (8)\n</code></pre> <ol> <li> <p>Will use just 3 rows from the sample dataset, and rename the \"question\" to \"instruction\", to set the expected value for the <code>MathShepherdGenerator</code>.</p> </li> <li> <p>We will use 2 different LLMs, <code>meta-llama/Meta-Llama-3.1-70B-Instruct</code> (a stronger model for the <code>golden_solution</code>) and <code>meta-llama/Meta-Llama-3.1-8B-Instruct</code> (a weaker one to generate candidate solutions, and the completions).</p> </li> <li> <p>This <code>MathShepherdGenerator</code> task, that uses the stronger model, will generate the <code>golden_solution</code> for us, the step by step solution for the task.</p> </li> <li> <p>Another <code>MathShepherdGenerator</code> task, but in this case using the weaker model will generate candidate <code>solutions</code> (<code>M=5</code> in total).</p> </li> <li> <p>Now the <code>MathShepherdCompleter</code> task will generate <code>n=4</code> completions for each step of each candidate solution in the <code>solutions</code> column, and label them using the <code>golden_solution</code> as shown in Figure 2 in the paper. This step will add the label (it uses [+ and -] tags following the implementation in the paper, but these values can be modified) to the <code>solutions</code> column in place, instead of generating an additional column, but the intermediate completions won't be shown at the end.</p> </li> <li> <p>The <code>ExpandColumns</code> step expands the solution to match the instruction, so if we had set M=5, we would now have 5x instruction-pair solutions. We set the <code>split_statistics</code> to True to ensure the <code>distilabel_metadata</code> is split accordingly, othwerwise the number of tokens for each solution would count as the tokens needed for the whole list of solutions generated. One can omit both this and the following step and process the data for training as preferred.</p> </li> <li> <p>And finally, the <code>FormatPRM</code> generates two columns: <code>input</code> and <code>label</code> which prepare the data for training as presented in the original Math-Shepherd dataset.</p> </li> <li> <p>Both the <code>generator_golden</code> and <code>generator</code> can be run in parallel as there's no dependency between them, and after that we combine the results and pass them to the <code>completer</code>. Finally, we use the <code>expand</code> and <code>formatter</code> prepare the data in the expected format to train the Process Reward Model as defined in the original paper.</p> </li> <li> <p>Generate structured outputs to ensure it's easier to parse them, otherwise the models can fail a lot of times with an easy to parse list.</p> </li> </ol>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#script-and-final-dataset","title":"Script and final dataset","text":"<p>To see all the pieces in place, take a look at the full pipeline:</p> Run <pre><code>python examples/pipe_math_shepherd.py\n</code></pre> Full pipeline pipe_math_shepherd.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, ExpandColumns\nfrom distilabel.steps.tasks import (\n    FormatPRM,\n    MathShepherdCompleter,\n    MathShepherdGenerator,\n)\n\nds_name = \"openai/gsm8k\"\n\nds = (\n    load_dataset(ds_name, \"main\", split=\"test\")\n    .rename_column(\"question\", \"instruction\")\n    .select(range(3))\n)\n\n\nwith Pipeline(name=\"Math-Shepherd\") as pipe:\n    model_id_70B = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    model_id_8B = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n\n    llm_70B = InferenceEndpointsLLM(\n        model_id=model_id_8B,\n        tokenizer_id=model_id_8B,\n        generation_kwargs={\"max_new_tokens\": 1024, \"temperature\": 0.5},\n    )\n    llm_8B = InferenceEndpointsLLM(\n        model_id=model_id_8B,\n        tokenizer_id=model_id_8B,\n        generation_kwargs={\"max_new_tokens\": 2048, \"temperature\": 0.7},\n    )\n\n    generator_golden = MathShepherdGenerator(\n        name=\"golden_generator\",\n        llm=llm_70B,\n    )\n    generator = MathShepherdGenerator(\n        name=\"generator\",\n        llm=llm_8B,\n        M=5,\n    )\n    completer = MathShepherdCompleter(name=\"completer\", llm=llm_8B, N=4)\n\n    combine = CombineOutputs()\n\n    expand = ExpandColumns(\n        name=\"expand_columns\",\n        columns=[\"solutions\"],\n        split_statistics=True,\n    )\n    formatter = FormatPRM(name=\"format_prm\")\n    [generator_golden, generator] &gt;&gt; combine &gt;&gt; completer &gt;&gt; expand &gt;&gt; formatter\n\n\nif __name__ == \"__main__\":\n    distiset = pipe.run(use_cache=False, dataset=ds)\n    distiset.push_to_hub(\"plaguss/test_math_shepherd_prm\")\n</code></pre> <p>The resulting dataset can be seen at: plaguss/test_math_shepherd_prm.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#pipeline-with-vllm-and-ray","title":"Pipeline with vLLM and ray","text":"<p>This section contains an alternative way of running the pipeline with a bigger outcome. To showcase how to scale the pipeline, we are using for the 3 generating tasks Qwen/Qwen2.5-72B-Instruct, highly improving the final quality as it follows much closer the prompt given. Also, we are using <code>vLLM</code> and 3 nodes (one per task in this case), to scale up the generation process.</p> Math-Shepherd's bigger pipeline <pre><code>from datasets import load_dataset\n\nfrom distilabel.models import vLLM\nfrom distilabel.steps import StepResources\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, ExpandColumns\nfrom distilabel.steps.tasks import (\n    FormatPRM,\n    MathShepherdCompleter,\n    MathShepherdGenerator,\n)\n\nds_name = \"openai/gsm8k\"\n\nds = (\n    load_dataset(ds_name, \"main\", split=\"test\")\n    .rename_column(\"question\", \"instruction\")\n)\n\n\nwith Pipeline(name=\"Math-Shepherd\").ray() as pipe:  # (1)\n\n    model_id_72B = \"Qwen/Qwen2.5-72B-Instruct\"\n\n    llm_72B = vLLM(\n        model=model_id_72B,\n        tokenizer=model_id_72B,\n        extra_kwargs={\n            \"tensor_parallel_size\": 8,               # Number of GPUs per node\n            \"max_model_len\": 2048,\n        },\n        generation_kwargs={\n            \"temperature\": 0.5,\n            \"max_new_tokens\": 4096,\n        },\n    )\n\n    generator_golden = MathShepherdGenerator(\n        name=\"golden_generator\",\n        llm=llm_72B,\n        input_batch_size=50,\n        output_mappings={\"model_name\": \"model_name_golden_generator\"},\n        resources=StepResources(replicas=1, gpus=8)  # (2)\n    )\n    generator = MathShepherdGenerator(\n        name=\"generator\",\n        llm=llm_72B,\n        input_batch_size=50,\n        M=5,\n        use_default_structured_output=True,\n        output_mappings={\"model_name\": \"model_name_generator\"},\n        resources=StepResources(replicas=1, gpus=8)\n    )\n    completer = MathShepherdCompleter(\n        name=\"completer\", \n        llm=llm_72B,\n        N=8,\n        use_default_structured_output=True,\n        output_mappings={\"model_name\": \"model_name_completer\"},\n        resources=StepResources(replicas=1, gpus=8)\n    )\n\n    combine = CombineOutputs()\n\n    expand = ExpandColumns(\n        name=\"expand_columns\",\n        columns=[\"solutions\"],\n        split_statistics=True,\n\n    )\n    formatter = FormatPRM(name=\"format_prm\", format=\"trl\")  # (3)\n\n    [generator_golden, generator] &gt;&gt; combine &gt;&gt; completer &gt;&gt; expand &gt;&gt; formatter\n\n\nif __name__ == \"__main__\":\n    distiset = pipe.run(use_cache=False, dataset=ds, dataset_batch_size=50)\n    if distiset:\n        distiset.push_to_hub(\"plaguss/test_math_shepherd_prm_ray\")\n</code></pre> <ol> <li> <p>Transform the pipeline to run using <code>ray</code> backend.</p> </li> <li> <p>Assign the resources: number of replicas 1 as we want a single instance of the task in a node, and number of GPUs equals to 8, using a whole node. Given that we defined the script in the slurm file to use 3 nodes, this will use all the 3 available nodes, with 8 GPUs for each of these tasks.</p> </li> <li> <p>Prepare the columns in the format expected by <code>TRL</code> for training.</p> </li> </ol> <p>Click to see the slurm file used to run the previous pipeline. It's our go to <code>slurm</code> file, using 3 8xH100 nodes.</p> Slurm file <pre><code>#!/bin/bash\n#SBATCH --job-name=math-shepherd-test-ray\n#SBATCH --partition=hopper-prod\n#SBATCH --qos=normal\n#SBATCH --nodes=3\n#SBATCH --exclusive\n#SBATCH --ntasks-per-node=1\n#SBATCH --gpus-per-node=8\n#SBATCH --output=./logs/%x-%j.out\n#SBATCH --err=./logs/%x-%j.err\n#SBATCH --time=48:00:00\n\nset -ex\n\nmodule load cuda/12.1\n\necho \"SLURM_JOB_ID: $SLURM_JOB_ID\"\necho \"SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST\"\n\nsource .venv/bin/activate\n\n# Getting the node names\nnodes=$(scontrol show hostnames \"$SLURM_JOB_NODELIST\")\nnodes_array=($nodes)\n\n# Get the IP address of the head node\nhead_node=${nodes_array[0]}\nhead_node_ip=$(srun --nodes=1 --ntasks=1 -w \"$head_node\" hostname --ip-address)\n\n# Start Ray head node\nport=6379\nip_head=$head_node_ip:$port\nexport ip_head\necho \"IP Head: $ip_head\"\n\n# Generate a unique Ray tmp dir for the head node\nhead_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_head\"\n\necho \"Starting HEAD at $head_node\"\nsrun --nodes=1 --ntasks=1 -w \"$head_node\" \\\n    ray start --head --node-ip-address=\"$head_node_ip\" --port=$port \\\n    --dashboard-host=0.0.0.0 \\\n    --dashboard-port=8265 \\\n    --temp-dir=\"$head_tmp_dir\" \\\n    --block &amp;\n\n# Give some time to head node to start...\nsleep 10\n\n# Start Ray worker nodes\nworker_num=$((SLURM_JOB_NUM_NODES - 1))\n\n# Start from 1 (0 is head node)\nfor ((i = 1; i &lt;= worker_num; i++)); do\n    node_i=${nodes_array[$i]}\n    worker_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_worker_$i\"\n    echo \"Starting WORKER $i at $node_i\"\n    srun --nodes=1 --ntasks=1 -w \"$node_i\" \\\n        ray start --address \"$ip_head\" \\\n        --temp-dir=\"$worker_tmp_dir\" \\\n        --block &amp;\n    sleep 5\ndone\n\n# Give some time to the Ray cluster to gather info\nsleep 60\n\n# Finally submit the job to the cluster\nRAY_ADDRESS=\"http://$head_node_ip:8265\" ray job submit --working-dir pipeline -- python -u pipeline_math_shepherd_ray.py\n</code></pre> Final dataset <p>The resulting dataset can be seen at: plaguss/test_math_shepherd_prm_ray.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/","title":"Prometheus 2","text":"<p>\"Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models\" presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in \"Prometheus: Inducing Fine-grained Evaluation Capability in Language Models\"; since GPT-4, as well as other proprietary LLMs, are commonly used to assess the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations.</p> <p></p> <p>Existing open evaluator LMs exhibit critical shortcomings:</p> <ol> <li>They issue scores that significantly diverge from those assigned by humans.</li> <li>They lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment.</li> </ol> <p>Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. Prometheus 2 is capable of processing both direct assessment and pair-wise ranking formats grouped with user-defined evaluation criteria.</p> <p>Prometheus 2 released two variants:</p> <ul> <li><code>prometheus-eval/prometheus-7b-v2.0</code>: fine-tuned on top of <code>mistralai/Mistral-7B-Instruct-v0.2</code></li> <li><code>prometheus-eval/prometheus-8x7b-v2.0</code>: fine-tuned on top of <code>mistralai/Mixtral-8x7B-Instruct-v0.1</code></li> </ul> <p>Both models have been fine-tuned for both direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively.</p> <p>On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Their models, code, and data are all publicly available at <code>prometheus-eval/prometheus-eval</code>.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/#replication","title":"Replication","text":"<p>Note</p> <p>The section is named <code>Replication</code> but in this case we're not replicating the Prometheus 2 paper per se, but rather showing how to use the <code>PrometheusEval</code> task implemented within <code>distilabel</code> to evaluate the quality of the responses from a given instruction using the Prometheus 2 model.</p> <p>To showcase Prometheus 2 we will be using the <code>PrometheusEval</code> task implemented in <code>distilabel</code> and a smaller dataset created by the Hugging Face H4 team named <code>HuggingFaceH4/instruction-dataset</code> for testing purposes.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/#installation","title":"Installation","text":"<p>To reproduce the code below, one will need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[vllm]&gt;=1.1.0\"\n</code></pre> <p>Alternatively, it's recommended to install <code>Dao-AILab/flash-attention</code> to benefit from Flash Attention 2 speed ups during inference via <code>vllm</code>.</p> <pre><code>pip install flash-attn --no-build-isolation\n</code></pre> <p>Note</p> <p>The installation notes above assume that you are using a VM with one GPU accelerator with at least the required VRAM to fit <code>prometheus-eval/prometheus-7b-v2.0</code> in bfloat16 (28GB); but if you have enough VRAM to fit their 8x7B model in bfloat16 (~90GB) you can use <code>prometheus-eval/prometheus-8x7b-v2.0</code> instead.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/#building-blocks","title":"Building blocks","text":"<ul> <li> <p><code>LoadDataFromHub</code>: <code>GeneratorStep</code> to load a dataset from the Hugging Face Hub.</p> </li> <li> <p><code>PrometheusEval</code>: <code>Task</code> that assesses the quality of a response for a given instruction using any of the Prometheus 2 models.</p> <ul> <li><code>vLLM</code>: <code>LLM</code> that loads a model from the Hugging Face Hub via vllm-project/vllm.</li> </ul> <p>Note</p> <p>Since the Prometheus 2 models use a slightly different chat template than <code>mistralai/Mistral-7B-Instruct-v0.2</code>, we need to set the <code>chat_template</code> parameter to <code>[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]</code> so as to properly format the input for Prometheus 2.</p> </li> <li> <p>(Optional) <code>KeepColumns</code>: <code>Task</code> that keeps only the specified columns in the dataset, used to remove the undesired columns.</p> </li> </ul>"},{"location":"sections/pipeline_samples/papers/prometheus/#code","title":"Code","text":"<p>As mentioned before, we will put the previously mentioned building blocks together to see how Prometheus 2 can be used via <code>distilabel</code>.</p> <pre><code>from distilabel.models import vLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import KeepColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import PrometheusEval\n\nif __name__ == \"__main__\":\n    with Pipeline(name=\"prometheus\") as pipeline:\n        load_dataset = LoadDataFromHub(\n            name=\"load_dataset\",\n            repo_id=\"HuggingFaceH4/instruction-dataset\",\n            split=\"test\",\n            output_mappings={\"prompt\": \"instruction\", \"completion\": \"generation\"},\n        )\n\n        task = PrometheusEval(\n            name=\"task\",\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"factual-validity\",\n            reference=False,\n            num_generations=1,\n            group_generations=False,\n        )\n\n        keep_columns = KeepColumns(\n            name=\"keep_columns\",\n            columns=[\"instruction\", \"generation\", \"feedback\", \"result\", \"model_name\"],\n        )\n\n        load_dataset &gt;&gt; task &gt;&gt; keep_columns\n</code></pre> <p>Then we need to call <code>pipeline.run</code> with the runtime parameters so that the pipeline can be launched.</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        task.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 1024,\n                    \"temperature\": 0.7,\n                },\n            },\n        },\n    },\n)\n</code></pre> <p>Finally, we can optionally push the generated dataset, named <code>Distiset</code>, to the Hugging Face Hub via the <code>push_to_hub</code> method, so that each subset generated in the leaf steps is pushed to the Hub.</p> <pre><code>distiset.push_to_hub(\n    \"instruction-dataset-prometheus\",\n    private=True,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/","title":"UltraFeedback","text":"<p>UltraFeedback: Boosting Language Models with High-quality Feedback is a paper published by OpenBMB which proposes <code>UltraFeedback</code>, a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models.</p> <p>UltraFeedback collects about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN), then they use these prompts to query multiple LLMs (commercial models, Llama models ranging 7B to 70B, and non-Llama models) and generate four different responses for each prompt, resulting in a total of 256k samples i.e. the UltraFeedback will rate four responses on every OpenAI request.</p> <p></p> <p>To collect high-quality preference and textual feedback, they design a fine-grained annotation instruction, which contains four different aspects, namely instruction-following, truthfulness, honesty and helpfulness (even though within the paper they also mention a fifth one named verbalized calibration). Finally, GPT-4 is used to generate the ratings for the generated responses to the given prompt using the previously mentioned aspects.</p>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#replication","title":"Replication","text":"<p>To replicate the paper we will be using <code>distilabel</code> and a smaller dataset created by the Hugging Face H4 team named <code>HuggingFaceH4/instruction-dataset</code> for testing purposes.</p> <p>Also for testing purposes we will just show how to evaluate the generated responses for a given prompt using a new global aspect named <code>overall-rating</code> defined by Argilla, that computes the average of the four aspects, so as to reduce number of requests to be sent to OpenAI, but note that all the aspects are implemented within <code>distilabel</code> and can be used instead for a more faithful reproduction. Besides that we will generate three responses for each instruction using three LLMs selected from a pool of six: <code>HuggingFaceH4/zephyr-7b-beta</code>, <code>argilla/notus-7b-v1</code>, <code>google/gemma-1.1-7b-it</code>, <code>meta-llama/Meta-Llama-3-8B-Instruct</code>, <code>HuggingFaceH4/zephyr-7b-gemma-v0.1</code> and <code>mlabonne/UltraMerge-7B</code>.</p>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#installation","title":"Installation","text":"<p>To replicate UltraFeedback one will need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[argilla,openai,vllm]&gt;=1.0.0\"\n</code></pre> <p>And since we will be using <code>vllm</code> we will need to use a VM with at least 6 NVIDIA GPUs with at least 16GB of memory each to run the text generation, and set the <code>OPENAI_API_KEY</code> environment variable value.</p>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#building-blocks","title":"Building blocks","text":"<ul> <li><code>LoadDataFromHub</code>: Generator Step to load a dataset from the Hugging Face Hub.</li> <li><code>sample_n_steps</code>: Function to create a <code>routing_batch_function</code> that samples <code>n</code> downstream steps for each batch generated by the upstream step. This is the key to replicate the LLM pooling mechanism described in the paper.</li> <li><code>TextGeneration</code>: Task to generate responses for a given instruction using an LLM.<ul> <li><code>vLLM</code>: LLM that loads a model from the Hugging Face Hub using <code>vllm</code>.</li> </ul> </li> <li><code>GroupColumns</code>: Task that combines multiple columns into a single one i.e. from string to list of strings. Useful when there are multiple parallel steps that are connected to the same node.</li> <li><code>UltraFeedback</code>: Task that generates ratings for the responses of a given instruction using the UltraFeedback prompt.<ul> <li><code>OpenAILLM</code>: LLM that loads a model from OpenAI.</li> </ul> </li> <li><code>KeepColumns</code>: Task to keep the desired columns while removing the not needed ones, as well as defining the order for those.</li> <li>(optional) <code>PreferenceToArgilla</code>: Task to optionally push the generated dataset to Argilla to do some further analysis and human annotation.</li> </ul>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#code","title":"Code","text":"<p>As mentioned before, we will put the previously mentioned building blocks together to replicate UltraFeedback.</p> <pre><code>from distilabel.models import OpenAILLM, vLLM\nfrom distilabel.pipeline import Pipeline, sample_n_steps\nfrom distilabel.steps import (\n    GroupColumns,\n    KeepColumns,\n    LoadDataFromHub,\n    PreferenceToArgilla,\n)\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n\nsample_three_llms = sample_n_steps(n=3)\n\n\nwith Pipeline(name=\"ultrafeedback-pipeline\") as pipeline:\n    load_hub_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n        batch_size=2,\n    )\n\n    text_generation_with_notus = TextGeneration(\n        name=\"text_generation_with_notus\",\n        llm=vLLM(model=\"argilla/notus-7b-v1\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_zephyr = TextGeneration(\n        name=\"text_generation_with_zephyr\",\n        llm=vLLM(model=\"HuggingFaceH4/zephyr-7b-gemma-v0.1\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_gemma = TextGeneration(\n        name=\"text_generation_with_gemma\",\n        llm=vLLM(model=\"google/gemma-1.1-7b-it\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_zephyr_gemma = TextGeneration(\n        name=\"text_generation_with_zephyr_gemma\",\n        llm=vLLM(model=\"HuggingFaceH4/zephyr-7b-gemma-v0.1\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_llama = TextGeneration(\n        name=\"text_generation_with_llama\",\n        llm=vLLM(model=\"meta-llama/Meta-Llama-3-8B-Instruct\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_ultramerge = TextGeneration(\n        name=\"text_generation_with_ultramerge\",\n        llm=vLLM(model=\"mlabonne/UltraMerge-7B\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n\n    combine_columns = GroupColumns(\n        name=\"combine_columns\",\n        columns=[\"generation\", \"generation_model\"],\n        output_columns=[\"generations\", \"generation_models\"],\n        input_batch_size=2\n    )\n\n    ultrafeedback = UltraFeedback(\n        name=\"ultrafeedback_openai\",\n        llm=OpenAILLM(model=\"gpt-4-turbo-2024-04-09\"),\n        aspect=\"overall-rating\",\n        output_mappings={\"model_name\": \"ultrafeedback_model\"},\n    )\n\n    keep_columns = KeepColumns(\n        name=\"keep_columns\",\n        columns=[\n            \"instruction\",\n            \"generations\",\n            \"generation_models\",\n            \"ratings\",\n            \"rationales\",\n            \"ultrafeedback_model\",\n        ],\n    )\n\n    (\n        load_hub_dataset\n        &gt;&gt; sample_three_llms\n        &gt;&gt; [\n            text_generation_with_notus,\n            text_generation_with_zephyr,\n            text_generation_with_gemma,\n            text_generation_with_llama,\n            text_generation_with_zephyr_gemma,\n            text_generation_with_ultramerge\n        ]\n        &gt;&gt; combine_columns\n        &gt;&gt; ultrafeedback\n        &gt;&gt; keep_columns\n    )\n\n    # Optional: Push the generated dataset to Argilla, but will need to `pip install argilla` first\n    # push_to_argilla = PreferenceToArgilla(\n    #     name=\"push_to_argilla\",\n    #     api_url=\"&lt;ARGILLA_API_URL&gt;\",\n    #     api_key=\"&lt;ARGILLA_API_KEY&gt;\",  # type: ignore\n    #     dataset_name=\"ultrafeedback\",\n    #     dataset_workspace=\"admin\",\n    #     num_generations=2,\n    # )\n    # keep_columns &gt;&gt; push_to_argilla\n</code></pre> <p>Note</p> <p>As we're using a relative small dataset, we're setting a low <code>batch_size</code> and <code>input_batch_size</code> so we have more batches for the <code>routing_batch_function</code> i.e. we will have more variety on the LLMs used to generate the responses. When using a large dataset, it's recommended to use a larger <code>batch_size</code> and <code>input_batch_size</code> to benefit from the <code>vLLM</code> optimizations for larger batch sizes, which makes the pipeline execution faster.</p> <p>Then we need to call <code>pipeline.run</code> with the runtime parameters so that the pipeline can be launched.</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        load_hub_dataset.name: {\n            \"repo_id\": \"HuggingFaceH4/instruction-dataset\",\n            \"split\": \"test\",\n        },\n        text_generation_with_notus.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_zephyr.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_gemma.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_llama.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_zephyr_gemma.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_ultramerge.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        ultrafeedback.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 2048,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n    }\n)\n</code></pre> <p>Finally, we can optionally push the generated dataset, named <code>Distiset</code>, to the Hugging Face Hub via the <code>push_to_hub</code> method, so that each subset generated in the leaf steps is pushed to the Hub.</p> <pre><code>distiset.push_to_hub(\n    \"ultrafeedback-instruction-dataset\",\n    private=True,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/","title":"Synthetic data generation for fine-tuning custom retrieval and reranking models","text":"<pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"sentence-transformers~=3.0\"\n</code></pre> <p>Let's make the needed imports:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.steps import LoadDataFromHub\n\nfrom sentence_transformers import SentenceTransformer, CrossEncoder\nimport torch\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Login to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>Let's make the extra needed imports:</p> <pre><code>import argilla as rg\n</code></pre> <pre><code>context = (\n\"\"\"\nThe text is a chunk from technical Python SDK documentation of Argilla.\nArgilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets.\nAlong with prose explanations, the text chunk may include code snippets and Python references.\n\"\"\"\n)\n</code></pre> <pre><code>llm = InferenceEndpointsLLM(\n    model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    tokenizer_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n)\n\nwith Pipeline(name=\"generate\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        num_examples=15,\n        output_mappings={\"chunks\": \"anchor\"},\n    )\n    generate_retrieval_pairs = GenerateSentencePair(\n        name=\"generate_retrieval_pairs\",\n        triplet=True,\n        hard_negative=True,\n        action=\"query\",\n        llm=llm,\n        input_batch_size=10,\n        context=context,\n    )\n    generate_reranking_pairs = GenerateSentencePair(\n        name=\"generate_reranking_pairs\",\n        triplet=True,\n        hard_negative=False,  # to potentially generate non-relevant pairs\n        action=\"semantically-similar\",\n        llm=llm,\n        input_batch_size=10,\n        context=context,\n    )\n\n    load_dataset.connect(generate_retrieval_pairs, generate_reranking_pairs)\n</code></pre> <p>Next, we can execute this using <code>pipeline.run</code>. We will provide some <code>parameters</code> to specific components within our pipeline.</p> <pre><code>generation_kwargs = {\n    \"llm\": {\n        \"generation_kwargs\": {\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 512,\n        }\n    }\n}\n\ndistiset = pipeline.run(  \n    parameters={\n        load_dataset.name: {\n            \"repo_id\": \"plaguss/argilla_sdk_docs_raw_unstructured\",\n            \"split\": \"train\",\n        },\n        generate_retrieval_pairs.name: generation_kwargs,\n        generate_reranking_pairs.name: generation_kwargs,\n    },\n    use_cache=False,  # False for demo\n)\n</code></pre> <p>Data generation can be a expensive, so it is recommended to store the data somewhere. For now, we will store it on the Hugging Face Hub, using  our <code>push_to_hub</code> method.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-retrieval-reranking-dataset\")\n</code></pre> <p>We have got 2 different leaf/end nodes, therefore we've got a distil configurations we can access, one for the retrieval data, and one for the reranking data.</p> <p>Looking at these initial examples, we can see they nicely capture the essence of the <code>chunks</code> column but we will need to evaluate the quality of the data a bit more before we can use it for fine-tuning.</p> <pre><code>model_id = \"Snowflake/snowflake-arctic-embed-m\"  # Hugging Face model ID\n\nmodel_retrieval = SentenceTransformer(\n    model_id, device=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n)\n</code></pre> <p>Next, we will encode the generated text pairs and compute the similarities. </p> <pre><code>from sklearn.metrics.pairwise import cosine_similarity\n\ndef get_embeddings(texts):\n    vectors = model_retrieval.encode(texts)\n    return [vector.tolist() for vector in vectors]\n\n\ndef get_similarities(vector_batch_a, vector_batch_b):\n    similarities = []\n    for vector_a, vector_b in zip(vector_batch_a, vector_batch_b):\n        similarity = cosine_similarity([vector_a], [vector_b])[0][0]\n        similarities.append(similarity)\n    return similarities\n\ndef format_data_retriever(batch):# -&amp;gt; Any:\n    batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n    batch[\"positive-vector\"] = get_embeddings(batch[\"positive\"])\n    batch[\"negative-vector\"] = get_embeddings(batch[\"negative\"])    \n    batch[\"similarity-positive-negative\"] = get_similarities(batch[\"positive-vector\"], batch[\"negative-vector\"])\n    batch[\"similarity-anchor-positive\"] = get_similarities(batch[\"anchor-vector\"], batch[\"positive-vector\"])\n    batch[\"similarity-anchor-negative\"] = get_similarities(batch[\"anchor-vector\"], batch[\"negative-vector\"])\n    return batch\n\ndataset_generate_retrieval_pairs = distiset[\"generate_retrieval_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)\n</code></pre> <pre><code>model_id = \"sentence-transformers/all-MiniLM-L12-v2\"\n\nmodel = CrossEncoder(model_id)\n</code></pre> <p>Next, we will compute the similarity for the generated text pairs using the reranker. On top of that, we will compute an <code>anchor-vector</code> to allow for doing semantic search.</p> <pre><code>def format_data_retriever(batch):# -&amp;gt; Any:\n    batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n    batch[\"similarity-positive-negative\"] = model.predict(zip(batch[\"positive-vector\"], batch[\"negative-vector\"]))\n    batch[\"similarity-anchor-positive\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"positive-vector\"]))\n    batch[\"similarity-anchor-negative\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"negative-vector\"]))\n    return batch\n\ndataset_generate_reranking_pairs = distiset[\"generate_reranking_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)\n</code></pre> <p>And voila, we have our proxies for quality evaluation which we can use to filter out the best and worst examples.</p> <p>First, we need to define the setting for our Argilla dataset. We will create two different datasets, one for the retrieval data and one for the reranking data to ensure our annotators can focus on the task at hand.</p> <pre><code>import argilla as rg\nfrom argilla._exceptions import ConflictError\n\napi_key = \"ohh so secret\"\napi_url = \"https://[your-owner-name]-[your-space-name].hf.space\"\n\nclient = rg.Argilla(api_url=api_url, api_key=api_key)\n\nsettings = rg.Settings(\n    fields=[\n        rg.TextField(\"anchor\")\n    ],\n    questions=[\n        rg.TextQuestion(\"positive\"),\n        rg.TextQuestion(\"negative\"),\n        rg.LabelQuestion(\n            name=\"is_positive_relevant\",\n            title=\"Is the positive query relevant?\",\n            labels=[\"yes\", \"no\"],\n        ),\n        rg.LabelQuestion(\n            name=\"is_negative_irrelevant\",\n            title=\"Is the negative query irrelevant?\",\n            labels=[\"yes\", \"no\"],\n        )\n    ],\n    metadata=[\n        rg.TermsMetadataProperty(\"filename\"),\n        rg.FloatMetadataProperty(\"similarity-positive-negative\"),\n        rg.FloatMetadataProperty(\"similarity-anchor-positive\"),\n        rg.FloatMetadataProperty(\"similarity-anchor-negative\"),\n    ],\n    vectors=[\n        rg.VectorField(\"anchor-vector\", dimensions=model.get_sentence_embedding_dimension())\n    ]\n)\nrg_datasets = []\nfor dataset_name in [\"generate_retrieval_pairs\", \"generate_reranking_pairs\"]:\n    ds = rg.Dataset(\n        name=dataset_name,\n        settings=settings\n    )\n    try:\n        ds.create()\n    except ConflictError:\n        ds = client.datasets(dataset_name)\n    rg_datasets.append(ds)\n</code></pre> <p>Now, we've got our dataset definitions setup in Argilla, we can upload our data to Argilla.</p> <pre><code>ds_datasets = [dataset_generate_retrieval_pairs, dataset_generate_reranking_pairs]\n\nrecords = []\n\nfor rg_dataset, ds_dataset in zip(rg_datasets, ds_datasets):\n    for idx, entry in enumerate(ds_dataset):\n        records.append(\n            rg.Record(\n                id=idx,\n                fields={\"anchor\": entry[\"anchor\"]},\n                suggestions=[\n                    rg.Suggestion(\"positive\", value=entry[\"positive\"], agent=\"gpt-4o\", type=\"model\"),\n                    rg.Suggestion(\"negative\", value=entry[\"negative\"], agent=\"gpt-4o\", type=\"model\"),\n                ],\n                metadata={\n                    \"filename\": entry[\"filename\"],\n                    \"similarity-positive-negative\": entry[\"similarity-positive-negative\"],\n                    \"similarity-anchor-positive\": entry[\"similarity-anchor-positive\"],\n                    \"similarity-anchor-negative\": entry[\"similarity-anchor-negative\"]\n                },\n                vectors={\"anchor-vector\": entry[\"anchor-vector\"]}\n            )\n        )\n    rg_dataset.records.log(records)\n</code></pre> <p>Now, we can explore the UI and add a final human touch to get he most out of our dataset. </p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#synthetic-data-generation-for-fine-tuning-custom-retrieval-and-reranking-models","title":"Synthetic data generation for fine-tuning custom retrieval and reranking models","text":"<ul> <li>Goal: Bootstrap, optimize and maintain your embedding models and rerankers through synthetic data generation and human feedback.</li> <li>Libraries: argilla, hf-inference-endpoints, sentence-transformers</li> <li>Components: LoadDataFromHub, GenerateSentencePair, InferenceEndpointsLLM</li> </ul> <p>Note</p> <p>For a comprehensive overview on optimizing the retrieval performance in a RAG pipeline, check this guide in collaboration with ZenML, an open-source MLOps framework designed for building portable and production-ready machine learning pipelines.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#getting-started","title":"Getting started","text":""},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide. </p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#the-dataset","title":"The dataset","text":"<p>Before starting any project, it is always important to look at your data. Our data is publicly available on the Hugging Face Hub so we can have a quick look through their dataset viewer within an embedded iFrame. </p> <p>As we can see, our dataset contains a column called <code>chunks</code>, which was obtained from the Argilla docs. Normally, you would need to download and chunk the data but we will not cover that in this tutorial. To read a full explanation for how this dataset was generated, please refer to How we leveraged distilabel to create an Argilla 2.0 Chatbot.</p> <p>Alternatively, we can load the entire dataset to disk with <code>datasets.load_dataset</code>.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#synthetic-data-generation","title":"Synthetic data generation","text":"<p>The <code>GenerateSentencePair</code> component from <code>distilabel</code> can be used to generate training datasets for embeddings models. </p> <p>It is a pre-defined <code>Task</code> that given an <code>anchor</code> sentence generate data for a specific <code>action</code>. Supported actions are: <code>\"paraphrase\", \"semantically-similar\", \"query\", \"answer\"</code>. In our case the <code>chunks</code> column corresponds to the <code>anchor</code>. This means we will use <code>query</code> to generate potential queries for a fine-tuning a retrieval model and that we will use <code>semantically-similar</code> to generate texts that are similar to the intial anchor for fine-tuning a reranking model.</p> <p>We will <code>triplet=True</code> in order to generate both positive and negative examples, which should help the model generalize better during fine-tuning and we will set <code>hard_negative=True</code> to generate more challenging examples that are closer to the anchor and discussed topics.</p> <p>Lastly, we can seed the LLM with <code>context</code> to generate more relevant examples.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#retrieval","title":"Retrieval","text":"<p>For retrieval, we will thus generate queries that are similar to the <code>chunks</code> column. We will use the <code>query</code> action to generate potential queries for a fine-tuning a retrieval model.</p> <pre><code>generate_sentence_pair = GenerateSentencePair(\n    triplet=True,  \n    hard_negative=True,\n    action=\"query\",\n    llm=llm,\n    input_batch_size=10,\n    context=context,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#reranking","title":"Reranking","text":"<p>For reranking, we will generate texts that are similar to the intial anchor. We will use the <code>semantically-similar</code> action to generate texts that are similar to the intial anchor for fine-tuning a reranking model. In this case, we set <code>hard_negative=False</code> to generate more diverse and potentially wrong examples, which can be used as negative examples for similarity fine-tuning because rerankers cannot be fine-tuned using triplets.</p> <pre><code>generate_sentence_pair = GenerateSentencePair(\n    triplet=True,\n    hard_negative=False,\n    action=\"semantically-similar\",\n    llm=llm,\n    input_batch_size=10,\n    context=context,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#combined-pipeline","title":"Combined pipeline","text":"<p>We will now use the <code>GenerateSentencePair</code> task to generate synthetic data for both retrieval and reranking models in a single pipeline. Note that, we map the <code>chunks</code> column to the <code>anchor</code> argument.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#data-quality-evaluation","title":"Data quality evaluation","text":"<p>Data is never as clean as it can be and this also holds for synthetically generated data too, therefore, it is always good to spent some time and look at your data.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#feature-engineering","title":"Feature engineering","text":"<p>In order to evaluate the quality of our data we will use features of the  models that we intent to fine-tune as proxy for data quality. We can then use these features to filter out the best examples.</p> <p>In order to choose a good default model, we will use the Massive Text Embedding Benchmark (MTEB) Leaderboard. We want to optimize for size and speed, so we will set model size <code>&lt;100M</code> and then filter for <code>Retrieval</code> and <code>Reranking</code> based on the highest average score, resulting in Snowflake/snowflake-arctic-embed-s and sentence-transformers/all-MiniLM-L12-v2 respectively.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#retrieval_1","title":"Retrieval","text":"<p>For retrieval, we will compute similarities for the current embeddings of <code>anchor-positive</code>, <code>positive-negative</code> and <code>anchor-negative</code> pairs. We assume that an overlap of these similarities will cause the model to have difficulties generalizing and therefore we can use these features to evaluate the quality of our data.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#reranking_1","title":"Reranking","text":"<p>For reranking, we will compute the compute the relevance scores from an existing reranker model for <code>anchor-positive</code>, <code>positive-negative</code> and <code>anchor-negative</code> pais and make a similar assumption as for the retrieval model.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#optional-argilla","title":"(Optional) Argilla","text":"<p>To get the most out of you data and actually look at our data, we will use Argilla. If you are not familiar with Argilla, we recommend taking a look at the Argilla quickstart docs. Alternatively, you can use your Hugging Face account to login to the Argilla demo Space.</p> <p>To start exploring data, we first need to define an <code>argilla.Dataset</code>. We will create a basic datset with some input <code>TextFields</code> for the <code>anchor</code> and output <code>TextQuestions</code> for the <code>positive</code> and <code>negative</code> pairs. Additionally, we will use the <code>file_name</code> as <code>MetaDataProperty</code>. Lastly, we will be re-using the vectors obtained from our previous step to allow for semantic search and we will add te similarity scores for some basic filtering and sorting.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#fine-tuning","title":"Fine-tuning","text":"<p>At last, we can fine-tune our models. We will use the <code>sentence-transformers</code> library to fine-tune our models.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#retrieval_2","title":"Retrieval","text":"<p>For retrieval, we have created a script that fine-tunes a model on our generated data the generated data based https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb.You can also open it in Google Colab directly.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#reranking_2","title":"Reranking","text":"<p>For reranking, <code>sentence-transformers</code> provides a script that shows how to fine-tune a CrossEncoder models. Ad of now, there is some uncertainty over fine-tuning CrossEncoder models with triplets but you can still use the <code>positive</code> and <code>anchor</code></p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#conclusions","title":"Conclusions","text":"<p>In this tutorial, we present an end-to-end example of fine-tuning retrievers and rerankers for RAG. This serves as a good starting point for optimizing and maintaining your data and model but need to be adapted to your specific use case.</p> <p>We started with some seed data from the Argilla docs, generated synthetic data for retrieval and reranking models, evaluated the quality of the data, and showed how to fine-tune the models. We also used Argilla to get a human touch on the data.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/","title":"Clean an existing preference dataset","text":"<ul> <li>Goal: Clean an existing preference dataset by providing AI feedback on the quality of the data.</li> <li>Libraries: argilla, hf-inference-endpoints</li> <li>Components: LoadDataFromDicts, UltraFeedback, KeepColumns, PreferenceToArgilla, InferenceEndpointsLLM, GlobalStep</li> </ul> <pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"transformers~=4.0\" \"torch~=2.0\"\n</code></pre> <p>Let's make the required imports:</p> <pre><code>import random\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import (\n    KeepColumns,\n    LoadDataFromDicts,\n    PreferenceToArgilla,\n)\nfrom distilabel.steps.tasks import UltraFeedback\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Login to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>In this case, we will clean a preference dataset, so we will use the <code>Intel/orca_dpo_pairs</code> dataset from the Hugging Face Hub.</p> <pre><code>dataset = load_dataset(\"Intel/orca_dpo_pairs\", split=\"train[:20]\")\n</code></pre> <p>Next, we will shuffle the <code>chosen</code> and <code>rejected</code> columns to avoid any bias in the dataset.</p> <pre><code>def shuffle_and_track(chosen, rejected):\n    pair = [chosen, rejected]\n    random.shuffle(pair)\n    order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n    return {\"generations\": pair, \"order\": order}\n\ndataset = dataset.map(lambda x: shuffle_and_track(x[\"chosen\"], x[\"rejected\"]))\n</code></pre> <pre><code>dataset = dataset.to_list()\n</code></pre> As a custom step <p>You can also create a custom step in a separate module, import it and add it to the pipeline after loading the <code>orca_dpo_pairs</code> dataset using the <code>LoadDataFromHub</code> step.</p> shuffle_step.py<pre><code>from typing import TYPE_CHECKING, List\nfrom distilabel.steps import GlobalStep, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepOutput\n\nimport random\n\nclass ShuffleStep(GlobalStep):\n    @property\n    def inputs(self):\n        \"\"\"Returns List[str]: The inputs of the step.\"\"\"\n        return [\"instruction\", \"chosen\", \"rejected\"]\n\n    @property\n    def outputs(self):\n        \"\"\"Returns List[str]: The outputs of the step.\"\"\"\n        return [\"instruction\", \"generations\", \"order\"]\n\n    def process(self, inputs: StepInput):\n        \"\"\"Returns StepOutput: The outputs of the step.\"\"\"\n        outputs = []\n\n        for input in inputs:\n            chosen = input[\"chosen\"]\n            rejected = input[\"rejected\"]\n            pair = [chosen, rejected]\n            random.shuffle(pair)\n            order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n\n            outputs.append({\"instruction\": input[\"instruction\"], \"generations\": pair, \"order\": order})\n\n        yield outputs\n</code></pre> <pre><code>from shuffle_step import ShuffleStep\n</code></pre> <p>To clean an existing preference dataset, we will need to define a <code>Pipeline</code> with all the necessary steps. However, a similar workflow can be used to clean a SFT dataset. Below, we will go over each step in detail.</p> <pre><code>load_dataset = LoadDataFromDicts(\n    data=dataset[:1],\n    output_mappings={\"question\": \"instruction\"},\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nload_dataset.load()\nnext(load_dataset.process())\n</code></pre> <pre>\n<code>([{'system': '',\n   'question': \"You will be given a definition of a task first, then some input of the task.\\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\\n\\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\\nOutput:\",\n   'chosen': '[\\n  [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n  [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]',\n   'rejected': \" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n   'generations': [\" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n    '[\\n  [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n  [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]'],\n   'order': ['rejected', 'chosen']}],\n True)</code>\n</pre> <pre><code>evaluate_responses = UltraFeedback(\n    aspect=\"overall-rating\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n    ),\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nevaluate_responses.load()\nnext(\n    evaluate_responses.process(\n        [\n            {\n                \"instruction\": \"What's the capital of Spain?\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'ratings': [5, 1],\n  'rationales': [\"The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\",\n   \"The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"],\n  'distilabel_metadata': {'raw_output_ultra_feedback_0': \"#### Output for Text 1\\nRating: 5 (Excellent)\\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\\n\\n#### Output for Text 2\\nRating: 1 (Low Quality)\\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"},\n  'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]</code>\n</pre> <pre><code>keep_columns = KeepColumns(\n    columns=[\n        \"instruction\",\n        \"generations\",\n        \"order\",\n        \"ratings\",\n        \"rationales\",\n        \"model_name\",\n    ],\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nkeep_columns.load()\nnext(\n    keep_columns.process(\n        [\n            {\n                \"system\": \"\",\n                \"instruction\": \"What's the capital of Spain?\",\n                \"chosen\": \"Madrid\",\n                \"rejected\": \"Barcelona\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n                \"order\": [\"chosen\", \"rejected\"],\n                \"ratings\": [5, 1],\n                \"rationales\": [\"\", \"\"],\n                \"model_name\": \"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'order': ['chosen', 'rejected'],\n  'ratings': [5, 1],\n  'rationales': ['', ''],\n  'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]</code>\n</pre> <pre><code>to_argilla = PreferenceToArgilla(\n    dataset_name=\"cleaned-dataset\",\n    dataset_workspace=\"argilla\",\n    api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n    api_key=\"[your-api-key]\",\n    num_generations=2\n)\n</code></pre> <p>Below, you can see the full pipeline definition:</p> <pre><code>with Pipeline(name=\"clean-dataset\") as pipeline:\n\n    load_dataset = LoadDataFromDicts(\n        data=dataset, output_mappings={\"question\": \"instruction\"}\n    )\n\n    evaluate_responses = UltraFeedback(\n        aspect=\"overall-rating\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        ),\n    )\n\n    keep_columns = KeepColumns(\n        columns=[\n            \"instruction\",\n            \"generations\",\n            \"order\",\n            \"ratings\",\n            \"rationales\",\n            \"model_name\",\n        ]\n    )\n\n    to_argilla = PreferenceToArgilla(\n        dataset_name=\"cleaned-dataset\",\n        dataset_workspace=\"argilla\",\n        api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n        api_key=\"[your-api-key]\",\n        num_generations=2,\n    )\n\n    load_dataset.connect(evaluate_responses)\n    evaluate_responses.connect(keep_columns)\n    keep_columns.connect(to_argilla)\n</code></pre> <p>Let's now run the pipeline and clean our preference dataset.</p> <pre><code>distiset = pipeline.run()\n</code></pre> <p>Let's check it! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.</p> <p>You can push the dataset to the Hub for sharing with the community and embed it to explore the data.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-cleaned-preference-dataset\")\n</code></pre> <p>In this tutorial, we showcased the detailed steps to build a pipeline for cleaning a preference dataset using distilabel. However, you can customize this pipeline for your own use cases, such as cleaning an SFT dataset or adding custom steps.</p> <p>We used a preference dataset as our starting point and shuffled the data to avoid any bias. Next, we evaluated the responses using a model through the serverless Hugging Face Inference API, following the UltraFeedback standards. Finally, we kept the needed columns and used Argilla for further curation.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#clean-an-existing-preference-dataset","title":"Clean an existing preference dataset","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#getting-started","title":"Getting Started","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide. </p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#the-dataset","title":"The dataset","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#define-the-pipeline","title":"Define the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#load-the-dataset","title":"Load the dataset","text":"<p>We will use the dataset we just shuffled as source data.</p> <ul> <li>Component: <code>LoadDataFromDicts</code></li> <li>Input columns: <code>system</code>, <code>question</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code> and <code>order</code>, the same keys as in the loaded list of dictionaries.</li> <li>Output columns: <code>system</code>, <code>instruction</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code> and <code>order</code>. We will use <code>output_mappings</code> to rename the columns.</li> </ul>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#evaluate-the-responses","title":"Evaluate the responses","text":"<p>To evaluate the quality of the responses, we will use <code>meta-llama/Meta-Llama-3.1-70B-Instruct</code>, applying the <code>UltraFeedback</code> task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For an SFT dataset, you can use <code>PrometheusEval</code> instead.</p> <ul> <li>Component: <code>UltraFeedback</code> task with LLMs using <code>InferenceEndpointsLLM</code></li> <li>Input columns: <code>instruction</code>, <code>generations</code></li> <li>Output columns: <code>ratings</code>, <code>rationales</code>, <code>distilabel_metadata</code>, <code>model_name</code></li> </ul> <p>For your use case and to improve the results, you can use any other LLM of your choice.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#keep-only-the-required-columns","title":"Keep only the required columns","text":"<p>We will get rid of the unneeded columns.</p> <ul> <li>Component: <code>KeepColumns</code></li> <li>Input columns: <code>system</code>, <code>instruction</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code>, <code>ratings</code>, <code>rationales</code>, <code>distilabel_metadata</code> and <code>model_name</code></li> <li>Output columns: <code>instruction</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code> and <code>order</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#optional-further-data-curation","title":"(Optional) Further data curation","text":"<p>You can use Argilla to further curate your data.</p> <ul> <li>Component: <code>PreferenceToArgilla</code> step</li> <li>Input columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> <li>Output columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#run-the-pipeline","title":"Run the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#conclusions","title":"Conclusions","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/","title":"Generate a preference dataset","text":"<ul> <li>Goal: Generate a synthetic preference dataset for DPO/ORPO.</li> <li>Libraries: argilla, hf-inference-endpoints</li> <li>Components: LoadDataFromHub, TextGeneration, UltraFeedback, GroupColumns, FormatTextGenerationDPO, PreferenceToArgilla, InferenceEndpointsLLM</li> </ul> <pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"transformers~=4.0\" \"torch~=2.0\"\n</code></pre> <p>Let's make the required imports:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import (\n    LoadDataFromHub,\n    GroupColumns,\n    FormatTextGenerationDPO,\n    PreferenceToArgilla,\n)\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Log in to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>To generate our preference dataset, we will need to define a <code>Pipeline</code> with all the necessary steps. Below, we will go over each step in detail.</p> <pre><code>load_dataset = LoadDataFromHub(\n        repo_id= \"argilla/10Kprompts-mini\",\n        num_examples=1,\n        pipeline=Pipeline(name=\"showcase-pipeline\"),\n    )\nload_dataset.load()\nnext(load_dataset.process())\n</code></pre> <pre>\n<code>([{'instruction': 'How can I create an efficient and robust workflow that utilizes advanced automation techniques to extract targeted data, including customer information, from diverse PDF documents and effortlessly integrate it into a designated Google Sheet? Furthermore, I am interested in establishing a comprehensive and seamless system that promptly activates an SMS notification on my mobile device whenever a new PDF document is uploaded to the Google Sheet, ensuring real-time updates and enhanced accessibility.',\n   'topic': 'Software Development'}],\n True)</code>\n</pre> <pre><code>generate_responses = [\n    TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        ),\n        pipeline=Pipeline(name=\"showcase-pipeline\"),\n    ),\n    TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n            tokenizer_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        ),\n        pipeline=Pipeline(name=\"showcase-pipeline\"),\n    ),\n]\nfor task in generate_responses:\n    task.load()\n    print(next(task.process([{\"instruction\": \"Which are the top cities in Spain?\"}])))\n</code></pre> <pre>\n<code>[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\\n\\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Fam\u00edlia and Park G\u00fcell, designed by Antoni Gaud\u00ed.\\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alc\u00e1zar Palace, and lively flamenco music scene.\\n5. **M\u00e1laga**: A coastal city in southern Spain, M\u00e1laga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\\n10. **San Sebasti\u00e1n**: A city in the Basque Country, San Sebasti\u00e1n is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebasti\u00e1n International Film Festival.\\n\\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\\n\\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Fam\u00edlia and Park G\u00fcell, designed by Antoni Gaud\u00ed.\\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alc\u00e1zar Palace, and lively flamenco music scene.\\n5. **M\u00e1laga**: A coastal city in southern Spain, M\u00e1laga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\\n10. **San Sebasti\u00e1n**: A city in the Basque Country, San Sebasti\u00e1n is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebasti\u00e1n International Film Festival.\\n\\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}]\n[{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\\n\\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\\n\\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaud\u00ed, including the Sagrada Familia and Park G\u00fcell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\\n\\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\\n\\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alc\u00e1zar, a beautiful palace made up of a series of rooms and courtyards.\\n\\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\\n\\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\\n\\n7. M\u00e1laga: A coastal city in Andalusia, M\u00e1laga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\\n\\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\\n\\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\\n\\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaud\u00ed, including the Sagrada Familia and Park G\u00fcell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\\n\\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\\n\\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alc\u00e1zar, a beautiful palace made up of a series of rooms and courtyards.\\n\\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\\n\\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\\n\\n7. M\u00e1laga: A coastal city in Andalusia, M\u00e1laga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\\n\\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]\n</code>\n</pre> <pre><code>group_responses = GroupColumns(\n    columns=[\"generation\", \"model_name\"],\n    output_columns=[\"generations\", \"model_names\"],\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nnext(\n    group_responses.process(\n        [\n            {\n                \"generation\": \"Madrid\",\n                \"model_name\": \"meta-llama/Meta-Llama-3-8B-Instruct\",\n            },\n        ],\n        [\n            {\n                \"generation\": \"Barcelona\",\n                \"model_name\": \"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n            }\n        ],\n    )\n)\n</code></pre> <pre>\n<code>[{'generations': ['Madrid', 'Barcelona'],\n  'model_names': ['meta-llama/Meta-Llama-3-8B-Instruct',\n   'mistralai/Mixtral-8x7B-Instruct-v0.1']}]</code>\n</pre> <pre><code>evaluate_responses = UltraFeedback(\n    aspect=\"overall-rating\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n    ),\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nevaluate_responses.load()\nnext(\n    evaluate_responses.process(\n        [\n            {\n                \"instruction\": \"What's the capital of Spain?\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'ratings': [5, 1],\n  'rationales': [\"The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\",\n   \"The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"],\n  'distilabel_metadata': {'raw_output_ultra_feedback_0': \"#### Output for Text 1\\nRating: 5 (Excellent)\\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\\n\\n#### Output for Text 2\\nRating: 1 (Low Quality)\\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"},\n  'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'}]</code>\n</pre> <pre><code>format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name=\"showcase-pipeline\"))\nformat_dpo.load()\nnext(\n    format_dpo.process(\n        [\n            {\n                \"instruction\": \"What's the capital of Spain?\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n                \"generation_models\": [\n                    \"Meta-Llama-3-8B-Instruct\",\n                    \"Mixtral-8x7B-Instruct-v0.1\",\n                ],\n                \"ratings\": [5, 1],\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'generation_models': ['Meta-Llama-3-8B-Instruct',\n   'Mixtral-8x7B-Instruct-v0.1'],\n  'ratings': [5, 1],\n  'prompt': \"What's the capital of Spain?\",\n  'prompt_id': '26174c953df26b3049484e4721102dca6b25d2de9e3aa22aa84f25ed1c798512',\n  'chosen': [{'role': 'user', 'content': \"What's the capital of Spain?\"},\n   {'role': 'assistant', 'content': 'Madrid'}],\n  'chosen_model': 'Meta-Llama-3-8B-Instruct',\n  'chosen_rating': 5,\n  'rejected': [{'role': 'user', 'content': \"What's the capital of Spain?\"},\n   {'role': 'assistant', 'content': 'Barcelona'}],\n  'rejected_model': 'Mixtral-8x7B-Instruct-v0.1',\n  'rejected_rating': 1}]</code>\n</pre> <ul> <li>Or you can use Argilla to manually label the data and convert it to a preference dataset.<ul> <li>Component: <code>PreferenceToArgilla</code> step</li> <li>Input columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> <li>Output columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> </ul> </li> </ul> <pre><code>to_argilla = PreferenceToArgilla(\n    dataset_name=\"preference-dataset\",\n    dataset_workspace=\"argilla\",\n    api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n    api_key=\"[your-api-key]\",\n    num_generations=2\n)\n</code></pre> <p>Below, you can see the full pipeline definition:</p> <pre><code>with Pipeline(name=\"generate-dataset\") as pipeline:\n\n    load_dataset = LoadDataFromHub(repo_id=\"argilla/10Kprompts-mini\")\n\n    generate_responses = [\n        TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n            )\n        ),\n        TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n                tokenizer_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n                generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n            )\n        ),\n    ]\n\n    group_responses = GroupColumns(\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    evaluate_responses = UltraFeedback(\n        aspect=\"overall-rating\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        )\n    )\n\n    format_dpo = FormatTextGenerationDPO()\n\n    to_argilla = PreferenceToArgilla(\n        dataset_name=\"preference-dataset\",\n        dataset_workspace=\"argilla\",\n        api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n        api_key=\"[your-api-key]\",\n        num_generations=2\n    )\n\n    for task in generate_responses:\n        load_dataset.connect(task)\n        task.connect(group_responses)\n    group_responses.connect(evaluate_responses)\n    evaluate_responses.connect(format_dpo, to_argilla)\n</code></pre> <p>Let's now run the pipeline and generate the preference dataset.</p> <pre><code>distiset = pipeline.run()\n</code></pre> <p>Let's check the preference dataset! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.</p> <p>You can push the dataset to the Hub for sharing with the community and embed it to explore the data.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-preference-dataset\")\n</code></pre> <p>In this tutorial, we showcased the detailed steps to build a pipeline for generating a preference dataset using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub, or use them to train a model for DPO or ORPO.</p> <p>We used a dataset containing prompts to generate responses using two different models through the serverless Hugging Face Inference API. Next, we evaluated the responses using a third model, following the UltraFeedback standards. Finally, we converted the data to a preference dataset and used Argilla for further curation.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#generate-a-preference-dataset","title":"Generate a preference dataset","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#getting-started","title":"Getting started","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide. </p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#define-the-pipeline","title":"Define the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#load-the-dataset","title":"Load the dataset","text":"<p>We will use as source data the <code>argilla/10Kprompts-mini</code> dataset from the Hugging Face Hub.</p> <ul> <li>Component: <code>LoadDataFromHub</code></li> <li>Input columns: <code>instruction</code> and <code>topic</code>, the same as in the loaded dataset</li> <li>Output columns: <code>instruction</code> and <code>topic</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#generate-responses","title":"Generate responses","text":"<p>We need to generate the responses for the given instructions. We will use two different models available on the Hugging Face Hub through the Serverless Inference API: <code>meta-llama/Meta-Llama-3-8B-Instruct</code> and <code>mistralai/Mixtral-8x7B-Instruct-v0.1</code>. We will also indicate the generation parameters for each model.</p> <ul> <li>Component: <code>TextGeneration</code> task with LLMs using <code>InferenceEndpointsLLM</code></li> <li>Input columns: <code>instruction</code></li> <li>Output columns: <code>generation</code>, <code>distilabel_metadata</code>, <code>model_name</code> for each model</li> </ul> <p>For your use case and to improve the results, you can use any other LLM of your choice.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#group-the-responses","title":"Group the responses","text":"<p>The task to evaluate the responses needs as input a list of generations. However, each model response was saved in the generation column of the subsets <code>text_generation_0</code> and <code>text_generation_1</code>. We will combine these two columns into a single column and the <code>default</code> subset.</p> <ul> <li>Component: <code>GroupColumns</code></li> <li>Input columns: <code>generation</code> and <code>model_name</code>from <code>text_generation_0</code> and <code>text_generation_1</code></li> <li>Output columns: <code>generations</code> and <code>model_names</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#evaluate-the-responses","title":"Evaluate the responses","text":"<p>To build our preference dataset, we need to evaluate the responses generated by the models. We will use <code>meta-llama/Meta-Llama-3-70B-Instruct</code> for this, applying the <code>UltraFeedback</code> task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness).</p> <ul> <li>Component: <code>UltraFeedback</code> task with LLMs using <code>InferenceEndpointsLLM</code></li> <li>Input columns: <code>instruction</code>, <code>generations</code></li> <li>Output columns: <code>ratings</code>, <code>rationales</code>, <code>distilabel_metadata</code>, <code>model_name</code></li> </ul> <p>For your use case and to improve the results, you can use any other LLM of your choice.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#convert-to-a-preference-dataset","title":"Convert to a preference dataset","text":"<ul> <li>You can automatically convert it to a preference dataset with the <code>chosen</code> and <code>rejected</code> columns.<ul> <li>Component: <code>FormatTextGenerationDPO</code> step</li> <li>Input columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> <li>Output columns: <code>prompt</code>, <code>prompt_id</code>, <code>chosen</code>, <code>chosen_model</code>, <code>chosen_rating</code>, <code>rejected</code>, <code>rejected_model</code>, <code>rejected_rating</code></li> </ul> </li> </ul>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#run-the-pipeline","title":"Run the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#conclusions","title":"Conclusions","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/","title":"Generate synthetic text classification data","text":"<ul> <li>Goal: Generate synthetic text classification data to augment an imbalanced and limited dataset for training a topic classifier. In addition, generate new data for training a fact-based versus opinion-based classifier to add a new label.</li> <li>Libraries: argilla, hf-inference-endpoints, SetFit</li> <li>Components: LoadDataFromDicts, EmbeddingTaskGenerator, GenerateTextClassificationData</li> </ul> <pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"transformers~=4.40\" \"torch~=2.0\" \"setfit~=1.0\"\n</code></pre> <p>Let's make the required imports:</p> <pre><code>import random\nfrom collections import Counter\n\nfrom datasets import load_dataset, Dataset\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import (\n    GenerateTextClassificationData,\n)\nfrom setfit import SetFitModel, Trainer, sample_dataset\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Log in to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>We will use the <code>fancyzhx/ag_news</code> dataset from the Hugging Face Hub as our original data source. To simulate a real-world scenario with imbalanced and limited data, we will load only 20 samples from this dataset.</p> <pre><code>hf_dataset = load_dataset(\"fancyzhx/ag_news\", split=\"train[-20:]\")\n</code></pre> <p>Now, we can retrieve the available labels in the dataset and examine the current data distribution.</p> <pre><code>labels_topic = hf_dataset.features[\"label\"].names\nid2str = {i: labels_topic[i] for i in range(len(labels_topic))}\nprint(id2str)\nprint(Counter(hf_dataset[\"label\"]))\n</code></pre> <pre>\n<code>{0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}\nCounter({0: 12, 1: 6, 2: 2})\n</code>\n</pre> <p>As observed, the dataset is imbalanced, with most samples falling under the <code>World</code> category, while the <code>Sci/Tech</code> category is entirely missing. Moreover, there are insufficient samples to effectively train a topic classification model.</p> <p>We will also define the labels for the new classification task.</p> <pre><code>labels_fact_opinion = [\"Fact-based\", \"Opinion-based\"]\n</code></pre> <p>To generate the data we will use the <code>GenerateTextClassificationData</code> task. This task will use as input classification tasks and we can define the language, difficulty and clarity required for the generated data.</p> <pre><code>task = GenerateTextClassificationData(\n    language=\"English\",\n    difficulty=\"college\",\n    clarity=\"clear\",\n    num_generations=1,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.4},\n    ),\n    input_batch_size=5,\n)\ntask.load()\nresult = next(\n    task.process([{\"task\": \"Classify the news article as fact-based or opinion-based\"}])\n)\nprint(result[0][\"distilabel_metadata\"][\"raw_input_generate_text_classification_data_0\"])\n</code></pre> <pre>\n<code>[{'role': 'user', 'content': 'You have been assigned a text classification task: Classify the news article as fact-based or opinion-based\\n\\nYour mission is to write one text classification example for this task in JSON format. The JSON object must contain the following keys:\\n - \"input_text\": a string, the input text specified by the classification task.\\n - \"label\": a string, the correct label of the input text.\\n - \"misleading_label\": a string, an incorrect label that is related to the task.\\n\\nPlease adhere to the following guidelines:\\n - The \"input_text\" should be diverse in expression.\\n - The \"misleading_label\" must be a valid label for the given task, but not as appropriate as the \"label\" for the \"input_text\".\\n - The values for all fields should be in English.\\n - Avoid including the values of the \"label\" and \"misleading_label\" fields in the \"input_text\", that would make the task too easy.\\n - The \"input_text\" is clear and requires college level education to comprehend.\\n\\nYour output must always be a JSON object only, do not explain yourself or output anything else. Be creative!'}]\n</code>\n</pre> <p>For our use case, we only need to generate data for two tasks: a topic classification task and a fact versus opinion classification task. Therefore, we will define the tasks accordingly. As we will be using an smaller model for generation, we will select 2 random labels for each topic classification task and change the order for the fact versus opinion classification task ensuring more diversity in the generated data.</p> <pre><code>task_templates = [\n    \"Determine the news article as {}\",\n    \"Classify news article as {}\",\n    \"Identify the news article as {}\",\n    \"Categorize the news article as {}\",\n    \"Label the news article using {}\",\n    \"Annotate the news article based on {}\",\n    \"Determine the theme of a news article from {}\",\n    \"Recognize the topic of the news article as {}\",\n]\n\nclassification_tasks = [\n    {\"task\": action.format(\" or \".join(random.sample(labels_topic, 2)))}\n    for action in task_templates for _ in range(4)\n] + [\n    {\"task\": action.format(\" or \".join(random.sample(labels_fact_opinion, 2)))}\n    for action in task_templates\n]\n</code></pre> <p>Now, it's time to define and run the pipeline. As mentioned, we will load the written tasks and feed them into the <code>GenerateTextClassificationData</code> task. For our use case, we will be using <code>Meta-Llama-3.1-8B-Instruct</code> via the <code>InferenceEndpointsLLM</code>, with different degrees of difficulty and clarity.</p> <pre><code>difficulties = [\"college\", \"high school\", \"PhD\"]\nclarity = [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n\nwith Pipeline(\"texcat-generation-pipeline\") as pipeline:\n\n    tasks_generator = LoadDataFromDicts(data=classification_tasks)\n\n    generate_data = []\n    for difficulty in difficulties:\n        for clarity_level in clarity:\n            task = GenerateTextClassificationData(\n                language=\"English\",\n                difficulty=difficulty,\n                clarity=clarity_level,\n                num_generations=2,\n                llm=InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n                    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n                    generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n                ),\n                input_batch_size=5,\n            )\n            generate_data.append(task)\n\n    for task in generate_data:\n        tasks_generator.connect(task)\n</code></pre> <p>Let's now run the pipeline and generate the synthetic data.</p> <pre><code>distiset = pipeline.run()\n</code></pre> <pre><code>distiset[\"generate_text_classification_data_0\"][\"train\"][0]\n</code></pre> <pre>\n<code>{'task': 'Determine the news article as Business or World',\n 'input_text': \"The recent decision by the European Central Bank to raise interest rates will likely have a significant impact on the eurozone's economic growth, with some analysts predicting a 0.5% contraction in GDP due to the increased borrowing costs. The move is seen as a measure to combat inflation, which has been rising steadily over the past year.\",\n 'label': 'Business',\n 'misleading_label': 'World',\n 'distilabel_metadata': {'raw_output_generate_text_classification_data_0': '{\\n  \"input_text\": \"The recent decision by the European Central Bank to raise interest rates will likely have a significant impact on the eurozone\\'s economic growth, with some analysts predicting a 0.5% contraction in GDP due to the increased borrowing costs. The move is seen as a measure to combat inflation, which has been rising steadily over the past year.\",\\n  \"label\": \"Business\",\\n  \"misleading_label\": \"World\"\\n}'},\n 'model_name': 'meta-llama/Meta-Llama-3.1-8B-Instruct'}</code>\n</pre> <p>You can push the dataset to the Hub for sharing with the community and embed it to explore the data.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-texcat-generation-dataset\")\n</code></pre> <p>By examining the distiset distribution, we can confirm that it includes at least the 8 required samples for each label to train our classification models with SetFit.</p> <pre><code>all_labels = [\n    entry[\"label\"]\n    for dataset_name in distiset\n    for entry in distiset[dataset_name][\"train\"]\n]\n\nCounter(all_labels)\n</code></pre> <pre>\n<code>Counter({'Sci/Tech': 275,\n         'Business': 130,\n         'World': 86,\n         'Fact-based': 86,\n         'Sports': 64,\n         'Opinion-based': 54,\n         None: 20,\n         'Opinion Based': 1,\n         'News/Opinion': 1,\n         'Science': 1,\n         'Environment': 1,\n         'Opinion': 1})</code>\n</pre> <p>We will create two datasets with the required labels and data for our use cases.</p> <pre><code>def extract_rows(distiset, labels):\n    return [\n        {\n            \"text\": entry[\"input_text\"],\n            \"label\": entry[\"label\"],\n            \"id\": i\n        }\n        for dataset_name in distiset\n        for i, entry in enumerate(distiset[dataset_name][\"train\"])\n        if entry[\"label\"] in labels\n    ]\n\ndata_topic = extract_rows(distiset, labels_topic)\ndata_fact_opinion = extract_rows(distiset, labels_fact_opinion)\n</code></pre> <p>Get started in Argilla</p> <p>If you are not familiar with Argilla, we recommend taking a look at the Argilla quickstart docs. Alternatively, you can use your Hugging Face account to login to the Argilla demo Space.</p> <p>To get the most out of our data, we will use Argilla. First, we need to connect to the Argilla instance.</p> <pre><code>import argilla as rg\n\n# Replace api_url with your url if using Docker\n# Replace api_key with your API key under \"My Settings\" in the UI\n# Uncomment the last line and set your HF_TOKEN if your space is private\nclient = rg.Argilla(\n    api_url=\"https://[your-owner-name]-[your_space_name].hf.space\",\n    api_key=\"[your-api-key]\",\n    # headers={\"Authorization\": f\"Bearer {HF_TOKEN}\"}\n)\n</code></pre> <p>We will create a <code>Dataset</code> for each task, with an input <code>TextField</code> for the text classification text and a <code>LabelQuestion</code> to ensure the generated labels are correct.</p> <pre><code>def create_texcat_dataset(dataset_name, labels):\n    settings = rg.Settings(\n        fields=[rg.TextField(\"text\")],\n        questions=[\n            rg.LabelQuestion(\n                name=\"label\",\n                title=\"Classify the texts according to the following labels\",\n                labels=labels,\n            ),\n        ],\n    )\n    return rg.Dataset(name=dataset_name, settings=settings).create()\n\n\nrg_dataset_topic = create_texcat_dataset(\"topic-classification\", labels_topic)\nrg_dataset_fact_opinion = create_texcat_dataset(\n    \"fact-opinion-classification\", labels_fact_opinion\n)\n</code></pre> <p>Now, we can upload the generated data to Argilla and evaluate it. We will use the generated labels as suggestions.</p> <pre><code>rg_dataset_topic.records.log(data_topic)\nrg_dataset_fact_opinion.records.log(data_fact_opinion)\n</code></pre> <p>Now, we can start the annotation process. Just open the dataset in the Argilla UI and start annotating the records. If the suggestions are correct, you can just click on <code>Submit</code>. Otherwise, you can select the correct label.</p> <p>Note</p> <p>Check this how-to guide to know more about annotating in the UI.</p> <p>Once, you get the annotations, let's continue by retrieving the data from Argilla and format it as a dataset with the required data.</p> <pre><code>rg_dataset_topic = client.datasets(\"topic-classification\")\nrg_dataset_fact_opinion = client.datasets(\"fact-opinion-classification\")\n</code></pre> <pre><code>status_filter = rg.Query(filter=rg.Filter((\"response.status\", \"==\", \"submitted\")))\n\nsubmitted_topic = rg_dataset_topic.records(status_filter).to_list(flatten=True)\nsubmitted_fact_opinion = rg_dataset_fact_opinion.records(status_filter).to_list(\n    flatten=True\n)\n</code></pre> <pre><code>def format_submitted(submitted):\n    return [\n        {\n            \"text\": r[\"text\"],\n            \"label\": r[\"label.responses\"][0],\n            \"id\": i,\n        }\n        for i, r in enumerate(submitted)\n    ]\n\ndata_topic = format_submitted(submitted_topic)\ndata_fact_opinion = format_submitted(submitted_fact_opinion)\n</code></pre> <p>In our case, we will fine-tune using SetFit. However, you can select the one that best fits your requirements.</p> <p>The next step will be to format the data to be compatible with SetFit. In the case of the topic classification, we will need to combine the synthetic data with the original data.</p> <pre><code>hf_topic = hf_dataset.to_list()\nnum = len(data_topic)\n\ndata_topic.extend(\n    [\n        {\n            \"text\": r[\"text\"],\n            \"label\": id2str[r[\"label\"]],\n            \"id\": num + i,\n        }\n        for i, r in enumerate(hf_topic)\n    ]\n)\n</code></pre> <p>If we check the data distribution now, we can see that we have enough samples for each label to train our models.</p> <pre><code>labels = [record[\"label\"] for record in data_topic]\nCounter(labels)\n</code></pre> <pre>\n<code>Counter({'Sci/Tech': 275, 'Business': 132, 'World': 98, 'Sports': 70})</code>\n</pre> <pre><code>labels = [record[\"label\"] for record in data_fact_opinion]\nCounter(labels)\n</code></pre> <pre>\n<code>Counter({'Fact-based': 86, 'Opinion-based': 54})</code>\n</pre> <p>Now, let's create our training and validation datasets. The training dataset will gather 8 samples by label. In this case, the validation datasets will contain the remaining samples not included in the training datasets.</p> <pre><code>def sample_and_split(dataset, label_column, num_samples):\n    train_dataset = sample_dataset(\n        dataset, label_column=label_column, num_samples=num_samples\n    )\n    eval_dataset = dataset.filter(lambda x: x[\"id\"] not in set(train_dataset[\"id\"]))\n    return train_dataset, eval_dataset\n\n\ndataset_topic_full = Dataset.from_list(data_topic)\ndataset_fact_opinion_full = Dataset.from_list(data_fact_opinion)\n\ntrain_dataset_topic, eval_dataset_topic = sample_and_split(\n    dataset_topic_full, \"label\", 8\n)\ntrain_dataset_fact_opinion, eval_dataset_fact_opinion = sample_and_split(\n    dataset_fact_opinion_full, \"label\", 8\n)\n</code></pre> <p>Let's train our models for each task! We will use TaylorAI/bge-micro-v2, available in the Hugging Face Hub. You can check the MTEB leaderboard to select the best model for your use case.</p> <pre><code>def train_model(model_name, dataset, eval_dataset):\n    model = SetFitModel.from_pretrained(model_name)\n\n    trainer = Trainer(\n        model=model,\n        train_dataset=dataset,\n    )\n    trainer.train()\n    metrics = trainer.evaluate(eval_dataset)\n    print(metrics)\n\n    return model\n</code></pre> <pre><code>model_topic = train_model(\n    model_name=\"TaylorAI/bge-micro-v2\",\n    dataset=train_dataset_topic,\n    eval_dataset=eval_dataset_topic,\n)\nmodel_topic.save_pretrained(\"topic_classification_model\")\nmodel_topic = SetFitModel.from_pretrained(\"topic_classification_model\")\n</code></pre> <pre>\n<code>***** Running training *****\n  Num unique pairs = 768\n  Batch size = 16\n  Num epochs = 1\n  Total optimization steps = 48\n</code>\n</pre> <pre>\n<code>{'embedding_loss': 0.1873, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}\n</code>\n</pre> <pre>\n<code>***** Running evaluation *****\n</code>\n</pre> <pre>\n<code>{'train_runtime': 4.9767, 'train_samples_per_second': 154.318, 'train_steps_per_second': 9.645, 'epoch': 1.0}\n{'accuracy': 0.8333333333333334}\n</code>\n</pre> <pre><code>model_fact_opinion = train_model(\n    model_name=\"TaylorAI/bge-micro-v2\",\n    dataset=train_dataset_fact_opinion,\n    eval_dataset=eval_dataset_fact_opinion,\n)\nmodel_fact_opinion.save_pretrained(\"fact_opinion_classification_model\")\nmodel_fact_opinion = SetFitModel.from_pretrained(\"fact_opinion_classification_model\")\n</code></pre> <pre>\n<code>***** Running training *****\n  Num unique pairs = 144\n  Batch size = 16\n  Num epochs = 1\n  Total optimization steps = 9\n</code>\n</pre> <pre>\n<code>{'embedding_loss': 0.2985, 'learning_rate': 2e-05, 'epoch': 0.11}\n</code>\n</pre> <pre>\n<code>***** Running evaluation *****\n</code>\n</pre> <pre>\n<code>{'train_runtime': 0.8327, 'train_samples_per_second': 172.931, 'train_steps_per_second': 10.808, 'epoch': 1.0}\n{'accuracy': 0.9090909090909091}\n</code>\n</pre> <p>Voil\u00e0! The models are now trained and ready to be used. You can start making predictions to check the model's performance and add the new label. Optionally, you can continue using distilabel to generate additional data or Argilla to verify the quality of the predictions.</p> <pre><code>def predict(model, input, labels):\n    model.labels = labels\n    prediction = model.predict([input])\n    return prediction[0]\n</code></pre> <pre><code>predict(\n    model_topic, \"The new iPhone is expected to be released next month.\", labels_topic\n)\n</code></pre> <pre>\n<code>'Sci/Tech'</code>\n</pre> <pre><code>predict(\n    model_fact_opinion,\n    \"The new iPhone is expected to be released next month.\",\n    labels_fact_opinion,\n)\n</code></pre> <pre>\n<code>'Opinion-based'</code>\n</pre> <p>In this tutorial, we showcased the detailed steps to build a pipeline for generating text classification data using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub.</p> <p>We defined two text classification tasks\u2014a topic classification task and a fact versus opinion classification task\u2014and generated new data using various models via the serverless Hugging Face Inference API. Then, we curated the generated data with Argilla. Finally, we trained the models with SetFit using both the original and synthetic data.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#generate-synthetic-text-classification-data","title":"Generate synthetic text classification data","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#getting-started","title":"Getting started","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.</p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#the-dataset","title":"The dataset","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#define-the-text-classification-task","title":"Define the text classification task","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#run-the-pipeline","title":"Run the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#optional-evaluate-with-argilla","title":"(Optional) Evaluate with Argilla","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#train-your-models","title":"Train your models","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#formatting-the-data","title":"Formatting the data","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#the-actual-training","title":"The actual training","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#conclusions","title":"Conclusions","text":""},{"location":"components-gallery/","title":"Components Gallery","text":"<ul> <li> <p> Steps</p> <p>Explore all the available <code>Step</code>s that can be used for data manipulation.</p> <p> Steps</p> </li> <li> <p> Tasks</p> <p>Explore all the available <code>Task</code>s that can be used with an <code>LLM</code> to perform data generation, annotation, and more.</p> <p> Tasks</p> </li> <li> <p> LLMs</p> <p>Explore all the available <code>LLM</code>s integrated with <code>distilabel</code>.</p> <p> LLMs</p> </li> <li> <p> Embeddings</p> <p>Explore all the available <code>Embeddings</code> models integrated with <code>distilabel</code>.</p> <p> Embeddings</p> </li> </ul>"},{"location":"components-gallery/steps/","title":"Steps Gallery","text":"Category Overview <p>The gallery page showcases the different types of components within <code>distilabel</code>.</p> Icon Category Description text-generation Text generation steps are used to generate text based on a given prompt. chat-generation Chat generation steps are used to generate text based on a conversation. text-classification Text classification steps are used to classify text into a category. text-manipulation Text manipulation steps are used to manipulate or rewrite an input text. evol Evol steps are used to rewrite input text and evolve it to a higher quality. critique Critique steps are used to provide feedback on the quality of the data with a written explanation. scorer Scorer steps are used to evaluate and score the data with a numerical value. preference Preference steps are used to collect preferences on the data with numerical values or ranks. embedding Embedding steps are used to generate embeddings for the data. clustering Clustering steps are used to group similar data points together. columns Columns steps are used to manipulate columns in the data. filtering Filtering steps are used to filter the data based on some criteria. format Format steps are used to format the data. load Load steps are used to load the data. execution Executes python functions. save Save steps are used to save the data. labelling Labelling steps are used to label the data. <ul> <li> <p> PreferenceToArgilla</p> <p>Creates a preference dataset in Argilla.</p> <p> PreferenceToArgilla</p> </li> <li> <p> TextGenerationToArgilla</p> <p>Creates a text generation dataset in Argilla.</p> <p> TextGenerationToArgilla</p> </li> <li> <p> CombineColumns</p> <p><code>CombineColumns</code> is deprecated and will be removed in version 1.5.0, use <code>GroupColumns</code> instead.</p> <p> CombineColumns</p> </li> <li> <p> PushToHub</p> <p>Push data to a Hugging Face Hub dataset.</p> <p> PushToHub</p> </li> <li> <p> LoadDataFromDicts</p> <p>Loads a dataset from a list of dictionaries.</p> <p> LoadDataFromDicts</p> </li> <li> <p> DataSampler</p> <p>Step to sample from a dataset.</p> <p> DataSampler</p> </li> <li> <p> LoadDataFromHub</p> <p>Loads a dataset from the Hugging Face Hub.</p> <p> LoadDataFromHub</p> </li> <li> <p> LoadDataFromFileSystem</p> <p>Loads a dataset from a file in your filesystem.</p> <p> LoadDataFromFileSystem</p> </li> <li> <p> LoadDataFromDisk</p> <p>Load a dataset that was previously saved to disk.</p> <p> LoadDataFromDisk</p> </li> <li> <p> PrepareExamples</p> <p>Helper step to create examples from <code>query</code> and <code>answers</code> pairs used as Few Shots in APIGen.</p> <p> PrepareExamples</p> </li> <li> <p> ConversationTemplate</p> <p>Generate a conversation template from an instruction and a response.</p> <p> ConversationTemplate</p> </li> <li> <p> FormatTextGenerationDPO</p> <p>Format the output of your LLMs for Direct Preference Optimization (DPO).</p> <p> FormatTextGenerationDPO</p> </li> <li> <p> FormatChatGenerationDPO</p> <p>Format the output of a combination of a <code>ChatGeneration</code> + a preference task for Direct Preference Optimization (DPO).</p> <p> FormatChatGenerationDPO</p> </li> <li> <p> FormatTextGenerationSFT</p> <p>Format the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p> FormatTextGenerationSFT</p> </li> <li> <p> FormatChatGenerationSFT</p> <p>Format the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p> FormatChatGenerationSFT</p> </li> <li> <p> DeitaFiltering</p> <p>Filter dataset rows using DEITA filtering strategy.</p> <p> DeitaFiltering</p> </li> <li> <p> EmbeddingDedup</p> <p>Deduplicates text using embeddings.</p> <p> EmbeddingDedup</p> </li> <li> <p> APIGenExecutionChecker</p> <p>Executes the generated function calls.</p> <p> APIGenExecutionChecker</p> </li> <li> <p> MinHashDedup</p> <p>Deduplicates text using <code>MinHash</code> and <code>MinHashLSH</code>.</p> <p> MinHashDedup</p> </li> <li> <p> CombineOutputs</p> <p>Combine the outputs of several upstream steps.</p> <p> CombineOutputs</p> </li> <li> <p> ExpandColumns</p> <p>Expand columns that contain lists into multiple rows.</p> <p> ExpandColumns</p> </li> <li> <p> GroupColumns</p> <p>Combines columns from a list of <code>StepInput</code>.</p> <p> GroupColumns</p> </li> <li> <p> KeepColumns</p> <p>Keeps selected columns in the dataset.</p> <p> KeepColumns</p> </li> <li> <p> MergeColumns</p> <p>Merge columns from a row.</p> <p> MergeColumns</p> </li> <li> <p> DBSCAN</p> <p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core</p> <p> DBSCAN</p> </li> <li> <p> UMAP</p> <p>UMAP is a general purpose manifold learning and dimension reduction algorithm.</p> <p> UMAP</p> </li> <li> <p> FaissNearestNeighbour</p> <p>Create a <code>faiss</code> index to get the nearest neighbours.</p> <p> FaissNearestNeighbour</p> </li> <li> <p> EmbeddingGeneration</p> <p>Generate embeddings using an <code>Embeddings</code> model.</p> <p> EmbeddingGeneration</p> </li> <li> <p> RewardModelScore</p> <p>Assign a score to a response using a Reward Model.</p> <p> RewardModelScore</p> </li> <li> <p> FormatPRM</p> <p>Helper step to transform the data into the format expected by the PRM model.</p> <p> FormatPRM</p> </li> <li> <p> TruncateTextColumn</p> <p>Truncate a row using a tokenizer or the number of characters.</p> <p> TruncateTextColumn</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/","title":"PreferenceToArgilla","text":"<p>Creates a preference dataset in Argilla.</p> <p>Step that creates a dataset in Argilla during the load phase, and then pushes the input     batches into it as records. This dataset is a preference dataset, where there's one field     for the instruction and one extra field per each generation within the same record, and then     a rating question per each of the generation fields. The rating question asks the annotator to     set a rating from 1 to 5 for each of the provided generations.</p>"},{"location":"components-gallery/steps/preferencetoargilla/#note","title":"Note","text":"<p>This step is meant to be used in conjunction with the <code>UltraFeedback</code> step, or any other step generating both ratings and responses for a given set of instruction and generations for the given instruction. But alternatively, it can also be used with any other task or step generating only the <code>instruction</code> and <code>generations</code>, as the <code>ratings</code> and <code>rationales</code> are optional.</p>"},{"location":"components-gallery/steps/preferencetoargilla/#attributes","title":"Attributes","text":"<ul> <li> <p>num_generations: The number of generations to include in the dataset.</p> </li> <li> <p>dataset_name: The name of the dataset in Argilla.</p> </li> <li> <p>dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to  <code>None</code>, which means it will be created in the default workspace.</p> </li> <li> <p>api_url: The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from  the <code>ARGILLA_API_URL</code> environment variable.</p> </li> <li> <p>api_key: The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will  be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_url: The base URL to use for the Argilla API requests.</p> </li> <li> <p>api_key: The API key to authenticate the requests to the Argilla API.</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generations]\n            ICOL2[ratings]\n            ICOL3[rationales]\n        end\n    end\n\n    subgraph PreferenceToArgilla\n        StepInput[Input Columns: instruction, generations, ratings, rationales]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n</code></pre>"},{"location":"components-gallery/steps/preferencetoargilla/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the completion.</p> </li> <li> <p>generations (<code>List[str]</code>): The completion that was generated based on the input instruction.</p> </li> <li> <p>ratings (<code>List[str]</code>, optional): The ratings for the generations. If not provided, the  generated ratings won't be pushed to Argilla.</p> </li> <li> <p>rationales (<code>List[str]</code>, optional): The rationales for the ratings. If not provided, the  generated rationales won't be pushed to Argilla.</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/preferencetoargilla/#push-a-preference-dataset-to-an-argilla-instance","title":"Push a preference dataset to an Argilla instance","text":"<pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = PreferenceToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generations': ['first_generation', 'second_generation']}]\n</code></pre>"},{"location":"components-gallery/steps/preferencetoargilla/#it-can-also-include-ratings-and-rationales","title":"It can also include ratings and rationales","text":"<pre><code>result = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n                \"ratings\": [\"4\", \"5\"],\n                \"rationales\": [\"rationale for 4\", \"rationale for 5\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'instruction',\n#         'generations': ['first_generation', 'second_generation'],\n#         'ratings': ['4', '5'],\n#         'rationales': ['rationale for 4', 'rationale for 5']\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/textgenerationtoargilla/","title":"TextGenerationToArgilla","text":"<p>Creates a text generation dataset in Argilla.</p> <p><code>Step</code> that creates a dataset in Argilla during the load phase, and then pushes the input     batches into it as records. This dataset is a text-generation dataset, where there's one field     per each input, and then a label question to rate the quality of the completion in either bad     (represented with \ud83d\udc4e) or good (represented with \ud83d\udc4d).</p>"},{"location":"components-gallery/steps/textgenerationtoargilla/#note","title":"Note","text":"<p>This step is meant to be used in conjunction with a <code>TextGeneration</code> step and no column mapping is needed, as it will use the default values for the <code>instruction</code> and <code>generation</code> columns.</p>"},{"location":"components-gallery/steps/textgenerationtoargilla/#attributes","title":"Attributes","text":"<ul> <li> <p>dataset_name: The name of the dataset in Argilla.</p> </li> <li> <p>dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to  <code>None</code>, which means it will be created in the default workspace.</p> </li> <li> <p>api_url: The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from  the <code>ARGILLA_API_URL</code> environment variable.</p> </li> <li> <p>api_key: The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will  be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> </li> </ul>"},{"location":"components-gallery/steps/textgenerationtoargilla/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_url: The base URL to use for the Argilla API requests.</p> </li> <li> <p>api_key: The API key to authenticate the requests to the Argilla API.</p> </li> </ul>"},{"location":"components-gallery/steps/textgenerationtoargilla/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generation]\n        end\n    end\n\n    subgraph TextGenerationToArgilla\n        StepInput[Input Columns: instruction, generation]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n</code></pre>"},{"location":"components-gallery/steps/textgenerationtoargilla/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the completion.</p> </li> <li> <p>generation (<code>str</code> or <code>List[str]</code>): The completions that were generated based on the input instruction.</p> </li> </ul>"},{"location":"components-gallery/steps/textgenerationtoargilla/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/textgenerationtoargilla/#push-a-text-generation-dataset-to-an-argilla-instance","title":"Push a text generation dataset to an Argilla instance","text":"<pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = TextGenerationToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generation\": \"generation\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generation': 'generation'}]\n</code></pre>"},{"location":"components-gallery/steps/combinecolumns/","title":"CombineColumns","text":"<p><code>CombineColumns</code> is deprecated and will be removed in version 1.5.0, use <code>GroupColumns</code> instead.</p>"},{"location":"components-gallery/steps/combinecolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n    end\n\n    subgraph CombineColumns\n    end\n\n</code></pre>"},{"location":"components-gallery/steps/pushtohub/","title":"PushToHub","text":"<p>Push data to a Hugging Face Hub dataset.</p> <p>A <code>GlobalStep</code> which creates a <code>datasets.Dataset</code> with the input data and pushes     it to the Hugging Face Hub.</p>"},{"location":"components-gallery/steps/pushtohub/#attributes","title":"Attributes","text":"<ul> <li> <p>repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.</p> </li> <li> <p>split: The split of the dataset that will be pushed. Defaults to <code>\"train\"</code>.</p> </li> <li> <p>private: Whether the dataset to be pushed should be private or not. Defaults to  <code>False</code>.</p> </li> <li> <p>token: The token that will be used to authenticate in the Hub. If not provided, the  token will be tried to be obtained from the environment variable <code>HF_TOKEN</code>.  If not provided using one of the previous methods, then <code>huggingface_hub</code> library  will try to use the token from the local Hugging Face CLI configuration. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/pushtohub/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.</p> </li> <li> <p>split: The split of the dataset that will be pushed.</p> </li> <li> <p>private: Whether the dataset to be pushed should be private or not.</p> </li> <li> <p>token: The token that will be used to authenticate in the Hub.</p> </li> </ul>"},{"location":"components-gallery/steps/pushtohub/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n    end\n\n    subgraph PushToHub\n        StepInput[Input Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n</code></pre>"},{"location":"components-gallery/steps/pushtohub/#inputs","title":"Inputs","text":"<ul> <li>dynamic (<code>all</code>): all columns from the input will be used to create the dataset.</li> </ul>"},{"location":"components-gallery/steps/pushtohub/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/pushtohub/#push-batches-of-your-dataset-to-the-hugging-face-hub-repository","title":"Push batches of your dataset to the Hugging Face Hub repository","text":"<pre><code>from distilabel.steps import PushToHub\n\npush = PushToHub(repo_id=\"path_to/repo\")\npush.load()\n\nresult = next(\n    push.process(\n        [\n            {\n                \"instruction\": \"instruction \",\n                \"generation\": \"generation\"\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction ', 'generation': 'generation'}]\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdicts/","title":"LoadDataFromDicts","text":"<p>Loads a dataset from a list of dictionaries.</p> <p><code>GeneratorStep</code> that loads a dataset from a list of dictionaries and yields it in     batches.</p>"},{"location":"components-gallery/steps/loaddatafromdicts/#attributes","title":"Attributes","text":"<ul> <li>data: The list of dictionaries to load the data from.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdicts/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>batch_size: The batch size to use when processing the data.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdicts/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromDicts\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdicts/#outputs","title":"Outputs","text":"<ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns  of the dataset.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdicts/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromdicts/#load-data-from-a-list-of-dictionaries","title":"Load data from a list of dictionaries","text":"<pre><code>from distilabel.steps import LoadDataFromDicts\n\nloader = LoadDataFromDicts(\n    data=[{\"instruction\": \"What are 2+2?\"}] * 5,\n    batch_size=2\n)\nloader.load()\n\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'instruction': 'What are 2+2?'}, {'instruction': 'What are 2+2?'}], False)\n</code></pre>"},{"location":"components-gallery/steps/datasampler/","title":"DataSampler","text":"<p>Step to sample from a dataset.</p> <p><code>GeneratorStep</code> that samples from a dataset and yields it in batches.     This step is useful when you have a pipeline that can benefit from using examples     in the prompts for example as few-shot learning, that can be changing on each row.     For example, you can pass a list of dictionaries with N examples and generate M samples     from it (assuming you have another step loading data, this M should have the same size     as the data being loaded in that step). The size S argument is the number of samples per     row generated, so each example would contain S examples to be used as examples.</p>"},{"location":"components-gallery/steps/datasampler/#attributes","title":"Attributes","text":"<ul> <li> <p>data: The list of dictionaries to sample from.</p> </li> <li> <p>size: Number of samples per example. For example in a few-shot learning scenario,  the number of few-shot examples that will be generated per example. Defaults to 2.</p> </li> <li> <p>samples: Number of examples that will be generated by the step in total.  If used with another loader step, this should be the same as the number  of samples in the loader step. Defaults to 100.</p> </li> </ul>"},{"location":"components-gallery/steps/datasampler/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph DataSampler\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/datasampler/#outputs","title":"Outputs","text":"<ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns  of the dataset.</li> </ul>"},{"location":"components-gallery/steps/datasampler/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/datasampler/#sample-data-from-a-list-of-dictionaries","title":"Sample data from a list of dictionaries","text":"<pre><code>from distilabel.steps import DataSampler\n\nsampler = DataSampler(\n    data=[{\"sample\": f\"sample {i}\"} for i in range(30)],\n    samples=10,\n    size=2,\n    batch_size=4\n)\nsampler.load()\n\nresult = next(sampler.process())\n# &gt;&gt;&gt; result\n# ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)\n</code></pre>"},{"location":"components-gallery/steps/datasampler/#pipeline-with-a-loader-and-a-sampler-combined-in-a-single-stream","title":"Pipeline with a loader and a sampler combined in a single stream","text":"<pre><code>from datasets import load_dataset\n\nfrom distilabel.steps import LoadDataFromDicts, DataSampler\nfrom distilabel.steps.tasks.apigen.utils import PrepareExamples\nfrom distilabel.pipeline import Pipeline\n\nds = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\ndata = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n]\nwith Pipeline(name=\"APIGenPipeline\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)\n    sampler = DataSampler(\n        data=ds,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n    prep_examples = PrepareExamples()\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples]\n        &gt;&gt; combine_steps\n    )\n# Now we have a single stream of data with the loader and the sampler data\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromhub/","title":"LoadDataFromHub","text":"<p>Loads a dataset from the Hugging Face Hub.</p> <p><code>GeneratorStep</code> that loads a dataset from the Hugging Face Hub using the <code>datasets</code>     library.</p>"},{"location":"components-gallery/steps/loaddatafromhub/#attributes","title":"Attributes","text":"<ul> <li> <p>repo_id: The Hugging Face Hub repository ID of the dataset to load.</p> </li> <li> <p>split: The split of the dataset to load.</p> </li> <li> <p>config: The configuration of the dataset to load. This is optional and only needed  if the dataset has multiple configurations.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromhub/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>batch_size: The batch size to use when processing the data.</p> </li> <li> <p>repo_id: The Hugging Face Hub repository ID of the dataset to load.</p> </li> <li> <p>split: The split of the dataset to load. Defaults to 'train'.</p> </li> <li> <p>config: The configuration of the dataset to load. This is optional and only  needed if the dataset has multiple configurations.</p> </li> <li> <p>revision: The revision of the dataset to load. Defaults to the latest revision.</p> </li> <li> <p>streaming: Whether to load the dataset in streaming mode or not. Defaults to  <code>False</code>.</p> </li> <li> <p>num_examples: The number of examples to load from the dataset.  By default will load all examples.</p> </li> <li> <p>storage_options: Key/value pairs to be passed on to the file-system backend, if any.  Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromhub/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromHub\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromhub/#outputs","title":"Outputs","text":"<ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the  datasets loaded from the Hugging Face Hub.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromhub/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromhub/#load-data-from-a-dataset-in-hugging-face-hub","title":"Load data from a dataset in Hugging Face Hub","text":"<pre><code>from distilabel.steps import LoadDataFromHub\n\nloader = LoadDataFromHub(\n    repo_id=\"distilabel-internal-testing/instruction-dataset-mini\",\n    split=\"test\",\n    batch_size=2\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'prompt': 'Arianna has 12...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/","title":"LoadDataFromFileSystem","text":"<p>Loads a dataset from a file in your filesystem.</p> <p><code>GeneratorStep</code> that creates a dataset from a file in the filesystem, uses Hugging Face <code>datasets</code>     library. Take a look at Hugging Face Datasets     for more information of the supported file types.</p>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#attributes","title":"Attributes","text":"<ul> <li> <p>data_files: The path to the file, or directory containing the files that conform  the dataset.</p> </li> <li> <p>split: The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>batch_size: The batch size to use when processing the data.</p> </li> <li> <p>data_files: The path to the file, or directory containing the files that conform  the dataset.</p> </li> <li> <p>split: The split of the dataset to load. Defaults to 'train'.</p> </li> <li> <p>streaming: Whether to load the dataset in streaming mode or not. Defaults to  <code>False</code>.</p> </li> <li> <p>num_examples: The number of examples to load from the dataset.  By default will load all examples.</p> </li> <li> <p>storage_options: Key/value pairs to be passed on to the file-system backend, if any.  Defaults to <code>None</code>.</p> </li> <li> <p>filetype: The expected filetype. If not provided, it will be inferred from the file extension.  For more than one file, it will be inferred from the first file.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromFileSystem\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#outputs","title":"Outputs","text":"<ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the  datasets loaded from the Hugging Face Hub.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromfilesystem/#load-data-from-a-hugging-face-dataset-in-your-file-system","title":"Load data from a Hugging Face dataset in your file system","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(data_files=\"path/to/dataset.jsonl\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#specify-a-filetype-if-the-file-extension-is-not-expected","title":"Specify a filetype if the file extension is not expected","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(filetype=\"csv\", data_files=\"path/to/dataset.txtr\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#load-data-from-a-file-in-your-cloud-provider","title":"Load data from a file in your cloud provider","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#load-data-passing-a-glob-pattern","title":"Load data passing a glob pattern","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"path/to/dataset/*.jsonl\",\n    streaming=True\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/","title":"LoadDataFromDisk","text":"<p>Load a dataset that was previously saved to disk.</p> <p>If you previously saved your dataset using the <code>save_to_disk</code> method, or     <code>Distiset.save_to_disk</code> you can load it again to build a new pipeline using this class.</p>"},{"location":"components-gallery/steps/loaddatafromdisk/#attributes","title":"Attributes","text":"<ul> <li> <p>dataset_path: The path to the dataset or distiset.</p> </li> <li> <p>split: The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> </li> <li> <p>config: The configuration of the dataset to load. Defaults to <code>default</code>, if there are  multiple configurations in the dataset this must be suplied or an error is raised.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromdisk/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>batch_size: The batch size to use when processing the data.</p> </li> <li> <p>dataset_path: The path to the dataset or distiset.</p> </li> <li> <p>is_distiset: Whether the dataset to load is a <code>Distiset</code> or not. Defaults to False.</p> </li> <li> <p>split: The split of the dataset to load. Defaults to 'train'.</p> </li> <li> <p>config: The configuration of the dataset to load. Defaults to <code>default</code>, if there are  multiple configurations in the dataset this must be suplied or an error is raised.</p> </li> <li> <p>num_examples: The number of examples to load from the dataset.  By default will load all examples.</p> </li> <li> <p>storage_options: Key/value pairs to be passed on to the file-system backend, if any.  Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromdisk/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromDisk\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/#outputs","title":"Outputs","text":"<ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the  datasets loaded from the Hugging Face Hub.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdisk/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromdisk/#load-data-from-a-hugging-face-dataset","title":"Load data from a Hugging Face Dataset","text":"<pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(dataset_path=\"path/to/dataset\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/#load-data-from-a-distilabel-distiset","title":"Load data from a distilabel Distiset","text":"<pre><code>from distilabel.steps import LoadDataFromDisk\n\n# Specify the configuration to load.\nloader = LoadDataFromDisk(\n    dataset_path=\"path/to/dataset\",\n    is_distiset=True,\n    config=\"leaf_step_1\"\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'a': 1}, {'a': 2}, {'a': 3}], True)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/#load-data-from-a-hugging-face-dataset-or-distiset-in-your-cloud-provider","title":"Load data from a Hugging Face Dataset or Distiset in your cloud provider","text":"<pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(\n    dataset_path=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/prepareexamples/","title":"PrepareExamples","text":"<p>Helper step to create examples from <code>query</code> and <code>answers</code> pairs used as Few Shots in APIGen.</p>"},{"location":"components-gallery/steps/prepareexamples/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[query]\n            ICOL1[answers]\n        end\n        subgraph New columns\n            OCOL0[examples]\n        end\n    end\n\n    subgraph PrepareExamples\n        StepInput[Input Columns: query, answers]\n        StepOutput[Output Columns: examples]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/prepareexamples/#inputs","title":"Inputs","text":"<ul> <li> <p>query (<code>str</code>): The query to generate examples from.</p> </li> <li> <p>answers (<code>str</code>): The answers to the query.</p> </li> </ul>"},{"location":"components-gallery/steps/prepareexamples/#outputs","title":"Outputs","text":"<ul> <li>examples (<code>str</code>): The formatted examples.</li> </ul>"},{"location":"components-gallery/steps/prepareexamples/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/prepareexamples/#generate-examples-for-apigen","title":"Generate examples for APIGen","text":"<pre><code>from distilabel.steps.tasks.apigen.utils import PrepareExamples\n\nprepare_examples = PrepareExamples()\nresult = next(prepare_examples.process(\n    [\n        {\n            \"query\": ['I need the area of circles with radius 2.5, 5, and 7.5 inches, please.', 'Can you provide the current locations of buses and trolleys on route 12?'],\n            \"answers\": ['[{\"name\": \"circle_area\", \"arguments\": {\"radius\": 2.5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 7.5}}]', '[{\"name\": \"bus_trolley_locations\", \"arguments\": {\"route\": \"12\"}}]']\n        }\n    ]\n)\n# result\n# [{'examples': '## Query:\\nI need the area of circles with radius 2.5, 5, and 7.5 inches, please.\\n## Answers:\\n[{\"name\": \"circle_area\", \"arguments\": {\"radius\": 2.5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 7.5}}]\\n\\n## Query:\\nCan you provide the current locations of buses and trolleys on route 12?\\n## Answers:\\n[{\"name\": \"bus_trolley_locations\", \"arguments\": {\"route\": \"12\"}}]'}, {'examples': '## Query:\\nI need the area of circles with radius 2.5, 5, and 7.5 inches, please.\\n## Answers:\\n[{\"name\": \"circle_area\", \"arguments\": {\"radius\": 2.5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 7.5}}]\\n\\n## Query:\\nCan you provide the current locations of buses and trolleys on route 12?\\n## Answers:\\n[{\"name\": \"bus_trolley_locations\", \"arguments\": {\"route\": \"12\"}}]'}]\n</code></pre>"},{"location":"components-gallery/steps/conversationtemplate/","title":"ConversationTemplate","text":"<p>Generate a conversation template from an instruction and a response.</p>"},{"location":"components-gallery/steps/conversationtemplate/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[response]\n        end\n        subgraph New columns\n            OCOL0[conversation]\n        end\n    end\n\n    subgraph ConversationTemplate\n        StepInput[Input Columns: instruction, response]\n        StepOutput[Output Columns: conversation]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/conversationtemplate/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction to be used in the conversation.</p> </li> <li> <p>response (<code>str</code>): The response to be used in the conversation.</p> </li> </ul>"},{"location":"components-gallery/steps/conversationtemplate/#outputs","title":"Outputs","text":"<ul> <li>conversation (<code>ChatType</code>): The conversation template.</li> </ul>"},{"location":"components-gallery/steps/conversationtemplate/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/conversationtemplate/#create-a-conversation-from-an-instruction-and-a-response","title":"Create a conversation from an instruction and a response","text":"<pre><code>from distilabel.steps import ConversationTemplate\n\nconv_template = ConversationTemplate()\nconv_template.load()\n\nresult = next(\n    conv_template.process(\n        [\n            {\n                \"instruction\": \"Hello\",\n                \"response\": \"Hi\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'Hello', 'response': 'Hi', 'conversation': [{'role': 'user', 'content': 'Hello'}, {'role': 'assistant', 'content': 'Hi'}]}]\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationdpo/","title":"FormatTextGenerationDPO","text":"<p>Format the output of your LLMs for Direct Preference Optimization (DPO).</p> <p><code>FormatTextGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>TextGeneration</code>     task with a preference <code>Task</code> i.e. a task generating <code>ratings</code>, so that those are used to rank the     existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>.     Use this step to transform the output of a combination of a <code>TextGeneration</code> + a preference task such as     <code>UltraFeedback</code> following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>.</p>"},{"location":"components-gallery/steps/formattextgenerationdpo/#note","title":"Note","text":"<p>The <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p>"},{"location":"components-gallery/steps/formattextgenerationdpo/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n            ICOL1[instruction]\n            ICOL2[generations]\n            ICOL3[generation_models]\n            ICOL4[ratings]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[chosen]\n            OCOL3[chosen_model]\n            OCOL4[chosen_rating]\n            OCOL5[rejected]\n            OCOL6[rejected_model]\n            OCOL7[rejected_rating]\n        end\n    end\n\n    subgraph FormatTextGenerationDPO\n        StepInput[Input Columns: system_prompt, instruction, generations, generation_models, ratings]\n        StepOutput[Output Columns: prompt, prompt_id, chosen, chosen_model, chosen_rating, rejected, rejected_model, rejected_rating]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    ICOL4 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepOutput --&gt; OCOL5\n    StepOutput --&gt; OCOL6\n    StepOutput --&gt; OCOL7\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationdpo/#inputs","title":"Inputs","text":"<ul> <li> <p>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the  <code>generations</code>, if available.</p> </li> <li> <p>instruction (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</p> </li> <li> <p>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</p> </li> <li> <p>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,  only available if the <code>model_name</code> from the <code>TextGeneration</code> task/s is combined into a single  column named this way, otherwise, it will be ignored.</p> </li> <li> <p>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference  task such as <code>UltraFeedback</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationdpo/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</p> </li> <li> <p>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationdpo/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formattextgenerationdpo/#format-your-dataset-for-dpo-fine-tuning","title":"Format your dataset for DPO fine tuning","text":"<pre><code>from distilabel.steps import FormatTextGenerationDPO\n\nformat_dpo = FormatTextGenerationDPO()\nformat_dpo.load()\n\n# NOTE: Both \"system_prompt\" and \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#    {   'instruction': \"What's 2+2?\",\n#        'generations': ['4', '5', '6'],\n#        'ratings': [1, 0, -1],\n#        'prompt': \"What's 2+2?\",\n#        'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#        'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#        'chosen_rating': 1,\n#        'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#        'rejected_rating': -1\n#    }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationdpo/","title":"FormatChatGenerationDPO","text":"<p>Format the output of a combination of a <code>ChatGeneration</code> + a preference task for Direct Preference Optimization (DPO).</p> <p><code>FormatChatGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>ChatGeneration</code>     task with a preference <code>Task</code> i.e. a task generating <code>ratings</code> such as <code>UltraFeedback</code> following the standard     formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>., so that those are used to rank the     existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>.</p>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#note","title":"Note","text":"<p>The <code>messages</code> column should contain at least one message from the user, the <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[messages]\n            ICOL1[generations]\n            ICOL2[generation_models]\n            ICOL3[ratings]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[chosen]\n            OCOL3[chosen_model]\n            OCOL4[chosen_rating]\n            OCOL5[rejected]\n            OCOL6[rejected_model]\n            OCOL7[rejected_rating]\n        end\n    end\n\n    subgraph FormatChatGenerationDPO\n        StepInput[Input Columns: messages, generations, generation_models, ratings]\n        StepOutput[Output Columns: prompt, prompt_id, chosen, chosen_model, chosen_rating, rejected, rejected_model, rejected_rating]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepOutput --&gt; OCOL5\n    StepOutput --&gt; OCOL6\n    StepOutput --&gt; OCOL7\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#inputs","title":"Inputs","text":"<ul> <li> <p>messages (<code>List[Dict[str, str]]</code>): The conversation messages.</p> </li> <li> <p>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</p> </li> <li> <p>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,  only available if the <code>model_name</code> from the <code>ChatGeneration</code> task/s is combined into a single  column named this way, otherwise, it will be ignored.</p> </li> <li> <p>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference  task such as <code>UltraFeedback</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The user message used to generate the <code>generations</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</p> </li> <li> <p>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formatchatgenerationdpo/#format-your-dataset-for-dpo-fine-tuning","title":"Format your dataset for DPO fine tuning","text":"<pre><code>from distilabel.steps import FormatChatGenerationDPO\n\nformat_dpo = FormatChatGenerationDPO()\nformat_dpo.load()\n\n# NOTE: \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}],\n#         'generations': ['4', '5', '6'],\n#         'ratings': [1, 0, -1],\n#         'prompt': \"What's 2+2?\",\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'chosen_rating': 1,\n#         'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#         'rejected_rating': -1\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationsft/","title":"FormatTextGenerationSFT","text":"<p>Format the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatTextGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>TextGeneration</code> task for     Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code>     or <code>alignment-handbook</code>. The output of the <code>TextGeneration</code> task is formatted into a chat-like     conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant     message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message     in the conversation.</p>"},{"location":"components-gallery/steps/formattextgenerationsft/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n            ICOL1[instruction]\n            ICOL2[generation]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[messages]\n        end\n    end\n\n    subgraph FormatTextGenerationSFT\n        StepInput[Input Columns: system_prompt, instruction, generation]\n        StepOutput[Output Columns: prompt, prompt_id, messages]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationsft/#inputs","title":"Inputs","text":"<ul> <li> <p>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the  <code>generation</code>, if available.</p> </li> <li> <p>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationsft/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as  the user message and the <code>generation</code> as the assistant message.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationsft/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formattextgenerationsft/#format-your-dataset-for-sft-fine-tuning","title":"Format your dataset for SFT fine tuning","text":"<pre><code>from distilabel.steps import FormatTextGenerationSFT\n\nformat_sft = FormatTextGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'What's 2+2?',\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}]\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationsft/","title":"FormatChatGenerationSFT","text":"<p>Format the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatChatGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>ChatGeneration</code> task for     Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code>     or <code>alignment-handbook</code>. The output of the <code>ChatGeneration</code> task is formatted into a chat-like     conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant     message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message     in the conversation.</p>"},{"location":"components-gallery/steps/formatchatgenerationsft/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n            ICOL1[instruction]\n            ICOL2[generation]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[messages]\n        end\n    end\n\n    subgraph FormatChatGenerationSFT\n        StepInput[Input Columns: system_prompt, instruction, generation]\n        StepOutput[Output Columns: prompt, prompt_id, messages]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationsft/#inputs","title":"Inputs","text":"<ul> <li> <p>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the  <code>generation</code>, if available.</p> </li> <li> <p>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationsft/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as  the user message and the <code>generation</code> as the assistant message.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationsft/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formatchatgenerationsft/#format-your-dataset-for-sft","title":"Format your dataset for SFT","text":"<pre><code>from distilabel.steps import FormatChatGenerationSFT\n\nformat_sft = FormatChatGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/deitafiltering/","title":"DeitaFiltering","text":"<p>Filter dataset rows using DEITA filtering strategy.</p> <p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings.     It's an implementation of the filtering step from the paper 'What Makes Good Data     for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p>"},{"location":"components-gallery/steps/deitafiltering/#attributes","title":"Attributes","text":"<ul> <li> <p>data_budget: The desired size of the dataset after filtering.</p> </li> <li> <p>diversity_threshold: If a row has a cosine distance with respect to it's nearest  neighbor greater than this value, it will be included in the filtered dataset.  Defaults to <code>0.9</code>.</p> </li> <li> <p>normalize_embeddings: Whether to normalize the embeddings before computing the cosine  distance. Defaults to <code>True</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>data_budget: The desired size of the dataset after filtering.</p> </li> <li> <p>diversity_threshold: If a row has a cosine distance with respect to it's nearest  neighbor greater than this value, it will be included in the filtered dataset.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[evol_instruction_score]\n            ICOL1[evol_response_score]\n            ICOL2[embedding]\n        end\n        subgraph New columns\n            OCOL0[deita_score]\n            OCOL1[deita_score_computed_with]\n            OCOL2[nearest_neighbor_distance]\n        end\n    end\n\n    subgraph DeitaFiltering\n        StepInput[Input Columns: evol_instruction_score, evol_response_score, embedding]\n        StepOutput[Output Columns: deita_score, deita_score_computed_with, nearest_neighbor_distance]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/deitafiltering/#inputs","title":"Inputs","text":"<ul> <li> <p>evol_instruction_score (<code>float</code>): The score of the instruction generated by  <code>ComplexityScorer</code> step.</p> </li> <li> <p>evol_response_score (<code>float</code>): The score of the response generated by  <code>QualityScorer</code> step.</p> </li> <li> <p>embedding (<code>List[float]</code>): The embedding generated for the conversation of the  instruction-response pair using <code>GenerateEmbeddings</code> step.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#outputs","title":"Outputs","text":"<ul> <li> <p>deita_score (<code>float</code>): The DEITA score for the instruction-response pair.</p> </li> <li> <p>deita_score_computed_with (<code>List[str]</code>): The scores used to compute the DEITA  score.</p> </li> <li> <p>nearest_neighbor_distance (<code>float</code>): The cosine distance between the embeddings  of the instruction-response pair.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/deitafiltering/#filter-the-dataset-based-on-the-deita-score-and-the-cosine-distance-between-the-embeddings","title":"Filter the dataset based on the DEITA score and the cosine distance between the embeddings","text":"<pre><code>from distilabel.steps import DeitaFiltering\n\ndeita_filtering = DeitaFiltering(data_budget=1)\n\ndeita_filtering.load()\n\nresult = next(\n    deita_filtering.process(\n        [\n            {\n                \"evol_instruction_score\": 0.5,\n                \"evol_response_score\": 0.5,\n                \"embedding\": [-8.12729941, -5.24642847, -6.34003029],\n            },\n            {\n                \"evol_instruction_score\": 0.6,\n                \"evol_response_score\": 0.6,\n                \"embedding\": [2.99329242, 0.7800932, 0.7799726],\n            },\n            {\n                \"evol_instruction_score\": 0.7,\n                \"evol_response_score\": 0.7,\n                \"embedding\": [10.29041806, 14.33088073, 13.00557506],\n            },\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]\n</code></pre>"},{"location":"components-gallery/steps/deitafiltering/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/","title":"EmbeddingDedup","text":"<p>Deduplicates text using embeddings.</p> <p><code>EmbeddingDedup</code> is a Step that detects near-duplicates in datasets, using     embeddings to compare the similarity between the texts. The typical workflow with this step     would include having a dataset with embeddings precomputed, and then (possibly using the     <code>FaissNearestNeighbour</code>) using the <code>nn_indices</code> and <code>nn_scores</code>, determine the texts that     are duplicate.</p>"},{"location":"components-gallery/steps/embeddingdedup/#attributes","title":"Attributes","text":"<ul> <li>threshold: the threshold to consider 2 examples as duplicates.  It's dependent on the type of index that was used to generate the embeddings.  For example, if the embeddings were generated using cosine similarity, a threshold  of <code>0.9</code> would make all the texts with a cosine similarity above the value  duplicates. Higher values detect less duplicates in such an index, but that should  be taken into account when building it. Defaults to <code>0.9</code>.  Runtime Parameters:  - <code>threshold</code>: the threshold to consider 2 examples as duplicates.</li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[nn_indices]\n            ICOL1[nn_scores]\n        end\n        subgraph New columns\n            OCOL0[keep_row_after_embedding_filtering]\n        end\n    end\n\n    subgraph EmbeddingDedup\n        StepInput[Input Columns: nn_indices, nn_scores]\n        StepOutput[Output Columns: keep_row_after_embedding_filtering]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/embeddingdedup/#inputs","title":"Inputs","text":"<ul> <li> <p>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours  in the inputs for the row.</p> </li> <li> <p>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>  nearest neighbour in the inputs.</p> </li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/#outputs","title":"Outputs","text":"<ul> <li>keep_row_after_embedding_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is  not a duplicate i.e. this text should be kept.</li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/embeddingdedup/#deduplicate-a-list-of-texts-using-embedding-information","title":"Deduplicate a list of texts using embedding information","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    data = LoadDataFromDicts(\n        data=[\n            {\n                \"persona\": \"A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.\",\n                \"embedding\": [\n                    0.018477669046149742,\n                    -0.03748236608841726,\n                    0.001919870620352492,\n                    0.024918478063770535,\n                    0.02348063521315178,\n                    0.0038251285566308375,\n                    -0.01723884983037716,\n                    0.02881971942372201,\n                ],\n                \"nn_indices\": [0, 1],\n                \"nn_scores\": [\n                    0.9164746999740601,\n                    0.782106876373291,\n                ],\n            },\n            {\n                \"persona\": \"A music teacher or instructor focused on theoretical and practical piano lessons.\",\n                \"embedding\": [\n                    -0.0023464179614082125,\n                    -0.07325472251663565,\n                    -0.06058678419516501,\n                    -0.02100326928586996,\n                    -0.013462744792362657,\n                    0.027368447064244242,\n                    -0.003916070100455717,\n                    0.01243614518480423,\n                ],\n                \"nn_indices\": [0, 2],\n                \"nn_scores\": [\n                    0.7552462220191956,\n                    0.7261884808540344,\n                ],\n            },\n            {\n                \"persona\": \"A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.\",\n                \"embedding\": [\n                    -0.01630817942328242,\n                    -0.023760151552345232,\n                    -0.014249650090627883,\n                    -0.005713686451446624,\n                    -0.016033059279131567,\n                    0.0071440908501058786,\n                    -0.05691099643425161,\n                    0.01597412704817784,\n                ],\n                \"nn_indices\": [1, 2],\n                \"nn_scores\": [\n                    0.8107735514640808,\n                    0.7172299027442932,\n                ],\n            },\n        ],\n        batch_size=batch_size,\n    )\n    # In general you should do something like this before the deduplication step, to obtain the\n    # `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's\n    # no need for it.\n    # nn = FaissNearestNeighbour(\n    #     k=30,\n    #     metric_type=faiss.METRIC_INNER_PRODUCT,\n    #     search_batch_size=50,\n    #     train_size=len(dataset),              # The number of embeddings to use for training\n    #     string_factory=\"IVF300_HNSW32,Flat\"   # To use an index (optional, maybe required for big datasets)\n    # )\n    # Read more about the `string_factory` here:\n    # https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index\n\n    embedding_dedup = EmbeddingDedup(\n        threshold=0.8,\n        input_batch_size=batch_size,\n    )\n\n    data &gt;&gt; embedding_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_embedding_filtering\"])\n</code></pre>"},{"location":"components-gallery/steps/apigenexecutionchecker/","title":"APIGenExecutionChecker","text":"<p>Executes the generated function calls.</p> <p>This step checks if a given answer from a model as generated by <code>APIGenGenerator</code>     can be executed against the given library (given by <code>libpath</code>, which is a string     pointing to a python .py file with functions).</p>"},{"location":"components-gallery/steps/apigenexecutionchecker/#attributes","title":"Attributes","text":"<ul> <li> <p>libpath: The path to the library where we will retrieve the functions.  It can also point to a folder with the functions. In this case, the folder  layout should be a folder with .py files, each containing a single function,  the name of the function being the same as the filename.</p> </li> <li> <p>check_is_dangerous: Bool to exclude some potentially dangerous functions, it contains  some heuristics found while testing. This functions can run subprocesses, deal with  the OS, or have other potentially dangerous operations. Defaults to True.</p> </li> </ul>"},{"location":"components-gallery/steps/apigenexecutionchecker/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[answers]\n        end\n        subgraph New columns\n            OCOL0[keep_row_after_execution_check]\n            OCOL1[execution_result]\n        end\n    end\n\n    subgraph APIGenExecutionChecker\n        StepInput[Input Columns: answers]\n        StepOutput[Output Columns: keep_row_after_execution_check, execution_result]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/apigenexecutionchecker/#inputs","title":"Inputs","text":"<ul> <li>answers (<code>str</code>): List with arguments to be passed to the function,  dumped as a string from a list of dictionaries. Should be loaded using  <code>json.loads</code>.</li> </ul>"},{"location":"components-gallery/steps/apigenexecutionchecker/#outputs","title":"Outputs","text":"<ul> <li> <p>keep_row_after_execution_check (<code>bool</code>): Whether the function should be kept or not.</p> </li> <li> <p>execution_result (<code>str</code>): The result from executing the function.</p> </li> </ul>"},{"location":"components-gallery/steps/apigenexecutionchecker/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/apigenexecutionchecker/#execute-a-function-from-a-given-library-with-the-answer-from-an-llm","title":"Execute a function from a given library with the answer from an LLM","text":"<pre><code>from distilabel.steps.tasks import APIGenExecutionChecker\n\n# For the libpath you can use as an example the file at the tests folder:\n# ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\ntask = APIGenExecutionChecker(\n    libpath=\"../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\",\n)\ntask.load()\n\nres = next(\n    task.process(\n        [\n            {\n                \"answers\": [\n                    {\n                        \"arguments\": {\n                            \"initial_velocity\": 0.2,\n                            \"acceleration\": 0.1,\n                            \"time\": 0.5,\n                        },\n                        \"name\": \"final_velocity\",\n                    }\n                ],\n            }\n        ]\n    )\n)\nres\n#[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]\n</code></pre>"},{"location":"components-gallery/steps/apigenexecutionchecker/#references","title":"References","text":"<ul> <li> <p>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</p> </li> <li> <p>Salesforce/xlam-function-calling-60k</p> </li> </ul>"},{"location":"components-gallery/steps/minhashdedup/","title":"MinHashDedup","text":"<p>Deduplicates text using <code>MinHash</code> and <code>MinHashLSH</code>.</p> <p><code>MinHashDedup</code> is a Step that detects near-duplicates in datasets. The idea roughly translates     to the following steps:     1. Tokenize the text into words or ngrams.     2. Create a <code>MinHash</code> for each text.     3. Store the <code>MinHashes</code> in a <code>MinHashLSH</code>.     4. Check if the <code>MinHash</code> is already in the <code>LSH</code>, if so, it is a duplicate.</p>"},{"location":"components-gallery/steps/minhashdedup/#attributes","title":"Attributes","text":"<ul> <li> <p>num_perm: the number of permutations to use. Defaults to <code>128</code>.</p> </li> <li> <p>seed: the seed to use for the MinHash. Defaults to <code>1</code>.</p> </li> <li> <p>tokenizer: the tokenizer to use. Available ones are <code>words</code> or <code>ngrams</code>.  If <code>words</code> is selected, it tokenizes the text into words using nltk's  word tokenizer. <code>ngram</code> estimates the ngrams (together with the size  <code>n</code>). Defaults to <code>words</code>.</p> </li> <li> <p>n: the size of the ngrams to use. Only relevant if <code>tokenizer=\"ngrams\"</code>. Defaults to <code>5</code>.</p> </li> <li> <p>threshold: the threshold to consider two MinHashes as duplicates.  Values closer to 0 detect more duplicates. Defaults to <code>0.9</code>.</p> </li> <li> <p>storage: the storage to use for the LSH. Can be <code>dict</code> to store the index  in memory, or <code>disk</code>. Keep in mind, <code>disk</code> is an experimental feature  not defined in <code>datasketch</code>, that is based on DiskCache's <code>Index</code> class.  It should work as a <code>dict</code>, but backed by disk, but depending on the system  it can be slower. Defaults to <code>dict</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/minhashdedup/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[keep_row_after_minhash_filtering]\n        end\n    end\n\n    subgraph MinHashDedup\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: keep_row_after_minhash_filtering]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/minhashdedup/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>): the texts to be filtered.</li> </ul>"},{"location":"components-gallery/steps/minhashdedup/#outputs","title":"Outputs","text":"<ul> <li>keep_row_after_minhash_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is  not a duplicate i.e. this text should be kept.</li> </ul>"},{"location":"components-gallery/steps/minhashdedup/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/minhashdedup/#deduplicate-a-list-of-texts-using-minhash-and-minhashlsh","title":"Deduplicate a list of texts using MinHash and MinHashLSH","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import MinHashDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    ds_size = 1000\n    batch_size = 500  # Bigger batch sizes work better for this step\n    data = LoadDataFromDicts(\n        data=[\n            {\"text\": \"This is a test document.\"},\n            {\"text\": \"This document is a test.\"},\n            {\"text\": \"Test document for duplication.\"},\n            {\"text\": \"Document for duplication test.\"},\n            {\"text\": \"This is another unique document.\"},\n        ]\n        * (ds_size // 5),\n        batch_size=batch_size,\n    )\n    minhash_dedup = MinHashDedup(\n        tokenizer=\"words\",\n        threshold=0.9,      # lower values will increase the number of duplicates\n        storage=\"dict\",     # or \"disk\" for bigger datasets\n    )\n\n    data &gt;&gt; minhash_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_minhash_filtering\"])\n</code></pre>"},{"location":"components-gallery/steps/minhashdedup/#references","title":"References","text":"<ul> <li> <p>datasketch documentation</p> </li> <li> <p>Identifying and Filtering Near-Duplicate Documents</p> </li> <li> <p>Diskcache's Index</p> </li> </ul>"},{"location":"components-gallery/steps/combineoutputs/","title":"CombineOutputs","text":"<p>Combine the outputs of several upstream steps.</p> <p><code>CombineOutputs</code> is a <code>Step</code> that takes the outputs of several upstream steps and combines     them to generate a new dictionary with all keys/columns of the upstream steps outputs.</p>"},{"location":"components-gallery/steps/combineoutputs/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph CombineOutputs\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/combineoutputs/#inputs","title":"Inputs","text":"<ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul>"},{"location":"components-gallery/steps/combineoutputs/#outputs","title":"Outputs","text":"<ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul>"},{"location":"components-gallery/steps/combineoutputs/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/combineoutputs/#combine-dictionaries-of-a-dataset","title":"Combine dictionaries of a dataset","text":"<pre><code>from distilabel.steps import CombineOutputs\n\ncombine_outputs = CombineOutputs()\ncombine_outputs.load()\n\nresult = next(\n    combine_outputs.process(\n        [{\"a\": 1, \"b\": 2}, {\"a\": 3, \"b\": 4}],\n        [{\"c\": 5, \"d\": 6}, {\"c\": 7, \"d\": 8}],\n    )\n)\n# [\n#   {\"a\": 1, \"b\": 2, \"c\": 5, \"d\": 6},\n#   {\"a\": 3, \"b\": 4, \"c\": 7, \"d\": 8},\n# ]\n</code></pre>"},{"location":"components-gallery/steps/combineoutputs/#combine-upstream-steps-outputs-in-a-pipeline","title":"Combine upstream steps outputs in a pipeline","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs\n\nwith Pipeline() as pipeline:\n    step_1 = ...\n    step_2 = ...\n    step_3 = ...\n    combine = CombineOutputs()\n\n    [step_1, step_2, step_3] &gt;&gt; combine\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/","title":"ExpandColumns","text":"<p>Expand columns that contain lists into multiple rows.</p> <p><code>ExpandColumns</code> is a <code>Step</code> that takes a list of columns and expands them into multiple     rows. The new rows will have the same data as the original row, except for the expanded     column, which will contain a single item from the original list.</p>"},{"location":"components-gallery/steps/expandcolumns/#attributes","title":"Attributes","text":"<ul> <li> <p>columns: A dictionary that maps the column to be expanded to the new column name  or a list of columns to be expanded. If a list is provided, the new column name  will be the same as the column name.</p> </li> <li> <p>encoded: A bool to inform Whether the columns are JSON encoded lists. If this value is  set to True, the columns will be decoded before expanding. Alternatively, to specify  columns that can be encoded, a list can be provided. In this case, the column names  informed must be a subset of the columns selected for expansion.</p> </li> <li> <p>split_statistics: A bool to inform whether the statistics in the <code>distilabel_metadata</code>  column should be split into multiple rows.  If we want to expand some columns containing a list of strings that come from  having parsed the output of an LLM, the tokens in the <code>statistics_{step_name}</code>  of the <code>distilabel_metadata</code> column should be splitted to avoid multiplying  them if we aggregate the data afterwards. For example, with a task that is supposed  to generate a list of N instructions, and we want each of those N instructions in  different rows, we should split the statistics by N.  In such a case, set this value to True.</p> </li> </ul>"},{"location":"components-gallery/steps/expandcolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph ExpandColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to be expanded into  multiple rows.</li> </ul>"},{"location":"components-gallery/steps/expandcolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The expanded columns.</li> </ul>"},{"location":"components-gallery/steps/expandcolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/expandcolumns/#expand-the-selected-columns-into-multiple-rows","title":"Expand the selected columns into multiple rows","text":"<pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"]}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/#expand-the-selected-columns-which-are-json-encoded-into-multiple-rows","title":"Expand the selected columns which are JSON encoded into multiple rows","text":"<pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    encoded=True,  # It can also be a list of columns that are encoded, i.e. [\"generation\"]\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": '[\"generation 1\", \"generation 2\"]'}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/#expand-the-selected-columns-and-split-the-statistics-in-the-distilabel_metadata-column","title":"Expand the selected columns and split the statistics in the <code>distilabel_metadata</code> column","text":"<pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    split_statistics=True,\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"],\n                \"distilabel_metadata\": {\n                    \"statistics_generation\": {\n                        \"input_tokens\": [12],\n                        \"output_tokens\": [12],\n                    },\n                },\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}, {'instruction': 'instruction 1', 'generation': 'generation 2', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}]\n</code></pre>"},{"location":"components-gallery/steps/groupcolumns/","title":"GroupColumns","text":"<p>Combines columns from a list of <code>StepInput</code>.</p> <p><code>GroupColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>group_dicts</code>     function to handle and combine a list of <code>StepInput</code>. Also <code>GroupColumns</code> provides two attributes     <code>columns</code> and <code>output_columns</code> to specify the columns to group and the output columns     which will override the default value for the properties <code>inputs</code> and <code>outputs</code>, respectively.</p>"},{"location":"components-gallery/steps/groupcolumns/#attributes","title":"Attributes","text":"<ul> <li> <p>columns: List of strings with the names of the columns to group.</p> </li> <li> <p>output_columns: Optional list of strings with the names of the output columns.</p> </li> </ul>"},{"location":"components-gallery/steps/groupcolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph GroupColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/groupcolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to group.</li> </ul>"},{"location":"components-gallery/steps/groupcolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> and <code>output_columns</code> attributes): The columns  that were grouped.</li> </ul>"},{"location":"components-gallery/steps/groupcolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/groupcolumns/#group-columns-of-a-dataset","title":"Group columns of a dataset","text":"<pre><code>from distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n# [{'merged_generation': ['AI generated text', 'Other generated text'], 'merged_model_name': ['my_model']}]\n</code></pre>"},{"location":"components-gallery/steps/groupcolumns/#specify-the-name-of-the-output-columns","title":"Specify the name of the output columns","text":"<pre><code>from distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n    output_columns=[\"generations\", \"generation_models\"]\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n#[{'generations': ['AI generated text', 'Other generated text'], 'generation_models': ['my_model']}]\n</code></pre>"},{"location":"components-gallery/steps/keepcolumns/","title":"KeepColumns","text":"<p>Keeps selected columns in the dataset.</p> <p><code>KeepColumns</code> is a <code>Step</code> that implements the <code>process</code> method that keeps only the columns     specified in the <code>columns</code> attribute. Also <code>KeepColumns</code> provides an attribute <code>columns</code> to     specify the columns to keep which will override the default value for the properties <code>inputs</code>     and <code>outputs</code>.</p>"},{"location":"components-gallery/steps/keepcolumns/#note","title":"Note","text":"<p>The order in which the columns are provided is important, as the output will be sorted using the provided order, which is useful before pushing either a <code>dataset.Dataset</code> via the <code>PushToHub</code> step or a <code>distilabel.Distiset</code> via the <code>Pipeline.run</code> output variable.</p>"},{"location":"components-gallery/steps/keepcolumns/#attributes","title":"Attributes","text":"<ul> <li>columns: List of strings with the names of the columns to keep.</li> </ul>"},{"location":"components-gallery/steps/keepcolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph KeepColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/keepcolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to keep.</li> </ul>"},{"location":"components-gallery/steps/keepcolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns that were kept.</li> </ul>"},{"location":"components-gallery/steps/keepcolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/keepcolumns/#select-the-columns-to-keep","title":"Select the columns to keep","text":"<pre><code>from distilabel.steps import KeepColumns\n\nkeep_columns = KeepColumns(\n    columns=[\"instruction\", \"generation\"],\n)\nkeep_columns.load()\n\nresult = next(\n    keep_columns.process(\n        [{\"instruction\": \"What's the brightest color?\", \"generation\": \"white\", \"model_name\": \"my_model\"}],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': \"What's the brightest color?\", 'generation': 'white'}]\n</code></pre>"},{"location":"components-gallery/steps/mergecolumns/","title":"MergeColumns","text":"<p>Merge columns from a row.</p> <p><code>MergeColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>merge_columns</code>     function to handle and combine columns in a <code>StepInput</code>. <code>MergeColumns</code> provides two attributes     <code>columns</code> and <code>output_column</code> to specify the columns to merge and the resulting output column.</p> <pre><code>This step can be useful if you have a `Task` that generates instructions for example, and you\nwant to have more examples of those. In such a case, you could for example use another `Task`\nto multiply your instructions synthetically, what would yield two different columns splitted.\nUsing `MergeColumns` you can merge them and use them as a single column in your dataset for\nfurther processing.\n</code></pre>"},{"location":"components-gallery/steps/mergecolumns/#attributes","title":"Attributes","text":"<ul> <li> <p>columns: List of strings with the names of the columns to merge.</p> </li> <li> <p>output_column: str name of the output column</p> </li> </ul>"},{"location":"components-gallery/steps/mergecolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph MergeColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/mergecolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to merge.</li> </ul>"},{"location":"components-gallery/steps/mergecolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> and <code>output_column</code> attributes): The columns  that were merged.</li> </ul>"},{"location":"components-gallery/steps/mergecolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/mergecolumns/#combine-columns-in-rows-of-a-dataset","title":"Combine columns in rows of a dataset","text":"<pre><code>from distilabel.steps import MergeColumns\n\ncombiner = MergeColumns(\n    columns=[\"queries\", \"multiple_queries\"],\n    output_column=\"queries\",\n)\ncombiner.load()\n\nresult = next(\n    combiner.process(\n        [\n            {\n                \"queries\": \"How are you?\",\n                \"multiple_queries\": [\"What's up?\", \"Everything ok?\"]\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'queries': ['How are you?', \"What's up?\", 'Everything ok?']}]\n</code></pre>"},{"location":"components-gallery/steps/dbscan/","title":"DBSCAN","text":"<p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core</p> <p>samples in regions of high density and expands clusters from them. This algorithm     is good for data which contains clusters of similar density.</p> <pre><code>This is a `GlobalStep` that clusters the embeddings using the DBSCAN algorithm\nfrom `sklearn`. Visit `TextClustering` step for an example of use.\nThe trained model is saved as an artifact when creating a distiset\nand pushing it to the Hugging Face Hub.\n</code></pre>"},{"location":"components-gallery/steps/dbscan/#attributes","title":"Attributes","text":"<ul> <li>eps: The maximum distance between two samples for one to be considered as in the  neighborhood of the other. This is not a maximum bound on the distances of  points within a cluster. This is the most important DBSCAN parameter to  choose appropriately for your data set and distance function.  - min_samples: The number of samples (or total weight) in a neighborhood for a point  to be considered as a core point. This includes the point itself. If <code>min_samples</code>  is set to a higher value, DBSCAN will find denser clusters, whereas if it is set  to a lower value, the found clusters will be more sparse.  - metric: The metric to use when calculating distance between instances in a feature  array. If metric is a string or callable, it must be one of the options allowed  by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.  - n_jobs: The number of parallel jobs to run.</li> </ul>"},{"location":"components-gallery/steps/dbscan/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>eps: The maximum distance between two samples for one to be considered as in the  neighborhood of the other. This is not a maximum bound on the distances of  points within a cluster. This is the most important DBSCAN parameter to  choose appropriately for your data set and distance function.</p> </li> <li> <p>min_samples: The number of samples (or total weight) in a neighborhood for a point  to be considered as a core point. This includes the point itself. If <code>min_samples</code>  is set to a higher value, DBSCAN will find denser clusters, whereas if it is set  to a lower value, the found clusters will be more sparse.</p> </li> <li> <p>metric: The metric to use when calculating distance between instances in a feature  array. If metric is a string or callable, it must be one of the options allowed  by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.</p> </li> <li> <p>n_jobs: The number of parallel jobs to run.</p> </li> </ul>"},{"location":"components-gallery/steps/dbscan/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[projection]\n        end\n        subgraph New columns\n            OCOL0[cluster_label]\n        end\n    end\n\n    subgraph DBSCAN\n        StepInput[Input Columns: projection]\n        StepOutput[Output Columns: cluster_label]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/dbscan/#inputs","title":"Inputs","text":"<ul> <li>projection (<code>List[float]</code>): Vector representation of the text to cluster,  normally the output from the <code>UMAP</code> step.</li> </ul>"},{"location":"components-gallery/steps/dbscan/#outputs","title":"Outputs","text":"<ul> <li>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1  means it wasn't clustered.</li> </ul>"},{"location":"components-gallery/steps/dbscan/#references","title":"References","text":"<ul> <li> <p>DBSCAN demo of sklearn</p> </li> <li> <p>sklearn dbscan</p> </li> </ul>"},{"location":"components-gallery/steps/umap/","title":"UMAP","text":"<p>UMAP is a general purpose manifold learning and dimension reduction algorithm.</p> <p>This is a <code>GlobalStep</code> that reduces the dimensionality of the embeddings using. Visit     the <code>TextClustering</code> step for an example of use. The trained model is saved as an artifact     when creating a distiset and pushing it to the Hugging Face Hub.</p>"},{"location":"components-gallery/steps/umap/#attributes","title":"Attributes","text":"<ul> <li>n_components: The dimension of the space to embed into. This defaults to 2 to  provide easy visualization (that's probably what you want), but can  reasonably be set to any integer value in the range 2 to 100.  - metric: The metric to use to compute distances in high dimensional space.  Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.  - n_jobs: The number of parallel jobs to run. Defaults to <code>8</code>.  - random_state: The random state to use for the UMAP algorithm.</li> </ul>"},{"location":"components-gallery/steps/umap/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>n_components: The dimension of the space to embed into. This defaults to 2 to  provide easy visualization (that's probably what you want), but can  reasonably be set to any integer value in the range 2 to 100.</p> </li> <li> <p>metric: The metric to use to compute distances in high dimensional space.  Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.</p> </li> <li> <p>n_jobs: The number of parallel jobs to run. Defaults to <code>8</code>.</p> </li> <li> <p>random_state: The random state to use for the UMAP algorithm.</p> </li> </ul>"},{"location":"components-gallery/steps/umap/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[embedding]\n        end\n        subgraph New columns\n            OCOL0[projection]\n        end\n    end\n\n    subgraph UMAP\n        StepInput[Input Columns: embedding]\n        StepOutput[Output Columns: projection]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/umap/#inputs","title":"Inputs","text":"<ul> <li>embedding (<code>List[float]</code>): The original embeddings we want to reduce the dimension.</li> </ul>"},{"location":"components-gallery/steps/umap/#outputs","title":"Outputs","text":"<ul> <li>projection (<code>List[float]</code>): Embedding reduced to the number of components specified,  the size of the new embeddings will be determined by the <code>n_components</code>.</li> </ul>"},{"location":"components-gallery/steps/umap/#references","title":"References","text":"<ul> <li> <p>UMAP repository</p> </li> <li> <p>UMAP documentation</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/","title":"FaissNearestNeighbour","text":"<p>Create a <code>faiss</code> index to get the nearest neighbours.</p> <p><code>FaissNearestNeighbour</code> is a <code>GlobalStep</code> that creates a <code>faiss</code> index using the Hugging     Face <code>datasets</code> library integration, and then gets the nearest neighbours and the scores     or distance of the nearest neighbours for each input row.</p>"},{"location":"components-gallery/steps/faissnearestneighbour/#attributes","title":"Attributes","text":"<ul> <li> <p>device: the CUDA device ID or a list of IDs to be used. If negative integer, it  will use all the available GPUs. Defaults to <code>None</code>.</p> </li> <li> <p>string_factory: the name of the factory to be used to build the <code>faiss</code> index.  Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.  Defaults to <code>None</code>.</p> </li> <li> <p>metric_type: the metric to be used to measure the distance between the points. It's  an integer and the recommend way to pass it is importing <code>faiss</code> and then passing  one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</p> </li> <li> <p>k: the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</p> </li> <li> <p>search_batch_size: the number of rows to include in a search batch. The value can  be adjusted to maximize the resources usage or to avoid OOM issues. Defaults  to <code>50</code>.</p> </li> <li> <p>train_size: If the index needs a training step, specifies how many vectors will be  used to train the index.</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>device: the CUDA device ID or a list of IDs to be used. If negative integer,  it will use all the available GPUs. Defaults to <code>None</code>.</p> </li> <li> <p>string_factory: the name of the factory to be used to build the <code>faiss</code> index.  Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.  Defaults to <code>None</code>.</p> </li> <li> <p>metric_type: the metric to be used to measure the distance between the points.  It's an integer and the recommend way to pass it is importing <code>faiss</code> and then  passing one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</p> </li> <li> <p>k: the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</p> </li> <li> <p>search_batch_size: the number of rows to include in a search batch. The value  can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults  to <code>50</code>.</p> </li> <li> <p>train_size: If the index needs a training step, specifies how many vectors will  be used to train the index.</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[embedding]\n        end\n        subgraph New columns\n            OCOL0[nn_indices]\n            OCOL1[nn_scores]\n        end\n    end\n\n    subgraph FaissNearestNeighbour\n        StepInput[Input Columns: embedding]\n        StepOutput[Output Columns: nn_indices, nn_scores]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/faissnearestneighbour/#inputs","title":"Inputs","text":"<ul> <li>embedding (<code>List[Union[float, int]]</code>): a sentence embedding.</li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#outputs","title":"Outputs","text":"<ul> <li> <p>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours  in the inputs for the row.</p> </li> <li> <p>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>  nearest neighbour in the inputs.</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/faissnearestneighbour/#generating-embeddings-and-getting-the-nearest-neighbours","title":"Generating embeddings and getting the nearest neighbours","text":"<pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub\n\nwith Pipeline(name=\"hello\") as pipeline:\n    load_data = LoadDataFromHub(output_mappings={\"prompt\": \"text\"})\n\n    embeddings = EmbeddingGeneration(\n        embeddings=SentenceTransformerEmbeddings(\n            model=\"mixedbread-ai/mxbai-embed-large-v1\"\n        )\n    )\n\n    nearest_neighbours = FaissNearestNeighbour()\n\n    load_data &gt;&gt; embeddings &gt;&gt; nearest_neighbours\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n        },\n        use_cache=False,\n    )\n</code></pre>"},{"location":"components-gallery/steps/faissnearestneighbour/#references","title":"References","text":"<ul> <li>The Faiss library</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/","title":"EmbeddingGeneration","text":"<p>Generate embeddings using an <code>Embeddings</code> model.</p> <p><code>EmbeddingGeneration</code> is a <code>Step</code> that using an <code>Embeddings</code> model generates sentence     embeddings for the provided input texts.</p>"},{"location":"components-gallery/steps/embeddinggeneration/#attributes","title":"Attributes","text":"<ul> <li>embeddings: the <code>Embeddings</code> model used to generate the sentence embeddings.</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[embedding]\n        end\n    end\n\n    subgraph EmbeddingGeneration\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: embedding]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/embeddinggeneration/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>): The text for which the sentence embedding has to be generated.</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/#outputs","title":"Outputs","text":"<ul> <li>embedding (<code>List[Union[float, int]]</code>): the generated sentence embedding.</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/embeddinggeneration/#generate-sentence-embeddings-with-sentence-transformers","title":"Generate sentence embeddings with Sentence Transformers","text":"<pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.steps import EmbeddingGeneration\n\nembedding_generation = EmbeddingGeneration(\n    embeddings=SentenceTransformerEmbeddings(\n        model=\"mixedbread-ai/mxbai-embed-large-v1\",\n    )\n)\n\nembedding_generation.load()\n\nresult = next(embedding_generation.process([{\"text\": \"Hello, how are you?\"}]))\n# [{'text': 'Hello, how are you?', 'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]\n</code></pre>"},{"location":"components-gallery/steps/rewardmodelscore/","title":"RewardModelScore","text":"<p>Assign a score to a response using a Reward Model.</p> <p><code>RewardModelScore</code> is a <code>Step</code> that using a Reward Model (RM) loaded using <code>transformers</code>,     assigns an score to a response generated for an instruction, or a score to a multi-turn     conversation.</p>"},{"location":"components-gallery/steps/rewardmodelscore/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>revision: if <code>model</code> refers to a Hugging Face Hub repository, then the revision  (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> </li> <li> <p>torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.  Defaults to <code>\"auto\"</code>.</p> </li> <li> <p>trust_remote_code: whether to allow fetching and executing remote code fetched  from the repository in the Hub. Defaults to <code>False</code>.</p> </li> <li> <p>device_map: a dictionary mapping each layer of the model to a device, or a mode like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> </li> <li> <p>token: the Hugging Face Hub token that will be used to authenticate to the Hugging  Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package  local configuration will be used. Defaults to <code>None</code>.</p> </li> <li> <p>truncation: whether to truncate sequences at the maximum length. Defaults to <code>False</code>.</p> </li> <li> <p>max_length: maximun length to use for padding or truncation. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/rewardmodelscore/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[response]\n            ICOL2[conversation]\n        end\n        subgraph New columns\n            OCOL0[score]\n        end\n    end\n\n    subgraph RewardModelScore\n        StepInput[Input Columns: instruction, response, conversation]\n        StepOutput[Output Columns: score]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/rewardmodelscore/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>, optional): the instruction used to generate a <code>response</code>.  If provided, then <code>response</code> must be provided too.</p> </li> <li> <p>response (<code>str</code>, optional): the response generated for <code>instruction</code>. If provided,  then <code>instruction</code> must be provide too.</p> </li> <li> <p>conversation (<code>ChatType</code>, optional): a multi-turn conversation. If not provided,  then <code>instruction</code> and <code>response</code> columns must be provided.</p> </li> </ul>"},{"location":"components-gallery/steps/rewardmodelscore/#outputs","title":"Outputs","text":"<ul> <li>score (<code>float</code>): the score given by the reward model for the instruction-response  pair or the conversation.</li> </ul>"},{"location":"components-gallery/steps/rewardmodelscore/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/rewardmodelscore/#response-pair","title":"response pair","text":"<pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"response\": \"The output of 2+2 is 4\",\n            },\n            {\"instruction\": \"How much is 2+2?\", \"response\": \"4\"},\n        ]\n    )\n)\n# [\n#   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},\n#   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}\n# ]\n</code></pre>"},{"location":"components-gallery/steps/rewardmodelscore/#turn-conversation","title":"turn conversation","text":"<pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"The output of 2+2 is 4\"},\n                ],\n            },\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"4\"},\n                ],\n            },\n        ]\n    )\n)\n# [\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formatprm/","title":"FormatPRM","text":"<p>Helper step to transform the data into the format expected by the PRM model.</p> <p>This step can be used to format the data in one of 2 formats:     Following the format presented     in peiyi9979/Math-Shepherd,     in which case this step creates the columns input and label, where the input is the instruction     with the solution (and the tag replaced by a token), and the label is the instruction     with the solution, both separated by a newline.     Following TRL's format for training, which generates the columns prompt, completions, and labels.     The labels correspond to the original tags replaced by boolean values, where True represents     correct steps.</p>"},{"location":"components-gallery/steps/formatprm/#attributes","title":"Attributes","text":"<ul> <li> <p>format: The format to use for the PRM model.  \"math-shepherd\" corresponds to the original paper, while \"trl\" is a format  prepared to train the model using TRL.</p> </li> <li> <p>step_token: String that serves as a unique token denoting the position  for predicting the step score.</p> </li> <li> <p>tags: List of tags that represent the correct and incorrect steps.  This only needs to be informed if it's different than the default in  <code>MathShepherdCompleter</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formatprm/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[solutions]\n        end\n        subgraph New columns\n            OCOL0[input]\n            OCOL1[label]\n            OCOL2[prompt]\n            OCOL3[completions]\n            OCOL4[labels]\n        end\n    end\n\n    subgraph FormatPRM\n        StepInput[Input Columns: instruction, solutions]\n        StepOutput[Output Columns: input, label, prompt, completions, labels]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formatprm/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The task or instruction.</p> </li> <li> <p>solutions (<code>list[str]</code>): List of steps with a solution to the task.</p> </li> </ul>"},{"location":"components-gallery/steps/formatprm/#outputs","title":"Outputs","text":"<ul> <li> <p>input (<code>str</code>): The instruction with the solutions, where the label tags  are replaced by a token.</p> </li> <li> <p>label (<code>str</code>): The instruction with the solutions.</p> </li> <li> <p>prompt (<code>str</code>): The instruction with the solutions, where the label tags  are replaced by a token.</p> </li> <li> <p>completions (<code>List[str]</code>): The solution represented as a list of steps.</p> </li> <li> <p>labels (<code>List[bool]</code>): The labels, as a list of booleans, where True represents  a good response.</p> </li> </ul>"},{"location":"components-gallery/steps/formatprm/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formatprm/#shepherd-format","title":"Shepherd format","text":"<pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM()\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\nresult = next(formatter.process(result))\n</code></pre>"},{"location":"components-gallery/steps/formatprm/#prepare-your-data-to-train-a-prm-model-with-the-trl-format","title":"Prepare your data to train a PRM model with the TRL format","text":"<pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM(format=\"trl\")\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\n\nresult = next(formatter.process(result))\n# {\n#     \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"solutions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"\n#     ],\n#     \"prompt\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"completions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber.\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3\"\n#     ],\n#     \"labels\": [\n#         true,\n#         true,\n#         true\n#     ]\n# }\n</code></pre>"},{"location":"components-gallery/steps/formatprm/#references","title":"References","text":"<ul> <li> <p>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</p> </li> <li> <p>peiyi9979/Math-Shepherd</p> </li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/","title":"TruncateTextColumn","text":"<p>Truncate a row using a tokenizer or the number of characters.</p> <p><code>TruncateTextColumn</code> is a <code>Step</code> that truncates a row according to the max length. If     the <code>tokenizer</code> is provided, then the row will be truncated using the tokenizer,     and the <code>max_length</code> will be used as the maximum number of tokens, otherwise it will     be used as the maximum number of characters. The <code>TruncateTextColumn</code> step is useful when one     wants to truncate a row to a certain length, to avoid posterior errors in the model due     to the length.</p>"},{"location":"components-gallery/steps/truncatetextcolumn/#attributes","title":"Attributes","text":"<ul> <li> <p>column: the column to truncate. Defaults to <code>\"text\"</code>.</p> </li> <li> <p>max_length: the maximum length to use for truncation.  If a <code>tokenizer</code> is given, corresponds to the number of tokens,  otherwise corresponds to the number of characters. Defaults to <code>8192</code>.</p> </li> <li> <p>tokenizer: the name of the tokenizer to use. If provided, the row will be  truncated using the tokenizer. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph TruncateTextColumn\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/truncatetextcolumn/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>column</code> attribute): The columns to be truncated, defaults to \"text\".</li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>column</code> attribute): The truncated column.</li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/truncatetextcolumn/#truncating-a-row-to-a-given-number-of-tokens","title":"Truncating a row to a given number of tokens","text":"<pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(\n    tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    max_length=4,\n    column=\"text\"\n)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a sample'}]\n</code></pre>"},{"location":"components-gallery/steps/truncatetextcolumn/#truncating-a-row-to-a-given-number-of-characters","title":"Truncating a row to a given number of characters","text":"<pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(max_length=10)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a '}]\n</code></pre>"},{"location":"components-gallery/tasks/","title":"Tasks Gallery","text":"Category Overview <p>The gallery page showcases the different types of components within <code>distilabel</code>.</p> Icon Category Description text-generation Text generation steps are used to generate text based on a given prompt. chat-generation Chat generation steps are used to generate text based on a conversation. text-classification Text classification steps are used to classify text into a category. text-manipulation Text manipulation steps are used to manipulate or rewrite an input text. evol Evol steps are used to rewrite input text and evolve it to a higher quality. critique Critique steps are used to provide feedback on the quality of the data with a written explanation. scorer Scorer steps are used to evaluate and score the data with a numerical value. preference Preference steps are used to collect preferences on the data with numerical values or ranks. embedding Embedding steps are used to generate embeddings for the data. clustering Clustering steps are used to group similar data points together. columns Columns steps are used to manipulate columns in the data. filtering Filtering steps are used to filter the data based on some criteria. format Format steps are used to format the data. load Load steps are used to load the data. execution Executes python functions. save Save steps are used to save the data. labelling Labelling steps are used to label the data. <ul> <li> <p> APIGenGenerator</p> <p>Generate queries and answers for the given functions in JSON format.</p> <p> APIGenGenerator</p> </li> <li> <p> Genstruct</p> <p>Generate a pair of instruction-response from a document using an <code>LLM</code>.</p> <p> Genstruct</p> </li> <li> <p> Magpie</p> <p>Generates conversations using an instruct fine-tuned LLM.</p> <p> Magpie</p> </li> <li> <p> MathShepherdCompleter</p> <p>Math Shepherd Completer and auto-labeller task.</p> <p> MathShepherdCompleter</p> </li> <li> <p> MathShepherdGenerator</p> <p>Math Shepherd solution generator.</p> <p> MathShepherdGenerator</p> </li> <li> <p> SelfInstruct</p> <p>Generate instructions based on a given input using an <code>LLM</code>.</p> <p> SelfInstruct</p> </li> <li> <p> TextGeneration</p> <p>Text generation with an <code>LLM</code> given a prompt.</p> <p> TextGeneration</p> </li> <li> <p> TextGenerationWithImage</p> <p>Text generation with images with an <code>LLM</code> given a prompt.</p> <p> TextGenerationWithImage</p> </li> <li> <p> URIAL</p> <p>Generates a response using a non-instruct fine-tuned model.</p> <p> URIAL</p> </li> <li> <p> MagpieGenerator</p> <p>Generator task the generates instructions or conversations using Magpie.</p> <p> MagpieGenerator</p> </li> <li> <p> ChatGeneration</p> <p>Generates text based on a conversation.</p> <p> ChatGeneration</p> </li> <li> <p> ArgillaLabeller</p> <p>Annotate Argilla records based on input fields, example records and question settings.</p> <p> ArgillaLabeller</p> </li> <li> <p> TextClassification</p> <p>Classifies text into one or more categories or labels.</p> <p> TextClassification</p> </li> <li> <p> EvolInstruct</p> <p>Evolve instructions using an <code>LLM</code>.</p> <p> EvolInstruct</p> </li> <li> <p> EvolComplexity</p> <p>Evolve instructions to make them more complex using an <code>LLM</code>.</p> <p> EvolComplexity</p> </li> <li> <p> EvolQuality</p> <p>Evolve the quality of the responses using an <code>LLM</code>.</p> <p> EvolQuality</p> </li> <li> <p> EvolInstructGenerator</p> <p>Generate evolved instructions using an <code>LLM</code>.</p> <p> EvolInstructGenerator</p> </li> <li> <p> EvolComplexityGenerator</p> <p>Generate evolved instructions with increased complexity using an <code>LLM</code>.</p> <p> EvolComplexityGenerator</p> </li> <li> <p> InstructionBacktranslation</p> <p>Self-Alignment with Instruction Backtranslation.</p> <p> InstructionBacktranslation</p> </li> <li> <p> PrometheusEval</p> <p>Critique and rank the quality of generations from an <code>LLM</code> using Prometheus 2.0.</p> <p> PrometheusEval</p> </li> <li> <p> ComplexityScorer</p> <p>Score instructions based on their complexity using an <code>LLM</code>.</p> <p> ComplexityScorer</p> </li> <li> <p> QualityScorer</p> <p>Score responses based on their quality using an <code>LLM</code>.</p> <p> QualityScorer</p> </li> <li> <p> CLAIR</p> <p>Contrastive Learning from AI Revisions (CLAIR).</p> <p> CLAIR</p> </li> <li> <p> UltraFeedback</p> <p>Rank generations focusing on different aspects using an <code>LLM</code>.</p> <p> UltraFeedback</p> </li> <li> <p> PairRM</p> <p>Rank the candidates based on the input using the <code>LLM</code> model.</p> <p> PairRM</p> </li> <li> <p> GenerateSentencePair</p> <p>Generate a positive and negative (optionally) sentences given an anchor sentence.</p> <p> GenerateSentencePair</p> </li> <li> <p> GenerateEmbeddings</p> <p>Generate embeddings using the last hidden state of an <code>LLM</code>.</p> <p> GenerateEmbeddings</p> </li> <li> <p> TextClustering</p> <p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p> TextClustering</p> </li> <li> <p> TextClustering</p> <p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p> TextClustering</p> </li> <li> <p> APIGenSemanticChecker</p> <p>Generate queries and answers for the given functions in JSON format.</p> <p> APIGenSemanticChecker</p> </li> <li> <p> GenerateTextRetrievalData</p> <p>Generate text retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateTextRetrievalData</p> </li> <li> <p> GenerateShortTextMatchingData</p> <p>Generate short text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateShortTextMatchingData</p> </li> <li> <p> GenerateLongTextMatchingData</p> <p>Generate long text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateLongTextMatchingData</p> </li> <li> <p> GenerateTextClassificationData</p> <p>Generate text classification data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateTextClassificationData</p> </li> <li> <p> StructuredGeneration</p> <p>Generate structured content for a given <code>instruction</code> using an <code>LLM</code>.</p> <p> StructuredGeneration</p> </li> <li> <p> MonolingualTripletGenerator</p> <p>Generate monolingual triplets with an <code>LLM</code> to later on train an embedding model.</p> <p> MonolingualTripletGenerator</p> </li> <li> <p> BitextRetrievalGenerator</p> <p>Generate bitext retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p> BitextRetrievalGenerator</p> </li> <li> <p> EmbeddingTaskGenerator</p> <p>Generate task descriptions for embedding-related tasks using an <code>LLM</code>.</p> <p> EmbeddingTaskGenerator</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/","title":"APIGenGenerator","text":"<p>Generate queries and answers for the given functions in JSON format.</p> <p>The <code>APIGenGenerator</code> is inspired by the APIGen pipeline, which was designed to generate     verifiable and diverse function-calling datasets. The task generates a set of diverse queries     and corresponding answers for the given functions in JSON format.</p>"},{"location":"components-gallery/tasks/apigengenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to guide the user in the generation of queries and answers.</p> </li> <li> <p>use_tools: Whether to use the tools available in the prompt to generate the queries and answers.  In case the tools are given in the input, they will be added to the prompt.</p> </li> <li> <p>number: The number of queries to generate. It can be a list, where each number will be  chosen randomly, or a dictionary with the number of queries and the probability of each.  I.e: <code>number=1</code>, <code>number=[1, 2, 3]</code>, <code>number={1: 0.5, 2: 0.3, 3: 0.2}</code> are all valid inputs.  It corresponds to the number of parallel queries to generate.</p> </li> <li> <p>use_default_structured_output: Whether to use the default structured output or not.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[examples]\n            ICOL1[func_name]\n            ICOL2[func_desc]\n            ICOL3[tools]\n        end\n        subgraph New columns\n            OCOL0[query]\n            OCOL1[answers]\n        end\n    end\n\n    subgraph APIGenGenerator\n        StepInput[Input Columns: examples, func_name, func_desc, tools]\n        StepOutput[Output Columns: query, answers]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/apigengenerator/#inputs","title":"Inputs","text":"<ul> <li> <p>examples (<code>str</code>): Examples used as few shots to guide the model.</p> </li> <li> <p>func_name (<code>str</code>): Name for the function to generate.</p> </li> <li> <p>func_desc (<code>str</code>): Description of what the function should do.</p> </li> <li> <p>tools (<code>str</code>): JSON formatted string containing the tool representation of the function.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>query (<code>str</code>): The list of queries.</p> </li> <li> <p>answers (<code>str</code>): JSON formatted string with the list of answers, containing the info as  a dictionary to be passed to the functions.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/apigengenerator/#generate-without-structured-output-original-implementation","title":"Generate without structured output (original implementation)","text":"<pre><code>from distilabel.steps.tasks import ApiGenGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\napigen = ApiGenGenerator(\n    use_default_structured_output=False,\n    llm=llm\n)\napigen.load()\n\nres = next(\n    apigen.process(\n        [\n            {\n                \"examples\": 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                \"func_name\": \"getrandommovie\",\n                \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n            }\n        ]\n    )\n)\nres\n# [{'examples': 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n# 'number': 1,\n# 'func_name': 'getrandommovie',\n# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n# 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',\n# 'Give me 5 random movie suggestions from your database to plan my weekend.'],\n# 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],\n# [{'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}}]],\n# 'raw_input_api_gen_generator_0': [{'role': 'system',\n#     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively\"},\n#     {'role': 'user',\n#     'content': 'Here are examples of queries and the corresponding answers for similar functions:\nQUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\nThe detailed function description is the following:\nReturns a list of random movies from a database by calling an external API.\n\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n</code></pre>"},{"location":"components-gallery/tasks/apigengenerator/#generate-with-structured-output","title":"Generate with structured output","text":"<pre><code>from distilabel.steps.tasks import ApiGenGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\napigen = ApiGenGenerator(\n    use_default_structured_output=True,\n    llm=llm\n)\napigen.load()\n\nres_struct = next(\n    apigen.process(\n        [\n            {\n                \"examples\": 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                \"func_name\": \"getrandommovie\",\n                \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n            }\n        ]\n    )\n)\nres_struct\n# [{'examples': 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n# 'number': 1,\n# 'func_name': 'getrandommovie',\n# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n# 'queries': [\"I'm bored and want to watch a movie. Can you suggest some movies?\",\n# \"My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?\"],\n# 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],\n# [{'arguments': {}, 'name': 'getrandommovie'}]],\n# 'raw_input_api_gen_generator_0': [{'role': 'system',\n#     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively\"},\n#     {'role': 'user',\n#     'content': 'Here are examples of queries and the corresponding answers for similar functions:\nQUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\nThe detailed function description is the following:\nReturns a list of random movies from a database by calling an external API.\n\nNow please generate 2 diverse query and answer pairs following the above format.'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/apigengenerator/#references","title":"References","text":"<ul> <li> <p>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</p> </li> <li> <p>Salesforce/xlam-function-calling-60k</p> </li> </ul>"},{"location":"components-gallery/tasks/genstruct/","title":"Genstruct","text":"<p>Generate a pair of instruction-response from a document using an <code>LLM</code>.</p> <p><code>Genstruct</code> is a pre-defined task designed to generate valid instructions from a given raw document,     with the title and the content, enabling the creation of new, partially synthetic instruction finetuning     datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is     inspired in the Ada-Instruct paper.</p>"},{"location":"components-gallery/tasks/genstruct/#note","title":"Note","text":"<p>The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended option is to use <code>NousResearch/Genstruct-7B</code> as the LLM provided to the task, since it was trained for this specific task.</p>"},{"location":"components-gallery/tasks/genstruct/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/genstruct/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[title]\n            ICOL1[content]\n        end\n        subgraph New columns\n            OCOL0[user]\n            OCOL1[assistant]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph Genstruct\n        StepInput[Input Columns: title, content]\n        StepOutput[Output Columns: user, assistant, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/genstruct/#inputs","title":"Inputs","text":"<ul> <li> <p>title (<code>str</code>): The title of the document.</p> </li> <li> <p>content (<code>str</code>): The content of the document.</p> </li> </ul>"},{"location":"components-gallery/tasks/genstruct/#outputs","title":"Outputs","text":"<ul> <li> <p>user (<code>str</code>): The user's instruction based on the document.</p> </li> <li> <p>assistant (<code>str</code>): The assistant's response based on the user's instruction.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/genstruct/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/genstruct/#generate-instructions-from-raw-documents-using-the-title-and-content","title":"Generate instructions from raw documents using the title and content","text":"<pre><code>from distilabel.steps.tasks import Genstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ngenstruct = Genstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"NousResearch/Genstruct-7B\",\n    ),\n)\n\ngenstruct.load()\n\nresult = next(\n    genstruct.process(\n        [\n            {\"title\": \"common instruction\", \"content\": \"content of the document\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'title': 'An instruction',\n#         'content': 'content of the document',\n#         'model_name': 'test',\n#         'user': 'An instruction',\n#         'assistant': 'content of the document',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/genstruct/#references","title":"References","text":"<ul> <li> <p>Genstruct 7B by Nous Research</p> </li> <li> <p>Ada-Instruct: Adapting Instruction Generators for Complex Reasoning</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/","title":"Magpie","text":"<p>Generates conversations using an instruct fine-tuned LLM.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data     or specific system prompt thanks to the autoregressive capabilities of the instruct     fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message     and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query     or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the     LLM without any user message, then the LLM will continue generating tokens as if it was     the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.     After this instruct is generated, it can be sent again to the LLM to generate this time     an assistant response. This process can be repeated N times allowing to build a multi-turn     conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from     Scratch by Prompting Aligned LLMs with Nothing'.</p>"},{"location":"components-gallery/tasks/magpie/#attributes","title":"Attributes","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have.  Defaults to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have. Defaults  to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n        end\n        subgraph New columns\n            OCOL0[conversation]\n            OCOL1[instruction]\n            OCOL2[response]\n            OCOL3[system_prompt_key]\n            OCOL4[model_name]\n        end\n    end\n\n    subgraph Magpie\n        StepInput[Input Columns: system_prompt]\n        StepOutput[Output Columns: conversation, instruction, response, system_prompt_key, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/magpie/#inputs","title":"Inputs","text":"<ul> <li>system_prompt (<code>str</code>, optional): an optional system prompt that can be provided  to guide the generation of the instruct LLM and steer it to generate instructions  of certain topic.</li> </ul>"},{"location":"components-gallery/tasks/magpie/#outputs","title":"Outputs","text":"<ul> <li> <p>conversation (<code>ChatType</code>): the generated conversation which is a list of chat  items with a role and a message. Only if <code>only_instruction=False</code>.</p> </li> <li> <p>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code> or <code>n_turns==1</code>.</p> </li> <li> <p>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</p> </li> <li> <p>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate  the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/magpie/#generating-instructions-with-llama-3-8b-instruct-and-transformersllm","title":"Generating instructions with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {'instruction': \"That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?\"},\n#     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/magpie/#generating-conversations-with-llama-3-8b-instruct-and-transformersllm","title":"Generating conversations with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    n_turns=2,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"},\n#             {\n#                 'role': 'user',\n#                 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x\u2192a f(x) or lim x\u2192a [f(x)]. It is read as \"the limit as x approaches a of f\n# of x\".'\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x\u2192a f(x) indeed means \"the limit as x approaches a of f of\n# x\". What it's asking us to do is find the'\n#             }\n#         ]\n#     },\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"},\n#             {\n#                 'role': 'user',\n#                 'content': \"As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it\n# might be pests or diseases, but I'm not sure which.\"\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': \"I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.\n# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called\"\n#             }\n#         ]\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/magpie/#references","title":"References","text":"<ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/","title":"MathShepherdCompleter","text":"<p>Math Shepherd Completer and auto-labeller task.</p> <p>This task is in charge of, given a list of solutions to an instruction, and a golden solution,     as reference, generate completions for the solutions, and label them according to the golden     solution using the hard estimation method from figure 2 in the reference paper, Eq. 3.     The attributes make the task flexible to be used with different types of dataset and LLMs, and     allow making use of different fields to modify the system and user prompts for it. Before modifying     them, review the current defaults to ensure the completions are generated correctly.</p>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to be used in the completions. The default one has been  checked and generates good completions using Llama 3.1 with 8B and 70B,  but it can be modified to adapt it to the model and dataset selected.</p> </li> <li> <p>extra_rules: This field can be used to insert extra rules relevant to the type of dataset.  For example, in the original paper they used GSM8K and MATH datasets, and this field  can be used to insert the rules for the GSM8K dataset.</p> </li> <li> <p>few_shots: Few shots to help the model generating the completions, write them in the  format of the type of solutions wanted for your dataset.</p> </li> <li> <p>N: Number of completions to generate for each step, correspond to N in the paper.  They used 8 in the paper, but it can be adjusted.</p> </li> <li> <p>tags: List of tags to be used in the completions, the default ones are [\"+\", \"-\"] as in the  paper, where the first is used as a positive label, and the second as a negative one.  This can be updated, but it MUST be a list with 2 elements, where the first is the  positive one, and the second the negative one.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[solutions]\n            ICOL2[golden_solution]\n        end\n        subgraph New columns\n            OCOL0[solutions]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph MathShepherdCompleter\n        StepInput[Input Columns: instruction, solutions, golden_solution]\n        StepOutput[Output Columns: solutions, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The task or instruction.</p> </li> <li> <p>solutions (<code>List[str]</code>): List of solutions to the task.</p> </li> <li> <p>golden_solution (<code>str</code>): The reference solution to the task, will be used  to annotate the candidate solutions.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#outputs","title":"Outputs","text":"<ul> <li> <p>solutions (<code>List[str]</code>): The same columns that were used as input, the \"solutions\" is modified.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the revision.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/mathshepherdcompleter/#annotate-your-steps-with-the-math-shepherd-completer-using-the-structured-outputs-the-preferred-way","title":"Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way)","text":"<pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3,\n    use_default_structured_output=True\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#annotate-your-steps-with-the-math-shepherd-completer","title":"Annotate your steps with the Math Shepherd Completer","text":"<pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#references","title":"References","text":"<ul> <li>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/","title":"MathShepherdGenerator","text":"<p>Math Shepherd solution generator.</p> <p>This task is in charge of generating completions for a given instruction, in the format expected     by the Math Shepherd Completer task. The attributes make the task flexible to be used with different     types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented     in the original paper. Before modifying them, review the current defaults to ensure the completions     are generated correctly. This task can be used to generate the golden solutions for a given problem if     not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer.     Only one of <code>solutions</code> or <code>golden_solution</code> will be generated, depending on the value of M.</p>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to be used in the completions. The default one has been  checked and generates good completions using Llama 3.1 with 8B and 70B,  but it can be modified to adapt it to the model and dataset selected.  Take into account that the system prompt includes 2 variables in the Jinja2 template,  {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example  to steer the model towards a specific type of responses, and few shots to add examples.  They can be modified to adapt the system prompt to the dataset and model used without needing  to change the full system prompt.</p> </li> <li> <p>extra_rules: This field can be used to insert extra rules relevant to the type of dataset.  For example, in the original paper they used GSM8K and MATH datasets, and this field  can be used to insert the rules for the GSM8K dataset.</p> </li> <li> <p>few_shots: Few shots to help the model generating the completions, write them in the  format of the type of solutions wanted for your dataset.</p> </li> <li> <p>M: Number of completions to generate for each step. By default is set to 1, which will  generate the \"golden_solution\". In this case select a stronger model, as it will be used  as the source of true during labelling. If M is set to a number greater than 1, the task  will generate a list of completions to be labeled by the Math Shepherd Completer task.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n        end\n        subgraph New columns\n            OCOL0[golden_solution]\n            OCOL1[solutions]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph MathShepherdGenerator\n        StepInput[Input Columns: instruction]\n        StepOutput[Output Columns: golden_solution, solutions, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#inputs","title":"Inputs","text":"<ul> <li>instruction (<code>str</code>): The task or instruction.</li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>golden_solution (<code>str</code>): The step by step solution to the instruction.  It will be generated if M is equal to 1.</p> </li> <li> <p>solutions (<code>List[List[str]]</code>): A list of possible solutions to the instruction.  It will be generated if M is greater than 1.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the revision.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/mathshepherdgenerator/#generate-the-solution-for-a-given-instruction-prefer-a-stronger-model-here","title":"Generate the solution for a given instruction (prefer a stronger model here)","text":"<pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"golden_solution_generator\",\n    llm=llm,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': '[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"]'}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#generate-m-completions-for-a-given-instruction-using-structured-output-generation","title":"Generate M completions for a given instruction (using structured output generation)","text":"<pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 2048,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"solution_generator\",\n    llm=llm,\n    M=2,\n    use_default_structured_output=True,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#references","title":"References","text":"<ul> <li>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/","title":"SelfInstruct","text":"<p>Generate instructions based on a given input using an <code>LLM</code>.</p> <p><code>SelfInstruct</code> is a pre-defined task that, given a number of instructions, a     certain criteria for query generations, an application description, and an input,     generates a number of instruction related to the given input and following what     is stated in the criteria for query generation and the application description.     It is based in the SelfInstruct framework from the paper \"Self-Instruct: Aligning     Language Models with Self-Generated Instructions\".</p>"},{"location":"components-gallery/tasks/selfinstruct/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated. Defaults to 5.</p> </li> <li> <p>criteria_for_query_generation: The criteria for the query generation. Defaults  to the criteria defined within the paper.</p> </li> <li> <p>application_description: The description of the AI application that one want  to build with these instructions. Defaults to <code>AI assistant</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[input]\n        end\n        subgraph New columns\n            OCOL0[instructions]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph SelfInstruct\n        StepInput[Input Columns: input]\n        StepOutput[Output Columns: instructions, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/selfinstruct/#inputs","title":"Inputs","text":"<ul> <li>input (<code>str</code>): The input to generate the instructions. It's also called seed in  the paper.</li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/#outputs","title":"Outputs","text":"<ul> <li> <p>instructions (<code>List[str]</code>): The generated instructions.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/selfinstruct/#generate-instructions-based-on-a-given-input","title":"Generate instructions based on a given input","text":"<pre><code>from distilabel.steps.tasks import SelfInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\nself_instruct = SelfInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=5,  # This is the default value\n)\n\nself_instruct.load()\n\nresult = next(self_instruct.process([{\"input\": \"instruction\"}]))\n# result\n# [\n#     {\n#         'input': 'instruction',\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'instructions': [\"instruction 1\", \"instruction 2\", \"instruction 3\", \"instruction 4\", \"instruction 5\"],\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/","title":"TextGeneration","text":"<p>Text generation with an <code>LLM</code> given a prompt.</p> <p><code>TextGeneration</code> is a pre-defined task that allows passing a custom prompt using the     Jinja2 syntax. By default, a <code>instruction</code> is expected in the inputs, but the using     <code>template</code> and <code>columns</code> attributes one can define a custom prompt and columns expected     from the text. This task should be good enough for tasks that don't need post-processing     of the responses generated by the LLM.</p>"},{"location":"components-gallery/tasks/textgeneration/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to use in the generation. If not provided, then  it will check if the input row has a column named <code>system_prompt</code> and use it.  If not, then no system prompt will be used. Defaults to <code>None</code>.</p> </li> <li> <p>template: The template to use for the generation. It must follow the Jinja2 template  syntax. If not provided, it will assume the text passed is an instruction and  construct the appropriate template.</p> </li> <li> <p>columns: A string with the column, or a list with columns expected in the template.  Take a look at the examples for more information. Defaults to <code>instruction</code>.</p> </li> <li> <p>use_system_prompt: DEPRECATED. To be removed in 1.5.0. Whether to use the system  prompt in the generation. Defaults to <code>True</code>, which means that if the column  <code>system_prompt</code> is defined within the input batch, then the <code>system_prompt</code>  will be used, otherwise, it will be ignored.</p> </li> </ul>"},{"location":"components-gallery/tasks/textgeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextGeneration\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): By default will be set to <code>instruction</code>.  The columns can point both to a <code>str</code> or a <code>List[str]</code> to be used in the template.</li> </ul>"},{"location":"components-gallery/tasks/textgeneration/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/textgeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textgeneration/#generate-text-from-an-instruction","title":"Generate text from an instruction","text":"<pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    )\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [{\"instruction\": \"your instruction\"}]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'your instruction',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'generation',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#use-a-custom-template-to-generate-text","title":"Use a custom template to generate text","text":"<pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Document:\n{{ document }}\n\nQuestion: {{ question }}\n\nPlease provide a clear and concise answer to the question based on the information in the document and your general knowledge:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    system_prompt=\"You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.\",\n    template=CUSTOM_TEMPLATE,\n    columns=[\"document\", \"question\"],\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"document\": \"The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.\",\n                \"question\": \"What is the main threat to the Great Barrier Reef mentioned in the document?\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',\n#         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#few-shot-learning-with-different-system-prompts","title":"Few shot learning with different system prompts","text":"<pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:\n\n{% for example in examples %}\nExample {{ loop.index }}:\nInstruction: {{ example }}\n\n{% endfor %}\nNow, generate a new instruction in a similar style:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    template=CUSTOM_TEMPLATE,\n    columns=\"examples\",\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"examples\": [\"This is an example\", \"Another relevant example\"],\n                \"system_prompt\": \"You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'examples': ['This is an example', 'Another relevant example'],\n#         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'Disable the firewall on the router',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#references","title":"References","text":"<ul> <li>Jinja2 Template Designer Documentation</li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/","title":"TextGenerationWithImage","text":"<p>Text generation with images with an <code>LLM</code> given a prompt.</p> <p><code>TextGenerationWithImage</code> is a pre-defined task that allows passing a custom prompt using the     Jinja2 syntax. By default, a <code>instruction</code> is expected in the inputs, but the using     <code>template</code> and <code>columns</code> attributes one can define a custom prompt and columns expected     from the text. Additionally, an <code>image</code> column is expected containing one of the     url, base64 encoded image or PIL image. This task inherits from <code>TextGeneration</code>,     so all the functionality available in that task related to the prompt will be available     here too.</p>"},{"location":"components-gallery/tasks/textgenerationwithimage/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to use in the generation.  If not, then no system prompt will be used. Defaults to <code>None</code>.</p> </li> <li> <p>template: The template to use for the generation. It must follow the Jinja2 template  syntax. If not provided, it will assume the text passed is an instruction and  construct the appropriate template.</p> </li> <li> <p>columns: A string with the column, or a list with columns expected in the template.  Take a look at the examples for more information. Defaults to <code>instruction</code>.</p> </li> <li> <p>image_type: The type of the image provided, this will be used to preprocess if necessary.  Must be one of \"url\", \"base64\" or \"PIL\".</p> </li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextGenerationWithImage\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textgenerationwithimage/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): By default will be set to <code>instruction</code>.  The columns can point both to a <code>str</code> or a <code>list[str]</code> to be used in the template.</li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textgenerationwithimage/#answer-questions-from-an-image","title":"Answer questions from an image","text":"<pre><code>from distilabel.steps.tasks import TextGenerationWithImage\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nvision = TextGenerationWithImage(\n    name=\"vision_gen\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    ),\n    image_type=\"url\"\n)\n\nvision.load()\n\nresult = next(\n    vision.process(\n        [\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         \"instruction\": \"What\u2019s in this image?\",\n#         \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n#         \"generation\": \"Based on the visual cues in the image...\",\n#         \"model_name\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\"\n#         ... # distilabel_metadata would be here\n#     }\n# ]\n# result[0][\"generation\"]\n# \"Based on the visual cues in the image, here are some possible story points:\n\n* The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.\n\nAnalysis and Ideas:\n* The abundance of green grass and trees suggests a healthy ecosystem or habitat.\n* The presence of wildlife, such as birds or deer, is possible based on the surroundings.\n* A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.\n\nAdditional Questions to Ask:\n* Why is a footbridge present in this area?\n* What kind of wildlife inhabits this region\"\n</code></pre>"},{"location":"components-gallery/tasks/textgenerationwithimage/#answer-questions-from-an-image-stored-as-base64","title":"Answer questions from an image stored as base64","text":"<pre><code># For this example we will assume that we have the string representation of the image\n# stored, but will just take the image and transform it to base64 to ilustrate the example.\nimport requests\nimport base64\n\nimage_url =\"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg\"\nimg = requests.get(image_url).content\nbase64_image = base64.b64encode(img).decode(\"utf-8\")\n\nfrom distilabel.steps.tasks import TextGenerationWithImage\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nvision = TextGenerationWithImage(\n    name=\"vision_gen\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    ),\n    image_type=\"base64\"\n)\n\nvision.load()\n\nresult = next(\n    vision.process(\n        [\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": base64_image\n            }\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/textgenerationwithimage/#references","title":"References","text":"<ul> <li> <p>Jinja2 Template Designer Documentation</p> </li> <li> <p>Image-Text-to-Text</p> </li> <li> <p>OpenAI Vision</p> </li> </ul>"},{"location":"components-gallery/tasks/urial/","title":"URIAL","text":"<p>Generates a response using a non-instruct fine-tuned model.</p> <p><code>URIAL</code> is a pre-defined task that generates a response using a non-instruct fine-tuned     model. This task is used to generate a response based on the conversation provided as     input.</p>"},{"location":"components-gallery/tasks/urial/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[conversation]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph URIAL\n        StepInput[Input Columns: instruction, conversation]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/urial/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>, optional): The instruction to generate a response from.</p> </li> <li> <p>conversation (<code>List[Dict[str, str]]</code>, optional): The conversation to generate  a response from (the last message must be from the user).</p> </li> </ul>"},{"location":"components-gallery/tasks/urial/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated response.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the response.</p> </li> </ul>"},{"location":"components-gallery/tasks/urial/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/urial/#generate-text-from-an-instruction","title":"Generate text from an instruction","text":"<pre><code>from distilabel.models import vLLM\nfrom distilabel.steps.tasks import URIAL\n\nstep = URIAL(\n    llm=vLLM(\n        model=\"meta-llama/Meta-Llama-3.1-8B\",\n        generation_kwargs={\"temperature\": 0.7},\n    ),\n)\n\nstep.load()\n\nresults = next(\n    step.process(inputs=[{\"instruction\": \"What's the most most common type of cloud?\"}])\n)\n# [\n#     {\n#         'instruction': \"What's the most most common type of cloud?\",\n#         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',\n#         'distilabel_metadata': {...},\n#         'model_name': 'meta-llama/Meta-Llama-3.1-8B'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/urial/#references","title":"References","text":"<ul> <li>The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning</li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/","title":"MagpieGenerator","text":"<p>Generator task the generates instructions or conversations using Magpie.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data     or specific system prompt thanks to the autoregressive capabilities of the instruct     fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message     and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query     or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the     LLM without any user message, then the LLM will continue generating tokens as it was     the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.     After this instruct is generated, it can be sent again to the LLM to generate this time     an assistant response. This process can be repeated N times allowing to build a multi-turn     conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from     Scratch by Prompting Aligned LLMs with Nothing'.</p>"},{"location":"components-gallery/tasks/magpiegenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have.  Defaults to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> </li> <li> <p>num_rows: the number of rows to be generated.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have. Defaults  to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic.</p> </li> <li> <p>num_rows: the number of rows to be generated.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[conversation]\n            OCOL1[instruction]\n            OCOL2[response]\n            OCOL3[system_prompt_key]\n            OCOL4[model_name]\n        end\n    end\n\n    subgraph MagpieGenerator\n        StepOutput[Output Columns: conversation, instruction, response, system_prompt_key, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>conversation (<code>ChatType</code>): the generated conversation which is a list of chat  items with a role and a message.</p> </li> <li> <p>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code>.</p> </li> <li> <p>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</p> </li> <li> <p>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate  the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/magpiegenerator/#generating-instructions-with-llama-3-8b-instruct-and-transformersllm","title":"Generating instructions with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#       [\n#           {\"instruction\": \"I've just bought a new phone and I're excited to start using it.\"},\n#           {\"instruction\": \"What are the most common types of companies that use digital signage?\"}\n#       ],\n#       True\n# )\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#generating-a-conversation-with-llama-3-8b-instruct-and-transformersllm","title":"Generating a conversation with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    n_turns=3,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#     [\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,\n# let's break down the basics. First, we need to identify your goals and target audience. What do\"\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': \"Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main\n# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating\"\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or\n# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'\n#                 }\n#             ]\n#         },\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.\n# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': 'Let me stop you there. Let's explore this \"purpose\" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my\n# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly\n# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* \"\n#                 }\n#             ]\n#         }\n#     ],\n#     True\n# )\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#generating-with-system-prompts-with-probabilities","title":"Generating with system prompts with probabilities","text":"<pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\nmagpie = MagpieGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 0.8,\n            \"max_new_tokens\": 256,\n        },\n    ),\n    n_turns=2,\n    system_prompt={\n        \"math\": (\"You're an expert AI assistant.\", 0.8),\n        \"writing\": (\"You're an expert writing assistant.\", 0.2),\n    },\n)\n\nmagpie.load()\n\nresult = next(magpie.process())\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#references","title":"References","text":"<ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul>"},{"location":"components-gallery/tasks/chatgeneration/","title":"ChatGeneration","text":"<p>Generates text based on a conversation.</p> <p><code>ChatGeneration</code> is a pre-defined task that defines the <code>messages</code> as the input     and <code>generation</code> as the output. This task is used to generate text based on a conversation.     The <code>model_name</code> is also returned as part of the output in order to enhance it.</p>"},{"location":"components-gallery/tasks/chatgeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[messages]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph ChatGeneration\n        StepInput[Input Columns: messages]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/chatgeneration/#inputs","title":"Inputs","text":"<ul> <li>messages (<code>List[Dict[Literal[\"role\", \"content\"], str]]</code>): The messages to generate the  follow up completion from.</li> </ul>"},{"location":"components-gallery/tasks/chatgeneration/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text from the assistant.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/chatgeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/chatgeneration/#generate-text-from-a-conversation-in-openai-chat-format","title":"Generate text from a conversation in OpenAI chat format","text":"<pre><code>from distilabel.steps.tasks import ChatGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nchat = ChatGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nchat.load()\n\nresult = next(\n    chat.process(\n        [\n            {\n                \"messages\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                ]\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'generation': '4',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/","title":"ArgillaLabeller","text":"<p>Annotate Argilla records based on input fields, example records and question settings.</p> <p>This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM.     It uses a system prompt that guides the LLM to understand the input fields, the question type,     and the question settings. The task then formats the input data and generates a response based on the question.     The response is validated against the question's value model, and the final suggestion is prepared for annotation.</p>"},{"location":"components-gallery/tasks/argillalabeller/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/argillalabeller/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[record]\n            ICOL1[fields]\n            ICOL2[question]\n            ICOL3[example_records]\n            ICOL4[guidelines]\n        end\n        subgraph New columns\n            OCOL0[suggestion]\n        end\n    end\n\n    subgraph ArgillaLabeller\n        StepInput[Input Columns: record, fields, question, example_records, guidelines]\n        StepOutput[Output Columns: suggestion]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    ICOL4 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#inputs","title":"Inputs","text":"<ul> <li> <p>record (<code>argilla.Record</code>): The record to be annotated.</p> </li> <li> <p>fields (<code>Optional[List[Dict[str, Any]]]</code>): The list of field settings for the input fields.</p> </li> <li> <p>question (<code>Optional[Dict[str, Any]]</code>): The question settings for the question to be answered.</p> </li> <li> <p>example_records (<code>Optional[List[Dict[str, Any]]]</code>): The few shot example records with responses to be used to answer the question.</p> </li> <li> <p>guidelines (<code>Optional[str]</code>): The guidelines for the annotation task.</p> </li> </ul>"},{"location":"components-gallery/tasks/argillalabeller/#outputs","title":"Outputs","text":"<ul> <li>suggestion (<code>Dict[str, Any]</code>): The final suggestion for annotation.</li> </ul>"},{"location":"components-gallery/tasks/argillalabeller/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/argillalabeller/#annotate-a-record-with-the-same-dataset-and-question","title":"Annotate a record with the same dataset and question","text":"<pre><code>import argilla as rg\nfrom argilla import Suggestion\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\npending_records_filter = rg.Filter((\"status\", \"==\", \"pending\"))\ncompleted_records_filter = rg.Filter((\"status\", \"==\", \"completed\"))\npending_records = list(\n    dataset.records(\n        query=rg.Query(filter=pending_records_filter),\n        limit=5,\n    )\n)\nexample_records = list(\n    dataset.records(\n        query=rg.Query(filter=completed_records_filter),\n        limit=5,\n    )\n)\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    fields=[field],\n    question=question,\n    example_records=example_records,\n    guidelines=dataset.guidelines\n)\nlabeller.load()\n\n# Process the pending records\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record\n            } for record in pending_records\n        ]\n    )\n)\n\n# Add the suggestions to the records\nfor record, suggestion in zip(pending_records, result):\n    record.suggestions.add(Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated records\ndataset.records.log(pending_records)\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#annotate-a-record-with-alternating-datasets-and-questions","title":"Annotate a record with alternating datasets and questions","text":"<pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\nquestion2 = dataset.settings.questions[\"label2\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\nlabeller.load()\n\n# Process the record\nrecord = next(dataset.records())\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question,\n            },\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question2,\n            }\n        ]\n    )\n)\n\n# Add the suggestions to the record\nfor suggestion in result:\n    record.suggestions.add(rg.Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated record\ndataset.records.log([record])\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#overwrite-default-prompts-and-instructions","title":"Overwrite default prompts and instructions","text":"<pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Overwrite default prompts and instructions\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    system_prompt=\"You are an expert annotator and labelling assistant that understands complex domains and natural language processing.\",\n    question_to_label_instruction={\n        \"label_selection\": \"Select the appropriate label from the list of provided labels.\",\n        \"multi_label_selection\": \"Select none, one or multiple labels from the list of provided labels.\",\n        \"text\": \"Provide a text response to the question.\",\n        \"rating\": \"Provide a rating for the question.\",\n    },\n)\nlabeller.load()\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#references","title":"References","text":"<ul> <li>Argilla: Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets</li> </ul>"},{"location":"components-gallery/tasks/textclassification/","title":"TextClassification","text":"<p>Classifies text into one or more categories or labels.</p> <p>This task can be used for text classification problems, where the goal is to assign     one or multiple labels to a given text.     It uses structured generation as per the reference paper by default,     it can help to generate more concise labels. See section 4.1 in the reference.</p>"},{"location":"components-gallery/tasks/textclassification/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: A prompt to display to the user before the task starts. Contains a default  message to make the model behave like a classifier specialist.</p> </li> <li> <p>n: Number of labels to generate If only 1 is required, corresponds to a label  classification problem, if &gt;1 it will intend return the \"n\" labels most representative  for the text. Defaults to 1.</p> </li> <li> <p>context: Context to use when generating the labels. By default contains a generic message,  but can be used to customize the context for the task.</p> </li> <li> <p>examples: List of examples to help the model understand the task, few shots.</p> </li> <li> <p>available_labels: List of available labels to choose from when classifying the text, or  a dictionary with the labels and their descriptions.</p> </li> <li> <p>default_label: Default label to use when the text is ambiguous or lacks sufficient information for  classification. Can be a list in case of multiple labels (n&gt;1).</p> </li> </ul>"},{"location":"components-gallery/tasks/textclassification/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[labels]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextClassification\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: labels, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>): The reference text we want to obtain labels for.</li> </ul>"},{"location":"components-gallery/tasks/textclassification/#outputs","title":"Outputs","text":"<ul> <li> <p>labels (<code>Union[str, List[str]]</code>): The label or list of labels for the text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the label/s.</p> </li> </ul>"},{"location":"components-gallery/tasks/textclassification/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textclassification/#assigning-a-sentiment-to-a-text","title":"Assigning a sentiment to a text","text":"<pre><code>from distilabel.steps.tasks import TextClassification\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\ntext_classification = TextClassification(\n    llm=llm,\n    context=\"You are an AI system specialized in assigning sentiment to movies.\",\n    available_labels=[\"positive\", \"negative\"],\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\"}]\n    )\n)\n# result\n# [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',\n# 'labels': 'positive',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"positive\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the user query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nYou are an AI system specialized in assigning sentiment to movie the user queries.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"positive\",  # The text shows positive sentiment\\n    \"negative\",  # The text shows negative sentiment\\n]\\n\\n\\n## User Query\\n```\\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#assigning-predefined-labels-with-specified-descriptions","title":"Assigning predefined labels with specified descriptions","text":"<pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=1,\n    context=\"Determine the intent of the text.\",\n    available_labels={\n        \"complaint\": \"A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.\",\n        \"inquiry\": \"A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.\",\n        \"feedback\": \"A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\",\n        \"praise\": \"A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\"\n    },\n    query_title=\"Customer Query\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"Can you tell me more about your return policy?\"}]\n    )\n)\n# result\n# [{'text': 'Can you tell me more about your return policy?',\n# 'labels': 'inquiry',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"inquiry\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the customer query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nDetermine the intent of the text.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"complaint\",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\\n    \"inquiry\",  # A question or request for information about a product, service, or situation. It\\'s a neutral or curious expression seeking clarification or details.\\n    \"feedback\",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\\n    \"praise\",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\\n]\\n\\n\\n## Customer Query\\n```\\nCan you tell me more about your return policy?\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#free-multi-label-classification-without-predefined-labels","title":"Free multi label classification without predefined labels","text":"<pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=3,\n    context=(\n        \"Describe the main themes, topics, or categories that could describe the \"\n        \"following type of persona.\"\n    ),\n    query_title=\"Example of Persona\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\"}]\n    )\n)\n# result\n# [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',\n# 'labels': ['Historical Researcher',\n# 'Cultural Specialist',\n# 'Ethnic Studies Expert'],\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": [\"Historical Researcher\", \"Cultural Specialist\", \"Ethnic Studies Expert\"]\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the example of persona by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide a list of 3 labels that best describe the text.\\nDescribe the main themes, topics, or categories that could describe the following type of persona.\\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\\n\\n\\n## Example of Persona\\n```\\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": [\"label_0\", \"label_1\", \"label_2\"]\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#references","title":"References","text":"<ul> <li>Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/","title":"EvolInstruct","text":"<p>Evolve instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p>"},{"location":"components-gallery/tasks/evolinstruct/#attributes","title":"Attributes","text":"<ul> <li> <p>num_evolutions: The number of evolutions to be performed.</p> </li> <li> <p>store_evolutions: Whether to store all the evolutions or just the last one. Defaults  to <code>False</code>.</p> </li> <li> <p>generate_answers: Whether to generate answers for the evolved instructions. Defaults  to <code>False</code>.</p> </li> <li> <p>include_original_instruction: Whether to include the original instruction in the  <code>evolved_instructions</code> output column. Defaults to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for evolving the instructions.  Defaults to the ones provided in the <code>utils.py</code> file.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n        end\n        subgraph New columns\n            OCOL0[evolved_instruction]\n            OCOL1[evolved_instructions]\n            OCOL2[model_name]\n            OCOL3[answer]\n            OCOL4[answers]\n        end\n    end\n\n    subgraph EvolInstruct\n        StepInput[Input Columns: instruction]\n        StepOutput[Output Columns: evolved_instruction, evolved_instructions, model_name, answer, answers]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#inputs","title":"Inputs","text":"<ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#outputs","title":"Outputs","text":"<ul> <li> <p>evolved_instruction (<code>str</code>): The evolved instruction if <code>store_evolutions=False</code>.</p> </li> <li> <p>evolved_instructions (<code>List[str]</code>): The evolved instructions if <code>store_evolutions=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</p> </li> <li> <p>answer (<code>str</code>): The answer to the evolved instruction if <code>generate_answers=True</code>  and <code>store_evolutions=False</code>.</p> </li> <li> <p>answers (<code>List[str]</code>): The answers to the evolved instructions if <code>generate_answers=True</code>  and <code>store_evolutions=True</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolinstruct/#evolve-an-instruction-using-an-llm","title":"Evolve an instruction using an LLM","text":"<pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#keep-the-iterations-of-the-evolutions","title":"Keep the iterations of the evolutions","text":"<pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    store_evolutions=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instructions': ['initial evolution', 'final evolution'],\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#generate-answers-for-the-instructions-in-a-single-step","title":"Generate answers for the instructions in a single step","text":"<pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    generate_answers=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instruction': 'evolved instruction',\n#         'answer': 'answer to the instruction',\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#references","title":"References","text":"<ul> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> <li> <p>GitHub: h2oai/h2o-wizardlm</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/","title":"EvolComplexity","text":"<p>Evolve instructions to make them more complex using an <code>LLM</code>.</p> <p><code>EvolComplexity</code> is a task that evolves instructions to make them more complex,     and it is based in the EvolInstruct task, using slight different prompts, but the     exact same evolutionary approach.</p>"},{"location":"components-gallery/tasks/evolcomplexity/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated.</p> </li> <li> <p>generate_answers: Whether to generate answers for the instructions or not. Defaults  to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for the generation of the  instructions.</p> </li> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to  be higher than, to be considered valid. Defaults to <code>512</code>.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to  be lower than, to be considered valid. Defaults to <code>1024</code>.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</p> </li> <li> <p>seed: The number of evolutions to be run.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n        end\n        subgraph New columns\n            OCOL0[evolved_instruction]\n            OCOL1[answer]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EvolComplexity\n        StepInput[Input Columns: instruction]\n        StepOutput[Output Columns: evolved_instruction, answer, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexity/#inputs","title":"Inputs","text":"<ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#outputs","title":"Outputs","text":"<ul> <li> <p>evolved_instruction (<code>str</code>): The evolved instruction.</p> </li> <li> <p>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolcomplexity/#evolve-an-instruction-using-an-llm","title":"Evolve an instruction using an LLM","text":"<pre><code>from distilabel.steps.tasks import EvolComplexity\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity = EvolComplexity(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_complexity.load()\n\nresult = next(evol_complexity.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexity/#references","title":"References","text":"<ul> <li> <p>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</p> </li> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/","title":"EvolQuality","text":"<p>Evolve the quality of the responses using an <code>LLM</code>.</p> <p><code>EvolQuality</code> task is used to evolve the quality of the responses given a prompt,     by generating a new response with a language model. This step implements the evolution     quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of     Automatic Data Selection in Instruction Tuning'.</p>"},{"location":"components-gallery/tasks/evolquality/#attributes","title":"Attributes","text":"<ul> <li> <p>num_evolutions: The number of evolutions to be performed on the responses.</p> </li> <li> <p>store_evolutions: Whether to store all the evolved responses or just the last one.  Defaults to <code>False</code>.</p> </li> <li> <p>include_original_response: Whether to include the original response within the evolved  responses. Defaults to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used to evolve the responses.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul>"},{"location":"components-gallery/tasks/evolquality/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[response]\n        end\n        subgraph New columns\n            OCOL0[evolved_response]\n            OCOL1[evolved_responses]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EvolQuality\n        StepInput[Input Columns: instruction, response]\n        StepOutput[Output Columns: evolved_response, evolved_responses, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/evolquality/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</p> </li> <li> <p>response (<code>str</code>): The responses to be rewritten.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/#outputs","title":"Outputs","text":"<ul> <li> <p>evolved_response (<code>str</code>): The evolved response if <code>store_evolutions=False</code>.</p> </li> <li> <p>evolved_responses (<code>List[str]</code>): The evolved responses if <code>store_evolutions=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the responses.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolquality/#evolve-the-quality-of-the-responses-given-a-prompt","title":"Evolve the quality of the responses given a prompt","text":"<pre><code>from distilabel.steps.tasks import EvolQuality\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_quality = EvolQuality(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_quality.load()\n\nresult = next(\n    evol_quality.process(\n        [\n            {\"instruction\": \"common instruction\", \"response\": \"a response\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'response': 'a response',\n#         'evolved_response': 'evolved response',\n#         'model_name': '\"mistralai/Mistral-7B-Instruct-v0.2\"'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/evolquality/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/","title":"EvolInstructGenerator","text":"<p>Generate evolved instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p>"},{"location":"components-gallery/tasks/evolinstructgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated.</p> </li> <li> <p>generate_answers: Whether to generate answers for the instructions or not. Defaults  to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for the generation of the  instructions.</p> </li> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to  be higher than, to be considered valid. Defaults to <code>512</code>.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to  be lower than, to be considered valid. Defaults to <code>1024</code>.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs  to be higher than, to be considered valid.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs  to be lower than, to be considered valid.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[instruction]\n            OCOL1[answer]\n            OCOL2[instructions]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph EvolInstructGenerator\n        StepOutput[Output Columns: instruction, answer, instructions, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n</code></pre>"},{"location":"components-gallery/tasks/evolinstructgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>instruction (<code>str</code>): The generated instruction if <code>generate_answers=False</code>.</p> </li> <li> <p>answer (<code>str</code>): The generated answer if <code>generate_answers=True</code>.</p> </li> <li> <p>instructions (<code>List[str]</code>): The generated instructions if <code>generate_answers=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to generate and evolve the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolinstructgenerator/#generate-evolved-instructions-without-initial-instructions","title":"Generate evolved instructions without initial instructions","text":"<pre><code>from distilabel.steps.tasks import EvolInstructGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct_generator = EvolInstructGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_instruct_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstructgenerator/#references","title":"References","text":"<ul> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> <li> <p>GitHub: h2oai/h2o-wizardlm</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/","title":"EvolComplexityGenerator","text":"<p>Generate evolved instructions with increased complexity using an <code>LLM</code>.</p> <p><code>EvolComplexityGenerator</code> is a generation task that evolves instructions to make     them more complex, and it is based in the EvolInstruct task, but using slight different     prompts, but the exact same evolutionary approach.</p>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated.</p> </li> <li> <p>generate_answers: Whether to generate answers for the instructions or not. Defaults  to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for the generation of the  instructions.</p> </li> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to  be higher than, to be considered valid. Defaults to <code>512</code>.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to  be lower than, to be considered valid. Defaults to <code>1024</code>.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</p> </li> <li> <p>seed: The number of evolutions to be run.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[instruction]\n            OCOL1[answer]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EvolComplexityGenerator\n        StepOutput[Output Columns: instruction, answer, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>instruction (<code>str</code>): The evolved instruction.</p> </li> <li> <p>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolcomplexitygenerator/#generate-evolved-instructions-without-initial-instructions","title":"Generate evolved instructions without initial instructions","text":"<pre><code>from distilabel.steps.tasks import EvolComplexityGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity_generator = EvolComplexityGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_complexity_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#references","title":"References","text":"<ul> <li> <p>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</p> </li> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/","title":"InstructionBacktranslation","text":"<p>Self-Alignment with Instruction Backtranslation.</p>"},{"location":"components-gallery/tasks/instructionbacktranslation/#attributes","title":"Attributes","text":"<ul> <li>_template: the Jinja2 template to use for the Instruction Backtranslation task.</li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generation]\n        end\n        subgraph New columns\n            OCOL0[score]\n            OCOL1[reason]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph InstructionBacktranslation\n        StepInput[Input Columns: instruction, generation]\n        StepOutput[Output Columns: score, reason, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/instructionbacktranslation/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The reference instruction to evaluate the text output.</p> </li> <li> <p>generation (<code>str</code>): The text output to evaluate for the given instruction.</p> </li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/#outputs","title":"Outputs","text":"<ul> <li> <p>score (<code>str</code>): The score for the generation based on the given instruction.</p> </li> <li> <p>reason (<code>str</code>): The reason for the provided score.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to score the generation.</p> </li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/instructionbacktranslation/#generate-a-score-and-reason-for-a-given-instruction-and-generation","title":"Generate a score and reason for a given instruction and generation","text":"<pre><code>from distilabel.steps.tasks import InstructionBacktranslation\n\ninstruction_backtranslation = InstructionBacktranslation(\n        name=\"instruction_backtranslation\",\n        llm=llm,\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"scoring_model\"},\n    )\ninstruction_backtranslation.load()\n\nresult = next(\n    instruction_backtranslation.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generation\": \"4\",\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         \"instruction\": \"How much is 2+2?\",\n#         \"generation\": \"4\",\n#         \"score\": 3,\n#         \"reason\": \"Reason for the generation.\",\n#         \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/instructionbacktranslation/#references","title":"References","text":"<ul> <li>Self-Alignment with Instruction Backtranslation</li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/","title":"PrometheusEval","text":"<p>Critique and rank the quality of generations from an <code>LLM</code> using Prometheus 2.0.</p> <p><code>PrometheusEval</code> is a task created for Prometheus 2.0, covering both the absolute and relative     evaluations. The absolute evaluation i.e. <code>mode=\"absolute\"</code> is used to evaluate a single generation from     an LLM for a given instruction. The relative evaluation i.e. <code>mode=\"relative\"</code> is used to evaluate two generations from an LLM     for a given instruction.     Both evaluations provide the possibility of using a reference answer to compare with or withoug     the <code>reference</code> attribute, and both are based on a score rubric that critiques the generation/s     based on the following default aspects: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>,     and <code>reasoning</code>, that can be overridden via <code>rubrics</code>, and the selected rubric is set via the attribute     <code>rubric</code>.</p>"},{"location":"components-gallery/tasks/prometheuseval/#note","title":"Note","text":"<p>The <code>PrometheusEval</code> task is better suited and intended to be used with any of the Prometheus 2.0 models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0, and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting and quality is not guaranteed if using another model, even though some other models may be able to correctly follow the formatting and generate insightful critiques too.</p>"},{"location":"components-gallery/tasks/prometheuseval/#attributes","title":"Attributes","text":"<ul> <li> <p>mode: the evaluation mode to use, either <code>absolute</code> or <code>relative</code>. It defines whether the task  will evaluate one or two generations.</p> </li> <li> <p>rubric: the score rubric to use within the prompt to run the critique based on different aspects.  Can be any existing key in the <code>rubrics</code> attribute, which by default means that it can be:  <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, or <code>reasoning</code>. Those will only  work if using the default <code>rubrics</code>, otherwise, the provided <code>rubrics</code> should be used.</p> </li> <li> <p>rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are  the rubric names and the values are the rubric descriptions. The default rubrics are the following:  <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, and <code>reasoning</code>.</p> </li> <li> <p>reference: a boolean flag to indicate whether a reference answer / completion will be provided, so  that the model critique is based on the comparison with it. It implies that the column <code>reference</code>  needs to be provided within the input data in addition to the rest of the inputs.</p> </li> <li> <p>_template: a Jinja2 template used to format the input for the LLM.</p> </li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generation]\n            ICOL2[generations]\n            ICOL3[reference]\n        end\n        subgraph New columns\n            OCOL0[feedback]\n            OCOL1[result]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph PrometheusEval\n        StepInput[Input Columns: instruction, generation, generations, reference]\n        StepOutput[Output Columns: feedback, result, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction to use as reference.</p> </li> <li> <p>generation (<code>str</code>, optional): The generated text from the given <code>instruction</code>. This column is required  if <code>mode=absolute</code>.</p> </li> <li> <p>generations (<code>List[str]</code>, optional): The generated texts from the given <code>instruction</code>. It should  contain 2 generations only. This column is required if <code>mode=relative</code>.</p> </li> <li> <p>reference (<code>str</code>, optional): The reference / golden answer for the <code>instruction</code>, to be used by the LLM  for comparison against.</p> </li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/#outputs","title":"Outputs","text":"<ul> <li> <p>feedback (<code>str</code>): The feedback explaining the result below, as critiqued by the LLM using the  pre-defined score rubric, compared against <code>reference</code> if provided.</p> </li> <li> <p>result (<code>Union[int, Literal[\"A\", \"B\"]]</code>): If <code>mode=absolute</code>, then the result contains the score for the  <code>generation</code> in a likert-scale from 1-5, otherwise, if <code>mode=relative</code>, then the result contains either  \"A\" or \"B\", the \"winning\" one being the generation in the index 0 of <code>generations</code> if <code>result='A'</code> or the  index 1 if <code>result='B'</code>.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/prometheuseval/#critique-and-evaluate-llm-generation-quality-using-prometheus-2_0","title":"Critique and evaluate LLM generation quality using Prometheus 2_0","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"factual-validity\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#critique-for-relative-evaluation","title":"Critique for relative evaluation","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"relative\",\n    rubric=\"honesty\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generations\": [\"something done\", \"other thing\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generations': ['something done', 'other thing'],\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 'something done',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#critique-with-a-custom-rubric","title":"Critique with a custom rubric","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"custom\",\n    rubrics={\n        \"custom\": \"[A]\\nScore 1: A\\nScore 2: B\\nScore 3: C\\nScore 4: D\\nScore 5: E\"\n    }\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#critique-using-a-reference-answer","title":"Critique using a reference answer","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"helpfulness\",\n    reference=True,\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\n                \"instruction\": \"make something\",\n                \"generation\": \"something done\",\n                \"reference\": \"this is a reference answer\",\n            },\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'reference': 'this is a reference answer',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#references","title":"References","text":"<ul> <li> <p>Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models</p> </li> <li> <p>prometheus-eval: Evaluate your LLM's response with Prometheus \ud83d\udcaf</p> </li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/","title":"ComplexityScorer","text":"<p>Score instructions based on their complexity using an <code>LLM</code>.</p> <p><code>ComplexityScorer</code> is a pre-defined task used to rank a list of instructions based in     their complexity. It's an implementation of the complexity score task from the paper     'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection     in Instruction Tuning'.</p>"},{"location":"components-gallery/tasks/complexityscorer/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instructions]\n        end\n        subgraph New columns\n            OCOL0[scores]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph ComplexityScorer\n        StepInput[Input Columns: instructions]\n        StepOutput[Output Columns: scores, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/complexityscorer/#inputs","title":"Inputs","text":"<ul> <li>instructions (<code>List[str]</code>): The list of instructions to be scored.</li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/#outputs","title":"Outputs","text":"<ul> <li> <p>scores (<code>List[float]</code>): The score for each instruction.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the scores.</p> </li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/complexityscorer/#evaluate-the-complexity-of-your-instructions","title":"Evaluate the complexity of your instructions","text":"<pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]\n</code></pre>"},{"location":"components-gallery/tasks/complexityscorer/#generate-structured-output-with-default-schema","title":"Generate structured output with default schema","text":"<pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=use_default_structured_output\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \\n  \"scores\": [\\n    1, \\n    2\\n  ]\\n}'}}]\n</code></pre>"},{"location":"components-gallery/tasks/complexityscorer/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/","title":"QualityScorer","text":"<p>Score responses based on their quality using an <code>LLM</code>.</p> <p><code>QualityScorer</code> is a pre-defined task that defines the <code>instruction</code> as the input     and <code>score</code> as the output. This task is used to rate the quality of instructions and responses.     It's an implementation of the quality score task from the paper 'What Makes Good Data     for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.     The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs     are scored in terms of quality, obtaining a quality score for each instruction.</p>"},{"location":"components-gallery/tasks/qualityscorer/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[responses]\n        end\n        subgraph New columns\n            OCOL0[scores]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph QualityScorer\n        StepInput[Input Columns: instruction, responses]\n        StepOutput[Output Columns: scores, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/qualityscorer/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</p> </li> <li> <p>responses (<code>List[str]</code>): The responses to be scored. Each response forms a pair with the instruction.</p> </li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/#outputs","title":"Outputs","text":"<ul> <li> <p>scores (<code>List[float]</code>): The score for each instruction.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the scores.</p> </li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/qualityscorer/#evaluate-the-quality-of-your-instructions","title":"Evaluate the quality of your instructions","text":"<pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n# result\n[\n    {\n        'instructions': 'instruction',\n        'model_name': 'test',\n        'scores': [5, 3, 1],\n    }\n]\n</code></pre>"},{"location":"components-gallery/tasks/qualityscorer/#generate-structured-output-with-default-schema","title":"Generate structured output with default schema","text":"<pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    use_default_structured_output=True\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n\n# result\n[{'instruction': 'instruction',\n'responses': ['good response', 'weird response', 'bad response'],\n'scores': [1, 2, 3],\n'distilabel_metadata': {'raw_output_quality_scorer_0': '{  \"scores\": [1, 2, 3] }'},\n'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/qualityscorer/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/clair/","title":"CLAIR","text":"<p>Contrastive Learning from AI Revisions (CLAIR).</p> <p>CLAIR uses an AI system to minimally revise a solution A\u2192A\u00b4 such that the resulting     preference A <code>preferred</code> A\u2019 is much more contrastive and precise.</p>"},{"location":"components-gallery/tasks/clair/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n            ICOL1[student_solution]\n        end\n        subgraph New columns\n            OCOL0[revision]\n            OCOL1[rational]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph CLAIR\n        StepInput[Input Columns: task, student_solution]\n        StepOutput[Output Columns: revision, rational, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/clair/#inputs","title":"Inputs","text":"<ul> <li> <p>task (<code>str</code>): The task or instruction.</p> </li> <li> <p>student_solution (<code>str</code>): An answer to the task that is to be revised.</p> </li> </ul>"},{"location":"components-gallery/tasks/clair/#outputs","title":"Outputs","text":"<ul> <li> <p>revision (<code>str</code>): The revised text.</p> </li> <li> <p>rational (<code>str</code>): The rational for the provided revision.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the revision and rational.</p> </li> </ul>"},{"location":"components-gallery/tasks/clair/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/clair/#create-contrastive-preference-pairs","title":"Create contrastive preference pairs","text":"<pre><code>from distilabel.steps.tasks import CLAIR\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 4096,\n    },\n)\nclair_task = CLAIR(llm=llm)\n\nclair_task.load()\n\nresult = next(\n    clair_task.process(\n        [\n            {\n                \"task\": \"How many gaps are there between the earth and the moon?\",\n                \"student_solution\": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'\n            }\n        ]\n    )\n)\n# result\n# [{'task': 'How many gaps are there between the earth and the moon?',\n# 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',\n# 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'rational': 'The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.',\n# 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.\\n\\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'raw_input_c_l_a_i_r_0': [{'role': 'system',\n#     'content': \"You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution.\"},\n#     {'role': 'user',\n#     'content': '{task}: How many gaps are there between the earth and the moon?\\n\\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\\n\\n-----------------\\n\\nLet\\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/clair/#references","title":"References","text":"<ul> <li> <p>Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment</p> </li> <li> <p>APO and CLAIR - GitHub Repository</p> </li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/","title":"UltraFeedback","text":"<p>Rank generations focusing on different aspects using an <code>LLM</code>.</p> <p>UltraFeedback: Boosting Language Models with High-quality Feedback.</p>"},{"location":"components-gallery/tasks/ultrafeedback/#attributes","title":"Attributes","text":"<ul> <li>aspect: The aspect to perform with the <code>UltraFeedback</code> model. The available aspects are:  - <code>helpfulness</code>: Evaluate text outputs based on helpfulness.  - <code>honesty</code>: Evaluate text outputs based on honesty.  - <code>instruction-following</code>: Evaluate text outputs based on given instructions.  - <code>truthfulness</code>: Evaluate text outputs based on truthfulness.  Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall  assessment of the text outputs within a single prompt. The custom aspect is:  - <code>overall-rating</code>: Evaluate text outputs based on an overall assessment.  Defaults to <code>\"overall-rating\"</code>.</li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generations]\n        end\n        subgraph New columns\n            OCOL0[ratings]\n            OCOL1[rationales]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph UltraFeedback\n        StepInput[Input Columns: instruction, generations]\n        StepOutput[Output Columns: ratings, rationales, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The reference instruction to evaluate the text outputs.</p> </li> <li> <p>generations (<code>List[str]</code>): The text outputs to evaluate for the given instruction.</p> </li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/#outputs","title":"Outputs","text":"<ul> <li> <p>ratings (<code>List[float]</code>): The ratings for each of the provided text outputs.</p> </li> <li> <p>rationales (<code>List[str]</code>): The rationales for each of the provided text outputs.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the ratings and rationales.</p> </li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/ultrafeedback/#rate-generations-from-different-llms-based-on-the-selected-aspect","title":"Rate generations from different LLMs based on the selected aspect","text":"<pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=False\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'How much is 2+2?',\n#         'generations': ['4', 'and a car'],\n#         'ratings': [1, 2],\n#         'rationales': ['explanation for 4', 'explanation for and a car'],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#rate-generations-from-different-llms-based-on-the-honesty-using-the-default-structured-output","title":"Rate generations from different LLMs based on the honesty, using the default structured output","text":"<pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    aspect=\"honesty\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n# 'generations': ['4', 'and a car'],\n# 'ratings': [5, 1],\n# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',\n# \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"],\n# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{\"ratings\": [\\n    5,\\n    1\\n] \\n\\n,\"rationales\": [\\n    \"The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.\",\\n    \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"\\n] }'},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#rate-generations-from-different-llms-based-on-the-helpfulness-using-the-default-structured-output","title":"Rate generations from different LLMs based on the helpfulness, using the default structured output","text":"<pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512},\n    ),\n    aspect=\"helpfulness\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n#   'generations': ['4', 'and a car'],\n#   'ratings': [1, 5],\n#   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',\n#    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],\n#   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',\n#    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],\n#   'types': [1, 3, 1],\n#   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \\n  \"ratings\": [\\n    1,\\n    5\\n  ]\\n ,\\n  \"rationales\": [\\n    \"Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.\",\\n    \"Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"rationales_for_rating\": [\\n    \"Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.\",\\n    \"Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"types\": [\\n    1, 3,\\n    1\\n  ]\\n  }'},\n#   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#references","title":"References","text":"<ul> <li> <p>UltraFeedback: Boosting Language Models with High-quality Feedback</p> </li> <li> <p>UltraFeedback - GitHub Repository</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/","title":"PairRM","text":"<p>Rank the candidates based on the input using the <code>LLM</code> model.</p>"},{"location":"components-gallery/tasks/pairrm/#note","title":"Note","text":"<p>This step differs to other tasks as there is a single implementation of this model currently, and we will use a specific <code>LLM</code>.</p>"},{"location":"components-gallery/tasks/pairrm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: The model to use for the ranking. Defaults to <code>\"llm-blender/PairRM\"</code>.</p> </li> <li> <p>instructions: The instructions to use for the model. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[inputs]\n            ICOL1[candidates]\n        end\n        subgraph New columns\n            OCOL0[ranks]\n            OCOL1[ranked_candidates]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph PairRM\n        StepInput[Input Columns: inputs, candidates]\n        StepOutput[Output Columns: ranks, ranked_candidates, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/pairrm/#inputs","title":"Inputs","text":"<ul> <li> <p>inputs (<code>List[Dict[str, Any]]</code>): The input text or conversation to rank the candidates for.</p> </li> <li> <p>candidates (<code>List[Dict[str, Any]]</code>): The candidates to rank.</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/#outputs","title":"Outputs","text":"<ul> <li> <p>ranks (<code>List[int]</code>): The ranks of the candidates based on the input.</p> </li> <li> <p>ranked_candidates (<code>List[Dict[str, Any]]</code>): The candidates ranked based on the input.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to rank the candidate responses. Defaults to <code>\"llm-blender/PairRM\"</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/pairrm/#rank-llm-candidates","title":"Rank LLM candidates","text":"<pre><code>from distilabel.steps.tasks import PairRM\n\n# Consider this as a placeholder for your actual LLM.\npair_rm = PairRM()\n\npair_rm.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\"input\": \"Hello, how are you?\", \"candidates\": [\"fine\", \"good\", \"bad\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'input': 'Hello, how are you?',\n#         'candidates': ['fine', 'good', 'bad'],\n#         'ranks': [2, 1, 3],\n#         'ranked_candidates': ['good', 'fine', 'bad'],\n#         'model_name': 'llm-blender/PairRM',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/pairrm/#references","title":"References","text":"<ul> <li> <p>LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion</p> </li> <li> <p>Pair Ranking Model</p> </li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/","title":"GenerateSentencePair","text":"<p>Generate a positive and negative (optionally) sentences given an anchor sentence.</p> <p><code>GenerateSentencePair</code> is a pre-defined task that given an anchor sentence generates     a positive sentence related to the anchor and optionally a negative sentence unrelated     to the anchor or similar to it. Optionally, you can give a context to guide the LLM     towards more specific behavior. This task is useful to generate training datasets for     training embeddings models.</p>"},{"location":"components-gallery/tasks/generatesentencepair/#attributes","title":"Attributes","text":"<ul> <li> <p>triplet: a flag to indicate if the task should generate a triplet of sentences  (anchor, positive, negative). Defaults to <code>False</code>.</p> </li> <li> <p>action: the action to perform to generate the positive sentence.</p> </li> <li> <p>context: the context to use for the generation. Can be helpful to guide the LLM  towards more specific context. Not used by default.</p> </li> <li> <p>hard_negative: A flag to indicate if the negative should be a hard-negative or not.  Hard negatives make it hard for the model to distinguish against the positive,  with a higher degree of semantic similarity.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[anchor]\n        end\n        subgraph New columns\n            OCOL0[positive]\n            OCOL1[negative]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph GenerateSentencePair\n        StepInput[Input Columns: anchor]\n        StepOutput[Output Columns: positive, negative, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#inputs","title":"Inputs","text":"<ul> <li>anchor (<code>str</code>): The anchor sentence to generate the positive and negative sentences.</li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/#outputs","title":"Outputs","text":"<ul> <li> <p>positive (<code>str</code>): The positive sentence related to the <code>anchor</code>.</p> </li> <li> <p>negative (<code>str</code>): The negative sentence unrelated to the <code>anchor</code> if <code>triplet=True</code>,  or more similar to the positive to make it more challenging for a model to distinguish  in case <code>hard_negative=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model that was used to generate the sentences.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatesentencepair/#paraphrasing","title":"Paraphrasing","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"paraphrase\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#generating-semantically-similar-sentences","title":"Generating semantically similar sentences","text":"<pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import GenerateSentencePair\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"semantically-similar\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"How does 3D printing work?\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#generating-queries","title":"Generating queries","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"Argilla is an open-source data curation platform for LLMs. Using Argilla, ...\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#generating-answers","title":"Generating answers","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"answer\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#_1","title":")","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    hard_negative=True,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n    use_default_structured_output=True\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generateembeddings/","title":"GenerateEmbeddings","text":"<p>Generate embeddings using the last hidden state of an <code>LLM</code>.</p> <p>Generate embeddings for a text input using the last hidden state of an <code>LLM</code>, as     described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of     Automatic Data Selection in Instruction Tuning'.</p>"},{"location":"components-gallery/tasks/generateembeddings/#attributes","title":"Attributes","text":"<ul> <li>llm: The <code>LLM</code> to use to generate the embeddings.</li> </ul>"},{"location":"components-gallery/tasks/generateembeddings/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[embedding]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph GenerateEmbeddings\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: embedding, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generateembeddings/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>, <code>List[Dict[str, str]]</code>): The input text or conversation to generate  embeddings for.</li> </ul>"},{"location":"components-gallery/tasks/generateembeddings/#outputs","title":"Outputs","text":"<ul> <li> <p>embedding (<code>List[float]</code>): The embedding of the input text or conversation.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the embeddings.</p> </li> </ul>"},{"location":"components-gallery/tasks/generateembeddings/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generateembeddings/#rank-llm-candidates","title":"Rank LLM candidates","text":"<pre><code>from distilabel.steps.tasks import GenerateEmbeddings\nfrom distilabel.models.llms.huggingface import TransformersLLM\n\n# Consider this as a placeholder for your actual LLM.\nembedder = GenerateEmbeddings(\n    llm=TransformersLLM(\n        model=\"TaylorAI/bge-micro-v2\",\n        model_kwargs={\"is_decoder\": True},\n        cuda_devices=[],\n    )\n)\nembedder.load()\n\nresult = next(\n    embedder.process(\n        [\n            {\"text\": \"Hello, how are you?\"},\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/generateembeddings/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/textclustering/","title":"TextClustering","text":"<p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p>This is a <code>GlobalTask</code> that inherits from <code>TextClassification</code>, this means that all     the attributes from that class are available here. Also, in this case we deal     with all the inputs at once, instead of using batches. The <code>input_batch_size</code> is     used here to send the examples to the LLM in batches (a subtle difference with the     more common <code>Task</code> definitions).     The task looks in each cluster for a given number of representative examples (the number     is set by the <code>samples_per_cluster</code> attribute), and sends them to the LLM to get a label/s     that represent the cluster. The labels are then assigned to each text in the cluster.     The clusters and projections used in the step, are assumed to be obtained from the <code>UMAP</code>     + <code>DBSCAN</code> steps, but could be generated for similar steps, as long as they represent the     same concepts.     This step runs a pipeline like the one in this repository:     https://github.com/huggingface/text-clustering</p>"},{"location":"components-gallery/tasks/textclustering/#attributes","title":"Attributes","text":"<ul> <li>savefig: Whether to generate and save a figure with the clustering of the texts.  - samples_per_cluster: The number of examples to use in the LLM as a sample of the cluster.</li> </ul>"},{"location":"components-gallery/tasks/textclustering/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n            ICOL1[projection]\n            ICOL2[cluster_label]\n        end\n        subgraph New columns\n            OCOL0[summary_label]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextClustering\n        StepInput[Input Columns: text, projection, cluster_label]\n        StepOutput[Output Columns: summary_label, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textclustering/#inputs","title":"Inputs","text":"<ul> <li> <p>text (<code>str</code>): The reference text we want to obtain labels for.</p> </li> <li> <p>projection (<code>List[float]</code>): Vector representation of the text to cluster,  normally the output from the <code>UMAP</code> step.</p> </li> <li> <p>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1  means it wasn't clustered.</p> </li> </ul>"},{"location":"components-gallery/tasks/textclustering/#outputs","title":"Outputs","text":"<ul> <li> <p>summary_label (<code>str</code>): The label or list of labels for the text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the label/s.</p> </li> </ul>"},{"location":"components-gallery/tasks/textclustering/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textclustering/#generate-labels-for-a-set-of-texts-using-clustering","title":"Generate labels for a set of texts using clustering","text":"<pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps import UMAP, DBSCAN, TextClustering\nfrom distilabel.pipeline import Pipeline\n\nds_name = \"argilla-warehouse/personahub-fineweb-edu-4-clustering-100k\"\n\nwith Pipeline(name=\"Text clustering dataset\") as pipeline:\n    batch_size = 500\n\n    ds = load_dataset(ds_name, split=\"train\").select(range(10000))\n    loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)\n\n    umap = UMAP(n_components=2, metric=\"cosine\")\n    dbscan = DBSCAN(eps=0.3, min_samples=30)\n\n    text_clustering = TextClustering(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        ),\n        n=3,  # 3 labels per example\n        query_title=\"Examples of Personas\",\n        samples_per_cluster=10,\n        context=(\n            \"Describe the main themes, topics, or categories that could describe the \"\n            \"following types of personas. All the examples of personas must share \"\n            \"the same set of labels.\"\n        ),\n        default_label=\"None\",\n        savefig=True,\n        input_batch_size=8,\n        input_mappings={\"text\": \"persona\"},\n        use_default_structured_output=True,\n    )\n\n    loader &gt;&gt; umap &gt;&gt; dbscan &gt;&gt; text_clustering\n</code></pre>"},{"location":"components-gallery/tasks/textclustering/#references","title":"References","text":"<ul> <li>text-clustering repository</li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/","title":"APIGenSemanticChecker","text":"<p>Generate queries and answers for the given functions in JSON format.</p> <p>The <code>APIGenGenerator</code> is inspired by the APIGen pipeline, which was designed to generate     verifiable and diverse function-calling datasets. The task generates a set of diverse queries     and corresponding answers for the given functions in JSON format.</p>"},{"location":"components-gallery/tasks/apigensemanticchecker/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: System prompt for the task. Has a default one.</p> </li> <li> <p>exclude_failed_execution: Whether to exclude failed executions (won't run on those  rows that have a False in <code>keep_row_after_execution_check</code> column, which  comes from running <code>APIGenExecutionChecker</code>). Defaults to True.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[func_desc]\n            ICOL1[query]\n            ICOL2[answers]\n            ICOL3[execution_result]\n        end\n        subgraph New columns\n            OCOL0[thought]\n            OCOL1[keep_row_after_semantic_check]\n        end\n    end\n\n    subgraph APIGenSemanticChecker\n        StepInput[Input Columns: func_desc, query, answers, execution_result]\n        StepOutput[Output Columns: thought, keep_row_after_semantic_check]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/apigensemanticchecker/#inputs","title":"Inputs","text":"<ul> <li> <p>func_desc (<code>str</code>): Description of what the function should do.</p> </li> <li> <p>query (<code>str</code>): Instruction from the user.</p> </li> <li> <p>answers (<code>str</code>): JSON encoded list with arguments to be passed to the function/API.  Should be loaded using <code>json.loads</code>.</p> </li> <li> <p>execution_result (<code>str</code>): Result of the function/API executed.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/#outputs","title":"Outputs","text":"<ul> <li> <p>thought (<code>str</code>): Reasoning for the output on whether to keep this output or not.</p> </li> <li> <p>keep_row_after_semantic_check (<code>bool</code>): True or False, can be used to filter  afterwards.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/apigensemanticchecker/#semantic-checker-for-generated-function-calls-original-implementation","title":"Semantic checker for generated function calls (original implementation)","text":"<pre><code>from distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=False,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'thought': '',\n# 'keep_row_after_semantic_check': True,\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \"thought\": \"Concisely describe your reasoning here\",\\n   \"pass\": \"yes\" or \"no\"\\n}\\n```\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/apigensemanticchecker/#semantic-checker-for-generated-function-calls-structured-output","title":"Semantic checker for generated function calls (structured output)","text":"<pre><code>from distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=True,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'keep_row_after_semantic_check': True,\n# 'thought': '',\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/apigensemanticchecker/#references","title":"References","text":"<ul> <li> <p>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</p> </li> <li> <p>Salesforce/xlam-function-calling-60k</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/","title":"GenerateTextRetrievalData","text":"<p>Generate text retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextRetrievalData</code> is a <code>Task</code> that generates text retrieval data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-retrieval\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-retrieval category.</p>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>query_type: The type of query to be generated, which can be <code>extremely long-tail</code>, <code>long-tail</code>,  or <code>common</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>query_length: The length of the query to be generated, which can be <code>less than 5 words</code>, <code>5 to 15 words</code>,  or <code>at least 10 words</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>clarity: The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>,  or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>num_words: The number of words in the query to be generated, which can be <code>50</code>, <code>100</code>, <code>200</code>, <code>300</code>, <code>400</code>, or <code>500</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[user_query]\n            OCOL1[positive_document]\n            OCOL2[hard_negative_document]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph GenerateTextRetrievalData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: user_query, positive_document, hard_negative_document, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#outputs","title":"Outputs","text":"<ul> <li> <p>user_query (<code>str</code>): the user query generated by the <code>LLM</code>.</p> </li> <li> <p>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</p> </li> <li> <p>hard_negative_document (<code>str</code>): the hard negative document generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the text retrieval data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatetextretrievaldata/#generate-synthetic-text-retrieval-data-for-training-embedding-models","title":"Generate synthetic text retrieval data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-retrieval\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextRetrievalData(\n        language=\"English\",\n        query_type=\"common\",\n        query_length=\"5 to 15 words\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        num_words=100,\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/","title":"GenerateShortTextMatchingData","text":"<p>Generate short text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateShortTextMatchingData</code> is a <code>Task</code> that generates short text matching data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-short\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-short category.</p>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.  Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> </li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[input]\n            OCOL1[positive_document]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph GenerateShortTextMatchingData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: input, positive_document, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#outputs","title":"Outputs","text":"<ul> <li> <p>input (<code>str</code>): the input generated by the <code>LLM</code>.</p> </li> <li> <p>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the short text matching  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#generate-synthetic-short-text-matching-data-for-training-embedding-models","title":"Generate synthetic short text matching data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-short\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateShortTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/","title":"GenerateLongTextMatchingData","text":"<p>Generate long text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateLongTextMatchingData</code> is a <code>Task</code> that generates long text matching data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-long\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-long category.</p>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.  Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[input]\n            OCOL1[positive_document]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph GenerateLongTextMatchingData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: input, positive_document, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#outputs","title":"Outputs","text":"<ul> <li> <p>input (<code>str</code>): the input generated by the <code>LLM</code>.</p> </li> <li> <p>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the long text matching  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#generate-synthetic-long-text-matching-data-for-training-embedding-models","title":"Generate synthetic long text matching data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-long\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateLongTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/","title":"GenerateTextClassificationData","text":"<p>Generate text classification data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextClassificationData</code> is a <code>Task</code> that generates text classification data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-classification\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-classification category.</p>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>clarity: The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>,  or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[input_text]\n            OCOL1[label]\n            OCOL2[misleading_label]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph GenerateTextClassificationData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: input_text, label, misleading_label, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#outputs","title":"Outputs","text":"<ul> <li> <p>input_text (<code>str</code>): the input text generated by the <code>LLM</code>.</p> </li> <li> <p>label (<code>str</code>): the label generated by the <code>LLM</code>.</p> </li> <li> <p>misleading_label (<code>str</code>): the misleading label generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the text classification  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatetextclassificationdata/#generate-synthetic-text-classification-data-for-training-embedding-models","title":"Generate synthetic text classification data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-classification\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextClassificationData(\n        language=\"English\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/","title":"StructuredGeneration","text":"<p>Generate structured content for a given <code>instruction</code> using an <code>LLM</code>.</p> <p><code>StructuredGeneration</code> is a pre-defined task that defines the <code>instruction</code> and the <code>structured_output</code>     as the inputs, and <code>generation</code> as the output. This task is used to generate structured content based on     the input instruction and following the schema provided within the <code>structured_output</code> column per each     <code>instruction</code>. The <code>model_name</code> also returned as part of the output in order to enhance it.</p>"},{"location":"components-gallery/tasks/structuredgeneration/#attributes","title":"Attributes","text":"<ul> <li>use_system_prompt: Whether to use the system prompt in the generation. Defaults to <code>True</code>,  which means that if the column <code>system_prompt</code> is defined within the input batch, then  the <code>system_prompt</code> will be used, otherwise, it will be ignored.</li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[structured_output]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph StructuredGeneration\n        StepInput[Input Columns: instruction, structured_output]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/structuredgeneration/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction to generate structured content from.</p> </li> <li> <p>structured_output (<code>Dict[str, Any]</code>): The structured_output to generate structured content from. It should be a  Python dictionary with the keys <code>format</code> and <code>schema</code>, where <code>format</code> should be one of <code>json</code> or  <code>regex</code>, and the <code>schema</code> should be either the JSON schema or the regex pattern, respectively.</p> </li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text matching the provided schema, if possible.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/structuredgeneration/#generate-structured-output-from-a-json-schema","title":"Generate structured output from a JSON schema","text":"<pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"Create an RPG character\",\n                \"structured_output\": {\n                    \"format\": \"json\",\n                    \"schema\": {\n                        \"properties\": {\n                            \"name\": {\n                                \"title\": \"Name\",\n                                \"type\": \"string\"\n                            },\n                            \"description\": {\n                                \"title\": \"Description\",\n                                \"type\": \"string\"\n                            },\n                            \"role\": {\n                                \"title\": \"Role\",\n                                \"type\": \"string\"\n                            },\n                            \"weapon\": {\n                                \"title\": \"Weapon\",\n                                \"type\": \"string\"\n                            }\n                        },\n                        \"required\": [\n                            \"name\",\n                            \"description\",\n                            \"role\",\n                            \"weapon\"\n                        ],\n                        \"title\": \"Character\",\n                        \"type\": \"object\"\n                    }\n                },\n            }\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/structuredgeneration/#generate-structured-output-from-a-regex-pattern-only-works-with-llms-that-support-regex-the-providers-using-outlines","title":"Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines)","text":"<pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"What's the weather like today in Seattle in Celsius degrees?\",\n                \"structured_output\": {\n                    \"format\": \"regex\",\n                    \"schema\": r\"(\\d{1,2})\u00b0C\"\n                },\n\n            }\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/","title":"MonolingualTripletGenerator","text":"<p>Generate monolingual triplets with an <code>LLM</code> to later on train an embedding model.</p> <p><code>MonolingualTripletGenerator</code> is a <code>GeneratorTask</code> that generates monolingual triplets with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>unit: The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>high_score: The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>low_score: The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[S1]\n            OCOL1[S2]\n            OCOL2[S3]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph MonolingualTripletGenerator\n        StepOutput[Output Columns: S1, S2, S3, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n</code></pre>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the monolingual triplets.</p> </li> </ul>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/monolingualtripletgenerator/#generate-monolingual-triplets-for-training-embedding-models","title":"Generate monolingual triplets for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import MonolingualTripletGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = MonolingualTripletGenerator(\n        language=\"English\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/","title":"BitextRetrievalGenerator","text":"<p>Generate bitext retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>BitextRetrievalGenerator</code> is a <code>GeneratorTask</code> that generates bitext retrieval data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>source_language: The source language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>target_language: The target language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>unit: The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>high_score: The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>low_score: The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[S1]\n            OCOL1[S2]\n            OCOL2[S3]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph BitextRetrievalGenerator\n        StepOutput[Output Columns: S1, S2, S3, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n</code></pre>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the bitext retrieval  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/bitextretrievalgenerator/#generate-bitext-retrieval-data-for-training-embedding-models","title":"Generate bitext retrieval data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import BitextRetrievalGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = BitextRetrievalGenerator(\n        source_language=\"English\",\n        target_language=\"Spanish\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/","title":"EmbeddingTaskGenerator","text":"<p>Generate task descriptions for embedding-related tasks using an <code>LLM</code>.</p> <p><code>EmbeddingTaskGenerator</code> is a <code>GeneratorTask</code> that doesn't receieve any input besides the     provided attributes that generates task descriptions for embedding-related tasks using a     pre-defined prompt based on the <code>category</code> attribute. The <code>category</code> attribute should be     one of the following:</p> <pre><code>- `text-retrieval`: Generate task descriptions for text retrieval tasks.\n- `text-matching-short`: Generate task descriptions for short text matching tasks.\n- `text-matching-long`: Generate task descriptions for long text matching tasks.\n- `text-classification`: Generate task descriptions for text classification tasks.\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>category: The category of the task to be generated, which can either be <code>text-retrieval</code>,  <code>text-matching-short</code>, <code>text-matching-long</code>, or <code>text-classification</code>.</p> </li> <li> <p>flatten_tasks: Whether to flatten the tasks i.e. since a list of tasks is generated by the  <code>LLM</code>, this attribute indicates whether to flatten the list or not. Defaults to <code>False</code>,  meaning that running this task with <code>num_generations=1</code> will return a <code>distilabel.Distiset</code>  with one row only containing a list with around 20 tasks; otherwise, if set to <code>True</code>, it  will return a <code>distilabel.Distiset</code> with around 20 rows, each containing one task.</p> </li> </ul>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[tasks]\n            OCOL1[task]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EmbeddingTaskGenerator\n        StepOutput[Output Columns: tasks, task, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>tasks (<code>List[str]</code>): the list of tasks generated by the <code>LLM</code>.</p> </li> <li> <p>task (<code>str</code>): the task generated by the <code>LLM</code> if <code>flatten_tasks=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the tasks.</p> </li> </ul>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/embeddingtaskgenerator/#generate-embedding-tasks-for-text-retrieval","title":"Generate embedding tasks for text retrieval","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-retrieval\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/llms/","title":"LLMs Gallery","text":"<ul> <li> <p> AnthropicLLM</p> <p>Anthropic LLM implementation running the Async API client.</p> <p> AnthropicLLM</p> </li> <li> <p> OpenAILLM</p> <p>OpenAI LLM implementation running the async API client.</p> <p> OpenAILLM</p> </li> <li> <p> AnyscaleLLM</p> <p>Anyscale LLM implementation running the async API client of OpenAI.</p> <p> AnyscaleLLM</p> </li> <li> <p> AzureOpenAILLM</p> <p>Azure OpenAI LLM implementation running the async API client.</p> <p> AzureOpenAILLM</p> </li> <li> <p> TogetherLLM</p> <p>TogetherLLM LLM implementation running the async API client of OpenAI.</p> <p> TogetherLLM</p> </li> <li> <p> ClientvLLM</p> <p>A client for the <code>vLLM</code> server implementing the OpenAI API specification.</p> <p> ClientvLLM</p> </li> <li> <p> CohereLLM</p> <p>Cohere API implementation using the async client for concurrent text generation.</p> <p> CohereLLM</p> </li> <li> <p> GroqLLM</p> <p>Groq API implementation using the async client for concurrent text generation.</p> <p> GroqLLM</p> </li> <li> <p> InferenceEndpointsLLM</p> <p>InferenceEndpoints LLM implementation running the async API client.</p> <p> InferenceEndpointsLLM</p> </li> <li> <p> LiteLLM</p> <p>LiteLLM implementation running the async API client.</p> <p> LiteLLM</p> </li> <li> <p> MistralLLM</p> <p>Mistral LLM implementation running the async API client.</p> <p> MistralLLM</p> </li> <li> <p> MixtureOfAgentsLLM</p> <p><code>Mixture-of-Agents</code> implementation.</p> <p> MixtureOfAgentsLLM</p> </li> <li> <p> OllamaLLM</p> <p>Ollama LLM implementation running the Async API client.</p> <p> OllamaLLM</p> </li> <li> <p> VertexAILLM</p> <p>VertexAI LLM implementation running the async API clients for Gemini.</p> <p> VertexAILLM</p> </li> <li> <p> TransformersLLM</p> <p>Hugging Face <code>transformers</code> library LLM implementation using the text generation</p> <p> TransformersLLM</p> </li> <li> <p> LlamaCppLLM</p> <p>llama.cpp LLM implementation running the Python bindings for the C++ code.</p> <p> LlamaCppLLM</p> </li> <li> <p> vLLM</p> <p><code>vLLM</code> library LLM implementation.</p> <p> vLLM</p> </li> </ul>"},{"location":"components-gallery/llms/anthropicllm/","title":"AnthropicLLM","text":"<p>Anthropic LLM implementation running the Async API client.</p>"},{"location":"components-gallery/llms/anthropicllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the name of the model to use for the LLM e.g. \"claude-3-opus-20240229\",  \"claude-3-sonnet-20240229\", etc. Available models can be checked here:  Anthropic: Models overview.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Anthropic API. If not provided,  it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</p> </li> <li> <p>base_url: the base URL to use for the Anthropic API. Defaults to <code>None</code> which means  that <code>https://api.anthropic.com</code> will be used internally.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</p> </li> <li> <p>max_retries: The maximum number of times to retry the request before failing. Defaults  to <code>6</code>.</p> </li> <li> <p>http_client: if provided, an alternative HTTP client to use for calling Anthropic  API. Defaults to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key. It  is meant to be used internally.</p> </li> <li> <p>_aclient: the <code>AsyncAnthropic</code> client to use for the Anthropic API. It is meant  to be used internally. Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/anthropicllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_key: the API key to authenticate the requests to the Anthropic API. If not  provided, it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</p> </li> <li> <p>base_url: the base URL to use for the Anthropic API. Defaults to <code>\"https://api.anthropic.com\"</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request before failing.  Defaults to <code>6</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/anthropicllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/anthropicllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AnthropicLLM\n\nllm = AnthropicLLM(model=\"claude-3-opus-20240229\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/anthropicllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AnthropicLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AnthropicLLM(\n    model=\"claude-3-opus-20240229\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/","title":"OpenAILLM","text":"<p>OpenAI LLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/openaillm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\", \"gpt-4\", etc.  Supported models can be found here.</p> </li> <li> <p>base_url: the base URL to use for the OpenAI API requests. Defaults to <code>None</code>, which  means that the value set for the environment variable <code>OPENAI_BASE_URL</code> will  be used, or \"https://api.openai.com/v1\" if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the OpenAI API. Defaults to  <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_KEY</code>  will be used, or <code>None</code> if not set.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/openaillm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base URL to use for the OpenAI API requests. Defaults to <code>None</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the OpenAI API. Defaults  to <code>None</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/openaillm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/openaillm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/#generate-text-from-a-custom-endpoint-following-the-openai-api","title":"Generate text from a custom endpoint following the OpenAI API","text":"<pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import OpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = OpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/#generate-with-batch-api-offline-batch-generation","title":"Generate with Batch API (offline batch generation)","text":"<pre><code>from distilabel.models.llms import OpenAILLM\n\nload = llm = OpenAILLM(\n    model=\"gpt-3.5-turbo\",\n    use_offline_batch_generation=True,\n    offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n# [['Hello! How can I assist you today?']]\n</code></pre>"},{"location":"components-gallery/llms/anyscalellm/","title":"AnyscaleLLM","text":"<p>Anyscale LLM implementation running the async API client of OpenAI.</p>"},{"location":"components-gallery/llms/anyscalellm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM, e.g., <code>google/gemma-7b-it</code>. See the  supported models under the \"Text Generation -&gt; Supported Models\" section  here.</p> </li> <li> <p>base_url: the base URL to use for the Anyscale API requests. Defaults to <code>None</code>, which  means that the value set for the environment variable <code>ANYSCALE_BASE_URL</code> will be used, or  \"https://api.endpoints.anyscale.com/v1\" if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Anyscale API. Defaults to <code>None</code> which  means that the value set for the environment variable <code>ANYSCALE_API_KEY</code> will be used, or  <code>None</code> if not set.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key.  It is meant to be used internally.</p> </li> </ul>"},{"location":"components-gallery/llms/anyscalellm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/anyscalellm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = AnyscaleLLM(model=\"google/gemma-7b-it\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/azureopenaillm/","title":"AzureOpenAILLM","text":"<p>Azure OpenAI LLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/azureopenaillm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM i.e. the name of the Azure deployment.</p> </li> <li> <p>base_url: the base URL to use for the Azure OpenAI API can be set with <code>AZURE_OPENAI_ENDPOINT</code>.  Defaults to <code>None</code> which means that the value set for the environment variable  <code>AZURE_OPENAI_ENDPOINT</code> will be used, or <code>None</code> if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Azure OpenAI API. Defaults to <code>None</code>  which means that the value set for the environment variable <code>AZURE_OPENAI_API_KEY</code> will be  used, or <code>None</code> if not set.</p> </li> <li> <p>api_version: the API version to use for the Azure OpenAI API. Defaults to <code>None</code> which means  that the value set for the environment variable <code>OPENAI_API_VERSION</code> will be used, or  <code>None</code> if not set.</p> </li> </ul>"},{"location":"components-gallery/llms/azureopenaillm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/azureopenaillm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/azureopenaillm/#generate-text-from-a-custom-endpoint-following-the-openai-api","title":"Generate text from a custom endpoint following the OpenAI API","text":"<pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/azureopenaillm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AzureOpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AzureOpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/togetherllm/","title":"TogetherLLM","text":"<p>TogetherLLM LLM implementation running the async API client of OpenAI.</p>"},{"location":"components-gallery/llms/togetherllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"mistralai/Mixtral-8x7B-Instruct-v0.1\".  Supported models can be found here.</p> </li> <li> <p>base_url: the base URL to use for the Together API can be set with <code>TOGETHER_BASE_URL</code>.  Defaults to <code>None</code> which means that the value set for the environment variable  <code>TOGETHER_BASE_URL</code> will be used, or \"https://api.together.xyz/v1\" if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Together API. Defaults to <code>None</code>  which means that the value set for the environment variable <code>TOGETHER_API_KEY</code> will be  used, or <code>None</code> if not set.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key. It  is meant to be used internally.</p> </li> </ul>"},{"location":"components-gallery/llms/togetherllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/togetherllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = TogetherLLM(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/clientvllm/","title":"ClientvLLM","text":"<p>A client for the <code>vLLM</code> server implementing the OpenAI API specification.</p>"},{"location":"components-gallery/llms/clientvllm/#attributes","title":"Attributes","text":"<ul> <li> <p>base_url: the base URL of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>httpx_client_kwargs: extra kwargs that will be passed to the <code>httpx.AsyncClient</code>  created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer: the Hugging Face Hub repo id or path of the tokenizer that will be used  to apply the chat template and tokenize the inputs before sending it to the  server. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer_revision: the revision of the tokenizer to load. Defaults to <code>None</code>.</p> </li> <li> <p>_aclient: the <code>httpx.AsyncClient</code> used to comunicate with the <code>vLLM</code> server. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/clientvllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base url of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>httpx_client_kwargs: extra kwargs that will be passed to the <code>httpx.AsyncClient</code>  created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/clientvllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/clientvllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import ClientvLLM\n\nllm = ClientvLLM(\n    base_url=\"http://localhost:8000/v1\",\n    tokenizer=\"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n)\n\nllm.load()\n\nresults = llm.generate_outputs(\n    inputs=[[{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]],\n    temperature=0.7,\n    top_p=1.0,\n    max_new_tokens=256,\n)\n# [\n#     [\n#         \"I'm functioning properly, thank you for asking. How can I assist you today?\",\n#         \"I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?\",\n#         \"I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?\"\n#     ]\n# ]\n</code></pre>"},{"location":"components-gallery/llms/coherellm/","title":"CohereLLM","text":"<p>Cohere API implementation using the async client for concurrent text generation.</p>"},{"location":"components-gallery/llms/coherellm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the name of the model from the Cohere API to use for the generation.</p> </li> <li> <p>base_url: the base URL to use for the Cohere API requests. Defaults to  <code>\"https://api.cohere.ai/v1\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Cohere API. Defaults to  the value of the <code>COHERE_API_KEY</code> environment variable.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>client_name: the name of the client to use for the API requests. Defaults to  <code>\"distilabel\"</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_ChatMessage: the <code>ChatMessage</code> class from the <code>cohere</code> package.</p> </li> <li> <p>_aclient: the <code>AsyncClient</code> client from the <code>cohere</code> package.</p> </li> </ul>"},{"location":"components-gallery/llms/coherellm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base URL to use for the Cohere API requests. Defaults to  <code>\"https://api.cohere.ai/v1\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Cohere API. Defaults  to the value of the <code>COHERE_API_KEY</code> environment variable.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>client_name: the name of the client to use for the API requests. Defaults to  <code>\"distilabel\"</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/coherellm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/coherellm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import CohereLLM\n\nllm = CohereLLM(model=\"CohereForAI/c4ai-command-r-plus\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/groqllm/","title":"GroqLLM","text":"<p>Groq API implementation using the async client for concurrent text generation.</p>"},{"location":"components-gallery/llms/groqllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the name of the model from the Groq API to use for the generation.</p> </li> <li> <p>base_url: the base URL to use for the Groq API requests. Defaults to  <code>\"https://api.groq.com\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Groq API. Defaults to  the value of the <code>GROQ_API_KEY</code> environment variable.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>2</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key.</p> </li> <li> <p>_aclient: the <code>AsyncGroq</code> client from the <code>groq</code> package.</p> </li> </ul>"},{"location":"components-gallery/llms/groqllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base URL to use for the Groq API requests. Defaults to  <code>\"https://api.groq.com\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Groq API. Defaults to  the value of the <code>GROQ_API_KEY</code> environment variable.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>2</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/groqllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/groqllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import GroqLLM\n\nllm = GroqLLM(model=\"llama3-70b-8192\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/","title":"InferenceEndpointsLLM","text":"<p>InferenceEndpoints LLM implementation running the async API client.</p> <p>This LLM will internally use <code>huggingface_hub.AsyncInferenceClient</code>.</p>"},{"location":"components-gallery/llms/inferenceendpointsllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model_id: the model ID to use for the LLM as available in the Hugging Face Hub, which  will be used to resolve the base URL for the serverless Inference Endpoints API requests.  Defaults to <code>None</code>.</p> </li> <li> <p>endpoint_name: the name of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> </li> <li> <p>endpoint_namespace: the namespace of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> </li> <li> <p>base_url: the base URL to use for the Inference Endpoints API requests.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Inference Endpoints API.</p> </li> <li> <p>tokenizer_id: the tokenizer ID to use for the LLM as available in the Hugging Face Hub.  Defaults to <code>None</code>, but defining one is recommended to properly format the prompt.</p> </li> <li> <p>model_display_name: the model display name to use for the LLM. Defaults to <code>None</code>.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or  if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>.  Defaults to None.</p> </li> </ul>"},{"location":"components-gallery/llms/inferenceendpointsllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/inferenceendpointsllm/#free-serverless-inference-api-set-the-input_batch_size-of-the-task-that-uses-this-to-avoid-model-is-overloaded","title":"Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded","text":"<pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/#dedicated-inference-endpoints","title":"Dedicated Inference Endpoints","text":"<pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    endpoint_name=\"&lt;ENDPOINT_NAME&gt;\",\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    endpoint_namespace=\"&lt;USER|ORG&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/#dedicated-inference-endpoints-or-tgi","title":"Dedicated Inference Endpoints or TGI","text":"<pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    base_url=\"&lt;BASE_URL&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    api_key=\"api.key\",\n    structured_output={\"format\": \"json\", \"schema\": User.model_json_schema()}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the Tour De France\"}]])\n</code></pre>"},{"location":"components-gallery/llms/litellm/","title":"LiteLLM","text":"<p>LiteLLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/litellm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\" or \"mistral/mistral-large\",  etc.</p> </li> <li> <p>verbose: whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/litellm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>verbose: whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</li> </ul>"},{"location":"components-gallery/llms/litellm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/litellm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import LiteLLM\n\nllm = LiteLLM(model=\"gpt-3.5-turbo\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/mistralllm/","title":"MistralLLM","text":"<p>Mistral LLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/mistralllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"mistral-tiny\", \"mistral-large\", etc.</p> </li> <li> <p>endpoint: the endpoint to use for the Mistral API. Defaults to \"https://api.mistral.ai\".</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Mistral API. Defaults to <code>None</code> which  means that the value set for the environment variable <code>OPENAI_API_KEY</code> will be used, or  <code>None</code> if not set.</p> </li> <li> <p>max_retries: the maximum number of retries to attempt when a request fails. Defaults to <code>5</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</p> </li> <li> <p>max_concurrent_requests: the maximum number of concurrent requests to send. Defaults  to <code>64</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key. It is meant to  be used internally.</p> </li> <li> <p>_aclient: the <code>Mistral</code> to use for the Mistral API. It is meant to be used internally.  Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/mistralllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_key: the API key to authenticate the requests to the Mistral API.</p> </li> <li> <p>max_retries: the maximum number of retries to attempt when a request fails.  Defaults to <code>5</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</p> </li> <li> <p>max_concurrent_requests: the maximum number of concurrent requests to send.  Defaults to <code>64</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/mistralllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/mistralllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import MistralLLM\n\nllm = MistralLLM(model=\"open-mixtral-8x22b\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/mixtureofagentsllm/","title":"MixtureOfAgentsLLM","text":"<p><code>Mixture-of-Agents</code> implementation.</p> <p>An <code>LLM</code> class that leverages <code>LLM</code>s collective strenghts to generate a response,     as described in the \"Mixture-of-Agents Enhances Large Language model Capabilities\"     paper. There is a list of <code>LLM</code>s proposing/generating outputs that <code>LLM</code>s from the next     round/layer can use as auxiliary information. Finally, there is an <code>LLM</code> that aggregates     the outputs to generate the final response.</p>"},{"location":"components-gallery/llms/mixtureofagentsllm/#attributes","title":"Attributes","text":"<ul> <li> <p>aggregator_llm: The <code>LLM</code> that aggregates the outputs of the proposer <code>LLM</code>s.</p> </li> <li> <p>proposers_llms: The list of <code>LLM</code>s that propose outputs to be aggregated.</p> </li> <li> <p>rounds: The number of layers or rounds that the <code>proposers_llms</code> will generate  outputs. Defaults to <code>1</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/mixtureofagentsllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/mixtureofagentsllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM\n\nllm = MixtureOfAgentsLLM(\n    aggregator_llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    proposers_llms=[\n        InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n            tokenizer_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n            tokenizer_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n        ),\n    ],\n    rounds=2,\n)\n\nllm.load()\n\noutput = llm.generate_outputs(\n    inputs=[\n        [\n            {\n                \"role\": \"user\",\n                \"content\": \"My favorite witty review of The Rings of Power series is this: Input:\",\n            }\n        ]\n    ]\n)\n</code></pre>"},{"location":"components-gallery/llms/mixtureofagentsllm/#references","title":"References","text":"<ul> <li>Mixture-of-Agents Enhances Large Language Model Capabilities</li> </ul>"},{"location":"components-gallery/llms/ollamallm/","title":"OllamaLLM","text":"<p>Ollama LLM implementation running the Async API client.</p>"},{"location":"components-gallery/llms/ollamallm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"notus\".</p> </li> <li> <p>host: the Ollama server host.</p> </li> <li> <p>timeout: the timeout for the LLM. Defaults to <code>120</code>.</p> </li> <li> <p>follow_redirects: whether to follow redirects. Defaults to <code>True</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer config files. If not provided, the one associated to the <code>model</code>  will be used. Defaults to <code>None</code>.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> <li> <p>_aclient: the <code>AsyncClient</code> to use for the Ollama API. It is meant to be used internally.  Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/ollamallm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>host: the Ollama server host.</p> </li> <li> <p>timeout: the client timeout for the Ollama API. Defaults to <code>120</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/ollamallm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/ollamallm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import OllamaLLM\n\nllm = OllamaLLM(model=\"llama3\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/vertexaillm/","title":"VertexAILLM","text":"<p>VertexAI LLM implementation running the async API clients for Gemini.</p> <ul> <li> <p>Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini</p> <p>To use the <code>VertexAILLM</code> is necessary to have configured the Google Cloud authentication using one of these methods:</p> <ul> <li>Setting <code>GOOGLE_CLOUD_CREDENTIALS</code> environment variable</li> <li>Using <code>gcloud auth application-default login</code> command</li> <li>Using <code>vertexai.init</code> function from the <code>google-cloud-aiplatform</code> library</li> </ul> </li> </ul>"},{"location":"components-gallery/llms/vertexaillm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"gemini-1.0-pro\". Supported models.</p> </li> <li> <p>_aclient: the <code>GenerativeModel</code> to use for the Vertex AI Gemini API. It is meant  to be used internally. Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/vertexaillm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/vertexaillm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import VertexAILLM\n\nllm = VertexAILLM(model=\"gemini-1.5-pro\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/transformersllm/","title":"TransformersLLM","text":"<p>Hugging Face <code>transformers</code> library LLM implementation using the text generation</p> <p>pipeline.</p>"},{"location":"components-gallery/llms/transformersllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>revision: if <code>model</code> refers to a Hugging Face Hub repository, then the revision  (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> </li> <li> <p>torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.  Defaults to <code>\"auto\"</code>.</p> </li> <li> <p>trust_remote_code: whether to allow fetching and executing remote code fetched  from the repository in the Hub. Defaults to <code>False</code>.</p> </li> <li> <p>model_kwargs: additional dictionary of keyword arguments that will be passed to  the <code>from_pretrained</code> method of the model.</p> </li> <li> <p>tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer config files. If not provided, the one associated to the <code>model</code>  will be used. Defaults to <code>None</code>.</p> </li> <li> <p>use_fast: whether to use a fast tokenizer or not. Defaults to <code>True</code>.</p> </li> <li> <p>chat_template: a chat template that will be used to build the prompts before  sending them to the model. If not provided, the chat template defined in the  tokenizer config will be used. If not provided and the tokenizer doesn't have  a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> </li> <li> <p>device: the name or index of the device where the model will be loaded. Defaults  to <code>None</code>.</p> </li> <li> <p>device_map: a dictionary mapping each layer of the model to a device, or a mode  like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> </li> <li> <p>token: the Hugging Face Hub token that will be used to authenticate to the Hugging  Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package  local configuration will be used. Defaults to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/transformersllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/transformersllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import TransformersLLM\n\nllm = TransformersLLM(model=\"microsoft/Phi-3-mini-4k-instruct\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/llamacppllm/","title":"LlamaCppLLM","text":"<p>llama.cpp LLM implementation running the Python bindings for the C++ code.</p>"},{"location":"components-gallery/llms/llamacppllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model_path: contains the path to the GGUF quantized model, compatible with the  installed version of the <code>llama.cpp</code> Python bindings.</p> </li> <li> <p>n_gpu_layers: the number of layers to use for the GPU. Defaults to <code>-1</code>, meaning that  the available GPU device will be used.</p> </li> <li> <p>chat_format: the chat format to use for the model. Defaults to <code>None</code>, which means the  Llama format will be used.</p> </li> <li> <p>n_ctx: the context size to use for the model. Defaults to <code>512</code>.</p> </li> <li> <p>n_batch: the prompt processing maximum batch size to use for the model. Defaults to <code>512</code>.</p> </li> <li> <p>seed: random seed to use for the generation. Defaults to <code>4294967295</code>.</p> </li> <li> <p>verbose: whether to print verbose output. Defaults to <code>False</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</p> </li> <li> <p>tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer config files. If not provided, the one associated to the <code>model</code>  will be used. Defaults to <code>None</code>.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> <li> <p>_model: the Llama model instance. This attribute is meant to be used internally and  should not be accessed directly. It will be set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/llamacppllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>model_path: the path to the GGUF quantized model.</p> </li> <li> <p>n_gpu_layers: the number of layers to use for the GPU. Defaults to <code>-1</code>.</p> </li> <li> <p>chat_format: the chat format to use for the model. Defaults to <code>None</code>.</p> </li> <li> <p>verbose: whether to print verbose output. Defaults to <code>False</code>.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/llamacppllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/llamacppllm/#generate-text","title":"Generate text","text":"<pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\n# You can follow along this example downloading the following model running the following\n# command in the terminal, that will download the model to the `Downloads` folder:\n# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),\n    n_gpu_layers=-1,  # To use the GPU if available\n    n_ctx=1024,       # Set the context size\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/llamacppllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),  # type: ignore\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/llamacppllm/#references","title":"References","text":"<ul> <li> <p>llama.cpp</p> </li> <li> <p>llama-cpp-python</p> </li> </ul>"},{"location":"components-gallery/llms/vllm/","title":"vLLM","text":"<p><code>vLLM</code> library LLM implementation.</p>"},{"location":"components-gallery/llms/vllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>dtype: the data type to use for the model. Defaults to <code>auto</code>.</p> </li> <li> <p>trust_remote_code: whether to trust the remote code when loading the model. Defaults  to <code>False</code>.</p> </li> <li> <p>quantization: the quantization mode to use for the model. Defaults to <code>None</code>.</p> </li> <li> <p>revision: the revision of the model to load. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer files. If not provided, the tokenizer will be loaded from the  model directory. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer_mode: the mode to use for the tokenizer. Defaults to <code>auto</code>.</p> </li> <li> <p>tokenizer_revision: the revision of the tokenizer to load. Defaults to <code>None</code>.</p> </li> <li> <p>skip_tokenizer_init: whether to skip the initialization of the tokenizer. Defaults  to <code>False</code>.</p> </li> <li> <p>chat_template: a chat template that will be used to build the prompts before  sending them to the model. If not provided, the chat template defined in the  tokenizer config will be used. If not provided and the tokenizer doesn't have  a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>seed: the seed to use for the random number generator. Defaults to <code>0</code>.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> </li> <li> <p>_model: the <code>vLLM</code> model instance. This attribute is meant to be used internally  and should not be accessed directly. It will be set in the <code>load</code> method.</p> </li> <li> <p>_tokenizer: the tokenizer instance used to format the prompt before passing it to  the <code>LLM</code>. This attribute is meant to be used internally and should not be  accessed directly. It will be set in the <code>load</code> method.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/vllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>extra_kwargs: additional dictionary of keyword arguments that will be passed to  the <code>LLM</code> class of <code>vllm</code> library.</li> </ul>"},{"location":"components-gallery/llms/vllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/vllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import vLLM\n\n# You can pass a custom chat_template to the model\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/vllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pathlib import Path\nfrom distilabel.models.llms import vLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\"\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/embeddings/","title":"Embeddings Gallery","text":"<ul> <li> <p> SentenceTransformerEmbeddings</p> <p><code>sentence-transformers</code> library implementation for embedding generation.</p> <p> SentenceTransformerEmbeddings</p> </li> <li> <p> vLLMEmbeddings</p> <p><code>vllm</code> library implementation for embedding generation.</p> <p> vLLMEmbeddings</p> </li> </ul>"},{"location":"components-gallery/embeddings/sentencetransformerembeddings/","title":"SentenceTransformerEmbeddings","text":"<p><code>sentence-transformers</code> library implementation for embedding generation.</p>"},{"location":"components-gallery/embeddings/sentencetransformerembeddings/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>device: the name of the device used to load the model e.g. \"cuda\", \"mps\", etc.  Defaults to <code>None</code>.</p> </li> <li> <p>prompts: a dictionary containing prompts to be used with the model. Defaults to  <code>None</code>.</p> </li> <li> <p>default_prompt_name: the default prompt (in <code>prompts</code>) that will be applied to the  inputs. If not provided, then no prompt will be used. Defaults to <code>None</code>.</p> </li> <li> <p>trust_remote_code: whether to allow fetching and executing remote code fetched  from the repository in the Hub. Defaults to <code>False</code>.</p> </li> <li> <p>revision: if <code>model</code> refers to a Hugging Face Hub repository, then the revision  (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> </li> <li> <p>token: the Hugging Face Hub token that will be used to authenticate to the Hugging  Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package  local configuration will be used. Defaults to <code>None</code>.</p> </li> <li> <p>truncate_dim: the dimension to truncate the sentence embeddings. Defaults to <code>None</code>.</p> </li> <li> <p>model_kwargs: extra kwargs that will be passed to the Hugging Face <code>transformers</code>  model class. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer_kwargs: extra kwargs that will be passed to the Hugging Face <code>transformers</code>  tokenizer class. Defaults to <code>None</code>.</p> </li> <li> <p>config_kwargs: extra kwargs that will be passed to the Hugging Face <code>transformers</code>  configuration class. Defaults to <code>None</code>.</p> </li> <li> <p>precision: the dtype that will have the resulting embeddings. Defaults to <code>\"float32\"</code>.</p> </li> <li> <p>normalize_embeddings: whether to normalize the embeddings so they have a length  of 1. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/embeddings/sentencetransformerembeddings/#examples","title":"Examples","text":""},{"location":"components-gallery/embeddings/sentencetransformerembeddings/#generating-sentence-embeddings","title":"Generating sentence embeddings","text":"<pre><code>from distilabel.models import SentenceTransformerEmbeddings\n\nembeddings = SentenceTransformerEmbeddings(model=\"mixedbread-ai/mxbai-embed-large-v1\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre>"},{"location":"components-gallery/embeddings/vllmembeddings/","title":"vLLMEmbeddings","text":"<p><code>vllm</code> library implementation for embedding generation.</p>"},{"location":"components-gallery/embeddings/vllmembeddings/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>dtype: the data type to use for the model. Defaults to <code>auto</code>.</p> </li> <li> <p>trust_remote_code: whether to trust the remote code when loading the model. Defaults  to <code>False</code>.</p> </li> <li> <p>quantization: the quantization mode to use for the model. Defaults to <code>None</code>.</p> </li> <li> <p>revision: the revision of the model to load. Defaults to <code>None</code>.</p> </li> <li> <p>enforce_eager: whether to enforce eager execution. Defaults to <code>True</code>.</p> </li> <li> <p>seed: the seed to use for the random number generator. Defaults to <code>0</code>.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> </li> <li> <p>_model: the <code>vLLM</code> model instance. This attribute is meant to be used internally  and should not be accessed directly. It will be set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/embeddings/vllmembeddings/#examples","title":"Examples","text":""},{"location":"components-gallery/embeddings/vllmembeddings/#generating-sentence-embeddings","title":"Generating sentence embeddings","text":"<pre><code>from distilabel.models import vLLMEmbeddings\n\nembeddings = vLLMEmbeddings(model=\"intfloat/e5-mistral-7b-instruct\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre>"},{"location":"components-gallery/embeddings/vllmembeddings/#references","title":"References","text":"<ul> <li>Offline inference embeddings</li> </ul>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Distilabel","text":"Synthesize data for AI and add feedback on the fly! <p>Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.</p> <ul> <li> <p>Get started in 5 minutes!</p> <p>Install distilabel with <code>pip</code> and run your first <code>Pipeline</code> to generate and evaluate synthetic data.</p> <p> Quickstart</p> </li> <li> <p>How-to guides</p> <p>Get familiar with the basics of distilabel. Learn how to define <code>steps</code>, <code>tasks</code> and <code>llms</code> and run your <code>Pipeline</code>.</p> <p> Learn more</p> </li> </ul>"},{"location":"#why-use-distilabel","title":"Why use distilabel?","text":"<p>Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.</p> <p>Improve your AI output quality through data quality</p> <p>Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your synthetic data.</p> <p>Take control of your data and models</p> <p>Ownership of data for fine-tuning your own LLMs is not easy but distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.</p> <p>Improve efficiency by quickly iterating on the right data and models</p> <p>Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.</p>"},{"location":"#what-do-people-build-with-distilabel","title":"What do people build with distilabel?","text":"<p>The Argilla community uses distilabel to create amazing datasets and models.</p> <ul> <li>The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.</li> <li>Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.</li> <li>The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.</li> </ul>"},{"location":"api/cli/","title":"Command Line Interface (CLI)","text":"<p>This section contains the API reference for the CLI. For more information on how to use the CLI, see Tutorial - CLI.</p>"},{"location":"api/cli/#utility-functions-for-the-distilabel-pipeline-sub-commands","title":"Utility functions for the <code>distilabel pipeline</code> sub-commands","text":"<p>Here are some utility functions to help working with the pipelines in the console.</p>"},{"location":"api/cli/#distilabel.cli.pipeline.utils","title":"<code>utils</code>","text":""},{"location":"api/cli/#distilabel.cli.pipeline.utils.parse_runtime_parameters","title":"<code>parse_runtime_parameters(params)</code>","text":"<p>Parses the runtime parameters from the CLI format to the format expected by the <code>Pipeline.run</code> method. The CLI format is a list of tuples, where the first element is a list of keys and the second element is the value.</p> <p>Parameters:</p> Name Type Description Default <code>params</code> <code>List[Tuple[List[str], str]]</code> <p>A list of tuples, where the first element is a list of keys and the second element is the value.</p> required <p>Returns:</p> Type Description <code>Dict[str, Dict[str, Any]]</code> <p>A dictionary with the runtime parameters in the format expected by the</p> <code>Dict[str, Dict[str, Any]]</code> <p><code>Pipeline.run</code> method.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def parse_runtime_parameters(\n    params: List[Tuple[List[str], str]],\n) -&gt; Dict[str, Dict[str, Any]]:\n    \"\"\"Parses the runtime parameters from the CLI format to the format expected by the\n    `Pipeline.run` method. The CLI format is a list of tuples, where the first element is\n    a list of keys and the second element is the value.\n\n    Args:\n        params: A list of tuples, where the first element is a list of keys and the\n            second element is the value.\n\n    Returns:\n        A dictionary with the runtime parameters in the format expected by the\n        `Pipeline.run` method.\n    \"\"\"\n    runtime_params = {}\n    for keys, value in params:\n        current = runtime_params\n        for i, key in enumerate(keys):\n            if i == len(keys) - 1:\n                current[key] = value\n            else:\n                current = current.setdefault(key, {})\n    return runtime_params\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.valid_http_url","title":"<code>valid_http_url(url)</code>","text":"<p>Check if the URL is a valid HTTP URL.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>The URL to check.</p> required <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code>, if the URL is a valid HTTP URL. <code>False</code>, otherwise.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def valid_http_url(url: str) -&gt; bool:\n    \"\"\"Check if the URL is a valid HTTP URL.\n\n    Args:\n        url: The URL to check.\n\n    Returns:\n        `True`, if the URL is a valid HTTP URL. `False`, otherwise.\n    \"\"\"\n    try:\n        TypeAdapter(HttpUrl).validate_python(url)  # type: ignore\n    except ValidationError:\n        return False\n\n    return True\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.get_config_from_url","title":"<code>get_config_from_url(url)</code>","text":"<p>Loads the pipeline configuration from a URL pointing to a JSON or YAML file.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>The URL pointing to the pipeline configuration file.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The pipeline configuration as a dictionary.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the file format is not supported.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def get_config_from_url(url: str) -&gt; Dict[str, Any]:\n    \"\"\"Loads the pipeline configuration from a URL pointing to a JSON or YAML file.\n\n    Args:\n        url: The URL pointing to the pipeline configuration file.\n\n    Returns:\n        The pipeline configuration as a dictionary.\n\n    Raises:\n        ValueError: If the file format is not supported.\n    \"\"\"\n    if not url.endswith((\".json\", \".yaml\", \".yml\")):\n        raise DistilabelUserError(\n            f\"Unsupported file format for '{url}'. Only JSON and YAML are supported\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=seriali#serializing-the-pipeline\",\n        )\n    response = _download_remote_file(url)\n\n    if url.endswith((\".yaml\", \".yml\")):\n        content = response.content.decode(\"utf-8\")\n        return yaml.safe_load(content)\n\n    return response.json()\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.get_pipeline_from_url","title":"<code>get_pipeline_from_url(url, pipeline_name='pipeline')</code>","text":"<p>Downloads the file to the current working directory and loads the pipeline object from a python script.</p> <p>Parameters:</p> Name Type Description Default <code>url</code> <code>str</code> <p>The URL pointing to the python script with the pipeline definition.</p> required <code>pipeline_name</code> <code>str</code> <p>The name of the pipeline in the script. I.e: <code>with Pipeline(...) as pipeline:...</code>.</p> <code>'pipeline'</code> <p>Returns:</p> Type Description <code>BasePipeline</code> <p>The pipeline instantiated.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the file format is not supported.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def get_pipeline_from_url(url: str, pipeline_name: str = \"pipeline\") -&gt; \"BasePipeline\":\n    \"\"\"Downloads the file to the current working directory and loads the pipeline object\n    from a python script.\n\n    Args:\n        url: The URL pointing to the python script with the pipeline definition.\n        pipeline_name: The name of the pipeline in the script.\n            I.e: `with Pipeline(...) as pipeline:...`.\n\n    Returns:\n        The pipeline instantiated.\n\n    Raises:\n        ValueError: If the file format is not supported.\n    \"\"\"\n    if not url.endswith(\".py\"):\n        raise DistilabelUserError(\n            f\"Unsupported file format for '{url}'. It must be a python file.\",\n            page=\"sections/how_to_guides/advanced/cli/#distilabel-pipeline-run\",\n        )\n    response = _download_remote_file(url)\n\n    content = response.content.decode(\"utf-8\")\n    script_local = Path.cwd() / Path(url).name\n    script_local.write_text(content)\n\n    # Add the current working directory to sys.path\n    sys.path.insert(0, os.getcwd())\n    module = importlib.import_module(str(Path(url).stem))\n    pipeline = getattr(module, pipeline_name, None)\n    if not pipeline:\n        raise ImportError(\n            f\"The script must contain an object with the pipeline named: '{pipeline_name}' that can be imported\"\n        )\n\n    return pipeline\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.get_pipeline","title":"<code>get_pipeline(config_or_script, pipeline_name='pipeline')</code>","text":"<p>Get a pipeline from a configuration file or a remote python script.</p> <p>Parameters:</p> Name Type Description Default <code>config_or_script</code> <code>str</code> <p>The path or URL to the pipeline configuration file or URL to a python script.</p> required <code>pipeline_name</code> <code>str</code> <p>The name of the pipeline in the script. I.e: <code>with Pipeline(...) as pipeline:...</code>.</p> <code>'pipeline'</code> <p>Returns:</p> Type Description <code>BasePipeline</code> <p>The pipeline.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the file format is not supported.</p> <code>FileNotFoundError</code> <p>If the configuration file does not exist.</p> Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def get_pipeline(\n    config_or_script: str, pipeline_name: str = \"pipeline\"\n) -&gt; \"BasePipeline\":\n    \"\"\"Get a pipeline from a configuration file or a remote python script.\n\n    Args:\n        config_or_script: The path or URL to the pipeline configuration file\n            or URL to a python script.\n        pipeline_name: The name of the pipeline in the script.\n            I.e: `with Pipeline(...) as pipeline:...`.\n\n    Returns:\n        The pipeline.\n\n    Raises:\n        ValueError: If the file format is not supported.\n        FileNotFoundError: If the configuration file does not exist.\n    \"\"\"\n    config = script = None\n    if config_or_script.endswith((\".json\", \".yaml\", \".yml\")):\n        config = config_or_script\n    elif config_or_script.endswith(\".py\"):\n        script = config_or_script\n    else:\n        raise DistilabelUserError(\n            \"The file must be a valid config file or python script with a pipeline.\",\n            page=\"sections/how_to_guides/advanced/cli/#distilabel-pipeline-run\",\n        )\n\n    if valid_http_url(config_or_script):\n        if config:\n            data = get_config_from_url(config)\n            return Pipeline.from_dict(data)\n        return get_pipeline_from_url(script, pipeline_name=pipeline_name)\n\n    if not config:\n        raise ValueError(\n            f\"To run a pipeline from a python script, run it as `python {script}`\"\n        )\n\n    if Path(config).is_file():\n        return Pipeline.from_file(config)\n\n    raise FileNotFoundError(f\"File '{config_or_script}' does not exist.\")\n</code></pre>"},{"location":"api/cli/#distilabel.cli.pipeline.utils.display_pipeline_information","title":"<code>display_pipeline_information(pipeline)</code>","text":"<p>Displays the pipeline information to the console.</p> <p>Parameters:</p> Name Type Description Default <code>pipeline</code> <code>BasePipeline</code> <p>The pipeline.</p> required Source code in <code>src/distilabel/cli/pipeline/utils.py</code> <pre><code>def display_pipeline_information(pipeline: \"BasePipeline\") -&gt; None:\n    \"\"\"Displays the pipeline information to the console.\n\n    Args:\n        pipeline: The pipeline.\n    \"\"\"\n    from rich.console import Console\n\n    Console().print(_build_pipeline_panel(pipeline))\n</code></pre>"},{"location":"api/distiset/","title":"Distiset","text":"<p>This section contains the API reference for the Distiset. For more information on how to use the CLI, see Tutorial - CLI.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset","title":"<code>Distiset</code>","text":"<p>               Bases: <code>dict</code></p> <p>Convenient wrapper around <code>datasets.Dataset</code> to push to the Hugging Face Hub.</p> <p>It's a dictionary where the keys correspond to the different leaf_steps from the internal <code>DAG</code> and the values are <code>datasets.Dataset</code>.</p> <p>Attributes:</p> Name Type Description <code>_pipeline_path</code> <code>Optional[Path]</code> <p>Optional path to the <code>pipeline.yaml</code> file that generated the dataset. Defaults to <code>None</code>.</p> <code>_artifacts_path</code> <code>Optional[Path]</code> <p>Optional path to the directory containing the generated artifacts by the pipeline steps. Defaults to <code>None</code>.</p> <code>_log_filename_path</code> <code>Optional[Path]</code> <p>Optional path to the <code>pipeline.log</code> file that generated was written by the pipeline. Defaults to <code>None</code>.</p> <code>_citations</code> <code>Optional[List[str]]</code> <p>Optional list containing citations that will be included in the dataset card. Defaults to <code>None</code>.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>class Distiset(dict):\n    \"\"\"Convenient wrapper around `datasets.Dataset` to push to the Hugging Face Hub.\n\n    It's a dictionary where the keys correspond to the different leaf_steps from the internal\n    `DAG` and the values are `datasets.Dataset`.\n\n    Attributes:\n        _pipeline_path: Optional path to the `pipeline.yaml` file that generated the dataset.\n            Defaults to `None`.\n        _artifacts_path: Optional path to the directory containing the generated artifacts\n            by the pipeline steps. Defaults to `None`.\n        _log_filename_path: Optional path to the `pipeline.log` file that generated was written\n            by the pipeline. Defaults to `None`.\n        _citations: Optional list containing citations that will be included in the dataset\n            card. Defaults to `None`.\n    \"\"\"\n\n    _pipeline_path: Optional[Path] = None\n    _artifacts_path: Optional[Path] = None\n    _log_filename_path: Optional[Path] = None\n    _citations: Optional[List[str]] = None\n\n    def push_to_hub(\n        self,\n        repo_id: str,\n        private: bool = False,\n        token: Optional[str] = None,\n        generate_card: bool = True,\n        include_script: bool = False,\n        **kwargs: Any,\n    ) -&gt; None:\n        \"\"\"Pushes the `Distiset` to the Hugging Face Hub, each dataset will be pushed as a different configuration\n        corresponding to the leaf step that generated it.\n\n        Args:\n            repo_id:\n                The ID of the repository to push to in the following format: `&lt;user&gt;/&lt;dataset_name&gt;` or\n                `&lt;org&gt;/&lt;dataset_name&gt;`. Also accepts `&lt;dataset_name&gt;`, which will default to the namespace\n                of the logged-in user.\n            private:\n                Whether the dataset repository should be set to private or not. Only affects repository creation:\n                a repository that already exists will not be affected by that parameter.\n            token:\n                An optional authentication token for the Hugging Face Hub. If no token is passed, will default\n                to the token saved locally when logging in with `huggingface-cli login`. Will raise an error\n                if no token is passed and the user is not logged-in.\n            generate_card:\n                Whether to generate a dataset card or not. Defaults to True.\n            include_script:\n                Whether you want to push the pipeline script to the hugging face hub to share it.\n                If set to True, the name of the script that was run to create the distiset will be\n                automatically determined, and that will be the name of the file uploaded to your\n                repository. Take into account, this operation only makes sense for a distiset obtained\n                from calling `Pipeline.run()` method. Defaults to False.\n            **kwargs:\n                Additional keyword arguments to pass to the `push_to_hub` method of the `datasets.Dataset` object.\n\n        Raises:\n            ValueError: If no token is provided and couldn't be retrieved automatically.\n        \"\"\"\n        script_filename = sys.argv[0]\n        filename_py = (\n            script_filename.split(\"/\")[-1]\n            if \"/\" in script_filename\n            else script_filename\n        )\n        script_path = Path.cwd() / script_filename\n\n        if token is None:\n            token = get_hf_token(self.__class__.__name__, \"token\")\n\n        for name, dataset in self.items():\n            dataset.push_to_hub(\n                repo_id=repo_id,\n                config_name=name,\n                private=private,\n                token=token,\n                **kwargs,\n            )\n\n        if self.artifacts_path:\n            upload_folder(\n                repo_id=repo_id,\n                folder_path=self.artifacts_path,\n                path_in_repo=\"artifacts\",\n                token=token,\n                repo_type=\"dataset\",\n                commit_message=\"Include pipeline artifacts\",\n            )\n\n        if include_script and script_path.exists():\n            upload_file(\n                path_or_fileobj=script_path,\n                path_in_repo=filename_py,\n                repo_id=repo_id,\n                repo_type=\"dataset\",\n                token=token,\n                commit_message=\"Include pipeline script\",\n            )\n\n        if generate_card:\n            self._generate_card(\n                repo_id, token, include_script=include_script, filename_py=filename_py\n            )\n\n    def _get_card(\n        self,\n        repo_id: str,\n        token: Optional[str] = None,\n        include_script: bool = False,\n        filename_py: Optional[str] = None,\n    ) -&gt; DistilabelDatasetCard:\n        \"\"\"Generates the dataset card for the `Distiset`.\n\n        Note:\n            If `repo_id` and `token` are provided, it will extract the metadata from the README.md file\n            on the hub.\n\n        Args:\n            repo_id: Name of the repository to push to, or the path for the distiset if saved to disk.\n            token: The token to authenticate with the Hugging Face Hub.\n                We assume that if it's provided, the dataset will be in the Hugging Face Hub,\n                so the README metadata will be extracted from there.\n            include_script: Whether to upload the script to the hugging face repository.\n            filename_py: The name of the script. If `include_script` is True, the script will\n                be uploaded to the repository using this name, otherwise it won't be used.\n\n        Returns:\n            The dataset card for the `Distiset`.\n        \"\"\"\n        sample_records = {}\n        for name, dataset in self.items():\n            record = (\n                dataset[0] if not isinstance(dataset, dict) else dataset[\"train\"][0]\n            )\n            for key, value in record.items():\n                # If list is too big, the `README.md` generated will be huge so we truncate it\n                if isinstance(value, list):\n                    length = len(value)\n                    if length &lt; 10:\n                        continue\n                    record[key] = value[:10]\n                    record[key].append(\n                        f\"... (truncated - showing 10 of {length} elements)\"\n                    )\n            sample_records[name] = record\n\n        readme_metadata = {}\n        if repo_id and token:\n            readme_metadata = self._extract_readme_metadata(repo_id, token)\n\n        metadata = {\n            **readme_metadata,\n            \"size_categories\": size_categories_parser(\n                max(len(dataset) for dataset in self.values())\n            ),\n            \"tags\": [\"synthetic\", \"distilabel\", \"rlaif\"],\n        }\n\n        card = DistilabelDatasetCard.from_template(\n            card_data=DatasetCardData(**metadata),\n            repo_id=repo_id,\n            sample_records=sample_records,\n            include_script=include_script,\n            filename_py=filename_py,\n            artifacts=self._get_artifacts_metadata(),\n            references=self.citations,\n        )\n\n        return card\n\n    def _get_artifacts_metadata(self) -&gt; Dict[str, List[Dict[str, Any]]]:\n        \"\"\"Gets a dictionary with the metadata of the artifacts generated by the pipeline steps.\n\n        Returns:\n            A dictionary in which the key is the name of the step and the value is a list\n            of dictionaries, each of them containing the name and metadata of the step artifact.\n        \"\"\"\n        if not self.artifacts_path:\n            return {}\n\n        def iterdir_ignore_hidden(path: Path) -&gt; Generator[Path, None, None]:\n            return (f for f in Path(path).iterdir() if not f.name.startswith(\".\"))\n\n        artifacts_metadata = defaultdict(list)\n        for step_artifacts_dir in iterdir_ignore_hidden(self.artifacts_path):\n            step_name = step_artifacts_dir.stem\n            for artifact_dir in iterdir_ignore_hidden(step_artifacts_dir):\n                artifact_name = artifact_dir.stem\n                metadata_path = artifact_dir / \"metadata.json\"\n                metadata = json.loads(metadata_path.read_text())\n                artifacts_metadata[step_name].append(\n                    {\"name\": artifact_name, \"metadata\": metadata}\n                )\n\n        return dict(artifacts_metadata)\n\n    def _extract_readme_metadata(\n        self, repo_id: str, token: Optional[str]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Extracts the metadata from the README.md file of the dataset repository.\n\n        We have to download the previous README.md file in the repo, extract the metadata from it,\n        and generate a dict again to be passed thorough the `DatasetCardData` object.\n\n        Args:\n            repo_id: The ID of the repository to push to, from the `push_to_hub` method.\n            token: The token to authenticate with the Hugging Face Hub, from the `push_to_hub` method.\n\n        Returns:\n            The metadata extracted from the README.md file of the dataset repository as a dict.\n        \"\"\"\n        readme_path = Path(\n            hf_hub_download(repo_id, \"README.md\", repo_type=\"dataset\", token=token)\n        )\n        # Remove the '---' from the metadata\n        metadata = re.findall(r\"---\\n(.*?)\\n---\", readme_path.read_text(), re.DOTALL)[0]\n        metadata = yaml.safe_load(metadata)\n        return metadata\n\n    def _generate_card(\n        self,\n        repo_id: str,\n        token: str,\n        include_script: bool = False,\n        filename_py: Optional[str] = None,\n    ) -&gt; None:\n        \"\"\"Generates a dataset card and pushes it to the Hugging Face Hub, and\n        if the `pipeline.yaml` path is available in the `Distiset`, uploads that\n        to the same repository.\n\n        Args:\n            repo_id: The ID of the repository to push to, from the `push_to_hub` method.\n            token: The token to authenticate with the Hugging Face Hub, from the `push_to_hub` method.\n            include_script: Whether to upload the script to the hugging face repository.\n            filename_py: The name of the script. If `include_script` is True, the script will\n                be uploaded to the repository using this name, otherwise it won't be used.\n        \"\"\"\n        card = self._get_card(\n            repo_id=repo_id,\n            token=token,\n            include_script=include_script,\n            filename_py=filename_py,\n        )\n\n        card.push_to_hub(\n            repo_id,\n            repo_type=\"dataset\",\n            token=token,\n        )\n\n        if self.pipeline_path:\n            # If the pipeline.yaml is available, upload it to the Hugging Face Hub as well.\n            HfApi().upload_file(\n                path_or_fileobj=self.pipeline_path,\n                path_in_repo=PIPELINE_CONFIG_FILENAME,\n                repo_id=repo_id,\n                repo_type=\"dataset\",\n                token=token,\n            )\n\n        if self.log_filename_path:\n            # The same we had with \"pipeline.yaml\" but with the log file.\n            HfApi().upload_file(\n                path_or_fileobj=self.log_filename_path,\n                path_in_repo=PIPELINE_LOG_FILENAME,\n                repo_id=repo_id,\n                repo_type=\"dataset\",\n                token=token,\n            )\n\n    def train_test_split(\n        self,\n        train_size: float,\n        shuffle: bool = True,\n        seed: Optional[int] = None,\n    ) -&gt; Self:\n        \"\"\"Return a `Distiset` whose values will be a `datasets.DatasetDict` with two random train and test subsets.\n        Splits are created from the dataset according to `train_size` and `shuffle`.\n\n        Args:\n            train_size:\n                Float between `0.0` and `1.0` representing the proportion of the dataset to include in the test split.\n                It will be applied to all the datasets in the `Distiset`.\n            shuffle: Whether or not to shuffle the data before splitting\n            seed:\n                A seed to initialize the default BitGenerator, passed to the underlying method.\n\n        Returns:\n            The `Distiset` with the train-test split applied to all the datasets.\n        \"\"\"\n        assert 0 &lt; train_size &lt; 1, \"train_size must be a float between 0 and 1\"\n        for name, dataset in self.items():\n            self[name] = dataset.train_test_split(\n                train_size=train_size,\n                shuffle=shuffle,\n                seed=seed,\n            )\n        return self\n\n    def save_to_disk(\n        self,\n        distiset_path: PathLike,\n        max_shard_size: Optional[Union[str, int]] = None,\n        num_shards: Optional[int] = None,\n        num_proc: Optional[int] = None,\n        storage_options: Optional[dict] = None,\n        save_card: bool = True,\n        save_pipeline_config: bool = True,\n        save_pipeline_log: bool = True,\n    ) -&gt; None:\n        r\"\"\"\n        Saves a `Distiset` to a dataset directory, or in a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n        In case you want to save the `Distiset` in a remote filesystem, you can pass the `storage_options` parameter\n        as you would do with `datasets`'s `Dataset.save_to_disk` method: [see example](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets)\n\n        Args:\n            distiset_path: Path where you want to save the `Distiset`. It can be a local path\n                (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`)\n            max_shard_size: The maximum size of the dataset shards to be uploaded to the hub.\n                If expressed as a string, needs to be digits followed by a unit (like `\"50MB\"`).\n                Defaults to `None`.\n            num_shards: Number of shards to write. By default the number of shards depends on\n                `max_shard_size` and `num_proc`. Defaults to `None`.\n            num_proc: Number of processes when downloading and generating the dataset locally.\n                Multiprocessing is disabled by default. Defaults to `None`.\n            storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n                Defaults to `None`.\n            save_card: Whether to save the dataset card. Defaults to `True`.\n            save_pipeline_config: Whether to save the pipeline configuration file (aka the `pipeline.yaml` file).\n                Defaults to `True`.\n            save_pipeline_log: Whether to save the pipeline log file (aka the `pipeline.log` file).\n                Defaults to `True`.\n\n        Examples:\n            ```python\n            # Save your distiset in a local folder:\n            distiset.save_to_disk(distiset_path=\"my-distiset\")\n            # Save your distiset in a remote storage:\n            storage_options = {\n                \"key\": os.environ[\"S3_ACCESS_KEY\"],\n                \"secret\": os.environ[\"S3_SECRET_KEY\"],\n                \"client_kwargs\": {\n                    \"endpoint_url\": os.environ[\"S3_ENDPOINT_URL\"],\n                    \"region_name\": os.environ[\"S3_REGION\"],\n                },\n            }\n            distiset.save_to_disk(distiset_path=\"my-distiset\", storage_options=storage_options)\n            ```\n        \"\"\"\n        distiset_path = str(distiset_path)\n        for name, dataset in self.items():\n            dataset.save_to_disk(\n                f\"{distiset_path}/{name}\",\n                max_shard_size=max_shard_size,\n                num_shards=num_shards,\n                num_proc=num_proc,\n                storage_options=storage_options,\n            )\n\n        distiset_config_folder = posixpath.join(distiset_path, DISTISET_CONFIG_FOLDER)\n\n        fs: fsspec.AbstractFileSystem\n        fs, _, _ = fsspec.get_fs_token_paths(\n            distiset_config_folder, storage_options=storage_options\n        )\n        fs.makedirs(distiset_config_folder, exist_ok=True)\n\n        if self.artifacts_path:\n            distiset_artifacts_folder = posixpath.join(\n                distiset_path, DISTISET_ARTIFACTS_FOLDER\n            )\n            fs.copy(str(self.artifacts_path), distiset_artifacts_folder, recursive=True)\n\n        if save_card:\n            # NOTE:\u00a0Currently the card is not the same if we write to disk or push to the HF hub,\n            # as we aren't generating the README copying/updating the data from the dataset repo.\n            card = self._get_card(repo_id=Path(distiset_path).stem, token=None)\n            new_filename = posixpath.join(distiset_config_folder, \"README.md\")\n            if storage_options:\n                # Write the card the same way as DatasetCard.save does:\n                with fs.open(new_filename, \"w\", newline=\"\", encoding=\"utf-8\") as f:\n                    f.write(str(card))\n            else:\n                card.save(new_filename)\n\n        # Write our internal files to the distiset folder by copying them to the distiset folder.\n        if save_pipeline_config and self.pipeline_path:\n            new_filename = posixpath.join(\n                distiset_config_folder, PIPELINE_CONFIG_FILENAME\n            )\n            if self.pipeline_path.exists() and (not fs.isfile(new_filename)):\n                data = yaml.safe_load(self.pipeline_path.read_text())\n                with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                    yaml.dump(data, f, default_flow_style=False)\n\n        if save_pipeline_log and self.log_filename_path:\n            new_filename = posixpath.join(distiset_config_folder, PIPELINE_LOG_FILENAME)\n            if self.log_filename_path.exists() and (not fs.isfile(new_filename)):\n                data = self.log_filename_path.read_text()\n                with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                    f.write(data)\n\n    @classmethod\n    def load_from_disk(\n        cls,\n        distiset_path: PathLike,\n        keep_in_memory: Optional[bool] = None,\n        storage_options: Optional[Dict[str, Any]] = None,\n        download_dir: Optional[PathLike] = None,\n    ) -&gt; Self:\n        \"\"\"Loads a dataset that was previously saved using `Distiset.save_to_disk` from a dataset\n        directory, or from a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n        Args:\n            distiset_path: Path (\"dataset/train\") or remote URI (\"s3://bucket/dataset/train\").\n            keep_in_memory: Whether to copy the dataset in-memory, see `datasets.Dataset.load_from_disk``\n                for more information. Defaults to `None`.\n            storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n                Defaults to `None`.\n            download_dir: Optional directory to download the dataset to. Defaults to None,\n                in which case it will create a temporary directory.\n\n        Returns:\n            A `Distiset` loaded from disk, it should be a `Distiset` object created using `Distiset.save_to_disk`.\n        \"\"\"\n        original_distiset_path = str(distiset_path)\n\n        fs: fsspec.AbstractFileSystem\n        fs, _, [distiset_path] = fsspec.get_fs_token_paths(  # type: ignore\n            original_distiset_path, storage_options=storage_options\n        )\n        dest_distiset_path = distiset_path\n\n        assert fs.isdir(\n            original_distiset_path\n        ), \"`distiset_path` must be a `PathLike` object pointing to a folder or a URI of a remote filesystem.\"\n\n        has_config = False\n        has_artifacts = False\n        distiset = cls()\n\n        if is_remote_filesystem(fs):\n            src_dataset_path = distiset_path\n            if download_dir:\n                dest_distiset_path = download_dir\n            else:\n                dest_distiset_path = Dataset._build_local_temp_path(src_dataset_path)  # type: ignore\n            fs.download(src_dataset_path, dest_distiset_path.as_posix(), recursive=True)  # type: ignore\n\n        # Now we should have the distiset locally, so we can read those files\n        for folder in Path(dest_distiset_path).iterdir():\n            if folder.stem == DISTISET_CONFIG_FOLDER:\n                has_config = True\n                continue\n            elif folder.stem == DISTISET_ARTIFACTS_FOLDER:\n                has_artifacts = True\n                continue\n            distiset[folder.stem] = load_from_disk(\n                str(folder),\n                keep_in_memory=keep_in_memory,\n            )\n\n        # From the config folder we just need to point to the files. Once downloaded we set the path to point to point to the files. Once downloaded we set the path\n        # to wherever they are.\n        if has_config:\n            distiset_config_folder = posixpath.join(\n                dest_distiset_path, DISTISET_CONFIG_FOLDER\n            )\n\n            pipeline_path = posixpath.join(\n                distiset_config_folder, PIPELINE_CONFIG_FILENAME\n            )\n            if Path(pipeline_path).exists():\n                distiset.pipeline_path = Path(pipeline_path)\n\n            log_filename_path = posixpath.join(\n                distiset_config_folder, PIPELINE_LOG_FILENAME\n            )\n            if Path(log_filename_path).exists():\n                distiset.log_filename_path = Path(log_filename_path)\n\n        if has_artifacts:\n            distiset.artifacts_path = Path(\n                posixpath.join(dest_distiset_path, DISTISET_ARTIFACTS_FOLDER)\n            )\n\n        return distiset\n\n    @property\n    def pipeline_path(self) -&gt; Union[Path, None]:\n        \"\"\"Returns the path to the `pipeline.yaml` file that generated the `Pipeline`.\"\"\"\n        return self._pipeline_path\n\n    @pipeline_path.setter\n    def pipeline_path(self, path: PathLike) -&gt; None:\n        self._pipeline_path = Path(path)\n\n    @property\n    def artifacts_path(self) -&gt; Union[Path, None]:\n        \"\"\"Returns the path to the directory containing the artifacts generated by the steps\n        of the pipeline.\"\"\"\n        return self._artifacts_path\n\n    @artifacts_path.setter\n    def artifacts_path(self, path: PathLike) -&gt; None:\n        self._artifacts_path = Path(path)\n\n    @property\n    def log_filename_path(self) -&gt; Union[Path, None]:\n        \"\"\"Returns the path to the `pipeline.log` file that generated the `Pipeline`.\"\"\"\n        return self._log_filename_path\n\n    @log_filename_path.setter\n    def log_filename_path(self, path: PathLike) -&gt; None:\n        self._log_filename_path = Path(path)\n\n    @property\n    def citations(self) -&gt; Union[List[str], None]:\n        \"\"\"Bibtex references to be included in the README.\"\"\"\n        return self._citations\n\n    @citations.setter\n    def citations(self, citations_: List[str]) -&gt; None:\n        self._citations = sorted(set(citations_))\n\n    def __repr__(self):\n        # Copy from `datasets.DatasetDict.__repr__`.\n        repr = \"\\n\".join([f\"{k}: {v}\" for k, v in self.items()])\n        repr = re.sub(r\"^\", \" \" * 4, repr, count=0, flags=re.M)\n        return f\"Distiset({{\\n{repr}\\n}})\"\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.pipeline_path","title":"<code>pipeline_path</code>  <code>property</code> <code>writable</code>","text":"<p>Returns the path to the <code>pipeline.yaml</code> file that generated the <code>Pipeline</code>.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.artifacts_path","title":"<code>artifacts_path</code>  <code>property</code> <code>writable</code>","text":"<p>Returns the path to the directory containing the artifacts generated by the steps of the pipeline.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.log_filename_path","title":"<code>log_filename_path</code>  <code>property</code> <code>writable</code>","text":"<p>Returns the path to the <code>pipeline.log</code> file that generated the <code>Pipeline</code>.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.citations","title":"<code>citations</code>  <code>property</code> <code>writable</code>","text":"<p>Bibtex references to be included in the README.</p>"},{"location":"api/distiset/#distilabel.distiset.Distiset.push_to_hub","title":"<code>push_to_hub(repo_id, private=False, token=None, generate_card=True, include_script=False, **kwargs)</code>","text":"<p>Pushes the <code>Distiset</code> to the Hugging Face Hub, each dataset will be pushed as a different configuration corresponding to the leaf step that generated it.</p> <p>Parameters:</p> Name Type Description Default <code>repo_id</code> <code>str</code> <p>The ID of the repository to push to in the following format: <code>&lt;user&gt;/&lt;dataset_name&gt;</code> or <code>&lt;org&gt;/&lt;dataset_name&gt;</code>. Also accepts <code>&lt;dataset_name&gt;</code>, which will default to the namespace of the logged-in user.</p> required <code>private</code> <code>bool</code> <p>Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.</p> <code>False</code> <code>token</code> <code>Optional[str]</code> <p>An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in with <code>huggingface-cli login</code>. Will raise an error if no token is passed and the user is not logged-in.</p> <code>None</code> <code>generate_card</code> <code>bool</code> <p>Whether to generate a dataset card or not. Defaults to True.</p> <code>True</code> <code>include_script</code> <code>bool</code> <p>Whether you want to push the pipeline script to the hugging face hub to share it. If set to True, the name of the script that was run to create the distiset will be automatically determined, and that will be the name of the file uploaded to your repository. Take into account, this operation only makes sense for a distiset obtained from calling <code>Pipeline.run()</code> method. Defaults to False.</p> <code>False</code> <code>**kwargs</code> <code>Any</code> <p>Additional keyword arguments to pass to the <code>push_to_hub</code> method of the <code>datasets.Dataset</code> object.</p> <code>{}</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If no token is provided and couldn't be retrieved automatically.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def push_to_hub(\n    self,\n    repo_id: str,\n    private: bool = False,\n    token: Optional[str] = None,\n    generate_card: bool = True,\n    include_script: bool = False,\n    **kwargs: Any,\n) -&gt; None:\n    \"\"\"Pushes the `Distiset` to the Hugging Face Hub, each dataset will be pushed as a different configuration\n    corresponding to the leaf step that generated it.\n\n    Args:\n        repo_id:\n            The ID of the repository to push to in the following format: `&lt;user&gt;/&lt;dataset_name&gt;` or\n            `&lt;org&gt;/&lt;dataset_name&gt;`. Also accepts `&lt;dataset_name&gt;`, which will default to the namespace\n            of the logged-in user.\n        private:\n            Whether the dataset repository should be set to private or not. Only affects repository creation:\n            a repository that already exists will not be affected by that parameter.\n        token:\n            An optional authentication token for the Hugging Face Hub. If no token is passed, will default\n            to the token saved locally when logging in with `huggingface-cli login`. Will raise an error\n            if no token is passed and the user is not logged-in.\n        generate_card:\n            Whether to generate a dataset card or not. Defaults to True.\n        include_script:\n            Whether you want to push the pipeline script to the hugging face hub to share it.\n            If set to True, the name of the script that was run to create the distiset will be\n            automatically determined, and that will be the name of the file uploaded to your\n            repository. Take into account, this operation only makes sense for a distiset obtained\n            from calling `Pipeline.run()` method. Defaults to False.\n        **kwargs:\n            Additional keyword arguments to pass to the `push_to_hub` method of the `datasets.Dataset` object.\n\n    Raises:\n        ValueError: If no token is provided and couldn't be retrieved automatically.\n    \"\"\"\n    script_filename = sys.argv[0]\n    filename_py = (\n        script_filename.split(\"/\")[-1]\n        if \"/\" in script_filename\n        else script_filename\n    )\n    script_path = Path.cwd() / script_filename\n\n    if token is None:\n        token = get_hf_token(self.__class__.__name__, \"token\")\n\n    for name, dataset in self.items():\n        dataset.push_to_hub(\n            repo_id=repo_id,\n            config_name=name,\n            private=private,\n            token=token,\n            **kwargs,\n        )\n\n    if self.artifacts_path:\n        upload_folder(\n            repo_id=repo_id,\n            folder_path=self.artifacts_path,\n            path_in_repo=\"artifacts\",\n            token=token,\n            repo_type=\"dataset\",\n            commit_message=\"Include pipeline artifacts\",\n        )\n\n    if include_script and script_path.exists():\n        upload_file(\n            path_or_fileobj=script_path,\n            path_in_repo=filename_py,\n            repo_id=repo_id,\n            repo_type=\"dataset\",\n            token=token,\n            commit_message=\"Include pipeline script\",\n        )\n\n    if generate_card:\n        self._generate_card(\n            repo_id, token, include_script=include_script, filename_py=filename_py\n        )\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.train_test_split","title":"<code>train_test_split(train_size, shuffle=True, seed=None)</code>","text":"<p>Return a <code>Distiset</code> whose values will be a <code>datasets.DatasetDict</code> with two random train and test subsets. Splits are created from the dataset according to <code>train_size</code> and <code>shuffle</code>.</p> <p>Parameters:</p> Name Type Description Default <code>train_size</code> <code>float</code> <p>Float between <code>0.0</code> and <code>1.0</code> representing the proportion of the dataset to include in the test split. It will be applied to all the datasets in the <code>Distiset</code>.</p> required <code>shuffle</code> <code>bool</code> <p>Whether or not to shuffle the data before splitting</p> <code>True</code> <code>seed</code> <code>Optional[int]</code> <p>A seed to initialize the default BitGenerator, passed to the underlying method.</p> <code>None</code> <p>Returns:</p> Type Description <code>Self</code> <p>The <code>Distiset</code> with the train-test split applied to all the datasets.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def train_test_split(\n    self,\n    train_size: float,\n    shuffle: bool = True,\n    seed: Optional[int] = None,\n) -&gt; Self:\n    \"\"\"Return a `Distiset` whose values will be a `datasets.DatasetDict` with two random train and test subsets.\n    Splits are created from the dataset according to `train_size` and `shuffle`.\n\n    Args:\n        train_size:\n            Float between `0.0` and `1.0` representing the proportion of the dataset to include in the test split.\n            It will be applied to all the datasets in the `Distiset`.\n        shuffle: Whether or not to shuffle the data before splitting\n        seed:\n            A seed to initialize the default BitGenerator, passed to the underlying method.\n\n    Returns:\n        The `Distiset` with the train-test split applied to all the datasets.\n    \"\"\"\n    assert 0 &lt; train_size &lt; 1, \"train_size must be a float between 0 and 1\"\n    for name, dataset in self.items():\n        self[name] = dataset.train_test_split(\n            train_size=train_size,\n            shuffle=shuffle,\n            seed=seed,\n        )\n    return self\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.save_to_disk","title":"<code>save_to_disk(distiset_path, max_shard_size=None, num_shards=None, num_proc=None, storage_options=None, save_card=True, save_pipeline_config=True, save_pipeline_log=True)</code>","text":"<p>Saves a <code>Distiset</code> to a dataset directory, or in a filesystem using any implementation of <code>fsspec.spec.AbstractFileSystem</code>.</p> <p>In case you want to save the <code>Distiset</code> in a remote filesystem, you can pass the <code>storage_options</code> parameter as you would do with <code>datasets</code>'s <code>Dataset.save_to_disk</code> method: see example</p> <p>Parameters:</p> Name Type Description Default <code>distiset_path</code> <code>PathLike</code> <p>Path where you want to save the <code>Distiset</code>. It can be a local path (e.g. <code>dataset/train</code>) or remote URI (e.g. <code>s3://my-bucket/dataset/train</code>)</p> required <code>max_shard_size</code> <code>Optional[Union[str, int]]</code> <p>The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like <code>\"50MB\"</code>). Defaults to <code>None</code>.</p> <code>None</code> <code>num_shards</code> <code>Optional[int]</code> <p>Number of shards to write. By default the number of shards depends on <code>max_shard_size</code> and <code>num_proc</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>num_proc</code> <code>Optional[int]</code> <p>Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default. Defaults to <code>None</code>.</p> <code>None</code> <code>storage_options</code> <code>Optional[dict]</code> <p>Key/value pairs to be passed on to the file-system backend, if any. Defaults to <code>None</code>.</p> <code>None</code> <code>save_card</code> <code>bool</code> <p>Whether to save the dataset card. Defaults to <code>True</code>.</p> <code>True</code> <code>save_pipeline_config</code> <code>bool</code> <p>Whether to save the pipeline configuration file (aka the <code>pipeline.yaml</code> file). Defaults to <code>True</code>.</p> <code>True</code> <code>save_pipeline_log</code> <code>bool</code> <p>Whether to save the pipeline log file (aka the <code>pipeline.log</code> file). Defaults to <code>True</code>.</p> <code>True</code> <p>Examples:</p> <pre><code># Save your distiset in a local folder:\ndistiset.save_to_disk(distiset_path=\"my-distiset\")\n# Save your distiset in a remote storage:\nstorage_options = {\n    \"key\": os.environ[\"S3_ACCESS_KEY\"],\n    \"secret\": os.environ[\"S3_SECRET_KEY\"],\n    \"client_kwargs\": {\n        \"endpoint_url\": os.environ[\"S3_ENDPOINT_URL\"],\n        \"region_name\": os.environ[\"S3_REGION\"],\n    },\n}\ndistiset.save_to_disk(distiset_path=\"my-distiset\", storage_options=storage_options)\n</code></pre> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def save_to_disk(\n    self,\n    distiset_path: PathLike,\n    max_shard_size: Optional[Union[str, int]] = None,\n    num_shards: Optional[int] = None,\n    num_proc: Optional[int] = None,\n    storage_options: Optional[dict] = None,\n    save_card: bool = True,\n    save_pipeline_config: bool = True,\n    save_pipeline_log: bool = True,\n) -&gt; None:\n    r\"\"\"\n    Saves a `Distiset` to a dataset directory, or in a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n    In case you want to save the `Distiset` in a remote filesystem, you can pass the `storage_options` parameter\n    as you would do with `datasets`'s `Dataset.save_to_disk` method: [see example](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets)\n\n    Args:\n        distiset_path: Path where you want to save the `Distiset`. It can be a local path\n            (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`)\n        max_shard_size: The maximum size of the dataset shards to be uploaded to the hub.\n            If expressed as a string, needs to be digits followed by a unit (like `\"50MB\"`).\n            Defaults to `None`.\n        num_shards: Number of shards to write. By default the number of shards depends on\n            `max_shard_size` and `num_proc`. Defaults to `None`.\n        num_proc: Number of processes when downloading and generating the dataset locally.\n            Multiprocessing is disabled by default. Defaults to `None`.\n        storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n        save_card: Whether to save the dataset card. Defaults to `True`.\n        save_pipeline_config: Whether to save the pipeline configuration file (aka the `pipeline.yaml` file).\n            Defaults to `True`.\n        save_pipeline_log: Whether to save the pipeline log file (aka the `pipeline.log` file).\n            Defaults to `True`.\n\n    Examples:\n        ```python\n        # Save your distiset in a local folder:\n        distiset.save_to_disk(distiset_path=\"my-distiset\")\n        # Save your distiset in a remote storage:\n        storage_options = {\n            \"key\": os.environ[\"S3_ACCESS_KEY\"],\n            \"secret\": os.environ[\"S3_SECRET_KEY\"],\n            \"client_kwargs\": {\n                \"endpoint_url\": os.environ[\"S3_ENDPOINT_URL\"],\n                \"region_name\": os.environ[\"S3_REGION\"],\n            },\n        }\n        distiset.save_to_disk(distiset_path=\"my-distiset\", storage_options=storage_options)\n        ```\n    \"\"\"\n    distiset_path = str(distiset_path)\n    for name, dataset in self.items():\n        dataset.save_to_disk(\n            f\"{distiset_path}/{name}\",\n            max_shard_size=max_shard_size,\n            num_shards=num_shards,\n            num_proc=num_proc,\n            storage_options=storage_options,\n        )\n\n    distiset_config_folder = posixpath.join(distiset_path, DISTISET_CONFIG_FOLDER)\n\n    fs: fsspec.AbstractFileSystem\n    fs, _, _ = fsspec.get_fs_token_paths(\n        distiset_config_folder, storage_options=storage_options\n    )\n    fs.makedirs(distiset_config_folder, exist_ok=True)\n\n    if self.artifacts_path:\n        distiset_artifacts_folder = posixpath.join(\n            distiset_path, DISTISET_ARTIFACTS_FOLDER\n        )\n        fs.copy(str(self.artifacts_path), distiset_artifacts_folder, recursive=True)\n\n    if save_card:\n        # NOTE:\u00a0Currently the card is not the same if we write to disk or push to the HF hub,\n        # as we aren't generating the README copying/updating the data from the dataset repo.\n        card = self._get_card(repo_id=Path(distiset_path).stem, token=None)\n        new_filename = posixpath.join(distiset_config_folder, \"README.md\")\n        if storage_options:\n            # Write the card the same way as DatasetCard.save does:\n            with fs.open(new_filename, \"w\", newline=\"\", encoding=\"utf-8\") as f:\n                f.write(str(card))\n        else:\n            card.save(new_filename)\n\n    # Write our internal files to the distiset folder by copying them to the distiset folder.\n    if save_pipeline_config and self.pipeline_path:\n        new_filename = posixpath.join(\n            distiset_config_folder, PIPELINE_CONFIG_FILENAME\n        )\n        if self.pipeline_path.exists() and (not fs.isfile(new_filename)):\n            data = yaml.safe_load(self.pipeline_path.read_text())\n            with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                yaml.dump(data, f, default_flow_style=False)\n\n    if save_pipeline_log and self.log_filename_path:\n        new_filename = posixpath.join(distiset_config_folder, PIPELINE_LOG_FILENAME)\n        if self.log_filename_path.exists() and (not fs.isfile(new_filename)):\n            data = self.log_filename_path.read_text()\n            with fs.open(new_filename, \"w\", encoding=\"utf-8\") as f:\n                f.write(data)\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.Distiset.load_from_disk","title":"<code>load_from_disk(distiset_path, keep_in_memory=None, storage_options=None, download_dir=None)</code>  <code>classmethod</code>","text":"<p>Loads a dataset that was previously saved using <code>Distiset.save_to_disk</code> from a dataset directory, or from a filesystem using any implementation of <code>fsspec.spec.AbstractFileSystem</code>.</p> <p>Parameters:</p> Name Type Description Default <code>distiset_path</code> <code>PathLike</code> <p>Path (\"dataset/train\") or remote URI (\"s3://bucket/dataset/train\").</p> required <code>keep_in_memory</code> <code>Optional[bool]</code> <p>Whether to copy the dataset in-memory, see <code>datasets.Dataset.load_from_disk`` for more information. Defaults to</code>None`.</p> <code>None</code> <code>storage_options</code> <code>Optional[Dict[str, Any]]</code> <p>Key/value pairs to be passed on to the file-system backend, if any. Defaults to <code>None</code>.</p> <code>None</code> <code>download_dir</code> <code>Optional[PathLike]</code> <p>Optional directory to download the dataset to. Defaults to None, in which case it will create a temporary directory.</p> <code>None</code> <p>Returns:</p> Type Description <code>Self</code> <p>A <code>Distiset</code> loaded from disk, it should be a <code>Distiset</code> object created using <code>Distiset.save_to_disk</code>.</p> Source code in <code>src/distilabel/distiset.py</code> <pre><code>@classmethod\ndef load_from_disk(\n    cls,\n    distiset_path: PathLike,\n    keep_in_memory: Optional[bool] = None,\n    storage_options: Optional[Dict[str, Any]] = None,\n    download_dir: Optional[PathLike] = None,\n) -&gt; Self:\n    \"\"\"Loads a dataset that was previously saved using `Distiset.save_to_disk` from a dataset\n    directory, or from a filesystem using any implementation of `fsspec.spec.AbstractFileSystem`.\n\n    Args:\n        distiset_path: Path (\"dataset/train\") or remote URI (\"s3://bucket/dataset/train\").\n        keep_in_memory: Whether to copy the dataset in-memory, see `datasets.Dataset.load_from_disk``\n            for more information. Defaults to `None`.\n        storage_options: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n        download_dir: Optional directory to download the dataset to. Defaults to None,\n            in which case it will create a temporary directory.\n\n    Returns:\n        A `Distiset` loaded from disk, it should be a `Distiset` object created using `Distiset.save_to_disk`.\n    \"\"\"\n    original_distiset_path = str(distiset_path)\n\n    fs: fsspec.AbstractFileSystem\n    fs, _, [distiset_path] = fsspec.get_fs_token_paths(  # type: ignore\n        original_distiset_path, storage_options=storage_options\n    )\n    dest_distiset_path = distiset_path\n\n    assert fs.isdir(\n        original_distiset_path\n    ), \"`distiset_path` must be a `PathLike` object pointing to a folder or a URI of a remote filesystem.\"\n\n    has_config = False\n    has_artifacts = False\n    distiset = cls()\n\n    if is_remote_filesystem(fs):\n        src_dataset_path = distiset_path\n        if download_dir:\n            dest_distiset_path = download_dir\n        else:\n            dest_distiset_path = Dataset._build_local_temp_path(src_dataset_path)  # type: ignore\n        fs.download(src_dataset_path, dest_distiset_path.as_posix(), recursive=True)  # type: ignore\n\n    # Now we should have the distiset locally, so we can read those files\n    for folder in Path(dest_distiset_path).iterdir():\n        if folder.stem == DISTISET_CONFIG_FOLDER:\n            has_config = True\n            continue\n        elif folder.stem == DISTISET_ARTIFACTS_FOLDER:\n            has_artifacts = True\n            continue\n        distiset[folder.stem] = load_from_disk(\n            str(folder),\n            keep_in_memory=keep_in_memory,\n        )\n\n    # From the config folder we just need to point to the files. Once downloaded we set the path to point to point to the files. Once downloaded we set the path\n    # to wherever they are.\n    if has_config:\n        distiset_config_folder = posixpath.join(\n            dest_distiset_path, DISTISET_CONFIG_FOLDER\n        )\n\n        pipeline_path = posixpath.join(\n            distiset_config_folder, PIPELINE_CONFIG_FILENAME\n        )\n        if Path(pipeline_path).exists():\n            distiset.pipeline_path = Path(pipeline_path)\n\n        log_filename_path = posixpath.join(\n            distiset_config_folder, PIPELINE_LOG_FILENAME\n        )\n        if Path(log_filename_path).exists():\n            distiset.log_filename_path = Path(log_filename_path)\n\n    if has_artifacts:\n        distiset.artifacts_path = Path(\n            posixpath.join(dest_distiset_path, DISTISET_ARTIFACTS_FOLDER)\n        )\n\n    return distiset\n</code></pre>"},{"location":"api/distiset/#distilabel.distiset.create_distiset","title":"<code>create_distiset(data_dir, pipeline_path=None, log_filename_path=None, enable_metadata=False, dag=None)</code>","text":"<p>Creates a <code>Distiset</code> from the buffer folder.</p> <p>This function is intended to be used as a helper to create a <code>Distiset</code> from from the folder where the cached data was written by the <code>_WriteBuffer</code>.</p> <p>Parameters:</p> Name Type Description Default <code>data_dir</code> <code>Path</code> <p>Folder where the data buffers were written by the <code>_WriteBuffer</code>. It should correspond to <code>CacheLocation.data</code>.</p> required <code>pipeline_path</code> <code>Optional[Path]</code> <p>Optional path to the pipeline.yaml file that generated the dataset. Internally this will be passed to the <code>Distiset</code> object on creation to allow uploading the <code>pipeline.yaml</code> file to the repo upon <code>Distiset.push_to_hub</code>.</p> <code>None</code> <code>log_filename_path</code> <code>Optional[Path]</code> <p>Optional path to the pipeline.log file that was generated during the pipeline run. Internally this will be passed to the <code>Distiset</code> object on creation to allow uploading the <code>pipeline.log</code> file to the repo upon <code>Distiset.push_to_hub</code>.</p> <code>None</code> <code>enable_metadata</code> <code>bool</code> <p>Whether to include the distilabel metadata column in the dataset or not. Defaults to <code>False</code>.</p> <code>False</code> <code>dag</code> <code>Optional[DAG]</code> <p>DAG contained in a <code>Pipeline</code>. If informed, will be used to extract the references/ citations from it.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>The dataset created from the buffer folder, where the different leaf steps will</p> <code>Distiset</code> <p>correspond to different configurations of the dataset.</p> <p>Examples:</p> <pre><code>from pathlib import Path\ndistiset = create_distiset(Path.home() / \".cache/distilabel/pipelines/path-to-pipe-hashname\")\n</code></pre> Source code in <code>src/distilabel/distiset.py</code> <pre><code>def create_distiset(  # noqa: C901\n    data_dir: Path,\n    pipeline_path: Optional[Path] = None,\n    log_filename_path: Optional[Path] = None,\n    enable_metadata: bool = False,\n    dag: Optional[\"DAG\"] = None,\n) -&gt; Distiset:\n    \"\"\"Creates a `Distiset` from the buffer folder.\n\n    This function is intended to be used as a helper to create a `Distiset` from from the folder\n    where the cached data was written by the `_WriteBuffer`.\n\n    Args:\n        data_dir: Folder where the data buffers were written by the `_WriteBuffer`.\n            It should correspond to `CacheLocation.data`.\n        pipeline_path: Optional path to the pipeline.yaml file that generated the dataset.\n            Internally this will be passed to the `Distiset` object on creation to allow\n            uploading the `pipeline.yaml` file to the repo upon `Distiset.push_to_hub`.\n        log_filename_path: Optional path to the pipeline.log file that was generated during the pipeline run.\n            Internally this will be passed to the `Distiset` object on creation to allow\n            uploading the `pipeline.log` file to the repo upon `Distiset.push_to_hub`.\n        enable_metadata: Whether to include the distilabel metadata column in the dataset or not.\n            Defaults to `False`.\n        dag: DAG contained in a `Pipeline`. If informed, will be used to extract the references/\n            citations from it.\n\n    Returns:\n        The dataset created from the buffer folder, where the different leaf steps will\n        correspond to different configurations of the dataset.\n\n    Examples:\n        ```python\n        from pathlib import Path\n        distiset = create_distiset(Path.home() / \".cache/distilabel/pipelines/path-to-pipe-hashname\")\n        ```\n    \"\"\"\n    from distilabel.constants import DISTILABEL_METADATA_KEY\n\n    logger = logging.getLogger(\"distilabel.distiset\")\n\n    steps_outputs_dir = data_dir / STEPS_OUTPUTS_PATH\n\n    distiset = Distiset()\n    for file in steps_outputs_dir.iterdir():\n        if file.is_file():\n            continue\n\n        files = [str(file) for file in list_files_in_dir(file)]\n        if files:\n            try:\n                ds = load_dataset(\n                    \"parquet\", name=file.stem, data_files={\"train\": files}\n                )\n                if not enable_metadata and DISTILABEL_METADATA_KEY in ds.column_names:\n                    ds = ds.remove_columns(DISTILABEL_METADATA_KEY)\n                distiset[file.stem] = ds\n            except ArrowInvalid:\n                logger.warning(f\"\u274c Failed to load the subset from '{file}' directory.\")\n                continue\n        else:\n            logger.warning(\n                f\"No output files for step '{file.stem}', can't create a dataset.\"\n                \" Did the step produce any data?\"\n            )\n\n    # If there's only one dataset i.e. one config, then set the config name to `default`\n    if len(distiset.keys()) == 1:\n        distiset[\"default\"] = distiset.pop(list(distiset.keys())[0])\n\n    # If there's any artifact set the `artifacts_path` so they can be uploaded\n    steps_artifacts_dir = data_dir / STEPS_ARTIFACTS_PATH\n    if any(steps_artifacts_dir.rglob(\"*\")):\n        distiset.artifacts_path = steps_artifacts_dir\n\n    # Include `pipeline.yaml` if exists\n    if pipeline_path:\n        distiset.pipeline_path = pipeline_path\n    else:\n        # If the pipeline path is not provided, try to find it in the parent directory\n        # and assume that's the wanted file.\n        pipeline_path = steps_outputs_dir.parent / \"pipeline.yaml\"\n        if pipeline_path.exists():\n            distiset.pipeline_path = pipeline_path\n\n    # Include `pipeline.log` if exists\n    if log_filename_path:\n        distiset.log_filename_path = log_filename_path\n    else:\n        log_filename_path = steps_outputs_dir.parent / \"pipeline.log\"\n        if log_filename_path.exists():\n            distiset.log_filename_path = log_filename_path\n\n    if dag:\n        distiset._citations = _grab_citations(dag)\n\n    return distiset\n</code></pre>"},{"location":"api/errors/","title":"Errors","text":"<p>This section contains the <code>distilabel</code> custom errors. Unlike exceptions, errors in <code>distilabel</code> are used to handle unexpected situations that can't be anticipated and that can't be handled in a controlled way.</p>"},{"location":"api/errors/#distilabel.errors.DistilabelError","title":"<code>DistilabelError</code>","text":"<p>A mixin class for common functionality shared by all Distilabel-specific errors.</p> <p>Attributes:</p> Name Type Description <code>message</code> <p>A message describing the error.</p> <code>page</code> <p>An optional error code from PydanticErrorCodes enum.</p> <p>Examples:</p> <pre><code>raise DistilabelUserError(\"This is an error message.\")\nThis is an error message.\n\nraise DistilabelUserError(\"This is an error message.\", page=\"sections/getting_started/faq/\")\nThis is an error message.\nFor further information visit 'https://distilabel.argilla.io/latest/sections/getting_started/faq/'\n</code></pre> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelError:\n    \"\"\"A mixin class for common functionality shared by all Distilabel-specific errors.\n\n    Attributes:\n        message: A message describing the error.\n        page: An optional error code from PydanticErrorCodes enum.\n\n    Examples:\n        ```python\n        raise DistilabelUserError(\"This is an error message.\")\n        This is an error message.\n\n        raise DistilabelUserError(\"This is an error message.\", page=\"sections/getting_started/faq/\")\n        This is an error message.\n        For further information visit 'https://distilabel.argilla.io/latest/sections/getting_started/faq/'\n        ```\n    \"\"\"\n\n    def __init__(self, message: str, *, page: Optional[str] = None) -&gt; None:\n        self.message = message\n        self.page = page\n\n    def __str__(self) -&gt; str:\n        if self.page is None:\n            return self.message\n        else:\n            return f\"{self.message}\\n\\nFor further information visit '{DISTILABEL_DOCS_URL}{self.page}'\"\n</code></pre>"},{"location":"api/errors/#distilabel.errors.DistilabelUserError","title":"<code>DistilabelUserError</code>","text":"<p>               Bases: <code>DistilabelError</code>, <code>ValueError</code></p> <p>ValueError that we can redirect to a given page in the documentation.</p> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelUserError(DistilabelError, ValueError):\n    \"\"\"ValueError that we can redirect to a given page in the documentation.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/errors/#distilabel.errors.DistilabelTypeError","title":"<code>DistilabelTypeError</code>","text":"<p>               Bases: <code>DistilabelError</code>, <code>TypeError</code></p> <p>TypeError that we can redirect to a given page in the documentation.</p> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelTypeError(DistilabelError, TypeError):\n    \"\"\"TypeError that we can redirect to a given page in the documentation.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/errors/#distilabel.errors.DistilabelNotImplementedError","title":"<code>DistilabelNotImplementedError</code>","text":"<p>               Bases: <code>DistilabelError</code>, <code>NotImplementedError</code></p> <p>NotImplementedError that we can redirect to a given page in the documentation.</p> Source code in <code>src/distilabel/errors.py</code> <pre><code>class DistilabelNotImplementedError(DistilabelError, NotImplementedError):\n    \"\"\"NotImplementedError that we can redirect to a given page in the documentation.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/exceptions/","title":"Exceptions","text":"<p>This section contains the <code>distilabel</code> custom exceptions. Unlike errors, exceptions in <code>distilabel</code> are used to handle specific situations that can be anticipated and that can be handled in a controlled way internally by the library.</p>"},{"location":"api/exceptions/#distilabel.exceptions.DistilabelException","title":"<code>DistilabelException</code>","text":"<p>               Bases: <code>Exception</code></p> <p>Base exception (can be gracefully handled) for <code>distilabel</code> framework.</p> Source code in <code>src/distilabel/exceptions.py</code> <pre><code>class DistilabelException(Exception):\n    \"\"\"Base exception (can be gracefully handled) for `distilabel` framework.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/exceptions/#distilabel.exceptions.DistilabelGenerationException","title":"<code>DistilabelGenerationException</code>","text":"<p>               Bases: <code>DistilabelException</code></p> <p>Base exception for <code>LLM</code> generation errors.</p> Source code in <code>src/distilabel/exceptions.py</code> <pre><code>class DistilabelGenerationException(DistilabelException):\n    \"\"\"Base exception for `LLM` generation errors.\"\"\"\n\n    pass\n</code></pre>"},{"location":"api/exceptions/#distilabel.exceptions.DistilabelOfflineBatchGenerationNotFinishedException","title":"<code>DistilabelOfflineBatchGenerationNotFinishedException</code>","text":"<p>               Bases: <code>DistilabelGenerationException</code></p> <p>Exception raised when a batch generation is not finished.</p> Source code in <code>src/distilabel/exceptions.py</code> <pre><code>class DistilabelOfflineBatchGenerationNotFinishedException(\n    DistilabelGenerationException\n):\n    \"\"\"Exception raised when a batch generation is not finished.\"\"\"\n\n    jobs_ids: Tuple[str, ...]\n\n    def __init__(self, jobs_ids: Tuple[str, ...]) -&gt; None:\n        self.jobs_ids = jobs_ids\n        super().__init__(f\"Batch generation with jobs_ids={jobs_ids} is not finished\")\n</code></pre>"},{"location":"api/mixins/requirements/","title":"RequirementsMixin","text":""},{"location":"api/mixins/requirements/#distilabel.mixins.requirements.RequirementsMixin","title":"<code>RequirementsMixin</code>","text":"<p>Mixin for classes that have <code>requirements</code> attribute.</p> <p>Used to add requirements to a <code>Step</code> and a <code>Pipeline</code>.</p> Source code in <code>src/distilabel/mixins/requirements.py</code> <pre><code>class RequirementsMixin:\n    \"\"\"Mixin for classes that have `requirements` attribute.\n\n    Used to add requirements to a `Step` and a `Pipeline`.\n    \"\"\"\n\n    _requirements: Union[List[Requirement], None] = []\n\n    def _gather_requirements(self) -&gt; List[str]:\n        \"\"\"This method will be overwritten in the `BasePipeline` class to gather the requirements\n        from each step.\n        \"\"\"\n        return []\n\n    @property\n    def requirements(self) -&gt; List[str]:\n        \"\"\"Return a list of requirements that must be installed to run the `Pipeline`.\n\n        The requirements in a Pipeline will include the requirements from all the steps (if any).\n\n        Returns:\n            List of requirements that must be installed to run the `Pipeline`, sorted alphabetically.\n        \"\"\"\n        self.requirements = self._gather_requirements()\n\n        return [str(r) for r in self._requirements]\n\n    @requirements.setter\n    def requirements(self, _requirements: List[str]) -&gt; None:\n        requirements = []\n        if not isinstance(_requirements, list):\n            _requirements = [_requirements]\n\n        for r in _requirements:\n            try:\n                requirements.append(Requirement(r))\n            except InvalidRequirement:\n                self._logger.warning(f\"Invalid requirement: `{r}`\")\n\n        self._requirements = sorted(\n            set(self._requirements).union(set(requirements)), key=lambda x: str(x)\n        )\n\n    def requirements_to_install(self) -&gt; List[str]:\n        \"\"\"Check if the requirements are installed in the current environment, and returns the ones that aren't.\n\n        Returns:\n            List of requirements required to run the pipeline that are not installed in the current environment.\n        \"\"\"\n\n        to_install = []\n        for req in self.requirements:\n            requirement = Requirement(req)\n            if importlib.util.find_spec(requirement.name):\n                if (str(requirement.specifier) != \"\") and (\n                    version(requirement.name) != str(requirement.specifier)\n                ):\n                    to_install.append(req)\n            else:\n                to_install.append(req)\n        return to_install\n</code></pre>"},{"location":"api/mixins/requirements/#distilabel.mixins.requirements.RequirementsMixin.requirements","title":"<code>requirements</code>  <code>property</code> <code>writable</code>","text":"<p>Return a list of requirements that must be installed to run the <code>Pipeline</code>.</p> <p>The requirements in a Pipeline will include the requirements from all the steps (if any).</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>List of requirements that must be installed to run the <code>Pipeline</code>, sorted alphabetically.</p>"},{"location":"api/mixins/requirements/#distilabel.mixins.requirements.RequirementsMixin.requirements_to_install","title":"<code>requirements_to_install()</code>","text":"<p>Check if the requirements are installed in the current environment, and returns the ones that aren't.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>List of requirements required to run the pipeline that are not installed in the current environment.</p> Source code in <code>src/distilabel/mixins/requirements.py</code> <pre><code>def requirements_to_install(self) -&gt; List[str]:\n    \"\"\"Check if the requirements are installed in the current environment, and returns the ones that aren't.\n\n    Returns:\n        List of requirements required to run the pipeline that are not installed in the current environment.\n    \"\"\"\n\n    to_install = []\n    for req in self.requirements:\n        requirement = Requirement(req)\n        if importlib.util.find_spec(requirement.name):\n            if (str(requirement.specifier) != \"\") and (\n                version(requirement.name) != str(requirement.specifier)\n            ):\n                to_install.append(req)\n        else:\n            to_install.append(req)\n    return to_install\n</code></pre>"},{"location":"api/mixins/runtime_parameters/","title":"RuntimeParametersMixin","text":""},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin","title":"<code>RuntimeParametersMixin</code>","text":"<p>               Bases: <code>BaseModel</code></p> <p>Mixin for classes that have <code>RuntimeParameter</code>s attributes.</p> <p>Attributes:</p> Name Type Description <code>_runtime_parameters</code> <code>Dict[str, Any]</code> <p>A dictionary containing the values of the runtime parameters of the class. This attribute is meant to be used internally and should not be accessed directly.</p> Source code in <code>src/distilabel/mixins/runtime_parameters.py</code> <pre><code>class RuntimeParametersMixin(BaseModel):\n    \"\"\"Mixin for classes that have `RuntimeParameter`s attributes.\n\n    Attributes:\n        _runtime_parameters: A dictionary containing the values of the runtime parameters\n            of the class. This attribute is meant to be used internally and should not be\n            accessed directly.\n    \"\"\"\n\n    _runtime_parameters: Dict[str, Any] = PrivateAttr(default_factory=dict)\n\n    @property\n    def runtime_parameters_names(self) -&gt; \"RuntimeParametersNames\":\n        \"\"\"Returns a dictionary containing the name of the runtime parameters of the class\n        as keys and whether the parameter is required or not as values.\n\n        Returns:\n            A dictionary containing the name of the runtime parameters of the class as keys\n            and whether the parameter is required or not as values.\n        \"\"\"\n\n        runtime_parameters = {}\n\n        for name, field_info in self.model_fields.items():  # type: ignore\n            # `field: RuntimeParameter[Any]` or `field: Optional[RuntimeParameter[Any]]`\n            is_runtime_param, is_optional = _is_runtime_parameter(field_info)\n            if is_runtime_param:\n                runtime_parameters[name] = is_optional\n                continue\n\n            attr = getattr(self, name)\n\n            # `field: RuntimeParametersMixin`\n            if isinstance(attr, RuntimeParametersMixin):\n                runtime_parameters[name] = attr.runtime_parameters_names\n\n            # `field: List[RuntimeParametersMixin]`\n            if (\n                isinstance(attr, list)\n                and attr\n                and isinstance(attr[0], RuntimeParametersMixin)\n            ):\n                runtime_parameters[name] = {\n                    str(i): item.runtime_parameters_names for i, item in enumerate(attr)\n                }\n\n        return runtime_parameters\n\n    def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n        \"\"\"Gets the information of the runtime parameters of the class such as the name and\n        the description. This function is meant to include the information of the runtime\n        parameters in the serialized data of the class.\n\n        Returns:\n            A list containing the information for each runtime parameter of the class.\n        \"\"\"\n        runtime_parameters_info = []\n        for name, field_info in self.model_fields.items():  # type: ignore\n            if name not in self.runtime_parameters_names:\n                continue\n\n            attr = getattr(self, name)\n\n            # Get runtime parameters info for `RuntimeParametersMixin` field\n            if isinstance(attr, RuntimeParametersMixin):\n                runtime_parameters_info.append(\n                    {\n                        \"name\": name,\n                        \"runtime_parameters_info\": attr.get_runtime_parameters_info(),\n                    }\n                )\n                continue\n\n            # Get runtime parameters info for `List[RuntimeParametersMixin]` field\n            if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n                runtime_parameters_info.append(\n                    {\n                        \"name\": name,\n                        \"runtime_parameters_info\": {\n                            str(i): item.get_runtime_parameters_info()\n                            for i, item in enumerate(attr)\n                        },\n                    }\n                )\n                continue\n\n            info = {\"name\": name, \"optional\": self.runtime_parameters_names[name]}\n            if field_info.description is not None:\n                info[\"description\"] = field_info.description\n            runtime_parameters_info.append(info)\n        return runtime_parameters_info\n\n    def set_runtime_parameters(self, runtime_parameters: Dict[str, Any]) -&gt; None:\n        \"\"\"Sets the runtime parameters of the class using the provided values. If the attr\n        to be set is a `RuntimeParametersMixin`, it will call `set_runtime_parameters` on\n        the attr.\n\n        Args:\n            runtime_parameters: A dictionary containing the values of the runtime parameters\n                to set.\n        \"\"\"\n        runtime_parameters_names = list(self.runtime_parameters_names.keys())\n        for name, value in runtime_parameters.items():\n            if name not in self.runtime_parameters_names:\n                # Check done just to ensure the unit tests for the mixin run\n                if getattr(self, \"pipeline\", None):\n                    closest = difflib.get_close_matches(\n                        name, runtime_parameters_names, cutoff=0.5\n                    )\n                    msg = (\n                        f\"\u26a0\ufe0f  Runtime parameter '{name}' unknown in step '{self.name}'.\"  # type: ignore\n                    )\n                    if closest:\n                        msg += f\" Did you mean any of: {closest}\"\n                    else:\n                        msg += f\" Available runtime parameters for the step: {runtime_parameters_names}.\"\n                    self.pipeline._logger.warning(msg)  # type: ignore\n                continue\n\n            attr = getattr(self, name)\n\n            # Set runtime parameters for `RuntimeParametersMixin` field\n            if isinstance(attr, RuntimeParametersMixin):\n                attr.set_runtime_parameters(value)\n                self._runtime_parameters[name] = value\n                continue\n\n            # Set runtime parameters for `List[RuntimeParametersMixin]` field\n            if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n                for i, item in enumerate(attr):\n                    item_value = value.get(str(i), {})\n                    item.set_runtime_parameters(item_value)\n                self._runtime_parameters[name] = value\n                continue\n\n            # Handle settings values for `_SecretField`\n            field_info = self.model_fields[name]\n            inner_type = extract_annotation_inner_type(field_info.annotation)\n            if is_type_pydantic_secret_field(inner_type):\n                value = inner_type(value)\n\n            # Set the value of the runtime parameter\n            setattr(self, name, value)\n            self._runtime_parameters[name] = value\n</code></pre>"},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin.runtime_parameters_names","title":"<code>runtime_parameters_names</code>  <code>property</code>","text":"<p>Returns a dictionary containing the name of the runtime parameters of the class as keys and whether the parameter is required or not as values.</p> <p>Returns:</p> Type Description <code>RuntimeParametersNames</code> <p>A dictionary containing the name of the runtime parameters of the class as keys</p> <code>RuntimeParametersNames</code> <p>and whether the parameter is required or not as values.</p>"},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin.get_runtime_parameters_info","title":"<code>get_runtime_parameters_info()</code>","text":"<p>Gets the information of the runtime parameters of the class such as the name and the description. This function is meant to include the information of the runtime parameters in the serialized data of the class.</p> <p>Returns:</p> Type Description <code>List[RuntimeParameterInfo]</code> <p>A list containing the information for each runtime parameter of the class.</p> Source code in <code>src/distilabel/mixins/runtime_parameters.py</code> <pre><code>def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n    \"\"\"Gets the information of the runtime parameters of the class such as the name and\n    the description. This function is meant to include the information of the runtime\n    parameters in the serialized data of the class.\n\n    Returns:\n        A list containing the information for each runtime parameter of the class.\n    \"\"\"\n    runtime_parameters_info = []\n    for name, field_info in self.model_fields.items():  # type: ignore\n        if name not in self.runtime_parameters_names:\n            continue\n\n        attr = getattr(self, name)\n\n        # Get runtime parameters info for `RuntimeParametersMixin` field\n        if isinstance(attr, RuntimeParametersMixin):\n            runtime_parameters_info.append(\n                {\n                    \"name\": name,\n                    \"runtime_parameters_info\": attr.get_runtime_parameters_info(),\n                }\n            )\n            continue\n\n        # Get runtime parameters info for `List[RuntimeParametersMixin]` field\n        if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n            runtime_parameters_info.append(\n                {\n                    \"name\": name,\n                    \"runtime_parameters_info\": {\n                        str(i): item.get_runtime_parameters_info()\n                        for i, item in enumerate(attr)\n                    },\n                }\n            )\n            continue\n\n        info = {\"name\": name, \"optional\": self.runtime_parameters_names[name]}\n        if field_info.description is not None:\n            info[\"description\"] = field_info.description\n        runtime_parameters_info.append(info)\n    return runtime_parameters_info\n</code></pre>"},{"location":"api/mixins/runtime_parameters/#distilabel.mixins.runtime_parameters.RuntimeParametersMixin.set_runtime_parameters","title":"<code>set_runtime_parameters(runtime_parameters)</code>","text":"<p>Sets the runtime parameters of the class using the provided values. If the attr to be set is a <code>RuntimeParametersMixin</code>, it will call <code>set_runtime_parameters</code> on the attr.</p> <p>Parameters:</p> Name Type Description Default <code>runtime_parameters</code> <code>Dict[str, Any]</code> <p>A dictionary containing the values of the runtime parameters to set.</p> required Source code in <code>src/distilabel/mixins/runtime_parameters.py</code> <pre><code>def set_runtime_parameters(self, runtime_parameters: Dict[str, Any]) -&gt; None:\n    \"\"\"Sets the runtime parameters of the class using the provided values. If the attr\n    to be set is a `RuntimeParametersMixin`, it will call `set_runtime_parameters` on\n    the attr.\n\n    Args:\n        runtime_parameters: A dictionary containing the values of the runtime parameters\n            to set.\n    \"\"\"\n    runtime_parameters_names = list(self.runtime_parameters_names.keys())\n    for name, value in runtime_parameters.items():\n        if name not in self.runtime_parameters_names:\n            # Check done just to ensure the unit tests for the mixin run\n            if getattr(self, \"pipeline\", None):\n                closest = difflib.get_close_matches(\n                    name, runtime_parameters_names, cutoff=0.5\n                )\n                msg = (\n                    f\"\u26a0\ufe0f  Runtime parameter '{name}' unknown in step '{self.name}'.\"  # type: ignore\n                )\n                if closest:\n                    msg += f\" Did you mean any of: {closest}\"\n                else:\n                    msg += f\" Available runtime parameters for the step: {runtime_parameters_names}.\"\n                self.pipeline._logger.warning(msg)  # type: ignore\n            continue\n\n        attr = getattr(self, name)\n\n        # Set runtime parameters for `RuntimeParametersMixin` field\n        if isinstance(attr, RuntimeParametersMixin):\n            attr.set_runtime_parameters(value)\n            self._runtime_parameters[name] = value\n            continue\n\n        # Set runtime parameters for `List[RuntimeParametersMixin]` field\n        if isinstance(attr, list) and isinstance(attr[0], RuntimeParametersMixin):\n            for i, item in enumerate(attr):\n                item_value = value.get(str(i), {})\n                item.set_runtime_parameters(item_value)\n            self._runtime_parameters[name] = value\n            continue\n\n        # Handle settings values for `_SecretField`\n        field_info = self.model_fields[name]\n        inner_type = extract_annotation_inner_type(field_info.annotation)\n        if is_type_pydantic_secret_field(inner_type):\n            value = inner_type(value)\n\n        # Set the value of the runtime parameter\n        setattr(self, name, value)\n        self._runtime_parameters[name] = value\n</code></pre>"},{"location":"api/models/embedding/","title":"Embedding","text":"<p>This section contains the API reference for the <code>distilabel</code> embeddings.</p> <p>For more information on how the <code>Embeddings</code> works and see some examples.</p>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base","title":"<code>base</code>","text":""},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings","title":"<code>Embeddings</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>BaseModel</code>, <code>_Serializable</code>, <code>ABC</code></p> <p>Base class for <code>Embeddings</code> models.</p> <p>To implement an <code>Embeddings</code> subclass, you need to subclass this class and implement:     - <code>load</code> method to load the <code>Embeddings</code> model. Don't forget to call <code>super().load()</code>,         so the <code>_logger</code> attribute is initialized.     - <code>model_name</code> property to return the model name used for the <code>Embeddings</code>.     - <code>encode</code> method to generate the sentence embeddings.</p> <p>Attributes:</p> Name Type Description <code>_logger</code> <code>Logger</code> <p>the logger to be used for the <code>Embeddings</code> model. It will be initialized when the <code>load</code> method is called.</p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>class Embeddings(RuntimeParametersMixin, BaseModel, _Serializable, ABC):\n    \"\"\"Base class for `Embeddings` models.\n\n    To implement an `Embeddings` subclass, you need to subclass this class and implement:\n        - `load` method to load the `Embeddings` model. Don't forget to call `super().load()`,\n            so the `_logger` attribute is initialized.\n        - `model_name` property to return the model name used for the `Embeddings`.\n        - `encode` method to generate the sentence embeddings.\n\n    Attributes:\n        _logger: the logger to be used for the `Embeddings` model. It will be initialized\n            when the `load` method is called.\n    \"\"\"\n\n    model_config = ConfigDict(\n        arbitrary_types_allowed=True,\n        protected_namespaces=(),\n        validate_default=True,\n        validate_assignment=True,\n        extra=\"forbid\",\n    )\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Method to be called to initialize the `Embeddings`\"\"\"\n        self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n\n    def unload(self) -&gt; None:\n        \"\"\"Method to be called to unload the `Embeddings` and release any resources.\"\"\"\n        pass\n\n    @property\n    @abstractmethod\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the `Embeddings`.\"\"\"\n        pass\n\n    @abstractmethod\n    def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n        \"\"\"Generates embeddings for the provided inputs.\n\n        Args:\n            inputs: a list of texts for which an embedding has to be generated.\n\n        Returns:\n            The generated embeddings.\n        \"\"\"\n        pass\n</code></pre>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.model_name","title":"<code>model_name</code>  <code>abstractmethod</code> <code>property</code>","text":"<p>Returns the model name used for the <code>Embeddings</code>.</p>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.load","title":"<code>load()</code>","text":"<p>Method to be called to initialize the <code>Embeddings</code></p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to be called to initialize the `Embeddings`\"\"\"\n    self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n</code></pre>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.unload","title":"<code>unload()</code>","text":"<p>Method to be called to unload the <code>Embeddings</code> and release any resources.</p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Method to be called to unload the `Embeddings` and release any resources.\"\"\"\n    pass\n</code></pre>"},{"location":"api/models/embedding/#distilabel.models.embeddings.base.Embeddings.encode","title":"<code>encode(inputs)</code>  <code>abstractmethod</code>","text":"<p>Generates embeddings for the provided inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[str]</code> <p>a list of texts for which an embedding has to be generated.</p> required <p>Returns:</p> Type Description <code>List[List[Union[int, float]]]</code> <p>The generated embeddings.</p> Source code in <code>src/distilabel/models/embeddings/base.py</code> <pre><code>@abstractmethod\ndef encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n    \"\"\"Generates embeddings for the provided inputs.\n\n    Args:\n        inputs: a list of texts for which an embedding has to be generated.\n\n    Returns:\n        The generated embeddings.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/","title":"Embedding Gallery","text":"<p>This section contains the existing <code>Embeddings</code> subclasses implemented in <code>distilabel</code>.</p>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings","title":"<code>embeddings</code>","text":""},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings","title":"<code>SentenceTransformerEmbeddings</code>","text":"<p>               Bases: <code>Embeddings</code>, <code>CudaDevicePlacementMixin</code></p> <p><code>sentence-transformers</code> library implementation for embedding generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>device</code> <code>Optional[RuntimeParameter[str]]</code> <p>the name of the device used to load the model e.g. \"cuda\", \"mps\", etc. Defaults to <code>None</code>.</p> <code>prompts</code> <code>Optional[Dict[str, str]]</code> <p>a dictionary containing prompts to be used with the model. Defaults to <code>None</code>.</p> <code>default_prompt_name</code> <code>Optional[str]</code> <p>the default prompt (in <code>prompts</code>) that will be applied to the inputs. If not provided, then no prompt will be used. Defaults to <code>None</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to allow fetching and executing remote code fetched from the repository in the Hub. Defaults to <code>False</code>.</p> <code>revision</code> <code>Optional[str]</code> <p>if <code>model</code> refers to a Hugging Face Hub repository, then the revision (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> <code>token</code> <code>Optional[str]</code> <p>the Hugging Face Hub token that will be used to authenticate to the Hugging Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package local configuration will be used. Defaults to <code>None</code>.</p> <code>truncate_dim</code> <code>Optional[int]</code> <p>the dimension to truncate the sentence embeddings. Defaults to <code>None</code>.</p> <code>model_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>extra kwargs that will be passed to the Hugging Face <code>transformers</code> model class. Defaults to <code>None</code>.</p> <code>tokenizer_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>extra kwargs that will be passed to the Hugging Face <code>transformers</code> tokenizer class. Defaults to <code>None</code>.</p> <code>config_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>extra kwargs that will be passed to the Hugging Face <code>transformers</code> configuration class. Defaults to <code>None</code>.</p> <code>precision</code> <code>Optional[Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']]</code> <p>the dtype that will have the resulting embeddings. Defaults to <code>\"float32\"</code>.</p> <code>normalize_embeddings</code> <code>RuntimeParameter[bool]</code> <p>whether to normalize the embeddings so they have a length of 1. Defaults to <code>None</code>.</p> <p>Examples:</p> <p>Generating sentence embeddings:</p> <pre><code>from distilabel.models import SentenceTransformerEmbeddings\n\nembeddings = SentenceTransformerEmbeddings(model=\"mixedbread-ai/mxbai-embed-large-v1\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre> Source code in <code>src/distilabel/models/embeddings/sentence_transformers.py</code> <pre><code>class SentenceTransformerEmbeddings(Embeddings, CudaDevicePlacementMixin):\n    \"\"\"`sentence-transformers` library implementation for embedding generation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        device: the name of the device used to load the model e.g. \"cuda\", \"mps\", etc.\n            Defaults to `None`.\n        prompts: a dictionary containing prompts to be used with the model. Defaults to\n            `None`.\n        default_prompt_name: the default prompt (in `prompts`) that will be applied to the\n            inputs. If not provided, then no prompt will be used. Defaults to `None`.\n        trust_remote_code: whether to allow fetching and executing remote code fetched\n            from the repository in the Hub. Defaults to `False`.\n        revision: if `model` refers to a Hugging Face Hub repository, then the revision\n            (e.g. a branch name or a commit id) to use. Defaults to `\"main\"`.\n        token: the Hugging Face Hub token that will be used to authenticate to the Hugging\n            Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package\n            local configuration will be used. Defaults to `None`.\n        truncate_dim: the dimension to truncate the sentence embeddings. Defaults to `None`.\n        model_kwargs: extra kwargs that will be passed to the Hugging Face `transformers`\n            model class. Defaults to `None`.\n        tokenizer_kwargs: extra kwargs that will be passed to the Hugging Face `transformers`\n            tokenizer class. Defaults to `None`.\n        config_kwargs: extra kwargs that will be passed to the Hugging Face `transformers`\n            configuration class. Defaults to `None`.\n        precision: the dtype that will have the resulting embeddings. Defaults to `\"float32\"`.\n        normalize_embeddings: whether to normalize the embeddings so they have a length\n            of 1. Defaults to `None`.\n\n    Examples:\n        Generating sentence embeddings:\n\n        ```python\n        from distilabel.models import SentenceTransformerEmbeddings\n\n        embeddings = SentenceTransformerEmbeddings(model=\"mixedbread-ai/mxbai-embed-large-v1\")\n\n        embeddings.load()\n\n        results = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n        # [\n        #   [-0.05447685346007347, -0.01623094454407692, ...],\n        #   [4.4889533455716446e-05, 0.044016145169734955, ...],\n        # ]\n        ```\n    \"\"\"\n\n    model: str\n    device: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The device to be used to load the model. If `None`, then it\"\n        \" will check if a GPU can be used.\",\n    )\n    prompts: Optional[Dict[str, str]] = None\n    default_prompt_name: Optional[str] = None\n    trust_remote_code: bool = False\n    revision: Optional[str] = None\n    token: Optional[str] = None\n    truncate_dim: Optional[int] = None\n    model_kwargs: Optional[Dict[str, Any]] = None\n    tokenizer_kwargs: Optional[Dict[str, Any]] = None\n    config_kwargs: Optional[Dict[str, Any]] = None\n    precision: Optional[Literal[\"float32\", \"int8\", \"uint8\", \"binary\", \"ubinary\"]] = (\n        \"float32\"\n    )\n    normalize_embeddings: RuntimeParameter[bool] = Field(\n        default=True,\n        description=\"Whether to normalize the embeddings so the generated vectors\"\n        \" have a length of 1 or not.\",\n    )\n\n    _model: Union[\"SentenceTransformer\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Sentence Transformer model\"\"\"\n        super().load()\n\n        if self.device == \"cuda\":\n            CudaDevicePlacementMixin.load(self)\n\n        try:\n            from sentence_transformers import SentenceTransformer\n        except ImportError as e:\n            raise ImportError(\n                \"`sentence-transformers` package is not installed. Please install it using\"\n                \" `pip install sentence-transformers`.\"\n            ) from e\n\n        self._model = SentenceTransformer(\n            model_name_or_path=self.model,\n            device=self.device,\n            prompts=self.prompts,\n            default_prompt_name=self.default_prompt_name,\n            trust_remote_code=self.trust_remote_code,\n            revision=self.revision,\n            token=self.token,\n            truncate_dim=self.truncate_dim,\n            model_kwargs=self.model_kwargs,\n            tokenizer_kwargs=self.tokenizer_kwargs,\n            config_kwargs=self.config_kwargs,\n        )\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the name of the model.\"\"\"\n        return self.model\n\n    def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n        \"\"\"Generates embeddings for the provided inputs.\n\n        Args:\n            inputs: a list of texts for which an embedding has to be generated.\n\n        Returns:\n            The generated embeddings.\n        \"\"\"\n        return self._model.encode(  # type: ignore\n            sentences=inputs,\n            batch_size=len(inputs),\n            convert_to_numpy=True,\n            precision=self.precision,  # type: ignore\n            normalize_embeddings=self.normalize_embeddings,  # type: ignore\n        ).tolist()  # type: ignore\n\n    def unload(self) -&gt; None:\n        del self._model\n        if self.device == \"cuda\":\n            CudaDevicePlacementMixin.unload(self)\n        super().unload()\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the name of the model.</p>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings.load","title":"<code>load()</code>","text":"<p>Loads the Sentence Transformer model</p> Source code in <code>src/distilabel/models/embeddings/sentence_transformers.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Sentence Transformer model\"\"\"\n    super().load()\n\n    if self.device == \"cuda\":\n        CudaDevicePlacementMixin.load(self)\n\n    try:\n        from sentence_transformers import SentenceTransformer\n    except ImportError as e:\n        raise ImportError(\n            \"`sentence-transformers` package is not installed. Please install it using\"\n            \" `pip install sentence-transformers`.\"\n        ) from e\n\n    self._model = SentenceTransformer(\n        model_name_or_path=self.model,\n        device=self.device,\n        prompts=self.prompts,\n        default_prompt_name=self.default_prompt_name,\n        trust_remote_code=self.trust_remote_code,\n        revision=self.revision,\n        token=self.token,\n        truncate_dim=self.truncate_dim,\n        model_kwargs=self.model_kwargs,\n        tokenizer_kwargs=self.tokenizer_kwargs,\n        config_kwargs=self.config_kwargs,\n    )\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.SentenceTransformerEmbeddings.encode","title":"<code>encode(inputs)</code>","text":"<p>Generates embeddings for the provided inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[str]</code> <p>a list of texts for which an embedding has to be generated.</p> required <p>Returns:</p> Type Description <code>List[List[Union[int, float]]]</code> <p>The generated embeddings.</p> Source code in <code>src/distilabel/models/embeddings/sentence_transformers.py</code> <pre><code>def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n    \"\"\"Generates embeddings for the provided inputs.\n\n    Args:\n        inputs: a list of texts for which an embedding has to be generated.\n\n    Returns:\n        The generated embeddings.\n    \"\"\"\n    return self._model.encode(  # type: ignore\n        sentences=inputs,\n        batch_size=len(inputs),\n        convert_to_numpy=True,\n        precision=self.precision,  # type: ignore\n        normalize_embeddings=self.normalize_embeddings,  # type: ignore\n    ).tolist()  # type: ignore\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings","title":"<code>vLLMEmbeddings</code>","text":"<p>               Bases: <code>Embeddings</code>, <code>CudaDevicePlacementMixin</code></p> <p><code>vllm</code> library implementation for embedding generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>dtype</code> <code>str</code> <p>the data type to use for the model. Defaults to <code>auto</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to trust the remote code when loading the model. Defaults to <code>False</code>.</p> <code>quantization</code> <code>Optional[str]</code> <p>the quantization mode to use for the model. Defaults to <code>None</code>.</p> <code>revision</code> <code>Optional[str]</code> <p>the revision of the model to load. Defaults to <code>None</code>.</p> <code>enforce_eager</code> <code>bool</code> <p>whether to enforce eager execution. Defaults to <code>True</code>.</p> <code>seed</code> <code>int</code> <p>the seed to use for the random number generator. Defaults to <code>0</code>.</p> <code>extra_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> <code>_model</code> <code>LLM</code> <p>the <code>vLLM</code> model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> References <ul> <li>Offline inference embeddings</li> </ul> <p>Examples:</p> <p>Generating sentence embeddings:</p> <pre><code>from distilabel.models import vLLMEmbeddings\n\nembeddings = vLLMEmbeddings(model=\"intfloat/e5-mistral-7b-instruct\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>class vLLMEmbeddings(Embeddings, CudaDevicePlacementMixin):\n    \"\"\"`vllm` library implementation for embedding generation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        dtype: the data type to use for the model. Defaults to `auto`.\n        trust_remote_code: whether to trust the remote code when loading the model. Defaults\n            to `False`.\n        quantization: the quantization mode to use for the model. Defaults to `None`.\n        revision: the revision of the model to load. Defaults to `None`.\n        enforce_eager: whether to enforce eager execution. Defaults to `True`.\n        seed: the seed to use for the random number generator. Defaults to `0`.\n        extra_kwargs: additional dictionary of keyword arguments that will be passed to the\n            `LLM` class of `vllm` library. Defaults to `{}`.\n        _model: the `vLLM` model instance. This attribute is meant to be used internally\n            and should not be accessed directly. It will be set in the `load` method.\n\n    References:\n        - [Offline inference embeddings](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference_embedding.html)\n\n    Examples:\n        Generating sentence embeddings:\n\n        ```python\n        from distilabel.models import vLLMEmbeddings\n\n        embeddings = vLLMEmbeddings(model=\"intfloat/e5-mistral-7b-instruct\")\n\n        embeddings.load()\n\n        results = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n        # [\n        #   [-0.05447685346007347, -0.01623094454407692, ...],\n        #   [4.4889533455716446e-05, 0.044016145169734955, ...],\n        # ]\n        ```\n    \"\"\"\n\n    model: str\n    dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    quantization: Optional[str] = None\n    revision: Optional[str] = None\n\n    enforce_eager: bool = True\n\n    seed: int = 0\n\n    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"Additional dictionary of keyword arguments that will be passed to the\"\n        \" `vLLM` class of `vllm` library. See all the supported arguments at: \"\n        \"https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py\",\n    )\n\n    _model: \"_vLLM\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\"\"\"\n        super().load()\n\n        CudaDevicePlacementMixin.load(self)\n\n        try:\n            from vllm import LLM as _vLLM\n\n        except ImportError as ie:\n            raise ImportError(\n                \"vLLM is not installed. Please install it using `pip install vllm`.\"\n            ) from ie\n\n        self._model = _vLLM(\n            self.model,\n            dtype=self.dtype,\n            trust_remote_code=self.trust_remote_code,\n            quantization=self.quantization,\n            revision=self.revision,\n            enforce_eager=self.enforce_eager,\n            seed=self.seed,\n            **self.extra_kwargs,  # type: ignore\n        )\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the `vLLM` model.\"\"\"\n        CudaDevicePlacementMixin.unload(self)\n        super().unload()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the name of the model.\"\"\"\n        return self.model\n\n    def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n        \"\"\"Generates embeddings for the provided inputs.\n\n        Args:\n            inputs: a list of texts for which an embedding has to be generated.\n\n        Returns:\n            The generated embeddings.\n        \"\"\"\n        return [output.outputs.embedding for output in self._model.encode(inputs)]\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the name of the model.</p>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.load","title":"<code>load()</code>","text":"<p>Loads the <code>vLLM</code> model using either the path or the Hugging Face Hub repository id.</p> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\"\"\"\n    super().load()\n\n    CudaDevicePlacementMixin.load(self)\n\n    try:\n        from vllm import LLM as _vLLM\n\n    except ImportError as ie:\n        raise ImportError(\n            \"vLLM is not installed. Please install it using `pip install vllm`.\"\n        ) from ie\n\n    self._model = _vLLM(\n        self.model,\n        dtype=self.dtype,\n        trust_remote_code=self.trust_remote_code,\n        quantization=self.quantization,\n        revision=self.revision,\n        enforce_eager=self.enforce_eager,\n        seed=self.seed,\n        **self.extra_kwargs,  # type: ignore\n    )\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.unload","title":"<code>unload()</code>","text":"<p>Unloads the <code>vLLM</code> model.</p> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the `vLLM` model.\"\"\"\n    CudaDevicePlacementMixin.unload(self)\n    super().unload()\n</code></pre>"},{"location":"api/models/embedding/embedding_gallery/#distilabel.models.embeddings.vLLMEmbeddings.encode","title":"<code>encode(inputs)</code>","text":"<p>Generates embeddings for the provided inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[str]</code> <p>a list of texts for which an embedding has to be generated.</p> required <p>Returns:</p> Type Description <code>List[List[Union[int, float]]]</code> <p>The generated embeddings.</p> Source code in <code>src/distilabel/models/embeddings/vllm.py</code> <pre><code>def encode(self, inputs: List[str]) -&gt; List[List[Union[int, float]]]:\n    \"\"\"Generates embeddings for the provided inputs.\n\n    Args:\n        inputs: a list of texts for which an embedding has to be generated.\n\n    Returns:\n        The generated embeddings.\n    \"\"\"\n    return [output.outputs.embedding for output in self._model.encode(inputs)]\n</code></pre>"},{"location":"api/models/llm/","title":"LLM","text":"<p>This section contains the API reference for the <code>distilabel</code> LLMs, both for the <code>LLM</code> synchronous implementation, and for the <code>AsyncLLM</code> asynchronous one.</p> <p>For more information and examples on how to use existing LLMs or create custom ones, please refer to Tutorial - LLM.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base","title":"<code>base</code>","text":""},{"location":"api/models/llm/#distilabel.models.llms.base.LLM","title":"<code>LLM</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>BaseModel</code>, <code>_Serializable</code>, <code>ABC</code></p> <p>Base class for <code>LLM</code>s to be used in <code>distilabel</code> framework.</p> <p>To implement an <code>LLM</code> subclass, you need to subclass this class and implement:     - <code>load</code> method to load the <code>LLM</code> if needed. Don't forget to call <code>super().load()</code>,         so the <code>_logger</code> attribute is initialized.     - <code>model_name</code> property to return the model name used for the LLM.     - <code>generate</code> method to generate <code>num_generations</code> per input in <code>inputs</code>.</p> <p>Attributes:</p> Name Type Description <code>generation_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>the kwargs to be propagated to either <code>generate</code> or <code>agenerate</code> methods within each <code>LLM</code>.</p> <code>use_offline_batch_generation</code> <code>Optional[RuntimeParameter[bool]]</code> <p>whether to use the <code>offline_batch_generate</code> method to generate the responses.</p> <code>offline_batch_generation_block_until_done</code> <code>Optional[RuntimeParameter[int]]</code> <p>if provided, then polling will be done until the <code>ofline_batch_generate</code> method is able to retrieve the results. The value indicate the time to wait between each polling.</p> <code>jobs_ids</code> <code>Union[Tuple[str, ...], None]</code> <p>the job ids generated by the <code>offline_batch_generate</code> method. This attribute is used to store the job ids generated by the <code>offline_batch_generate</code> method so later they can be used to retrieve the results. It is not meant to be set by the user.</p> <code>_logger</code> <code>Logger</code> <p>the logger to be used for the <code>LLM</code>. It will be initialized when the <code>load</code> method is called.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>class LLM(RuntimeParametersMixin, BaseModel, _Serializable, ABC):\n    \"\"\"Base class for `LLM`s to be used in `distilabel` framework.\n\n    To implement an `LLM` subclass, you need to subclass this class and implement:\n        - `load` method to load the `LLM` if needed. Don't forget to call `super().load()`,\n            so the `_logger` attribute is initialized.\n        - `model_name` property to return the model name used for the LLM.\n        - `generate` method to generate `num_generations` per input in `inputs`.\n\n    Attributes:\n        generation_kwargs: the kwargs to be propagated to either `generate` or `agenerate`\n            methods within each `LLM`.\n        use_offline_batch_generation: whether to use the `offline_batch_generate` method to\n            generate the responses.\n        offline_batch_generation_block_until_done: if provided, then polling will be done until\n            the `ofline_batch_generate` method is able to retrieve the results. The value indicate\n            the time to wait between each polling.\n        jobs_ids: the job ids generated by the `offline_batch_generate` method. This attribute\n            is used to store the job ids generated by the `offline_batch_generate` method\n            so later they can be used to retrieve the results. It is not meant to be set by\n            the user.\n        _logger: the logger to be used for the `LLM`. It will be initialized when the `load`\n            method is called.\n    \"\"\"\n\n    model_config = ConfigDict(\n        arbitrary_types_allowed=True,\n        protected_namespaces=(),\n        validate_default=True,\n        validate_assignment=True,\n        extra=\"forbid\",\n    )\n\n    generation_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"The kwargs to be propagated to either `generate` or `agenerate`\"\n        \" methods within each `LLM`.\",\n    )\n    use_offline_batch_generation: Optional[RuntimeParameter[bool]] = Field(\n        default=False,\n        description=\"Whether to use the `offline_batch_generate` method to generate\"\n        \" the responses.\",\n    )\n    offline_batch_generation_block_until_done: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"If provided, then polling will be done until the `ofline_batch_generate`\"\n        \" method is able to retrieve the results. The value indicate the time to wait between\"\n        \" each polling.\",\n    )\n\n    jobs_ids: Union[Tuple[str, ...], None] = Field(default=None)\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Method to be called to initialize the `LLM`, its logger and optionally the\n        structured output generator.\"\"\"\n        self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n\n    def unload(self) -&gt; None:\n        \"\"\"Method to be called to unload the `LLM` and release any resources.\"\"\"\n        pass\n\n    @property\n    @abstractmethod\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        pass\n\n    def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n        \"\"\"Returns the generation kwargs to be used for the generation. This method can\n        be overridden to provide a more complex logic for the generation kwargs.\n\n        Returns:\n            The kwargs to be used for the generation.\n        \"\"\"\n        return self.generation_kwargs  # type: ignore\n\n    @abstractmethod\n    def generate(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Abstract method to be implemented by each LLM to generate `num_generations`\n        per input in `inputs`.\n\n        Args:\n            inputs: the list of inputs to generate responses for which follows OpenAI's\n                API format:\n\n                ```python\n                [\n                    {\"role\": \"system\", \"content\": \"You're a helpful assistant...\"},\n                    {\"role\": \"user\", \"content\": \"Give a template email for B2B communications...\"},\n                    {\"role\": \"assistant\", \"content\": \"Sure, here's a template you can use...\"},\n                    {\"role\": \"user\", \"content\": \"Modify the second paragraph...\"}\n                ]\n                ```\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n        \"\"\"\n        pass\n\n    def generate_outputs(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Generates outputs for the given inputs using either `generate` method or the\n        `offine_batch_generate` method if `use_offline_\n        \"\"\"\n        if self.use_offline_batch_generation:\n            if self.offline_batch_generation_block_until_done is not None:\n                return self._offline_batch_generate_polling(\n                    inputs=inputs,\n                    num_generations=num_generations,\n                    **kwargs,\n                )\n\n            # This will raise `DistilabelOfflineBatchGenerationNotFinishedException` right away\n            # if the batch generation is not finished.\n            return self.offline_batch_generate(\n                inputs=inputs,\n                num_generations=num_generations,\n                **kwargs,\n            )\n\n        return self.generate(inputs=inputs, num_generations=num_generations, **kwargs)\n\n    def _offline_batch_generate_polling(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Method to poll the `offline_batch_generate` method until the batch generation\n        is finished.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        while True:\n            try:\n                return self.offline_batch_generate(\n                    inputs=inputs,\n                    num_generations=num_generations,\n                    **kwargs,\n                )\n            except DistilabelOfflineBatchGenerationNotFinishedException as e:\n                self._logger.info(\n                    f\"Waiting for the offline batch generation to finish: {e}. Sleeping\"\n                    f\" for {self.offline_batch_generation_block_until_done} seconds before\"\n                    \" trying to get the results again.\"\n                )\n                # When running a `Step` in a child process, SIGINT is overriden so the child\n                # process doesn't stop when the parent process receives a SIGINT signal.\n                # The new handler sets an environment variable that is checked here to stop\n                # the polling.\n                if os.getenv(SIGINT_HANDLER_CALLED_ENV_NAME) is not None:\n                    self._logger.info(\n                        \"Received a KeyboardInterrupt. Stopping polling for checking if the\"\n                        \" offline batch generation is finished...\"\n                    )\n                    raise e\n                time.sleep(self.offline_batch_generation_block_until_done)  # type: ignore\n            except KeyboardInterrupt as e:\n                # This is for the case the `LLM` is being executed outside a pipeline\n                self._logger.info(\n                    \"Received a KeyboardInterrupt. Stopping polling for checking if the\"\n                    \" offline batch generation is finished...\"\n                )\n                raise DistilabelOfflineBatchGenerationNotFinishedException(\n                    jobs_ids=self.jobs_ids  # type: ignore\n                ) from e\n\n    @property\n    def generate_parameters(self) -&gt; List[\"inspect.Parameter\"]:\n        \"\"\"Returns the parameters of the `generate` method.\n\n        Returns:\n            A list containing the parameters of the `generate` method.\n        \"\"\"\n        return list(inspect.signature(self.generate).parameters.values())\n\n    @property\n    def runtime_parameters_names(self) -&gt; \"RuntimeParametersNames\":\n        \"\"\"Returns the runtime parameters of the `LLM`, which are combination of the\n        attributes of the `LLM` type hinted with `RuntimeParameter` and the parameters\n        of the `generate` method that are not `input` and `num_generations`.\n\n        Returns:\n            A dictionary with the name of the runtime parameters as keys and a boolean\n            indicating if the parameter is optional or not.\n        \"\"\"\n        runtime_parameters = super().runtime_parameters_names\n        runtime_parameters[\"generation_kwargs\"] = {}\n\n        # runtime parameters from the `generate` method\n        for param in self.generate_parameters:\n            if param.name in [\"input\", \"inputs\", \"num_generations\"]:\n                continue\n            is_optional = param.default != inspect.Parameter.empty\n            runtime_parameters[\"generation_kwargs\"][param.name] = is_optional\n\n        return runtime_parameters\n\n    def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n        \"\"\"Gets the information of the runtime parameters of the `LLM` such as the name\n        and the description. This function is meant to include the information of the runtime\n        parameters in the serialized data of the `LLM`.\n\n        Returns:\n            A list containing the information for each runtime parameter of the `LLM`.\n        \"\"\"\n        runtime_parameters_info = super().get_runtime_parameters_info()\n\n        generation_kwargs_info = next(\n            (\n                runtime_parameter_info\n                for runtime_parameter_info in runtime_parameters_info\n                if runtime_parameter_info[\"name\"] == \"generation_kwargs\"\n            ),\n            None,\n        )\n\n        # If `generation_kwargs` attribute is present, we need to include the `generate`\n        # method arguments as the information for this attribute.\n        if generation_kwargs_info:\n            generate_docstring_args = self.generate_parsed_docstring[\"args\"]\n\n            generation_kwargs_info[\"keys\"] = []\n            for key, value in generation_kwargs_info[\"optional\"].items():\n                info = {\"name\": key, \"optional\": value}\n                if description := generate_docstring_args.get(key):\n                    info[\"description\"] = description\n                generation_kwargs_info[\"keys\"].append(info)\n\n            generation_kwargs_info.pop(\"optional\")\n\n        return runtime_parameters_info\n\n    @cached_property\n    def generate_parsed_docstring(self) -&gt; \"Docstring\":\n        \"\"\"Returns the parsed docstring of the `generate` method.\n\n        Returns:\n            The parsed docstring of the `generate` method.\n        \"\"\"\n        return parse_google_docstring(self.generate)\n\n    def get_last_hidden_states(\n        self, inputs: List[\"StandardInput\"]\n    ) -&gt; List[\"HiddenState\"]:\n        \"\"\"Method to get the last hidden states of the model for a list of inputs.\n\n        Args:\n            inputs: the list of inputs to get the last hidden states from.\n\n        Returns:\n            A list containing the last hidden state for each sequence using a NumPy array\n                with shape [num_tokens, hidden_size].\n        \"\"\"\n        # TODO: update to use `DistilabelNotImplementedError`\n        raise NotImplementedError(\n            f\"Method `get_last_hidden_states` is not implemented for `{self.__class__.__name__}`\"\n        )\n\n    def _prepare_structured_output(\n        self, structured_output: \"StructuredOutputType\"\n    ) -&gt; Union[Any, None]:\n        \"\"\"Method in charge of preparing the structured output generator.\n\n        By default will raise a `NotImplementedError`, subclasses that allow it must override this\n        method with the implementation.\n\n        Args:\n            structured_output: the config to prepare the guided generation.\n\n        Returns:\n            The structure to be used for the guided generation.\n        \"\"\"\n        # TODO: update to use `DistilabelNotImplementedError`\n        raise NotImplementedError(\n            f\"Guided generation is not implemented for `{type(self).__name__}`\"\n        )\n\n    def offline_batch_generate(\n        self,\n        inputs: Union[List[\"FormattedInput\"], None] = None,\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Method to generate a list of outputs for the given inputs using an offline batch\n        generation method to be implemented by each `LLM`.\n\n        This method should create jobs the first time is called and store the job ids, so\n        the second and subsequent calls can retrieve the results of the batch generation.\n        If subsequent calls are made before the batch generation is finished, then the method\n        should raise a `DistilabelOfflineBatchGenerationNotFinishedException`. This exception\n        will be handled automatically by the `Pipeline` which will store all the required\n        information for recovering the pipeline execution when the batch generation is finished.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        raise DistilabelNotImplementedError(\n            f\"`offline_batch_generate` is not implemented for `{self.__class__.__name__}`\",\n            page=\"sections/how_to_guides/advanced/offline-batch-generation/\",\n        )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.model_name","title":"<code>model_name</code>  <code>abstractmethod</code> <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate_parameters","title":"<code>generate_parameters</code>  <code>property</code>","text":"<p>Returns the parameters of the <code>generate</code> method.</p> <p>Returns:</p> Type Description <code>List[Parameter]</code> <p>A list containing the parameters of the <code>generate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.runtime_parameters_names","title":"<code>runtime_parameters_names</code>  <code>property</code>","text":"<p>Returns the runtime parameters of the <code>LLM</code>, which are combination of the attributes of the <code>LLM</code> type hinted with <code>RuntimeParameter</code> and the parameters of the <code>generate</code> method that are not <code>input</code> and <code>num_generations</code>.</p> <p>Returns:</p> Type Description <code>RuntimeParametersNames</code> <p>A dictionary with the name of the runtime parameters as keys and a boolean</p> <code>RuntimeParametersNames</code> <p>indicating if the parameter is optional or not.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate_parsed_docstring","title":"<code>generate_parsed_docstring</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the parsed docstring of the <code>generate</code> method.</p> <p>Returns:</p> Type Description <code>Docstring</code> <p>The parsed docstring of the <code>generate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.load","title":"<code>load()</code>","text":"<p>Method to be called to initialize the <code>LLM</code>, its logger and optionally the structured output generator.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to be called to initialize the `LLM`, its logger and optionally the\n    structured output generator.\"\"\"\n    self._logger = logging.getLogger(f\"distilabel.llm.{self.model_name}\")\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.unload","title":"<code>unload()</code>","text":"<p>Method to be called to unload the <code>LLM</code> and release any resources.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Method to be called to unload the `LLM` and release any resources.\"\"\"\n    pass\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.get_generation_kwargs","title":"<code>get_generation_kwargs()</code>","text":"<p>Returns the generation kwargs to be used for the generation. This method can be overridden to provide a more complex logic for the generation kwargs.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The kwargs to be used for the generation.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns the generation kwargs to be used for the generation. This method can\n    be overridden to provide a more complex logic for the generation kwargs.\n\n    Returns:\n        The kwargs to be used for the generation.\n    \"\"\"\n    return self.generation_kwargs  # type: ignore\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate","title":"<code>generate(inputs, num_generations=1, **kwargs)</code>  <code>abstractmethod</code>","text":"<p>Abstract method to be implemented by each LLM to generate <code>num_generations</code> per input in <code>inputs</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>the list of inputs to generate responses for which follows OpenAI's API format:</p> <pre><code>[\n    {\"role\": \"system\", \"content\": \"You're a helpful assistant...\"},\n    {\"role\": \"user\", \"content\": \"Give a template email for B2B communications...\"},\n    {\"role\": \"assistant\", \"content\": \"Sure, here's a template you can use...\"},\n    {\"role\": \"user\", \"content\": \"Modify the second paragraph...\"}\n]\n</code></pre> required <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>@abstractmethod\ndef generate(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Abstract method to be implemented by each LLM to generate `num_generations`\n    per input in `inputs`.\n\n    Args:\n        inputs: the list of inputs to generate responses for which follows OpenAI's\n            API format:\n\n            ```python\n            [\n                {\"role\": \"system\", \"content\": \"You're a helpful assistant...\"},\n                {\"role\": \"user\", \"content\": \"Give a template email for B2B communications...\"},\n                {\"role\": \"assistant\", \"content\": \"Sure, here's a template you can use...\"},\n                {\"role\": \"user\", \"content\": \"Modify the second paragraph...\"}\n            ]\n            ```\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.generate_outputs","title":"<code>generate_outputs(inputs, num_generations=1, **kwargs)</code>","text":"<p>Generates outputs for the given inputs using either <code>generate</code> method or the <code>offine_batch_generate</code> method if `use_offline_</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def generate_outputs(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Generates outputs for the given inputs using either `generate` method or the\n    `offine_batch_generate` method if `use_offline_\n    \"\"\"\n    if self.use_offline_batch_generation:\n        if self.offline_batch_generation_block_until_done is not None:\n            return self._offline_batch_generate_polling(\n                inputs=inputs,\n                num_generations=num_generations,\n                **kwargs,\n            )\n\n        # This will raise `DistilabelOfflineBatchGenerationNotFinishedException` right away\n        # if the batch generation is not finished.\n        return self.offline_batch_generate(\n            inputs=inputs,\n            num_generations=num_generations,\n            **kwargs,\n        )\n\n    return self.generate(inputs=inputs, num_generations=num_generations, **kwargs)\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.get_runtime_parameters_info","title":"<code>get_runtime_parameters_info()</code>","text":"<p>Gets the information of the runtime parameters of the <code>LLM</code> such as the name and the description. This function is meant to include the information of the runtime parameters in the serialized data of the <code>LLM</code>.</p> <p>Returns:</p> Type Description <code>List[RuntimeParameterInfo]</code> <p>A list containing the information for each runtime parameter of the <code>LLM</code>.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def get_runtime_parameters_info(self) -&gt; List[\"RuntimeParameterInfo\"]:\n    \"\"\"Gets the information of the runtime parameters of the `LLM` such as the name\n    and the description. This function is meant to include the information of the runtime\n    parameters in the serialized data of the `LLM`.\n\n    Returns:\n        A list containing the information for each runtime parameter of the `LLM`.\n    \"\"\"\n    runtime_parameters_info = super().get_runtime_parameters_info()\n\n    generation_kwargs_info = next(\n        (\n            runtime_parameter_info\n            for runtime_parameter_info in runtime_parameters_info\n            if runtime_parameter_info[\"name\"] == \"generation_kwargs\"\n        ),\n        None,\n    )\n\n    # If `generation_kwargs` attribute is present, we need to include the `generate`\n    # method arguments as the information for this attribute.\n    if generation_kwargs_info:\n        generate_docstring_args = self.generate_parsed_docstring[\"args\"]\n\n        generation_kwargs_info[\"keys\"] = []\n        for key, value in generation_kwargs_info[\"optional\"].items():\n            info = {\"name\": key, \"optional\": value}\n            if description := generate_docstring_args.get(key):\n                info[\"description\"] = description\n            generation_kwargs_info[\"keys\"].append(info)\n\n        generation_kwargs_info.pop(\"optional\")\n\n    return runtime_parameters_info\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.get_last_hidden_states","title":"<code>get_last_hidden_states(inputs)</code>","text":"<p>Method to get the last hidden states of the model for a list of inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StandardInput]</code> <p>the list of inputs to get the last hidden states from.</p> required <p>Returns:</p> Type Description <code>List[HiddenState]</code> <p>A list containing the last hidden state for each sequence using a NumPy array with shape [num_tokens, hidden_size].</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def get_last_hidden_states(\n    self, inputs: List[\"StandardInput\"]\n) -&gt; List[\"HiddenState\"]:\n    \"\"\"Method to get the last hidden states of the model for a list of inputs.\n\n    Args:\n        inputs: the list of inputs to get the last hidden states from.\n\n    Returns:\n        A list containing the last hidden state for each sequence using a NumPy array\n            with shape [num_tokens, hidden_size].\n    \"\"\"\n    # TODO: update to use `DistilabelNotImplementedError`\n    raise NotImplementedError(\n        f\"Method `get_last_hidden_states` is not implemented for `{self.__class__.__name__}`\"\n    )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.LLM.offline_batch_generate","title":"<code>offline_batch_generate(inputs=None, num_generations=1, **kwargs)</code>","text":"<p>Method to generate a list of outputs for the given inputs using an offline batch generation method to be implemented by each <code>LLM</code>.</p> <p>This method should create jobs the first time is called and store the job ids, so the second and subsequent calls can retrieve the results of the batch generation. If subsequent calls are made before the batch generation is finished, then the method should raise a <code>DistilabelOfflineBatchGenerationNotFinishedException</code>. This exception will be handled automatically by the <code>Pipeline</code> which will store all the required information for recovering the pipeline execution when the batch generation is finished.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[List[FormattedInput], None]</code> <p>the list of inputs to generate responses for.</p> <code>None</code> <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list containing the generations for each input.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def offline_batch_generate(\n    self,\n    inputs: Union[List[\"FormattedInput\"], None] = None,\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Method to generate a list of outputs for the given inputs using an offline batch\n    generation method to be implemented by each `LLM`.\n\n    This method should create jobs the first time is called and store the job ids, so\n    the second and subsequent calls can retrieve the results of the batch generation.\n    If subsequent calls are made before the batch generation is finished, then the method\n    should raise a `DistilabelOfflineBatchGenerationNotFinishedException`. This exception\n    will be handled automatically by the `Pipeline` which will store all the required\n    information for recovering the pipeline execution when the batch generation is finished.\n\n    Args:\n        inputs: the list of inputs to generate responses for.\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n\n    Returns:\n        A list containing the generations for each input.\n    \"\"\"\n    raise DistilabelNotImplementedError(\n        f\"`offline_batch_generate` is not implemented for `{self.__class__.__name__}`\",\n        page=\"sections/how_to_guides/advanced/offline-batch-generation/\",\n    )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM","title":"<code>AsyncLLM</code>","text":"<p>               Bases: <code>LLM</code></p> <p>Abstract class for asynchronous LLMs, so as to benefit from the async capabilities of each LLM implementation. This class is meant to be subclassed by each LLM, and the method <code>agenerate</code> needs to be implemented to provide the asynchronous generation of responses.</p> <p>Attributes:</p> Name Type Description <code>_event_loop</code> <code>AbstractEventLoop</code> <p>the event loop to be used for the asynchronous generation of responses.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>class AsyncLLM(LLM):\n    \"\"\"Abstract class for asynchronous LLMs, so as to benefit from the async capabilities\n    of each LLM implementation. This class is meant to be subclassed by each LLM, and the\n    method `agenerate` needs to be implemented to provide the asynchronous generation of\n    responses.\n\n    Attributes:\n        _event_loop: the event loop to be used for the asynchronous generation of responses.\n    \"\"\"\n\n    _num_generations_param_supported = True\n    _event_loop: \"asyncio.AbstractEventLoop\" = PrivateAttr(default=None)\n    _new_event_loop: bool = PrivateAttr(default=False)\n\n    @property\n    def generate_parameters(self) -&gt; List[inspect.Parameter]:\n        \"\"\"Returns the parameters of the `agenerate` method.\n\n        Returns:\n            A list containing the parameters of the `agenerate` method.\n        \"\"\"\n        return list(inspect.signature(self.agenerate).parameters.values())\n\n    @cached_property\n    def generate_parsed_docstring(self) -&gt; \"Docstring\":\n        \"\"\"Returns the parsed docstring of the `agenerate` method.\n\n        Returns:\n            The parsed docstring of the `agenerate` method.\n        \"\"\"\n        return parse_google_docstring(self.agenerate)\n\n    @property\n    def event_loop(self) -&gt; \"asyncio.AbstractEventLoop\":\n        if self._event_loop is None:\n            try:\n                self._event_loop = asyncio.get_running_loop()\n                if self._event_loop.is_closed():\n                    self._event_loop = asyncio.new_event_loop()  # type: ignore\n                    self._new_event_loop = True\n            except RuntimeError:\n                self._event_loop = asyncio.new_event_loop()\n                self._new_event_loop = True\n        asyncio.set_event_loop(self._event_loop)\n        return self._event_loop\n\n    @abstractmethod\n    async def agenerate(\n        self, input: \"FormattedInput\", num_generations: int = 1, **kwargs: Any\n    ) -&gt; \"GenerateOutput\":\n        \"\"\"Method to generate a `num_generations` responses for a given input asynchronously,\n        and executed concurrently in `generate` method.\n        \"\"\"\n        pass\n\n    async def _agenerate(\n        self, inputs: List[\"FormattedInput\"], num_generations: int = 1, **kwargs: Any\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Internal function to concurrently generate responses for a list of inputs.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        if self._num_generations_param_supported:\n            tasks = [\n                asyncio.create_task(\n                    self.agenerate(\n                        input=input, num_generations=num_generations, **kwargs\n                    )\n                )\n                for input in inputs\n            ]\n            result = await asyncio.gather(*tasks)\n            return result\n\n        tasks = [\n            asyncio.create_task(self.agenerate(input=input, **kwargs))\n            for input in inputs\n            for _ in range(num_generations)\n        ]\n        outputs = await asyncio.gather(*tasks)\n        return merge_responses(outputs, n=num_generations)\n\n    def generate(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Method to generate a list of responses asynchronously, returning the output\n        synchronously awaiting for the response of each input sent to `agenerate`.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        return self.event_loop.run_until_complete(\n            self._agenerate(inputs=inputs, num_generations=num_generations, **kwargs)\n        )\n\n    def __del__(self) -&gt; None:\n        \"\"\"Closes the event loop when the object is deleted.\"\"\"\n        if sys.meta_path is None:\n            return\n\n        if self._new_event_loop:\n            if self._event_loop.is_running():\n                self._event_loop.stop()\n            self._event_loop.close()\n\n    @staticmethod\n    def _prepare_structured_output(  # type: ignore\n        structured_output: \"InstructorStructuredOutputType\",\n        client: Any = None,\n        framework: Optional[str] = None,\n    ) -&gt; Dict[str, Union[str, Any]]:\n        \"\"\"Wraps the client and updates the schema to work store it internally as a json schema.\n\n        Args:\n            structured_output: The configuration dict to prepare the structured output.\n            client: The client to wrap to generate structured output. Implemented to work\n                with `instructor`.\n            framework: The name of the framework.\n\n        Returns:\n            A dictionary containing the wrapped client and the schema to update the structured_output\n            variable in case it is a pydantic model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.instructor import (\n            prepare_instructor,\n        )\n\n        result = {}\n        client = prepare_instructor(\n            client,\n            mode=structured_output.get(\"mode\"),\n            framework=framework,  # type: ignore\n        )\n        result[\"client\"] = client\n\n        schema = structured_output.get(\"schema\")\n        if not schema:\n            raise DistilabelUserError(\n                f\"The `structured_output` argument must contain a schema: {structured_output}\",\n                page=\"sections/how_to_guides/advanced/structured_generation/#instructor\",\n            )\n        if inspect.isclass(schema) and issubclass(schema, BaseModel):\n            # We want a json schema for the serialization, but instructor wants a pydantic BaseModel.\n            structured_output[\"schema\"] = schema.model_json_schema()  # type: ignore\n            result[\"structured_output\"] = structured_output\n\n        return result\n\n    @staticmethod\n    def _prepare_kwargs(\n        arguments: Dict[str, Any], structured_output: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Helper method to update the kwargs with the structured output configuration,\n        used in case they are defined.\n\n        Args:\n            arguments: The arguments that would be passed to the LLM as **kwargs.\n                to update with the structured output configuration.\n            structured_outputs: The structured output configuration to update the arguments.\n\n        Returns:\n            kwargs updated with the special arguments used by `instructor`.\n        \"\"\"\n        # We can deal with json schema or BaseModel, but we need to convert it to a BaseModel\n        # for the Instructor client.\n        schema = structured_output.get(\"schema\", {})\n\n        # If there's already a pydantic model, we don't need to do anything,\n        # otherwise, try to obtain one.\n        if not (inspect.isclass(schema) and issubclass(schema, BaseModel)):\n            from distilabel.steps.tasks.structured_outputs.utils import (\n                json_schema_to_model,\n            )\n\n            if isinstance(schema, str):\n                # In case it was saved in the dataset as a string.\n                schema = json.loads(schema)\n\n            try:\n                schema = json_schema_to_model(schema)\n            except Exception as e:\n                raise ValueError(\n                    f\"Failed to convert the schema to a pydantic model, the model is too complex currently: {e}\"\n                ) from e\n\n        arguments.update(\n            **{\n                \"response_model\": schema,\n                \"max_retries\": structured_output.get(\"max_retries\", 1),\n            },\n        )\n        return arguments\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.generate_parameters","title":"<code>generate_parameters</code>  <code>property</code>","text":"<p>Returns the parameters of the <code>agenerate</code> method.</p> <p>Returns:</p> Type Description <code>List[Parameter]</code> <p>A list containing the parameters of the <code>agenerate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.generate_parsed_docstring","title":"<code>generate_parsed_docstring</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the parsed docstring of the <code>agenerate</code> method.</p> <p>Returns:</p> Type Description <code>Docstring</code> <p>The parsed docstring of the <code>agenerate</code> method.</p>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.agenerate","title":"<code>agenerate(input, num_generations=1, **kwargs)</code>  <code>abstractmethod</code> <code>async</code>","text":"<p>Method to generate a <code>num_generations</code> responses for a given input asynchronously, and executed concurrently in <code>generate</code> method.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>@abstractmethod\nasync def agenerate(\n    self, input: \"FormattedInput\", num_generations: int = 1, **kwargs: Any\n) -&gt; \"GenerateOutput\":\n    \"\"\"Method to generate a `num_generations` responses for a given input asynchronously,\n    and executed concurrently in `generate` method.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.generate","title":"<code>generate(inputs, num_generations=1, **kwargs)</code>","text":"<p>Method to generate a list of responses asynchronously, returning the output synchronously awaiting for the response of each input sent to <code>agenerate</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>the list of inputs to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list containing the generations for each input.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def generate(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Method to generate a list of responses asynchronously, returning the output\n    synchronously awaiting for the response of each input sent to `agenerate`.\n\n    Args:\n        inputs: the list of inputs to generate responses for.\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n\n    Returns:\n        A list containing the generations for each input.\n    \"\"\"\n    return self.event_loop.run_until_complete(\n        self._agenerate(inputs=inputs, num_generations=num_generations, **kwargs)\n    )\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.AsyncLLM.__del__","title":"<code>__del__()</code>","text":"<p>Closes the event loop when the object is deleted.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def __del__(self) -&gt; None:\n    \"\"\"Closes the event loop when the object is deleted.\"\"\"\n    if sys.meta_path is None:\n        return\n\n    if self._new_event_loop:\n        if self._event_loop.is_running():\n            self._event_loop.stop()\n        self._event_loop.close()\n</code></pre>"},{"location":"api/models/llm/#distilabel.models.llms.base.merge_responses","title":"<code>merge_responses(responses, n=1)</code>","text":"<p>Helper function to group the responses from <code>LLM.agenerate</code> method according to the number of generations requested.</p> <p>Parameters:</p> Name Type Description Default <code>responses</code> <code>List[GenerateOutput]</code> <p>the responses from the <code>LLM.agenerate</code> method.</p> required <code>n</code> <code>int</code> <p>number of responses to group together. Defaults to 1.</p> <code>1</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>List of merged responses, where each merged response contains n generations</p> <code>List[GenerateOutput]</code> <p>and their corresponding statistics.</p> Source code in <code>src/distilabel/models/llms/base.py</code> <pre><code>def merge_responses(\n    responses: List[\"GenerateOutput\"], n: int = 1\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Helper function to group the responses from `LLM.agenerate` method according\n    to the number of generations requested.\n\n    Args:\n        responses: the responses from the `LLM.agenerate` method.\n        n: number of responses to group together. Defaults to 1.\n\n    Returns:\n        List of merged responses, where each merged response contains n generations\n        and their corresponding statistics.\n    \"\"\"\n    if not responses:\n        return []\n\n    def chunks(lst, n):\n        \"\"\"Yield successive n-sized chunks from lst.\"\"\"\n        for i in range(0, len(lst), n):\n            yield list(islice(lst, i, i + n))\n\n    extra_keys = [\n        key for key in responses[0].keys() if key not in (\"generations\", \"statistics\")\n    ]\n\n    result = []\n    for group in chunks(responses, n):\n        merged = {\n            \"generations\": [],\n            \"statistics\": {\"input_tokens\": [], \"output_tokens\": []},\n        }\n        for response in group:\n            merged[\"generations\"].append(response[\"generations\"][0])\n            # Merge statistics\n            for key in response[\"statistics\"]:\n                if key not in merged[\"statistics\"]:\n                    merged[\"statistics\"][key] = []\n                merged[\"statistics\"][key].append(response[\"statistics\"][key][0])\n            # Merge extra keys returned by the `LLM`\n            for extra_key in extra_keys:\n                if extra_key not in merged:\n                    merged[extra_key] = []\n                merged[extra_key].append(response[extra_key][0])\n        result.append(merged)\n    return result\n</code></pre>"},{"location":"api/models/llm/llm_gallery/","title":"LLM Gallery","text":"<p>This section contains the existing <code>LLM</code> subclasses implemented in <code>distilabel</code>.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms","title":"<code>llms</code>","text":""},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM","title":"<code>AnthropicLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Anthropic LLM implementation running the Async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the name of the model to use for the LLM e.g. \"claude-3-opus-20240229\", \"claude-3-sonnet-20240229\", etc. Available models can be checked here: Anthropic: Models overview.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Anthropic API. If not provided, it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Anthropic API. Defaults to <code>None</code> which means that <code>https://api.anthropic.com</code> will be used internally.</p> <code>timeout</code> <code>RuntimeParameter[float]</code> <p>the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>The maximum number of times to retry the request before failing. Defaults to <code>6</code>.</p> <code>http_client</code> <code>Optional[AsyncClient]</code> <p>if provided, an alternative HTTP client to use for calling Anthropic API. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <code>_aclient</code> <code>Optional[AsyncAnthropic]</code> <p>the <code>AsyncAnthropic</code> client to use for the Anthropic API. It is meant to be used internally. Set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>api_key</code>: the API key to authenticate the requests to the Anthropic API. If not     provided, it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</li> <li><code>base_url</code>: the base URL to use for the Anthropic API. Defaults to <code>\"https://api.anthropic.com\"</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</li> <li><code>max_retries</code>: the maximum number of times to retry the request before failing.     Defaults to <code>6</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AnthropicLLM\n\nllm = AnthropicLLM(model=\"claude-3-opus-20240229\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AnthropicLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AnthropicLLM(\n    model=\"claude-3-opus-20240229\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>class AnthropicLLM(AsyncLLM):\n    \"\"\"Anthropic LLM implementation running the Async API client.\n\n    Attributes:\n        model: the name of the model to use for the LLM e.g. \"claude-3-opus-20240229\",\n            \"claude-3-sonnet-20240229\", etc. Available models can be checked here:\n            [Anthropic: Models overview](https://docs.anthropic.com/claude/docs/models-overview).\n        api_key: the API key to authenticate the requests to the Anthropic API. If not provided,\n            it will be read from `ANTHROPIC_API_KEY` environment variable.\n        base_url: the base URL to use for the Anthropic API. Defaults to `None` which means\n            that `https://api.anthropic.com` will be used internally.\n        timeout: the maximum time in seconds to wait for a response. Defaults to `600.0`.\n        max_retries: The maximum number of times to retry the request before failing. Defaults\n            to `6`.\n        http_client: if provided, an alternative HTTP client to use for calling Anthropic\n            API. Defaults to `None`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _api_key_env_var: the name of the environment variable to use for the API key. It\n            is meant to be used internally.\n        _aclient: the `AsyncAnthropic` client to use for the Anthropic API. It is meant\n            to be used internally. Set in the `load` method.\n\n    Runtime parameters:\n        - `api_key`: the API key to authenticate the requests to the Anthropic API. If not\n            provided, it will be read from `ANTHROPIC_API_KEY` environment variable.\n        - `base_url`: the base URL to use for the Anthropic API. Defaults to `\"https://api.anthropic.com\"`.\n        - `timeout`: the maximum time in seconds to wait for a response. Defaults to `600.0`.\n        - `max_retries`: the maximum number of times to retry the request before failing.\n            Defaults to `6`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AnthropicLLM\n\n        llm = AnthropicLLM(model=\"claude-3-opus-20240229\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import AnthropicLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = AnthropicLLM(\n            model=\"claude-3-opus-20240229\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"ANTHROPIC_BASE_URL\", \"https://api.anthropic.com\"\n        ),\n        description=\"The base URL to use for the Anthropic API.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_ANTHROPIC_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Anthropic API.\",\n    )\n    timeout: RuntimeParameter[float] = Field(\n        default=600.0,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=6,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    http_client: Optional[AsyncClient] = Field(default=None, exclude=True)\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _api_key_env_var: str = PrivateAttr(default=_ANTHROPIC_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"AsyncAnthropic\"] = PrivateAttr(...)\n\n    def _check_model_exists(self) -&gt; None:\n        \"\"\"Checks if the specified model exists in the available models.\"\"\"\n        from anthropic import AsyncAnthropic\n\n        annotation = get_type_hints(AsyncAnthropic().messages.create).get(\"model\", None)\n        models = [\n            value\n            for type_ in get_args(annotation)\n            if get_origin(type_) is Literal\n            for value in get_args(type_)\n        ]\n\n        if self.model not in models:\n            raise ValueError(\n                f\"Model {self.model} does not exist among available models. \"\n                f\"The available models are {', '.join(models)}\"\n            )\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncAnthropic` client to use the Anthropic async API.\"\"\"\n        super().load()\n\n        try:\n            from anthropic import AsyncAnthropic\n        except ImportError as ie:\n            raise ImportError(\n                \"Anthropic Python client is not installed. Please install it using\"\n                \" `pip install anthropic`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._check_model_exists()\n\n        self._aclient = AsyncAnthropic(\n            api_key=self.api_key.get_secret_value(),\n            base_url=self.base_url,\n            timeout=self.timeout,\n            http_client=self.http_client,\n            max_retries=self.max_retries,\n        )\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"anthropic\",\n            )\n            self._aclient = result.get(\"client\")\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        max_tokens: int = 128,\n        stop_sequences: Union[List[str], None] = None,\n        temperature: float = 1.0,\n        top_p: Union[float, None] = None,\n        top_k: Union[int, None] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates a response asynchronously, using the [Anthropic Async API definition](https://github.com/anthropics/anthropic-sdk-python).\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            max_tokens: the maximum number of new tokens that the model will generate. Defaults to `128`.\n            stop_sequences: custom text sequences that will cause the model to stop generating. Defaults to `NOT_GIVEN`.\n            temperature: the temperature to use for the generation. Set only if top_p is None. Defaults to `1.0`.\n            top_p: the top-p value to use for the generation. Defaults to `NOT_GIVEN`.\n            top_k: the top-k value to use for the generation. Defaults to `NOT_GIVEN`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        from anthropic._types import NOT_GIVEN\n\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"anthropic\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"system\": (\n                input.pop(0)[\"content\"]\n                if input and input[0][\"role\"] == \"system\"\n                else NOT_GIVEN\n            ),\n            \"max_tokens\": max_tokens,\n            \"stream\": False,\n            \"stop_sequences\": NOT_GIVEN if stop_sequences is None else stop_sequences,\n            \"temperature\": temperature,\n            \"top_p\": NOT_GIVEN if top_p is None else top_p,\n            \"top_k\": NOT_GIVEN if top_k is None else top_k,\n        }\n\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n        completion: Union[\"Message\", \"BaseModel\"] = await self._aclient.messages.create(\n            **kwargs\n        )  # type: ignore\n        if structured_output:\n            # raw_response = completion._raw_response\n            return prepare_output(\n                [completion.model_dump_json()],\n                **self._get_llm_statistics(completion._raw_response),\n            )\n\n        if (content := completion.content[0].text) is None:\n            self._logger.warning(\n                f\"Received no response using Anthropic client (model: '{self.model}').\"\n                f\" Finish reason was: {completion.stop_reason}\"\n            )\n        return prepare_output([content], **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: \"Message\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion.usage.input_tokens],\n            \"output_tokens\": [completion.usage.output_tokens],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM._check_model_exists","title":"<code>_check_model_exists()</code>","text":"<p>Checks if the specified model exists in the available models.</p> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>def _check_model_exists(self) -&gt; None:\n    \"\"\"Checks if the specified model exists in the available models.\"\"\"\n    from anthropic import AsyncAnthropic\n\n    annotation = get_type_hints(AsyncAnthropic().messages.create).get(\"model\", None)\n    models = [\n        value\n        for type_ in get_args(annotation)\n        if get_origin(type_) is Literal\n        for value in get_args(type_)\n    ]\n\n    if self.model not in models:\n        raise ValueError(\n            f\"Model {self.model} does not exist among available models. \"\n            f\"The available models are {', '.join(models)}\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncAnthropic</code> client to use the Anthropic async API.</p> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncAnthropic` client to use the Anthropic async API.\"\"\"\n    super().load()\n\n    try:\n        from anthropic import AsyncAnthropic\n    except ImportError as ie:\n        raise ImportError(\n            \"Anthropic Python client is not installed. Please install it using\"\n            \" `pip install anthropic`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._check_model_exists()\n\n    self._aclient = AsyncAnthropic(\n        api_key=self.api_key.get_secret_value(),\n        base_url=self.base_url,\n        timeout=self.timeout,\n        http_client=self.http_client,\n        max_retries=self.max_retries,\n    )\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"anthropic\",\n        )\n        self._aclient = result.get(\"client\")\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnthropicLLM.agenerate","title":"<code>agenerate(input, max_tokens=128, stop_sequences=None, temperature=1.0, top_p=None, top_k=None)</code>  <code>async</code>","text":"<p>Generates a response asynchronously, using the Anthropic Async API definition.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>max_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>stop_sequences</code> <code>Union[List[str], None]</code> <p>custom text sequences that will cause the model to stop generating. Defaults to <code>NOT_GIVEN</code>.</p> <code>None</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Set only if top_p is None. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_p</code> <code>Union[float, None]</code> <p>the top-p value to use for the generation. Defaults to <code>NOT_GIVEN</code>.</p> <code>None</code> <code>top_k</code> <code>Union[int, None]</code> <p>the top-k value to use for the generation. Defaults to <code>NOT_GIVEN</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/anthropic.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    max_tokens: int = 128,\n    stop_sequences: Union[List[str], None] = None,\n    temperature: float = 1.0,\n    top_p: Union[float, None] = None,\n    top_k: Union[int, None] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates a response asynchronously, using the [Anthropic Async API definition](https://github.com/anthropics/anthropic-sdk-python).\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        max_tokens: the maximum number of new tokens that the model will generate. Defaults to `128`.\n        stop_sequences: custom text sequences that will cause the model to stop generating. Defaults to `NOT_GIVEN`.\n        temperature: the temperature to use for the generation. Set only if top_p is None. Defaults to `1.0`.\n        top_p: the top-p value to use for the generation. Defaults to `NOT_GIVEN`.\n        top_k: the top-k value to use for the generation. Defaults to `NOT_GIVEN`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    from anthropic._types import NOT_GIVEN\n\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"anthropic\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"system\": (\n            input.pop(0)[\"content\"]\n            if input and input[0][\"role\"] == \"system\"\n            else NOT_GIVEN\n        ),\n        \"max_tokens\": max_tokens,\n        \"stream\": False,\n        \"stop_sequences\": NOT_GIVEN if stop_sequences is None else stop_sequences,\n        \"temperature\": temperature,\n        \"top_p\": NOT_GIVEN if top_p is None else top_p,\n        \"top_k\": NOT_GIVEN if top_k is None else top_k,\n    }\n\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n    completion: Union[\"Message\", \"BaseModel\"] = await self._aclient.messages.create(\n        **kwargs\n    )  # type: ignore\n    if structured_output:\n        # raw_response = completion._raw_response\n        return prepare_output(\n            [completion.model_dump_json()],\n            **self._get_llm_statistics(completion._raw_response),\n        )\n\n    if (content := completion.content[0].text) is None:\n        self._logger.warning(\n            f\"Received no response using Anthropic client (model: '{self.model}').\"\n            f\" Finish reason was: {completion.stop_reason}\"\n        )\n    return prepare_output([content], **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AnyscaleLLM","title":"<code>AnyscaleLLM</code>","text":"<p>               Bases: <code>OpenAILLM</code></p> <p>Anyscale LLM implementation running the async API client of OpenAI.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM, e.g., <code>google/gemma-7b-it</code>. See the supported models under the \"Text Generation -&gt; Supported Models\" section here.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Anyscale API requests. Defaults to <code>None</code>, which means that the value set for the environment variable <code>ANYSCALE_BASE_URL</code> will be used, or \"https://api.endpoints.anyscale.com/v1\" if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Anyscale API. Defaults to <code>None</code> which means that the value set for the environment variable <code>ANYSCALE_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = AnyscaleLLM(model=\"google/gemma-7b-it\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/anyscale.py</code> <pre><code>class AnyscaleLLM(OpenAILLM):\n    \"\"\"Anyscale LLM implementation running the async API client of OpenAI.\n\n    Attributes:\n        model: the model name to use for the LLM, e.g., `google/gemma-7b-it`. See the\n            supported models under the \"Text Generation -&gt; Supported Models\" section\n            [here](https://docs.endpoints.anyscale.com/).\n        base_url: the base URL to use for the Anyscale API requests. Defaults to `None`, which\n            means that the value set for the environment variable `ANYSCALE_BASE_URL` will be used, or\n            \"https://api.endpoints.anyscale.com/v1\" if not set.\n        api_key: the API key to authenticate the requests to the Anyscale API. Defaults to `None` which\n            means that the value set for the environment variable `ANYSCALE_API_KEY` will be used, or\n            `None` if not set.\n        _api_key_env_var: the name of the environment variable to use for the API key.\n            It is meant to be used internally.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AnyscaleLLM\n\n        llm = AnyscaleLLM(model=\"google/gemma-7b-it\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"ANYSCALE_BASE_URL\", \"https://api.endpoints.anyscale.com/v1\"\n        ),\n        description=\"The base URL to use for the Anyscale API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_ANYSCALE_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Anyscale API.\",\n    )\n\n    _api_key_env_var: str = PrivateAttr(_ANYSCALE_API_KEY_ENV_VAR_NAME)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AzureOpenAILLM","title":"<code>AzureOpenAILLM</code>","text":"<p>               Bases: <code>OpenAILLM</code></p> <p>Azure OpenAI LLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM i.e. the name of the Azure deployment.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Azure OpenAI API can be set with <code>AZURE_OPENAI_ENDPOINT</code>. Defaults to <code>None</code> which means that the value set for the environment variable <code>AZURE_OPENAI_ENDPOINT</code> will be used, or <code>None</code> if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Azure OpenAI API. Defaults to <code>None</code> which means that the value set for the environment variable <code>AZURE_OPENAI_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>api_version</code> <code>Optional[RuntimeParameter[str]]</code> <p>the API version to use for the Azure OpenAI API. Defaults to <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_VERSION</code> will be used, or <code>None</code> if not set.</p> Icon <p><code>:material-microsoft-azure:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate text from a custom endpoint following the OpenAI API:</p> <pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AzureOpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AzureOpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/azure.py</code> <pre><code>class AzureOpenAILLM(OpenAILLM):\n    \"\"\"Azure OpenAI LLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM i.e. the name of the Azure deployment.\n        base_url: the base URL to use for the Azure OpenAI API can be set with `AZURE_OPENAI_ENDPOINT`.\n            Defaults to `None` which means that the value set for the environment variable\n            `AZURE_OPENAI_ENDPOINT` will be used, or `None` if not set.\n        api_key: the API key to authenticate the requests to the Azure OpenAI API. Defaults to `None`\n            which means that the value set for the environment variable `AZURE_OPENAI_API_KEY` will be\n            used, or `None` if not set.\n        api_version: the API version to use for the Azure OpenAI API. Defaults to `None` which means\n            that the value set for the environment variable `OPENAI_API_VERSION` will be used, or\n            `None` if not set.\n\n    Icon:\n        `:material-microsoft-azure:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AzureOpenAILLM\n\n        llm = AzureOpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate text from a custom endpoint following the OpenAI API:\n\n        ```python\n        from distilabel.models.llms import AzureOpenAILLM\n\n        llm = AzureOpenAILLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            base_url=r\"http://localhost:8080/v1\"\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import AzureOpenAILLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = AzureOpenAILLM(\n            model=\"gpt-4-turbo\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(_AZURE_OPENAI_ENDPOINT_ENV_VAR_NAME),\n        description=\"The base URL to use for the Azure OpenAI API requests i.e. the Azure OpenAI endpoint.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_AZURE_OPENAI_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Azure OpenAI API.\",\n    )\n\n    api_version: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\"OPENAI_API_VERSION\"),\n        description=\"The API version to use for the Azure OpenAI API.\",\n    )\n\n    _base_url_env_var: str = PrivateAttr(_AZURE_OPENAI_ENDPOINT_ENV_VAR_NAME)\n    _api_key_env_var: str = PrivateAttr(_AZURE_OPENAI_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"AsyncAzureOpenAI\"] = PrivateAttr(...)  # type: ignore\n\n    @override\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncAzureOpenAI` client to benefit from async requests.\"\"\"\n        # This is a workaround to avoid the `OpenAILLM` calling the _prepare_structured_output\n        # in the load method before we have the proper client.\n        with patch(\n            \"distilabel.models.openai.OpenAILLM._prepare_structured_output\", lambda x: x\n        ):\n            super().load()\n\n        try:\n            from openai import AsyncAzureOpenAI\n        except ImportError as ie:\n            raise ImportError(\n                \"OpenAI Python client is not installed. Please install it using\"\n                \" `pip install openai`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        # TODO: May be worth adding the AD auth too? Also the `organization`?\n        self._aclient = AsyncAzureOpenAI(  # type: ignore\n            azure_endpoint=self.base_url,  # type: ignore\n            azure_deployment=self.model,\n            api_version=self.api_version,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            self._prepare_structured_output(self.structured_output)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.AzureOpenAILLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncAzureOpenAI</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/azure.py</code> <pre><code>@override\ndef load(self) -&gt; None:\n    \"\"\"Loads the `AsyncAzureOpenAI` client to benefit from async requests.\"\"\"\n    # This is a workaround to avoid the `OpenAILLM` calling the _prepare_structured_output\n    # in the load method before we have the proper client.\n    with patch(\n        \"distilabel.models.openai.OpenAILLM._prepare_structured_output\", lambda x: x\n    ):\n        super().load()\n\n    try:\n        from openai import AsyncAzureOpenAI\n    except ImportError as ie:\n        raise ImportError(\n            \"OpenAI Python client is not installed. Please install it using\"\n            \" `pip install openai`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    # TODO: May be worth adding the AD auth too? Also the `organization`?\n    self._aclient = AsyncAzureOpenAI(  # type: ignore\n        azure_endpoint=self.base_url,  # type: ignore\n        azure_deployment=self.model,\n        api_version=self.api_version,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        self._prepare_structured_output(self.structured_output)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM","title":"<code>CohereLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Cohere API implementation using the async client for concurrent text generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the name of the model from the Cohere API to use for the generation.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Cohere API requests. Defaults to <code>\"https://api.cohere.ai/v1\"</code>.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Cohere API. Defaults to the value of the <code>COHERE_API_KEY</code> environment variable.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>client_name</code> <code>RuntimeParameter[str]</code> <p>the name of the client to use for the API requests. Defaults to <code>\"distilabel\"</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_ChatMessage</code> <code>Type[ChatMessage]</code> <p>the <code>ChatMessage</code> class from the <code>cohere</code> package.</p> <code>_aclient</code> <code>AsyncClient</code> <p>the <code>AsyncClient</code> client from the <code>cohere</code> package.</p> Runtime parameters <ul> <li><code>base_url</code>: the base URL to use for the Cohere API requests. Defaults to     <code>\"https://api.cohere.ai/v1\"</code>.</li> <li><code>api_key</code>: the API key to authenticate the requests to the Cohere API. Defaults     to the value of the <code>COHERE_API_KEY</code> environment variable.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> <li><code>client_name</code>: the name of the client to use for the API requests. Defaults to     <code>\"distilabel\"</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import CohereLLM\n\nllm = CohereLLM(model=\"CohereForAI/c4ai-command-r-plus\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import CohereLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = CohereLLM(\n    model=\"CohereForAI/c4ai-command-r-plus\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>class CohereLLM(AsyncLLM):\n    \"\"\"Cohere API implementation using the async client for concurrent text generation.\n\n    Attributes:\n        model: the name of the model from the Cohere API to use for the generation.\n        base_url: the base URL to use for the Cohere API requests. Defaults to\n            `\"https://api.cohere.ai/v1\"`.\n        api_key: the API key to authenticate the requests to the Cohere API. Defaults to\n            the value of the `COHERE_API_KEY` environment variable.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        client_name: the name of the client to use for the API requests. Defaults to\n            `\"distilabel\"`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _ChatMessage: the `ChatMessage` class from the `cohere` package.\n        _aclient: the `AsyncClient` client from the `cohere` package.\n\n    Runtime parameters:\n        - `base_url`: the base URL to use for the Cohere API requests. Defaults to\n            `\"https://api.cohere.ai/v1\"`.\n        - `api_key`: the API key to authenticate the requests to the Cohere API. Defaults\n            to the value of the `COHERE_API_KEY` environment variable.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        - `client_name`: the name of the client to use for the API requests. Defaults to\n            `\"distilabel\"`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import CohereLLM\n\n        llm = CohereLLM(model=\"CohereForAI/c4ai-command-r-plus\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import CohereLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = CohereLLM(\n            model=\"CohereForAI/c4ai-command-r-plus\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"COHERE_BASE_URL\", \"https://api.cohere.ai/v1\"\n        ),\n        description=\"The base URL to use for the Cohere API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_COHERE_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Cohere API.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    client_name: RuntimeParameter[str] = Field(\n        default=\"distilabel\",\n        description=\"The name of the client to use for the API requests.\",\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _ChatMessage: Type[\"ChatMessage\"] = PrivateAttr(...)\n    _aclient: \"AsyncClient\" = PrivateAttr(...)\n    _tokenizer: \"Tokenizer\" = PrivateAttr(...)\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncClient` client from the `cohere` package.\"\"\"\n\n        super().load()\n\n        try:\n            from cohere import AsyncClient, ChatMessage\n        except ImportError as ie:\n            raise ImportError(\n                \"The `cohere` package is required to use the `CohereLLM` class.\"\n            ) from ie\n\n        self._ChatMessage = ChatMessage\n\n        self._aclient = AsyncClient(\n            api_key=self.api_key.get_secret_value(),  # type: ignore\n            client_name=self.client_name,\n            base_url=self.base_url,\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"cohere\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n        from cohere.manually_maintained.tokenizers import get_hf_tokenizer\n\n        self._tokenizer: \"Tokenizer\" = get_hf_tokenizer(self._aclient, self.model)\n\n    def _format_chat_to_cohere(\n        self, input: \"FormattedInput\"\n    ) -&gt; Tuple[Union[str, None], List[\"ChatMessage\"], str]:\n        \"\"\"Formats the chat input to the Cohere Chat API conversational format.\n\n        Args:\n            input: The chat input to format.\n\n        Returns:\n            A tuple containing the system, chat history, and message.\n        \"\"\"\n        system = None\n        message = None\n        chat_history = []\n        for item in input:\n            role = item[\"role\"]\n            content = item[\"content\"]\n            if role == \"system\":\n                system = content\n            elif role == \"user\":\n                message = content\n            elif role == \"assistant\":\n                if message is None:\n                    raise ValueError(\n                        \"An assistant message but be preceded by a user message.\"\n                    )\n                chat_history.append(self._ChatMessage(role=\"USER\", message=message))  # type: ignore\n                chat_history.append(self._ChatMessage(role=\"CHATBOT\", message=content))  # type: ignore\n                message = None\n\n        if message is None:\n            raise ValueError(\"The chat input must end with a user message.\")\n\n        return system, chat_history, message\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        temperature: Optional[float] = None,\n        max_tokens: Optional[int] = None,\n        k: Optional[int] = None,\n        p: Optional[float] = None,\n        seed: Optional[float] = None,\n        stop_sequences: Optional[Sequence[str]] = None,\n        frequency_penalty: Optional[float] = None,\n        presence_penalty: Optional[float] = None,\n        raw_prompting: Optional[bool] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates a response from the LLM given an input.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            temperature: the temperature to use for the generation. Defaults to `None`.\n            max_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `None`.\n            k: the number of highest probability vocabulary tokens to keep for the generation.\n                Defaults to `None`.\n            p: the nucleus sampling probability to use for the generation. Defaults to\n                `None`.\n            seed: the seed to use for the generation. Defaults to `None`.\n            stop_sequences: a list of sequences to use as stopping criteria for the generation.\n                Defaults to `None`.\n            frequency_penalty: the frequency penalty to use for the generation. Defaults\n                to `None`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `None`.\n            raw_prompting: a flag to use raw prompting for the generation. Defaults to\n                `None`.\n\n        Returns:\n            The generated response from the Cohere API model.\n        \"\"\"\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,  # type: ignore\n                client=self._aclient,\n                framework=\"cohere\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        system, chat_history, message = self._format_chat_to_cohere(input)\n\n        kwargs = {\n            \"message\": message,\n            \"model\": self.model,\n            \"preamble\": system,\n            \"chat_history\": chat_history,\n            \"temperature\": temperature,\n            \"max_tokens\": max_tokens,\n            \"k\": k,\n            \"p\": p,\n            \"seed\": seed,\n            \"stop_sequences\": stop_sequences,\n            \"frequency_penalty\": frequency_penalty,\n            \"presence_penalty\": presence_penalty,\n            \"raw_prompting\": raw_prompting,\n        }\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n        response: Union[\"Message\", \"BaseModel\"] = await self._aclient.chat(**kwargs)  # type: ignore\n\n        if structured_output:\n            return prepare_output(\n                [response.model_dump_json()],\n                **self._get_llm_statistics(\n                    input, orjson.dumps(response.model_dump_json()).decode(\"utf-8\")\n                ),  # type: ignore\n            )\n\n        if (text := response.text) == \"\":\n            self._logger.warning(  # type: ignore\n                f\"Received no response using Cohere client (model: '{self.model}').\"\n                f\" Finish reason was: {response.finish_reason}\"\n            )\n            return prepare_output(\n                [None],\n                **self._get_llm_statistics(input, \"\"),\n            )\n\n        return prepare_output(\n            [text],\n            **self._get_llm_statistics(input, text),\n        )\n\n    def _get_llm_statistics(\n        self, input: FormattedInput, output: str\n    ) -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [compute_tokens(input, self._tokenizer.encode)],\n            \"output_tokens\": [compute_tokens(output, self._tokenizer.encode)],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncClient</code> client from the <code>cohere</code> package.</p> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncClient` client from the `cohere` package.\"\"\"\n\n    super().load()\n\n    try:\n        from cohere import AsyncClient, ChatMessage\n    except ImportError as ie:\n        raise ImportError(\n            \"The `cohere` package is required to use the `CohereLLM` class.\"\n        ) from ie\n\n    self._ChatMessage = ChatMessage\n\n    self._aclient = AsyncClient(\n        api_key=self.api_key.get_secret_value(),  # type: ignore\n        client_name=self.client_name,\n        base_url=self.base_url,\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"cohere\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n\n    from cohere.manually_maintained.tokenizers import get_hf_tokenizer\n\n    self._tokenizer: \"Tokenizer\" = get_hf_tokenizer(self._aclient, self.model)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM._format_chat_to_cohere","title":"<code>_format_chat_to_cohere(input)</code>","text":"<p>Formats the chat input to the Cohere Chat API conversational format.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>The chat input to format.</p> required <p>Returns:</p> Type Description <code>Tuple[Union[str, None], List[ChatMessage], str]</code> <p>A tuple containing the system, chat history, and message.</p> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>def _format_chat_to_cohere(\n    self, input: \"FormattedInput\"\n) -&gt; Tuple[Union[str, None], List[\"ChatMessage\"], str]:\n    \"\"\"Formats the chat input to the Cohere Chat API conversational format.\n\n    Args:\n        input: The chat input to format.\n\n    Returns:\n        A tuple containing the system, chat history, and message.\n    \"\"\"\n    system = None\n    message = None\n    chat_history = []\n    for item in input:\n        role = item[\"role\"]\n        content = item[\"content\"]\n        if role == \"system\":\n            system = content\n        elif role == \"user\":\n            message = content\n        elif role == \"assistant\":\n            if message is None:\n                raise ValueError(\n                    \"An assistant message but be preceded by a user message.\"\n                )\n            chat_history.append(self._ChatMessage(role=\"USER\", message=message))  # type: ignore\n            chat_history.append(self._ChatMessage(role=\"CHATBOT\", message=content))  # type: ignore\n            message = None\n\n    if message is None:\n        raise ValueError(\"The chat input must end with a user message.\")\n\n    return system, chat_history, message\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CohereLLM.agenerate","title":"<code>agenerate(input, temperature=None, max_tokens=None, k=None, p=None, seed=None, stop_sequences=None, frequency_penalty=None, presence_penalty=None, raw_prompting=None)</code>  <code>async</code>","text":"<p>Generates a response from the LLM given an input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>temperature</code> <code>Optional[float]</code> <p>the temperature to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>max_tokens</code> <code>Optional[int]</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>None</code>.</p> <code>None</code> <code>k</code> <code>Optional[int]</code> <p>the number of highest probability vocabulary tokens to keep for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>p</code> <code>Optional[float]</code> <p>the nucleus sampling probability to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>seed</code> <code>Optional[float]</code> <p>the seed to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_sequences</code> <code>Optional[Sequence[str]]</code> <p>a list of sequences to use as stopping criteria for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>Optional[float]</code> <p>the frequency penalty to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>presence_penalty</code> <code>Optional[float]</code> <p>the presence penalty to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>raw_prompting</code> <code>Optional[bool]</code> <p>a flag to use raw prompting for the generation. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>The generated response from the Cohere API model.</p> Source code in <code>src/distilabel/models/llms/cohere.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    temperature: Optional[float] = None,\n    max_tokens: Optional[int] = None,\n    k: Optional[int] = None,\n    p: Optional[float] = None,\n    seed: Optional[float] = None,\n    stop_sequences: Optional[Sequence[str]] = None,\n    frequency_penalty: Optional[float] = None,\n    presence_penalty: Optional[float] = None,\n    raw_prompting: Optional[bool] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates a response from the LLM given an input.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        temperature: the temperature to use for the generation. Defaults to `None`.\n        max_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `None`.\n        k: the number of highest probability vocabulary tokens to keep for the generation.\n            Defaults to `None`.\n        p: the nucleus sampling probability to use for the generation. Defaults to\n            `None`.\n        seed: the seed to use for the generation. Defaults to `None`.\n        stop_sequences: a list of sequences to use as stopping criteria for the generation.\n            Defaults to `None`.\n        frequency_penalty: the frequency penalty to use for the generation. Defaults\n            to `None`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `None`.\n        raw_prompting: a flag to use raw prompting for the generation. Defaults to\n            `None`.\n\n    Returns:\n        The generated response from the Cohere API model.\n    \"\"\"\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,  # type: ignore\n            client=self._aclient,\n            framework=\"cohere\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    system, chat_history, message = self._format_chat_to_cohere(input)\n\n    kwargs = {\n        \"message\": message,\n        \"model\": self.model,\n        \"preamble\": system,\n        \"chat_history\": chat_history,\n        \"temperature\": temperature,\n        \"max_tokens\": max_tokens,\n        \"k\": k,\n        \"p\": p,\n        \"seed\": seed,\n        \"stop_sequences\": stop_sequences,\n        \"frequency_penalty\": frequency_penalty,\n        \"presence_penalty\": presence_penalty,\n        \"raw_prompting\": raw_prompting,\n    }\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n    response: Union[\"Message\", \"BaseModel\"] = await self._aclient.chat(**kwargs)  # type: ignore\n\n    if structured_output:\n        return prepare_output(\n            [response.model_dump_json()],\n            **self._get_llm_statistics(\n                input, orjson.dumps(response.model_dump_json()).decode(\"utf-8\")\n            ),  # type: ignore\n        )\n\n    if (text := response.text) == \"\":\n        self._logger.warning(  # type: ignore\n            f\"Received no response using Cohere client (model: '{self.model}').\"\n            f\" Finish reason was: {response.finish_reason}\"\n        )\n        return prepare_output(\n            [None],\n            **self._get_llm_statistics(input, \"\"),\n        )\n\n    return prepare_output(\n        [text],\n        **self._get_llm_statistics(input, text),\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM","title":"<code>GroqLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Groq API implementation using the async client for concurrent text generation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the name of the model from the Groq API to use for the generation.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Groq API requests. Defaults to <code>\"https://api.groq.com\"</code>.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Groq API. Defaults to the value of the <code>GROQ_API_KEY</code> environment variable.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of times to retry the request to the API before failing. Defaults to <code>2</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key.</p> <code>_aclient</code> <code>Optional[AsyncGroq]</code> <p>the <code>AsyncGroq</code> client from the <code>groq</code> package.</p> Runtime parameters <ul> <li><code>base_url</code>: the base URL to use for the Groq API requests. Defaults to     <code>\"https://api.groq.com\"</code>.</li> <li><code>api_key</code>: the API key to authenticate the requests to the Groq API. Defaults to     the value of the <code>GROQ_API_KEY</code> environment variable.</li> <li><code>max_retries</code>: the maximum number of times to retry the request to the API before     failing. Defaults to <code>2</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import GroqLLM\n\nllm = GroqLLM(model=\"llama3-70b-8192\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import GroqLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = GroqLLM(\n    model=\"llama3-70b-8192\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/groq.py</code> <pre><code>class GroqLLM(AsyncLLM):\n    \"\"\"Groq API implementation using the async client for concurrent text generation.\n\n    Attributes:\n        model: the name of the model from the Groq API to use for the generation.\n        base_url: the base URL to use for the Groq API requests. Defaults to\n            `\"https://api.groq.com\"`.\n        api_key: the API key to authenticate the requests to the Groq API. Defaults to\n            the value of the `GROQ_API_KEY` environment variable.\n        max_retries: the maximum number of times to retry the request to the API before\n            failing. Defaults to `2`.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _api_key_env_var: the name of the environment variable to use for the API key.\n        _aclient: the `AsyncGroq` client from the `groq` package.\n\n    Runtime parameters:\n        - `base_url`: the base URL to use for the Groq API requests. Defaults to\n            `\"https://api.groq.com\"`.\n        - `api_key`: the API key to authenticate the requests to the Groq API. Defaults to\n            the value of the `GROQ_API_KEY` environment variable.\n        - `max_retries`: the maximum number of times to retry the request to the API before\n            failing. Defaults to `2`.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import GroqLLM\n\n        llm = GroqLLM(model=\"llama3-70b-8192\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import GroqLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = GroqLLM(\n            model=\"llama3-70b-8192\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            _GROQ_API_BASE_URL_ENV_VAR_NAME, \"https://api.groq.com\"\n        ),\n        description=\"The base URL to use for the Groq API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_GROQ_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Groq API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=2,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _api_key_env_var: str = PrivateAttr(_GROQ_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"AsyncGroq\"] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncGroq` client to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from groq import AsyncGroq\n        except ImportError as ie:\n            raise ImportError(\n                \"Groq Python client is not installed. Please install it using\"\n                ' `pip install groq` or from the extras as `pip install \"distilabel[groq]\"`.'\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._aclient = AsyncGroq(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"groq\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        seed: Optional[int] = None,\n        max_new_tokens: int = 128,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        stop: Optional[str] = None,\n    ) -&gt; \"GenerateOutput\":\n        \"\"\"Generates `num_generations` responses for the given input using the Groq async\n        client.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            seed: the seed to use for the generation. Defaults to `None`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: the stop sequence to use for the generation. Defaults to `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n\n        References:\n            - https://console.groq.com/docs/text-chat\n        \"\"\"\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"groq\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"seed\": seed,\n            \"temperature\": temperature,\n            \"max_tokens\": max_new_tokens,\n            \"top_p\": top_p,\n            \"stream\": False,\n            \"stop\": stop,\n        }\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n        if structured_output:\n            return prepare_output(\n                [completion.model_dump_json()],\n                **self._get_llm_statistics(completion._raw_response),\n            )\n\n        generations = []\n        for choice in completion.choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using the Groq client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n        return prepare_output(generations, **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: \"ChatCompletion\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion.usage.prompt_tokens if completion else 0],\n            \"output_tokens\": [completion.usage.completion_tokens if completion else 0],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncGroq</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/groq.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncGroq` client to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from groq import AsyncGroq\n    except ImportError as ie:\n        raise ImportError(\n            \"Groq Python client is not installed. Please install it using\"\n            ' `pip install groq` or from the extras as `pip install \"distilabel[groq]\"`.'\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._aclient = AsyncGroq(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"groq\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.GroqLLM.agenerate","title":"<code>agenerate(input, seed=None, max_new_tokens=128, temperature=1.0, top_p=1.0, stop=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the Groq async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>seed</code> <code>Optional[int]</code> <p>the seed to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[str]</code> <p>the stop sequence to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> References <ul> <li>https://console.groq.com/docs/text-chat</li> </ul> Source code in <code>src/distilabel/models/llms/groq.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    seed: Optional[int] = None,\n    max_new_tokens: int = 128,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    stop: Optional[str] = None,\n) -&gt; \"GenerateOutput\":\n    \"\"\"Generates `num_generations` responses for the given input using the Groq async\n    client.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        seed: the seed to use for the generation. Defaults to `None`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: the stop sequence to use for the generation. Defaults to `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n\n    References:\n        - https://console.groq.com/docs/text-chat\n    \"\"\"\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"groq\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"seed\": seed,\n        \"temperature\": temperature,\n        \"max_tokens\": max_new_tokens,\n        \"top_p\": top_p,\n        \"stream\": False,\n        \"stop\": stop,\n    }\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n    completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n    if structured_output:\n        return prepare_output(\n            [completion.model_dump_json()],\n            **self._get_llm_statistics(completion._raw_response),\n        )\n\n    generations = []\n    for choice in completion.choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using the Groq client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n    return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM","title":"<code>InferenceEndpointsLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>InferenceEndpoints LLM implementation running the async API client.</p> <p>This LLM will internally use <code>huggingface_hub.AsyncInferenceClient</code>.</p> <p>Attributes:</p> Name Type Description <code>model_id</code> <code>Optional[str]</code> <p>the model ID to use for the LLM as available in the Hugging Face Hub, which will be used to resolve the base URL for the serverless Inference Endpoints API requests. Defaults to <code>None</code>.</p> <code>endpoint_name</code> <code>Optional[RuntimeParameter[str]]</code> <p>the name of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> <code>endpoint_namespace</code> <code>Optional[RuntimeParameter[str]]</code> <p>the namespace of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Inference Endpoints API requests.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Inference Endpoints API.</p> <code>tokenizer_id</code> <code>Optional[str]</code> <p>the tokenizer ID to use for the LLM as available in the Hugging Face Hub. Defaults to <code>None</code>, but defining one is recommended to properly format the prompt.</p> <code>model_display_name</code> <code>Optional[str]</code> <p>the model display name to use for the LLM. Defaults to <code>None</code>.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[StructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> Icon <p><code>:hugging:</code></p> <p>Examples:</p> <p>Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded:</p> <pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Dedicated Inference Endpoints:</p> <pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    endpoint_name=\"&lt;ENDPOINT_NAME&gt;\",\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    endpoint_namespace=\"&lt;USER|ORG&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Dedicated Inference Endpoints or TGI:</p> <pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    base_url=\"&lt;BASE_URL&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    api_key=\"api.key\",\n    structured_output={\"format\": \"json\", \"schema\": User.model_json_schema()}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the Tour De France\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>class InferenceEndpointsLLM(AsyncLLM, MagpieChatTemplateMixin):\n    \"\"\"InferenceEndpoints LLM implementation running the async API client.\n\n    This LLM will internally use `huggingface_hub.AsyncInferenceClient`.\n\n    Attributes:\n        model_id: the model ID to use for the LLM as available in the Hugging Face Hub, which\n            will be used to resolve the base URL for the serverless Inference Endpoints API requests.\n            Defaults to `None`.\n        endpoint_name: the name of the Inference Endpoint to use for the LLM. Defaults to `None`.\n        endpoint_namespace: the namespace of the Inference Endpoint to use for the LLM. Defaults to `None`.\n        base_url: the base URL to use for the Inference Endpoints API requests.\n        api_key: the API key to authenticate the requests to the Inference Endpoints API.\n        tokenizer_id: the tokenizer ID to use for the LLM as available in the Hugging Face Hub.\n            Defaults to `None`, but defining one is recommended to properly format the prompt.\n        model_display_name: the model display name to use for the LLM. Defaults to `None`.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n        structured_output: a dictionary containing the structured output configuration or\n            if more fine-grained control is needed, an instance of `OutlinesStructuredOutput`.\n            Defaults to None.\n\n    Icon:\n        `:hugging:`\n\n    Examples:\n        Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded:\n\n        ```python\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Dedicated Inference Endpoints:\n\n        ```python\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            endpoint_name=\"&lt;ENDPOINT_NAME&gt;\",\n            api_key=\"&lt;HF_API_KEY&gt;\",\n            endpoint_namespace=\"&lt;USER|ORG&gt;\",\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Dedicated Inference Endpoints or TGI:\n\n        ```python\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            api_key=\"&lt;HF_API_KEY&gt;\",\n            base_url=\"&lt;BASE_URL&gt;\",\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import InferenceEndpointsLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            api_key=\"api.key\",\n            structured_output={\"format\": \"json\", \"schema\": User.model_json_schema()}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the Tour De France\"}]])\n        ```\n    \"\"\"\n\n    model_id: Optional[str] = None\n\n    endpoint_name: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The name of the Inference Endpoint to use for the LLM.\",\n    )\n    endpoint_namespace: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The namespace of the Inference Endpoint to use for the LLM.\",\n    )\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The base URL to use for the Inference Endpoints API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(HF_TOKEN_ENV_VAR),\n        description=\"The API key to authenticate the requests to the Inference Endpoints API.\",\n    )\n\n    tokenizer_id: Optional[str] = None\n    model_display_name: Optional[str] = None\n\n    structured_output: Optional[RuntimeParameter[StructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n\n    _num_generations_param_supported = False\n\n    _model_name: Optional[str] = PrivateAttr(default=None)\n    _tokenizer: Optional[\"PreTrainedTokenizer\"] = PrivateAttr(default=None)\n    _api_key_env_var: str = PrivateAttr(HF_TOKEN_ENV_VAR)\n    _aclient: Optional[\"AsyncInferenceClient\"] = PrivateAttr(...)\n\n    @model_validator(mode=\"after\")  # type: ignore\n    def only_one_of_model_id_endpoint_name_or_base_url_provided(\n        self,\n    ) -&gt; \"InferenceEndpointsLLM\":\n        \"\"\"Validates that only one of `model_id` or `endpoint_name` is provided; and if `base_url` is also\n        provided, a warning will be shown informing the user that the provided `base_url` will be ignored in\n        favour of the dynamically calculated one..\"\"\"\n\n        if self.base_url and (self.model_id or self.endpoint_name):\n            self._logger.warning(  # type: ignore\n                f\"Since the `base_url={self.base_url}` is available and either one of `model_id`\"\n                \" or `endpoint_name` is also provided, the `base_url` will either be ignored\"\n                \" or overwritten with the one generated from either of those args, for serverless\"\n                \" or dedicated inference endpoints, respectively.\"\n            )\n\n        if self.use_magpie_template and self.tokenizer_id is None:\n            raise ValueError(\n                \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n                \" set a `tokenizer_id` and try again.\"\n            )\n\n        if (\n            self.model_id\n            and self.tokenizer_id is None\n            and self.structured_output is not None\n        ):\n            self.tokenizer_id = self.model_id\n\n        if self.base_url and not (self.model_id or self.endpoint_name):\n            return self\n\n        if self.model_id and not self.endpoint_name:\n            return self\n\n        if self.endpoint_name and not self.model_id:\n            return self\n\n        raise ValidationError(\n            f\"Only one of `model_id` or `endpoint_name` must be provided. If `base_url` is\"\n            f\" provided too, it will be overwritten instead. Found `model_id`={self.model_id},\"\n            f\" `endpoint_name`={self.endpoint_name}, and `base_url`={self.base_url}.\"\n        )\n\n    def load(self) -&gt; None:  # noqa: C901\n        \"\"\"Loads the `AsyncInferenceClient` client to connect to the Hugging Face Inference\n        Endpoint.\n\n        Raises:\n            ImportError: if the `huggingface-hub` Python client is not installed.\n            ValueError: if the model is not currently deployed or is not running the TGI framework.\n            ImportError: if the `transformers` Python client is not installed.\n        \"\"\"\n        super().load()\n\n        try:\n            from huggingface_hub import (\n                AsyncInferenceClient,\n                InferenceClient,\n                get_inference_endpoint,\n            )\n        except ImportError as ie:\n            raise ImportError(\n                \"Hugging Face Hub Python client is not installed. Please install it using\"\n                \" `pip install huggingface-hub`.\"\n            ) from ie\n\n        if self.api_key is None:\n            self.api_key = SecretStr(get_hf_token(self.__class__.__name__, \"api_key\"))\n\n        if self.model_id is not None:\n            client = InferenceClient(\n                model=self.model_id, token=self.api_key.get_secret_value()\n            )\n            status = client.get_model_status()\n\n            if (\n                status.state not in {\"Loadable\", \"Loaded\"}\n                and status.framework != \"text-generation-inference\"\n            ):\n                raise ValueError(\n                    f\"Model {self.model_id} is not currently deployed or is not running the TGI framework\"\n                )\n\n            self.base_url = client._resolve_url(\n                model=self.model_id, task=\"text-generation\"\n            )\n\n        if self.endpoint_name is not None:\n            client = get_inference_endpoint(\n                name=self.endpoint_name,\n                namespace=self.endpoint_namespace,\n                token=self.api_key.get_secret_value(),\n            )\n            if client.status in [\"paused\", \"scaledToZero\"]:\n                client.resume().wait(timeout=300)\n            elif client.status == \"initializing\":\n                client.wait(timeout=300)\n\n            self.base_url = client.url\n            self._model_name = client.repository\n\n        self._aclient = AsyncInferenceClient(\n            base_url=self.base_url,\n            token=self.api_key.get_secret_value(),\n        )\n\n        if self.tokenizer_id:\n            try:\n                from transformers import AutoTokenizer\n            except ImportError as ie:\n                raise ImportError(\n                    \"Transformers Python client is not installed. Please install it using\"\n                    \" `pip install transformers`.\"\n                ) from ie\n\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n\n    @property\n    @override\n    def model_name(self) -&gt; Union[str, None]:  # type: ignore\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return (\n            self.model_display_name\n            or self._model_name\n            or self.model_id\n            or self.endpoint_name\n            or self.base_url\n        )\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(  # type: ignore\n                conversation=input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    def _get_structured_output(\n        self, input: FormattedInput\n    ) -&gt; Tuple[\"StandardInput\", Union[Dict[str, Any], None]]:\n        \"\"\"Gets the structured output (if any) for the given input.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n\n        Returns:\n            The input and the structured output that will be passed as `grammar` to the\n            inference endpoint or `None` if not required.\n        \"\"\"\n        structured_output = None\n\n        # Specific structured output per input\n        if isinstance(input, tuple):\n            input, structured_output = input\n            structured_output = {\n                \"type\": structured_output[\"format\"],  # type: ignore\n                \"value\": structured_output[\"schema\"],  # type: ignore\n            }\n\n        # Same structured output for all the inputs\n        if structured_output is None and self.structured_output is not None:\n            try:\n                structured_output = {\n                    \"type\": self.structured_output[\"format\"],  # type: ignore\n                    \"value\": self.structured_output[\"schema\"],  # type: ignore\n                }\n            except KeyError as e:\n                raise ValueError(\n                    \"To use the structured output you have to inform the `format` and `schema` in \"\n                    \"the `structured_output` attribute.\"\n                ) from e\n\n        if structured_output:\n            if isinstance(structured_output[\"value\"], ModelMetaclass):\n                structured_output[\"value\"] = structured_output[\n                    \"value\"\n                ].model_json_schema()\n\n        return input, structured_output\n\n    async def _generate_with_text_generation(\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        repetition_penalty: Optional[float] = None,\n        frequency_penalty: Optional[float] = None,\n        temperature: float = 1.0,\n        do_sample: bool = False,\n        top_n_tokens: Optional[int] = None,\n        top_p: Optional[float] = None,\n        top_k: Optional[int] = None,\n        typical_p: Optional[float] = None,\n        stop_sequences: Union[List[str], None] = None,\n        return_full_text: bool = False,\n        seed: Optional[int] = None,\n        watermark: bool = False,\n    ) -&gt; GenerateOutput:\n        input, structured_output = self._get_structured_output(input)\n        prompt = self.prepare_input(input)\n        generation: Union[\"TextGenerationOutput\", None] = None\n        try:\n            generation = await self._aclient.text_generation(  # type: ignore\n                prompt=prompt,\n                max_new_tokens=max_new_tokens,\n                do_sample=do_sample,\n                typical_p=typical_p,\n                repetition_penalty=repetition_penalty,\n                frequency_penalty=frequency_penalty,\n                temperature=temperature,\n                top_n_tokens=top_n_tokens,\n                top_p=top_p,\n                top_k=top_k,\n                stop_sequences=stop_sequences,\n                return_full_text=return_full_text,\n                # NOTE: here to ensure that the cache is not used and a different response is\n                # generated every time\n                seed=seed or random.randint(0, sys.maxsize),\n                watermark=watermark,\n                grammar=structured_output,  # type: ignore\n                details=True,\n            )\n        except Exception as e:\n            self._logger.warning(  # type: ignore\n                f\"\u26a0\ufe0f Received no response using Inference Client (model: '{self.model_name}').\"\n                f\" Finish reason was: {e}\"\n            )\n        return prepare_output(\n            generations=[generation.generated_text] if generation else [None],\n            input_tokens=[compute_tokens(prompt, self._tokenizer.encode)],  # type: ignore\n            output_tokens=[\n                generation.details.generated_tokens\n                if generation and generation.details\n                else 0\n            ],\n            logprobs=self._get_logprobs_from_text_generation(generation)\n            if generation\n            else None,  # type: ignore\n        )\n\n    def _get_logprobs_from_text_generation(\n        self, generation: \"TextGenerationOutput\"\n    ) -&gt; Union[List[List[List[\"Logprob\"]]], None]:\n        if generation.details is None or generation.details.top_tokens is None:\n            return None\n\n        return [\n            [\n                [\n                    {\"token\": top_logprob[\"text\"], \"logprob\": top_logprob[\"logprob\"]}\n                    for top_logprob in token_logprobs\n                ]\n                for token_logprobs in generation.details.top_tokens\n            ]\n        ]\n\n    async def _generate_with_chat_completion(\n        self,\n        input: \"StandardInput\",\n        max_new_tokens: int = 128,\n        frequency_penalty: Optional[float] = None,\n        logit_bias: Optional[List[float]] = None,\n        logprobs: bool = False,\n        presence_penalty: Optional[float] = None,\n        seed: Optional[int] = None,\n        stop_sequences: Optional[List[str]] = None,\n        temperature: float = 1.0,\n        tool_choice: Optional[Union[Dict[str, str], Literal[\"auto\"]]] = None,\n        tool_prompt: Optional[str] = None,\n        tools: Optional[List[Dict[str, Any]]] = None,\n        top_logprobs: Optional[PositiveInt] = None,\n        top_p: Optional[float] = None,\n    ) -&gt; GenerateOutput:\n        message = None\n        completion: Union[\"ChatCompletionOutput\", None] = None\n        output_logprobs = None\n        try:\n            completion = await self._aclient.chat_completion(  # type: ignore\n                messages=input,  # type: ignore\n                max_tokens=max_new_tokens,\n                frequency_penalty=frequency_penalty,\n                logit_bias=logit_bias,\n                logprobs=logprobs,\n                presence_penalty=presence_penalty,\n                # NOTE: here to ensure that the cache is not used and a different response is\n                # generated every time\n                seed=seed or random.randint(0, sys.maxsize),\n                stop=stop_sequences,\n                temperature=temperature,\n                tool_choice=tool_choice,  # type: ignore\n                tool_prompt=tool_prompt,\n                tools=tools,  # type: ignore\n                top_logprobs=top_logprobs,\n                top_p=top_p,\n            )\n            choice = completion.choices[0]  # type: ignore\n            if (message := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"\u26a0\ufe0f Received no response using Inference Client (model: '{self.model_name}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            if choice_logprobs := self._get_logprobs_from_choice(choice):\n                output_logprobs = [choice_logprobs]\n        except Exception as e:\n            self._logger.warning(  # type: ignore\n                f\"\u26a0\ufe0f Received no response using Inference Client (model: '{self.model_name}').\"\n                f\" Finish reason was: {e}\"\n            )\n        return prepare_output(\n            generations=[message],\n            input_tokens=[completion.usage.prompt_tokens] if completion else None,\n            output_tokens=[completion.usage.completion_tokens] if completion else None,\n            logprobs=output_logprobs,\n        )\n\n    def _get_logprobs_from_choice(\n        self, choice: \"ChatCompletionOutputComplete\"\n    ) -&gt; Union[List[List[\"Logprob\"]], None]:\n        if choice.logprobs is None:\n            return None\n\n        return [\n            [\n                {\"token\": top_logprob.token, \"logprob\": top_logprob.logprob}\n                for top_logprob in token_logprobs.top_logprobs\n            ]\n            for token_logprobs in choice.logprobs.content\n        ]\n\n    def _check_stop_sequences(\n        self,\n        stop_sequences: Optional[Union[str, List[str]]] = None,\n    ) -&gt; Union[List[str], None]:\n        \"\"\"Checks that no more than 4 stop sequences are provided.\n\n        Args:\n            stop_sequences: the stop sequences to be checked.\n\n        Returns:\n            The stop sequences.\n        \"\"\"\n        if stop_sequences is not None:\n            if isinstance(stop_sequences, str):\n                stop_sequences = [stop_sequences]\n            if len(stop_sequences) &gt; 4:\n                warnings.warn(\n                    \"Only up to 4 stop sequences are allowed, so keeping the first 4 items only.\",\n                    UserWarning,\n                    stacklevel=2,\n                )\n                stop_sequences = stop_sequences[:4]\n        return stop_sequences\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        frequency_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n        logit_bias: Optional[List[float]] = None,\n        logprobs: bool = False,\n        presence_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n        seed: Optional[int] = None,\n        stop_sequences: Optional[List[str]] = None,\n        temperature: float = 1.0,\n        tool_choice: Optional[Union[Dict[str, str], Literal[\"auto\"]]] = None,\n        tool_prompt: Optional[str] = None,\n        tools: Optional[List[Dict[str, Any]]] = None,\n        top_logprobs: Optional[PositiveInt] = None,\n        top_n_tokens: Optional[PositiveInt] = None,\n        top_p: Optional[float] = None,\n        do_sample: bool = False,\n        repetition_penalty: Optional[float] = None,\n        return_full_text: bool = False,\n        top_k: Optional[int] = None,\n        typical_p: Optional[float] = None,\n        watermark: bool = False,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates completions for the given input using the async client. This method\n        uses two methods of the `huggingface_hub.AsyncClient`: `chat_completion` and `text_generation`.\n        `chat_completion` method will be used only if no `tokenizer_id` has been specified.\n        Some arguments of this function are specific to the `text_generation` method, while\n        some others are specific to the `chat_completion` method.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            frequency_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n                new tokens based on their existing frequency in the text so far, decreasing\n                model's likelihood to repeat the same line verbatim. Defauls to `None`.\n            logit_bias: modify the likelihood of specified tokens appearing in the completion.\n                This argument is exclusive to the `chat_completion` method and will be used\n                only if `tokenizer_id` is `None`.\n                Defaults to `None`.\n            logprobs: whether to return the log probabilities or not. This argument is exclusive\n                to the `chat_completion` method and will be used only if `tokenizer_id`\n                is `None`. Defaults to `False`.\n            presence_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n                new tokens based on whether they appear in the text so far, increasing the\n                model likelihood to talk about new topics. This argument is exclusive to\n                the `chat_completion` method and will be used only if `tokenizer_id` is\n                `None`. Defauls to `None`.\n            seed: the seed to use for the generation. Defaults to `None`.\n            stop_sequences: either a single string or a list of strings containing the sequences\n                to stop the generation at. Defaults to `None`, but will be set to the\n                `tokenizer.eos_token` if available.\n            temperature: the temperature to use for the generation. Defaults to `1.0`.\n            tool_choice: the name of the tool the model should call. It can be a dictionary\n                like `{\"function_name\": \"my_tool\"}` or \"auto\". If not provided, then the\n                model won't use any tool. This argument is exclusive to the `chat_completion`\n                method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n            tool_prompt: A prompt to be appended before the tools. This argument is exclusive\n                to the `chat_completion` method and will be used only if `tokenizer_id`\n                is `None`. Defauls to `None`.\n            tools: a list of tools definitions that the LLM can use.\n                This argument is exclusive to the `chat_completion` method and will be used\n                only if `tokenizer_id` is `None`. Defaults to `None`.\n            top_logprobs: the number of top log probabilities to return per output token\n                generated. This argument is exclusive to the `chat_completion` method and\n                will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n            top_n_tokens: the number of top log probabilities to return per output token\n                generated. This argument is exclusive of the `text_generation` method and\n                will be only used if `tokenizer_id` is not `None`. Defaults to `None`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            do_sample: whether to use sampling for the generation. This argument is exclusive\n                of the `text_generation` method and will be only used if `tokenizer_id` is not\n                `None`. Defaults to `False`.\n            repetition_penalty: the repetition penalty to use for the generation. This argument\n                is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `None`.\n            return_full_text: whether to return the full text of the completion or just\n                the generated text. Defaults to `False`, meaning that only the generated\n                text will be returned. This argument is exclusive of the `text_generation`\n                method and will be only used if `tokenizer_id` is not `None`.\n            top_k: the top-k value to use for the generation. This argument is exclusive\n                of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `0.8`, since neither `0.0` nor `1.0` are valid\n                values in TGI.\n            typical_p: the typical-p value to use for the generation. This argument is exclusive\n                of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `None`.\n            watermark: whether to add the watermark to the generated text. This argument\n                is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n                is not `None`. Defaults to `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        stop_sequences = self._check_stop_sequences(stop_sequences)\n\n        if self.tokenizer_id is None:\n            return await self._generate_with_chat_completion(\n                input=input,  # type: ignore\n                max_new_tokens=max_new_tokens,\n                frequency_penalty=frequency_penalty,\n                logit_bias=logit_bias,\n                logprobs=logprobs,\n                presence_penalty=presence_penalty,\n                seed=seed,\n                stop_sequences=stop_sequences,\n                temperature=temperature,\n                tool_choice=tool_choice,\n                tool_prompt=tool_prompt,\n                tools=tools,\n                top_logprobs=top_logprobs,\n                top_p=top_p,\n            )\n\n        return await self._generate_with_text_generation(\n            input=input,\n            max_new_tokens=max_new_tokens,\n            do_sample=do_sample,\n            typical_p=typical_p,\n            repetition_penalty=repetition_penalty,\n            frequency_penalty=frequency_penalty,\n            temperature=temperature,\n            top_n_tokens=top_n_tokens,\n            top_p=top_p,\n            top_k=top_k,\n            stop_sequences=stop_sequences,\n            return_full_text=return_full_text,\n            seed=seed,\n            watermark=watermark,\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.only_one_of_model_id_endpoint_name_or_base_url_provided","title":"<code>only_one_of_model_id_endpoint_name_or_base_url_provided()</code>","text":"<p>Validates that only one of <code>model_id</code> or <code>endpoint_name</code> is provided; and if <code>base_url</code> is also provided, a warning will be shown informing the user that the provided <code>base_url</code> will be ignored in favour of the dynamically calculated one..</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>@model_validator(mode=\"after\")  # type: ignore\ndef only_one_of_model_id_endpoint_name_or_base_url_provided(\n    self,\n) -&gt; \"InferenceEndpointsLLM\":\n    \"\"\"Validates that only one of `model_id` or `endpoint_name` is provided; and if `base_url` is also\n    provided, a warning will be shown informing the user that the provided `base_url` will be ignored in\n    favour of the dynamically calculated one..\"\"\"\n\n    if self.base_url and (self.model_id or self.endpoint_name):\n        self._logger.warning(  # type: ignore\n            f\"Since the `base_url={self.base_url}` is available and either one of `model_id`\"\n            \" or `endpoint_name` is also provided, the `base_url` will either be ignored\"\n            \" or overwritten with the one generated from either of those args, for serverless\"\n            \" or dedicated inference endpoints, respectively.\"\n        )\n\n    if self.use_magpie_template and self.tokenizer_id is None:\n        raise ValueError(\n            \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n            \" set a `tokenizer_id` and try again.\"\n        )\n\n    if (\n        self.model_id\n        and self.tokenizer_id is None\n        and self.structured_output is not None\n    ):\n        self.tokenizer_id = self.model_id\n\n    if self.base_url and not (self.model_id or self.endpoint_name):\n        return self\n\n    if self.model_id and not self.endpoint_name:\n        return self\n\n    if self.endpoint_name and not self.model_id:\n        return self\n\n    raise ValidationError(\n        f\"Only one of `model_id` or `endpoint_name` must be provided. If `base_url` is\"\n        f\" provided too, it will be overwritten instead. Found `model_id`={self.model_id},\"\n        f\" `endpoint_name`={self.endpoint_name}, and `base_url`={self.base_url}.\"\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncInferenceClient</code> client to connect to the Hugging Face Inference Endpoint.</p> <p>Raises:</p> Type Description <code>ImportError</code> <p>if the <code>huggingface-hub</code> Python client is not installed.</p> <code>ValueError</code> <p>if the model is not currently deployed or is not running the TGI framework.</p> <code>ImportError</code> <p>if the <code>transformers</code> Python client is not installed.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def load(self) -&gt; None:  # noqa: C901\n    \"\"\"Loads the `AsyncInferenceClient` client to connect to the Hugging Face Inference\n    Endpoint.\n\n    Raises:\n        ImportError: if the `huggingface-hub` Python client is not installed.\n        ValueError: if the model is not currently deployed or is not running the TGI framework.\n        ImportError: if the `transformers` Python client is not installed.\n    \"\"\"\n    super().load()\n\n    try:\n        from huggingface_hub import (\n            AsyncInferenceClient,\n            InferenceClient,\n            get_inference_endpoint,\n        )\n    except ImportError as ie:\n        raise ImportError(\n            \"Hugging Face Hub Python client is not installed. Please install it using\"\n            \" `pip install huggingface-hub`.\"\n        ) from ie\n\n    if self.api_key is None:\n        self.api_key = SecretStr(get_hf_token(self.__class__.__name__, \"api_key\"))\n\n    if self.model_id is not None:\n        client = InferenceClient(\n            model=self.model_id, token=self.api_key.get_secret_value()\n        )\n        status = client.get_model_status()\n\n        if (\n            status.state not in {\"Loadable\", \"Loaded\"}\n            and status.framework != \"text-generation-inference\"\n        ):\n            raise ValueError(\n                f\"Model {self.model_id} is not currently deployed or is not running the TGI framework\"\n            )\n\n        self.base_url = client._resolve_url(\n            model=self.model_id, task=\"text-generation\"\n        )\n\n    if self.endpoint_name is not None:\n        client = get_inference_endpoint(\n            name=self.endpoint_name,\n            namespace=self.endpoint_namespace,\n            token=self.api_key.get_secret_value(),\n        )\n        if client.status in [\"paused\", \"scaledToZero\"]:\n            client.resume().wait(timeout=300)\n        elif client.status == \"initializing\":\n            client.wait(timeout=300)\n\n        self.base_url = client.url\n        self._model_name = client.repository\n\n    self._aclient = AsyncInferenceClient(\n        base_url=self.base_url,\n        token=self.api_key.get_secret_value(),\n    )\n\n    if self.tokenizer_id:\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers Python client is not installed. Please install it using\"\n                \" `pip install transformers`.\"\n            ) from ie\n\n        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(  # type: ignore\n            conversation=input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM._get_structured_output","title":"<code>_get_structured_output(input)</code>","text":"<p>Gets the structured output (if any) for the given input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <p>Returns:</p> Type Description <code>StandardInput</code> <p>The input and the structured output that will be passed as <code>grammar</code> to the</p> <code>Union[Dict[str, Any], None]</code> <p>inference endpoint or <code>None</code> if not required.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def _get_structured_output(\n    self, input: FormattedInput\n) -&gt; Tuple[\"StandardInput\", Union[Dict[str, Any], None]]:\n    \"\"\"Gets the structured output (if any) for the given input.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n\n    Returns:\n        The input and the structured output that will be passed as `grammar` to the\n        inference endpoint or `None` if not required.\n    \"\"\"\n    structured_output = None\n\n    # Specific structured output per input\n    if isinstance(input, tuple):\n        input, structured_output = input\n        structured_output = {\n            \"type\": structured_output[\"format\"],  # type: ignore\n            \"value\": structured_output[\"schema\"],  # type: ignore\n        }\n\n    # Same structured output for all the inputs\n    if structured_output is None and self.structured_output is not None:\n        try:\n            structured_output = {\n                \"type\": self.structured_output[\"format\"],  # type: ignore\n                \"value\": self.structured_output[\"schema\"],  # type: ignore\n            }\n        except KeyError as e:\n            raise ValueError(\n                \"To use the structured output you have to inform the `format` and `schema` in \"\n                \"the `structured_output` attribute.\"\n            ) from e\n\n    if structured_output:\n        if isinstance(structured_output[\"value\"], ModelMetaclass):\n            structured_output[\"value\"] = structured_output[\n                \"value\"\n            ].model_json_schema()\n\n    return input, structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM._check_stop_sequences","title":"<code>_check_stop_sequences(stop_sequences=None)</code>","text":"<p>Checks that no more than 4 stop sequences are provided.</p> <p>Parameters:</p> Name Type Description Default <code>stop_sequences</code> <code>Optional[Union[str, List[str]]]</code> <p>the stop sequences to be checked.</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[List[str], None]</code> <p>The stop sequences.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>def _check_stop_sequences(\n    self,\n    stop_sequences: Optional[Union[str, List[str]]] = None,\n) -&gt; Union[List[str], None]:\n    \"\"\"Checks that no more than 4 stop sequences are provided.\n\n    Args:\n        stop_sequences: the stop sequences to be checked.\n\n    Returns:\n        The stop sequences.\n    \"\"\"\n    if stop_sequences is not None:\n        if isinstance(stop_sequences, str):\n            stop_sequences = [stop_sequences]\n        if len(stop_sequences) &gt; 4:\n            warnings.warn(\n                \"Only up to 4 stop sequences are allowed, so keeping the first 4 items only.\",\n                UserWarning,\n                stacklevel=2,\n            )\n            stop_sequences = stop_sequences[:4]\n    return stop_sequences\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.InferenceEndpointsLLM.agenerate","title":"<code>agenerate(input, max_new_tokens=128, frequency_penalty=None, logit_bias=None, logprobs=False, presence_penalty=None, seed=None, stop_sequences=None, temperature=1.0, tool_choice=None, tool_prompt=None, tools=None, top_logprobs=None, top_n_tokens=None, top_p=None, do_sample=False, repetition_penalty=None, return_full_text=False, top_k=None, typical_p=None, watermark=False)</code>  <code>async</code>","text":"<p>Generates completions for the given input using the async client. This method uses two methods of the <code>huggingface_hub.AsyncClient</code>: <code>chat_completion</code> and <code>text_generation</code>. <code>chat_completion</code> method will be used only if no <code>tokenizer_id</code> has been specified. Some arguments of this function are specific to the <code>text_generation</code> method, while some others are specific to the <code>chat_completion</code> method.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>frequency_penalty</code> <code>Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]</code> <p>a value between <code>-2.0</code> and <code>2.0</code>. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing model's likelihood to repeat the same line verbatim. Defauls to <code>None</code>.</p> <code>None</code> <code>logit_bias</code> <code>Optional[List[float]]</code> <p>modify the likelihood of specified tokens appearing in the completion. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>logprobs</code> <code>bool</code> <p>whether to return the log probabilities or not. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>False</code>.</p> <code>False</code> <code>presence_penalty</code> <code>Optional[Annotated[float, Field(ge=-2.0, le=2.0)]]</code> <p>a value between <code>-2.0</code> and <code>2.0</code>. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model likelihood to talk about new topics. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defauls to <code>None</code>.</p> <code>None</code> <code>seed</code> <code>Optional[int]</code> <p>the seed to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_sequences</code> <code>Optional[List[str]]</code> <p>either a single string or a list of strings containing the sequences to stop the generation at. Defaults to <code>None</code>, but will be set to the <code>tokenizer.eos_token</code> if available.</p> <code>None</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>tool_choice</code> <code>Optional[Union[Dict[str, str], Literal['auto']]]</code> <p>the name of the tool the model should call. It can be a dictionary like <code>{\"function_name\": \"my_tool\"}</code> or \"auto\". If not provided, then the model won't use any tool. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>tool_prompt</code> <code>Optional[str]</code> <p>A prompt to be appended before the tools. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defauls to <code>None</code>.</p> <code>None</code> <code>tools</code> <code>Optional[List[Dict[str, Any]]]</code> <p>a list of tools definitions that the LLM can use. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>top_logprobs</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. This argument is exclusive to the <code>chat_completion</code> method and will be used only if <code>tokenizer_id</code> is <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>top_n_tokens</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>top_p</code> <code>Optional[float]</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>None</code> <code>do_sample</code> <code>bool</code> <p>whether to use sampling for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>False</code>.</p> <code>False</code> <code>repetition_penalty</code> <code>Optional[float]</code> <p>the repetition penalty to use for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>return_full_text</code> <code>bool</code> <p>whether to return the full text of the completion or just the generated text. Defaults to <code>False</code>, meaning that only the generated text will be returned. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>.</p> <code>False</code> <code>top_k</code> <code>Optional[int]</code> <p>the top-k value to use for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>0.8</code>, since neither <code>0.0</code> nor <code>1.0</code> are valid values in TGI.</p> <code>None</code> <code>typical_p</code> <code>Optional[float]</code> <p>the typical-p value to use for the generation. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>None</code> <code>watermark</code> <code>bool</code> <p>whether to add the watermark to the generated text. This argument is exclusive of the <code>text_generation</code> method and will be only used if <code>tokenizer_id</code> is not <code>None</code>. Defaults to <code>None</code>.</p> <code>False</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/huggingface/inference_endpoints.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    max_new_tokens: int = 128,\n    frequency_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n    logit_bias: Optional[List[float]] = None,\n    logprobs: bool = False,\n    presence_penalty: Optional[Annotated[float, Field(ge=-2.0, le=2.0)]] = None,\n    seed: Optional[int] = None,\n    stop_sequences: Optional[List[str]] = None,\n    temperature: float = 1.0,\n    tool_choice: Optional[Union[Dict[str, str], Literal[\"auto\"]]] = None,\n    tool_prompt: Optional[str] = None,\n    tools: Optional[List[Dict[str, Any]]] = None,\n    top_logprobs: Optional[PositiveInt] = None,\n    top_n_tokens: Optional[PositiveInt] = None,\n    top_p: Optional[float] = None,\n    do_sample: bool = False,\n    repetition_penalty: Optional[float] = None,\n    return_full_text: bool = False,\n    top_k: Optional[int] = None,\n    typical_p: Optional[float] = None,\n    watermark: bool = False,\n) -&gt; GenerateOutput:\n    \"\"\"Generates completions for the given input using the async client. This method\n    uses two methods of the `huggingface_hub.AsyncClient`: `chat_completion` and `text_generation`.\n    `chat_completion` method will be used only if no `tokenizer_id` has been specified.\n    Some arguments of this function are specific to the `text_generation` method, while\n    some others are specific to the `chat_completion` method.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        frequency_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n            new tokens based on their existing frequency in the text so far, decreasing\n            model's likelihood to repeat the same line verbatim. Defauls to `None`.\n        logit_bias: modify the likelihood of specified tokens appearing in the completion.\n            This argument is exclusive to the `chat_completion` method and will be used\n            only if `tokenizer_id` is `None`.\n            Defaults to `None`.\n        logprobs: whether to return the log probabilities or not. This argument is exclusive\n            to the `chat_completion` method and will be used only if `tokenizer_id`\n            is `None`. Defaults to `False`.\n        presence_penalty: a value between `-2.0` and `2.0`. Positive values penalize\n            new tokens based on whether they appear in the text so far, increasing the\n            model likelihood to talk about new topics. This argument is exclusive to\n            the `chat_completion` method and will be used only if `tokenizer_id` is\n            `None`. Defauls to `None`.\n        seed: the seed to use for the generation. Defaults to `None`.\n        stop_sequences: either a single string or a list of strings containing the sequences\n            to stop the generation at. Defaults to `None`, but will be set to the\n            `tokenizer.eos_token` if available.\n        temperature: the temperature to use for the generation. Defaults to `1.0`.\n        tool_choice: the name of the tool the model should call. It can be a dictionary\n            like `{\"function_name\": \"my_tool\"}` or \"auto\". If not provided, then the\n            model won't use any tool. This argument is exclusive to the `chat_completion`\n            method and will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n        tool_prompt: A prompt to be appended before the tools. This argument is exclusive\n            to the `chat_completion` method and will be used only if `tokenizer_id`\n            is `None`. Defauls to `None`.\n        tools: a list of tools definitions that the LLM can use.\n            This argument is exclusive to the `chat_completion` method and will be used\n            only if `tokenizer_id` is `None`. Defaults to `None`.\n        top_logprobs: the number of top log probabilities to return per output token\n            generated. This argument is exclusive to the `chat_completion` method and\n            will be used only if `tokenizer_id` is `None`. Defaults to `None`.\n        top_n_tokens: the number of top log probabilities to return per output token\n            generated. This argument is exclusive of the `text_generation` method and\n            will be only used if `tokenizer_id` is not `None`. Defaults to `None`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        do_sample: whether to use sampling for the generation. This argument is exclusive\n            of the `text_generation` method and will be only used if `tokenizer_id` is not\n            `None`. Defaults to `False`.\n        repetition_penalty: the repetition penalty to use for the generation. This argument\n            is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `None`.\n        return_full_text: whether to return the full text of the completion or just\n            the generated text. Defaults to `False`, meaning that only the generated\n            text will be returned. This argument is exclusive of the `text_generation`\n            method and will be only used if `tokenizer_id` is not `None`.\n        top_k: the top-k value to use for the generation. This argument is exclusive\n            of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `0.8`, since neither `0.0` nor `1.0` are valid\n            values in TGI.\n        typical_p: the typical-p value to use for the generation. This argument is exclusive\n            of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `None`.\n        watermark: whether to add the watermark to the generated text. This argument\n            is exclusive of the `text_generation` method and will be only used if `tokenizer_id`\n            is not `None`. Defaults to `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    stop_sequences = self._check_stop_sequences(stop_sequences)\n\n    if self.tokenizer_id is None:\n        return await self._generate_with_chat_completion(\n            input=input,  # type: ignore\n            max_new_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            logit_bias=logit_bias,\n            logprobs=logprobs,\n            presence_penalty=presence_penalty,\n            seed=seed,\n            stop_sequences=stop_sequences,\n            temperature=temperature,\n            tool_choice=tool_choice,\n            tool_prompt=tool_prompt,\n            tools=tools,\n            top_logprobs=top_logprobs,\n            top_p=top_p,\n        )\n\n    return await self._generate_with_text_generation(\n        input=input,\n        max_new_tokens=max_new_tokens,\n        do_sample=do_sample,\n        typical_p=typical_p,\n        repetition_penalty=repetition_penalty,\n        frequency_penalty=frequency_penalty,\n        temperature=temperature,\n        top_n_tokens=top_n_tokens,\n        top_p=top_p,\n        top_k=top_k,\n        stop_sequences=stop_sequences,\n        return_full_text=return_full_text,\n        seed=seed,\n        watermark=watermark,\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM","title":"<code>TransformersLLM</code>","text":"<p>               Bases: <code>LLM</code>, <code>MagpieChatTemplateMixin</code>, <code>CudaDevicePlacementMixin</code></p> <p>Hugging Face <code>transformers</code> library LLM implementation using the text generation pipeline.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>revision</code> <code>str</code> <p>if <code>model</code> refers to a Hugging Face Hub repository, then the revision (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> <code>torch_dtype</code> <code>str</code> <p>the torch dtype to use for the model e.g. \"float16\", \"float32\", etc. Defaults to <code>\"auto\"</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to allow fetching and executing remote code fetched from the repository in the Hub. Defaults to <code>False</code>.</p> <code>model_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>from_pretrained</code> method of the model.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the <code>model</code> will be used. Defaults to <code>None</code>.</p> <code>use_fast</code> <code>bool</code> <p>whether to use a fast tokenizer or not. Defaults to <code>True</code>.</p> <code>chat_template</code> <code>Optional[str]</code> <p>a chat template that will be used to build the prompts before sending them to the model. If not provided, the chat template defined in the tokenizer config will be used. If not provided and the tokenizer doesn't have a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> <code>device</code> <code>Optional[Union[str, int]]</code> <p>the name or index of the device where the model will be loaded. Defaults to <code>None</code>.</p> <code>device_map</code> <code>Optional[Union[str, Dict[str, Any]]]</code> <p>a dictionary mapping each layer of the model to a device, or a mode like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> <code>token</code> <code>Optional[SecretStr]</code> <p>the Hugging Face Hub token that will be used to authenticate to the Hugging Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package local configuration will be used. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[OutlinesStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> Icon <p><code>:hugging:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import TransformersLLM\n\nllm = TransformersLLM(model=\"microsoft/Phi-3-mini-4k-instruct\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>class TransformersLLM(LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin):\n    \"\"\"Hugging Face `transformers` library LLM implementation using the text generation\n    pipeline.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        revision: if `model` refers to a Hugging Face Hub repository, then the revision\n            (e.g. a branch name or a commit id) to use. Defaults to `\"main\"`.\n        torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.\n            Defaults to `\"auto\"`.\n        trust_remote_code: whether to allow fetching and executing remote code fetched\n            from the repository in the Hub. Defaults to `False`.\n        model_kwargs: additional dictionary of keyword arguments that will be passed to\n            the `from_pretrained` method of the model.\n        tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer config files. If not provided, the one associated to the `model`\n            will be used. Defaults to `None`.\n        use_fast: whether to use a fast tokenizer or not. Defaults to `True`.\n        chat_template: a chat template that will be used to build the prompts before\n            sending them to the model. If not provided, the chat template defined in the\n            tokenizer config will be used. If not provided and the tokenizer doesn't have\n            a chat template, then ChatML template will be used. Defaults to `None`.\n        device: the name or index of the device where the model will be loaded. Defaults\n            to `None`.\n        device_map: a dictionary mapping each layer of the model to a device, or a mode\n            like `\"sequential\"` or `\"auto\"`. Defaults to `None`.\n        token: the Hugging Face Hub token that will be used to authenticate to the Hugging\n            Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package\n            local configuration will be used. Defaults to `None`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n\n    Icon:\n        `:hugging:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import TransformersLLM\n\n        llm = TransformersLLM(model=\"microsoft/Phi-3-mini-4k-instruct\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    revision: str = \"main\"\n    torch_dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    model_kwargs: Optional[Dict[str, Any]] = None\n    tokenizer: Optional[str] = None\n    use_fast: bool = True\n    chat_template: Optional[str] = None\n    device: Optional[Union[str, int]] = None\n    device_map: Optional[Union[str, Dict[str, Any]]] = None\n    token: Optional[SecretStr] = Field(\n        default_factory=lambda: os.getenv(HF_TOKEN_ENV_VAR)\n    )\n    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n\n    _pipeline: Optional[\"Pipeline\"] = PrivateAttr(...)\n    _prefix_allowed_tokens_fn: Union[Callable, None] = PrivateAttr(default=None)\n    _logits_processor: Optional[\"LogitsProcessorList\"] = PrivateAttr(default=None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the model and tokenizer and creates the text generation pipeline. In addition,\n        it will configure the tokenizer chat template.\"\"\"\n        if self.device == \"cuda\":\n            CudaDevicePlacementMixin.load(self)\n\n        try:\n            from transformers import LogitsProcessorList, pipeline\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers is not installed. Please install it using `pip install transformers`.\"\n            ) from ie\n\n        token = self.token.get_secret_value() if self.token is not None else self.token\n\n        self._pipeline = pipeline(\n            \"text-generation\",\n            model=self.model,\n            revision=self.revision,\n            torch_dtype=self.torch_dtype,\n            trust_remote_code=self.trust_remote_code,\n            model_kwargs=self.model_kwargs or {},\n            tokenizer=self.tokenizer or self.model,\n            use_fast=self.use_fast,\n            device=self.device,\n            device_map=self.device_map,\n            token=token,\n            return_full_text=False,\n        )\n\n        if self.chat_template is not None:\n            self._pipeline.tokenizer.chat_template = self.chat_template  # type: ignore\n\n        if self._pipeline.tokenizer.pad_token is None:  # type: ignore\n            self._pipeline.tokenizer.pad_token = self._pipeline.tokenizer.eos_token  # type: ignore\n\n        if self.structured_output:\n            from distilabel.steps.tasks.structured_outputs.outlines import (\n                outlines_below_0_1_0,\n            )\n\n            if outlines_below_0_1_0:\n                self._prefix_allowed_tokens_fn = self._prepare_structured_output(\n                    self.structured_output\n                )\n            else:\n                logits_processor = self._prepare_structured_output(\n                    self.structured_output\n                )\n                self._logits_processor = LogitsProcessorList([logits_processor])\n\n        super().load()\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the `vLLM` model.\"\"\"\n        CudaDevicePlacementMixin.unload(self)\n        super().unload()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        if self._pipeline.tokenizer.chat_template is None:  # type: ignore\n            return input[0][\"content\"]\n\n        prompt: str = (\n            self._pipeline.tokenizer.apply_chat_template(  # type: ignore\n                input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    @validate_call\n    def generate(  # type: ignore\n        self,\n        inputs: List[StandardInput],\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        temperature: float = 0.1,\n        repetition_penalty: float = 1.1,\n        top_p: float = 1.0,\n        top_k: int = 0,\n        do_sample: bool = True,\n    ) -&gt; List[GenerateOutput]:\n        \"\"\"Generates `num_generations` responses for each input using the text generation\n        pipeline.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            repetition_penalty: the repetition penalty to use for the generation. Defaults\n                to `1.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            top_k: the top-k value to use for the generation. Defaults to `0`.\n            do_sample: whether to use sampling or not. Defaults to `True`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        prepared_inputs = [self.prepare_input(input=input) for input in inputs]\n\n        outputs: List[List[Dict[str, str]]] = self._pipeline(\n            prepared_inputs,\n            max_new_tokens=max_new_tokens,\n            temperature=temperature,\n            repetition_penalty=repetition_penalty,\n            top_p=top_p,\n            top_k=top_k,\n            do_sample=do_sample,\n            num_return_sequences=num_generations,\n            prefix_allowed_tokens_fn=self._prefix_allowed_tokens_fn,\n            logits_processor=self._logits_processor,\n            pad_token_id=self._pipeline.tokenizer.eos_token_id,\n        )\n        llm_output = [\n            [generation[\"generated_text\"] for generation in output]\n            for output in outputs\n        ]\n\n        result = []\n        for input, output in zip(inputs, llm_output):\n            result.append(\n                prepare_output(\n                    output,\n                    input_tokens=[\n                        compute_tokens(input, self._pipeline.tokenizer.encode)\n                    ],\n                    output_tokens=[\n                        compute_tokens(row, self._pipeline.tokenizer.encode)\n                        for row in output\n                    ],\n                )\n            )\n\n        return result\n\n    def get_last_hidden_states(\n        self, inputs: List[\"StandardInput\"]\n    ) -&gt; List[\"HiddenState\"]:\n        \"\"\"Gets the last `hidden_states` of the model for the given inputs. It doesn't\n        execute the task head.\n\n        Args:\n            inputs: a list of inputs in chat format to generate the embeddings for.\n\n        Returns:\n            A list containing the last hidden state for each sequence using a NumPy array\n            with shape [num_tokens, hidden_size].\n        \"\"\"\n        model: \"PreTrainedModel\" = (\n            self._pipeline.model.model  # type: ignore\n            if hasattr(self._pipeline.model, \"model\")  # type: ignore\n            else next(self._pipeline.model.children())  # type: ignore\n        )\n        tokenizer: \"PreTrainedTokenizer\" = self._pipeline.tokenizer  # type: ignore\n        input_ids = tokenizer(\n            [self.prepare_input(input) for input in inputs],  # type: ignore\n            return_tensors=\"pt\",\n            padding=True,\n        ).to(model.device)\n        last_hidden_states = model(**input_ids)[\"last_hidden_state\"]\n\n        return [\n            seq_last_hidden_state[attention_mask.bool(), :].detach().cpu().numpy()\n            for seq_last_hidden_state, attention_mask in zip(\n                last_hidden_states,\n                input_ids[\"attention_mask\"],  # type: ignore\n            )\n        ]\n\n    def _prepare_structured_output(\n        self, structured_output: Optional[OutlinesStructuredOutputType] = None\n    ) -&gt; Union[Callable, None]:\n        \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n        Args:\n            structured_output: the configuration dict to prepare the structured output.\n\n        Returns:\n            The callable that will be used to guide the generation of the model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            prepare_guided_output,\n        )\n\n        result = prepare_guided_output(\n            structured_output, \"transformers\", self._pipeline\n        )\n        if schema := result.get(\"schema\"):\n            self.structured_output[\"schema\"] = schema\n        return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.load","title":"<code>load()</code>","text":"<p>Loads the model and tokenizer and creates the text generation pipeline. In addition, it will configure the tokenizer chat template.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the model and tokenizer and creates the text generation pipeline. In addition,\n    it will configure the tokenizer chat template.\"\"\"\n    if self.device == \"cuda\":\n        CudaDevicePlacementMixin.load(self)\n\n    try:\n        from transformers import LogitsProcessorList, pipeline\n    except ImportError as ie:\n        raise ImportError(\n            \"Transformers is not installed. Please install it using `pip install transformers`.\"\n        ) from ie\n\n    token = self.token.get_secret_value() if self.token is not None else self.token\n\n    self._pipeline = pipeline(\n        \"text-generation\",\n        model=self.model,\n        revision=self.revision,\n        torch_dtype=self.torch_dtype,\n        trust_remote_code=self.trust_remote_code,\n        model_kwargs=self.model_kwargs or {},\n        tokenizer=self.tokenizer or self.model,\n        use_fast=self.use_fast,\n        device=self.device,\n        device_map=self.device_map,\n        token=token,\n        return_full_text=False,\n    )\n\n    if self.chat_template is not None:\n        self._pipeline.tokenizer.chat_template = self.chat_template  # type: ignore\n\n    if self._pipeline.tokenizer.pad_token is None:  # type: ignore\n        self._pipeline.tokenizer.pad_token = self._pipeline.tokenizer.eos_token  # type: ignore\n\n    if self.structured_output:\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            outlines_below_0_1_0,\n        )\n\n        if outlines_below_0_1_0:\n            self._prefix_allowed_tokens_fn = self._prepare_structured_output(\n                self.structured_output\n            )\n        else:\n            logits_processor = self._prepare_structured_output(\n                self.structured_output\n            )\n            self._logits_processor = LogitsProcessorList([logits_processor])\n\n    super().load()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.unload","title":"<code>unload()</code>","text":"<p>Unloads the <code>vLLM</code> model.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the `vLLM` model.\"\"\"\n    CudaDevicePlacementMixin.unload(self)\n    super().unload()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    if self._pipeline.tokenizer.chat_template is None:  # type: ignore\n        return input[0][\"content\"]\n\n    prompt: str = (\n        self._pipeline.tokenizer.apply_chat_template(  # type: ignore\n            input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.generate","title":"<code>generate(inputs, num_generations=1, max_new_tokens=128, temperature=0.1, repetition_penalty=1.1, top_p=1.0, top_k=0, do_sample=True)</code>","text":"<p>Generates <code>num_generations</code> responses for each input using the text generation pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StandardInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>0.1</code> <code>repetition_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>1.1</code>.</p> <code>1.1</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_k</code> <code>int</code> <p>the top-k value to use for the generation. Defaults to <code>0</code>.</p> <code>0</code> <code>do_sample</code> <code>bool</code> <p>whether to use sampling or not. Defaults to <code>True</code>.</p> <code>True</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>@validate_call\ndef generate(  # type: ignore\n    self,\n    inputs: List[StandardInput],\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    temperature: float = 0.1,\n    repetition_penalty: float = 1.1,\n    top_p: float = 1.0,\n    top_k: int = 0,\n    do_sample: bool = True,\n) -&gt; List[GenerateOutput]:\n    \"\"\"Generates `num_generations` responses for each input using the text generation\n    pipeline.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        repetition_penalty: the repetition penalty to use for the generation. Defaults\n            to `1.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        top_k: the top-k value to use for the generation. Defaults to `0`.\n        do_sample: whether to use sampling or not. Defaults to `True`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    prepared_inputs = [self.prepare_input(input=input) for input in inputs]\n\n    outputs: List[List[Dict[str, str]]] = self._pipeline(\n        prepared_inputs,\n        max_new_tokens=max_new_tokens,\n        temperature=temperature,\n        repetition_penalty=repetition_penalty,\n        top_p=top_p,\n        top_k=top_k,\n        do_sample=do_sample,\n        num_return_sequences=num_generations,\n        prefix_allowed_tokens_fn=self._prefix_allowed_tokens_fn,\n        logits_processor=self._logits_processor,\n        pad_token_id=self._pipeline.tokenizer.eos_token_id,\n    )\n    llm_output = [\n        [generation[\"generated_text\"] for generation in output]\n        for output in outputs\n    ]\n\n    result = []\n    for input, output in zip(inputs, llm_output):\n        result.append(\n            prepare_output(\n                output,\n                input_tokens=[\n                    compute_tokens(input, self._pipeline.tokenizer.encode)\n                ],\n                output_tokens=[\n                    compute_tokens(row, self._pipeline.tokenizer.encode)\n                    for row in output\n                ],\n            )\n        )\n\n    return result\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM.get_last_hidden_states","title":"<code>get_last_hidden_states(inputs)</code>","text":"<p>Gets the last <code>hidden_states</code> of the model for the given inputs. It doesn't execute the task head.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StandardInput]</code> <p>a list of inputs in chat format to generate the embeddings for.</p> required <p>Returns:</p> Type Description <code>List[HiddenState]</code> <p>A list containing the last hidden state for each sequence using a NumPy array</p> <code>List[HiddenState]</code> <p>with shape [num_tokens, hidden_size].</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def get_last_hidden_states(\n    self, inputs: List[\"StandardInput\"]\n) -&gt; List[\"HiddenState\"]:\n    \"\"\"Gets the last `hidden_states` of the model for the given inputs. It doesn't\n    execute the task head.\n\n    Args:\n        inputs: a list of inputs in chat format to generate the embeddings for.\n\n    Returns:\n        A list containing the last hidden state for each sequence using a NumPy array\n        with shape [num_tokens, hidden_size].\n    \"\"\"\n    model: \"PreTrainedModel\" = (\n        self._pipeline.model.model  # type: ignore\n        if hasattr(self._pipeline.model, \"model\")  # type: ignore\n        else next(self._pipeline.model.children())  # type: ignore\n    )\n    tokenizer: \"PreTrainedTokenizer\" = self._pipeline.tokenizer  # type: ignore\n    input_ids = tokenizer(\n        [self.prepare_input(input) for input in inputs],  # type: ignore\n        return_tensors=\"pt\",\n        padding=True,\n    ).to(model.device)\n    last_hidden_states = model(**input_ids)[\"last_hidden_state\"]\n\n    return [\n        seq_last_hidden_state[attention_mask.bool(), :].detach().cpu().numpy()\n        for seq_last_hidden_state, attention_mask in zip(\n            last_hidden_states,\n            input_ids[\"attention_mask\"],  # type: ignore\n        )\n    ]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TransformersLLM._prepare_structured_output","title":"<code>_prepare_structured_output(structured_output=None)</code>","text":"<p>Creates the appropriate function to filter tokens to generate structured outputs.</p> <p>Parameters:</p> Name Type Description Default <code>structured_output</code> <code>Optional[OutlinesStructuredOutputType]</code> <p>the configuration dict to prepare the structured output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[Callable, None]</code> <p>The callable that will be used to guide the generation of the model.</p> Source code in <code>src/distilabel/models/llms/huggingface/transformers.py</code> <pre><code>def _prepare_structured_output(\n    self, structured_output: Optional[OutlinesStructuredOutputType] = None\n) -&gt; Union[Callable, None]:\n    \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n    Args:\n        structured_output: the configuration dict to prepare the structured output.\n\n    Returns:\n        The callable that will be used to guide the generation of the model.\n    \"\"\"\n    from distilabel.steps.tasks.structured_outputs.outlines import (\n        prepare_guided_output,\n    )\n\n    result = prepare_guided_output(\n        structured_output, \"transformers\", self._pipeline\n    )\n    if schema := result.get(\"schema\"):\n        self.structured_output[\"schema\"] = schema\n    return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM","title":"<code>LiteLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>LiteLLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"gpt-3.5-turbo\" or \"mistral/mistral-large\", etc.</p> <code>verbose</code> <code>RuntimeParameter[bool]</code> <p>whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> Runtime parameters <ul> <li><code>verbose</code>: whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import LiteLLM\n\nllm = LiteLLM(model=\"gpt-3.5-turbo\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import LiteLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = LiteLLM(\n    model=\"gpt-3.5-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/litellm.py</code> <pre><code>class LiteLLM(AsyncLLM):\n    \"\"\"LiteLLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\" or \"mistral/mistral-large\",\n            etc.\n        verbose: whether to log the LiteLLM client's logs. Defaults to `False`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n\n    Runtime parameters:\n        - `verbose`: whether to log the LiteLLM client's logs. Defaults to `False`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import LiteLLM\n\n        llm = LiteLLM(model=\"gpt-3.5-turbo\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import LiteLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = LiteLLM(\n            model=\"gpt-3.5-turbo\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    verbose: RuntimeParameter[bool] = Field(\n        default=False, description=\"Whether to log the LiteLLM client's logs.\"\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _aclient: Optional[Callable] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"\n        Loads the `acompletion` LiteLLM client to benefit from async requests.\n        \"\"\"\n        super().load()\n\n        try:\n            import litellm\n\n            litellm.telemetry = False\n        except ImportError as e:\n            raise ImportError(\n                \"LiteLLM Python client is not installed. Please install it using\"\n                \" `pip install litellm`.\"\n            ) from e\n        self._aclient = litellm.acompletion\n\n        if not self.verbose:\n            litellm.suppress_debug_info = True\n            for key in logging.Logger.manager.loggerDict.keys():\n                if \"litellm\" not in key.lower():\n                    continue\n                logging.getLogger(key).setLevel(logging.CRITICAL)\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"litellm\",\n            )\n            self._aclient = result.get(\"client\")\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore # noqa: C901\n        self,\n        input: FormattedInput,\n        num_generations: int = 1,\n        functions: Optional[List] = None,\n        function_call: Optional[str] = None,\n        temperature: Optional[float] = 1.0,\n        top_p: Optional[float] = 1.0,\n        stop: Optional[Union[str, list]] = None,\n        max_tokens: Optional[int] = None,\n        presence_penalty: Optional[float] = None,\n        frequency_penalty: Optional[float] = None,\n        logit_bias: Optional[dict] = None,\n        user: Optional[str] = None,\n        metadata: Optional[dict] = None,\n        api_base: Optional[str] = None,\n        api_version: Optional[str] = None,\n        api_key: Optional[str] = None,\n        model_list: Optional[list] = None,\n        mock_response: Optional[str] = None,\n        force_timeout: Optional[int] = 600,\n        custom_llm_provider: Optional[str] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the [LiteLLM async client](https://github.com/BerriAI/litellm).\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            functions: a list of functions to apply to the conversation messages. Defaults to\n                `None`.\n            function_call: the name of the function to call within the conversation. Defaults\n                to `None`.\n            temperature: the temperature to use for the generation. Defaults to `1.0`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: Up to 4 sequences where the LLM API will stop generating further tokens.\n                Defaults to `None`.\n            max_tokens: The maximum number of tokens in the generated completion. Defaults to\n                `None`.\n            presence_penalty: It is used to penalize new tokens based on their existence in the\n                text so far. Defaults to `None`.\n            frequency_penalty: It is used to penalize new tokens based on their frequency in the\n                text so far. Defaults to `None`.\n            logit_bias: Used to modify the probability of specific tokens appearing in the\n                completion. Defaults to `None`.\n            user: A unique identifier representing your end-user. This can help the LLM provider\n                to monitor and detect abuse. Defaults to `None`.\n            metadata: Pass in additional metadata to tag your completion calls - eg. prompt\n                version, details, etc. Defaults to `None`.\n            api_base: Base URL for the API. Defaults to `None`.\n            api_version: API version. Defaults to `None`.\n            api_key: API key. Defaults to `None`.\n            model_list: List of api base, version, keys. Defaults to `None`.\n            mock_response: If provided, return a mock completion response for testing or debugging\n                purposes. Defaults to `None`.\n            force_timeout: The maximum execution time in seconds for the completion request.\n                Defaults to `600`.\n            custom_llm_provider: Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)\n                model=\"amazon.titan-tg1-large\" and custom_llm_provider=\"bedrock\". Defaults to\n                `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        import litellm\n        from litellm import token_counter\n\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"litellm\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"model\": self.model,\n            \"messages\": input,\n            \"n\": num_generations,\n            \"functions\": functions,\n            \"function_call\": function_call,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n            \"stream\": False,\n            \"stop\": stop,\n            \"max_tokens\": max_tokens,\n            \"presence_penalty\": presence_penalty,\n            \"frequency_penalty\": frequency_penalty,\n            \"logit_bias\": logit_bias,\n            \"user\": user,\n            \"metadata\": metadata,\n            \"api_base\": api_base,\n            \"api_version\": api_version,\n            \"api_key\": api_key,\n            \"model_list\": model_list,\n            \"mock_response\": mock_response,\n            \"force_timeout\": force_timeout,\n            \"custom_llm_provider\": custom_llm_provider,\n        }\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n        async def _call_aclient_until_n_choices() -&gt; List[\"Choices\"]:\n            choices = []\n            while len(choices) &lt; num_generations:\n                completion = await self._aclient(**kwargs)  # type: ignore\n                if not self.structured_output:\n                    completion = completion.choices\n                choices.extend(completion)\n            return choices\n\n        # litellm.drop_params is used to en/disable sending **kwargs parameters to the API if they cannot be used\n        try:\n            litellm.drop_params = False\n            choices = await _call_aclient_until_n_choices()\n        except litellm.exceptions.APIError as e:\n            if \"does not support parameters\" in str(e):\n                litellm.drop_params = True\n                choices = await _call_aclient_until_n_choices()\n            else:\n                raise e\n\n        generations = []\n        input_tokens = [\n            token_counter(model=self.model, messages=input)\n        ] * num_generations\n        output_tokens = []\n\n        if self.structured_output:\n            for choice in choices:\n                generations.append(choice.model_dump_json())\n                output_tokens.append(\n                    token_counter(\n                        model=self.model,\n                        text=orjson.dumps(choice.model_dump_json()).decode(\"utf-8\"),\n                    )\n                )\n            return prepare_output(\n                generations,\n                input_tokens=input_tokens,\n                output_tokens=output_tokens,\n            )\n\n        for choice in choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using LiteLLM client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n            output_tokens.append(token_counter(model=self.model, text=content))\n\n        return prepare_output(\n            generations, input_tokens=input_tokens, output_tokens=output_tokens\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>acompletion</code> LiteLLM client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/litellm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"\n    Loads the `acompletion` LiteLLM client to benefit from async requests.\n    \"\"\"\n    super().load()\n\n    try:\n        import litellm\n\n        litellm.telemetry = False\n    except ImportError as e:\n        raise ImportError(\n            \"LiteLLM Python client is not installed. Please install it using\"\n            \" `pip install litellm`.\"\n        ) from e\n    self._aclient = litellm.acompletion\n\n    if not self.verbose:\n        litellm.suppress_debug_info = True\n        for key in logging.Logger.manager.loggerDict.keys():\n            if \"litellm\" not in key.lower():\n                continue\n            logging.getLogger(key).setLevel(logging.CRITICAL)\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"litellm\",\n        )\n        self._aclient = result.get(\"client\")\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LiteLLM.agenerate","title":"<code>agenerate(input, num_generations=1, functions=None, function_call=None, temperature=1.0, top_p=1.0, stop=None, max_tokens=None, presence_penalty=None, frequency_penalty=None, logit_bias=None, user=None, metadata=None, api_base=None, api_version=None, api_key=None, model_list=None, mock_response=None, force_timeout=600, custom_llm_provider=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the LiteLLM async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>functions</code> <code>Optional[List]</code> <p>a list of functions to apply to the conversation messages. Defaults to <code>None</code>.</p> <code>None</code> <code>function_call</code> <code>Optional[str]</code> <p>the name of the function to call within the conversation. Defaults to <code>None</code>.</p> <code>None</code> <code>temperature</code> <code>Optional[float]</code> <p>the temperature to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_p</code> <code>Optional[float]</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[Union[str, list]]</code> <p>Up to 4 sequences where the LLM API will stop generating further tokens. Defaults to <code>None</code>.</p> <code>None</code> <code>max_tokens</code> <code>Optional[int]</code> <p>The maximum number of tokens in the generated completion. Defaults to <code>None</code>.</p> <code>None</code> <code>presence_penalty</code> <code>Optional[float]</code> <p>It is used to penalize new tokens based on their existence in the text so far. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>Optional[float]</code> <p>It is used to penalize new tokens based on their frequency in the text so far. Defaults to <code>None</code>.</p> <code>None</code> <code>logit_bias</code> <code>Optional[dict]</code> <p>Used to modify the probability of specific tokens appearing in the completion. Defaults to <code>None</code>.</p> <code>None</code> <code>user</code> <code>Optional[str]</code> <p>A unique identifier representing your end-user. This can help the LLM provider to monitor and detect abuse. Defaults to <code>None</code>.</p> <code>None</code> <code>metadata</code> <code>Optional[dict]</code> <p>Pass in additional metadata to tag your completion calls - eg. prompt version, details, etc. Defaults to <code>None</code>.</p> <code>None</code> <code>api_base</code> <code>Optional[str]</code> <p>Base URL for the API. Defaults to <code>None</code>.</p> <code>None</code> <code>api_version</code> <code>Optional[str]</code> <p>API version. Defaults to <code>None</code>.</p> <code>None</code> <code>api_key</code> <code>Optional[str]</code> <p>API key. Defaults to <code>None</code>.</p> <code>None</code> <code>model_list</code> <code>Optional[list]</code> <p>List of api base, version, keys. Defaults to <code>None</code>.</p> <code>None</code> <code>mock_response</code> <code>Optional[str]</code> <p>If provided, return a mock completion response for testing or debugging purposes. Defaults to <code>None</code>.</p> <code>None</code> <code>force_timeout</code> <code>Optional[int]</code> <p>The maximum execution time in seconds for the completion request. Defaults to <code>600</code>.</p> <code>600</code> <code>custom_llm_provider</code> <code>Optional[str]</code> <p>Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable) model=\"amazon.titan-tg1-large\" and custom_llm_provider=\"bedrock\". Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/litellm.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore # noqa: C901\n    self,\n    input: FormattedInput,\n    num_generations: int = 1,\n    functions: Optional[List] = None,\n    function_call: Optional[str] = None,\n    temperature: Optional[float] = 1.0,\n    top_p: Optional[float] = 1.0,\n    stop: Optional[Union[str, list]] = None,\n    max_tokens: Optional[int] = None,\n    presence_penalty: Optional[float] = None,\n    frequency_penalty: Optional[float] = None,\n    logit_bias: Optional[dict] = None,\n    user: Optional[str] = None,\n    metadata: Optional[dict] = None,\n    api_base: Optional[str] = None,\n    api_version: Optional[str] = None,\n    api_key: Optional[str] = None,\n    model_list: Optional[list] = None,\n    mock_response: Optional[str] = None,\n    force_timeout: Optional[int] = 600,\n    custom_llm_provider: Optional[str] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the [LiteLLM async client](https://github.com/BerriAI/litellm).\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        functions: a list of functions to apply to the conversation messages. Defaults to\n            `None`.\n        function_call: the name of the function to call within the conversation. Defaults\n            to `None`.\n        temperature: the temperature to use for the generation. Defaults to `1.0`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: Up to 4 sequences where the LLM API will stop generating further tokens.\n            Defaults to `None`.\n        max_tokens: The maximum number of tokens in the generated completion. Defaults to\n            `None`.\n        presence_penalty: It is used to penalize new tokens based on their existence in the\n            text so far. Defaults to `None`.\n        frequency_penalty: It is used to penalize new tokens based on their frequency in the\n            text so far. Defaults to `None`.\n        logit_bias: Used to modify the probability of specific tokens appearing in the\n            completion. Defaults to `None`.\n        user: A unique identifier representing your end-user. This can help the LLM provider\n            to monitor and detect abuse. Defaults to `None`.\n        metadata: Pass in additional metadata to tag your completion calls - eg. prompt\n            version, details, etc. Defaults to `None`.\n        api_base: Base URL for the API. Defaults to `None`.\n        api_version: API version. Defaults to `None`.\n        api_key: API key. Defaults to `None`.\n        model_list: List of api base, version, keys. Defaults to `None`.\n        mock_response: If provided, return a mock completion response for testing or debugging\n            purposes. Defaults to `None`.\n        force_timeout: The maximum execution time in seconds for the completion request.\n            Defaults to `600`.\n        custom_llm_provider: Used for Non-OpenAI LLMs, Example usage for bedrock, set(iterable)\n            model=\"amazon.titan-tg1-large\" and custom_llm_provider=\"bedrock\". Defaults to\n            `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    import litellm\n    from litellm import token_counter\n\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"litellm\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"model\": self.model,\n        \"messages\": input,\n        \"n\": num_generations,\n        \"functions\": functions,\n        \"function_call\": function_call,\n        \"temperature\": temperature,\n        \"top_p\": top_p,\n        \"stream\": False,\n        \"stop\": stop,\n        \"max_tokens\": max_tokens,\n        \"presence_penalty\": presence_penalty,\n        \"frequency_penalty\": frequency_penalty,\n        \"logit_bias\": logit_bias,\n        \"user\": user,\n        \"metadata\": metadata,\n        \"api_base\": api_base,\n        \"api_version\": api_version,\n        \"api_key\": api_key,\n        \"model_list\": model_list,\n        \"mock_response\": mock_response,\n        \"force_timeout\": force_timeout,\n        \"custom_llm_provider\": custom_llm_provider,\n    }\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n\n    async def _call_aclient_until_n_choices() -&gt; List[\"Choices\"]:\n        choices = []\n        while len(choices) &lt; num_generations:\n            completion = await self._aclient(**kwargs)  # type: ignore\n            if not self.structured_output:\n                completion = completion.choices\n            choices.extend(completion)\n        return choices\n\n    # litellm.drop_params is used to en/disable sending **kwargs parameters to the API if they cannot be used\n    try:\n        litellm.drop_params = False\n        choices = await _call_aclient_until_n_choices()\n    except litellm.exceptions.APIError as e:\n        if \"does not support parameters\" in str(e):\n            litellm.drop_params = True\n            choices = await _call_aclient_until_n_choices()\n        else:\n            raise e\n\n    generations = []\n    input_tokens = [\n        token_counter(model=self.model, messages=input)\n    ] * num_generations\n    output_tokens = []\n\n    if self.structured_output:\n        for choice in choices:\n            generations.append(choice.model_dump_json())\n            output_tokens.append(\n                token_counter(\n                    model=self.model,\n                    text=orjson.dumps(choice.model_dump_json()).decode(\"utf-8\"),\n                )\n            )\n        return prepare_output(\n            generations,\n            input_tokens=input_tokens,\n            output_tokens=output_tokens,\n        )\n\n    for choice in choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using LiteLLM client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n        output_tokens.append(token_counter(model=self.model, text=content))\n\n    return prepare_output(\n        generations, input_tokens=input_tokens, output_tokens=output_tokens\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM","title":"<code>LlamaCppLLM</code>","text":"<p>               Bases: <code>LLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>llama.cpp LLM implementation running the Python bindings for the C++ code.</p> <p>Attributes:</p> Name Type Description <code>model_path</code> <code>RuntimeParameter[FilePath]</code> <p>contains the path to the GGUF quantized model, compatible with the installed version of the <code>llama.cpp</code> Python bindings.</p> <code>n_gpu_layers</code> <code>RuntimeParameter[int]</code> <p>the number of layers to use for the GPU. Defaults to <code>-1</code>, meaning that the available GPU device will be used.</p> <code>chat_format</code> <code>Optional[RuntimeParameter[str]]</code> <p>the chat format to use for the model. Defaults to <code>None</code>, which means the Llama format will be used.</p> <code>n_ctx</code> <code>int</code> <p>the context size to use for the model. Defaults to <code>512</code>.</p> <code>n_batch</code> <code>int</code> <p>the prompt processing maximum batch size to use for the model. Defaults to <code>512</code>.</p> <code>seed</code> <code>int</code> <p>random seed to use for the generation. Defaults to <code>4294967295</code>.</p> <code>verbose</code> <code>RuntimeParameter[bool]</code> <p>whether to print verbose output. Defaults to <code>False</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[OutlinesStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>extra_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</p> <code>tokenizer_id</code> <code>Optional[RuntimeParameter[str]]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the <code>model</code> will be used. Defaults to <code>None</code>.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> <code>_model</code> <code>Optional[Llama]</code> <p>the Llama model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>model_path</code>: the path to the GGUF quantized model.</li> <li><code>n_gpu_layers</code>: the number of layers to use for the GPU. Defaults to <code>-1</code>.</li> <li><code>chat_format</code>: the chat format to use for the model. Defaults to <code>None</code>.</li> <li><code>verbose</code>: whether to print verbose output. Defaults to <code>False</code>.</li> <li><code>extra_kwargs</code>: additional dictionary of keyword arguments that will be passed to the     <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</li> </ul> References <ul> <li><code>llama.cpp</code></li> <li><code>llama-cpp-python</code></li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\n# You can follow along this example downloading the following model running the following\n# command in the terminal, that will download the model to the `Downloads` folder:\n# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),\n    n_gpu_layers=-1,  # To use the GPU if available\n    n_ctx=1024,       # Set the context size\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),  # type: ignore\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>class LlamaCppLLM(LLM, MagpieChatTemplateMixin):\n    \"\"\"llama.cpp LLM implementation running the Python bindings for the C++ code.\n\n    Attributes:\n        model_path: contains the path to the GGUF quantized model, compatible with the\n            installed version of the `llama.cpp` Python bindings.\n        n_gpu_layers: the number of layers to use for the GPU. Defaults to `-1`, meaning that\n            the available GPU device will be used.\n        chat_format: the chat format to use for the model. Defaults to `None`, which means the\n            Llama format will be used.\n        n_ctx: the context size to use for the model. Defaults to `512`.\n        n_batch: the prompt processing maximum batch size to use for the model. Defaults to `512`.\n        seed: random seed to use for the generation. Defaults to `4294967295`.\n        verbose: whether to print verbose output. Defaults to `False`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        extra_kwargs: additional dictionary of keyword arguments that will be passed to the\n            `Llama` class of `llama_cpp` library. Defaults to `{}`.\n        tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer config files. If not provided, the one associated to the `model`\n            will be used. Defaults to `None`.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n        _model: the Llama model instance. This attribute is meant to be used internally and\n            should not be accessed directly. It will be set in the `load` method.\n\n    Runtime parameters:\n        - `model_path`: the path to the GGUF quantized model.\n        - `n_gpu_layers`: the number of layers to use for the GPU. Defaults to `-1`.\n        - `chat_format`: the chat format to use for the model. Defaults to `None`.\n        - `verbose`: whether to print verbose output. Defaults to `False`.\n        - `extra_kwargs`: additional dictionary of keyword arguments that will be passed to the\n            `Llama` class of `llama_cpp` library. Defaults to `{}`.\n\n    References:\n        - [`llama.cpp`](https://github.com/ggerganov/llama.cpp)\n        - [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python)\n\n    Examples:\n        Generate text:\n\n        ```python\n        from pathlib import Path\n        from distilabel.models.llms import LlamaCppLLM\n\n        # You can follow along this example downloading the following model running the following\n        # command in the terminal, that will download the model to the `Downloads` folder:\n        # curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\n        model_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\n        llm = LlamaCppLLM(\n            model_path=str(Path.home() / model_path),\n            n_gpu_layers=-1,  # To use the GPU if available\n            n_ctx=1024,       # Set the context size\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pathlib import Path\n        from distilabel.models.llms import LlamaCppLLM\n\n        model_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = LlamaCppLLM(\n            model_path=str(Path.home() / model_path),  # type: ignore\n            n_gpu_layers=-1,\n            n_ctx=1024,\n            structured_output={\"format\": \"json\", \"schema\": Character},\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model_path: RuntimeParameter[FilePath] = Field(\n        default=None, description=\"The path to the GGUF quantized model.\", exclude=True\n    )\n    n_gpu_layers: RuntimeParameter[int] = Field(\n        default=-1,\n        description=\"The number of layers that will be loaded in the GPU.\",\n    )\n    chat_format: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The chat format to use for the model. Defaults to `None`, which means the Llama format will be used.\",\n    )\n\n    n_ctx: int = 512\n    n_batch: int = 512\n    seed: int = 4294967295\n    verbose: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether to print verbose output from llama.cpp library.\",\n    )\n    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"Additional dictionary of keyword arguments that will be passed to the\"\n        \" `Llama` class of `llama_cpp` library. See all the supported arguments at: \"\n        \"https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__\",\n    )\n    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n    tokenizer_id: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repo id or a path to a directory containing\"\n        \" the tokenizer config files. If not provided, the one associated to the `model`\"\n        \" will be used.\",\n    )\n    _logits_processor: Optional[\"LogitsProcessorList\"] = PrivateAttr(default=None)\n    _model: Optional[\"Llama\"] = PrivateAttr(...)\n\n    @model_validator(mode=\"after\")\n    def validate_magpie_usage(\n        self,\n    ) -&gt; \"LlamaCppLLM\":\n        \"\"\"Validates that magpie usage is valid.\"\"\"\n\n        if self.use_magpie_template and self.tokenizer_id is None:\n            raise ValueError(\n                \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n                \" set a `tokenizer_id` and try again.\"\n            )\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `Llama` model from the `model_path`.\"\"\"\n        try:\n            from llama_cpp import Llama\n        except ImportError as ie:\n            raise ImportError(\n                \"The `llama_cpp` package is required to use the `LlamaCppLLM` class.\"\n            ) from ie\n\n        self._model = Llama(\n            model_path=self.model_path.as_posix(),\n            seed=self.seed,\n            n_ctx=self.n_ctx,\n            n_batch=self.n_batch,\n            chat_format=self.chat_format,\n            n_gpu_layers=self.n_gpu_layers,\n            verbose=self.verbose,\n            **self.extra_kwargs,\n        )\n\n        if self.structured_output:\n            self._logits_processor = self._prepare_structured_output(\n                self.structured_output\n            )\n\n        if self.use_magpie_template or self.magpie_pre_query_template:\n            if not self.tokenizer_id:\n                raise ValueError(\n                    \"The Hugging Face Hub repo id or a path to a directory containing\"\n                    \" the tokenizer config files is required when using the `use_magpie_template`\"\n                    \" or `magpie_pre_query_template` runtime parameters.\"\n                )\n\n        if self.tokenizer_id:\n            try:\n                from transformers import AutoTokenizer\n            except ImportError as ie:\n                raise ImportError(\n                    \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n                ) from ie\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n            if self._tokenizer.chat_template is None:\n                raise ValueError(\n                    \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n                )\n\n        # NOTE: Here because of the custom `logging` interface used, since it will create the logging name\n        # out of the model name, which won't be available until the `Llama` instance is created.\n        super().load()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self._model.model_path  # type: ignore\n\n    def _generate_chat_completion(\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; \"CreateChatCompletionResponse\":\n        return self._model.create_chat_completion(  # type: ignore\n            messages=input,  # type: ignore\n            max_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            temperature=temperature,\n            top_p=top_p,\n            logits_processor=self._logits_processor,\n            **(extra_generation_kwargs or {}),\n        )\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(  # type: ignore\n                conversation=input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    def _generate_with_text_generation(\n        self,\n        input: FormattedInput,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; \"CreateChatCompletionResponse\":\n        prompt = self.prepare_input(input)\n        return self._model.create_completion(\n            prompt=prompt,\n            max_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            presence_penalty=presence_penalty,\n            temperature=temperature,\n            top_p=top_p,\n            logits_processor=self._logits_processor,\n            **(extra_generation_kwargs or {}),\n        )\n\n    @validate_call\n    def generate(  # type: ignore\n        self,\n        inputs: List[FormattedInput],\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; List[GenerateOutput]:\n        \"\"\"Generates `num_generations` responses for the given input using the Llama model.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            extra_generation_kwargs: dictionary with additional arguments to be passed to\n                the `create_chat_completion` method. Reference at\n                https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        structured_output = None\n        batch_outputs = []\n        for input in inputs:\n            if isinstance(input, tuple):\n                input, structured_output = input\n            elif self.structured_output:\n                structured_output = self.structured_output\n\n            outputs = []\n            output_tokens = []\n            for _ in range(num_generations):\n                # NOTE(plaguss): There seems to be a bug in how the logits processor\n                # is used. Basically it consumes the FSM internally, and it isn't reinitialized\n                # after each generation, so subsequent calls yield nothing. This is a workaround\n                # until is fixed in the `llama_cpp` or `outlines` libraries.\n                if structured_output:\n                    self._logits_processor = self._prepare_structured_output(\n                        structured_output\n                    )\n                if self.tokenizer_id is None:\n                    completion = self._generate_chat_completion(\n                        input,\n                        max_new_tokens,\n                        frequency_penalty,\n                        presence_penalty,\n                        temperature,\n                        top_p,\n                        extra_generation_kwargs,\n                    )\n                    outputs.append(completion[\"choices\"][0][\"message\"][\"content\"])\n                    output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n                else:\n                    completion: \"CreateChatCompletionResponse\" = (\n                        self._generate_with_text_generation(  # type: ignore\n                            input,\n                            max_new_tokens,\n                            frequency_penalty,\n                            presence_penalty,\n                            temperature,\n                            top_p,\n                            extra_generation_kwargs,\n                        )\n                    )\n                    outputs.append(completion[\"choices\"][0][\"text\"])\n                    output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n            batch_outputs.append(\n                prepare_output(\n                    outputs,\n                    input_tokens=[completion[\"usage\"][\"prompt_tokens\"]]\n                    * num_generations,\n                    output_tokens=output_tokens,\n                )\n            )\n\n        return batch_outputs\n\n    def _prepare_structured_output(\n        self, structured_output: Optional[OutlinesStructuredOutputType] = None\n    ) -&gt; Union[\"LogitsProcessorList\", None]:\n        \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n        Args:\n            structured_output: the configuration dict to prepare the structured output.\n\n        Returns:\n            The callable that will be used to guide the generation of the model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            prepare_guided_output,\n        )\n\n        result = prepare_guided_output(structured_output, \"llamacpp\", self._model)\n        if (schema := result.get(\"schema\")) and self.structured_output:\n            self.structured_output[\"schema\"] = schema\n        return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.validate_magpie_usage","title":"<code>validate_magpie_usage()</code>","text":"<p>Validates that magpie usage is valid.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>@model_validator(mode=\"after\")\ndef validate_magpie_usage(\n    self,\n) -&gt; \"LlamaCppLLM\":\n    \"\"\"Validates that magpie usage is valid.\"\"\"\n\n    if self.use_magpie_template and self.tokenizer_id is None:\n        raise ValueError(\n            \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n            \" set a `tokenizer_id` and try again.\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>Llama</code> model from the <code>model_path</code>.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `Llama` model from the `model_path`.\"\"\"\n    try:\n        from llama_cpp import Llama\n    except ImportError as ie:\n        raise ImportError(\n            \"The `llama_cpp` package is required to use the `LlamaCppLLM` class.\"\n        ) from ie\n\n    self._model = Llama(\n        model_path=self.model_path.as_posix(),\n        seed=self.seed,\n        n_ctx=self.n_ctx,\n        n_batch=self.n_batch,\n        chat_format=self.chat_format,\n        n_gpu_layers=self.n_gpu_layers,\n        verbose=self.verbose,\n        **self.extra_kwargs,\n    )\n\n    if self.structured_output:\n        self._logits_processor = self._prepare_structured_output(\n            self.structured_output\n        )\n\n    if self.use_magpie_template or self.magpie_pre_query_template:\n        if not self.tokenizer_id:\n            raise ValueError(\n                \"The Hugging Face Hub repo id or a path to a directory containing\"\n                \" the tokenizer config files is required when using the `use_magpie_template`\"\n                \" or `magpie_pre_query_template` runtime parameters.\"\n            )\n\n    if self.tokenizer_id:\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n            ) from ie\n        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n        if self._tokenizer.chat_template is None:\n            raise ValueError(\n                \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n            )\n\n    # NOTE: Here because of the custom `logging` interface used, since it will create the logging name\n    # out of the model name, which won't be available until the `Llama` instance is created.\n    super().load()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(  # type: ignore\n            conversation=input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM.generate","title":"<code>generate(inputs, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, extra_generation_kwargs=None)</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the Llama model.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>extra_generation_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>dictionary with additional arguments to be passed to the <code>create_chat_completion</code> method. Reference at https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion</p> <code>None</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>@validate_call\ndef generate(  # type: ignore\n    self,\n    inputs: List[FormattedInput],\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    frequency_penalty: float = 0.0,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    extra_generation_kwargs: Optional[Dict[str, Any]] = None,\n) -&gt; List[GenerateOutput]:\n    \"\"\"Generates `num_generations` responses for the given input using the Llama model.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        extra_generation_kwargs: dictionary with additional arguments to be passed to\n            the `create_chat_completion` method. Reference at\n            https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    structured_output = None\n    batch_outputs = []\n    for input in inputs:\n        if isinstance(input, tuple):\n            input, structured_output = input\n        elif self.structured_output:\n            structured_output = self.structured_output\n\n        outputs = []\n        output_tokens = []\n        for _ in range(num_generations):\n            # NOTE(plaguss): There seems to be a bug in how the logits processor\n            # is used. Basically it consumes the FSM internally, and it isn't reinitialized\n            # after each generation, so subsequent calls yield nothing. This is a workaround\n            # until is fixed in the `llama_cpp` or `outlines` libraries.\n            if structured_output:\n                self._logits_processor = self._prepare_structured_output(\n                    structured_output\n                )\n            if self.tokenizer_id is None:\n                completion = self._generate_chat_completion(\n                    input,\n                    max_new_tokens,\n                    frequency_penalty,\n                    presence_penalty,\n                    temperature,\n                    top_p,\n                    extra_generation_kwargs,\n                )\n                outputs.append(completion[\"choices\"][0][\"message\"][\"content\"])\n                output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n            else:\n                completion: \"CreateChatCompletionResponse\" = (\n                    self._generate_with_text_generation(  # type: ignore\n                        input,\n                        max_new_tokens,\n                        frequency_penalty,\n                        presence_penalty,\n                        temperature,\n                        top_p,\n                        extra_generation_kwargs,\n                    )\n                )\n                outputs.append(completion[\"choices\"][0][\"text\"])\n                output_tokens.append(completion[\"usage\"][\"completion_tokens\"])\n        batch_outputs.append(\n            prepare_output(\n                outputs,\n                input_tokens=[completion[\"usage\"][\"prompt_tokens\"]]\n                * num_generations,\n                output_tokens=output_tokens,\n            )\n        )\n\n    return batch_outputs\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.LlamaCppLLM._prepare_structured_output","title":"<code>_prepare_structured_output(structured_output=None)</code>","text":"<p>Creates the appropriate function to filter tokens to generate structured outputs.</p> <p>Parameters:</p> Name Type Description Default <code>structured_output</code> <code>Optional[OutlinesStructuredOutputType]</code> <p>the configuration dict to prepare the structured output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Union[LogitsProcessorList, None]</code> <p>The callable that will be used to guide the generation of the model.</p> Source code in <code>src/distilabel/models/llms/llamacpp.py</code> <pre><code>def _prepare_structured_output(\n    self, structured_output: Optional[OutlinesStructuredOutputType] = None\n) -&gt; Union[\"LogitsProcessorList\", None]:\n    \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n    Args:\n        structured_output: the configuration dict to prepare the structured output.\n\n    Returns:\n        The callable that will be used to guide the generation of the model.\n    \"\"\"\n    from distilabel.steps.tasks.structured_outputs.outlines import (\n        prepare_guided_output,\n    )\n\n    result = prepare_guided_output(structured_output, \"llamacpp\", self._model)\n    if (schema := result.get(\"schema\")) and self.structured_output:\n        self.structured_output[\"schema\"] = schema\n    return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM","title":"<code>MistralLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>Mistral LLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"mistral-tiny\", \"mistral-large\", etc.</p> <code>endpoint</code> <code>str</code> <p>the endpoint to use for the Mistral API. Defaults to \"https://api.mistral.ai\".</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Mistral API. Defaults to <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of retries to attempt when a request fails. Defaults to <code>5</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</p> <code>max_concurrent_requests</code> <code>RuntimeParameter[int]</code> <p>the maximum number of concurrent requests to send. Defaults to <code>64</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <code>_aclient</code> <code>Optional[Mistral]</code> <p>the <code>Mistral</code> to use for the Mistral API. It is meant to be used internally. Set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>api_key</code>: the API key to authenticate the requests to the Mistral API.</li> <li><code>max_retries</code>: the maximum number of retries to attempt when a request fails.     Defaults to <code>5</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</li> <li><code>max_concurrent_requests</code>: the maximum number of concurrent requests to send.     Defaults to <code>64</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import MistralLLM\n\nllm = MistralLLM(model=\"open-mixtral-8x22b\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n\n```python\nfrom pydantic import BaseModel\nfrom distilabel.models.llms import MistralLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = MistralLLM(\n    model=\"open-mixtral-8x22b\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/mistral.py</code> <pre><code>class MistralLLM(AsyncLLM):\n    \"\"\"Mistral LLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"mistral-tiny\", \"mistral-large\", etc.\n        endpoint: the endpoint to use for the Mistral API. Defaults to \"https://api.mistral.ai\".\n        api_key: the API key to authenticate the requests to the Mistral API. Defaults to `None` which\n            means that the value set for the environment variable `OPENAI_API_KEY` will be used, or\n            `None` if not set.\n        max_retries: the maximum number of retries to attempt when a request fails. Defaults to `5`.\n        timeout: the maximum time in seconds to wait for a response. Defaults to `120`.\n        max_concurrent_requests: the maximum number of concurrent requests to send. Defaults\n            to `64`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n        _api_key_env_var: the name of the environment variable to use for the API key. It is meant to\n            be used internally.\n        _aclient: the `Mistral` to use for the Mistral API. It is meant to be used internally.\n            Set in the `load` method.\n\n    Runtime parameters:\n        - `api_key`: the API key to authenticate the requests to the Mistral API.\n        - `max_retries`: the maximum number of retries to attempt when a request fails.\n            Defaults to `5`.\n        - `timeout`: the maximum time in seconds to wait for a response. Defaults to `120`.\n        - `max_concurrent_requests`: the maximum number of concurrent requests to send.\n            Defaults to `64`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import MistralLLM\n\n        llm = MistralLLM(model=\"open-mixtral-8x22b\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import MistralLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = MistralLLM(\n            model=\"open-mixtral-8x22b\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    endpoint: str = \"https://api.mistral.ai\"\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_MISTRALAI_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Mistral API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=6,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    max_concurrent_requests: RuntimeParameter[int] = Field(\n        default=64, description=\"The maximum number of concurrent requests to send.\"\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _num_generations_param_supported = False\n\n    _api_key_env_var: str = PrivateAttr(_MISTRALAI_API_KEY_ENV_VAR_NAME)\n    _aclient: Optional[\"Mistral\"] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `Mistral` client to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from mistralai import Mistral\n        except ImportError as ie:\n            raise ImportError(\n                \"MistralAI Python client is not installed. Please install it using\"\n                \" `pip install mistralai`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._aclient = Mistral(\n            api_key=self.api_key.get_secret_value(),\n            endpoint=self.endpoint,\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,  # type: ignore\n            max_concurrent_requests=self.max_concurrent_requests,  # type: ignore\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"mistral\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    # TODO: add `num_generations` parameter once Mistral client allows `n` parameter\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        max_new_tokens: Optional[int] = None,\n        temperature: Optional[float] = None,\n        top_p: Optional[float] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the MistralAI async\n        client.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,\n                client=self._aclient,\n                framework=\"mistral\",\n            )\n            self._aclient = result.get(\"client\")\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"max_tokens\": max_new_tokens,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n        }\n        generations = []\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)\n            # TODO:\u00a0This should work just with the _aclient.chat method, but it's not working.\n            # We need to check instructor and see if we can create a PR.\n            completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n        else:\n            # completion = await self._aclient.chat(**kwargs)  # type: ignore\n            completion = await self._aclient.chat.complete_async(**kwargs)  # type: ignore\n\n        if structured_output:\n            return prepare_output(\n                [completion.model_dump_json()],\n                **self._get_llm_statistics(completion._raw_response),\n            )\n\n        for choice in completion.choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using MistralAI client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n\n        return prepare_output(generations, **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: \"ChatCompletionResponse\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion.usage.prompt_tokens],\n            \"output_tokens\": [completion.usage.completion_tokens],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>Mistral</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/mistral.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `Mistral` client to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from mistralai import Mistral\n    except ImportError as ie:\n        raise ImportError(\n            \"MistralAI Python client is not installed. Please install it using\"\n            \" `pip install mistralai`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._aclient = Mistral(\n        api_key=self.api_key.get_secret_value(),\n        endpoint=self.endpoint,\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,  # type: ignore\n        max_concurrent_requests=self.max_concurrent_requests,  # type: ignore\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"mistral\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MistralLLM.agenerate","title":"<code>agenerate(input, max_new_tokens=None, temperature=None, top_p=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the MistralAI async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>max_new_tokens</code> <code>Optional[int]</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>None</code> <code>temperature</code> <code>Optional[float]</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>None</code> <code>top_p</code> <code>Optional[float]</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/mistral.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    max_new_tokens: Optional[int] = None,\n    temperature: Optional[float] = None,\n    top_p: Optional[float] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the MistralAI async\n    client.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,\n            client=self._aclient,\n            framework=\"mistral\",\n        )\n        self._aclient = result.get(\"client\")\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"max_tokens\": max_new_tokens,\n        \"temperature\": temperature,\n        \"top_p\": top_p,\n    }\n    generations = []\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)\n        # TODO:\u00a0This should work just with the _aclient.chat method, but it's not working.\n        # We need to check instructor and see if we can create a PR.\n        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n    else:\n        # completion = await self._aclient.chat(**kwargs)  # type: ignore\n        completion = await self._aclient.chat.complete_async(**kwargs)  # type: ignore\n\n    if structured_output:\n        return prepare_output(\n            [completion.model_dump_json()],\n            **self._get_llm_statistics(completion._raw_response),\n        )\n\n    for choice in completion.choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using MistralAI client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n\n    return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM","title":"<code>MixtureOfAgentsLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p><code>Mixture-of-Agents</code> implementation.</p> <p>An <code>LLM</code> class that leverages <code>LLM</code>s collective strenghts to generate a response, as described in the \"Mixture-of-Agents Enhances Large Language model Capabilities\" paper. There is a list of <code>LLM</code>s proposing/generating outputs that <code>LLM</code>s from the next round/layer can use as auxiliary information. Finally, there is an <code>LLM</code> that aggregates the outputs to generate the final response.</p> <p>Attributes:</p> Name Type Description <code>aggregator_llm</code> <code>LLM</code> <p>The <code>LLM</code> that aggregates the outputs of the proposer <code>LLM</code>s.</p> <code>proposers_llms</code> <code>List[AsyncLLM]</code> <p>The list of <code>LLM</code>s that propose outputs to be aggregated.</p> <code>rounds</code> <code>int</code> <p>The number of layers or rounds that the <code>proposers_llms</code> will generate outputs. Defaults to <code>1</code>.</p> References <ul> <li>Mixture-of-Agents Enhances Large Language Model Capabilities</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM\n\nllm = MixtureOfAgentsLLM(\n    aggregator_llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    proposers_llms=[\n        InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n            tokenizer_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n            tokenizer_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n        ),\n    ],\n    rounds=2,\n)\n\nllm.load()\n\noutput = llm.generate_outputs(\n    inputs=[\n        [\n            {\n                \"role\": \"user\",\n                \"content\": \"My favorite witty review of The Rings of Power series is this: Input:\",\n            }\n        ]\n    ]\n)\n</code></pre> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>class MixtureOfAgentsLLM(AsyncLLM):\n    \"\"\"`Mixture-of-Agents` implementation.\n\n    An `LLM` class that leverages `LLM`s collective strenghts to generate a response,\n    as described in the \"Mixture-of-Agents Enhances Large Language model Capabilities\"\n    paper. There is a list of `LLM`s proposing/generating outputs that `LLM`s from the next\n    round/layer can use as auxiliary information. Finally, there is an `LLM` that aggregates\n    the outputs to generate the final response.\n\n    Attributes:\n        aggregator_llm: The `LLM` that aggregates the outputs of the proposer `LLM`s.\n        proposers_llms: The list of `LLM`s that propose outputs to be aggregated.\n        rounds: The number of layers or rounds that the `proposers_llms` will generate\n            outputs. Defaults to `1`.\n\n    References:\n        - [Mixture-of-Agents Enhances Large Language Model Capabilities](https://arxiv.org/abs/2406.04692)\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM\n\n        llm = MixtureOfAgentsLLM(\n            aggregator_llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            ),\n            proposers_llms=[\n                InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                    tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                ),\n                InferenceEndpointsLLM(\n                    model_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n                    tokenizer_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n                ),\n                InferenceEndpointsLLM(\n                    model_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n                    tokenizer_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n                ),\n            ],\n            rounds=2,\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(\n            inputs=[\n                [\n                    {\n                        \"role\": \"user\",\n                        \"content\": \"My favorite witty review of The Rings of Power series is this: Input:\",\n                    }\n                ]\n            ]\n        )\n        ```\n    \"\"\"\n\n    aggregator_llm: LLM\n    proposers_llms: List[AsyncLLM] = Field(default_factory=list)\n    rounds: int = 1\n\n    @property\n    def runtime_parameters_names(self) -&gt; \"RuntimeParametersNames\":\n        \"\"\"Returns the runtime parameters of the `LLM`, which are a combination of the\n        `RuntimeParameter`s of the `LLM`, the `aggregator_llm` and the `proposers_llms`.\n\n        Returns:\n            The runtime parameters of the `LLM`.\n        \"\"\"\n        runtime_parameters_names = super().runtime_parameters_names\n        del runtime_parameters_names[\"generation_kwargs\"]\n        return runtime_parameters_names\n\n    def load(self) -&gt; None:\n        \"\"\"Loads all the `LLM`s in the `MixtureOfAgents`.\"\"\"\n        super().load()\n\n        for llm in self.proposers_llms:\n            self._logger.debug(f\"Loading proposer LLM in MoA: {llm}\")  # type: ignore\n            llm.load()\n\n        self._logger.debug(f\"Loading aggregator LLM in MoA: {self.aggregator_llm}\")  # type: ignore\n        self.aggregator_llm.load()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the aggregated model name.\"\"\"\n        return f\"moa-{self.aggregator_llm.model_name}-{'-'.join([llm.model_name for llm in self.proposers_llms])}\"\n\n    def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n        \"\"\"Returns the generation kwargs of the `MixtureOfAgents` as a dictionary.\n\n        Returns:\n            The generation kwargs of the `MixtureOfAgents`.\n        \"\"\"\n        return {\n            \"aggregator_llm\": self.aggregator_llm.get_generation_kwargs(),\n            \"proposers_llms\": [\n                llm.get_generation_kwargs() for llm in self.proposers_llms\n            ],\n        }\n\n    # `abstractmethod`, had to be implemented but not used\n    async def agenerate(\n        self, input: \"FormattedInput\", num_generations: int = 1, **kwargs: Any\n    ) -&gt; List[Union[str, None]]:\n        raise NotImplementedError(\n            \"`agenerate` method is not implemented for `MixtureOfAgents`\"\n        )\n\n    def _build_moa_system_prompt(self, prev_outputs: List[str]) -&gt; str:\n        \"\"\"Builds the Mixture-of-Agents system prompt.\n\n        Args:\n            prev_outputs: The list of previous outputs to use as references.\n\n        Returns:\n            The Mixture-of-Agents system prompt.\n        \"\"\"\n        moa_system_prompt = MOA_SYSTEM_PROMPT\n        for i, prev_output in enumerate(prev_outputs):\n            if prev_output is not None:\n                moa_system_prompt += f\"\\n{i + 1}. {prev_output}\"\n        return moa_system_prompt\n\n    def _inject_moa_system_prompt(\n        self, input: \"StandardInput\", prev_outputs: List[str]\n    ) -&gt; \"StandardInput\":\n        \"\"\"Injects the Mixture-of-Agents system prompt into the input.\n\n        Args:\n            input: The input to inject the system prompt into.\n            prev_outputs: The list of previous outputs to use as references.\n\n        Returns:\n            The input with the Mixture-of-Agents system prompt injected.\n        \"\"\"\n        if len(prev_outputs) == 0:\n            return input\n\n        moa_system_prompt = self._build_moa_system_prompt(prev_outputs)\n\n        system = next((item for item in input if item[\"role\"] == \"system\"), None)\n        if system:\n            original_system_prompt = system[\"content\"]\n            system[\"content\"] = f\"{moa_system_prompt}\\n\\n{original_system_prompt}\"\n        else:\n            input.insert(0, {\"role\": \"system\", \"content\": moa_system_prompt})\n\n        return input\n\n    async def _agenerate(\n        self,\n        inputs: List[\"FormattedInput\"],\n        num_generations: int = 1,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Internal function to concurrently generate responses for a list of inputs.\n\n        Args:\n            inputs: the list of inputs to generate responses for.\n            num_generations: the number of generations to generate per input.\n            **kwargs: the additional kwargs to be used for the generation.\n\n        Returns:\n            A list containing the generations for each input.\n        \"\"\"\n        aggregator_llm_kwargs: Dict[str, Any] = kwargs.get(\"aggregator_llm\", {})\n        proposers_llms_kwargs: List[Dict[str, Any]] = kwargs.get(\n            \"proposers_llms\", [{}] * len(self.proposers_llms)\n        )\n\n        prev_outputs = []\n        for round in range(self.rounds):\n            self._logger.debug(f\"Generating round {round + 1}/{self.rounds} in MoA\")  # type: ignore\n\n            # Generate `num_generations` with each proposer LLM for each input\n            tasks = [\n                asyncio.create_task(\n                    llm._agenerate(\n                        inputs=[\n                            self._inject_moa_system_prompt(\n                                cast(\"StandardInput\", input), prev_input_outputs\n                            )\n                            for input, prev_input_outputs in itertools.zip_longest(\n                                inputs, prev_outputs, fillvalue=[]\n                            )\n                        ],\n                        num_generations=1,\n                        **generation_kwargs,\n                    )\n                )\n                for llm, generation_kwargs in zip(\n                    self.proposers_llms, proposers_llms_kwargs\n                )\n            ]\n\n            # Group generations per input\n            outputs: List[List[\"GenerateOutput\"]] = await asyncio.gather(*tasks)\n            prev_outputs = [\n                list(itertools.chain(*input_outputs)) for input_outputs in zip(*outputs)\n            ]\n\n        self._logger.debug(\"Aggregating outputs in MoA\")  # type: ignore\n        if isinstance(self.aggregator_llm, AsyncLLM):\n            return await self.aggregator_llm._agenerate(\n                inputs=[\n                    self._inject_moa_system_prompt(\n                        cast(\"StandardInput\", input), prev_input_outputs\n                    )\n                    for input, prev_input_outputs in zip(inputs, prev_outputs)\n                ],\n                num_generations=num_generations,\n                **aggregator_llm_kwargs,\n            )\n\n        return self.aggregator_llm.generate(\n            inputs=[\n                self._inject_moa_system_prompt(\n                    cast(\"StandardInput\", input), prev_input_outputs\n                )\n                for input, prev_input_outputs in zip(inputs, prev_outputs)\n            ],\n            num_generations=num_generations,\n            **aggregator_llm_kwargs,\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.runtime_parameters_names","title":"<code>runtime_parameters_names</code>  <code>property</code>","text":"<p>Returns the runtime parameters of the <code>LLM</code>, which are a combination of the <code>RuntimeParameter</code>s of the <code>LLM</code>, the <code>aggregator_llm</code> and the <code>proposers_llms</code>.</p> <p>Returns:</p> Type Description <code>RuntimeParametersNames</code> <p>The runtime parameters of the <code>LLM</code>.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the aggregated model name.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.load","title":"<code>load()</code>","text":"<p>Loads all the <code>LLM</code>s in the <code>MixtureOfAgents</code>.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads all the `LLM`s in the `MixtureOfAgents`.\"\"\"\n    super().load()\n\n    for llm in self.proposers_llms:\n        self._logger.debug(f\"Loading proposer LLM in MoA: {llm}\")  # type: ignore\n        llm.load()\n\n    self._logger.debug(f\"Loading aggregator LLM in MoA: {self.aggregator_llm}\")  # type: ignore\n    self.aggregator_llm.load()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM.get_generation_kwargs","title":"<code>get_generation_kwargs()</code>","text":"<p>Returns the generation kwargs of the <code>MixtureOfAgents</code> as a dictionary.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The generation kwargs of the <code>MixtureOfAgents</code>.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def get_generation_kwargs(self) -&gt; Dict[str, Any]:\n    \"\"\"Returns the generation kwargs of the `MixtureOfAgents` as a dictionary.\n\n    Returns:\n        The generation kwargs of the `MixtureOfAgents`.\n    \"\"\"\n    return {\n        \"aggregator_llm\": self.aggregator_llm.get_generation_kwargs(),\n        \"proposers_llms\": [\n            llm.get_generation_kwargs() for llm in self.proposers_llms\n        ],\n    }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM._build_moa_system_prompt","title":"<code>_build_moa_system_prompt(prev_outputs)</code>","text":"<p>Builds the Mixture-of-Agents system prompt.</p> <p>Parameters:</p> Name Type Description Default <code>prev_outputs</code> <code>List[str]</code> <p>The list of previous outputs to use as references.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The Mixture-of-Agents system prompt.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def _build_moa_system_prompt(self, prev_outputs: List[str]) -&gt; str:\n    \"\"\"Builds the Mixture-of-Agents system prompt.\n\n    Args:\n        prev_outputs: The list of previous outputs to use as references.\n\n    Returns:\n        The Mixture-of-Agents system prompt.\n    \"\"\"\n    moa_system_prompt = MOA_SYSTEM_PROMPT\n    for i, prev_output in enumerate(prev_outputs):\n        if prev_output is not None:\n            moa_system_prompt += f\"\\n{i + 1}. {prev_output}\"\n    return moa_system_prompt\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM._inject_moa_system_prompt","title":"<code>_inject_moa_system_prompt(input, prev_outputs)</code>","text":"<p>Injects the Mixture-of-Agents system prompt into the input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>The input to inject the system prompt into.</p> required <code>prev_outputs</code> <code>List[str]</code> <p>The list of previous outputs to use as references.</p> required <p>Returns:</p> Type Description <code>StandardInput</code> <p>The input with the Mixture-of-Agents system prompt injected.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>def _inject_moa_system_prompt(\n    self, input: \"StandardInput\", prev_outputs: List[str]\n) -&gt; \"StandardInput\":\n    \"\"\"Injects the Mixture-of-Agents system prompt into the input.\n\n    Args:\n        input: The input to inject the system prompt into.\n        prev_outputs: The list of previous outputs to use as references.\n\n    Returns:\n        The input with the Mixture-of-Agents system prompt injected.\n    \"\"\"\n    if len(prev_outputs) == 0:\n        return input\n\n    moa_system_prompt = self._build_moa_system_prompt(prev_outputs)\n\n    system = next((item for item in input if item[\"role\"] == \"system\"), None)\n    if system:\n        original_system_prompt = system[\"content\"]\n        system[\"content\"] = f\"{moa_system_prompt}\\n\\n{original_system_prompt}\"\n    else:\n        input.insert(0, {\"role\": \"system\", \"content\": moa_system_prompt})\n\n    return input\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.MixtureOfAgentsLLM._agenerate","title":"<code>_agenerate(inputs, num_generations=1, **kwargs)</code>  <code>async</code>","text":"<p>Internal function to concurrently generate responses for a list of inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>the list of inputs to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to generate per input.</p> <code>1</code> <code>**kwargs</code> <code>Any</code> <p>the additional kwargs to be used for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list containing the generations for each input.</p> Source code in <code>src/distilabel/models/llms/moa.py</code> <pre><code>async def _agenerate(\n    self,\n    inputs: List[\"FormattedInput\"],\n    num_generations: int = 1,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Internal function to concurrently generate responses for a list of inputs.\n\n    Args:\n        inputs: the list of inputs to generate responses for.\n        num_generations: the number of generations to generate per input.\n        **kwargs: the additional kwargs to be used for the generation.\n\n    Returns:\n        A list containing the generations for each input.\n    \"\"\"\n    aggregator_llm_kwargs: Dict[str, Any] = kwargs.get(\"aggregator_llm\", {})\n    proposers_llms_kwargs: List[Dict[str, Any]] = kwargs.get(\n        \"proposers_llms\", [{}] * len(self.proposers_llms)\n    )\n\n    prev_outputs = []\n    for round in range(self.rounds):\n        self._logger.debug(f\"Generating round {round + 1}/{self.rounds} in MoA\")  # type: ignore\n\n        # Generate `num_generations` with each proposer LLM for each input\n        tasks = [\n            asyncio.create_task(\n                llm._agenerate(\n                    inputs=[\n                        self._inject_moa_system_prompt(\n                            cast(\"StandardInput\", input), prev_input_outputs\n                        )\n                        for input, prev_input_outputs in itertools.zip_longest(\n                            inputs, prev_outputs, fillvalue=[]\n                        )\n                    ],\n                    num_generations=1,\n                    **generation_kwargs,\n                )\n            )\n            for llm, generation_kwargs in zip(\n                self.proposers_llms, proposers_llms_kwargs\n            )\n        ]\n\n        # Group generations per input\n        outputs: List[List[\"GenerateOutput\"]] = await asyncio.gather(*tasks)\n        prev_outputs = [\n            list(itertools.chain(*input_outputs)) for input_outputs in zip(*outputs)\n        ]\n\n    self._logger.debug(\"Aggregating outputs in MoA\")  # type: ignore\n    if isinstance(self.aggregator_llm, AsyncLLM):\n        return await self.aggregator_llm._agenerate(\n            inputs=[\n                self._inject_moa_system_prompt(\n                    cast(\"StandardInput\", input), prev_input_outputs\n                )\n                for input, prev_input_outputs in zip(inputs, prev_outputs)\n            ],\n            num_generations=num_generations,\n            **aggregator_llm_kwargs,\n        )\n\n    return self.aggregator_llm.generate(\n        inputs=[\n            self._inject_moa_system_prompt(\n                cast(\"StandardInput\", input), prev_input_outputs\n            )\n            for input, prev_input_outputs in zip(inputs, prev_outputs)\n        ],\n        num_generations=num_generations,\n        **aggregator_llm_kwargs,\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM","title":"<code>OllamaLLM</code>","text":"<p>               Bases: <code>AsyncLLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>Ollama LLM implementation running the Async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"notus\".</p> <code>host</code> <code>Optional[RuntimeParameter[str]]</code> <p>the Ollama server host.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the timeout for the LLM. Defaults to <code>120</code>.</p> <code>follow_redirects</code> <code>bool</code> <p>whether to follow redirects. Defaults to <code>True</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>tokenizer_id</code> <code>Optional[RuntimeParameter[str]]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer config files. If not provided, the one associated to the <code>model</code> will be used. Defaults to <code>None</code>.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> <code>_aclient</code> <code>Optional[AsyncClient]</code> <p>the <code>AsyncClient</code> to use for the Ollama API. It is meant to be used internally. Set in the <code>load</code> method.</p> Runtime parameters <ul> <li><code>host</code>: the Ollama server host.</li> <li><code>timeout</code>: the client timeout for the Ollama API. Defaults to <code>120</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import OllamaLLM\n\nllm = OllamaLLM(model=\"llama3\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>class OllamaLLM(AsyncLLM, MagpieChatTemplateMixin):\n    \"\"\"Ollama LLM implementation running the Async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"notus\".\n        host: the Ollama server host.\n        timeout: the timeout for the LLM. Defaults to `120`.\n        follow_redirects: whether to follow redirects. Defaults to `True`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer config files. If not provided, the one associated to the `model`\n            will be used. Defaults to `None`.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n        _aclient: the `AsyncClient` to use for the Ollama API. It is meant to be used internally.\n            Set in the `load` method.\n\n    Runtime parameters:\n        - `host`: the Ollama server host.\n        - `timeout`: the client timeout for the Ollama API. Defaults to `120`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import OllamaLLM\n\n        llm = OllamaLLM(model=\"llama3\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    host: Optional[RuntimeParameter[str]] = Field(\n        default=None, description=\"The host of the Ollama API.\"\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120, description=\"The timeout for the Ollama API.\"\n    )\n    follow_redirects: bool = True\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n    tokenizer_id: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repo id or a path to a directory containing\"\n        \" the tokenizer config files. If not provided, the one associated to the `model`\"\n        \" will be used.\",\n    )\n    _num_generations_param_supported = False\n    _aclient: Optional[\"AsyncClient\"] = PrivateAttr(...)  # type: ignore\n\n    @model_validator(mode=\"after\")  # type: ignore\n    def validate_magpie_usage(\n        self,\n    ) -&gt; \"OllamaLLM\":\n        \"\"\"Validates that magpie usage is valid.\"\"\"\n\n        if self.use_magpie_template and self.tokenizer_id is None:\n            raise ValueError(\n                \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n                \" set a `tokenizer_id` and try again.\"\n            )\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncClient` to use Ollama async API.\"\"\"\n        super().load()\n\n        try:\n            from ollama import AsyncClient\n\n            self._aclient = AsyncClient(\n                host=self.host,\n                timeout=self.timeout,\n                follow_redirects=self.follow_redirects,\n            )\n        except ImportError as e:\n            raise ImportError(\n                \"Ollama Python client is not installed. Please install it using\"\n                \" `pip install ollama`.\"\n            ) from e\n\n        if self.tokenizer_id:\n            try:\n                from transformers import AutoTokenizer\n            except ImportError as ie:\n                raise ImportError(\n                    \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n                ) from ie\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n            if self._tokenizer.chat_template is None:\n                raise ValueError(\n                    \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n                )\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    async def _generate_chat_completion(\n        self,\n        input: \"StandardInput\",\n        format: Literal[\"\", \"json\"] = \"\",\n        options: Union[Options, None] = None,\n        keep_alive: Union[bool, None] = None,\n    ) -&gt; \"ChatResponse\":\n        return await self._aclient.chat(\n            model=self.model,\n            messages=input,\n            stream=False,\n            format=format,\n            options=options,\n            keep_alive=keep_alive,\n        )\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(\n                conversation=input,\n                tokenize=False,\n                add_generation_prompt=True,\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    async def _generate_with_text_generation(\n        self,\n        input: \"StandardInput\",\n        format: Literal[\"\", \"json\"] = None,\n        options: Union[Options, None] = None,\n        keep_alive: Union[bool, None] = None,\n    ) -&gt; \"GenerateResponse\":\n        input = self.prepare_input(input)\n        return await self._aclient.generate(\n            model=self.model,\n            prompt=input,\n            format=format,\n            options=options,\n            keep_alive=keep_alive,\n            raw=True,\n        )\n\n    @validate_call\n    async def agenerate(\n        self,\n        input: StandardInput,\n        format: Literal[\"\", \"json\"] = \"\",\n        # TODO: include relevant options from `Options` in `agenerate` method.\n        options: Union[Options, None] = None,\n        keep_alive: Union[bool, None] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"\n        Generates a response asynchronously, using the [Ollama Async API definition](https://github.com/ollama/ollama-python).\n\n        Args:\n            input: the input to use for the generation.\n            format: the format to use for the generation. Defaults to `\"\"`.\n            options: the options to use for the generation. Defaults to `None`.\n            keep_alive: whether to keep the connection alive. Defaults to `None`.\n\n        Returns:\n            A list of strings as completion for the given input.\n        \"\"\"\n        text = None\n        try:\n            if not format:\n                format = None\n            if self.tokenizer_id is None:\n                completion = await self._generate_chat_completion(\n                    input, format, options, keep_alive\n                )\n                text = completion[\"message\"][\"content\"]\n            else:\n                completion = await self._generate_with_text_generation(\n                    input, format, options, keep_alive\n                )\n                text = completion.response\n        except Exception as e:\n            self._logger.warning(  # type: ignore\n                f\"\u26a0\ufe0f Received no response using Ollama client (model: '{self.model_name}').\"\n                f\" Finish reason was: {e}\"\n            )\n\n        return prepare_output([text], **self._get_llm_statistics(completion))\n\n    @staticmethod\n    def _get_llm_statistics(completion: Dict[str, Any]) -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [completion[\"prompt_eval_count\"]],\n            \"output_tokens\": [completion[\"eval_count\"]],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.validate_magpie_usage","title":"<code>validate_magpie_usage()</code>","text":"<p>Validates that magpie usage is valid.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>@model_validator(mode=\"after\")  # type: ignore\ndef validate_magpie_usage(\n    self,\n) -&gt; \"OllamaLLM\":\n    \"\"\"Validates that magpie usage is valid.\"\"\"\n\n    if self.use_magpie_template and self.tokenizer_id is None:\n        raise ValueError(\n            \"`use_magpie_template` cannot be `True` if `tokenizer_id` is `None`. Please,\"\n            \" set a `tokenizer_id` and try again.\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncClient</code> to use Ollama async API.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncClient` to use Ollama async API.\"\"\"\n    super().load()\n\n    try:\n        from ollama import AsyncClient\n\n        self._aclient = AsyncClient(\n            host=self.host,\n            timeout=self.timeout,\n            follow_redirects=self.follow_redirects,\n        )\n    except ImportError as e:\n        raise ImportError(\n            \"Ollama Python client is not installed. Please install it using\"\n            \" `pip install ollama`.\"\n        ) from e\n\n    if self.tokenizer_id:\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"Transformers is not installed. Please install it using `pip install 'distilabel[hf-transformers]'`.\"\n            ) from ie\n        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id)\n        if self._tokenizer.chat_template is None:\n            raise ValueError(\n                \"The tokenizer does not have a chat template. Please use a tokenizer with a chat template.\"\n            )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(\n            conversation=input,\n            tokenize=False,\n            add_generation_prompt=True,\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OllamaLLM.agenerate","title":"<code>agenerate(input, format='', options=None, keep_alive=None)</code>  <code>async</code>","text":"<p>Generates a response asynchronously, using the Ollama Async API definition.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input to use for the generation.</p> required <code>format</code> <code>Literal['', 'json']</code> <p>the format to use for the generation. Defaults to <code>\"\"</code>.</p> <code>''</code> <code>options</code> <code>Union[Options, None]</code> <p>the options to use for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>keep_alive</code> <code>Union[bool, None]</code> <p>whether to keep the connection alive. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of strings as completion for the given input.</p> Source code in <code>src/distilabel/models/llms/ollama.py</code> <pre><code>@validate_call\nasync def agenerate(\n    self,\n    input: StandardInput,\n    format: Literal[\"\", \"json\"] = \"\",\n    # TODO: include relevant options from `Options` in `agenerate` method.\n    options: Union[Options, None] = None,\n    keep_alive: Union[bool, None] = None,\n) -&gt; GenerateOutput:\n    \"\"\"\n    Generates a response asynchronously, using the [Ollama Async API definition](https://github.com/ollama/ollama-python).\n\n    Args:\n        input: the input to use for the generation.\n        format: the format to use for the generation. Defaults to `\"\"`.\n        options: the options to use for the generation. Defaults to `None`.\n        keep_alive: whether to keep the connection alive. Defaults to `None`.\n\n    Returns:\n        A list of strings as completion for the given input.\n    \"\"\"\n    text = None\n    try:\n        if not format:\n            format = None\n        if self.tokenizer_id is None:\n            completion = await self._generate_chat_completion(\n                input, format, options, keep_alive\n            )\n            text = completion[\"message\"][\"content\"]\n        else:\n            completion = await self._generate_with_text_generation(\n                input, format, options, keep_alive\n            )\n            text = completion.response\n    except Exception as e:\n        self._logger.warning(  # type: ignore\n            f\"\u26a0\ufe0f Received no response using Ollama client (model: '{self.model_name}').\"\n            f\" Finish reason was: {e}\"\n        )\n\n    return prepare_output([text], **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM","title":"<code>OpenAILLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>OpenAI LLM implementation running the async API client.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"gpt-3.5-turbo\", \"gpt-4\", etc. Supported models can be found here.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the OpenAI API requests. Defaults to <code>None</code>, which means that the value set for the environment variable <code>OPENAI_BASE_URL</code> will be used, or \"https://api.openai.com/v1\" if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the OpenAI API. Defaults to <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of times to retry the request to the API before failing. Defaults to <code>6</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[InstructorStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration configuration using <code>instructor</code>. You can take a look at the dictionary structure in <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> Runtime parameters <ul> <li><code>base_url</code>: the base URL to use for the OpenAI API requests. Defaults to <code>None</code>.</li> <li><code>api_key</code>: the API key to authenticate the requests to the OpenAI API. Defaults     to <code>None</code>.</li> <li><code>max_retries</code>: the maximum number of times to retry the request to the API before     failing. Defaults to <code>6</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> </ul> Icon <p><code>:simple-openai:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate text from a custom endpoint following the OpenAI API:</p> <pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import OpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = OpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> <p>Generate with Batch API (offline batch generation):</p> <pre><code>from distilabel.models.llms import OpenAILLM\n\nload = llm = OpenAILLM(\n    model=\"gpt-3.5-turbo\",\n    use_offline_batch_generation=True,\n    offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n# [['Hello! How can I assist you today?']]\n</code></pre> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>class OpenAILLM(AsyncLLM):\n    \"\"\"OpenAI LLM implementation running the async API client.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\", \"gpt-4\", etc.\n            Supported models can be found [here](https://platform.openai.com/docs/guides/text-generation).\n        base_url: the base URL to use for the OpenAI API requests. Defaults to `None`, which\n            means that the value set for the environment variable `OPENAI_BASE_URL` will\n            be used, or \"https://api.openai.com/v1\" if not set.\n        api_key: the API key to authenticate the requests to the OpenAI API. Defaults to\n            `None` which means that the value set for the environment variable `OPENAI_API_KEY`\n            will be used, or `None` if not set.\n        max_retries: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        structured_output: a dictionary containing the structured output configuration configuration\n            using `instructor`. You can take a look at the dictionary structure in\n            `InstructorStructuredOutputType` from `distilabel.steps.tasks.structured_outputs.instructor`.\n\n    Runtime parameters:\n        - `base_url`: the base URL to use for the OpenAI API requests. Defaults to `None`.\n        - `api_key`: the API key to authenticate the requests to the OpenAI API. Defaults\n            to `None`.\n        - `max_retries`: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n\n    Icon:\n        `:simple-openai:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import OpenAILLM\n\n        llm = OpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate text from a custom endpoint following the OpenAI API:\n\n        ```python\n        from distilabel.models.llms import OpenAILLM\n\n        llm = OpenAILLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            base_url=r\"http://localhost:8080/v1\"\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pydantic import BaseModel\n        from distilabel.models.llms import OpenAILLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = OpenAILLM(\n            model=\"gpt-4-turbo\",\n            api_key=\"api.key\",\n            structured_output={\"schema\": User}\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n\n        Generate with Batch API (offline batch generation):\n\n        ```python\n        from distilabel.models.llms import OpenAILLM\n\n        load = llm = OpenAILLM(\n            model=\"gpt-3.5-turbo\",\n            use_offline_batch_generation=True,\n            offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n        )\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        # [['Hello! How can I assist you today?']]\n        ```\n    \"\"\"\n\n    model: str\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"OPENAI_BASE_URL\", \"https://api.openai.com/v1\"\n        ),\n        description=\"The base URL to use for the OpenAI API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_OPENAI_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the OpenAI API.\",\n    )\n    max_retries: RuntimeParameter[int] = Field(\n        default=6,\n        description=\"The maximum number of times to retry the request to the API before\"\n        \" failing.\",\n    )\n    timeout: RuntimeParameter[int] = Field(\n        default=120,\n        description=\"The maximum time in seconds to wait for a response from the API.\",\n    )\n    structured_output: Optional[RuntimeParameter[InstructorStructuredOutputType]] = (\n        Field(\n            default=None,\n            description=\"The structured output format to use across all the generations.\",\n        )\n    )\n\n    _api_key_env_var: str = PrivateAttr(_OPENAI_API_KEY_ENV_VAR_NAME)\n    _client: \"OpenAI\" = PrivateAttr(None)\n    _aclient: \"AsyncOpenAI\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `AsyncOpenAI` client to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from openai import AsyncOpenAI, OpenAI\n        except ImportError as ie:\n            raise ImportError(\n                \"OpenAI Python client is not installed. Please install it using\"\n                \" `pip install openai`.\"\n            ) from ie\n\n        if self.api_key is None:\n            raise ValueError(\n                f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n                f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n            )\n\n        self._client = OpenAI(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        self._aclient = AsyncOpenAI(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        if self.structured_output:\n            result = self._prepare_structured_output(\n                structured_output=self.structured_output,\n                client=self._aclient,\n                framework=\"openai\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n            if structured_output := result.get(\"structured_output\"):\n                self.structured_output = structured_output\n\n    def unload(self) -&gt; None:\n        \"\"\"Set clients to `None` as they both contain `thread._RLock` which cannot be pickled\n        in case an exception is raised and has to be handled in the main process\"\"\"\n\n        self._client = None  # type: ignore\n        self._aclient = None  # type: ignore\n        self.structured_output = None\n        super().unload()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        logprobs: bool = False,\n        top_logprobs: Optional[PositiveInt] = None,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        stop: Optional[Union[str, List[str]]] = None,\n        response_format: Optional[Dict[str, str]] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the OpenAI async\n        client.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            logprobs: whether to return the log probabilities or not. Defaults to `False`.\n            top_logprobs: the number of top log probabilities to return per output token\n                generated. Defaults to `None`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: a string or a list of strings to use as a stop sequence for the generation.\n                Defaults to `None`.\n            response_format: the format of the response to return. Must be one of\n                \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n                for more information on how to use the JSON model from OpenAI. Defaults to None\n                which returns text. To return JSON, use {\"type\": \"json_object\"}.\n\n        Note:\n            If response_format\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n\n        structured_output = None\n        if isinstance(input, tuple):\n            input, structured_output = input\n            result = self._prepare_structured_output(\n                structured_output=structured_output,  # type: ignore\n                client=self._aclient,\n                framework=\"openai\",\n            )\n            self._aclient = result.get(\"client\")  # type: ignore\n\n        if structured_output is None and self.structured_output is not None:\n            structured_output = self.structured_output\n\n        kwargs = {\n            \"messages\": input,  # type: ignore\n            \"model\": self.model,\n            \"logprobs\": logprobs,\n            \"top_logprobs\": top_logprobs,\n            \"max_tokens\": max_new_tokens,\n            \"n\": num_generations,\n            \"frequency_penalty\": frequency_penalty,\n            \"presence_penalty\": presence_penalty,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n            \"stop\": stop,\n        }\n        # Check if it's a vision generation task, in that case \"stop\" cannot be used or raises\n        # an error in the API.\n        if isinstance(\n            [row for row in input if row[\"role\"] == \"user\"][0][\"content\"], list\n        ):\n            kwargs.pop(\"stop\")\n\n        if response_format is not None:\n            kwargs[\"response_format\"] = response_format\n\n        if structured_output:\n            kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n        completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n\n        if structured_output:\n            # NOTE: `instructor` doesn't work with `n` parameter, so it will always return\n            # only 1 choice.\n            statistics = self._get_llm_statistics(completion._raw_response)\n            if choice_logprobs := self._get_logprobs_from_choice(\n                completion._raw_response.choices[0]\n            ):\n                output_logprobs = [choice_logprobs]\n            else:\n                output_logprobs = None\n            return prepare_output(\n                generations=[completion.model_dump_json()],\n                input_tokens=statistics[\"input_tokens\"],\n                output_tokens=statistics[\"output_tokens\"],\n                logprobs=output_logprobs,\n            )\n\n        return self._generations_from_openai_completion(completion)\n\n    def _generations_from_openai_completion(\n        self, completion: \"OpenAIChatCompletion\"\n    ) -&gt; \"GenerateOutput\":\n        \"\"\"Get the generations from the OpenAI Chat Completion object.\n\n        Args:\n            completion: the completion object to get the generations from.\n\n        Returns:\n            A list of strings containing the generated responses for the input.\n        \"\"\"\n        generations = []\n        logprobs = []\n        for choice in completion.choices:\n            if (content := choice.message.content) is None:\n                self._logger.warning(  # type: ignore\n                    f\"Received no response using OpenAI client (model: '{self.model}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(content)\n            if choice_logprobs := self._get_logprobs_from_choice(choice):\n                logprobs.append(choice_logprobs)\n\n        statistics = self._get_llm_statistics(completion)\n        return prepare_output(\n            generations=generations,\n            input_tokens=statistics[\"input_tokens\"],\n            output_tokens=statistics[\"output_tokens\"],\n            logprobs=logprobs,\n        )\n\n    def _get_logprobs_from_choice(\n        self, choice: \"OpenAIChoice\"\n    ) -&gt; Union[List[List[\"Logprob\"]], None]:\n        if choice.logprobs is None or choice.logprobs.content is None:\n            return None\n\n        return [\n            [\n                {\"token\": top_logprob.token, \"logprob\": top_logprob.logprob}\n                for top_logprob in token_logprobs.top_logprobs\n            ]\n            for token_logprobs in choice.logprobs.content\n        ]\n\n    def offline_batch_generate(\n        self,\n        inputs: Union[List[\"FormattedInput\"], None] = None,\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        logprobs: bool = False,\n        top_logprobs: Optional[PositiveInt] = None,\n        frequency_penalty: float = 0.0,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        stop: Optional[Union[str, List[str]]] = None,\n        response_format: Optional[str] = None,\n        **kwargs: Any,\n    ) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Uses the OpenAI batch API to generate `num_generations` responses for the given\n        inputs.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            logprobs: whether to return the log probabilities or not. Defaults to `False`.\n            top_logprobs: the number of top log probabilities to return per output token\n                generated. Defaults to `None`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            stop: a string or a list of strings to use as a stop sequence for the generation.\n                Defaults to `None`.\n            response_format: the format of the response to return. Must be one of\n                \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n                for more information on how to use the JSON model from OpenAI. Defaults to `text`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input\n            in `inputs`.\n\n        Raises:\n            DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n                is not finished yet.\n            ValueError: if no job IDs were found to retrieve the results from.\n        \"\"\"\n        if self.jobs_ids:\n            return self._check_and_get_batch_results()\n\n        if inputs:\n            self.jobs_ids = self._create_jobs(\n                inputs=inputs,\n                **{\n                    \"model\": self.model,\n                    \"logprobs\": logprobs,\n                    \"top_logprobs\": top_logprobs,\n                    \"max_tokens\": max_new_tokens,\n                    \"n\": num_generations,\n                    \"frequency_penalty\": frequency_penalty,\n                    \"presence_penalty\": presence_penalty,\n                    \"temperature\": temperature,\n                    \"top_p\": top_p,\n                    \"stop\": stop,\n                    \"response_format\": response_format,\n                },\n            )\n            raise DistilabelOfflineBatchGenerationNotFinishedException(\n                jobs_ids=self.jobs_ids\n            )\n\n        raise ValueError(\"No `inputs` were provided and no `jobs_ids` were found.\")\n\n    def _check_and_get_batch_results(self) -&gt; List[\"GenerateOutput\"]:\n        \"\"\"Checks the status of the batch jobs and retrieves the results from the OpenAI\n        Batch API.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n\n        Raises:\n            ValueError: if no job IDs were found to retrieve the results from.\n            DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n                is not finished yet.\n            RuntimeError: if the only batch job found failed.\n        \"\"\"\n        if not self.jobs_ids:\n            raise ValueError(\"No job IDs were found to retrieve the results from.\")\n\n        outputs = []\n        for batch_id in self.jobs_ids:\n            batch = self._get_openai_batch(batch_id)\n\n            if batch.status in (\"validating\", \"in_progress\", \"finalizing\"):\n                raise DistilabelOfflineBatchGenerationNotFinishedException(\n                    jobs_ids=self.jobs_ids\n                )\n\n            if batch.status in (\"failed\", \"expired\", \"cancelled\", \"cancelling\"):\n                self._logger.error(  # type: ignore\n                    f\"OpenAI API batch with ID '{batch_id}' failed with status '{batch.status}'.\"\n                )\n                if len(self.jobs_ids) == 1:\n                    self.jobs_ids = None\n                    raise RuntimeError(\n                        f\"The only OpenAI API Batch that was created with ID '{batch_id}'\"\n                        f\" failed with status '{batch.status}'.\"\n                    )\n\n                continue\n\n            outputs.extend(self._retrieve_batch_results(batch))\n\n        # sort by `custom_id` to return the results in the same order as the inputs\n        outputs = sorted(outputs, key=lambda x: int(x[\"custom_id\"]))\n        return [self._parse_output(output) for output in outputs]\n\n    def _parse_output(self, output: Dict[str, Any]) -&gt; \"GenerateOutput\":\n        \"\"\"Parses the output from the OpenAI Batch API into a list of strings.\n\n        Args:\n            output: the output to parse.\n\n        Returns:\n            A list of strings containing the generated responses for the input.\n        \"\"\"\n        from openai.types.chat import ChatCompletion as OpenAIChatCompletion\n\n        if \"response\" not in output:\n            return []\n\n        if output[\"response\"][\"status_code\"] != 200:\n            return []\n\n        return self._generations_from_openai_completion(\n            OpenAIChatCompletion(**output[\"response\"][\"body\"])\n        )\n\n    def _get_openai_batch(self, batch_id: str) -&gt; \"OpenAIBatch\":\n        \"\"\"Gets a batch from the OpenAI Batch API.\n\n        Args:\n            batch_id: the ID of the batch to retrieve.\n\n        Returns:\n            The batch retrieved from the OpenAI Batch API.\n\n        Raises:\n            openai.OpenAIError: if there was an error while retrieving the batch from the\n                OpenAI Batch API.\n        \"\"\"\n        import openai\n\n        try:\n            return self._client.batches.retrieve(batch_id)\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while retrieving batch '{batch_id}' from OpenAI: {e}\"\n            )\n            raise e\n\n    def _retrieve_batch_results(self, batch: \"OpenAIBatch\") -&gt; List[Dict[str, Any]]:\n        \"\"\"Retrieves the results of a batch from its output file, parsing the JSONL content\n        into a list of dictionaries.\n\n        Args:\n            batch: the batch to retrieve the results from.\n\n        Returns:\n            A list of dictionaries containing the results of the batch.\n\n        Raises:\n            AssertionError: if no output file ID was found in the batch.\n        \"\"\"\n        import openai\n\n        assert batch.output_file_id, \"No output file ID was found in the batch.\"\n\n        try:\n            file_response = self._client.files.content(batch.output_file_id)\n            return [orjson.loads(line) for line in file_response.text.splitlines()]\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while retrieving batch results from file '{batch.output_file_id}': {e}\"\n            )\n            return []\n\n    def _create_jobs(\n        self, inputs: List[\"FormattedInput\"], **kwargs: Any\n    ) -&gt; Tuple[str, ...]:\n        \"\"\"Creates jobs in the OpenAI Batch API to generate responses for the given inputs.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            kwargs: the keyword arguments to use for the generation.\n\n        Returns:\n            A list of job IDs created in the OpenAI Batch API.\n        \"\"\"\n        batch_input_files = self._create_batch_files(inputs=inputs, **kwargs)\n        jobs = []\n        for batch_input_file in batch_input_files:\n            if batch := self._create_batch_api_job(batch_input_file):\n                jobs.append(batch.id)\n        return tuple(jobs)\n\n    def _create_batch_api_job(\n        self, batch_input_file: \"OpenAIFileObject\"\n    ) -&gt; Union[\"OpenAIBatch\", None]:\n        \"\"\"Creates a job in the OpenAI Batch API to generate responses for the given input\n        file.\n\n        Args:\n            batch_input_file: the input file to generate responses for.\n\n        Returns:\n            The batch job created in the OpenAI Batch API.\n        \"\"\"\n        import openai\n\n        metadata = {\"description\": \"distilabel\"}\n\n        if distilabel_pipeline_name := envs.DISTILABEL_PIPELINE_NAME:\n            metadata[\"distilabel_pipeline_name\"] = distilabel_pipeline_name\n\n        if distilabel_pipeline_cache_id := envs.DISTILABEL_PIPELINE_CACHE_ID:\n            metadata[\"distilabel_pipeline_cache_id\"] = distilabel_pipeline_cache_id\n\n        batch = None\n        try:\n            batch = self._client.batches.create(\n                completion_window=\"24h\",\n                endpoint=\"/v1/chat/completions\",\n                input_file_id=batch_input_file.id,\n                metadata=metadata,\n            )\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while creating OpenAI Batch API job for file with ID\"\n                f\" '{batch_input_file.id}': {e}.\"\n            )\n            raise e\n        return batch\n\n    def _create_batch_files(\n        self, inputs: List[\"FormattedInput\"], **kwargs: Any\n    ) -&gt; List[\"OpenAIFileObject\"]:\n        \"\"\"Creates the necessary input files for the batch API to generate responses. The\n        maximum size of each file so the OpenAI Batch API can process it is 100MB, so we\n        need to split the inputs into multiple files if necessary.\n\n        More information: https://platform.openai.com/docs/api-reference/files/create\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for, optionally\n                including structured output.\n            kwargs: the keyword arguments to use for the generation.\n\n        Returns:\n            The list of file objects created for the OpenAI Batch API.\n\n        Raises:\n            openai.OpenAIError: if there was an error while creating the batch input file\n                in the OpenAI Batch API.\n        \"\"\"\n        import openai\n\n        files = []\n        for file_no, buffer in enumerate(\n            self._create_jsonl_buffers(inputs=inputs, **kwargs)\n        ):\n            try:\n                # TODO: add distilabel pipeline name and id\n                batch_input_file = self._client.files.create(\n                    file=(self._name_for_openai_files(file_no), buffer),\n                    purpose=\"batch\",\n                )\n                files.append(batch_input_file)\n            except openai.OpenAIError as e:\n                self._logger.error(  # type: ignore\n                    f\"Error while creating OpenAI batch input file: {e}\"\n                )\n                raise e\n        return files\n\n    def _create_jsonl_buffers(\n        self, inputs: List[\"FormattedInput\"], **kwargs: Any\n    ) -&gt; Generator[io.BytesIO, None, None]:\n        \"\"\"Creates a generator of buffers containing the JSONL formatted inputs to be\n        used by the OpenAI Batch API. The buffers created are of size 100MB or less.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for, optionally\n                including structured output.\n            kwargs: the keyword arguments to use for the generation.\n\n        Yields:\n            A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch\n            API.\n        \"\"\"\n        buffer = io.BytesIO()\n        buffer_current_size = 0\n        for i, input in enumerate(inputs):\n            # We create the smallest `custom_id` so we don't  increase the size of the file\n            # to much, but we can still sort the results with the order of the inputs.\n            row = self._create_jsonl_row(input=input, custom_id=str(i), **kwargs)\n            row_size = len(row)\n            if row_size + buffer_current_size &gt; _OPENAI_BATCH_API_MAX_FILE_SIZE:\n                buffer.seek(0)\n                yield buffer\n                buffer = io.BytesIO()\n                buffer_current_size = 0\n            buffer.write(row)\n            buffer_current_size += row_size\n\n        if buffer_current_size &gt; 0:\n            buffer.seek(0)\n            yield buffer\n\n    def _create_jsonl_row(\n        self, input: \"FormattedInput\", custom_id: str, **kwargs: Any\n    ) -&gt; bytes:\n        \"\"\"Creates a JSONL formatted row to be used by the OpenAI Batch API.\n\n        Args:\n            input: a list of inputs in chat format to generate responses for, optionally\n                including structured output.\n            custom_id: a custom ID to use for the row.\n            kwargs: the keyword arguments to use for the generation.\n\n        Returns:\n            A JSONL formatted row to be used by the OpenAI Batch API.\n        \"\"\"\n        # TODO: depending on the format of the input, add `response_format` to the kwargs\n        row = {\n            \"custom_id\": custom_id,\n            \"method\": \"POST\",\n            \"url\": \"/v1/chat/completions\",\n            \"body\": {\"messages\": input, **kwargs},\n        }\n        json_row = orjson.dumps(row)\n        return json_row + b\"\\n\"\n\n    def _name_for_openai_files(self, file_no: int) -&gt; str:\n        if (\n            envs.DISTILABEL_PIPELINE_NAME is None\n            or envs.DISTILABEL_PIPELINE_CACHE_ID is None\n        ):\n            return f\"distilabel-pipeline-fileno-{file_no}.jsonl\"\n\n        return f\"distilabel-pipeline-{envs.DISTILABEL_PIPELINE_NAME}-{envs.DISTILABEL_PIPELINE_CACHE_ID}-fileno-{file_no}.jsonl\"\n\n    @staticmethod\n    def _get_llm_statistics(\n        completion: Union[\"OpenAIChatCompletion\", \"OpenAICompletion\"],\n    ) -&gt; \"LLMStatistics\":\n        return {\n            \"output_tokens\": [\n                completion.usage.completion_tokens if completion.usage else 0\n            ],\n            \"input_tokens\": [completion.usage.prompt_tokens if completion.usage else 0],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>AsyncOpenAI</code> client to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `AsyncOpenAI` client to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from openai import AsyncOpenAI, OpenAI\n    except ImportError as ie:\n        raise ImportError(\n            \"OpenAI Python client is not installed. Please install it using\"\n            \" `pip install openai`.\"\n        ) from ie\n\n    if self.api_key is None:\n        raise ValueError(\n            f\"To use `{self.__class__.__name__}` an API key must be provided via `api_key`\"\n            f\" attribute or runtime parameter, or set the environment variable `{self._api_key_env_var}`.\"\n        )\n\n    self._client = OpenAI(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    self._aclient = AsyncOpenAI(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    if self.structured_output:\n        result = self._prepare_structured_output(\n            structured_output=self.structured_output,\n            client=self._aclient,\n            framework=\"openai\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n        if structured_output := result.get(\"structured_output\"):\n            self.structured_output = structured_output\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.unload","title":"<code>unload()</code>","text":"<p>Set clients to <code>None</code> as they both contain <code>thread._RLock</code> which cannot be pickled in case an exception is raised and has to be handled in the main process</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Set clients to `None` as they both contain `thread._RLock` which cannot be pickled\n    in case an exception is raised and has to be handled in the main process\"\"\"\n\n    self._client = None  # type: ignore\n    self._aclient = None  # type: ignore\n    self.structured_output = None\n    super().unload()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.agenerate","title":"<code>agenerate(input, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the OpenAI async client.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>logprobs</code> <code>bool</code> <p>whether to return the log probabilities or not. Defaults to <code>False</code>.</p> <code>False</code> <code>top_logprobs</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[Union[str, List[str]]]</code> <p>a string or a list of strings to use as a stop sequence for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>response_format</code> <code>Optional[Dict[str, str]]</code> <p>the format of the response to return. Must be one of \"text\" or \"json\". Read the documentation here for more information on how to use the JSON model from OpenAI. Defaults to None which returns text. To return JSON, use {\"type\": \"json_object\"}.</p> <code>None</code> Note <p>If response_format</p> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    logprobs: bool = False,\n    top_logprobs: Optional[PositiveInt] = None,\n    frequency_penalty: float = 0.0,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    stop: Optional[Union[str, List[str]]] = None,\n    response_format: Optional[Dict[str, str]] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the OpenAI async\n    client.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        logprobs: whether to return the log probabilities or not. Defaults to `False`.\n        top_logprobs: the number of top log probabilities to return per output token\n            generated. Defaults to `None`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: a string or a list of strings to use as a stop sequence for the generation.\n            Defaults to `None`.\n        response_format: the format of the response to return. Must be one of\n            \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n            for more information on how to use the JSON model from OpenAI. Defaults to None\n            which returns text. To return JSON, use {\"type\": \"json_object\"}.\n\n    Note:\n        If response_format\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n\n    structured_output = None\n    if isinstance(input, tuple):\n        input, structured_output = input\n        result = self._prepare_structured_output(\n            structured_output=structured_output,  # type: ignore\n            client=self._aclient,\n            framework=\"openai\",\n        )\n        self._aclient = result.get(\"client\")  # type: ignore\n\n    if structured_output is None and self.structured_output is not None:\n        structured_output = self.structured_output\n\n    kwargs = {\n        \"messages\": input,  # type: ignore\n        \"model\": self.model,\n        \"logprobs\": logprobs,\n        \"top_logprobs\": top_logprobs,\n        \"max_tokens\": max_new_tokens,\n        \"n\": num_generations,\n        \"frequency_penalty\": frequency_penalty,\n        \"presence_penalty\": presence_penalty,\n        \"temperature\": temperature,\n        \"top_p\": top_p,\n        \"stop\": stop,\n    }\n    # Check if it's a vision generation task, in that case \"stop\" cannot be used or raises\n    # an error in the API.\n    if isinstance(\n        [row for row in input if row[\"role\"] == \"user\"][0][\"content\"], list\n    ):\n        kwargs.pop(\"stop\")\n\n    if response_format is not None:\n        kwargs[\"response_format\"] = response_format\n\n    if structured_output:\n        kwargs = self._prepare_kwargs(kwargs, structured_output)  # type: ignore\n\n    completion = await self._aclient.chat.completions.create(**kwargs)  # type: ignore\n\n    if structured_output:\n        # NOTE: `instructor` doesn't work with `n` parameter, so it will always return\n        # only 1 choice.\n        statistics = self._get_llm_statistics(completion._raw_response)\n        if choice_logprobs := self._get_logprobs_from_choice(\n            completion._raw_response.choices[0]\n        ):\n            output_logprobs = [choice_logprobs]\n        else:\n            output_logprobs = None\n        return prepare_output(\n            generations=[completion.model_dump_json()],\n            input_tokens=statistics[\"input_tokens\"],\n            output_tokens=statistics[\"output_tokens\"],\n            logprobs=output_logprobs,\n        )\n\n    return self._generations_from_openai_completion(completion)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._generations_from_openai_completion","title":"<code>_generations_from_openai_completion(completion)</code>","text":"<p>Get the generations from the OpenAI Chat Completion object.</p> <p>Parameters:</p> Name Type Description Default <code>completion</code> <code>ChatCompletion</code> <p>the completion object to get the generations from.</p> required <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of strings containing the generated responses for the input.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _generations_from_openai_completion(\n    self, completion: \"OpenAIChatCompletion\"\n) -&gt; \"GenerateOutput\":\n    \"\"\"Get the generations from the OpenAI Chat Completion object.\n\n    Args:\n        completion: the completion object to get the generations from.\n\n    Returns:\n        A list of strings containing the generated responses for the input.\n    \"\"\"\n    generations = []\n    logprobs = []\n    for choice in completion.choices:\n        if (content := choice.message.content) is None:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using OpenAI client (model: '{self.model}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(content)\n        if choice_logprobs := self._get_logprobs_from_choice(choice):\n            logprobs.append(choice_logprobs)\n\n    statistics = self._get_llm_statistics(completion)\n    return prepare_output(\n        generations=generations,\n        input_tokens=statistics[\"input_tokens\"],\n        output_tokens=statistics[\"output_tokens\"],\n        logprobs=logprobs,\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM.offline_batch_generate","title":"<code>offline_batch_generate(inputs=None, num_generations=1, max_new_tokens=128, logprobs=False, top_logprobs=None, frequency_penalty=0.0, presence_penalty=0.0, temperature=1.0, top_p=1.0, stop=None, response_format=None, **kwargs)</code>","text":"<p>Uses the OpenAI batch API to generate <code>num_generations</code> responses for the given inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[List[FormattedInput], None]</code> <p>a list of inputs in chat format to generate responses for.</p> <code>None</code> <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>logprobs</code> <code>bool</code> <p>whether to return the log probabilities or not. Defaults to <code>False</code>.</p> <code>False</code> <code>top_logprobs</code> <code>Optional[PositiveInt]</code> <p>the number of top log probabilities to return per output token generated. Defaults to <code>None</code>.</p> <code>None</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>stop</code> <code>Optional[Union[str, List[str]]]</code> <p>a string or a list of strings to use as a stop sequence for the generation. Defaults to <code>None</code>.</p> <code>None</code> <code>response_format</code> <code>Optional[str]</code> <p>the format of the response to return. Must be one of \"text\" or \"json\". Read the documentation here for more information on how to use the JSON model from OpenAI. Defaults to <code>text</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input</p> <code>List[GenerateOutput]</code> <p>in <code>inputs</code>.</p> <p>Raises:</p> Type Description <code>DistilabelOfflineBatchGenerationNotFinishedException</code> <p>if the batch generation is not finished yet.</p> <code>ValueError</code> <p>if no job IDs were found to retrieve the results from.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def offline_batch_generate(\n    self,\n    inputs: Union[List[\"FormattedInput\"], None] = None,\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    logprobs: bool = False,\n    top_logprobs: Optional[PositiveInt] = None,\n    frequency_penalty: float = 0.0,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    stop: Optional[Union[str, List[str]]] = None,\n    response_format: Optional[str] = None,\n    **kwargs: Any,\n) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Uses the OpenAI batch API to generate `num_generations` responses for the given\n    inputs.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        logprobs: whether to return the log probabilities or not. Defaults to `False`.\n        top_logprobs: the number of top log probabilities to return per output token\n            generated. Defaults to `None`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        stop: a string or a list of strings to use as a stop sequence for the generation.\n            Defaults to `None`.\n        response_format: the format of the response to return. Must be one of\n            \"text\" or \"json\". Read the documentation [here](https://platform.openai.com/docs/guides/text-generation/json-mode)\n            for more information on how to use the JSON model from OpenAI. Defaults to `text`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input\n        in `inputs`.\n\n    Raises:\n        DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n            is not finished yet.\n        ValueError: if no job IDs were found to retrieve the results from.\n    \"\"\"\n    if self.jobs_ids:\n        return self._check_and_get_batch_results()\n\n    if inputs:\n        self.jobs_ids = self._create_jobs(\n            inputs=inputs,\n            **{\n                \"model\": self.model,\n                \"logprobs\": logprobs,\n                \"top_logprobs\": top_logprobs,\n                \"max_tokens\": max_new_tokens,\n                \"n\": num_generations,\n                \"frequency_penalty\": frequency_penalty,\n                \"presence_penalty\": presence_penalty,\n                \"temperature\": temperature,\n                \"top_p\": top_p,\n                \"stop\": stop,\n                \"response_format\": response_format,\n            },\n        )\n        raise DistilabelOfflineBatchGenerationNotFinishedException(\n            jobs_ids=self.jobs_ids\n        )\n\n    raise ValueError(\"No `inputs` were provided and no `jobs_ids` were found.\")\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._check_and_get_batch_results","title":"<code>_check_and_get_batch_results()</code>","text":"<p>Checks the status of the batch jobs and retrieves the results from the OpenAI Batch API.</p> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>if no job IDs were found to retrieve the results from.</p> <code>DistilabelOfflineBatchGenerationNotFinishedException</code> <p>if the batch generation is not finished yet.</p> <code>RuntimeError</code> <p>if the only batch job found failed.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _check_and_get_batch_results(self) -&gt; List[\"GenerateOutput\"]:\n    \"\"\"Checks the status of the batch jobs and retrieves the results from the OpenAI\n    Batch API.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n\n    Raises:\n        ValueError: if no job IDs were found to retrieve the results from.\n        DistilabelOfflineBatchGenerationNotFinishedException: if the batch generation\n            is not finished yet.\n        RuntimeError: if the only batch job found failed.\n    \"\"\"\n    if not self.jobs_ids:\n        raise ValueError(\"No job IDs were found to retrieve the results from.\")\n\n    outputs = []\n    for batch_id in self.jobs_ids:\n        batch = self._get_openai_batch(batch_id)\n\n        if batch.status in (\"validating\", \"in_progress\", \"finalizing\"):\n            raise DistilabelOfflineBatchGenerationNotFinishedException(\n                jobs_ids=self.jobs_ids\n            )\n\n        if batch.status in (\"failed\", \"expired\", \"cancelled\", \"cancelling\"):\n            self._logger.error(  # type: ignore\n                f\"OpenAI API batch with ID '{batch_id}' failed with status '{batch.status}'.\"\n            )\n            if len(self.jobs_ids) == 1:\n                self.jobs_ids = None\n                raise RuntimeError(\n                    f\"The only OpenAI API Batch that was created with ID '{batch_id}'\"\n                    f\" failed with status '{batch.status}'.\"\n                )\n\n            continue\n\n        outputs.extend(self._retrieve_batch_results(batch))\n\n    # sort by `custom_id` to return the results in the same order as the inputs\n    outputs = sorted(outputs, key=lambda x: int(x[\"custom_id\"]))\n    return [self._parse_output(output) for output in outputs]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._parse_output","title":"<code>_parse_output(output)</code>","text":"<p>Parses the output from the OpenAI Batch API into a list of strings.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Dict[str, Any]</code> <p>the output to parse.</p> required <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of strings containing the generated responses for the input.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _parse_output(self, output: Dict[str, Any]) -&gt; \"GenerateOutput\":\n    \"\"\"Parses the output from the OpenAI Batch API into a list of strings.\n\n    Args:\n        output: the output to parse.\n\n    Returns:\n        A list of strings containing the generated responses for the input.\n    \"\"\"\n    from openai.types.chat import ChatCompletion as OpenAIChatCompletion\n\n    if \"response\" not in output:\n        return []\n\n    if output[\"response\"][\"status_code\"] != 200:\n        return []\n\n    return self._generations_from_openai_completion(\n        OpenAIChatCompletion(**output[\"response\"][\"body\"])\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._get_openai_batch","title":"<code>_get_openai_batch(batch_id)</code>","text":"<p>Gets a batch from the OpenAI Batch API.</p> <p>Parameters:</p> Name Type Description Default <code>batch_id</code> <code>str</code> <p>the ID of the batch to retrieve.</p> required <p>Returns:</p> Type Description <code>Batch</code> <p>The batch retrieved from the OpenAI Batch API.</p> <p>Raises:</p> Type Description <code>OpenAIError</code> <p>if there was an error while retrieving the batch from the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _get_openai_batch(self, batch_id: str) -&gt; \"OpenAIBatch\":\n    \"\"\"Gets a batch from the OpenAI Batch API.\n\n    Args:\n        batch_id: the ID of the batch to retrieve.\n\n    Returns:\n        The batch retrieved from the OpenAI Batch API.\n\n    Raises:\n        openai.OpenAIError: if there was an error while retrieving the batch from the\n            OpenAI Batch API.\n    \"\"\"\n    import openai\n\n    try:\n        return self._client.batches.retrieve(batch_id)\n    except openai.OpenAIError as e:\n        self._logger.error(  # type: ignore\n            f\"Error while retrieving batch '{batch_id}' from OpenAI: {e}\"\n        )\n        raise e\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._retrieve_batch_results","title":"<code>_retrieve_batch_results(batch)</code>","text":"<p>Retrieves the results of a batch from its output file, parsing the JSONL content into a list of dictionaries.</p> <p>Parameters:</p> Name Type Description Default <code>batch</code> <code>Batch</code> <p>the batch to retrieve the results from.</p> required <p>Returns:</p> Type Description <code>List[Dict[str, Any]]</code> <p>A list of dictionaries containing the results of the batch.</p> <p>Raises:</p> Type Description <code>AssertionError</code> <p>if no output file ID was found in the batch.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _retrieve_batch_results(self, batch: \"OpenAIBatch\") -&gt; List[Dict[str, Any]]:\n    \"\"\"Retrieves the results of a batch from its output file, parsing the JSONL content\n    into a list of dictionaries.\n\n    Args:\n        batch: the batch to retrieve the results from.\n\n    Returns:\n        A list of dictionaries containing the results of the batch.\n\n    Raises:\n        AssertionError: if no output file ID was found in the batch.\n    \"\"\"\n    import openai\n\n    assert batch.output_file_id, \"No output file ID was found in the batch.\"\n\n    try:\n        file_response = self._client.files.content(batch.output_file_id)\n        return [orjson.loads(line) for line in file_response.text.splitlines()]\n    except openai.OpenAIError as e:\n        self._logger.error(  # type: ignore\n            f\"Error while retrieving batch results from file '{batch.output_file_id}': {e}\"\n        )\n        return []\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_jobs","title":"<code>_create_jobs(inputs, **kwargs)</code>","text":"<p>Creates jobs in the OpenAI Batch API to generate responses for the given inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>Tuple[str, ...]</code> <p>A list of job IDs created in the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_jobs(\n    self, inputs: List[\"FormattedInput\"], **kwargs: Any\n) -&gt; Tuple[str, ...]:\n    \"\"\"Creates jobs in the OpenAI Batch API to generate responses for the given inputs.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        kwargs: the keyword arguments to use for the generation.\n\n    Returns:\n        A list of job IDs created in the OpenAI Batch API.\n    \"\"\"\n    batch_input_files = self._create_batch_files(inputs=inputs, **kwargs)\n    jobs = []\n    for batch_input_file in batch_input_files:\n        if batch := self._create_batch_api_job(batch_input_file):\n            jobs.append(batch.id)\n    return tuple(jobs)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_batch_api_job","title":"<code>_create_batch_api_job(batch_input_file)</code>","text":"<p>Creates a job in the OpenAI Batch API to generate responses for the given input file.</p> <p>Parameters:</p> Name Type Description Default <code>batch_input_file</code> <code>FileObject</code> <p>the input file to generate responses for.</p> required <p>Returns:</p> Type Description <code>Union[Batch, None]</code> <p>The batch job created in the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_batch_api_job(\n    self, batch_input_file: \"OpenAIFileObject\"\n) -&gt; Union[\"OpenAIBatch\", None]:\n    \"\"\"Creates a job in the OpenAI Batch API to generate responses for the given input\n    file.\n\n    Args:\n        batch_input_file: the input file to generate responses for.\n\n    Returns:\n        The batch job created in the OpenAI Batch API.\n    \"\"\"\n    import openai\n\n    metadata = {\"description\": \"distilabel\"}\n\n    if distilabel_pipeline_name := envs.DISTILABEL_PIPELINE_NAME:\n        metadata[\"distilabel_pipeline_name\"] = distilabel_pipeline_name\n\n    if distilabel_pipeline_cache_id := envs.DISTILABEL_PIPELINE_CACHE_ID:\n        metadata[\"distilabel_pipeline_cache_id\"] = distilabel_pipeline_cache_id\n\n    batch = None\n    try:\n        batch = self._client.batches.create(\n            completion_window=\"24h\",\n            endpoint=\"/v1/chat/completions\",\n            input_file_id=batch_input_file.id,\n            metadata=metadata,\n        )\n    except openai.OpenAIError as e:\n        self._logger.error(  # type: ignore\n            f\"Error while creating OpenAI Batch API job for file with ID\"\n            f\" '{batch_input_file.id}': {e}.\"\n        )\n        raise e\n    return batch\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_batch_files","title":"<code>_create_batch_files(inputs, **kwargs)</code>","text":"<p>Creates the necessary input files for the batch API to generate responses. The maximum size of each file so the OpenAI Batch API can process it is 100MB, so we need to split the inputs into multiple files if necessary.</p> <p>More information: https://platform.openai.com/docs/api-reference/files/create</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for, optionally including structured output.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>List[FileObject]</code> <p>The list of file objects created for the OpenAI Batch API.</p> <p>Raises:</p> Type Description <code>OpenAIError</code> <p>if there was an error while creating the batch input file in the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_batch_files(\n    self, inputs: List[\"FormattedInput\"], **kwargs: Any\n) -&gt; List[\"OpenAIFileObject\"]:\n    \"\"\"Creates the necessary input files for the batch API to generate responses. The\n    maximum size of each file so the OpenAI Batch API can process it is 100MB, so we\n    need to split the inputs into multiple files if necessary.\n\n    More information: https://platform.openai.com/docs/api-reference/files/create\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for, optionally\n            including structured output.\n        kwargs: the keyword arguments to use for the generation.\n\n    Returns:\n        The list of file objects created for the OpenAI Batch API.\n\n    Raises:\n        openai.OpenAIError: if there was an error while creating the batch input file\n            in the OpenAI Batch API.\n    \"\"\"\n    import openai\n\n    files = []\n    for file_no, buffer in enumerate(\n        self._create_jsonl_buffers(inputs=inputs, **kwargs)\n    ):\n        try:\n            # TODO: add distilabel pipeline name and id\n            batch_input_file = self._client.files.create(\n                file=(self._name_for_openai_files(file_no), buffer),\n                purpose=\"batch\",\n            )\n            files.append(batch_input_file)\n        except openai.OpenAIError as e:\n            self._logger.error(  # type: ignore\n                f\"Error while creating OpenAI batch input file: {e}\"\n            )\n            raise e\n    return files\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_jsonl_buffers","title":"<code>_create_jsonl_buffers(inputs, **kwargs)</code>","text":"<p>Creates a generator of buffers containing the JSONL formatted inputs to be used by the OpenAI Batch API. The buffers created are of size 100MB or less.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for, optionally including structured output.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Yields:</p> Type Description <code>BytesIO</code> <p>A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch</p> <code>BytesIO</code> <p>API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_jsonl_buffers(\n    self, inputs: List[\"FormattedInput\"], **kwargs: Any\n) -&gt; Generator[io.BytesIO, None, None]:\n    \"\"\"Creates a generator of buffers containing the JSONL formatted inputs to be\n    used by the OpenAI Batch API. The buffers created are of size 100MB or less.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for, optionally\n            including structured output.\n        kwargs: the keyword arguments to use for the generation.\n\n    Yields:\n        A buffer containing the JSONL formatted inputs to be used by the OpenAI Batch\n        API.\n    \"\"\"\n    buffer = io.BytesIO()\n    buffer_current_size = 0\n    for i, input in enumerate(inputs):\n        # We create the smallest `custom_id` so we don't  increase the size of the file\n        # to much, but we can still sort the results with the order of the inputs.\n        row = self._create_jsonl_row(input=input, custom_id=str(i), **kwargs)\n        row_size = len(row)\n        if row_size + buffer_current_size &gt; _OPENAI_BATCH_API_MAX_FILE_SIZE:\n            buffer.seek(0)\n            yield buffer\n            buffer = io.BytesIO()\n            buffer_current_size = 0\n        buffer.write(row)\n        buffer_current_size += row_size\n\n    if buffer_current_size &gt; 0:\n        buffer.seek(0)\n        yield buffer\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.OpenAILLM._create_jsonl_row","title":"<code>_create_jsonl_row(input, custom_id, **kwargs)</code>","text":"<p>Creates a JSONL formatted row to be used by the OpenAI Batch API.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a list of inputs in chat format to generate responses for, optionally including structured output.</p> required <code>custom_id</code> <code>str</code> <p>a custom ID to use for the row.</p> required <code>kwargs</code> <code>Any</code> <p>the keyword arguments to use for the generation.</p> <code>{}</code> <p>Returns:</p> Type Description <code>bytes</code> <p>A JSONL formatted row to be used by the OpenAI Batch API.</p> Source code in <code>src/distilabel/models/llms/openai.py</code> <pre><code>def _create_jsonl_row(\n    self, input: \"FormattedInput\", custom_id: str, **kwargs: Any\n) -&gt; bytes:\n    \"\"\"Creates a JSONL formatted row to be used by the OpenAI Batch API.\n\n    Args:\n        input: a list of inputs in chat format to generate responses for, optionally\n            including structured output.\n        custom_id: a custom ID to use for the row.\n        kwargs: the keyword arguments to use for the generation.\n\n    Returns:\n        A JSONL formatted row to be used by the OpenAI Batch API.\n    \"\"\"\n    # TODO: depending on the format of the input, add `response_format` to the kwargs\n    row = {\n        \"custom_id\": custom_id,\n        \"method\": \"POST\",\n        \"url\": \"/v1/chat/completions\",\n        \"body\": {\"messages\": input, **kwargs},\n    }\n    json_row = orjson.dumps(row)\n    return json_row + b\"\\n\"\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.TogetherLLM","title":"<code>TogetherLLM</code>","text":"<p>               Bases: <code>OpenAILLM</code></p> <p>TogetherLLM LLM implementation running the async API client of OpenAI.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"mistralai/Mixtral-8x7B-Instruct-v0.1\". Supported models can be found here.</p> <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL to use for the Together API can be set with <code>TOGETHER_BASE_URL</code>. Defaults to <code>None</code> which means that the value set for the environment variable <code>TOGETHER_BASE_URL</code> will be used, or \"https://api.together.xyz/v1\" if not set.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>the API key to authenticate the requests to the Together API. Defaults to <code>None</code> which means that the value set for the environment variable <code>TOGETHER_API_KEY</code> will be used, or <code>None</code> if not set.</p> <code>_api_key_env_var</code> <code>str</code> <p>the name of the environment variable to use for the API key. It is meant to be used internally.</p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = TogetherLLM(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/together.py</code> <pre><code>class TogetherLLM(OpenAILLM):\n    \"\"\"TogetherLLM LLM implementation running the async API client of OpenAI.\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"mistralai/Mixtral-8x7B-Instruct-v0.1\".\n            Supported models can be found [here](https://api.together.xyz/models).\n        base_url: the base URL to use for the Together API can be set with `TOGETHER_BASE_URL`.\n            Defaults to `None` which means that the value set for the environment variable\n            `TOGETHER_BASE_URL` will be used, or \"https://api.together.xyz/v1\" if not set.\n        api_key: the API key to authenticate the requests to the Together API. Defaults to `None`\n            which means that the value set for the environment variable `TOGETHER_API_KEY` will be\n            used, or `None` if not set.\n        _api_key_env_var: the name of the environment variable to use for the API key. It\n            is meant to be used internally.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import AnyscaleLLM\n\n        llm = TogetherLLM(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", api_key=\"api.key\")\n\n        llm.load()\n\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    base_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(\n            \"TOGETHER_BASE_URL\", \"https://api.together.xyz/v1\"\n        ),\n        description=\"The base URL to use for the Together API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_TOGETHER_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Together API.\",\n    )\n\n    _api_key_env_var: str = PrivateAttr(_TOGETHER_API_KEY_ENV_VAR_NAME)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM","title":"<code>VertexAILLM</code>","text":"<p>               Bases: <code>AsyncLLM</code></p> <p>VertexAI LLM implementation running the async API clients for Gemini.</p> <ul> <li>Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini</li> </ul> <p>To use the <code>VertexAILLM</code> is necessary to have configured the Google Cloud authentication using one of these methods:</p> <ul> <li>Setting <code>GOOGLE_CLOUD_CREDENTIALS</code> environment variable</li> <li>Using <code>gcloud auth application-default login</code> command</li> <li>Using <code>vertexai.init</code> function from the <code>google-cloud-aiplatform</code> library</li> </ul> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model name to use for the LLM e.g. \"gemini-1.0-pro\". Supported models.</p> <code>_aclient</code> <code>Optional[GenerativeModel]</code> <p>the <code>GenerativeModel</code> to use for the Vertex AI Gemini API. It is meant to be used internally. Set in the <code>load</code> method.</p> Icon <p><code>:simple-googlecloud:</code></p> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import VertexAILLM\n\nllm = VertexAILLM(model=\"gemini-1.5-pro\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>class VertexAILLM(AsyncLLM):\n    \"\"\"VertexAI LLM implementation running the async API clients for Gemini.\n\n    - Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini\n\n    To use the `VertexAILLM` is necessary to have configured the Google Cloud authentication\n    using one of these methods:\n\n    - Setting `GOOGLE_CLOUD_CREDENTIALS` environment variable\n    - Using `gcloud auth application-default login` command\n    - Using `vertexai.init` function from the `google-cloud-aiplatform` library\n\n    Attributes:\n        model: the model name to use for the LLM e.g. \"gemini-1.0-pro\". [Supported models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models).\n        _aclient: the `GenerativeModel` to use for the Vertex AI Gemini API. It is meant\n            to be used internally. Set in the `load` method.\n\n    Icon:\n        `:simple-googlecloud:`\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import VertexAILLM\n\n        llm = VertexAILLM(model=\"gemini-1.5-pro\")\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n    \"\"\"\n\n    model: str\n\n    _num_generations_param_supported = False\n\n    _aclient: Optional[\"GenerativeModel\"] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `GenerativeModel` class which has access to `generate_content_async` to benefit from async requests.\"\"\"\n        super().load()\n\n        try:\n            from vertexai.generative_models import GenerationConfig, GenerativeModel\n\n            self._generation_config_class = GenerationConfig\n        except ImportError as e:\n            raise ImportError(\n                \"vertexai is not installed. Please install it using\"\n                \" `pip install google-cloud-aiplatform`.\"\n            ) from e\n\n        if _is_gemini_model(self.model):\n            self._aclient = GenerativeModel(model_name=self.model)\n        else:\n            raise NotImplementedError(\n                \"`VertexAILLM` is only implemented for `gemini` models that allow for `ChatType` data.\"\n            )\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def _chattype_to_content(self, input: \"StandardInput\") -&gt; List[\"Content\"]:\n        \"\"\"Converts a chat type to a list of content items expected by the API.\n\n        Args:\n            input: the chat type to be converted.\n\n        Returns:\n            List[str]: a list of content items expected by the API.\n        \"\"\"\n        from vertexai.generative_models import Content, Part\n\n        contents = []\n        for message in input:\n            if message[\"role\"] not in [\"user\", \"model\"]:\n                raise ValueError(\n                    \"`VertexAILLM only supports the roles 'user' or 'model'.\"\n                )\n            contents.append(\n                Content(\n                    role=message[\"role\"], parts=[Part.from_text(message[\"content\"])]\n                )\n            )\n        return contents\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: VertexChatType,\n        temperature: Optional[float] = None,\n        top_p: Optional[float] = None,\n        top_k: Optional[int] = None,\n        max_output_tokens: Optional[int] = None,\n        stop_sequences: Optional[List[str]] = None,\n        safety_settings: Optional[Dict[str, Any]] = None,\n        tools: Optional[List[Dict[str, Any]]] = None,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for the given input using the [VertexAI async client definition](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini).\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            temperature: Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to `None`.\n            top_p: If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to `None`.\n            top_k: If specified, top-k sampling will be used. Defaults to `None`.\n            max_output_tokens: The maximum number of output tokens to generate per message. Defaults to `None`.\n            stop_sequences: A list of stop sequences. Defaults to `None`.\n            safety_settings: Safety configuration for returned content from the API. Defaults to `None`.\n            tools: A potential list of tools that can be used by the API. Defaults to `None`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        from vertexai.generative_models import GenerationConfig\n\n        content: \"GenerationResponse\" = await self._aclient.generate_content_async(  # type: ignore\n            contents=self._chattype_to_content(input),\n            generation_config=GenerationConfig(\n                candidate_count=1,  # only one candidate allowed per call\n                temperature=temperature,\n                top_k=top_k,\n                top_p=top_p,\n                max_output_tokens=max_output_tokens,\n                stop_sequences=stop_sequences,\n            ),\n            safety_settings=safety_settings,  # type: ignore\n            tools=tools,  # type: ignore\n            stream=False,\n        )\n\n        text = None\n        try:\n            text = content.candidates[0].text\n        except ValueError:\n            self._logger.warning(  # type: ignore\n                f\"Received no response using VertexAI client (model: '{self.model}').\"\n                f\" Finish reason was: '{content.candidates[0].finish_reason}'.\"\n            )\n        return prepare_output([text], **self._get_llm_statistics(content))\n\n    @staticmethod\n    def _get_llm_statistics(content: \"GenerationResponse\") -&gt; \"LLMStatistics\":\n        return {\n            \"input_tokens\": [content.usage_metadata.prompt_token_count],\n            \"output_tokens\": [content.usage_metadata.candidates_token_count],\n        }\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>GenerativeModel</code> class which has access to <code>generate_content_async</code> to benefit from async requests.</p> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `GenerativeModel` class which has access to `generate_content_async` to benefit from async requests.\"\"\"\n    super().load()\n\n    try:\n        from vertexai.generative_models import GenerationConfig, GenerativeModel\n\n        self._generation_config_class = GenerationConfig\n    except ImportError as e:\n        raise ImportError(\n            \"vertexai is not installed. Please install it using\"\n            \" `pip install google-cloud-aiplatform`.\"\n        ) from e\n\n    if _is_gemini_model(self.model):\n        self._aclient = GenerativeModel(model_name=self.model)\n    else:\n        raise NotImplementedError(\n            \"`VertexAILLM` is only implemented for `gemini` models that allow for `ChatType` data.\"\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM._chattype_to_content","title":"<code>_chattype_to_content(input)</code>","text":"<p>Converts a chat type to a list of content items expected by the API.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the chat type to be converted.</p> required <p>Returns:</p> Type Description <code>List[Content]</code> <p>List[str]: a list of content items expected by the API.</p> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>def _chattype_to_content(self, input: \"StandardInput\") -&gt; List[\"Content\"]:\n    \"\"\"Converts a chat type to a list of content items expected by the API.\n\n    Args:\n        input: the chat type to be converted.\n\n    Returns:\n        List[str]: a list of content items expected by the API.\n    \"\"\"\n    from vertexai.generative_models import Content, Part\n\n    contents = []\n    for message in input:\n        if message[\"role\"] not in [\"user\", \"model\"]:\n            raise ValueError(\n                \"`VertexAILLM only supports the roles 'user' or 'model'.\"\n            )\n        contents.append(\n            Content(\n                role=message[\"role\"], parts=[Part.from_text(message[\"content\"])]\n            )\n        )\n    return contents\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.VertexAILLM.agenerate","title":"<code>agenerate(input, temperature=None, top_p=None, top_k=None, max_output_tokens=None, stop_sequences=None, safety_settings=None, tools=None)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for the given input using the VertexAI async client definition.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>VertexChatType</code> <p>a single input in chat format to generate responses for.</p> required <code>temperature</code> <code>Optional[float]</code> <p>Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to <code>None</code>.</p> <code>None</code> <code>top_p</code> <code>Optional[float]</code> <p>If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to <code>None</code>.</p> <code>None</code> <code>top_k</code> <code>Optional[int]</code> <p>If specified, top-k sampling will be used. Defaults to <code>None</code>.</p> <code>None</code> <code>max_output_tokens</code> <code>Optional[int]</code> <p>The maximum number of output tokens to generate per message. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_sequences</code> <code>Optional[List[str]]</code> <p>A list of stop sequences. Defaults to <code>None</code>.</p> <code>None</code> <code>safety_settings</code> <code>Optional[Dict[str, Any]]</code> <p>Safety configuration for returned content from the API. Defaults to <code>None</code>.</p> <code>None</code> <code>tools</code> <code>Optional[List[Dict[str, Any]]]</code> <p>A potential list of tools that can be used by the API. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/vertexai.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: VertexChatType,\n    temperature: Optional[float] = None,\n    top_p: Optional[float] = None,\n    top_k: Optional[int] = None,\n    max_output_tokens: Optional[int] = None,\n    stop_sequences: Optional[List[str]] = None,\n    safety_settings: Optional[Dict[str, Any]] = None,\n    tools: Optional[List[Dict[str, Any]]] = None,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for the given input using the [VertexAI async client definition](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini).\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        temperature: Controls the randomness of predictions. Range: [0.0, 1.0]. Defaults to `None`.\n        top_p: If specified, nucleus sampling will be used. Range: (0.0, 1.0]. Defaults to `None`.\n        top_k: If specified, top-k sampling will be used. Defaults to `None`.\n        max_output_tokens: The maximum number of output tokens to generate per message. Defaults to `None`.\n        stop_sequences: A list of stop sequences. Defaults to `None`.\n        safety_settings: Safety configuration for returned content from the API. Defaults to `None`.\n        tools: A potential list of tools that can be used by the API. Defaults to `None`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    from vertexai.generative_models import GenerationConfig\n\n    content: \"GenerationResponse\" = await self._aclient.generate_content_async(  # type: ignore\n        contents=self._chattype_to_content(input),\n        generation_config=GenerationConfig(\n            candidate_count=1,  # only one candidate allowed per call\n            temperature=temperature,\n            top_k=top_k,\n            top_p=top_p,\n            max_output_tokens=max_output_tokens,\n            stop_sequences=stop_sequences,\n        ),\n        safety_settings=safety_settings,  # type: ignore\n        tools=tools,  # type: ignore\n        stream=False,\n    )\n\n    text = None\n    try:\n        text = content.candidates[0].text\n    except ValueError:\n        self._logger.warning(  # type: ignore\n            f\"Received no response using VertexAI client (model: '{self.model}').\"\n            f\" Finish reason was: '{content.candidates[0].finish_reason}'.\"\n        )\n    return prepare_output([text], **self._get_llm_statistics(content))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM","title":"<code>ClientvLLM</code>","text":"<p>               Bases: <code>OpenAILLM</code>, <code>MagpieChatTemplateMixin</code></p> <p>A client for the <code>vLLM</code> server implementing the OpenAI API specification.</p> <p>Attributes:</p> Name Type Description <code>base_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>the base URL of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</p> <code>max_retries</code> <code>RuntimeParameter[int]</code> <p>the maximum number of times to retry the request to the API before failing. Defaults to <code>6</code>.</p> <code>timeout</code> <code>RuntimeParameter[int]</code> <p>the maximum time in seconds to wait for a response from the API. Defaults to <code>120</code>.</p> <code>httpx_client_kwargs</code> <code>RuntimeParameter[int]</code> <p>extra kwargs that will be passed to the <code>httpx.AsyncClient</code> created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the Hugging Face Hub repo id or path of the tokenizer that will be used to apply the chat template and tokenize the inputs before sending it to the server. Defaults to <code>None</code>.</p> <code>tokenizer_revision</code> <code>Optional[str]</code> <p>the revision of the tokenizer to load. Defaults to <code>None</code>.</p> <code>_aclient</code> <code>AsyncOpenAI</code> <p>the <code>httpx.AsyncClient</code> used to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> Runtime parameters <ul> <li><code>base_url</code>: the base url of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</li> <li><code>max_retries</code>: the maximum number of times to retry the request to the API before     failing. Defaults to <code>6</code>.</li> <li><code>timeout</code>: the maximum time in seconds to wait for a response from the API. Defaults     to <code>120</code>.</li> <li><code>httpx_client_kwargs</code>: extra kwargs that will be passed to the <code>httpx.AsyncClient</code>     created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import ClientvLLM\n\nllm = ClientvLLM(\n    base_url=\"http://localhost:8000/v1\",\n    tokenizer=\"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n)\n\nllm.load()\n\nresults = llm.generate_outputs(\n    inputs=[[{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]],\n    temperature=0.7,\n    top_p=1.0,\n    max_new_tokens=256,\n)\n# [\n#     [\n#         \"I'm functioning properly, thank you for asking. How can I assist you today?\",\n#         \"I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?\",\n#         \"I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?\"\n#     ]\n# ]\n</code></pre> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>class ClientvLLM(OpenAILLM, MagpieChatTemplateMixin):\n    \"\"\"A client for the `vLLM` server implementing the OpenAI API specification.\n\n    Attributes:\n        base_url: the base URL of the `vLLM` server. Defaults to `\"http://localhost:8000\"`.\n        max_retries: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        timeout: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        httpx_client_kwargs: extra kwargs that will be passed to the `httpx.AsyncClient`\n            created to comunicate with the `vLLM` server. Defaults to `None`.\n        tokenizer: the Hugging Face Hub repo id or path of the tokenizer that will be used\n            to apply the chat template and tokenize the inputs before sending it to the\n            server. Defaults to `None`.\n        tokenizer_revision: the revision of the tokenizer to load. Defaults to `None`.\n        _aclient: the `httpx.AsyncClient` used to comunicate with the `vLLM` server. Defaults\n            to `None`.\n\n    Runtime parameters:\n        - `base_url`: the base url of the `vLLM` server. Defaults to `\"http://localhost:8000\"`.\n        - `max_retries`: the maximum number of times to retry the request to the API before\n            failing. Defaults to `6`.\n        - `timeout`: the maximum time in seconds to wait for a response from the API. Defaults\n            to `120`.\n        - `httpx_client_kwargs`: extra kwargs that will be passed to the `httpx.AsyncClient`\n            created to comunicate with the `vLLM` server. Defaults to `None`.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import ClientvLLM\n\n        llm = ClientvLLM(\n            base_url=\"http://localhost:8000/v1\",\n            tokenizer=\"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n        )\n\n        llm.load()\n\n        results = llm.generate_outputs(\n            inputs=[[{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]],\n            temperature=0.7,\n            top_p=1.0,\n            max_new_tokens=256,\n        )\n        # [\n        #     [\n        #         \"I'm functioning properly, thank you for asking. How can I assist you today?\",\n        #         \"I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?\",\n        #         \"I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?\"\n        #     ]\n        # ]\n        ```\n    \"\"\"\n\n    model: str = \"\"  # Default value so it's not needed to `ClientvLLM(model=\"...\")`\n    tokenizer: Optional[str] = None\n    tokenizer_revision: Optional[str] = None\n\n    # We need the sync client to get the list of models\n    _client: \"OpenAI\" = PrivateAttr(None)\n    _tokenizer: \"PreTrainedTokenizer\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer\n        optionally.\"\"\"\n\n        self.api_key = SecretStr(\"EMPTY\")\n\n        # We need to first create the sync client to get the model name that will be used\n        # in the `super().load()` when creating the logger.\n        try:\n            from openai import OpenAI\n        except ImportError as ie:\n            raise ImportError(\n                \"OpenAI Python client is not installed. Please install it using\"\n                \" `pip install openai`.\"\n            ) from ie\n\n        self._client = OpenAI(\n            base_url=self.base_url,\n            api_key=self.api_key.get_secret_value(),  # type: ignore\n            max_retries=self.max_retries,  # type: ignore\n            timeout=self.timeout,\n        )\n\n        super().load()\n\n        try:\n            from transformers import AutoTokenizer\n        except ImportError as ie:\n            raise ImportError(\n                \"To use `ClientvLLM` you need to install `transformers`.\"\n                \"Please install it using `pip install transformers`.\"\n            ) from ie\n\n        self._tokenizer = AutoTokenizer.from_pretrained(\n            self.tokenizer, revision=self.tokenizer_revision\n        )\n\n    @cached_property\n    def model_name(self) -&gt; str:  # type: ignore\n        \"\"\"Returns the name of the model served with vLLM server.\"\"\"\n        models = self._client.models.list()\n        return models.data[0].id\n\n    def _prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        prompt: str = (\n            self._tokenizer.apply_chat_template(  # type: ignore\n                input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,  # type: ignore\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    @validate_call\n    async def agenerate(  # type: ignore\n        self,\n        input: FormattedInput,\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        frequency_penalty: float = 0.0,\n        logit_bias: Optional[Dict[str, int]] = None,\n        presence_penalty: float = 0.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n    ) -&gt; GenerateOutput:\n        \"\"\"Generates `num_generations` responses for each input.\n\n        Args:\n            input: a single input in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            logit_bias: modify the likelihood of specified tokens appearing in the completion.\n                Defaults to ``\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: nucleus sampling. The value refers to the top-p tokens that should be\n                considered for sampling. Defaults to `1.0`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n\n        completion = await self._aclient.completions.create(\n            model=self.model_name,\n            prompt=self._prepare_input(input),  # type: ignore\n            n=num_generations,\n            max_tokens=max_new_tokens,\n            frequency_penalty=frequency_penalty,\n            logit_bias=logit_bias,\n            presence_penalty=presence_penalty,\n            temperature=temperature,\n            top_p=top_p,\n        )\n\n        generations = []\n        for choice in completion.choices:\n            text = choice.text\n            if text == \"\":\n                self._logger.warning(  # type: ignore\n                    f\"Received no response from vLLM server (model: '{self.model_name}').\"\n                    f\" Finish reason was: {choice.finish_reason}\"\n                )\n            generations.append(text)\n\n        return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM.model_name","title":"<code>model_name</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the name of the model served with vLLM server.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM.load","title":"<code>load()</code>","text":"<p>Creates an <code>httpx.AsyncClient</code> to connect to the vLLM server and a tokenizer optionally.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Creates an `httpx.AsyncClient` to connect to the vLLM server and a tokenizer\n    optionally.\"\"\"\n\n    self.api_key = SecretStr(\"EMPTY\")\n\n    # We need to first create the sync client to get the model name that will be used\n    # in the `super().load()` when creating the logger.\n    try:\n        from openai import OpenAI\n    except ImportError as ie:\n        raise ImportError(\n            \"OpenAI Python client is not installed. Please install it using\"\n            \" `pip install openai`.\"\n        ) from ie\n\n    self._client = OpenAI(\n        base_url=self.base_url,\n        api_key=self.api_key.get_secret_value(),  # type: ignore\n        max_retries=self.max_retries,  # type: ignore\n        timeout=self.timeout,\n    )\n\n    super().load()\n\n    try:\n        from transformers import AutoTokenizer\n    except ImportError as ie:\n        raise ImportError(\n            \"To use `ClientvLLM` you need to install `transformers`.\"\n            \"Please install it using `pip install transformers`.\"\n        ) from ie\n\n    self._tokenizer = AutoTokenizer.from_pretrained(\n        self.tokenizer, revision=self.tokenizer_revision\n    )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM._prepare_input","title":"<code>_prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def _prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    prompt: str = (\n        self._tokenizer.apply_chat_template(  # type: ignore\n            input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,  # type: ignore\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.ClientvLLM.agenerate","title":"<code>agenerate(input, num_generations=1, max_new_tokens=128, frequency_penalty=0.0, logit_bias=None, presence_penalty=0.0, temperature=1.0, top_p=1.0)</code>  <code>async</code>","text":"<p>Generates <code>num_generations</code> responses for each input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>FormattedInput</code> <p>a single input in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>logit_bias</code> <code>Optional[Dict[str, int]]</code> <p>modify the likelihood of specified tokens appearing in the completion. Defaults to ``</p> <code>None</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>nucleus sampling. The value refers to the top-p tokens that should be considered for sampling. Defaults to <code>1.0</code>.</p> <code>1.0</code> <p>Returns:</p> Type Description <code>GenerateOutput</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>@validate_call\nasync def agenerate(  # type: ignore\n    self,\n    input: FormattedInput,\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    frequency_penalty: float = 0.0,\n    logit_bias: Optional[Dict[str, int]] = None,\n    presence_penalty: float = 0.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n) -&gt; GenerateOutput:\n    \"\"\"Generates `num_generations` responses for each input.\n\n    Args:\n        input: a single input in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        logit_bias: modify the likelihood of specified tokens appearing in the completion.\n            Defaults to ``\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: nucleus sampling. The value refers to the top-p tokens that should be\n            considered for sampling. Defaults to `1.0`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n\n    completion = await self._aclient.completions.create(\n        model=self.model_name,\n        prompt=self._prepare_input(input),  # type: ignore\n        n=num_generations,\n        max_tokens=max_new_tokens,\n        frequency_penalty=frequency_penalty,\n        logit_bias=logit_bias,\n        presence_penalty=presence_penalty,\n        temperature=temperature,\n        top_p=top_p,\n    )\n\n    generations = []\n    for choice in completion.choices:\n        text = choice.text\n        if text == \"\":\n            self._logger.warning(  # type: ignore\n                f\"Received no response from vLLM server (model: '{self.model_name}').\"\n                f\" Finish reason was: {choice.finish_reason}\"\n            )\n        generations.append(text)\n\n    return prepare_output(generations, **self._get_llm_statistics(completion))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM","title":"<code>vLLM</code>","text":"<p>               Bases: <code>LLM</code>, <code>MagpieChatTemplateMixin</code>, <code>CudaDevicePlacementMixin</code></p> <p><code>vLLM</code> library LLM implementation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>dtype</code> <code>str</code> <p>the data type to use for the model. Defaults to <code>auto</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to trust the remote code when loading the model. Defaults to <code>False</code>.</p> <code>quantization</code> <code>Optional[str]</code> <p>the quantization mode to use for the model. Defaults to <code>None</code>.</p> <code>revision</code> <code>Optional[str]</code> <p>the revision of the model to load. Defaults to <code>None</code>.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the tokenizer Hugging Face Hub repo id or a path to a directory containing the tokenizer files. If not provided, the tokenizer will be loaded from the model directory. Defaults to <code>None</code>.</p> <code>tokenizer_mode</code> <code>Literal['auto', 'slow']</code> <p>the mode to use for the tokenizer. Defaults to <code>auto</code>.</p> <code>tokenizer_revision</code> <code>Optional[str]</code> <p>the revision of the tokenizer to load. Defaults to <code>None</code>.</p> <code>skip_tokenizer_init</code> <code>bool</code> <p>whether to skip the initialization of the tokenizer. Defaults to <code>False</code>.</p> <code>chat_template</code> <code>Optional[str]</code> <p>a chat template that will be used to build the prompts before sending them to the model. If not provided, the chat template defined in the tokenizer config will be used. If not provided and the tokenizer doesn't have a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> <code>structured_output</code> <code>Optional[RuntimeParameter[OutlinesStructuredOutputType]]</code> <p>a dictionary containing the structured output configuration or if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> <code>seed</code> <code>int</code> <p>the seed to use for the random number generator. Defaults to <code>0</code>.</p> <code>extra_kwargs</code> <code>Optional[RuntimeParameter[Dict[str, Any]]]</code> <p>additional dictionary of keyword arguments that will be passed to the <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> <code>_model</code> <code>LLM</code> <p>the <code>vLLM</code> model instance. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> <code>_tokenizer</code> <code>PreTrainedTokenizer</code> <p>the tokenizer instance used to format the prompt before passing it to the <code>LLM</code>. This attribute is meant to be used internally and should not be accessed directly. It will be set in the <code>load</code> method.</p> <code>use_magpie_template</code> <code>bool</code> <p>a flag used to enable/disable applying the Magpie pre-query template. Defaults to <code>False</code>.</p> <code>magpie_pre_query_template</code> <code>Union[MagpieAvailablePreQueryTemplates, str, None]</code> <p>the pre-query template to be applied to the prompt or sent to the LLM to generate an instruction or a follow up user message. Valid values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults to <code>None</code>.</p> References <ul> <li>https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py</li> </ul> Runtime parameters <ul> <li><code>extra_kwargs</code>: additional dictionary of keyword arguments that will be passed to     the <code>LLM</code> class of <code>vllm</code> library.</li> </ul> <p>Examples:</p> <p>Generate text:</p> <pre><code>from distilabel.models.llms import vLLM\n\n# You can pass a custom chat_template to the model\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre> <p>Generate structured data:</p> <pre><code>from pathlib import Path\nfrom distilabel.models.llms import vLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\"\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>class vLLM(LLM, MagpieChatTemplateMixin, CudaDevicePlacementMixin):\n    \"\"\"`vLLM` library LLM implementation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        dtype: the data type to use for the model. Defaults to `auto`.\n        trust_remote_code: whether to trust the remote code when loading the model. Defaults\n            to `False`.\n        quantization: the quantization mode to use for the model. Defaults to `None`.\n        revision: the revision of the model to load. Defaults to `None`.\n        tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing\n            the tokenizer files. If not provided, the tokenizer will be loaded from the\n            model directory. Defaults to `None`.\n        tokenizer_mode: the mode to use for the tokenizer. Defaults to `auto`.\n        tokenizer_revision: the revision of the tokenizer to load. Defaults to `None`.\n        skip_tokenizer_init: whether to skip the initialization of the tokenizer. Defaults\n            to `False`.\n        chat_template: a chat template that will be used to build the prompts before\n            sending them to the model. If not provided, the chat template defined in the\n            tokenizer config will be used. If not provided and the tokenizer doesn't have\n            a chat template, then ChatML template will be used. Defaults to `None`.\n        structured_output: a dictionary containing the structured output configuration or if more\n            fine-grained control is needed, an instance of `OutlinesStructuredOutput`. Defaults to None.\n        seed: the seed to use for the random number generator. Defaults to `0`.\n        extra_kwargs: additional dictionary of keyword arguments that will be passed to the\n            `LLM` class of `vllm` library. Defaults to `{}`.\n        _model: the `vLLM` model instance. This attribute is meant to be used internally\n            and should not be accessed directly. It will be set in the `load` method.\n        _tokenizer: the tokenizer instance used to format the prompt before passing it to\n            the `LLM`. This attribute is meant to be used internally and should not be\n            accessed directly. It will be set in the `load` method.\n        use_magpie_template: a flag used to enable/disable applying the Magpie pre-query\n            template. Defaults to `False`.\n        magpie_pre_query_template: the pre-query template to be applied to the prompt or\n            sent to the LLM to generate an instruction or a follow up user message. Valid\n            values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults\n            to `None`.\n\n    References:\n        - https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py\n\n    Runtime parameters:\n        - `extra_kwargs`: additional dictionary of keyword arguments that will be passed to\n            the `LLM` class of `vllm` library.\n\n    Examples:\n        Generate text:\n\n        ```python\n        from distilabel.models.llms import vLLM\n\n        # You can pass a custom chat_template to the model\n        llm = vLLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n        ```\n\n        Generate structured data:\n\n        ```python\n        from pathlib import Path\n        from distilabel.models.llms import vLLM\n\n        class User(BaseModel):\n            name: str\n            last_name: str\n            id: int\n\n        llm = vLLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\"\n            structured_output={\"format\": \"json\", \"schema\": Character},\n        )\n\n        llm.load()\n\n        # Call the model\n        output = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n        ```\n    \"\"\"\n\n    model: str\n    dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    quantization: Optional[str] = None\n    revision: Optional[str] = None\n\n    tokenizer: Optional[str] = None\n    tokenizer_mode: Literal[\"auto\", \"slow\"] = \"auto\"\n    tokenizer_revision: Optional[str] = None\n    skip_tokenizer_init: bool = False\n    chat_template: Optional[str] = None\n\n    seed: int = 0\n\n    extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(\n        default_factory=dict,\n        description=\"Additional dictionary of keyword arguments that will be passed to the\"\n        \" `vLLM` class of `vllm` library. See all the supported arguments at: \"\n        \"https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py\",\n    )\n    structured_output: Optional[RuntimeParameter[OutlinesStructuredOutputType]] = Field(\n        default=None,\n        description=\"The structured output format to use across all the generations.\",\n    )\n\n    _model: \"_vLLM\" = PrivateAttr(None)\n    _tokenizer: \"PreTrainedTokenizer\" = PrivateAttr(None)\n    _structured_output_logits_processor: Optional[Callable] = PrivateAttr(default=None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\n        Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly\n        parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the\n        default value is ChatML format, unless explicitly provided.\n        \"\"\"\n        super().load()\n\n        CudaDevicePlacementMixin.load(self)\n\n        try:\n            from vllm import LLM as _vLLM\n        except ImportError as ie:\n            raise ImportError(\n                \"vLLM is not installed. Please install it using `pip install vllm`.\"\n            ) from ie\n\n        self._model = _vLLM(\n            self.model,\n            dtype=self.dtype,\n            trust_remote_code=self.trust_remote_code,\n            quantization=self.quantization,\n            revision=self.revision,\n            tokenizer=self.tokenizer,\n            tokenizer_mode=self.tokenizer_mode,\n            tokenizer_revision=self.tokenizer_revision,\n            skip_tokenizer_init=self.skip_tokenizer_init,\n            seed=self.seed,\n            **self.extra_kwargs,  # type: ignore\n        )\n\n        self._tokenizer = self._model.get_tokenizer()  # type: ignore\n        if self.chat_template is not None:\n            self._tokenizer.chat_template = self.chat_template  # type: ignore\n\n        if self.structured_output:\n            self._structured_output_logits_processor = self._prepare_structured_output(\n                self.structured_output\n            )\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the `vLLM` model.\"\"\"\n        self._cleanup_vllm_model()\n        self._model = None  # type: ignore\n        self._tokenizer = None  # type: ignore\n        CudaDevicePlacementMixin.unload(self)\n        super().unload()\n\n    def _cleanup_vllm_model(self) -&gt; None:\n        if self._model is None:\n            return\n\n        import torch  # noqa\n        from vllm.distributed.parallel_state import (\n            destroy_distributed_environment,\n            destroy_model_parallel,\n        )\n\n        destroy_model_parallel()\n        destroy_distributed_environment()\n        del self._model.llm_engine.model_executor\n        del self._model\n        with contextlib.suppress(AssertionError):\n            torch.distributed.destroy_process_group()\n        gc.collect()\n        if torch.cuda.is_available():\n            torch.cuda.empty_cache()\n            torch.cuda.synchronize()\n\n    @property\n    def model_name(self) -&gt; str:\n        \"\"\"Returns the model name used for the LLM.\"\"\"\n        return self.model\n\n    def prepare_input(self, input: \"StandardInput\") -&gt; str:\n        \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n        input.\n\n        Args:\n            input: the input list containing chat items.\n\n        Returns:\n            The prompt to send to the LLM.\n        \"\"\"\n        if self._tokenizer.chat_template is None:\n            return [item[\"content\"] for item in input if item[\"role\"] == \"user\"][0]\n\n        prompt: str = (\n            self._tokenizer.apply_chat_template(\n                input,  # type: ignore\n                tokenize=False,\n                add_generation_prompt=True,  # type: ignore\n            )\n            if input\n            else \"\"\n        )\n        return super().apply_magpie_pre_query_template(prompt, input)\n\n    def _prepare_batches(\n        self, inputs: List[\"StructuredInput\"]\n    ) -&gt; Tuple[List[Tuple[List[str], \"OutlinesStructuredOutputType\"]], List[int]]:\n        \"\"\"Prepares the inputs by grouping them by the structured output.\n\n        When we generate structured outputs with schemas obtained from a dataset, we need to\n        prepare the data to try to send batches of inputs instead of single inputs to the model\n        to take advante of the engine. So we group the inputs by the structured output to be\n        passed in the `generate` method.\n\n        Args:\n            inputs: The batch of inputs passed to the generate method. As we expect to be generating\n                structured outputs, each element will be a tuple containing the instruction and the\n                structured output.\n\n        Returns:\n            The prepared batches (sub-batches let's say) to be passed to the `generate` method.\n            Each new tuple will contain instead of the single instruction, a list of instructions\n        \"\"\"\n        instruction_order = {}\n        batches: Dict[str, List[str]] = {}\n        for i, (instruction, structured_output) in enumerate(inputs):\n            instruction = self.prepare_input(instruction)\n            instruction_order[instruction] = i\n\n            structured_output = json.dumps(structured_output)\n            if structured_output not in batches:\n                batches[structured_output] = [instruction]\n            else:\n                batches[structured_output].append(instruction)\n\n        # Built a list with instructions sorted by structured output\n        flat_instructions = [\n            instruction for _, group in batches.items() for instruction in group\n        ]\n\n        # Generate the list of indices based on the original order\n        sorted_indices = [\n            instruction_order[instruction] for instruction in flat_instructions\n        ]\n\n        return [\n            (batch, json.loads(schema)) for schema, batch in batches.items()\n        ], sorted_indices\n\n    @validate_call\n    def generate(  # noqa: C901 # type: ignore\n        self,\n        inputs: List[FormattedInput],\n        num_generations: int = 1,\n        max_new_tokens: int = 128,\n        presence_penalty: float = 0.0,\n        frequency_penalty: float = 0.0,\n        repetition_penalty: float = 1.0,\n        temperature: float = 1.0,\n        top_p: float = 1.0,\n        top_k: int = -1,\n        min_p: float = 0.0,\n        logprobs: Optional[PositiveInt] = None,\n        stop: Optional[List[str]] = None,\n        stop_token_ids: Optional[List[int]] = None,\n        include_stop_str_in_output: bool = False,\n        logits_processors: Optional[LogitsProcessors] = None,\n        extra_sampling_params: Optional[Dict[str, Any]] = None,\n    ) -&gt; List[GenerateOutput]:\n        \"\"\"Generates `num_generations` responses for each input.\n\n        Args:\n            inputs: a list of inputs in chat format to generate responses for.\n            num_generations: the number of generations to create per input. Defaults to\n                `1`.\n            max_new_tokens: the maximum number of new tokens that the model will generate.\n                Defaults to `128`.\n            presence_penalty: the presence penalty to use for the generation. Defaults to\n                `0.0`.\n            frequency_penalty: the repetition penalty to use for the generation. Defaults\n                to `0.0`.\n            repetition_penalty: the repetition penalty to use for the generation Defaults to\n                `1.0`.\n            temperature: the temperature to use for the generation. Defaults to `0.1`.\n            top_p: the top-p value to use for the generation. Defaults to `1.0`.\n            top_k: the top-k value to use for the generation. Defaults to `0`.\n            min_p: the minimum probability to use for the generation. Defaults to `0.0`.\n            logprobs: number of log probabilities to return per output token. If `None`,\n                then no log probability won't be returned. Defaults to `None`.\n            stop: a list of strings that will be used to stop the generation when found.\n                Defaults to `None`.\n            stop_token_ids: a list of token ids that will be used to stop the generation\n                when found. Defaults to `None`.\n            include_stop_str_in_output: whether to include the stop string in the output.\n                Defaults to `False`.\n            logits_processors: a list of functions to process the logits before sampling.\n                Defaults to `None`.\n            extra_sampling_params: dictionary with additional arguments to be passed to\n                the `SamplingParams` class from `vllm`.\n\n        Returns:\n            A list of lists of strings containing the generated responses for each input.\n        \"\"\"\n        from vllm import SamplingParams\n\n        if not logits_processors:\n            logits_processors = []\n\n        if extra_sampling_params is None:\n            extra_sampling_params = {}\n\n        structured_output = None\n\n        if isinstance(inputs[0], tuple):\n            # Prepare the batches for structured generation\n            prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore\n        else:\n            # Simulate a batch without the structured output content\n            prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore\n            sorted_indices = None\n\n        # Case in which we have a single structured output for the dataset\n        if self._structured_output_logits_processor:\n            logits_processors.append(self._structured_output_logits_processor)\n\n        batched_outputs: List[\"LLMOutput\"] = []\n        generations = []\n\n        for prepared_inputs, structured_output in prepared_batches:\n            if self.structured_output is not None and structured_output is not None:\n                # TODO: warning\n                pass\n\n            if structured_output is not None:\n                logits_processors.append(\n                    self._prepare_structured_output(structured_output)  # type: ignore\n                )\n\n            sampling_params = SamplingParams(  # type: ignore\n                n=num_generations,\n                presence_penalty=presence_penalty,\n                frequency_penalty=frequency_penalty,\n                repetition_penalty=repetition_penalty,\n                temperature=temperature,\n                top_p=top_p,\n                top_k=top_k,\n                min_p=min_p,\n                max_tokens=max_new_tokens,\n                logprobs=logprobs,\n                stop=stop,\n                stop_token_ids=stop_token_ids,\n                include_stop_str_in_output=include_stop_str_in_output,\n                logits_processors=logits_processors,\n                **extra_sampling_params,\n            )\n\n            batch_outputs: List[\"RequestOutput\"] = self._model.generate(\n                prompts=prepared_inputs,\n                sampling_params=sampling_params,\n                use_tqdm=False,\n            )\n\n            # Remove structured output logit processor to avoid stacking structured output\n            # logits processors that leads to non-sense generations\n            if structured_output is not None:\n                logits_processors.pop(-1)\n\n            for input, outputs in zip(prepared_inputs, batch_outputs):\n                texts, statistics, outputs_logprobs = self._process_outputs(\n                    input, outputs\n                )\n                batched_outputs.append(texts)\n                generations.append(\n                    prepare_output(\n                        generations=texts,\n                        input_tokens=statistics[\"input_tokens\"],\n                        output_tokens=statistics[\"output_tokens\"],\n                        logprobs=outputs_logprobs,\n                    )\n                )\n\n        if sorted_indices is not None:\n            pairs = list(enumerate(sorted_indices))\n            pairs.sort(key=lambda x: x[1])\n            generations = [generations[original_idx] for original_idx, _ in pairs]\n\n        return generations\n\n    def _process_outputs(\n        self, input: str, outputs: \"RequestOutput\"\n    ) -&gt; Tuple[\"LLMOutput\", \"LLMStatistics\", \"LLMLogprobs\"]:\n        texts = []\n        outputs_logprobs = []\n        statistics = {\n            \"input_tokens\": [compute_tokens(input, self._tokenizer.encode)]\n            * len(outputs.outputs),\n            \"output_tokens\": [],\n        }\n        for output in outputs.outputs:\n            texts.append(output.text)\n            statistics[\"output_tokens\"].append(len(output.token_ids))\n            if output.logprobs is not None:\n                outputs_logprobs.append(self._get_llm_logprobs(output))\n        return texts, statistics, outputs_logprobs\n\n    def _prepare_structured_output(\n        self, structured_output: \"OutlinesStructuredOutputType\"\n    ) -&gt; Union[Callable, None]:\n        \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n        Args:\n            structured_output: the configuration dict to prepare the structured output.\n\n        Returns:\n            The callable that will be used to guide the generation of the model.\n        \"\"\"\n        from distilabel.steps.tasks.structured_outputs.outlines import (\n            prepare_guided_output,\n        )\n\n        assert structured_output is not None, \"`structured_output` cannot be `None`\"\n\n        result = prepare_guided_output(structured_output, \"vllm\", self._model)\n        if (schema := result.get(\"schema\")) and self.structured_output:\n            self.structured_output[\"schema\"] = schema\n        return result[\"processor\"]\n\n    def _get_llm_logprobs(self, output: \"CompletionOutput\") -&gt; List[List[\"Logprob\"]]:\n        logprobs = []\n        for token_logprob in output.logprobs:  # type: ignore\n            token_logprobs = []\n            for logprob in token_logprob.values():\n                token_logprobs.append(\n                    {\"token\": logprob.decoded_token, \"logprob\": logprob.logprob}\n                )\n            logprobs.append(token_logprobs)\n        return logprobs\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.model_name","title":"<code>model_name</code>  <code>property</code>","text":"<p>Returns the model name used for the LLM.</p>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.load","title":"<code>load()</code>","text":"<p>Loads the <code>vLLM</code> model using either the path or the Hugging Face Hub repository id. Additionally, this method also sets the <code>chat_template</code> for the tokenizer, so as to properly parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the default value is ChatML format, unless explicitly provided.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `vLLM` model using either the path or the Hugging Face Hub repository id.\n    Additionally, this method also sets the `chat_template` for the tokenizer, so as to properly\n    parse the list of OpenAI formatted inputs using the expected format by the model, otherwise, the\n    default value is ChatML format, unless explicitly provided.\n    \"\"\"\n    super().load()\n\n    CudaDevicePlacementMixin.load(self)\n\n    try:\n        from vllm import LLM as _vLLM\n    except ImportError as ie:\n        raise ImportError(\n            \"vLLM is not installed. Please install it using `pip install vllm`.\"\n        ) from ie\n\n    self._model = _vLLM(\n        self.model,\n        dtype=self.dtype,\n        trust_remote_code=self.trust_remote_code,\n        quantization=self.quantization,\n        revision=self.revision,\n        tokenizer=self.tokenizer,\n        tokenizer_mode=self.tokenizer_mode,\n        tokenizer_revision=self.tokenizer_revision,\n        skip_tokenizer_init=self.skip_tokenizer_init,\n        seed=self.seed,\n        **self.extra_kwargs,  # type: ignore\n    )\n\n    self._tokenizer = self._model.get_tokenizer()  # type: ignore\n    if self.chat_template is not None:\n        self._tokenizer.chat_template = self.chat_template  # type: ignore\n\n    if self.structured_output:\n        self._structured_output_logits_processor = self._prepare_structured_output(\n            self.structured_output\n        )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.unload","title":"<code>unload()</code>","text":"<p>Unloads the <code>vLLM</code> model.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the `vLLM` model.\"\"\"\n    self._cleanup_vllm_model()\n    self._model = None  # type: ignore\n    self._tokenizer = None  # type: ignore\n    CudaDevicePlacementMixin.unload(self)\n    super().unload()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.prepare_input","title":"<code>prepare_input(input)</code>","text":"<p>Prepares the input (applying the chat template and tokenization) for the provided input.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>StandardInput</code> <p>the input list containing chat items.</p> required <p>Returns:</p> Type Description <code>str</code> <p>The prompt to send to the LLM.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def prepare_input(self, input: \"StandardInput\") -&gt; str:\n    \"\"\"Prepares the input (applying the chat template and tokenization) for the provided\n    input.\n\n    Args:\n        input: the input list containing chat items.\n\n    Returns:\n        The prompt to send to the LLM.\n    \"\"\"\n    if self._tokenizer.chat_template is None:\n        return [item[\"content\"] for item in input if item[\"role\"] == \"user\"][0]\n\n    prompt: str = (\n        self._tokenizer.apply_chat_template(\n            input,  # type: ignore\n            tokenize=False,\n            add_generation_prompt=True,  # type: ignore\n        )\n        if input\n        else \"\"\n    )\n    return super().apply_magpie_pre_query_template(prompt, input)\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM._prepare_batches","title":"<code>_prepare_batches(inputs)</code>","text":"<p>Prepares the inputs by grouping them by the structured output.</p> <p>When we generate structured outputs with schemas obtained from a dataset, we need to prepare the data to try to send batches of inputs instead of single inputs to the model to take advante of the engine. So we group the inputs by the structured output to be passed in the <code>generate</code> method.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[StructuredInput]</code> <p>The batch of inputs passed to the generate method. As we expect to be generating structured outputs, each element will be a tuple containing the instruction and the structured output.</p> required <p>Returns:</p> Type Description <code>List[Tuple[List[str], OutlinesStructuredOutputType]]</code> <p>The prepared batches (sub-batches let's say) to be passed to the <code>generate</code> method.</p> <code>List[int]</code> <p>Each new tuple will contain instead of the single instruction, a list of instructions</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def _prepare_batches(\n    self, inputs: List[\"StructuredInput\"]\n) -&gt; Tuple[List[Tuple[List[str], \"OutlinesStructuredOutputType\"]], List[int]]:\n    \"\"\"Prepares the inputs by grouping them by the structured output.\n\n    When we generate structured outputs with schemas obtained from a dataset, we need to\n    prepare the data to try to send batches of inputs instead of single inputs to the model\n    to take advante of the engine. So we group the inputs by the structured output to be\n    passed in the `generate` method.\n\n    Args:\n        inputs: The batch of inputs passed to the generate method. As we expect to be generating\n            structured outputs, each element will be a tuple containing the instruction and the\n            structured output.\n\n    Returns:\n        The prepared batches (sub-batches let's say) to be passed to the `generate` method.\n        Each new tuple will contain instead of the single instruction, a list of instructions\n    \"\"\"\n    instruction_order = {}\n    batches: Dict[str, List[str]] = {}\n    for i, (instruction, structured_output) in enumerate(inputs):\n        instruction = self.prepare_input(instruction)\n        instruction_order[instruction] = i\n\n        structured_output = json.dumps(structured_output)\n        if structured_output not in batches:\n            batches[structured_output] = [instruction]\n        else:\n            batches[structured_output].append(instruction)\n\n    # Built a list with instructions sorted by structured output\n    flat_instructions = [\n        instruction for _, group in batches.items() for instruction in group\n    ]\n\n    # Generate the list of indices based on the original order\n    sorted_indices = [\n        instruction_order[instruction] for instruction in flat_instructions\n    ]\n\n    return [\n        (batch, json.loads(schema)) for schema, batch in batches.items()\n    ], sorted_indices\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM.generate","title":"<code>generate(inputs, num_generations=1, max_new_tokens=128, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, logprobs=None, stop=None, stop_token_ids=None, include_stop_str_in_output=False, logits_processors=None, extra_sampling_params=None)</code>","text":"<p>Generates <code>num_generations</code> responses for each input.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[FormattedInput]</code> <p>a list of inputs in chat format to generate responses for.</p> required <code>num_generations</code> <code>int</code> <p>the number of generations to create per input. Defaults to <code>1</code>.</p> <code>1</code> <code>max_new_tokens</code> <code>int</code> <p>the maximum number of new tokens that the model will generate. Defaults to <code>128</code>.</p> <code>128</code> <code>presence_penalty</code> <code>float</code> <p>the presence penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>frequency_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>repetition_penalty</code> <code>float</code> <p>the repetition penalty to use for the generation Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>temperature</code> <code>float</code> <p>the temperature to use for the generation. Defaults to <code>0.1</code>.</p> <code>1.0</code> <code>top_p</code> <code>float</code> <p>the top-p value to use for the generation. Defaults to <code>1.0</code>.</p> <code>1.0</code> <code>top_k</code> <code>int</code> <p>the top-k value to use for the generation. Defaults to <code>0</code>.</p> <code>-1</code> <code>min_p</code> <code>float</code> <p>the minimum probability to use for the generation. Defaults to <code>0.0</code>.</p> <code>0.0</code> <code>logprobs</code> <code>Optional[PositiveInt]</code> <p>number of log probabilities to return per output token. If <code>None</code>, then no log probability won't be returned. Defaults to <code>None</code>.</p> <code>None</code> <code>stop</code> <code>Optional[List[str]]</code> <p>a list of strings that will be used to stop the generation when found. Defaults to <code>None</code>.</p> <code>None</code> <code>stop_token_ids</code> <code>Optional[List[int]]</code> <p>a list of token ids that will be used to stop the generation when found. Defaults to <code>None</code>.</p> <code>None</code> <code>include_stop_str_in_output</code> <code>bool</code> <p>whether to include the stop string in the output. Defaults to <code>False</code>.</p> <code>False</code> <code>logits_processors</code> <code>Optional[LogitsProcessors]</code> <p>a list of functions to process the logits before sampling. Defaults to <code>None</code>.</p> <code>None</code> <code>extra_sampling_params</code> <code>Optional[Dict[str, Any]]</code> <p>dictionary with additional arguments to be passed to the <code>SamplingParams</code> class from <code>vllm</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>List[GenerateOutput]</code> <p>A list of lists of strings containing the generated responses for each input.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>@validate_call\ndef generate(  # noqa: C901 # type: ignore\n    self,\n    inputs: List[FormattedInput],\n    num_generations: int = 1,\n    max_new_tokens: int = 128,\n    presence_penalty: float = 0.0,\n    frequency_penalty: float = 0.0,\n    repetition_penalty: float = 1.0,\n    temperature: float = 1.0,\n    top_p: float = 1.0,\n    top_k: int = -1,\n    min_p: float = 0.0,\n    logprobs: Optional[PositiveInt] = None,\n    stop: Optional[List[str]] = None,\n    stop_token_ids: Optional[List[int]] = None,\n    include_stop_str_in_output: bool = False,\n    logits_processors: Optional[LogitsProcessors] = None,\n    extra_sampling_params: Optional[Dict[str, Any]] = None,\n) -&gt; List[GenerateOutput]:\n    \"\"\"Generates `num_generations` responses for each input.\n\n    Args:\n        inputs: a list of inputs in chat format to generate responses for.\n        num_generations: the number of generations to create per input. Defaults to\n            `1`.\n        max_new_tokens: the maximum number of new tokens that the model will generate.\n            Defaults to `128`.\n        presence_penalty: the presence penalty to use for the generation. Defaults to\n            `0.0`.\n        frequency_penalty: the repetition penalty to use for the generation. Defaults\n            to `0.0`.\n        repetition_penalty: the repetition penalty to use for the generation Defaults to\n            `1.0`.\n        temperature: the temperature to use for the generation. Defaults to `0.1`.\n        top_p: the top-p value to use for the generation. Defaults to `1.0`.\n        top_k: the top-k value to use for the generation. Defaults to `0`.\n        min_p: the minimum probability to use for the generation. Defaults to `0.0`.\n        logprobs: number of log probabilities to return per output token. If `None`,\n            then no log probability won't be returned. Defaults to `None`.\n        stop: a list of strings that will be used to stop the generation when found.\n            Defaults to `None`.\n        stop_token_ids: a list of token ids that will be used to stop the generation\n            when found. Defaults to `None`.\n        include_stop_str_in_output: whether to include the stop string in the output.\n            Defaults to `False`.\n        logits_processors: a list of functions to process the logits before sampling.\n            Defaults to `None`.\n        extra_sampling_params: dictionary with additional arguments to be passed to\n            the `SamplingParams` class from `vllm`.\n\n    Returns:\n        A list of lists of strings containing the generated responses for each input.\n    \"\"\"\n    from vllm import SamplingParams\n\n    if not logits_processors:\n        logits_processors = []\n\n    if extra_sampling_params is None:\n        extra_sampling_params = {}\n\n    structured_output = None\n\n    if isinstance(inputs[0], tuple):\n        # Prepare the batches for structured generation\n        prepared_batches, sorted_indices = self._prepare_batches(inputs)  # type: ignore\n    else:\n        # Simulate a batch without the structured output content\n        prepared_batches = [([self.prepare_input(input) for input in inputs], None)]  # type: ignore\n        sorted_indices = None\n\n    # Case in which we have a single structured output for the dataset\n    if self._structured_output_logits_processor:\n        logits_processors.append(self._structured_output_logits_processor)\n\n    batched_outputs: List[\"LLMOutput\"] = []\n    generations = []\n\n    for prepared_inputs, structured_output in prepared_batches:\n        if self.structured_output is not None and structured_output is not None:\n            # TODO: warning\n            pass\n\n        if structured_output is not None:\n            logits_processors.append(\n                self._prepare_structured_output(structured_output)  # type: ignore\n            )\n\n        sampling_params = SamplingParams(  # type: ignore\n            n=num_generations,\n            presence_penalty=presence_penalty,\n            frequency_penalty=frequency_penalty,\n            repetition_penalty=repetition_penalty,\n            temperature=temperature,\n            top_p=top_p,\n            top_k=top_k,\n            min_p=min_p,\n            max_tokens=max_new_tokens,\n            logprobs=logprobs,\n            stop=stop,\n            stop_token_ids=stop_token_ids,\n            include_stop_str_in_output=include_stop_str_in_output,\n            logits_processors=logits_processors,\n            **extra_sampling_params,\n        )\n\n        batch_outputs: List[\"RequestOutput\"] = self._model.generate(\n            prompts=prepared_inputs,\n            sampling_params=sampling_params,\n            use_tqdm=False,\n        )\n\n        # Remove structured output logit processor to avoid stacking structured output\n        # logits processors that leads to non-sense generations\n        if structured_output is not None:\n            logits_processors.pop(-1)\n\n        for input, outputs in zip(prepared_inputs, batch_outputs):\n            texts, statistics, outputs_logprobs = self._process_outputs(\n                input, outputs\n            )\n            batched_outputs.append(texts)\n            generations.append(\n                prepare_output(\n                    generations=texts,\n                    input_tokens=statistics[\"input_tokens\"],\n                    output_tokens=statistics[\"output_tokens\"],\n                    logprobs=outputs_logprobs,\n                )\n            )\n\n    if sorted_indices is not None:\n        pairs = list(enumerate(sorted_indices))\n        pairs.sort(key=lambda x: x[1])\n        generations = [generations[original_idx] for original_idx, _ in pairs]\n\n    return generations\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.vLLM._prepare_structured_output","title":"<code>_prepare_structured_output(structured_output)</code>","text":"<p>Creates the appropriate function to filter tokens to generate structured outputs.</p> <p>Parameters:</p> Name Type Description Default <code>structured_output</code> <code>OutlinesStructuredOutputType</code> <p>the configuration dict to prepare the structured output.</p> required <p>Returns:</p> Type Description <code>Union[Callable, None]</code> <p>The callable that will be used to guide the generation of the model.</p> Source code in <code>src/distilabel/models/llms/vllm.py</code> <pre><code>def _prepare_structured_output(\n    self, structured_output: \"OutlinesStructuredOutputType\"\n) -&gt; Union[Callable, None]:\n    \"\"\"Creates the appropriate function to filter tokens to generate structured outputs.\n\n    Args:\n        structured_output: the configuration dict to prepare the structured output.\n\n    Returns:\n        The callable that will be used to guide the generation of the model.\n    \"\"\"\n    from distilabel.steps.tasks.structured_outputs.outlines import (\n        prepare_guided_output,\n    )\n\n    assert structured_output is not None, \"`structured_output` cannot be `None`\"\n\n    result = prepare_guided_output(structured_output, \"vllm\", self._model)\n    if (schema := result.get(\"schema\")) and self.structured_output:\n        self.structured_output[\"schema\"] = schema\n    return result[\"processor\"]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin","title":"<code>CudaDevicePlacementMixin</code>","text":"<p>               Bases: <code>BaseModel</code></p> <p>Mixin class to assign CUDA devices to the <code>LLM</code> based on the <code>cuda_devices</code> attribute and the device placement information provided in <code>_device_llm_placement_map</code>. Providing the device placement information is optional, but if it is provided, it will be used to assign CUDA devices to the <code>LLM</code>s, trying to avoid using the same device for different <code>LLM</code>s.</p> <p>Attributes:</p> Name Type Description <code>cuda_devices</code> <code>RuntimeParameter[Union[List[int], Literal['auto']]]</code> <p>a list with the ID of the CUDA devices to be used by the <code>LLM</code>. If set to \"auto\", the devices will be automatically assigned based on the device placement information provided in <code>_device_llm_placement_map</code>. If set to a list of devices, it will be checked if the devices are available to be used by the <code>LLM</code>. If not, a warning will be logged.</p> <code>disable_cuda_device_placement</code> <code>RuntimeParameter[bool]</code> <p>Whether to disable the CUDA device placement logic or not. Defaults to <code>False</code>.</p> <code>_llm_identifier</code> <code>Union[str, None]</code> <p>the identifier of the <code>LLM</code> to be used as key in <code>_device_llm_placement_map</code>.</p> <code>_device_llm_placement_map</code> <code>Generator[Dict[str, List[int]], None, None]</code> <p>a dictionary with the device placement information for each <code>LLM</code>.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>class CudaDevicePlacementMixin(BaseModel):\n    \"\"\"Mixin class to assign CUDA devices to the `LLM` based on the `cuda_devices` attribute\n    and the device placement information provided in `_device_llm_placement_map`. Providing\n    the device placement information is optional, but if it is provided, it will be used to\n    assign CUDA devices to the `LLM`s, trying to avoid using the same device for different\n    `LLM`s.\n\n    Attributes:\n        cuda_devices: a list with the ID of the CUDA devices to be used by the `LLM`. If set\n            to \"auto\", the devices will be automatically assigned based on the device\n            placement information provided in `_device_llm_placement_map`. If set to a list\n            of devices, it will be checked if the devices are available to be used by the\n            `LLM`. If not, a warning will be logged.\n        disable_cuda_device_placement: Whether to disable the CUDA device placement logic\n            or not. Defaults to `False`.\n        _llm_identifier: the identifier of the `LLM` to be used as key in `_device_llm_placement_map`.\n        _device_llm_placement_map: a dictionary with the device placement information for each\n            `LLM`.\n    \"\"\"\n\n    cuda_devices: RuntimeParameter[Union[List[int], Literal[\"auto\"]]] = Field(\n        default=\"auto\", description=\"A list with the ID of the CUDA devices to be used.\"\n    )\n    disable_cuda_device_placement: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether to disable the CUDA device placement logic or not.\",\n    )\n\n    _llm_identifier: Union[str, None] = PrivateAttr(default=None)\n    _desired_num_gpus: PositiveInt = PrivateAttr(default=1)\n    _available_cuda_devices: List[int] = PrivateAttr(default_factory=list)\n    _can_check_cuda_devices: bool = PrivateAttr(default=False)\n\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Assign CUDA devices to the LLM based on the device placement information provided\n        in `_device_llm_placement_map`.\"\"\"\n\n        if self.disable_cuda_device_placement:\n            return\n\n        try:\n            import pynvml\n\n            pynvml.nvmlInit()\n            device_count = pynvml.nvmlDeviceGetCount()\n            self._available_cuda_devices = list(range(device_count))\n            self._can_check_cuda_devices = True\n        except ImportError as ie:\n            if self.cuda_devices == \"auto\":\n                raise ImportError(\n                    \"The 'pynvml' library is not installed. It is required to automatically\"\n                    \" assign CUDA devices to the `LLM`s. Please, install it and try again.\"\n                ) from ie\n\n            if self.cuda_devices:\n                self._logger.warning(  # type: ignore\n                    \"The 'pynvml' library is not installed. It is recommended to install it\"\n                    \" to check if the CUDA devices assigned to the LLM are available.\"\n                )\n\n        self._assign_cuda_devices()\n\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the LLM and removes the CUDA devices assigned to it from the device\n        placement information provided in `_device_llm_placement_map`.\"\"\"\n        if self.disable_cuda_device_placement:\n            return\n\n        with self._device_llm_placement_map() as device_map:\n            if self._llm_identifier in device_map:\n                self._logger.debug(  # type: ignore\n                    f\"Removing '{self._llm_identifier}' from the CUDA device map file\"\n                    f\" '{_CUDA_DEVICE_PLACEMENT_MIXIN_FILE}'.\"\n                )\n                del device_map[self._llm_identifier]\n\n    @contextmanager\n    def _device_llm_placement_map(self) -&gt; Generator[Dict[str, List[int]], None, None]:\n        \"\"\"Reads the content of the device placement file of the node with a lock, yields\n        the content, and writes the content back to the file after the context manager is\n        closed. If the file doesn't exist, an empty dictionary will be yielded.\n\n        Yields:\n            The content of the device placement file.\n        \"\"\"\n        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.parent.mkdir(parents=True, exist_ok=True)\n        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.touch()\n        with portalocker.Lock(\n            _CUDA_DEVICE_PLACEMENT_MIXIN_FILE,\n            \"r+\",\n            flags=portalocker.LockFlags.EXCLUSIVE,\n        ) as f:\n            try:\n                content = json.load(f)\n            except json.JSONDecodeError:\n                content = {}\n            yield content\n            f.seek(0)\n            f.truncate()\n            f.write(json.dumps(content))\n\n    def _assign_cuda_devices(self) -&gt; None:\n        \"\"\"Assigns CUDA devices to the LLM based on the device placement information provided\n        in `_device_llm_placement_map`. If the `cuda_devices` attribute is set to \"auto\", it\n        will be set to the first available CUDA device that is not going to be used by any\n        other LLM. If the `cuda_devices` attribute is set to a list of devices, it will be\n        checked if the devices are available to be used by the LLM. If not, a warning will be\n        logged.\"\"\"\n\n        # Take the lock and read the device placement information for each LLM.\n        with self._device_llm_placement_map() as device_map:\n            if self.cuda_devices == \"auto\":\n                self.cuda_devices = []\n                for _ in range(self._desired_num_gpus):\n                    if (device_id := self._get_cuda_device(device_map)) is not None:\n                        self.cuda_devices.append(device_id)\n                        device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n                if len(self.cuda_devices) != self._desired_num_gpus:\n                    self._logger.warning(  # type: ignore\n                        f\"Could not assign the desired number of GPUs {self._desired_num_gpus}\"\n                        f\" for LLM with identifier '{self._llm_identifier}'.\"\n                    )\n            else:\n                self._check_cuda_devices(device_map)\n\n            device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n\n        # `_device_llm_placement_map` was not provided and user didn't set the `cuda_devices`\n        # attribute. In this case, the `cuda_devices` attribute will be set to an empty list.\n        if self.cuda_devices == \"auto\":\n            self.cuda_devices = []\n\n        self._set_cuda_visible_devices()\n\n    def _check_cuda_devices(self, device_map: Dict[str, List[int]]) -&gt; None:\n        \"\"\"Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.\n\n        Args:\n            device_map: a dictionary with the device placement information for each LLM.\n        \"\"\"\n        for device in self.cuda_devices:  # type: ignore\n            for llm, devices in device_map.items():\n                if device in devices:\n                    self._logger.warning(  # type: ignore\n                        f\"LLM with identifier '{llm}' is also going to use CUDA device \"\n                        f\"'{device}'. This may lead to performance issues or running out\"\n                        \" of memory depending on the device capabilities and the loaded\"\n                        \" models.\"\n                    )\n\n    def _get_cuda_device(self, device_map: Dict[str, List[int]]) -&gt; Union[int, None]:\n        \"\"\"Returns the first available CUDA device to be used by the LLM that is not going\n        to be used by any other LLM.\n\n        Args:\n            device_map: a dictionary with the device placement information for each LLM.\n\n        Returns:\n            The first available CUDA device to be used by the LLM.\n\n        Raises:\n            RuntimeError: if there is no available CUDA device to be used by the LLM.\n        \"\"\"\n        for device in self._available_cuda_devices:\n            if all(device not in devices for devices in device_map.values()):\n                return device\n\n        return None\n\n    def _set_cuda_visible_devices(self) -&gt; None:\n        \"\"\"Sets the `CUDA_VISIBLE_DEVICES` environment variable to the list of CUDA devices\n        to be used by the LLM.\n        \"\"\"\n        if not self.cuda_devices:\n            return\n\n        if self._can_check_cuda_devices and not all(\n            device in self._available_cuda_devices for device in self.cuda_devices\n        ):\n            raise RuntimeError(\n                f\"Invalid CUDA devices for LLM '{self._llm_identifier}': {self.cuda_devices}.\"\n                f\" The available devices are: {self._available_cuda_devices}. Please, review\"\n                \" the 'cuda_devices' attribute and try again.\"\n            )\n\n        cuda_devices = \",\".join([str(device) for device in self.cuda_devices])\n        self._logger.info(  # type: ignore\n            f\"\ud83c\udfae LLM '{self._llm_identifier}' is going to use the following CUDA devices:\"\n            f\" {self.cuda_devices}.\"\n        )\n        os.environ[\"CUDA_VISIBLE_DEVICES\"] = cuda_devices\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin.load","title":"<code>load()</code>","text":"<p>Assign CUDA devices to the LLM based on the device placement information provided in <code>_device_llm_placement_map</code>.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Assign CUDA devices to the LLM based on the device placement information provided\n    in `_device_llm_placement_map`.\"\"\"\n\n    if self.disable_cuda_device_placement:\n        return\n\n    try:\n        import pynvml\n\n        pynvml.nvmlInit()\n        device_count = pynvml.nvmlDeviceGetCount()\n        self._available_cuda_devices = list(range(device_count))\n        self._can_check_cuda_devices = True\n    except ImportError as ie:\n        if self.cuda_devices == \"auto\":\n            raise ImportError(\n                \"The 'pynvml' library is not installed. It is required to automatically\"\n                \" assign CUDA devices to the `LLM`s. Please, install it and try again.\"\n            ) from ie\n\n        if self.cuda_devices:\n            self._logger.warning(  # type: ignore\n                \"The 'pynvml' library is not installed. It is recommended to install it\"\n                \" to check if the CUDA devices assigned to the LLM are available.\"\n            )\n\n    self._assign_cuda_devices()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin.unload","title":"<code>unload()</code>","text":"<p>Unloads the LLM and removes the CUDA devices assigned to it from the device placement information provided in <code>_device_llm_placement_map</code>.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Unloads the LLM and removes the CUDA devices assigned to it from the device\n    placement information provided in `_device_llm_placement_map`.\"\"\"\n    if self.disable_cuda_device_placement:\n        return\n\n    with self._device_llm_placement_map() as device_map:\n        if self._llm_identifier in device_map:\n            self._logger.debug(  # type: ignore\n                f\"Removing '{self._llm_identifier}' from the CUDA device map file\"\n                f\" '{_CUDA_DEVICE_PLACEMENT_MIXIN_FILE}'.\"\n            )\n            del device_map[self._llm_identifier]\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._device_llm_placement_map","title":"<code>_device_llm_placement_map()</code>","text":"<p>Reads the content of the device placement file of the node with a lock, yields the content, and writes the content back to the file after the context manager is closed. If the file doesn't exist, an empty dictionary will be yielded.</p> <p>Yields:</p> Type Description <code>Dict[str, List[int]]</code> <p>The content of the device placement file.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>@contextmanager\ndef _device_llm_placement_map(self) -&gt; Generator[Dict[str, List[int]], None, None]:\n    \"\"\"Reads the content of the device placement file of the node with a lock, yields\n    the content, and writes the content back to the file after the context manager is\n    closed. If the file doesn't exist, an empty dictionary will be yielded.\n\n    Yields:\n        The content of the device placement file.\n    \"\"\"\n    _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.parent.mkdir(parents=True, exist_ok=True)\n    _CUDA_DEVICE_PLACEMENT_MIXIN_FILE.touch()\n    with portalocker.Lock(\n        _CUDA_DEVICE_PLACEMENT_MIXIN_FILE,\n        \"r+\",\n        flags=portalocker.LockFlags.EXCLUSIVE,\n    ) as f:\n        try:\n            content = json.load(f)\n        except json.JSONDecodeError:\n            content = {}\n        yield content\n        f.seek(0)\n        f.truncate()\n        f.write(json.dumps(content))\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._assign_cuda_devices","title":"<code>_assign_cuda_devices()</code>","text":"<p>Assigns CUDA devices to the LLM based on the device placement information provided in <code>_device_llm_placement_map</code>. If the <code>cuda_devices</code> attribute is set to \"auto\", it will be set to the first available CUDA device that is not going to be used by any other LLM. If the <code>cuda_devices</code> attribute is set to a list of devices, it will be checked if the devices are available to be used by the LLM. If not, a warning will be logged.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _assign_cuda_devices(self) -&gt; None:\n    \"\"\"Assigns CUDA devices to the LLM based on the device placement information provided\n    in `_device_llm_placement_map`. If the `cuda_devices` attribute is set to \"auto\", it\n    will be set to the first available CUDA device that is not going to be used by any\n    other LLM. If the `cuda_devices` attribute is set to a list of devices, it will be\n    checked if the devices are available to be used by the LLM. If not, a warning will be\n    logged.\"\"\"\n\n    # Take the lock and read the device placement information for each LLM.\n    with self._device_llm_placement_map() as device_map:\n        if self.cuda_devices == \"auto\":\n            self.cuda_devices = []\n            for _ in range(self._desired_num_gpus):\n                if (device_id := self._get_cuda_device(device_map)) is not None:\n                    self.cuda_devices.append(device_id)\n                    device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n            if len(self.cuda_devices) != self._desired_num_gpus:\n                self._logger.warning(  # type: ignore\n                    f\"Could not assign the desired number of GPUs {self._desired_num_gpus}\"\n                    f\" for LLM with identifier '{self._llm_identifier}'.\"\n                )\n        else:\n            self._check_cuda_devices(device_map)\n\n        device_map[self._llm_identifier] = self.cuda_devices  # type: ignore\n\n    # `_device_llm_placement_map` was not provided and user didn't set the `cuda_devices`\n    # attribute. In this case, the `cuda_devices` attribute will be set to an empty list.\n    if self.cuda_devices == \"auto\":\n        self.cuda_devices = []\n\n    self._set_cuda_visible_devices()\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._check_cuda_devices","title":"<code>_check_cuda_devices(device_map)</code>","text":"<p>Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.</p> <p>Parameters:</p> Name Type Description Default <code>device_map</code> <code>Dict[str, List[int]]</code> <p>a dictionary with the device placement information for each LLM.</p> required Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _check_cuda_devices(self, device_map: Dict[str, List[int]]) -&gt; None:\n    \"\"\"Checks if the CUDA devices assigned to the LLM are also assigned to other LLMs.\n\n    Args:\n        device_map: a dictionary with the device placement information for each LLM.\n    \"\"\"\n    for device in self.cuda_devices:  # type: ignore\n        for llm, devices in device_map.items():\n            if device in devices:\n                self._logger.warning(  # type: ignore\n                    f\"LLM with identifier '{llm}' is also going to use CUDA device \"\n                    f\"'{device}'. This may lead to performance issues or running out\"\n                    \" of memory depending on the device capabilities and the loaded\"\n                    \" models.\"\n                )\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._get_cuda_device","title":"<code>_get_cuda_device(device_map)</code>","text":"<p>Returns the first available CUDA device to be used by the LLM that is not going to be used by any other LLM.</p> <p>Parameters:</p> Name Type Description Default <code>device_map</code> <code>Dict[str, List[int]]</code> <p>a dictionary with the device placement information for each LLM.</p> required <p>Returns:</p> Type Description <code>Union[int, None]</code> <p>The first available CUDA device to be used by the LLM.</p> <p>Raises:</p> Type Description <code>RuntimeError</code> <p>if there is no available CUDA device to be used by the LLM.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _get_cuda_device(self, device_map: Dict[str, List[int]]) -&gt; Union[int, None]:\n    \"\"\"Returns the first available CUDA device to be used by the LLM that is not going\n    to be used by any other LLM.\n\n    Args:\n        device_map: a dictionary with the device placement information for each LLM.\n\n    Returns:\n        The first available CUDA device to be used by the LLM.\n\n    Raises:\n        RuntimeError: if there is no available CUDA device to be used by the LLM.\n    \"\"\"\n    for device in self._available_cuda_devices:\n        if all(device not in devices for devices in device_map.values()):\n            return device\n\n    return None\n</code></pre>"},{"location":"api/models/llm/llm_gallery/#distilabel.models.llms.CudaDevicePlacementMixin._set_cuda_visible_devices","title":"<code>_set_cuda_visible_devices()</code>","text":"<p>Sets the <code>CUDA_VISIBLE_DEVICES</code> environment variable to the list of CUDA devices to be used by the LLM.</p> Source code in <code>src/distilabel/models/mixins/cuda_device_placement.py</code> <pre><code>def _set_cuda_visible_devices(self) -&gt; None:\n    \"\"\"Sets the `CUDA_VISIBLE_DEVICES` environment variable to the list of CUDA devices\n    to be used by the LLM.\n    \"\"\"\n    if not self.cuda_devices:\n        return\n\n    if self._can_check_cuda_devices and not all(\n        device in self._available_cuda_devices for device in self.cuda_devices\n    ):\n        raise RuntimeError(\n            f\"Invalid CUDA devices for LLM '{self._llm_identifier}': {self.cuda_devices}.\"\n            f\" The available devices are: {self._available_cuda_devices}. Please, review\"\n            \" the 'cuda_devices' attribute and try again.\"\n        )\n\n    cuda_devices = \",\".join([str(device) for device in self.cuda_devices])\n    self._logger.info(  # type: ignore\n        f\"\ud83c\udfae LLM '{self._llm_identifier}' is going to use the following CUDA devices:\"\n        f\" {self.cuda_devices}.\"\n    )\n    os.environ[\"CUDA_VISIBLE_DEVICES\"] = cuda_devices\n</code></pre>"},{"location":"api/pipeline/","title":"Pipeline","text":"<p>This section contains the API reference for the <code>distilabel</code> pipelines. For an example on how to use the pipelines, see the Tutorial - Pipeline.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base","title":"<code>base</code>","text":""},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline","title":"<code>BasePipeline</code>","text":"<p>               Bases: <code>ABC</code>, <code>RequirementsMixin</code>, <code>_Serializable</code></p> <p>Base class for a <code>distilabel</code> pipeline.</p> <p>Attributes:</p> Name Type Description <code>name</code> <p>The name of the pipeline.</p> <code>description</code> <p>A description of the pipeline.</p> <code>dag</code> <p>The <code>DAG</code> instance that represents the pipeline.</p> <code>_cache_dir</code> <p>The directory where the pipeline will be cached.</p> <code>_logger</code> <p>The logger instance that will be used by the pipeline.</p> <code>_batch_manager</code> <code>Optional[_BatchManager]</code> <p>The batch manager that will manage the batches received from the steps while running the pipeline. It will be created when the pipeline is run, from scratch or from cache. Defaults to <code>None</code>.</p> <code>_write_buffer</code> <code>Optional[_WriteBuffer]</code> <p>The buffer that will store the data of the leaf steps of the pipeline while running, so the <code>Distiset</code> can be created at the end. It will be created when the pipeline is run. Defaults to <code>None</code>.</p> <code>_fs</code> <code>Optional[AbstractFileSystem]</code> <p>The <code>fsspec</code> filesystem to be used to store the data of the <code>_Batch</code>es passed between the steps. It will be set when the pipeline is run. Defaults to <code>None</code>.</p> <code>_storage_base_path</code> <code>Optional[str]</code> <p>The base path where the data of the <code>_Batch</code>es passed between the steps will be stored. It will be set then the pipeline is run. Defaults to <code>None</code>.</p> <code>_use_fs_to_pass_data</code> <code>bool</code> <p>Whether to use the file system to pass the data of the <code>_Batch</code>es between the steps. Even if this parameter is <code>False</code>, the <code>Batch</code>es received by <code>GlobalStep</code>s will always use the file system to pass the data. Defaults to <code>False</code>.</p> <code>_dry_run</code> <p>A flag to indicate if the pipeline is running in dry run mode. Defaults to <code>False</code>.</p> <code>output_queue</code> <p>A queue to store the output of the steps while running the pipeline.</p> <code>load_queue</code> <p>A queue used by each <code>Step</code> to notify the main process it has finished loading or it the step has been unloaded.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>class BasePipeline(ABC, RequirementsMixin, _Serializable):\n    \"\"\"Base class for a `distilabel` pipeline.\n\n    Attributes:\n        name: The name of the pipeline.\n        description: A description of the pipeline.\n        dag: The `DAG` instance that represents the pipeline.\n        _cache_dir: The directory where the pipeline will be cached.\n        _logger: The logger instance that will be used by the pipeline.\n        _batch_manager: The batch manager that will manage the batches received from the\n            steps while running the pipeline. It will be created when the pipeline is run,\n            from scratch or from cache. Defaults to `None`.\n        _write_buffer: The buffer that will store the data of the leaf steps of the pipeline\n            while running, so the `Distiset` can be created at the end. It will be created\n            when the pipeline is run. Defaults to `None`.\n        _fs: The `fsspec` filesystem to be used to store the data of the `_Batch`es passed\n            between the steps. It will be set when the pipeline is run. Defaults to `None`.\n        _storage_base_path: The base path where the data of the `_Batch`es passed between\n            the steps will be stored. It will be set then the pipeline is run. Defaults\n            to `None`.\n        _use_fs_to_pass_data: Whether to use the file system to pass the data of the\n            `_Batch`es between the steps. Even if this parameter is `False`, the `Batch`es\n            received by `GlobalStep`s will always use the file system to pass the data.\n            Defaults to `False`.\n        _dry_run: A flag to indicate if the pipeline is running in dry run mode. Defaults\n            to `False`.\n        output_queue: A queue to store the output of the steps while running the pipeline.\n        load_queue: A queue used by each `Step` to notify the main process it has finished\n            loading or it the step has been unloaded.\n    \"\"\"\n\n    _output_queue: \"Queue[Any]\"\n    _load_queue: \"Queue[Union[StepLoadStatus, None]]\"\n\n    def __init__(\n        self,\n        name: Optional[str] = None,\n        description: Optional[str] = None,\n        cache_dir: Optional[Union[str, \"PathLike\"]] = None,\n        enable_metadata: bool = False,\n        requirements: Optional[List[str]] = None,\n    ) -&gt; None:\n        \"\"\"Initialize the `BasePipeline` instance.\n\n        Args:\n            name: The name of the pipeline. If not generated, a random one will be generated by default.\n            description: A description of the pipeline. Defaults to `None`.\n            cache_dir: A directory where the pipeline will be cached. Defaults to `None`.\n            enable_metadata: Whether to include the distilabel metadata column for the pipeline\n                in the final `Distiset`. It contains metadata used by distilabel, for example\n                the raw outputs of the `LLM` without processing would be here, inside `raw_output_...`\n                field. Defaults to `False`.\n            requirements: List of requirements that must be installed to run the pipeline.\n                Defaults to `None`, but can be helpful to inform in a pipeline to be shared\n                that this requirements must be installed.\n        \"\"\"\n        self.name = name or _PIPELINE_DEFAULT_NAME\n        self.description = description\n        self._enable_metadata = enable_metadata\n        self.dag = DAG()\n\n        if cache_dir:\n            self._cache_dir = Path(cache_dir)\n        elif env_cache_dir := envs.DISTILABEL_CACHE_DIR:\n            self._cache_dir = Path(env_cache_dir)\n        else:\n            self._cache_dir = constants.PIPELINES_CACHE_DIR\n\n        self._logger = logging.getLogger(\"distilabel.pipeline\")\n\n        self._batch_manager: Optional[\"_BatchManager\"] = None\n        self._write_buffer: Optional[\"_WriteBuffer\"] = None\n        self._steps_input_queues: Dict[str, \"Queue\"] = {}\n\n        self._steps_load_status: Dict[str, int] = {}\n        self._steps_load_status_lock = threading.Lock()\n\n        self._stop_called = False\n        self._stop_called_lock = threading.Lock()\n        self._stop_calls = 0\n\n        self._recover_offline_batch_generate_for_step: Union[\n            Tuple[str, List[List[Dict[str, Any]]]], None\n        ] = None\n\n        self._fs: Optional[fsspec.AbstractFileSystem] = None\n        self._storage_base_path: Optional[str] = None\n        self._use_fs_to_pass_data: bool = False\n        self._dry_run = False\n\n        self._current_stage = 0\n        self._stages_last_batch: List[List[str]] = []\n        self._load_groups = []\n\n        self.requirements = requirements or []\n\n        self._exception: Union[Exception, None] = None\n\n        self._log_queue: Union[\"Queue[Any]\", None] = None\n\n    def __enter__(self) -&gt; Self:\n        \"\"\"Set the global pipeline instance when entering a pipeline context.\"\"\"\n        _GlobalPipelineManager.set_pipeline(self)\n        return self\n\n    def __exit__(self, exc_type, exc_value, traceback) -&gt; None:\n        \"\"\"Unset the global pipeline instance when exiting a pipeline context.\"\"\"\n        _GlobalPipelineManager.set_pipeline(None)\n        self._set_pipeline_name()\n\n    def _set_pipeline_name(self) -&gt; None:\n        \"\"\"Creates a name for the pipeline if it's the default one (if hasn't been set).\"\"\"\n        if self.name == _PIPELINE_DEFAULT_NAME:\n            self.name = f\"pipeline_{'_'.join(self.dag)}\"\n\n    @property\n    def signature(self) -&gt; str:\n        \"\"\"Makes a signature (hash) of a pipeline, using the step ids and the adjacency between them.\n\n        The main use is to find the pipeline in the cache folder.\n\n        Returns:\n            Signature of the pipeline.\n        \"\"\"\n\n        pipeline_dump = self.dump()[\"pipeline\"]\n        steps_names = list(self.dag)\n        connections_info = [\n            f\"{c['from']}-{'-'.join(c['to'])}\" for c in pipeline_dump[\"connections\"]\n        ]\n\n        routing_batch_functions_info = []\n        for function in pipeline_dump[\"routing_batch_functions\"]:\n            step = function[\"step\"]\n            routing_batch_function: \"RoutingBatchFunction\" = self.dag.get_step(step)[\n                constants.ROUTING_BATCH_FUNCTION_ATTR_NAME\n            ]\n            if type_info := routing_batch_function._get_type_info():\n                step += f\"-{type_info}\"\n            routing_batch_functions_info.append(step)\n\n        return hashlib.sha1(\n            \",\".join(\n                steps_names + connections_info + routing_batch_functions_info\n            ).encode()\n        ).hexdigest()\n\n    def run(\n        self,\n        parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n        load_groups: Optional[\"LoadGroups\"] = None,\n        use_cache: bool = True,\n        storage_parameters: Optional[Dict[str, Any]] = None,\n        use_fs_to_pass_data: bool = False,\n        dataset: Optional[\"InputDataset\"] = None,\n        dataset_batch_size: int = 50,\n        logging_handlers: Optional[List[logging.Handler]] = None,\n    ) -&gt; \"Distiset\":  # type: ignore\n        \"\"\"Run the pipeline. It will set the runtime parameters for the steps and validate\n        the pipeline.\n\n        This method should be extended by the specific pipeline implementation,\n        adding the logic to run the pipeline.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n                the runtime parameters for the step as the value. Defaults to `None`.\n            load_groups: A list containing lists of steps that have to be loaded together\n                and in isolation with respect to the rest of the steps of the pipeline.\n                This argument also allows passing the following modes:\n\n                - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                    the execution of the steps will be sequential.\n\n                Defaults to `None`.\n            use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n                `True`.\n            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n                that will be used to store the data of the `_Batch`es passed between the\n                steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n                `GlobalStep` it will be always used). It must have at least the \"path\" key,\n                and it can contain additional keys depending on the protocol. By default,\n                it will use the local file system and a directory in the cache directory.\n                Defaults to `None`.\n            use_fs_to_pass_data: Whether to use the file system to pass the data of\n                the `_Batch`es between the steps. Even if this parameter is `False`, the\n                `Batch`es received by `GlobalStep`s will always use the file system to\n                pass the data. Defaults to `False`.\n            dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n                root step. Convenient method when you have already processed the dataset in\n                your script and just want to pass it already processed. Defaults to `None`.\n            dataset_batch_size: if `dataset` is given, this will be the size of the batches\n                yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n            logging_handlers: A list of logging handlers that will be used to log the\n                output of the pipeline. This argument can be useful so the logging messages\n                can be extracted and used in a different context. Defaults to `None`.\n\n        Returns:\n            The `Distiset` created by the pipeline.\n        \"\"\"\n\n        self._exception: Union[Exception, None] = None\n\n        # Set the runtime parameters that will be used during the pipeline execution.\n        # They are used to generate the signature of the pipeline that is used to hit the\n        # cache when the pipeline is run, so it's important to do it first.\n        self._set_runtime_parameters(parameters or {})\n\n        self._refresh_pipeline_from_cache()\n\n        if dataset is not None:\n            self._add_dataset_generator_step(dataset, dataset_batch_size)\n\n        setup_logging(\n            log_queue=self._log_queue,\n            filename=str(self._cache_location[\"log_file\"]),\n            logging_handlers=logging_handlers,\n        )\n\n        # Set the name of the pipeline if it's the default one. This should be called\n        # if the pipeline is defined within the context manager, and the run is called\n        # outside of it. Is here in the following case:\n        # with Pipeline() as pipeline:\n        #    pipeline.run()\n        self._set_pipeline_name()\n\n        # Validate the pipeline DAG to check that all the steps are chainable, there are\n        # no missing runtime parameters, batch sizes are correct, load groups are valid,\n        # etc.\n        self._load_groups = self._built_load_groups(load_groups)\n        self._validate()\n\n        self._set_pipeline_artifacts_path_in_steps()\n\n        # Set the initial load status for all the steps\n        self._init_steps_load_status()\n\n        # Load the stages status or initialize it\n        self._load_stages_status(use_cache)\n\n        # Load the `_BatchManager` from cache or create one from scratch\n        self._load_batch_manager(use_cache)\n\n        # Check pipeline requirements are installed\n        self._check_requirements()\n\n        # Setup the filesystem that will be used to pass the data of the `_Batch`es\n        self._setup_fsspec(storage_parameters)\n        self._use_fs_to_pass_data = use_fs_to_pass_data\n\n        if self._dry_run:\n            self._logger.info(\"\ud83c\udf35 Dry run mode\")\n\n        # If the batch manager is not able to generate batches, that means that the loaded\n        # `_BatchManager` from cache didn't have any remaining batches to process i.e.\n        # the previous pipeline execution was completed successfully.\n        if not self._batch_manager.can_generate():  # type: ignore\n            self._logger.info(\n                \"\ud83d\udcbe Loaded batch manager from cache doesn't contain any remaining data.\"\n                \" Returning `Distiset` from cache data...\"\n            )\n            distiset = create_distiset(\n                data_dir=self._cache_location[\"data\"],\n                pipeline_path=self._cache_location[\"pipeline\"],\n                log_filename_path=self._cache_location[\"log_file\"],\n                enable_metadata=self._enable_metadata,\n                dag=self.dag,\n            )\n            stop_logging()\n            return distiset\n\n        self._setup_write_buffer(use_cache)\n\n        self._print_load_stages_info()\n\n    def dry_run(\n        self,\n        parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n        batch_size: int = 1,\n        dataset: Optional[\"InputDataset\"] = None,\n    ) -&gt; \"Distiset\":\n        \"\"\"Do a dry run to test the pipeline runs as expected.\n\n        Running a `Pipeline` in dry run mode will set all the `batch_size` of generator steps\n        to the specified `batch_size`, and run just with a single batch, effectively\n        running the whole pipeline with a single example. The cache will be set to `False`.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n                the runtime parameters for the step as the value. Defaults to `None`.\n            batch_size: The batch size of the unique batch generated by the generators\n                steps of the pipeline. Defaults to `1`.\n            dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n                root step. Convenient method when you have already processed the dataset in\n                your script and just want to pass it already processed. Defaults to `None`.\n\n        Returns:\n            Will return the `Distiset` as the main run method would do.\n        \"\"\"\n        self._dry_run = True\n\n        for step_name in self.dag:\n            step = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n\n            if step.is_generator:\n                if not parameters:\n                    parameters = {}\n                parameters[step_name] = {\"batch_size\": batch_size}\n\n        distiset = self.run(parameters=parameters, use_cache=False, dataset=dataset)\n\n        self._dry_run = False\n        return distiset\n\n    def get_load_stages(self, load_groups: Optional[\"LoadGroups\"] = None) -&gt; LoadStages:\n        \"\"\"Convenient method to get the load stages of a pipeline.\n\n        Args:\n            load_groups: A list containing list of steps that has to be loaded together\n                and in isolation with respect to the rest of the steps of the pipeline.\n                Defaults to `None`.\n\n        Returns:\n            A tuple with the first element containing asorted list by stage containing\n            lists with the names of the steps of the stage, and the second element a list\n            sorted by stage containing lists with the names of the last steps of the stage.\n        \"\"\"\n        load_groups = self._built_load_groups(load_groups)\n        return self.dag.get_steps_load_stages(load_groups)\n\n    def _add_dataset_generator_step(\n        self, dataset: \"InputDataset\", batch_size: int = 50\n    ) -&gt; None:\n        \"\"\"Create a root step to work as the `GeneratorStep` for the pipeline using a\n        dataset.\n\n        Args:\n            dataset: A dataset that will be used to create a `GeneratorStep` and\n                placed in the DAG as the root step.\n            batch_size: The size of the batches generated by the `GeneratorStep`.\n\n        Raises:\n            ValueError: If there's already a `GeneratorStep` in the pipeline.\n        \"\"\"\n        for step_name in self.dag:\n            step = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            if isinstance(step_name, GeneratorStep):\n                raise DistilabelUserError(\n                    \"There is already a `GeneratorStep` in the pipeline, you can either\"\n                    \" pass a `dataset` to the run method, or create a `GeneratorStep` explictly.\"\n                    f\" `GeneratorStep`: {step}\",\n                    page=\"sections/how_to_guides/basic/step/#types-of-steps\",\n                )\n        loader = make_generator_step(\n            dataset=dataset,\n            pipeline=self,\n            batch_size=batch_size,\n        )\n        self.dag.add_root_step(loader)\n\n    def get_runtime_parameters_info(self) -&gt; \"PipelineRuntimeParametersInfo\":\n        \"\"\"Get the runtime parameters for the steps in the pipeline.\n\n        Returns:\n            A dictionary with the step name as the key and a list of dictionaries with\n            the parameter name and the parameter info as the value.\n        \"\"\"\n        runtime_parameters = {}\n        for step_name in self.dag:\n            step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            runtime_parameters[step_name] = step.get_runtime_parameters_info()\n        return runtime_parameters\n\n    def _built_load_groups(\n        self, load_groups: Optional[\"LoadGroups\"] = None\n    ) -&gt; List[List[str]]:\n        if load_groups is None:\n            return []\n\n        if load_groups == \"sequential_step_execution\":\n            return [[step_name] for step_name in self.dag]\n\n        return [\n            [\n                step.name if isinstance(step, _Step) else step\n                for step in steps_load_group\n            ]  # type: ignore\n            for steps_load_group in load_groups\n        ]\n\n    def _validate(self) -&gt; None:\n        \"\"\"Validates the pipeline DAG to check that all the steps are chainable, there are\n        no missing runtime parameters, batch sizes are correct and that load groups are\n        valid (if any).\"\"\"\n        self.dag.validate()\n        self._validate_load_groups(self._load_groups)\n\n    def _validate_load_groups(self, load_groups: List[List[Any]]) -&gt; None:  # noqa: C901\n        \"\"\"Checks that the provided load groups are valid and that the steps can be scheduled\n        to be loaded in different stages without any issue.\n\n        Args:\n            load_groups: the load groups to be checked.\n\n        Raises:\n            DistilabelUserError: if something is not OK when checking the load groups.\n        \"\"\"\n\n        def check_predecessor_in_load_group(\n            step_name: str, load_group: List[str], first: bool\n        ) -&gt; Union[str, None]:\n            if not first and step_name in load_group:\n                return step_name\n\n            for predecessor_step_name in self.dag.get_step_predecessors(step_name):\n                # Immediate predecessor is in the same load group. This is OK.\n                if first and predecessor_step_name in load_group:\n                    continue\n\n                # Case: A -&gt; B -&gt; C, load_group=[A, C]\n                # If a non-immediate predecessor is in the same load group and an immediate\n                # predecessor is not , then it's not OK because we cannot load `step_name`\n                # before one immediate predecessor.\n                if step_name_in_load_group := check_predecessor_in_load_group(\n                    predecessor_step_name, load_group, False\n                ):\n                    return step_name_in_load_group\n\n            return None\n\n        steps_included_in_load_group = []\n        for load_group_num, steps_load_group in enumerate(load_groups):\n            for step_name in steps_load_group:\n                if step_name not in self.dag.G:\n                    raise DistilabelUserError(\n                        f\"Step with name '{step_name}' included in group {load_group_num} of\"\n                        \" the `load_groups` is not an step included in the pipeline. Please,\"\n                        \" check that you're passing the correct step name and run again.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                node = self.dag.get_step(step_name)\n                step: \"_Step\" = node[constants.STEP_ATTR_NAME]\n\n                if step_name_in_load_group := check_predecessor_in_load_group(\n                    step_name, steps_load_group, True\n                ):\n                    # Improve this user error message\n                    raise DistilabelUserError(\n                        f\"Step with name '{step_name}' cannot be in the same load group\"\n                        f\" as the step with name '{step_name_in_load_group}'. '{step_name_in_load_group}'\"\n                        f\" is not an immediate predecessor of '{step_name}' and there are\"\n                        \" immediate predecessors that have not been included.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                if step.is_global and len(steps_load_group) &gt; 1:\n                    raise DistilabelUserError(\n                        f\"Global step '{step_name}' has been included in a load group along\"\n                        \" more steps. Global steps cannot be included in a load group with\"\n                        \" more steps as they will be loaded in a different stage to the\"\n                        \" rest of the steps in the pipeline by default.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                if step_name in steps_included_in_load_group:\n                    raise DistilabelUserError(\n                        f\"Step with name '{step_name}' in load group {load_group_num} has\"\n                        \" already been included in a previous load group. A step cannot be in more\"\n                        \" than one load group.\",\n                        page=\"sections/how_to_guides/advanced/load_groups_and_execution_stages\",\n                    )\n\n                steps_included_in_load_group.append(step_name)\n\n    def _init_steps_load_status(self) -&gt; None:\n        \"\"\"Initialize the `_steps_load_status` dictionary assigning 0 to every step of\n        the pipeline.\"\"\"\n        for step_name in self.dag:\n            self._steps_load_status[step_name] = _STEP_NOT_LOADED_CODE\n\n    def _set_pipeline_artifacts_path_in_steps(self) -&gt; None:\n        \"\"\"Sets the attribute `_pipeline_artifacts_path` in all the `Step`s of the pipeline,\n        so steps can use it to get the path to save the generated artifacts.\"\"\"\n        artifacts_path = self._cache_location[\"data\"] / constants.STEPS_ARTIFACTS_PATH\n        for name in self.dag:\n            step: \"_Step\" = self.dag.get_step(name)[constants.STEP_ATTR_NAME]\n            step.set_pipeline_artifacts_path(path=artifacts_path)\n\n    def _check_requirements(self) -&gt; None:\n        \"\"\"Checks if the dependencies required to run the pipeline are installed.\n\n        Raises:\n            ModuleNotFoundError: if one or more requirements are missing.\n        \"\"\"\n        if to_install := self.requirements_to_install():\n            # Print the list of requirements like they would appear in a requirements.txt\n            to_install_list = \"\\n\" + \"\\n\".join(to_install)\n            msg = f\"Please install the following requirements to run the pipeline: {to_install_list}\"\n            self._logger.error(msg)\n            raise ModuleNotFoundError(msg)\n\n    def _setup_fsspec(\n        self, storage_parameters: Optional[Dict[str, Any]] = None\n    ) -&gt; None:\n        \"\"\"Setups the `fsspec` filesystem to be used to store the data of the `_Batch`es\n        passed between the steps.\n\n        Args:\n            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n                that will be used to store the data of the `_Batch`es passed between the\n                steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n                `GlobalStep` it will be always used). It must have at least the \"path\" key,\n                and it can contain additional keys depending on the protocol. By default,\n                it will use the local file system and a directory in the cache directory.\n                Defaults to `None`.\n        \"\"\"\n        if not storage_parameters:\n            self._fs = fsspec.filesystem(\"file\")\n            self._storage_base_path = (\n                f\"file://{self._cache_location['batch_input_data']}\"\n            )\n            return\n\n        if \"path\" not in storage_parameters:\n            raise DistilabelUserError(\n                \"The 'path' key must be present in the `storage_parameters` dictionary\"\n                \" if it's not `None`.\",\n                page=\"sections/how_to_guides/advanced/fs_to_pass_data/\",\n            )\n\n        path = storage_parameters.pop(\"path\")\n        protocol = UPath(path).protocol\n\n        self._fs = fsspec.filesystem(protocol, **storage_parameters)\n        self._storage_base_path = path\n\n    def _add_step(self, step: \"_Step\") -&gt; None:\n        \"\"\"Add a step to the pipeline.\n\n        Args:\n            step: The step to be added to the pipeline.\n        \"\"\"\n        self.dag.add_step(step)\n\n    def _add_edge(self, from_step: str, to_step: str) -&gt; None:\n        \"\"\"Add an edge between two steps in the pipeline.\n\n        Args:\n            from_step: The name of the step that will generate the input for `to_step`.\n            to_step: The name of the step that will receive the input from `from_step`.\n        \"\"\"\n        self.dag.add_edge(from_step, to_step)\n\n        # Check if `from_step` has a `routing_batch_function`. If it does, then mark\n        # `to_step` as a step that will receive a routed batch.\n        node = self.dag.get_step(from_step)  # type: ignore\n        routing_batch_function = node.get(\n            constants.ROUTING_BATCH_FUNCTION_ATTR_NAME, None\n        )\n        self.dag.set_step_attr(\n            name=to_step,\n            attr=constants.RECEIVES_ROUTED_BATCHES_ATTR_NAME,\n            value=routing_batch_function is not None,\n        )\n\n    def _is_convergence_step(self, step_name: str) -&gt; None:\n        \"\"\"Checks if a step is a convergence step.\n\n        Args:\n            step_name: The name of the step.\n        \"\"\"\n        return self.dag.get_step(step_name).get(constants.CONVERGENCE_STEP_ATTR_NAME)\n\n    def _add_routing_batch_function(\n        self, step_name: str, routing_batch_function: \"RoutingBatchFunction\"\n    ) -&gt; None:\n        \"\"\"Add a routing batch function to a step.\n\n        Args:\n            step_name: The name of the step that will receive the routed batch.\n            routing_batch_function: The function that will route the batch to the step.\n        \"\"\"\n        self.dag.set_step_attr(\n            name=step_name,\n            attr=constants.ROUTING_BATCH_FUNCTION_ATTR_NAME,\n            value=routing_batch_function,\n        )\n\n    def _set_runtime_parameters(self, parameters: Dict[str, Dict[str, Any]]) -&gt; None:\n        \"\"\"Set the runtime parameters for the steps in the pipeline.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n            the parameter name as the key and the parameter value as the value.\n        \"\"\"\n        step_names = set(self.dag.G)\n        for step_name, step_parameters in parameters.items():\n            if step_name not in step_names:\n                self._logger.warning(\n                    f\"\u2753 Step '{step_name}' provided in `Pipeline.run(parameters={{...}})` not found in the pipeline.\"\n                    f\" Available steps are: {step_names}.\"\n                )\n            else:\n                step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n                step.set_runtime_parameters(step_parameters)\n\n    def _model_dump(self, obj: Any, **kwargs: Any) -&gt; Dict[str, Any]:\n        \"\"\"Dumps the DAG content to a dict.\n\n        Args:\n            obj (Any): Unused, just kept to match the signature of the parent method.\n            kwargs (Any): Unused, just kept to match the signature of the parent method.\n\n        Returns:\n            Dict[str, Any]: Internal representation of the DAG from networkx in a serializable format.\n        \"\"\"\n        return self.dag.dump()\n\n    def draw(\n        self,\n        path: Optional[Union[str, Path]] = \"pipeline.png\",\n        top_to_bottom: bool = False,\n        show_edge_labels: bool = True,\n    ) -&gt; str:\n        \"\"\"\n        Draws the pipeline.\n\n        Parameters:\n            path: The path to save the image to.\n            top_to_bottom: Whether to draw the DAG top to bottom. Defaults to `False`.\n            show_edge_labels: Whether to show the edge labels. Defaults to `True`.\n\n        Returns:\n            The path to the saved image.\n        \"\"\"\n        png = self.dag.draw(\n            top_to_bottom=top_to_bottom, show_edge_labels=show_edge_labels\n        )\n        with open(path, \"wb\") as f:\n            f.write(png)\n        return path\n\n    def __repr__(self) -&gt; str:\n        \"\"\"\n        If running in a Jupyter notebook, display an image representing this `Pipeline`.\n        \"\"\"\n        if in_notebook():\n            try:\n                from IPython.display import Image, display\n\n                image_data = self.dag.draw()\n\n                display(Image(image_data))\n            except Exception:\n                pass\n        return super().__repr__()\n\n    def dump(self, **kwargs: Any) -&gt; Dict[str, Any]:\n        return {\n            \"distilabel\": {\"version\": __version__},\n            \"pipeline\": {\n                \"name\": self.name,\n                \"description\": self.description,\n                **super().dump(),\n            },\n            \"requirements\": self.requirements,\n        }\n\n    @classmethod\n    def from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n        \"\"\"Create a Pipeline from a dict containing the serialized data.\n\n        Note:\n            It's intended for internal use.\n\n        Args:\n            data (Dict[str, Any]): Dictionary containing the serialized data from a Pipeline.\n\n        Returns:\n            BasePipeline: Pipeline recreated from the dictionary info.\n        \"\"\"\n        name = data[\"pipeline\"][\"name\"]\n        description = data[\"pipeline\"].get(\"description\")\n        requirements = data.get(\"requirements\", [])\n        with cls(name=name, description=description, requirements=requirements) as pipe:\n            pipe.dag = DAG.from_dict(data[\"pipeline\"])\n        return pipe\n\n    @property\n    def _cache_location(self) -&gt; \"_CacheLocation\":\n        \"\"\"Dictionary containing the object that will stored and the location,\n        whether it is a filename or a folder.\n\n        Returns:\n            Path: Filenames where the pipeline content will be serialized.\n        \"\"\"\n        folder = self._cache_dir / self.name / self.signature\n        pipeline_execution_dir = folder / \"executions\" / self.aggregated_steps_signature\n        return {\n            \"pipeline\": pipeline_execution_dir / \"pipeline.yaml\",\n            \"batch_manager\": pipeline_execution_dir / \"batch_manager.json\",\n            \"steps_data\": self._cache_dir / self.name / \"steps_data\",\n            \"data\": pipeline_execution_dir / \"data\",\n            \"batch_input_data\": pipeline_execution_dir / \"batch_input_data\",\n            \"log_file\": pipeline_execution_dir / \"pipeline.log\",\n            \"stages_file\": pipeline_execution_dir / \"stages.json\",\n        }\n\n    @property\n    def aggregated_steps_signature(self) -&gt; str:\n        \"\"\"Creates an aggregated signature using `Step`s signature that will be used for\n        the `_BatchManager`.\n\n        Returns:\n            The aggregated signature.\n        \"\"\"\n        signatures = []\n        for step_name in self.dag:\n            step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            signatures.append(step.signature)\n        return hashlib.sha1(\"\".join(signatures).encode()).hexdigest()\n\n    def _cache(self) -&gt; None:\n        \"\"\"Saves the `BasePipeline` using the `_cache_filename`.\"\"\"\n        if self._dry_run:\n            return\n\n        self.save(\n            path=self._cache_location[\"pipeline\"],\n            format=self._cache_location[\"pipeline\"].suffix.replace(\".\", \"\"),  # type: ignore\n        )\n\n        if self._batch_manager is not None:\n            self._batch_manager.cache(\n                path=self._cache_location[\"batch_manager\"],\n                steps_data_path=self._cache_location[\"steps_data\"],\n            )\n\n        self._save_stages_status()\n\n        self._logger.debug(\"Pipeline and batch manager saved to cache.\")\n\n    def _save_stages_status(self) -&gt; None:\n        \"\"\"Saves the stages status to cache.\"\"\"\n        self.save(\n            path=self._cache_location[\"stages_file\"],\n            format=\"json\",\n            dump={\n                \"current_stage\": self._current_stage,\n                \"stages_last_batch\": self._stages_last_batch,\n            },\n        )\n\n    def _get_steps_load_stages(self) -&gt; Tuple[List[List[str]], List[List[str]]]:\n        return self.dag.get_steps_load_stages(self._load_groups)\n\n    def _load_stages_status(self, use_cache: bool = True) -&gt; None:\n        \"\"\"Try to load the stages status from cache, or initialize it if cache file doesn't\n        exist or cache is not going to be used.\"\"\"\n        if use_cache and self._cache_location[\"stages_file\"].exists():\n            stages_status = read_json(self._cache_location[\"stages_file\"])\n            self._current_stage = stages_status[\"current_stage\"]\n            self._stages_last_batch = stages_status[\"stages_last_batch\"]\n        else:\n            self._current_stage = 0\n            self._stages_last_batch = [\n                [] for _ in range(len(self._get_steps_load_stages()[0]))\n            ]\n\n    def _refresh_pipeline_from_cache(self) -&gt; None:\n        \"\"\"Refresh the DAG (and its steps) from the cache file. This is useful as some\n        `Step`s can update and change their state during the pipeline execution, and this\n        method will make sure the pipeline is up-to-date with the latest changes when\n        the pipeline is reloaded from cache.\n        \"\"\"\n\n        def recursively_handle_secrets_and_excluded_attributes(\n            cached_model: \"BaseModel\", model: \"BaseModel\"\n        ) -&gt; None:\n            \"\"\"Recursively handle the secrets and excluded attributes of a `BaseModel`,\n            setting the values of the cached model to the values of the model.\n\n            Args:\n                cached_model: The cached model that will be updated as it doesn't contain\n                    the secrets and excluded attributes (not serialized).\n                model: The model that contains the secrets and excluded attributes because\n                    it comes from pipeline instantiation.\n            \"\"\"\n            for field_name, field_info in cached_model.model_fields.items():\n                if field_name in (\"pipeline\"):\n                    continue\n\n                inner_type = extract_annotation_inner_type(field_info.annotation)\n                if is_type_pydantic_secret_field(inner_type) or field_info.exclude:\n                    setattr(cached_model, field_name, getattr(model, field_name))\n                elif isclass(inner_type) and issubclass(inner_type, BaseModel):\n                    recursively_handle_secrets_and_excluded_attributes(\n                        getattr(cached_model, field_name),\n                        getattr(model, field_name),\n                    )\n\n        if self._cache_location[\"pipeline\"].exists():\n            cached_dag = self.from_yaml(self._cache_location[\"pipeline\"]).dag\n\n            for step_name in cached_dag:\n                step_cached: \"_Step\" = cached_dag.get_step(step_name)[\n                    constants.STEP_ATTR_NAME\n                ]\n                step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n                recursively_handle_secrets_and_excluded_attributes(step_cached, step)\n\n            self.dag = cached_dag\n\n    def _load_batch_manager(self, use_cache: bool = True) -&gt; None:\n        \"\"\"Will try to load the `_BatchManager` from the cache dir if found. Otherwise,\n        it will create one from scratch.\n\n        If the `_BatchManager` is loaded from cache, we check for invalid steps (those that\n        may have a different signature than the original in the pipeline folder), and\n        restart them, as well as their successors.\n\n        Args:\n            use_cache: whether the cache should be used or not.\n        \"\"\"\n        batch_manager_cache_loc = self._cache_location[\"batch_manager\"]\n\n        # This first condition handles the case in which the pipeline is exactly the same\n        # no steps have been added, removed or changed.\n        if use_cache and batch_manager_cache_loc.exists():\n            self._logger.info(\n                f\"\ud83d\udcbe Loading `_BatchManager` from cache: '{batch_manager_cache_loc}'\"\n            )\n            self._batch_manager = _BatchManager.load_from_cache(\n                dag=self.dag,\n                batch_manager_path=batch_manager_cache_loc,\n                steps_data_path=self._cache_location[\"steps_data\"],\n            )\n            self._invalidate_steps_cache_if_required()\n        # In this other case, the pipeline has been changed. We need to create a new batch\n        # manager and if `use_cache==True` then check which outputs have we computed and\n        # cached for steps that haven't changed but that were executed in another pipeline,\n        # and therefore we can reuse\n        else:\n            self._batch_manager = _BatchManager.from_dag(\n                dag=self.dag,\n                use_cache=use_cache,\n                steps_data_path=self._cache_location[\"steps_data\"],\n            )\n\n    def _invalidate_steps_cache_if_required(self) -&gt; None:\n        \"\"\"Iterates over the steps of the pipeline and invalidates their cache if required.\"\"\"\n        for step_name in self.dag:\n            # `GeneratorStep`s doesn't receive input data so no need to check their\n            # `_BatchManagerStep`\n            if self.dag.get_step(step_name)[constants.STEP_ATTR_NAME].is_generator:\n                continue\n\n            step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            if not step.use_cache:\n                self._batch_manager.invalidate_cache_for(\n                    step_name=step.name,\n                    dag=self.dag,\n                    steps_data_path=self._cache_location[\"steps_data\"],\n                )  # type: ignore\n                self._logger.info(\n                    f\"\u267b\ufe0f Step '{step.name}' won't use cache (`use_cache=False`). The cache of this step and their successors won't be\"\n                    \" reused and the results will have to be recomputed.\"\n                )\n                break\n\n    def _setup_write_buffer(self, use_cache: bool = True) -&gt; None:\n        \"\"\"Setups the `_WriteBuffer` that will store the data of the leaf steps of the\n        pipeline while running, so the `Distiset` can be created at the end.\n        \"\"\"\n        if not use_cache and self._cache_location[\"data\"].exists():\n            shutil.rmtree(self._cache_location[\"data\"])\n        buffer_data_path = self._cache_location[\"data\"] / constants.STEPS_OUTPUTS_PATH\n        self._logger.info(f\"\ud83d\udcdd Pipeline data will be written to '{buffer_data_path}'\")\n        self._write_buffer = _WriteBuffer(\n            buffer_data_path,\n            self.dag.leaf_steps,\n            steps_cached={\n                step_name: self.dag.get_step(step_name)[\n                    constants.STEP_ATTR_NAME\n                ].use_cache\n                for step_name in self.dag\n            },\n        )\n\n    def _print_load_stages_info(self) -&gt; None:\n        \"\"\"Prints the information about the load stages.\"\"\"\n        stages, _ = self._get_steps_load_stages()\n        msg = \"\"\n        for stage, steps in enumerate(stages):\n            steps_to_be_loaded = self._steps_to_be_loaded_in_stage(stage)\n            msg += f\"\\n * Stage {stage}:\"\n            for step_name in steps:\n                step: \"Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n                if step.is_generator:\n                    emoji = \"\ud83d\udeb0\"\n                elif step.is_global:\n                    emoji = \"\ud83c\udf10\"\n                else:\n                    emoji = \"\ud83d\udd04\"\n                msg += f\"\\n   - {emoji} '{step_name}'\"\n                if step_name not in steps_to_be_loaded:\n                    msg += \" (results cached, won't be loaded and executed)\"\n        legend = \"\\n * Legend: \ud83d\udeb0 GeneratorStep \ud83c\udf10 GlobalStep \ud83d\udd04 Step\"\n        self._logger.info(\n            f\"\u231b The steps of the pipeline will be loaded in stages:{legend}{msg}\"\n        )\n\n    def _run_output_queue_loop_in_thread(self) -&gt; threading.Thread:\n        \"\"\"Runs the output queue loop in a separate thread to receive the output batches\n        from the steps. This is done to avoid the signal handler to block the loop, which\n        would prevent the pipeline from stopping correctly.\"\"\"\n        thread = threading.Thread(target=self._output_queue_loop)\n        thread.start()\n        return thread\n\n    def _output_queue_loop(self) -&gt; None:\n        \"\"\"Loop to receive the output batches from the steps and manage the flow of the\n        batches through the pipeline.\"\"\"\n        self._create_steps_input_queues()\n\n        if not self._initialize_pipeline_execution():\n            return\n\n        while self._should_continue_processing():  # type: ignore\n            self._logger.debug(\"Waiting for output batch from step...\")\n            if (batch := self._output_queue.get()) is None:\n                self._logger.debug(\"Received `None` from output queue. Breaking loop.\")\n                break\n\n            self._logger.debug(\n                f\"Received batch with seq_no {batch.seq_no} from step '{batch.step_name}'\"\n                f\" from output queue: {batch}\"\n            )\n\n            self._process_batch(batch)\n\n            # If `_stop_called` was set to `True` while waiting for the output queue, then\n            # we need to handle the stop of the pipeline and break the loop to avoid\n            # propagating the batches through the pipeline and making the stop process\n            # slower.\n            with self._stop_called_lock:\n                if self._stop_called:\n                    self._handle_batch_on_stop(batch)\n                    break\n\n            # If there is another load stage and all the `last_batch`es from the stage\n            # have been received, then load the next stage.\n            if self._should_load_next_stage():\n                self._wait_current_stage_to_finish()\n                if not self._update_stage():\n                    break\n\n            self._manage_batch_flow(batch)\n\n        self._finalize_pipeline_execution()\n\n    def _create_steps_input_queues(self) -&gt; None:\n        \"\"\"Creates the input queue for all the steps in the pipeline.\"\"\"\n        for step_name in self.dag:\n            self._logger.debug(f\"Creating input queue for '{step_name}' step...\")\n            input_queue = self._create_step_input_queue(step_name)\n            self._steps_input_queues[step_name] = input_queue\n\n    def _initialize_pipeline_execution(self) -&gt; bool:\n        \"\"\"Load the steps of the required stage to initialize the pipeline execution,\n        and requests the initial batches to trigger the batch flowing in the pipeline.\n\n        Returns:\n            `True` if initialization went OK, `False` otherwise.\n        \"\"\"\n        # Wait for all the steps to be loaded correctly\n        if not self._run_stage_steps_and_wait(stage=self._current_stage):\n            self._set_steps_not_loaded_exception()\n            return False\n\n        # Send the \"first\" batches to the steps so the batches starts flowing through\n        # the input queues and output queue\n        self._request_initial_batches()\n\n        return True\n\n    def _should_continue_processing(self) -&gt; bool:\n        \"\"\"Condition for the consume batches from the `output_queue` loop.\n\n        Returns:\n            `True` if should continue consuming batches, `False` otherwise and the pipeline\n            should stop.\n        \"\"\"\n        with self._stop_called_lock:\n            return self._batch_manager.can_generate() and not self._stop_called  # type: ignore\n\n    def _process_batch(\n        self, batch: \"_Batch\", send_last_batch_flag: bool = True\n    ) -&gt; None:\n        \"\"\"Process a batch consumed from the `output_queue`.\n\n        Args:\n            batch: the batch to be processed.\n        \"\"\"\n        if batch.data_path:\n            self._logger.debug(\n                f\"Reading {batch.seq_no} batch data from '{batch.step_name}': '{batch.data_path}'\"\n            )\n            batch.read_batch_data_from_fs()\n\n        if batch.step_name in self.dag.leaf_steps:\n            self._write_buffer.add_batch(batch)  # type: ignore\n\n        if batch.last_batch:\n            self._register_stages_last_batch(batch)\n\n            # Make sure to send the `LAST_BATCH_SENT_FLAG` to the predecessors of the step\n            # if the batch is the last one, so they stop their processing loop even if they\n            # haven't received the last batch because of the routing function.\n            if send_last_batch_flag:\n                for step_name in self.dag.get_step_predecessors(batch.step_name):\n                    if self._is_step_running(step_name):\n                        self._send_last_batch_flag_to_step(step_name)\n\n    def _set_step_for_recovering_offline_batch_generation(\n        self, step: \"_Step\", data: List[List[Dict[str, Any]]]\n    ) -&gt; None:\n        \"\"\"Sets the required information to recover a pipeline execution from a `_Step`\n        that used an `LLM` with offline batch generation.\n\n        Args:\n            step: The `_Step` that used an `LLM` with offline batch generation.\n            data: The data that was used to generate the batches for the step.\n        \"\"\"\n        # Replace step so the attribute `jobs_ids` of the `LLM` is not lost, as it was\n        # updated in the child process but not in the main process.\n        step_name: str = step.name  # type: ignore\n        self.dag.set_step_attr(\n            name=step_name, attr=constants.STEP_ATTR_NAME, value=step\n        )\n        self._recover_offline_batch_generate_for_step = (step_name, data)\n\n    def _add_batch_for_recovering_offline_batch_generation(self) -&gt; None:\n        \"\"\"Adds a dummy `_Batch` to the specified step name (it's a `Task` that used an\n        `LLM` with offline batch generation) to recover the pipeline state for offline\n        batch generation in next pipeline executions.\"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        if self._recover_offline_batch_generate_for_step is None:\n            return\n\n        step_name, data = self._recover_offline_batch_generate_for_step\n        self._logger.debug(\n            f\"Adding batch to '{step_name}' step to recover pipeline execution for offline\"\n            \" batch generation...\"\n        )\n        self._batch_manager.add_batch_to_recover_offline_batch_generation(\n            to_step=step_name,\n            data=data,\n        )\n\n    def _register_stages_last_batch(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Registers the last batch received from a step in the `_stages_last_batch`\n        dictionary.\n\n        Args:\n            batch: The last batch received from a step.\n        \"\"\"\n        _, stages_last_steps = self._get_steps_load_stages()\n        stage_last_steps = stages_last_steps[self._current_stage]\n        if batch.step_name in stage_last_steps:\n            self._stages_last_batch[self._current_stage].append(batch.step_name)\n            self._stages_last_batch[self._current_stage].sort()\n\n    def _update_stage(self) -&gt; bool:\n        \"\"\"Checks if the steps of next stage should be loaded and updates `_current_stage`\n        attribute.\n\n        Returns:\n            `True` if updating the stage went OK, `False` otherwise.\n        \"\"\"\n        self._current_stage += 1\n        if not self._run_stage_steps_and_wait(stage=self._current_stage):\n            self._set_steps_not_loaded_exception()\n            return False\n\n        return True\n\n    def _should_load_next_stage(self) -&gt; bool:\n        \"\"\"Returns if the next stage should be loaded.\n\n        Returns:\n            `True` if the next stage should be loaded, `False` otherwise.\n        \"\"\"\n        _, stage_last_steps = self._get_steps_load_stages()\n        there_is_next_stage = self._current_stage + 1 &lt; len(stage_last_steps)\n        stage_last_batches_received = (\n            self._stages_last_batch[self._current_stage]\n            == stage_last_steps[self._current_stage]\n        )\n        return there_is_next_stage and stage_last_batches_received\n\n    def _finalize_pipeline_execution(self) -&gt; None:\n        \"\"\"Finalizes the pipeline execution handling the prematurely stop of the pipeline\n        if required, caching the data and ensuring that all the steps finish its execution.\"\"\"\n\n        # Send `None` to steps `input_queue`s just in case some step is still waiting\n        self._notify_steps_to_stop()\n\n        for step_name in self.dag:\n            while self._is_step_running(step_name):\n                self._logger.debug(f\"Waiting for step '{step_name}' to finish...\")\n                time.sleep(0.5)\n\n        with self._stop_called_lock:\n            if self._stop_called:\n                self._handle_stop()\n\n            # Reset flag state\n            self._stop_called = False\n\n        self._add_batch_for_recovering_offline_batch_generation()\n\n        self._cache()\n\n    def _run_load_queue_loop_in_thread(self) -&gt; threading.Thread:\n        \"\"\"Runs a background thread that reads from the `load_queue` to update the status\n        of the number of replicas loaded for each step.\n\n        Returns:\n            The thread that was started.\n        \"\"\"\n        thread = threading.Thread(target=self._run_load_queue_loop)\n        thread.start()\n        return thread\n\n    def _run_load_queue_loop(self) -&gt; None:\n        \"\"\"Runs a loop that reads from the `load_queue` to update the status of the number\n        of replicas loaded for each step.\"\"\"\n\n        while True:\n            if (load_info := self._load_queue.get()) is None:\n                self._logger.debug(\"Received `None` from load queue. Breaking loop.\")\n                break\n\n            with self._steps_load_status_lock:\n                step_name, status = load_info[\"name\"], load_info[\"status\"]\n                if status == \"loaded\":\n                    if self._steps_load_status[step_name] == _STEP_NOT_LOADED_CODE:\n                        self._steps_load_status[step_name] = 1\n                    else:\n                        self._steps_load_status[step_name] += 1\n                elif status == \"unloaded\":\n                    self._steps_load_status[step_name] -= 1\n                    if self._steps_load_status[step_name] == 0:\n                        self._steps_load_status[step_name] = _STEP_UNLOADED_CODE\n                else:\n                    # load failed\n                    self._steps_load_status[step_name] = _STEP_LOAD_FAILED_CODE\n\n                self._logger.debug(\n                    f\"Step '{step_name}' loaded replicas: {self._steps_load_status[step_name]}\"\n                )\n\n    def _is_step_running(self, step_name: str) -&gt; bool:\n        \"\"\"Checks if the step is running (at least one replica is running).\n\n        Args:\n            step_name: The step to be check if running.\n\n        Returns:\n            `True` if the step is running, `False` otherwise.\n        \"\"\"\n        with self._steps_load_status_lock:\n            return self._steps_load_status[step_name] &gt;= 1\n\n    def _steps_to_be_loaded_in_stage(self, stage: int) -&gt; List[str]:\n        \"\"\"Returns the list of steps of the provided stage that should be loaded taking\n        into account if they have finished.\n\n        Args:\n            stage: the stage number\n\n        Returns:\n            A list containing the name of the steps that should be loaded in this stage.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        steps_stages, _ = self._get_steps_load_stages()\n\n        return [\n            step\n            for step in steps_stages[stage]\n            if not self._batch_manager.step_has_finished(step)\n        ]\n\n    def _get_steps_load_status(self, steps: List[str]) -&gt; Dict[str, int]:\n        \"\"\"Gets the a dictionary containing the load status of the provided steps.\n\n        Args:\n            steps: a list containing the names of the steps to get their load status.\n\n        Returns:\n            A dictionary containing the load status of the provided steps.\n        \"\"\"\n        return {\n            step_name: replicas\n            for step_name, replicas in self._steps_load_status.items()\n            if step_name in steps\n        }\n\n    def _wait_current_stage_to_finish(self) -&gt; None:\n        \"\"\"Waits for the current stage to finish.\"\"\"\n        stage = self._current_stage\n        steps = self._steps_to_be_loaded_in_stage(stage)\n        self._logger.info(f\"\u23f3 Waiting for stage {stage} to finish...\")\n        with self._stop_called_lock:\n            while not self._stop_called:\n                filtered_steps_load_status = self._get_steps_load_status(steps)\n                if all(\n                    replicas == _STEP_UNLOADED_CODE\n                    for replicas in filtered_steps_load_status.values()\n                ):\n                    self._logger.info(f\"\u2705 Stage {stage} has finished!\")\n                    break\n\n    def _run_stage_steps_and_wait(self, stage: int) -&gt; bool:\n        \"\"\"Runs the steps of the specified stage and waits for them to be ready.\n\n        Args:\n            stage: the stage from which the steps have to be loaded.\n\n        Returns:\n            `True` if all the steps have been loaded correctly, `False` otherwise.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        steps = self._steps_to_be_loaded_in_stage(stage)\n        self._logger.debug(f\"Steps to be loaded in stage {stage}: {steps}\")\n\n        # Run the steps of the stage\n        self._run_steps(steps=steps)\n\n        # Wait for them to be ready\n        self._logger.info(f\"\u23f3 Waiting for all the steps of stage {stage} to load...\")\n        previous_message = None\n        with self._stop_called_lock:\n            while not self._stop_called:\n                with self._steps_load_status_lock:\n                    filtered_steps_load_status = self._get_steps_load_status(steps)\n                    self._logger.debug(\n                        f\"Steps from stage {stage} loaded: {filtered_steps_load_status}\"\n                    )\n\n                    if any(\n                        replicas_loaded == _STEP_LOAD_FAILED_CODE\n                        for replicas_loaded in filtered_steps_load_status.values()\n                    ):\n                        self._logger.error(\n                            f\"\u274c Failed to load all the steps of stage {stage}\"\n                        )\n                        return False\n\n                    num_steps_loaded = 0\n                    replicas_message = \"\"\n                    for step_name, replicas in filtered_steps_load_status.items():\n                        step_replica_count = self.dag.get_step_replica_count(step_name)\n                        # It can happen that the step is very fast and it has done all the\n                        # work and have finished its execution before checking if it has\n                        # been loaded, that's why we also considered the step to be loaded\n                        # if `_STEP_UNLOADED_CODE`.\n                        if (\n                            replicas == step_replica_count\n                            or replicas == _STEP_UNLOADED_CODE\n                        ):\n                            num_steps_loaded += 1\n                        replicas_message += f\"\\n * '{step_name}' replicas: {max(0, replicas)}/{step_replica_count}\"\n\n                    message = f\"\u23f3 Steps from stage {stage} loaded: {num_steps_loaded}/{len(filtered_steps_load_status)}{replicas_message}\"\n                    if num_steps_loaded &gt; 0 and message != previous_message:\n                        self._logger.info(message)\n                        previous_message = message\n\n                    if num_steps_loaded == len(filtered_steps_load_status):\n                        self._logger.info(\n                            f\"\u2705 All the steps from stage {stage} have been loaded!\"\n                        )\n                        return True\n\n                time.sleep(2.5)\n\n        return not self._stop_called\n\n    def _handle_stop(self) -&gt; None:\n        \"\"\"Handles the stop of the pipeline execution, which will stop the steps from\n        processing more batches and wait for the output queue to be empty, to not lose\n        any data that was already processed by the steps before the stop was called.\"\"\"\n        self._logger.debug(\"Handling stop of the pipeline execution...\")\n\n        self._add_batches_back_to_batch_manager()\n\n        # Wait for the input queue to be empty, which means that all the steps finished\n        # processing the batches that were sent before the stop flag.\n        self._wait_steps_input_queues_empty()\n\n        self._consume_output_queue()\n\n        if self._should_load_next_stage():\n            self._current_stage += 1\n\n    def _wait_steps_input_queues_empty(self) -&gt; None:\n        self._logger.debug(\"Waiting for steps input queues to be empty...\")\n        for step_name in self.dag:\n            self._wait_step_input_queue_empty(step_name)\n        self._logger.debug(\"Steps input queues are empty!\")\n\n    def _wait_step_input_queue_empty(self, step_name: str) -&gt; Union[\"Queue[Any]\", None]:\n        \"\"\"Waits for the input queue of a step to be empty.\n\n        Args:\n            step_name: The name of the step.\n\n        Returns:\n            The input queue of the step if it's not loaded or finished, `None` otherwise.\n        \"\"\"\n        if self._check_step_not_loaded_or_finished(step_name):\n            return None\n\n        if input_queue := self.dag.get_step(step_name).get(\n            constants.INPUT_QUEUE_ATTR_NAME\n        ):\n            while input_queue.qsize() != 0:\n                pass\n            return input_queue\n\n    def _check_step_not_loaded_or_finished(self, step_name: str) -&gt; bool:\n        \"\"\"Checks if a step is not loaded or already finished.\n\n        Args:\n            step_name: The name of the step.\n\n        Returns:\n            `True` if the step is not loaded or already finished, `False` otherwise.\n        \"\"\"\n        with self._steps_load_status_lock:\n            num_replicas = self._steps_load_status[step_name]\n\n            # The step has finished (replicas = 0) or it has failed to load\n            if num_replicas in [0, _STEP_LOAD_FAILED_CODE, _STEP_UNLOADED_CODE]:\n                return True\n\n        return False\n\n    @property\n    @abstractmethod\n    def QueueClass(self) -&gt; Callable:\n        \"\"\"The class of the queue to use in the pipeline.\"\"\"\n        pass\n\n    def _create_step_input_queue(self, step_name: str) -&gt; \"Queue[Any]\":\n        \"\"\"Creates an input queue for a step.\n\n        Args:\n            step_name: The name of the step.\n\n        Returns:\n            The input queue created.\n        \"\"\"\n        input_queue = self.QueueClass()\n        self.dag.set_step_attr(step_name, constants.INPUT_QUEUE_ATTR_NAME, input_queue)\n        return input_queue\n\n    @abstractmethod\n    def _run_step(self, step: \"_Step\", input_queue: \"Queue[Any]\", replica: int) -&gt; None:\n        \"\"\"Runs the `Step` instance.\n\n        Args:\n            step: The `Step` instance to run.\n            input_queue: The input queue where the step will receive the batches.\n            replica: The replica ID assigned.\n        \"\"\"\n        pass\n\n    def _run_steps(self, steps: Iterable[str]) -&gt; None:\n        \"\"\"Runs the `Step`s of the pipeline, creating first an input queue for each step\n        that will be used to send the batches.\n\n        Args:\n            steps:\n        \"\"\"\n        for step_name in steps:\n            step: \"Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n            input_queue = self._steps_input_queues[step.name]  # type: ignore\n\n            # Set `pipeline` to `None` as in some Python environments the pipeline is not\n            # picklable and it will raise an error when trying to send the step to the process.\n            # `TypeError: cannot pickle 'code' object`\n            step.pipeline = None\n\n            if not step.is_normal and step.resources.replicas &gt; 1:  # type: ignore\n                self._logger.warning(\n                    f\"Step '{step_name}' is a `GeneratorStep` or `GlobalStep` and has more\"\n                    \" than 1 replica. Only `Step` instances can have more than 1 replica.\"\n                    \" The number of replicas for the step will be set to 1.\"\n                )\n\n            step_num_replicas: int = step.resources.replicas if step.is_normal else 1  # type: ignore\n            for replica in range(step_num_replicas):\n                self._logger.debug(\n                    f\"Running 1 replica of step '{step.name}' with ID {replica}...\"\n                )\n                self._run_step(\n                    step=step.model_copy(deep=True),\n                    input_queue=input_queue,\n                    replica=replica,\n                )\n\n    def _add_batches_back_to_batch_manager(self) -&gt; None:\n        \"\"\"Add the `Batch`es that were sent to a `Step` back to the `_BatchManager`. This\n        method should be used when the pipeline has been stopped prematurely.\"\"\"\n        self._logger.debug(\n            \"Adding batches from step input queues back to the batch manager...\"\n        )\n        for step_name in self.dag:\n            node = self.dag.get_step(step_name)\n            step: \"_Step\" = node[constants.STEP_ATTR_NAME]\n            if step.is_generator:\n                continue\n            if input_queue := node.get(constants.INPUT_QUEUE_ATTR_NAME):\n                while not input_queue.empty():\n                    batch = input_queue.get()\n                    if not isinstance(batch, _Batch):\n                        continue\n                    self._batch_manager.add_batch(  # type: ignore\n                        to_step=step_name,\n                        batch=batch,\n                        prepend=True,\n                    )\n                    self._logger.debug(\n                        f\"Adding batch back to the batch manager: {batch}\"\n                    )\n                if self._check_step_not_loaded_or_finished(step_name):\n                    # Notify the step to stop\n                    input_queue.put(None)\n        self._logger.debug(\"Finished adding batches back to the batch manager.\")\n\n    def _consume_output_queue(self) -&gt; None:\n        \"\"\"Consumes the `Batch`es from the output queue until it's empty. This method should\n        be used when the pipeline has been stopped prematurely to consume and to not lose\n        the `Batch`es that were processed by the leaf `Step`s before stopping the pipeline.\"\"\"\n        while not self._output_queue.empty():\n            batch = self._output_queue.get()\n            if batch is None:\n                continue\n            self._process_batch(batch, send_last_batch_flag=False)\n            self._handle_batch_on_stop(batch)\n\n    def _manage_batch_flow(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Checks if the step that generated the batch has more data in its buffer to\n        generate a new batch. If there's data, then a new batch is sent to the step. If\n        the step has no data in its buffer, then the predecessors generator steps are\n        requested to send a new batch.\n\n        Args:\n            batch: The batch that was processed.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        route_to, do_not_route_to, routed = self._get_successors(batch)\n\n        self._register_batch(batch)\n\n        # Keep track of the steps that the batch was routed to\n        if routed:\n            batch.batch_routed_to = route_to\n\n        self._set_next_expected_seq_no(\n            steps=do_not_route_to,\n            from_step=batch.step_name,\n            next_expected_seq_no=batch.seq_no + 1,\n        )\n\n        step = self._get_step_from_batch(batch)\n\n        # Add the batch to the successors input buffers\n        for successor in route_to:\n            # Copy batch to avoid modifying the same reference in the batch manager\n            batch_to_add = batch.copy() if len(route_to) &gt; 1 else batch\n\n            self._batch_manager.add_batch(successor, batch_to_add)\n\n            # Check if the step is a generator and if there are successors that need data\n            # from this step. This usually happens when the generator `batch_size` is smaller\n            # than the `input_batch_size` of the successor steps.\n            if (\n                step.is_generator\n                and step.name in self._batch_manager.step_empty_buffers(successor)\n            ):\n                last_batch_sent = self._batch_manager.get_last_batch_sent(step.name)\n                self._send_batch_to_step(last_batch_sent.next_batch())  # type: ignore\n\n            # If successor step has enough data in its buffer to create a new batch, then\n            # send the batch to the step.\n            while new_batch := self._batch_manager.get_batch(successor):\n                self._send_batch_to_step(new_batch)\n\n        if not step.is_generator:\n            # Step (\"this\", the one from which the batch was received) has enough data on its\n            # buffers to create a new batch\n            while new_batch := self._batch_manager.get_batch(step.name):  # type: ignore\n                self._send_batch_to_step(new_batch)\n            else:\n                self._request_more_batches_if_needed(step)\n        else:\n            # Case in which the pipeline only contains a `GeneratorStep` so we constanly keep\n            # requesting batch after batch as there is no downstream step to consume it\n            if len(self.dag) == 1:\n                self._request_batch_from_generator(step.name)  # type: ignore\n\n        self._cache()\n\n    def _send_to_step(self, step_name: str, to_send: Any) -&gt; None:\n        \"\"\"Sends something to the input queue of a step.\n\n        Args:\n            step_name: The name of the step.\n            to_send: The object to send.\n        \"\"\"\n        input_queue = self.dag.get_step(step_name)[constants.INPUT_QUEUE_ATTR_NAME]\n        input_queue.put(to_send)\n\n    def _send_batch_to_step(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Sends a batch to the input queue of a step, writing the data of the batch\n        to the filesystem and setting `batch.data_path` with the path where the data\n        was written (if requiered i.e. the step is a global step or `use_fs_to_pass_data`)\n\n        This method should be extended by the specific pipeline implementation, adding\n        the logic to send the batch to the step.\n\n        Args:\n            batch: The batch to send.\n        \"\"\"\n        self._logger.debug(\n            f\"Setting batch {batch.seq_no} as last batch sent to '{batch.step_name}': {batch}\"\n        )\n        self._batch_manager.set_last_batch_sent(batch)  # type: ignore\n\n        step: \"_Step\" = self.dag.get_step(batch.step_name)[constants.STEP_ATTR_NAME]\n        if not step.is_generator and (step.is_global or self._use_fs_to_pass_data):\n            base_path = UPath(self._storage_base_path) / step.name  # type: ignore\n            self._logger.debug(\n                f\"Writing {batch.seq_no} batch for '{batch.step_name}' step to filesystem: {base_path}\"\n            )\n            batch.write_batch_data_to_fs(self._fs, base_path)  # type: ignore\n\n        self._logger.debug(\n            f\"Sending batch {batch.seq_no} to step '{batch.step_name}': {batch}\"\n        )\n        self._send_to_step(batch.step_name, batch)\n\n    def _gather_requirements(self) -&gt; List[str]:\n        \"\"\"Extracts the requirements from the steps to be used in the pipeline.\n\n        Returns:\n            List of requirements gathered from the steps.\n        \"\"\"\n        steps_requirements = []\n        for step in self.dag:\n            step_req = self.dag.get_step(step)[constants.STEP_ATTR_NAME].requirements\n            steps_requirements.extend(step_req)\n\n        return steps_requirements\n\n    def _register_batch(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Registers a batch in the batch manager.\n\n        Args:\n            batch: The batch to register.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n        self._batch_manager.register_batch(\n            batch, steps_data_path=self._cache_location[\"steps_data\"]\n        )  # type: ignore\n        self._logger.debug(\n            f\"Batch {batch.seq_no} from step '{batch.step_name}' registered in batch\"\n            \" manager\"\n        )\n\n    def _send_last_batch_flag_to_step(self, step_name: str) -&gt; None:\n        \"\"\"Sends the `LAST_BATCH_SENT_FLAG` to a step to stop processing batches.\n\n        Args:\n            step_name: The name of the step.\n        \"\"\"\n        self._logger.debug(\n            f\"Sending `LAST_BATCH_SENT_FLAG` to '{step_name}' step to stop processing\"\n            \" batches...\"\n        )\n\n        for _ in range(self.dag.get_step_replica_count(step_name)):\n            self._send_to_step(step_name, constants.LAST_BATCH_SENT_FLAG)\n        self._batch_manager.set_last_batch_flag_sent_to(step_name)  # type: ignore\n\n    def _request_initial_batches(self) -&gt; None:\n        \"\"\"Requests the initial batches to the generator steps.\"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n        for step in self._batch_manager._steps.values():\n            if not self._is_step_running(step.step_name):\n                continue\n            if batch := step.get_batch():\n                self._logger.debug(\n                    f\"Sending initial batch to '{step.step_name}' step: {batch}\"\n                )\n                self._send_batch_to_step(batch)\n\n        for step_name in self.dag.root_steps:\n            if not self._is_step_running(step_name):\n                continue\n            seq_no = 0\n            if last_batch := self._batch_manager.get_last_batch(step_name):\n                seq_no = last_batch.seq_no + 1\n            batch = _Batch(seq_no=seq_no, step_name=step_name, last_batch=self._dry_run)\n            self._logger.debug(\n                f\"Requesting initial batch to '{step_name}' generator step: {batch}\"\n            )\n            self._send_batch_to_step(batch)\n\n    def _request_batch_from_generator(self, step_name: str) -&gt; None:\n        \"\"\"Request a new batch to a `GeneratorStep`.\n\n        Args:\n            step_name: the name of the `GeneratorStep` to which a batch has to be requested.\n        \"\"\"\n        # Get the last batch that the previous step sent to generate the next batch\n        # (next `seq_no`).\n        last_batch = self._batch_manager.get_last_batch_sent(step_name)  # type: ignore\n        if last_batch is None:\n            return\n        self._send_batch_to_step(last_batch.next_batch())\n\n    def _request_more_batches_if_needed(self, step: \"Step\") -&gt; None:\n        \"\"\"Request more batches to the predecessors steps of `step` if needed.\n\n        Args:\n            step: The step of which it has to be checked if more batches are needed from\n                its predecessors.\n        \"\"\"\n        empty_buffers = self._batch_manager.step_empty_buffers(step.name)  # type: ignore\n        for previous_step_name in empty_buffers:\n            # Only more batches can be requested to the `GeneratorStep`s as they are the\n            # only kind of steps that lazily generate batches.\n            if previous_step_name not in self.dag.root_steps:\n                continue\n\n            self._request_batch_from_generator(previous_step_name)\n\n    def _handle_batch_on_stop(self, batch: \"_Batch\") -&gt; None:\n        \"\"\"Handles a batch that was received from the output queue when the pipeline was\n        stopped. It will add and register the batch in the batch manager.\n\n        Args:\n            batch: The batch to handle.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        self._batch_manager.register_batch(\n            batch, steps_data_path=self._cache_location[\"steps_data\"]\n        )\n        step: \"Step\" = self.dag.get_step(batch.step_name)[constants.STEP_ATTR_NAME]\n        for successor in self.dag.get_step_successors(step.name):  # type: ignore\n            self._batch_manager.add_batch(successor, batch)\n\n    def _get_step_from_batch(self, batch: \"_Batch\") -&gt; \"Step\":\n        \"\"\"Gets the `Step` instance from a batch.\n\n        Args:\n            batch: The batch to get the step from.\n\n        Returns:\n            The `Step` instance.\n        \"\"\"\n        return self.dag.get_step(batch.step_name)[constants.STEP_ATTR_NAME]\n\n    def _notify_steps_to_stop(self) -&gt; None:\n        \"\"\"Notifies the steps to stop their infinite running loop by sending `None` to\n        their input queues.\"\"\"\n        with self._steps_load_status_lock:\n            for step_name, replicas in self._steps_load_status.items():\n                if replicas &gt; 0:\n                    for _ in range(replicas):\n                        self._send_to_step(step_name, None)\n\n    def _get_successors(self, batch: \"_Batch\") -&gt; Tuple[List[str], List[str], bool]:\n        \"\"\"Gets the successors and the successors to which the batch has to be routed.\n\n        Args:\n            batch: The batch to which the successors will be determined.\n\n        Returns:\n            The successors to route the batch to and whether the batch was routed using\n            a routing function.\n        \"\"\"\n        node = self.dag.get_step(batch.step_name)\n        step: \"Step\" = node[constants.STEP_ATTR_NAME]\n        successors = list(self.dag.get_step_successors(step.name))  # type: ignore\n        route_to = successors\n\n        # Check if the step has a routing function to send the batch to specific steps\n        if routing_batch_function := node.get(\n            constants.ROUTING_BATCH_FUNCTION_ATTR_NAME\n        ):\n            route_to = routing_batch_function(batch, successors)\n            successors_str = \", \".join(f\"'{successor}'\" for successor in route_to)\n            self._logger.info(\n                f\"\ud83d\ude8f Using '{step.name}' routing function to send batch {batch.seq_no} to steps: {successors_str}\"\n            )\n\n        return route_to, list(set(successors) - set(route_to)), route_to != successors\n\n    def _set_next_expected_seq_no(\n        self, steps: List[str], from_step: str, next_expected_seq_no: int\n    ) -&gt; None:\n        \"\"\"Sets the next expected sequence number of a `_Batch` received by `step` from\n        `from_step`. This is necessary as some `Step`s might not receive all the batches\n        comming from the previous steps because there is a routing batch function.\n\n        Args:\n            steps: list of steps to which the next expected sequence number of a `_Batch`\n                from `from_step` has to be updated in the `_BatchManager`.\n            from_step: the name of the step from which the next expected sequence number\n                of a `_Batch` has to be updated in `steps`.\n            next_expected_seq_no: the number of the next expected sequence number of a `Batch`\n                from `from_step`.\n        \"\"\"\n        assert self._batch_manager, \"Batch manager is not set\"\n\n        for step in steps:\n            self._batch_manager.set_next_expected_seq_no(\n                step_name=step,\n                from_step=from_step,\n                next_expected_seq_no=next_expected_seq_no,\n            )\n\n    @abstractmethod\n    def _teardown(self) -&gt; None:\n        \"\"\"Clean/release/stop resources reserved to run the pipeline.\"\"\"\n        pass\n\n    @abstractmethod\n    def _set_steps_not_loaded_exception(self) -&gt; None:\n        \"\"\"Used to raise `RuntimeError` when the load of the steps failed.\n\n        Raises:\n            RuntimeError: containing the information and why a step failed to be loaded.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def _stop(self) -&gt; None:\n        \"\"\"Stops the pipeline in a controlled way.\"\"\"\n        pass\n\n    def _stop_load_queue_loop(self) -&gt; None:\n        \"\"\"Stops the `_load_queue` loop sending a `None`.\"\"\"\n        self._logger.debug(\"Sending `None` to the load queue to notify stop...\")\n        self._load_queue.put(None)\n\n    def _stop_output_queue_loop(self) -&gt; None:\n        \"\"\"Stops the `_output_queue` loop sending a `None`.\"\"\"\n        self._logger.debug(\"Sending `None` to the output queue to notify stop...\")\n        self._output_queue.put(None)\n\n    def _handle_keyboard_interrupt(self) -&gt; Any:\n        \"\"\"Handles KeyboardInterrupt signal sent during the Pipeline.run method.\n\n        It will try to call self._stop (if the pipeline didn't started yet, it won't\n        have any effect), and if the pool is already started, will close it before exiting\n        the program.\n\n        Returns:\n            The original `signal.SIGINT` handler.\n        \"\"\"\n\n        def signal_handler(signumber: int, frame: Any) -&gt; None:\n            self._stop()\n\n        return signal.signal(signal.SIGINT, signal_handler)\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.signature","title":"<code>signature</code>  <code>property</code>","text":"<p>Makes a signature (hash) of a pipeline, using the step ids and the adjacency between them.</p> <p>The main use is to find the pipeline in the cache folder.</p> <p>Returns:</p> Type Description <code>str</code> <p>Signature of the pipeline.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.aggregated_steps_signature","title":"<code>aggregated_steps_signature</code>  <code>property</code>","text":"<p>Creates an aggregated signature using <code>Step</code>s signature that will be used for the <code>_BatchManager</code>.</p> <p>Returns:</p> Type Description <code>str</code> <p>The aggregated signature.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.QueueClass","title":"<code>QueueClass</code>  <code>abstractmethod</code> <code>property</code>","text":"<p>The class of the queue to use in the pipeline.</p>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__init__","title":"<code>__init__(name=None, description=None, cache_dir=None, enable_metadata=False, requirements=None)</code>","text":"<p>Initialize the <code>BasePipeline</code> instance.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>Optional[str]</code> <p>The name of the pipeline. If not generated, a random one will be generated by default.</p> <code>None</code> <code>description</code> <code>Optional[str]</code> <p>A description of the pipeline. Defaults to <code>None</code>.</p> <code>None</code> <code>cache_dir</code> <code>Optional[Union[str, PathLike]]</code> <p>A directory where the pipeline will be cached. Defaults to <code>None</code>.</p> <code>None</code> <code>enable_metadata</code> <code>bool</code> <p>Whether to include the distilabel metadata column for the pipeline in the final <code>Distiset</code>. It contains metadata used by distilabel, for example the raw outputs of the <code>LLM</code> without processing would be here, inside <code>raw_output_...</code> field. Defaults to <code>False</code>.</p> <code>False</code> <code>requirements</code> <code>Optional[List[str]]</code> <p>List of requirements that must be installed to run the pipeline. Defaults to <code>None</code>, but can be helpful to inform in a pipeline to be shared that this requirements must be installed.</p> <code>None</code> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __init__(\n    self,\n    name: Optional[str] = None,\n    description: Optional[str] = None,\n    cache_dir: Optional[Union[str, \"PathLike\"]] = None,\n    enable_metadata: bool = False,\n    requirements: Optional[List[str]] = None,\n) -&gt; None:\n    \"\"\"Initialize the `BasePipeline` instance.\n\n    Args:\n        name: The name of the pipeline. If not generated, a random one will be generated by default.\n        description: A description of the pipeline. Defaults to `None`.\n        cache_dir: A directory where the pipeline will be cached. Defaults to `None`.\n        enable_metadata: Whether to include the distilabel metadata column for the pipeline\n            in the final `Distiset`. It contains metadata used by distilabel, for example\n            the raw outputs of the `LLM` without processing would be here, inside `raw_output_...`\n            field. Defaults to `False`.\n        requirements: List of requirements that must be installed to run the pipeline.\n            Defaults to `None`, but can be helpful to inform in a pipeline to be shared\n            that this requirements must be installed.\n    \"\"\"\n    self.name = name or _PIPELINE_DEFAULT_NAME\n    self.description = description\n    self._enable_metadata = enable_metadata\n    self.dag = DAG()\n\n    if cache_dir:\n        self._cache_dir = Path(cache_dir)\n    elif env_cache_dir := envs.DISTILABEL_CACHE_DIR:\n        self._cache_dir = Path(env_cache_dir)\n    else:\n        self._cache_dir = constants.PIPELINES_CACHE_DIR\n\n    self._logger = logging.getLogger(\"distilabel.pipeline\")\n\n    self._batch_manager: Optional[\"_BatchManager\"] = None\n    self._write_buffer: Optional[\"_WriteBuffer\"] = None\n    self._steps_input_queues: Dict[str, \"Queue\"] = {}\n\n    self._steps_load_status: Dict[str, int] = {}\n    self._steps_load_status_lock = threading.Lock()\n\n    self._stop_called = False\n    self._stop_called_lock = threading.Lock()\n    self._stop_calls = 0\n\n    self._recover_offline_batch_generate_for_step: Union[\n        Tuple[str, List[List[Dict[str, Any]]]], None\n    ] = None\n\n    self._fs: Optional[fsspec.AbstractFileSystem] = None\n    self._storage_base_path: Optional[str] = None\n    self._use_fs_to_pass_data: bool = False\n    self._dry_run = False\n\n    self._current_stage = 0\n    self._stages_last_batch: List[List[str]] = []\n    self._load_groups = []\n\n    self.requirements = requirements or []\n\n    self._exception: Union[Exception, None] = None\n\n    self._log_queue: Union[\"Queue[Any]\", None] = None\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__enter__","title":"<code>__enter__()</code>","text":"<p>Set the global pipeline instance when entering a pipeline context.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __enter__(self) -&gt; Self:\n    \"\"\"Set the global pipeline instance when entering a pipeline context.\"\"\"\n    _GlobalPipelineManager.set_pipeline(self)\n    return self\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__exit__","title":"<code>__exit__(exc_type, exc_value, traceback)</code>","text":"<p>Unset the global pipeline instance when exiting a pipeline context.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __exit__(self, exc_type, exc_value, traceback) -&gt; None:\n    \"\"\"Unset the global pipeline instance when exiting a pipeline context.\"\"\"\n    _GlobalPipelineManager.set_pipeline(None)\n    self._set_pipeline_name()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.run","title":"<code>run(parameters=None, load_groups=None, use_cache=True, storage_parameters=None, use_fs_to_pass_data=False, dataset=None, dataset_batch_size=50, logging_handlers=None)</code>","text":"<p>Run the pipeline. It will set the runtime parameters for the steps and validate the pipeline.</p> <p>This method should be extended by the specific pipeline implementation, adding the logic to run the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>parameters</code> <code>Optional[Dict[str, Dict[str, Any]]]</code> <p>A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to <code>None</code>.</p> <code>None</code> <code>load_groups</code> <code>Optional[LoadGroups]</code> <p>A list containing lists of steps that have to be loaded together and in isolation with respect to the rest of the steps of the pipeline. This argument also allows passing the following modes:</p> <ul> <li>\"sequential_step_execution\": each step will be executed in a stage i.e.     the execution of the steps will be sequential.</li> </ul> <p>Defaults to <code>None</code>.</p> <code>None</code> <code>use_cache</code> <code>bool</code> <p>Whether to use the cache from previous pipeline runs. Defaults to <code>True</code>.</p> <code>True</code> <code>storage_parameters</code> <code>Optional[Dict[str, Any]]</code> <p>A dictionary with the storage parameters (<code>fsspec</code> and path) that will be used to store the data of the <code>_Batch</code>es passed between the steps if <code>use_fs_to_pass_data</code> is <code>True</code> (for the batches received by a <code>GlobalStep</code> it will be always used). It must have at least the \"path\" key, and it can contain additional keys depending on the protocol. By default, it will use the local file system and a directory in the cache directory. Defaults to <code>None</code>.</p> <code>None</code> <code>use_fs_to_pass_data</code> <code>bool</code> <p>Whether to use the file system to pass the data of the <code>_Batch</code>es between the steps. Even if this parameter is <code>False</code>, the <code>Batch</code>es received by <code>GlobalStep</code>s will always use the file system to pass the data. Defaults to <code>False</code>.</p> <code>False</code> <code>dataset</code> <code>Optional[InputDataset]</code> <p>If given, it will be used to create a <code>GeneratorStep</code> and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to <code>None</code>.</p> <code>None</code> <code>dataset_batch_size</code> <code>int</code> <p>if <code>dataset</code> is given, this will be the size of the batches yield by the <code>GeneratorStep</code> created using the <code>dataset</code>. Defaults to <code>50</code>.</p> <code>50</code> <code>logging_handlers</code> <code>Optional[List[Handler]]</code> <p>A list of logging handlers that will be used to log the output of the pipeline. This argument can be useful so the logging messages can be extracted and used in a different context. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>The <code>Distiset</code> created by the pipeline.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def run(\n    self,\n    parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n    load_groups: Optional[\"LoadGroups\"] = None,\n    use_cache: bool = True,\n    storage_parameters: Optional[Dict[str, Any]] = None,\n    use_fs_to_pass_data: bool = False,\n    dataset: Optional[\"InputDataset\"] = None,\n    dataset_batch_size: int = 50,\n    logging_handlers: Optional[List[logging.Handler]] = None,\n) -&gt; \"Distiset\":  # type: ignore\n    \"\"\"Run the pipeline. It will set the runtime parameters for the steps and validate\n    the pipeline.\n\n    This method should be extended by the specific pipeline implementation,\n    adding the logic to run the pipeline.\n\n    Args:\n        parameters: A dictionary with the step name as the key and a dictionary with\n            the runtime parameters for the step as the value. Defaults to `None`.\n        load_groups: A list containing lists of steps that have to be loaded together\n            and in isolation with respect to the rest of the steps of the pipeline.\n            This argument also allows passing the following modes:\n\n            - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                the execution of the steps will be sequential.\n\n            Defaults to `None`.\n        use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n            `True`.\n        storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n            that will be used to store the data of the `_Batch`es passed between the\n            steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n            `GlobalStep` it will be always used). It must have at least the \"path\" key,\n            and it can contain additional keys depending on the protocol. By default,\n            it will use the local file system and a directory in the cache directory.\n            Defaults to `None`.\n        use_fs_to_pass_data: Whether to use the file system to pass the data of\n            the `_Batch`es between the steps. Even if this parameter is `False`, the\n            `Batch`es received by `GlobalStep`s will always use the file system to\n            pass the data. Defaults to `False`.\n        dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n            root step. Convenient method when you have already processed the dataset in\n            your script and just want to pass it already processed. Defaults to `None`.\n        dataset_batch_size: if `dataset` is given, this will be the size of the batches\n            yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n        logging_handlers: A list of logging handlers that will be used to log the\n            output of the pipeline. This argument can be useful so the logging messages\n            can be extracted and used in a different context. Defaults to `None`.\n\n    Returns:\n        The `Distiset` created by the pipeline.\n    \"\"\"\n\n    self._exception: Union[Exception, None] = None\n\n    # Set the runtime parameters that will be used during the pipeline execution.\n    # They are used to generate the signature of the pipeline that is used to hit the\n    # cache when the pipeline is run, so it's important to do it first.\n    self._set_runtime_parameters(parameters or {})\n\n    self._refresh_pipeline_from_cache()\n\n    if dataset is not None:\n        self._add_dataset_generator_step(dataset, dataset_batch_size)\n\n    setup_logging(\n        log_queue=self._log_queue,\n        filename=str(self._cache_location[\"log_file\"]),\n        logging_handlers=logging_handlers,\n    )\n\n    # Set the name of the pipeline if it's the default one. This should be called\n    # if the pipeline is defined within the context manager, and the run is called\n    # outside of it. Is here in the following case:\n    # with Pipeline() as pipeline:\n    #    pipeline.run()\n    self._set_pipeline_name()\n\n    # Validate the pipeline DAG to check that all the steps are chainable, there are\n    # no missing runtime parameters, batch sizes are correct, load groups are valid,\n    # etc.\n    self._load_groups = self._built_load_groups(load_groups)\n    self._validate()\n\n    self._set_pipeline_artifacts_path_in_steps()\n\n    # Set the initial load status for all the steps\n    self._init_steps_load_status()\n\n    # Load the stages status or initialize it\n    self._load_stages_status(use_cache)\n\n    # Load the `_BatchManager` from cache or create one from scratch\n    self._load_batch_manager(use_cache)\n\n    # Check pipeline requirements are installed\n    self._check_requirements()\n\n    # Setup the filesystem that will be used to pass the data of the `_Batch`es\n    self._setup_fsspec(storage_parameters)\n    self._use_fs_to_pass_data = use_fs_to_pass_data\n\n    if self._dry_run:\n        self._logger.info(\"\ud83c\udf35 Dry run mode\")\n\n    # If the batch manager is not able to generate batches, that means that the loaded\n    # `_BatchManager` from cache didn't have any remaining batches to process i.e.\n    # the previous pipeline execution was completed successfully.\n    if not self._batch_manager.can_generate():  # type: ignore\n        self._logger.info(\n            \"\ud83d\udcbe Loaded batch manager from cache doesn't contain any remaining data.\"\n            \" Returning `Distiset` from cache data...\"\n        )\n        distiset = create_distiset(\n            data_dir=self._cache_location[\"data\"],\n            pipeline_path=self._cache_location[\"pipeline\"],\n            log_filename_path=self._cache_location[\"log_file\"],\n            enable_metadata=self._enable_metadata,\n            dag=self.dag,\n        )\n        stop_logging()\n        return distiset\n\n    self._setup_write_buffer(use_cache)\n\n    self._print_load_stages_info()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.dry_run","title":"<code>dry_run(parameters=None, batch_size=1, dataset=None)</code>","text":"<p>Do a dry run to test the pipeline runs as expected.</p> <p>Running a <code>Pipeline</code> in dry run mode will set all the <code>batch_size</code> of generator steps to the specified <code>batch_size</code>, and run just with a single batch, effectively running the whole pipeline with a single example. The cache will be set to <code>False</code>.</p> <p>Parameters:</p> Name Type Description Default <code>parameters</code> <code>Optional[Dict[str, Dict[str, Any]]]</code> <p>A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to <code>None</code>.</p> <code>None</code> <code>batch_size</code> <code>int</code> <p>The batch size of the unique batch generated by the generators steps of the pipeline. Defaults to <code>1</code>.</p> <code>1</code> <code>dataset</code> <code>Optional[InputDataset]</code> <p>If given, it will be used to create a <code>GeneratorStep</code> and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>Will return the <code>Distiset</code> as the main run method would do.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def dry_run(\n    self,\n    parameters: Optional[Dict[str, Dict[str, Any]]] = None,\n    batch_size: int = 1,\n    dataset: Optional[\"InputDataset\"] = None,\n) -&gt; \"Distiset\":\n    \"\"\"Do a dry run to test the pipeline runs as expected.\n\n    Running a `Pipeline` in dry run mode will set all the `batch_size` of generator steps\n    to the specified `batch_size`, and run just with a single batch, effectively\n    running the whole pipeline with a single example. The cache will be set to `False`.\n\n    Args:\n        parameters: A dictionary with the step name as the key and a dictionary with\n            the runtime parameters for the step as the value. Defaults to `None`.\n        batch_size: The batch size of the unique batch generated by the generators\n            steps of the pipeline. Defaults to `1`.\n        dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n            root step. Convenient method when you have already processed the dataset in\n            your script and just want to pass it already processed. Defaults to `None`.\n\n    Returns:\n        Will return the `Distiset` as the main run method would do.\n    \"\"\"\n    self._dry_run = True\n\n    for step_name in self.dag:\n        step = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n\n        if step.is_generator:\n            if not parameters:\n                parameters = {}\n            parameters[step_name] = {\"batch_size\": batch_size}\n\n    distiset = self.run(parameters=parameters, use_cache=False, dataset=dataset)\n\n    self._dry_run = False\n    return distiset\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.get_load_stages","title":"<code>get_load_stages(load_groups=None)</code>","text":"<p>Convenient method to get the load stages of a pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>load_groups</code> <code>Optional[LoadGroups]</code> <p>A list containing list of steps that has to be loaded together and in isolation with respect to the rest of the steps of the pipeline. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>LoadStages</code> <p>A tuple with the first element containing asorted list by stage containing</p> <code>LoadStages</code> <p>lists with the names of the steps of the stage, and the second element a list</p> <code>LoadStages</code> <p>sorted by stage containing lists with the names of the last steps of the stage.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def get_load_stages(self, load_groups: Optional[\"LoadGroups\"] = None) -&gt; LoadStages:\n    \"\"\"Convenient method to get the load stages of a pipeline.\n\n    Args:\n        load_groups: A list containing list of steps that has to be loaded together\n            and in isolation with respect to the rest of the steps of the pipeline.\n            Defaults to `None`.\n\n    Returns:\n        A tuple with the first element containing asorted list by stage containing\n        lists with the names of the steps of the stage, and the second element a list\n        sorted by stage containing lists with the names of the last steps of the stage.\n    \"\"\"\n    load_groups = self._built_load_groups(load_groups)\n    return self.dag.get_steps_load_stages(load_groups)\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.get_runtime_parameters_info","title":"<code>get_runtime_parameters_info()</code>","text":"<p>Get the runtime parameters for the steps in the pipeline.</p> <p>Returns:</p> Type Description <code>PipelineRuntimeParametersInfo</code> <p>A dictionary with the step name as the key and a list of dictionaries with</p> <code>PipelineRuntimeParametersInfo</code> <p>the parameter name and the parameter info as the value.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def get_runtime_parameters_info(self) -&gt; \"PipelineRuntimeParametersInfo\":\n    \"\"\"Get the runtime parameters for the steps in the pipeline.\n\n    Returns:\n        A dictionary with the step name as the key and a list of dictionaries with\n        the parameter name and the parameter info as the value.\n    \"\"\"\n    runtime_parameters = {}\n    for step_name in self.dag:\n        step: \"_Step\" = self.dag.get_step(step_name)[constants.STEP_ATTR_NAME]\n        runtime_parameters[step_name] = step.get_runtime_parameters_info()\n    return runtime_parameters\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.draw","title":"<code>draw(path='pipeline.png', top_to_bottom=False, show_edge_labels=True)</code>","text":"<p>Draws the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Optional[Union[str, Path]]</code> <p>The path to save the image to.</p> <code>'pipeline.png'</code> <code>top_to_bottom</code> <code>bool</code> <p>Whether to draw the DAG top to bottom. Defaults to <code>False</code>.</p> <code>False</code> <code>show_edge_labels</code> <code>bool</code> <p>Whether to show the edge labels. Defaults to <code>True</code>.</p> <code>True</code> <p>Returns:</p> Type Description <code>str</code> <p>The path to the saved image.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def draw(\n    self,\n    path: Optional[Union[str, Path]] = \"pipeline.png\",\n    top_to_bottom: bool = False,\n    show_edge_labels: bool = True,\n) -&gt; str:\n    \"\"\"\n    Draws the pipeline.\n\n    Parameters:\n        path: The path to save the image to.\n        top_to_bottom: Whether to draw the DAG top to bottom. Defaults to `False`.\n        show_edge_labels: Whether to show the edge labels. Defaults to `True`.\n\n    Returns:\n        The path to the saved image.\n    \"\"\"\n    png = self.dag.draw(\n        top_to_bottom=top_to_bottom, show_edge_labels=show_edge_labels\n    )\n    with open(path, \"wb\") as f:\n        f.write(png)\n    return path\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.__repr__","title":"<code>__repr__()</code>","text":"<p>If running in a Jupyter notebook, display an image representing this <code>Pipeline</code>.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>def __repr__(self) -&gt; str:\n    \"\"\"\n    If running in a Jupyter notebook, display an image representing this `Pipeline`.\n    \"\"\"\n    if in_notebook():\n        try:\n            from IPython.display import Image, display\n\n            image_data = self.dag.draw()\n\n            display(Image(image_data))\n        except Exception:\n            pass\n    return super().__repr__()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.base.BasePipeline.from_dict","title":"<code>from_dict(data)</code>  <code>classmethod</code>","text":"<p>Create a Pipeline from a dict containing the serialized data.</p> Note <p>It's intended for internal use.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> <p>Dictionary containing the serialized data from a Pipeline.</p> required <p>Returns:</p> Name Type Description <code>BasePipeline</code> <code>Self</code> <p>Pipeline recreated from the dictionary info.</p> Source code in <code>src/distilabel/pipeline/base.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n    \"\"\"Create a Pipeline from a dict containing the serialized data.\n\n    Note:\n        It's intended for internal use.\n\n    Args:\n        data (Dict[str, Any]): Dictionary containing the serialized data from a Pipeline.\n\n    Returns:\n        BasePipeline: Pipeline recreated from the dictionary info.\n    \"\"\"\n    name = data[\"pipeline\"][\"name\"]\n    description = data[\"pipeline\"].get(\"description\")\n    requirements = data.get(\"requirements\", [])\n    with cls(name=name, description=description, requirements=requirements) as pipe:\n        pipe.dag = DAG.from_dict(data[\"pipeline\"])\n    return pipe\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.local","title":"<code>local</code>","text":""},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline","title":"<code>Pipeline</code>","text":"<p>               Bases: <code>BasePipeline</code></p> <p>Local pipeline implementation using <code>multiprocessing</code>.</p> Source code in <code>src/distilabel/pipeline/local.py</code> <pre><code>class Pipeline(BasePipeline):\n    \"\"\"Local pipeline implementation using `multiprocessing`.\"\"\"\n\n    def ray(\n        self,\n        ray_head_node_url: Optional[str] = None,\n        ray_init_kwargs: Optional[Dict[str, Any]] = None,\n    ) -&gt; RayPipeline:\n        \"\"\"Creates a `RayPipeline` using the init parameters of this pipeline. This is a\n        convenient method that can be used to \"transform\" one common `Pipeline` to a `RayPipeline`\n        and it's mainly used by the CLI.\n\n        Args:\n            ray_head_node_url: The URL that can be used to connect to the head node of\n                the Ray cluster. Normally, you won't want to use this argument as the\n                recommended way to submit a job to a Ray cluster is using the [Ray Jobs\n                CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-overview).\n                Defaults to `None`.\n            ray_init_kwargs: kwargs that will be passed to the `ray.init` method. Defaults\n                to `None`.\n\n        Returns:\n            A `RayPipeline` instance.\n        \"\"\"\n        pipeline = RayPipeline(\n            name=self.name,\n            description=self.description,\n            cache_dir=self._cache_dir,\n            enable_metadata=self._enable_metadata,\n            requirements=self.requirements,\n            ray_head_node_url=ray_head_node_url,\n            ray_init_kwargs=ray_init_kwargs,\n        )\n        pipeline.dag = self.dag\n        return pipeline\n\n    def run(\n        self,\n        parameters: Optional[Dict[Any, Dict[str, Any]]] = None,\n        load_groups: Optional[\"LoadGroups\"] = None,\n        use_cache: bool = True,\n        storage_parameters: Optional[Dict[str, Any]] = None,\n        use_fs_to_pass_data: bool = False,\n        dataset: Optional[\"InputDataset\"] = None,\n        dataset_batch_size: int = 50,\n        logging_handlers: Optional[List[\"logging.Handler\"]] = None,\n    ) -&gt; \"Distiset\":\n        \"\"\"Runs the pipeline.\n\n        Args:\n            parameters: A dictionary with the step name as the key and a dictionary with\n                the runtime parameters for the step as the value. Defaults to `None`.\n            load_groups: A list containing lists of steps that have to be loaded together\n                and in isolation with respect to the rest of the steps of the pipeline.\n                This argument also allows passing the following modes:\n\n                - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                    the execution of the steps will be sequential.\n\n                Defaults to `None`.\n            use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n                `True`.\n            storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n                that will be used to store the data of the `_Batch`es passed between the\n                steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n                `GlobalStep` it will be always used). It must have at least the \"path\" key,\n                and it can contain additional keys depending on the protocol. By default,\n                it will use the local file system and a directory in the cache directory.\n                Defaults to `None`.\n            use_fs_to_pass_data: Whether to use the file system to pass the data of\n                the `_Batch`es between the steps. Even if this parameter is `False`, the\n                `Batch`es received by `GlobalStep`s will always use the file system to\n                pass the data. Defaults to `False`.\n            dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n                root step. Convenient method when you have already processed the dataset in\n                your script and just want to pass it already processed. Defaults to `None`.\n            dataset_batch_size: if `dataset` is given, this will be the size of the batches\n                yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n            logging_handlers: A list of logging handlers that will be used to log the\n                output of the pipeline. This argument can be useful so the logging messages\n                can be extracted and used in a different context. Defaults to `None`.\n\n        Returns:\n            The `Distiset` created by the pipeline.\n\n        Raises:\n            RuntimeError: If the pipeline fails to load all the steps.\n        \"\"\"\n        if script_executed_in_ray_cluster():\n            print(\"Script running in Ray cluster... Using `RayPipeline`...\")\n            return self.ray().run(\n                parameters=parameters,\n                use_cache=use_cache,\n                storage_parameters=storage_parameters,\n                use_fs_to_pass_data=use_fs_to_pass_data,\n                dataset=dataset,\n                dataset_batch_size=dataset_batch_size,\n            )\n\n        self._log_queue = cast(\"Queue[Any]\", mp.Queue())\n\n        if distiset := super().run(\n            parameters=parameters,\n            load_groups=load_groups,\n            use_cache=use_cache,\n            storage_parameters=storage_parameters,\n            use_fs_to_pass_data=use_fs_to_pass_data,\n            dataset=dataset,\n            dataset_batch_size=dataset_batch_size,\n            logging_handlers=logging_handlers,\n        ):\n            return distiset\n\n        num_processes = self.dag.get_total_replica_count()\n        with (\n            mp.Manager() as manager,\n            _NoDaemonPool(\n                num_processes,\n                initializer=_init_worker,\n                initargs=(\n                    self._log_queue,\n                    self.name,\n                    self.signature,\n                ),\n            ) as pool,\n        ):\n            self._manager = manager\n            self._pool = pool\n            self._output_queue = self.QueueClass()\n            self._load_queue = self.QueueClass()\n            self._handle_keyboard_interrupt()\n\n            # Run the loop for receiving the load status of each step\n            self._load_steps_thread = self._run_load_queue_loop_in_thread()\n\n            # Start a loop to receive the output batches from the steps\n            self._output_queue_thread = self._run_output_queue_loop_in_thread()\n            self._output_queue_thread.join()\n\n            self._teardown()\n\n            if self._exception:\n                raise self._exception\n\n        distiset = create_distiset(\n            self._cache_location[\"data\"],\n            pipeline_path=self._cache_location[\"pipeline\"],\n            log_filename_path=self._cache_location[\"log_file\"],\n            enable_metadata=self._enable_metadata,\n            dag=self.dag,\n        )\n\n        stop_logging()\n\n        return distiset\n\n    @property\n    def QueueClass(self) -&gt; Callable:\n        \"\"\"The callable used to create the input and output queues.\n\n        Returns:\n            The callable to create a `Queue`.\n        \"\"\"\n        assert self._manager, \"Manager is not initialized\"\n        return self._manager.Queue\n\n    def _run_step(self, step: \"_Step\", input_queue: \"Queue[Any]\", replica: int) -&gt; None:\n        \"\"\"Runs the `Step` wrapped in a `_ProcessWrapper` in a separate process of the\n        `Pool`.\n\n        Args:\n            step: The step to run.\n            input_queue: The input queue to send the data to the step.\n            replica: The replica ID assigned.\n        \"\"\"\n        assert self._pool, \"Pool is not initialized\"\n\n        step_wrapper = _StepWrapper(\n            step=step,  # type: ignore\n            replica=replica,\n            input_queue=input_queue,\n            output_queue=self._output_queue,\n            load_queue=self._load_queue,\n            dry_run=self._dry_run,\n            ray_pipeline=False,\n        )\n\n        self._pool.apply_async(step_wrapper.run, error_callback=self._error_callback)\n\n    def _error_callback(self, e: BaseException) -&gt; None:\n        \"\"\"Error callback that will be called when an error occurs in a `Step` process.\n\n        Args:\n            e: The exception raised by the process.\n        \"\"\"\n        global _SUBPROCESS_EXCEPTION\n\n        # First we check that the exception is a `_StepWrapperException`, otherwise, we\n        # print it out and stop the pipeline, since some errors may be unhandled\n        if not isinstance(e, _StepWrapperException):\n            self._logger.error(f\"\u274c Failed with an unhandled exception: {e}\")\n            self._stop()\n            return\n\n        if e.is_load_error:\n            self._logger.error(f\"\u274c Failed to load step '{e.step.name}': {e.message}\")\n            _SUBPROCESS_EXCEPTION = e.subprocess_exception\n            _SUBPROCESS_EXCEPTION.__traceback__ = tblib.Traceback.from_string(  # type: ignore\n                e.formatted_traceback\n            ).as_traceback()\n            return\n\n        # If the step is global, is not in the last trophic level and has no successors,\n        # then we can ignore the error and continue executing the pipeline\n        step_name: str = e.step.name  # type: ignore\n        if (\n            e.step.is_global\n            and not self.dag.step_in_last_trophic_level(step_name)\n            and list(self.dag.get_step_successors(step_name)) == []\n        ):\n            self._logger.error(\n                f\"\u270b An error occurred when running global step '{step_name}' with no\"\n                \" successors and not in the last trophic level. Pipeline execution can\"\n                f\" continue. Error will be ignored.\"\n            )\n            self._logger.error(f\"Subprocess traceback:\\n\\n{e.formatted_traceback}\")\n            return\n\n        # Handle tasks using an `LLM` using offline batch generation\n        if isinstance(\n            e.subprocess_exception, DistilabelOfflineBatchGenerationNotFinishedException\n        ):\n            self._logger.info(\n                f\"\u23f9\ufe0f '{e.step.name}' task stopped pipeline execution: LLM offline batch\"\n                \" generation in progress. Rerun pipeline with cache to check results and\"\n                \" continue execution.\"\n            )\n            self._set_step_for_recovering_offline_batch_generation(e.step, e.data)  # type: ignore\n            with self._stop_called_lock:\n                if not self._stop_called:\n                    self._stop(acquire_lock=False)\n            return\n\n        # Global step with successors failed\n        self._logger.error(f\"An error occurred in global step '{step_name}'\")\n        self._logger.error(f\"Subprocess traceback:\\n\\n{e.formatted_traceback}\")\n\n        self._stop()\n\n    def _teardown(self) -&gt; None:\n        \"\"\"Clean/release/stop resources reserved to run the pipeline.\"\"\"\n        if self._write_buffer:\n            self._write_buffer.close()\n\n        if self._batch_manager:\n            self._batch_manager = None\n\n        self._stop_load_queue_loop()\n        self._load_steps_thread.join()\n\n        if self._pool:\n            self._pool.terminate()\n            self._pool.join()\n\n        if self._manager:\n            self._manager.shutdown()\n            self._manager.join()\n\n    def _set_steps_not_loaded_exception(self) -&gt; None:\n        \"\"\"Raises a `RuntimeError` notifying that the steps load has failed.\n\n        Raises:\n            RuntimeError: containing the information and why a step failed to be loaded.\n        \"\"\"\n        self._exception = RuntimeError(\n            \"Failed to load all the steps. Could not run pipeline.\"\n        )\n        self._exception.__cause__ = _SUBPROCESS_EXCEPTION\n\n    def _stop(self, acquire_lock: bool = True) -&gt; None:\n        \"\"\"Stops the pipeline execution. It will first send `None` to the input queues\n        of all the steps and then wait until the output queue is empty i.e. all the steps\n        finished processing the batches that were sent before the stop flag. Then it will\n        send `None` to the output queue to notify the pipeline to stop.\n\n        Args:\n            acquire_lock: Whether to acquire the lock to access the `_stop_called` attribute.\n        \"\"\"\n\n        if acquire_lock:\n            self._stop_called_lock.acquire()\n\n        if self._stop_called:\n            self._stop_calls += 1\n            if self._stop_calls == 1:\n                self._logger.warning(\"\ud83d\uded1 Press again to force the pipeline to stop.\")\n            elif self._stop_calls &gt; 1:\n                self._logger.warning(\"\ud83d\uded1 Forcing pipeline interruption.\")\n\n                if self._pool:\n                    self._pool.terminate()\n                    self._pool.join()\n                    self._pool = None\n\n                if self._manager:\n                    self._manager.shutdown()\n                    self._manager.join()\n                    self._manager = None\n\n                stop_logging()\n\n                sys.exit(1)\n\n            return\n        self._stop_called = True\n\n        if acquire_lock:\n            self._stop_called_lock.release()\n\n        self._logger.debug(\n            f\"Steps loaded before calling `stop`: {self._steps_load_status}\"\n        )\n        self._logger.info(\n            \"\ud83d\uded1 Stopping pipeline. Waiting for steps to finish processing batches...\"\n        )\n\n        self._stop_output_queue_loop()\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline.QueueClass","title":"<code>QueueClass</code>  <code>property</code>","text":"<p>The callable used to create the input and output queues.</p> <p>Returns:</p> Type Description <code>Callable</code> <p>The callable to create a <code>Queue</code>.</p>"},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline.ray","title":"<code>ray(ray_head_node_url=None, ray_init_kwargs=None)</code>","text":"<p>Creates a <code>RayPipeline</code> using the init parameters of this pipeline. This is a convenient method that can be used to \"transform\" one common <code>Pipeline</code> to a <code>RayPipeline</code> and it's mainly used by the CLI.</p> <p>Parameters:</p> Name Type Description Default <code>ray_head_node_url</code> <code>Optional[str]</code> <p>The URL that can be used to connect to the head node of the Ray cluster. Normally, you won't want to use this argument as the recommended way to submit a job to a Ray cluster is using the Ray Jobs CLI. Defaults to <code>None</code>.</p> <code>None</code> <code>ray_init_kwargs</code> <code>Optional[Dict[str, Any]]</code> <p>kwargs that will be passed to the <code>ray.init</code> method. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>RayPipeline</code> <p>A <code>RayPipeline</code> instance.</p> Source code in <code>src/distilabel/pipeline/local.py</code> <pre><code>def ray(\n    self,\n    ray_head_node_url: Optional[str] = None,\n    ray_init_kwargs: Optional[Dict[str, Any]] = None,\n) -&gt; RayPipeline:\n    \"\"\"Creates a `RayPipeline` using the init parameters of this pipeline. This is a\n    convenient method that can be used to \"transform\" one common `Pipeline` to a `RayPipeline`\n    and it's mainly used by the CLI.\n\n    Args:\n        ray_head_node_url: The URL that can be used to connect to the head node of\n            the Ray cluster. Normally, you won't want to use this argument as the\n            recommended way to submit a job to a Ray cluster is using the [Ray Jobs\n            CLI](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html#ray-jobs-overview).\n            Defaults to `None`.\n        ray_init_kwargs: kwargs that will be passed to the `ray.init` method. Defaults\n            to `None`.\n\n    Returns:\n        A `RayPipeline` instance.\n    \"\"\"\n    pipeline = RayPipeline(\n        name=self.name,\n        description=self.description,\n        cache_dir=self._cache_dir,\n        enable_metadata=self._enable_metadata,\n        requirements=self.requirements,\n        ray_head_node_url=ray_head_node_url,\n        ray_init_kwargs=ray_init_kwargs,\n    )\n    pipeline.dag = self.dag\n    return pipeline\n</code></pre>"},{"location":"api/pipeline/#distilabel.pipeline.local.Pipeline.run","title":"<code>run(parameters=None, load_groups=None, use_cache=True, storage_parameters=None, use_fs_to_pass_data=False, dataset=None, dataset_batch_size=50, logging_handlers=None)</code>","text":"<p>Runs the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>parameters</code> <code>Optional[Dict[Any, Dict[str, Any]]]</code> <p>A dictionary with the step name as the key and a dictionary with the runtime parameters for the step as the value. Defaults to <code>None</code>.</p> <code>None</code> <code>load_groups</code> <code>Optional[LoadGroups]</code> <p>A list containing lists of steps that have to be loaded together and in isolation with respect to the rest of the steps of the pipeline. This argument also allows passing the following modes:</p> <ul> <li>\"sequential_step_execution\": each step will be executed in a stage i.e.     the execution of the steps will be sequential.</li> </ul> <p>Defaults to <code>None</code>.</p> <code>None</code> <code>use_cache</code> <code>bool</code> <p>Whether to use the cache from previous pipeline runs. Defaults to <code>True</code>.</p> <code>True</code> <code>storage_parameters</code> <code>Optional[Dict[str, Any]]</code> <p>A dictionary with the storage parameters (<code>fsspec</code> and path) that will be used to store the data of the <code>_Batch</code>es passed between the steps if <code>use_fs_to_pass_data</code> is <code>True</code> (for the batches received by a <code>GlobalStep</code> it will be always used). It must have at least the \"path\" key, and it can contain additional keys depending on the protocol. By default, it will use the local file system and a directory in the cache directory. Defaults to <code>None</code>.</p> <code>None</code> <code>use_fs_to_pass_data</code> <code>bool</code> <p>Whether to use the file system to pass the data of the <code>_Batch</code>es between the steps. Even if this parameter is <code>False</code>, the <code>Batch</code>es received by <code>GlobalStep</code>s will always use the file system to pass the data. Defaults to <code>False</code>.</p> <code>False</code> <code>dataset</code> <code>Optional[InputDataset]</code> <p>If given, it will be used to create a <code>GeneratorStep</code> and put it as the root step. Convenient method when you have already processed the dataset in your script and just want to pass it already processed. Defaults to <code>None</code>.</p> <code>None</code> <code>dataset_batch_size</code> <code>int</code> <p>if <code>dataset</code> is given, this will be the size of the batches yield by the <code>GeneratorStep</code> created using the <code>dataset</code>. Defaults to <code>50</code>.</p> <code>50</code> <code>logging_handlers</code> <code>Optional[List[Handler]]</code> <p>A list of logging handlers that will be used to log the output of the pipeline. This argument can be useful so the logging messages can be extracted and used in a different context. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Distiset</code> <p>The <code>Distiset</code> created by the pipeline.</p> <p>Raises:</p> Type Description <code>RuntimeError</code> <p>If the pipeline fails to load all the steps.</p> Source code in <code>src/distilabel/pipeline/local.py</code> <pre><code>def run(\n    self,\n    parameters: Optional[Dict[Any, Dict[str, Any]]] = None,\n    load_groups: Optional[\"LoadGroups\"] = None,\n    use_cache: bool = True,\n    storage_parameters: Optional[Dict[str, Any]] = None,\n    use_fs_to_pass_data: bool = False,\n    dataset: Optional[\"InputDataset\"] = None,\n    dataset_batch_size: int = 50,\n    logging_handlers: Optional[List[\"logging.Handler\"]] = None,\n) -&gt; \"Distiset\":\n    \"\"\"Runs the pipeline.\n\n    Args:\n        parameters: A dictionary with the step name as the key and a dictionary with\n            the runtime parameters for the step as the value. Defaults to `None`.\n        load_groups: A list containing lists of steps that have to be loaded together\n            and in isolation with respect to the rest of the steps of the pipeline.\n            This argument also allows passing the following modes:\n\n            - \"sequential_step_execution\": each step will be executed in a stage i.e.\n                the execution of the steps will be sequential.\n\n            Defaults to `None`.\n        use_cache: Whether to use the cache from previous pipeline runs. Defaults to\n            `True`.\n        storage_parameters: A dictionary with the storage parameters (`fsspec` and path)\n            that will be used to store the data of the `_Batch`es passed between the\n            steps if `use_fs_to_pass_data` is `True` (for the batches received by a\n            `GlobalStep` it will be always used). It must have at least the \"path\" key,\n            and it can contain additional keys depending on the protocol. By default,\n            it will use the local file system and a directory in the cache directory.\n            Defaults to `None`.\n        use_fs_to_pass_data: Whether to use the file system to pass the data of\n            the `_Batch`es between the steps. Even if this parameter is `False`, the\n            `Batch`es received by `GlobalStep`s will always use the file system to\n            pass the data. Defaults to `False`.\n        dataset: If given, it will be used to create a `GeneratorStep` and put it as the\n            root step. Convenient method when you have already processed the dataset in\n            your script and just want to pass it already processed. Defaults to `None`.\n        dataset_batch_size: if `dataset` is given, this will be the size of the batches\n            yield by the `GeneratorStep` created using the `dataset`. Defaults to `50`.\n        logging_handlers: A list of logging handlers that will be used to log the\n            output of the pipeline. This argument can be useful so the logging messages\n            can be extracted and used in a different context. Defaults to `None`.\n\n    Returns:\n        The `Distiset` created by the pipeline.\n\n    Raises:\n        RuntimeError: If the pipeline fails to load all the steps.\n    \"\"\"\n    if script_executed_in_ray_cluster():\n        print(\"Script running in Ray cluster... Using `RayPipeline`...\")\n        return self.ray().run(\n            parameters=parameters,\n            use_cache=use_cache,\n            storage_parameters=storage_parameters,\n            use_fs_to_pass_data=use_fs_to_pass_data,\n            dataset=dataset,\n            dataset_batch_size=dataset_batch_size,\n        )\n\n    self._log_queue = cast(\"Queue[Any]\", mp.Queue())\n\n    if distiset := super().run(\n        parameters=parameters,\n        load_groups=load_groups,\n        use_cache=use_cache,\n        storage_parameters=storage_parameters,\n        use_fs_to_pass_data=use_fs_to_pass_data,\n        dataset=dataset,\n        dataset_batch_size=dataset_batch_size,\n        logging_handlers=logging_handlers,\n    ):\n        return distiset\n\n    num_processes = self.dag.get_total_replica_count()\n    with (\n        mp.Manager() as manager,\n        _NoDaemonPool(\n            num_processes,\n            initializer=_init_worker,\n            initargs=(\n                self._log_queue,\n                self.name,\n                self.signature,\n            ),\n        ) as pool,\n    ):\n        self._manager = manager\n        self._pool = pool\n        self._output_queue = self.QueueClass()\n        self._load_queue = self.QueueClass()\n        self._handle_keyboard_interrupt()\n\n        # Run the loop for receiving the load status of each step\n        self._load_steps_thread = self._run_load_queue_loop_in_thread()\n\n        # Start a loop to receive the output batches from the steps\n        self._output_queue_thread = self._run_output_queue_loop_in_thread()\n        self._output_queue_thread.join()\n\n        self._teardown()\n\n        if self._exception:\n            raise self._exception\n\n    distiset = create_distiset(\n        self._cache_location[\"data\"],\n        pipeline_path=self._cache_location[\"pipeline\"],\n        log_filename_path=self._cache_location[\"log_file\"],\n        enable_metadata=self._enable_metadata,\n        dag=self.dag,\n    )\n\n    stop_logging()\n\n    return distiset\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/","title":"Routing batch function","text":""},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function","title":"<code>routing_batch_function</code>","text":""},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunc","title":"<code>RoutingBatchFunc = Callable[[List[str]], List[str]]</code>  <code>module-attribute</code>","text":"<p>Type alias for a routing batch function. It takes a list of all the downstream steps and returns a list with the names of the steps that should receive the batch.</p>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction","title":"<code>RoutingBatchFunction</code>","text":"<p>               Bases: <code>BaseModel</code>, <code>_Serializable</code></p> <p>A thin wrapper around a routing batch function that can be used to route batches from one upstream step to specific downstream steps.</p> <p>Attributes:</p> Name Type Description <code>routing_function</code> <code>RoutingBatchFunc</code> <p>The routing function that takes a list of all the downstream steps and returns a list with the names of the steps that should receive the batch.</p> <code>_step</code> <code>Union[_Step, None]</code> <p>The upstream step that is connected to the routing batch function.</p> <code>_routed_batch_registry</code> <code>Dict[str, Dict[int, List[str]]]</code> <p>A dictionary that keeps track of the batches that have been routed to specific downstream steps.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>class RoutingBatchFunction(BaseModel, _Serializable):\n    \"\"\"A thin wrapper around a routing batch function that can be used to route batches\n    from one upstream step to specific downstream steps.\n\n    Attributes:\n        routing_function: The routing function that takes a list of all the downstream steps\n            and returns a list with the names of the steps that should receive the batch.\n        _step: The upstream step that is connected to the routing batch function.\n        _routed_batch_registry: A dictionary that keeps track of the batches that have been\n            routed to specific downstream steps.\n    \"\"\"\n\n    routing_function: RoutingBatchFunc\n    description: Optional[str] = None\n\n    _step: Union[\"_Step\", None] = PrivateAttr(default=None)\n    _routed_batch_registry: Dict[str, Dict[int, List[str]]] = PrivateAttr(\n        default_factory=dict\n    )\n    _factory_function_module: Union[str, None] = PrivateAttr(default=None)\n    _factory_function_name: Union[str, None] = PrivateAttr(default=None)\n    _factory_function_kwargs: Union[Dict[str, Any], None] = PrivateAttr(default=None)\n\n    def route_batch(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n        \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n        should be routed.\n\n        Args:\n            batch: The batch that should be routed.\n            steps: A list of all the downstream steps that can receive the batch.\n\n        Returns:\n            A list with the names of the steps that should receive the batch.\n        \"\"\"\n        routed_steps = self.routing_function(steps)\n        self._register_routed_batch(batch, routed_steps)\n        return routed_steps\n\n    def set_factory_function(\n        self,\n        factory_function_module: str,\n        factory_function_name: str,\n        factory_function_kwargs: Dict[str, Any],\n    ) -&gt; None:\n        \"\"\"Sets the factory function that was used to create the `routing_batch_function`.\n\n        Args:\n            factory_function_module: The module name where the factory function is defined.\n            factory_function_name: The name of the factory function that was used to create\n                the `routing_batch_function`.\n            factory_function_kwargs: The keyword arguments that were used when calling the\n                factory function.\n        \"\"\"\n        self._factory_function_module = factory_function_module\n        self._factory_function_name = factory_function_name\n        self._factory_function_kwargs = factory_function_kwargs\n\n    def __call__(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n        \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n        should be routed.\n\n        Args:\n            batch: The batch that should be routed.\n            steps: A list of all the downstream steps that can receive the batch.\n\n        Returns:\n            A list with the names of the steps that should receive the batch.\n        \"\"\"\n        return self.route_batch(batch, steps)\n\n    def _register_routed_batch(self, batch: \"_Batch\", routed_steps: List[str]) -&gt; None:\n        \"\"\"Registers a batch that has been routed to specific downstream steps.\n\n        Args:\n            batch: The batch that has been routed.\n            routed_steps: The list of downstream steps that have been selected to receive\n                the batch.\n        \"\"\"\n        upstream_step = batch.step_name\n        batch_seq_no = batch.seq_no\n        self._routed_batch_registry.setdefault(upstream_step, {}).setdefault(\n            batch_seq_no, routed_steps\n        )\n\n    def __rshift__(\n        self, other: List[\"DownstreamConnectableSteps\"]\n    ) -&gt; List[\"DownstreamConnectableSteps\"]:\n        \"\"\"Connects a list of dowstream steps to the upstream step of the routing batch\n        function.\n\n        Args:\n            other: A list of downstream steps that should be connected to the upstream step\n                of the routing batch function.\n\n        Returns:\n            The list of downstream steps that have been connected to the upstream step of the\n            routing batch function.\n        \"\"\"\n        if not isinstance(other, list):\n            raise DistilabelUserError(\n                f\"Can only set a `routing_batch_function` for a list of steps. Got: {other}.\"\n                \" Please, review the right-hand side of the `routing_batch_function &gt;&gt; other`\"\n                \" expression. It should be\"\n                \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, dowstream_step_2, ...]`.\",\n                page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n            )\n\n        if not self._step:\n            raise DistilabelUserError(\n                \"Routing batch function doesn't have an upstream step. Cannot connect downstream\"\n                \" steps before connecting the upstream step. Connect this routing batch\"\n                \" function to an upstream step using the `&gt;&gt;` operator. For example:\"\n                \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, downstream_step_2, ...]`.\",\n                page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n            )\n\n        for step in other:\n            self._step.connect(step)\n        return other\n\n    def dump(self, **kwargs: Any) -&gt; Dict[str, Any]:\n        \"\"\"Dumps the routing batch function to a dictionary, and the information of the\n        factory function used to create this routing batch function.\n\n        Args:\n            **kwargs: Additional keyword arguments that should be included in the dump.\n\n        Returns:\n            A dictionary with the routing batch function information and the factory function\n            information.\n        \"\"\"\n        dump_info: Dict[str, Any] = {\"step\": self._step.name}  # type: ignore\n\n        if self.description:\n            dump_info[\"description\"] = self.description\n\n        if type_info := self._get_type_info():\n            dump_info[TYPE_INFO_KEY] = type_info\n\n        return dump_info\n\n    def _get_type_info(self) -&gt; Dict[str, Any]:\n        \"\"\"Returns the information of the factory function used to create the routing batch\n        function.\n\n        Returns:\n            A dictionary with the factory function information.\n        \"\"\"\n\n        type_info = {}\n\n        if self._factory_function_module:\n            type_info[\"module\"] = self._factory_function_module\n\n        if self._factory_function_name:\n            type_info[\"name\"] = self._factory_function_name\n\n        if self._factory_function_kwargs:\n            type_info[\"kwargs\"] = self._factory_function_kwargs\n\n        return type_info\n\n    @classmethod\n    def from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n        \"\"\"Loads a routing batch function from a dictionary. It must contain the information\n        of the factory function used to create the routing batch function.\n\n        Args:\n            data: A dictionary with the routing batch function information and the factory\n                function information.\n        \"\"\"\n        type_info = data.get(TYPE_INFO_KEY)\n        if not type_info:\n            step = data.get(\"step\")\n            raise ValueError(\n                f\"The routing batch function for step '{step}' was created without a factory\"\n                \" function, and it cannot be reconstructed.\"\n            )\n\n        module = type_info.get(\"module\")\n        name = type_info.get(\"name\")\n        kwargs = type_info.get(\"kwargs\")\n\n        if not module or not name or not kwargs:\n            raise ValueError(\n                \"The routing batch function was created with a factory function, but the\"\n                \" information is incomplete. Cannot reconstruct the routing batch function.\"\n            )\n\n        routing_batch_function = _get_module_attr(module=module, name=name)(**kwargs)\n        routing_batch_function.description = data.get(\"description\")\n        routing_batch_function.set_factory_function(\n            factory_function_module=module,\n            factory_function_name=name,\n            factory_function_kwargs=kwargs,\n        )\n\n        return routing_batch_function\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.route_batch","title":"<code>route_batch(batch, steps)</code>","text":"<p>Returns a list of selected downstream steps from <code>steps</code> to which the <code>batch</code> should be routed.</p> <p>Parameters:</p> Name Type Description Default <code>batch</code> <code>_Batch</code> <p>The batch that should be routed.</p> required <code>steps</code> <code>List[str]</code> <p>A list of all the downstream steps that can receive the batch.</p> required <p>Returns:</p> Type Description <code>List[str]</code> <p>A list with the names of the steps that should receive the batch.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def route_batch(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n    \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n    should be routed.\n\n    Args:\n        batch: The batch that should be routed.\n        steps: A list of all the downstream steps that can receive the batch.\n\n    Returns:\n        A list with the names of the steps that should receive the batch.\n    \"\"\"\n    routed_steps = self.routing_function(steps)\n    self._register_routed_batch(batch, routed_steps)\n    return routed_steps\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.set_factory_function","title":"<code>set_factory_function(factory_function_module, factory_function_name, factory_function_kwargs)</code>","text":"<p>Sets the factory function that was used to create the <code>routing_batch_function</code>.</p> <p>Parameters:</p> Name Type Description Default <code>factory_function_module</code> <code>str</code> <p>The module name where the factory function is defined.</p> required <code>factory_function_name</code> <code>str</code> <p>The name of the factory function that was used to create the <code>routing_batch_function</code>.</p> required <code>factory_function_kwargs</code> <code>Dict[str, Any]</code> <p>The keyword arguments that were used when calling the factory function.</p> required Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def set_factory_function(\n    self,\n    factory_function_module: str,\n    factory_function_name: str,\n    factory_function_kwargs: Dict[str, Any],\n) -&gt; None:\n    \"\"\"Sets the factory function that was used to create the `routing_batch_function`.\n\n    Args:\n        factory_function_module: The module name where the factory function is defined.\n        factory_function_name: The name of the factory function that was used to create\n            the `routing_batch_function`.\n        factory_function_kwargs: The keyword arguments that were used when calling the\n            factory function.\n    \"\"\"\n    self._factory_function_module = factory_function_module\n    self._factory_function_name = factory_function_name\n    self._factory_function_kwargs = factory_function_kwargs\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.__call__","title":"<code>__call__(batch, steps)</code>","text":"<p>Returns a list of selected downstream steps from <code>steps</code> to which the <code>batch</code> should be routed.</p> <p>Parameters:</p> Name Type Description Default <code>batch</code> <code>_Batch</code> <p>The batch that should be routed.</p> required <code>steps</code> <code>List[str]</code> <p>A list of all the downstream steps that can receive the batch.</p> required <p>Returns:</p> Type Description <code>List[str]</code> <p>A list with the names of the steps that should receive the batch.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def __call__(self, batch: \"_Batch\", steps: List[str]) -&gt; List[str]:\n    \"\"\"Returns a list of selected downstream steps from `steps` to which the `batch`\n    should be routed.\n\n    Args:\n        batch: The batch that should be routed.\n        steps: A list of all the downstream steps that can receive the batch.\n\n    Returns:\n        A list with the names of the steps that should receive the batch.\n    \"\"\"\n    return self.route_batch(batch, steps)\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.__rshift__","title":"<code>__rshift__(other)</code>","text":"<p>Connects a list of dowstream steps to the upstream step of the routing batch function.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>List[DownstreamConnectableSteps]</code> <p>A list of downstream steps that should be connected to the upstream step of the routing batch function.</p> required <p>Returns:</p> Type Description <code>List[DownstreamConnectableSteps]</code> <p>The list of downstream steps that have been connected to the upstream step of the</p> <code>List[DownstreamConnectableSteps]</code> <p>routing batch function.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def __rshift__(\n    self, other: List[\"DownstreamConnectableSteps\"]\n) -&gt; List[\"DownstreamConnectableSteps\"]:\n    \"\"\"Connects a list of dowstream steps to the upstream step of the routing batch\n    function.\n\n    Args:\n        other: A list of downstream steps that should be connected to the upstream step\n            of the routing batch function.\n\n    Returns:\n        The list of downstream steps that have been connected to the upstream step of the\n        routing batch function.\n    \"\"\"\n    if not isinstance(other, list):\n        raise DistilabelUserError(\n            f\"Can only set a `routing_batch_function` for a list of steps. Got: {other}.\"\n            \" Please, review the right-hand side of the `routing_batch_function &gt;&gt; other`\"\n            \" expression. It should be\"\n            \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, dowstream_step_2, ...]`.\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n        )\n\n    if not self._step:\n        raise DistilabelUserError(\n            \"Routing batch function doesn't have an upstream step. Cannot connect downstream\"\n            \" steps before connecting the upstream step. Connect this routing batch\"\n            \" function to an upstream step using the `&gt;&gt;` operator. For example:\"\n            \" `upstream_step &gt;&gt; routing_batch_function &gt;&gt; [downstream_step_1, downstream_step_2, ...]`.\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=routing#routing-batches-to-specific-downstream-steps\",\n        )\n\n    for step in other:\n        self._step.connect(step)\n    return other\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.dump","title":"<code>dump(**kwargs)</code>","text":"<p>Dumps the routing batch function to a dictionary, and the information of the factory function used to create this routing batch function.</p> <p>Parameters:</p> Name Type Description Default <code>**kwargs</code> <code>Any</code> <p>Additional keyword arguments that should be included in the dump.</p> <code>{}</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dictionary with the routing batch function information and the factory function</p> <code>Dict[str, Any]</code> <p>information.</p> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def dump(self, **kwargs: Any) -&gt; Dict[str, Any]:\n    \"\"\"Dumps the routing batch function to a dictionary, and the information of the\n    factory function used to create this routing batch function.\n\n    Args:\n        **kwargs: Additional keyword arguments that should be included in the dump.\n\n    Returns:\n        A dictionary with the routing batch function information and the factory function\n        information.\n    \"\"\"\n    dump_info: Dict[str, Any] = {\"step\": self._step.name}  # type: ignore\n\n    if self.description:\n        dump_info[\"description\"] = self.description\n\n    if type_info := self._get_type_info():\n        dump_info[TYPE_INFO_KEY] = type_info\n\n    return dump_info\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.RoutingBatchFunction.from_dict","title":"<code>from_dict(data)</code>  <code>classmethod</code>","text":"<p>Loads a routing batch function from a dictionary. It must contain the information of the factory function used to create the routing batch function.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>Dict[str, Any]</code> <p>A dictionary with the routing batch function information and the factory function information.</p> required Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>@classmethod\ndef from_dict(cls, data: Dict[str, Any]) -&gt; Self:\n    \"\"\"Loads a routing batch function from a dictionary. It must contain the information\n    of the factory function used to create the routing batch function.\n\n    Args:\n        data: A dictionary with the routing batch function information and the factory\n            function information.\n    \"\"\"\n    type_info = data.get(TYPE_INFO_KEY)\n    if not type_info:\n        step = data.get(\"step\")\n        raise ValueError(\n            f\"The routing batch function for step '{step}' was created without a factory\"\n            \" function, and it cannot be reconstructed.\"\n        )\n\n    module = type_info.get(\"module\")\n    name = type_info.get(\"name\")\n    kwargs = type_info.get(\"kwargs\")\n\n    if not module or not name or not kwargs:\n        raise ValueError(\n            \"The routing batch function was created with a factory function, but the\"\n            \" information is incomplete. Cannot reconstruct the routing batch function.\"\n        )\n\n    routing_batch_function = _get_module_attr(module=module, name=name)(**kwargs)\n    routing_batch_function.description = data.get(\"description\")\n    routing_batch_function.set_factory_function(\n        factory_function_module=module,\n        factory_function_name=name,\n        factory_function_kwargs=kwargs,\n    )\n\n    return routing_batch_function\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.routing_batch_function","title":"<code>routing_batch_function(description=None)</code>","text":"<p>Creates a routing batch function that can be used to route batches from one upstream step to specific downstream steps.</p> <p>Parameters:</p> Name Type Description Default <code>description</code> <code>Optional[str]</code> <p>An optional description for the routing batch function.</p> <code>None</code> <p>Returns:</p> Type Description <code>Callable[[RoutingBatchFunc], RoutingBatchFunction]</code> <p>A <code>RoutingBatchFunction</code> instance that can be used with the <code>&gt;&gt;</code> operators and with</p> <code>Callable[[RoutingBatchFunc], RoutingBatchFunction]</code> <p>the <code>Pipeline.connect</code> method when defining the pipeline.</p> <p>Example:</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, routing_batch_function\nfrom distilabel.steps import LoadDataFromHub, GroupColumns\n\n\n@routing_batch_function\ndef random_routing_batch(steps: List[str]) -&gt; List[str]:\n    return random.sample(steps, 2)\n\n\nwith Pipeline(name=\"routing-batch-function\") as pipeline:\n    load_data = LoadDataFromHub()\n\n    generations = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        generations.append(task)\n\n    combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n    load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n</code></pre> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def routing_batch_function(\n    description: Optional[str] = None,\n) -&gt; Callable[[RoutingBatchFunc], RoutingBatchFunction]:\n    \"\"\"Creates a routing batch function that can be used to route batches from one upstream\n    step to specific downstream steps.\n\n    Args:\n        description: An optional description for the routing batch function.\n\n    Returns:\n        A `RoutingBatchFunction` instance that can be used with the `&gt;&gt;` operators and with\n        the `Pipeline.connect` method when defining the pipeline.\n\n    Example:\n\n    ```python\n    from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\n    from distilabel.pipeline import Pipeline, routing_batch_function\n    from distilabel.steps import LoadDataFromHub, GroupColumns\n\n\n    @routing_batch_function\n    def random_routing_batch(steps: List[str]) -&gt; List[str]:\n        return random.sample(steps, 2)\n\n\n    with Pipeline(name=\"routing-batch-function\") as pipeline:\n        load_data = LoadDataFromHub()\n\n        generations = []\n        for llm in (\n            OpenAILLM(model=\"gpt-4-0125-preview\"),\n            MistralLLM(model=\"mistral-large-2402\"),\n            VertexAILLM(model=\"gemini-1.5-pro\"),\n        ):\n            task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n            generations.append(task)\n\n        combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n        load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n    ```\n    \"\"\"\n\n    def decorator(func: RoutingBatchFunc) -&gt; RoutingBatchFunction:\n        factory_function_name, factory_function_module, factory_function_kwargs = (\n            None,\n            None,\n            None,\n        )\n\n        # Check if `routing_batch_function` was created using a factory function from an installed package\n        stack = inspect.stack()\n        if len(stack) &gt; 2:\n            factory_function_frame_info = stack[1]\n\n            # Function factory path\n            if factory_function_frame_info.function != \"&lt;module&gt;\":\n                factory_function_name = factory_function_frame_info.function\n                factory_function_module = inspect.getmodule(\n                    factory_function_frame_info.frame\n                ).__name__  # type: ignore\n\n                # Function factory kwargs\n                factory_function_kwargs = factory_function_frame_info.frame.f_locals\n\n        routing_batch_function = RoutingBatchFunction(\n            routing_function=func,\n            description=description,\n        )\n\n        if (\n            factory_function_module\n            and factory_function_name\n            and factory_function_kwargs\n        ):\n            routing_batch_function.set_factory_function(\n                factory_function_module=factory_function_module,\n                factory_function_name=factory_function_name,\n                factory_function_kwargs=factory_function_kwargs,\n            )\n\n        return routing_batch_function\n\n    return decorator\n</code></pre>"},{"location":"api/pipeline/routing_batch_function/#distilabel.pipeline.routing_batch_function.sample_n_steps","title":"<code>sample_n_steps(n)</code>","text":"<p>A simple function that creates a routing batch function that samples <code>n</code> steps from the list of all the downstream steps.</p> <p>Parameters:</p> Name Type Description Default <code>n</code> <code>int</code> <p>The number of steps to sample from the list of all the downstream steps.</p> required <p>Returns:</p> Type Description <code>RoutingBatchFunction</code> <p>A <code>RoutingBatchFunction</code> instance that can be used with the <code>&gt;&gt;</code> operators and with</p> <code>RoutingBatchFunction</code> <p>the <code>Pipeline.connect</code> method when defining the pipeline.</p> <p>Example:</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, sample_n_steps\nfrom distilabel.steps import LoadDataFromHub, GroupColumns\n\nrandom_routing_batch = sample_n_steps(2)\n\n\nwith Pipeline(name=\"routing-batch-function\") as pipeline:\n    load_data = LoadDataFromHub()\n\n    generations = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        generations.append(task)\n\n    combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n    load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n</code></pre> Source code in <code>src/distilabel/pipeline/routing_batch_function.py</code> <pre><code>def sample_n_steps(n: int) -&gt; RoutingBatchFunction:\n    \"\"\"A simple function that creates a routing batch function that samples `n` steps from\n    the list of all the downstream steps.\n\n    Args:\n        n: The number of steps to sample from the list of all the downstream steps.\n\n    Returns:\n        A `RoutingBatchFunction` instance that can be used with the `&gt;&gt;` operators and with\n        the `Pipeline.connect` method when defining the pipeline.\n\n    Example:\n\n    ```python\n    from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\n    from distilabel.pipeline import Pipeline, sample_n_steps\n    from distilabel.steps import LoadDataFromHub, GroupColumns\n\n    random_routing_batch = sample_n_steps(2)\n\n\n    with Pipeline(name=\"routing-batch-function\") as pipeline:\n        load_data = LoadDataFromHub()\n\n        generations = []\n        for llm in (\n            OpenAILLM(model=\"gpt-4-0125-preview\"),\n            MistralLLM(model=\"mistral-large-2402\"),\n            VertexAILLM(model=\"gemini-1.5-pro\"),\n        ):\n            task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n            generations.append(task)\n\n        combine_columns = GroupColumns(columns=[\"generation\", \"model_name\"])\n\n        load_data &gt;&gt; random_routing_batch &gt;&gt; generations &gt;&gt; combine_columns\n    ```\n    \"\"\"\n\n    @routing_batch_function(\n        description=f\"Sample {n} steps from the list of downstream steps.\"\n    )\n    def sample_n(steps: List[str]) -&gt; List[str]:\n        return random.sample(steps, n)\n\n    return sample_n\n</code></pre>"},{"location":"api/pipeline/step_wrapper/","title":"Step Wrapper","text":""},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapper","title":"<code>_StepWrapper</code>","text":"<p>Wrapper to run the <code>Step</code>.</p> <p>Attributes:</p> Name Type Description <code>step</code> <p>The step to run.</p> <code>replica</code> <p>The replica ID assigned.</p> <code>input_queue</code> <p>The queue to receive the input data.</p> <code>output_queue</code> <p>The queue to send the output data.</p> <code>load_queue</code> <p>The queue used to notify the main process that the step has been loaded, has been unloaded or has failed to load.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>class _StepWrapper:\n    \"\"\"Wrapper to run the `Step`.\n\n    Attributes:\n        step: The step to run.\n        replica: The replica ID assigned.\n        input_queue: The queue to receive the input data.\n        output_queue: The queue to send the output data.\n        load_queue: The queue used to notify the main process that the step has been loaded,\n            has been unloaded or has failed to load.\n    \"\"\"\n\n    def __init__(\n        self,\n        step: Union[\"Step\", \"GeneratorStep\"],\n        replica: int,\n        input_queue: \"Queue[_Batch]\",\n        output_queue: \"Queue[_Batch]\",\n        load_queue: \"Queue[Union[StepLoadStatus, None]]\",\n        dry_run: bool = False,\n        ray_pipeline: bool = False,\n    ) -&gt; None:\n        \"\"\"Initializes the `_ProcessWrapper`.\n\n        Args:\n            step: The step to run.\n            input_queue: The queue to receive the input data.\n            output_queue: The queue to send the output data.\n            load_queue: The queue used to notify the main process that the step has been\n                loaded, has been unloaded or has failed to load.\n            dry_run: Flag to ensure we are forcing to run the last batch.\n            ray_pipeline: Whether the step is running a `RayPipeline` or not.\n        \"\"\"\n        self.step = step\n        self.replica = replica\n        self.input_queue = input_queue\n        self.output_queue = output_queue\n        self.load_queue = load_queue\n        self.dry_run = dry_run\n        self.ray_pipeline = ray_pipeline\n\n        self._init_cuda_device_placement()\n\n    def _init_cuda_device_placement(self) -&gt; None:\n        \"\"\"Sets the LLM identifier and the number of desired GPUs of the `CudaDevicePlacementMixin`\"\"\"\n\n        def _init_cuda_device_placement_mixin(attr: CudaDevicePlacementMixin) -&gt; None:\n            if self.ray_pipeline:\n                attr.disable_cuda_device_placement = True\n            else:\n                desired_num_gpus = self.step.resources.gpus or 1\n                attr._llm_identifier = f\"{self.step.name}-replica-{self.replica}\"\n                attr._desired_num_gpus = desired_num_gpus\n\n        for field_name in self.step.model_fields_set:\n            attr = getattr(self.step, field_name)\n            if isinstance(attr, CudaDevicePlacementMixin):\n                _init_cuda_device_placement_mixin(attr)\n\n        if isinstance(self.step, CudaDevicePlacementMixin):\n            _init_cuda_device_placement_mixin(self.step)\n\n    def run(self) -&gt; str:\n        \"\"\"The target function executed by the process. This function will also handle\n        the step lifecycle, executing first the `load` function of the `Step` and then\n        waiting to receive a batch from the `input_queue` that will be handled by the\n        `process` method of the `Step`.\n\n        Returns:\n            The name of the step that was executed.\n        \"\"\"\n\n        try:\n            self.step.load()\n            self.step._logger.debug(f\"Step '{self.step.name}' loaded!\")\n        except Exception as e:\n            self.step.unload()\n            self._notify_load_failed()\n            raise _StepWrapperException.create_load_error(\n                message=f\"Step load failed: {e}\",\n                step=self.step,\n                subprocess_exception=e,\n            ) from e\n\n        self._notify_load()\n\n        if self.step.is_generator:\n            self._generator_step_process_loop()\n        else:\n            self._non_generator_process_loop()\n\n        # Just in case `None` sentinel was sent\n        try:\n            self.input_queue.get(block=False)\n        except Exception:\n            pass\n\n        self.step.unload()\n\n        self._notify_unload()\n\n        self.step._logger.info(\n            f\"\ud83c\udfc1 Finished running step '{self.step.name}' (replica ID: {self.replica})\"\n        )\n\n        return self.step.name  # type: ignore\n\n    def _notify_load(self) -&gt; None:\n        \"\"\"Notifies that the step has finished executing its `load` function successfully.\"\"\"\n        self.step._logger.debug(\n            f\"Notifying load of step '{self.step.name}' (replica ID {self.replica})...\"\n        )\n        self.load_queue.put({\"name\": self.step.name, \"status\": \"loaded\"})  # type: ignore\n\n    def _notify_unload(self) -&gt; None:\n        \"\"\"Notifies that the step has been unloaded.\"\"\"\n        self.step._logger.debug(\n            f\"Notifying unload of step '{self.step.name}' (replica ID {self.replica})...\"\n        )\n        self.load_queue.put({\"name\": self.step.name, \"status\": \"unloaded\"})  # type: ignore\n\n    def _notify_load_failed(self) -&gt; None:\n        \"\"\"Notifies that the step failed to load.\"\"\"\n        self.step._logger.debug(\n            f\"Notifying load failed of step '{self.step.name}' (replica ID {self.replica})...\"\n        )\n        self.load_queue.put({\"name\": self.step.name, \"status\": \"load_failed\"})  # type: ignore\n\n    def _generator_step_process_loop(self) -&gt; None:\n        \"\"\"Runs the process loop for a generator step. It will call the `process` method\n        of the step and send the output data to the `output_queue` and block until the next\n        batch request is received (i.e. receiving an empty batch from the `input_queue`).\n\n        If the `last_batch` attribute of the batch is `True`, the loop will stop and the\n        process will finish.\n\n        Raises:\n            _StepWrapperException: If an error occurs during the execution of the\n                `process` method.\n        \"\"\"\n        step = cast(\"GeneratorStep\", self.step)\n\n        try:\n            if (batch := self.input_queue.get()) is None:\n                self.step._logger.info(\n                    f\"\ud83d\uded1 Stopping yielding batches from step '{self.step.name}'\"\n                )\n                return\n\n            offset = batch.seq_no * step.batch_size  # type: ignore\n\n            self.step._logger.info(\n                f\"\ud83d\udeb0 Starting yielding batches from generator step '{self.step.name}'.\"\n                f\" Offset: {offset}\"\n            )\n\n            for data, last_batch in step.process_applying_mappings(offset=offset):\n                batch.set_data([data])\n                batch.last_batch = self.dry_run or last_batch\n                self._send_batch(batch)\n\n                if batch.last_batch:\n                    return\n\n                self.step._logger.debug(\n                    f\"Step '{self.step.name}' waiting for next batch request...\"\n                )\n                if (batch := self.input_queue.get()) is None:\n                    self.step._logger.info(\n                        f\"\ud83d\uded1 Stopping yielding batches from step '{self.step.name}'\"\n                    )\n                    return\n        except Exception as e:\n            raise _StepWrapperException(str(e), self.step, 2, e) from e\n\n    def _non_generator_process_loop(self) -&gt; None:\n        \"\"\"Runs the process loop for a non-generator step. It will call the `process`\n        method of the step and send the output data to the `output_queue` and block until\n        the next batch is received from the `input_queue`. If the `last_batch` attribute\n        of the batch is `True`, the loop will stop and the process will finish.\n\n        If an error occurs during the execution of the `process` method and the step is\n        global, the process will raise a `_StepWrapperException`. If the step is not\n        global, the process will log the error and send an empty batch to the `output_queue`.\n\n        Raises:\n            _StepWrapperException: If an error occurs during the execution of the\n                `process` method and the step is global.\n        \"\"\"\n        step = cast(\"Step\", self.step)\n        while True:\n            if (batch := self.input_queue.get()) is None:\n                self.step._logger.info(\n                    f\"\ud83d\uded1 Stopping processing batches from step '{self.step.name}'\"\n                )\n                break\n\n            if batch == LAST_BATCH_SENT_FLAG:\n                self.step._logger.debug(\"Received `LAST_BATCH_SENT_FLAG`. Stopping...\")\n                break\n\n            self.step._logger.info(\n                f\"\ud83d\udce6 Processing batch {batch.seq_no} in '{batch.step_name}' (replica ID: {self.replica})\"\n            )\n\n            if batch.data_path is not None:\n                self.step._logger.debug(f\"Reading batch data from '{batch.data_path}'\")\n                batch.read_batch_data_from_fs()\n\n            result = []\n            try:\n                if self.step.has_multiple_inputs:\n                    result = next(step.process_applying_mappings(*batch.data))\n                else:\n                    result = next(step.process_applying_mappings(batch.data[0]))\n            except Exception as e:\n                if self.step.is_global:\n                    self.step.unload()\n                    self._notify_unload()\n                    data = (\n                        batch.data\n                        if isinstance(\n                            e, DistilabelOfflineBatchGenerationNotFinishedException\n                        )\n                        else None\n                    )\n                    raise _StepWrapperException(str(e), self.step, 2, e, data) from e\n\n                # Impute step outputs columns with `None`\n                result = self._impute_step_outputs(batch)\n\n                # if the step is not global then we can skip the batch which means sending\n                # an empty batch to the output queue\n                self.step._logger.warning(\n                    f\"\u26a0\ufe0f Processing batch {batch.seq_no} with step '{self.step.name}' failed.\"\n                    \" Sending empty batch filled with `None`s...\"\n                )\n                self.step._logger.warning(\n                    f\"Subprocess traceback:\\n\\n{traceback.format_exc()}\"\n                )\n            finally:\n                batch.set_data([result])\n                self._send_batch(batch)\n\n            if batch.last_batch:\n                break\n\n    def _impute_step_outputs(self, batch: \"_Batch\") -&gt; List[Dict[str, Any]]:\n        \"\"\"Imputes the step outputs columns with `None` in the batch data.\n\n        Args:\n            batch: The batch to impute.\n        \"\"\"\n        return self.step.impute_step_outputs(batch.data[0])\n\n    def _send_batch(self, batch: _Batch) -&gt; None:\n        \"\"\"Sends a batch to the `output_queue`.\"\"\"\n        if batch.data_path is not None:\n            self.step._logger.debug(f\"Writing batch data to '{batch.data_path}'\")\n            batch.write_batch_data_to_fs()\n\n        self.step._logger.info(\n            f\"\ud83d\udce8 Step '{batch.step_name}' sending batch {batch.seq_no} to output queue\"\n        )\n        self.output_queue.put(batch)\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapper.__init__","title":"<code>__init__(step, replica, input_queue, output_queue, load_queue, dry_run=False, ray_pipeline=False)</code>","text":"<p>Initializes the <code>_ProcessWrapper</code>.</p> <p>Parameters:</p> Name Type Description Default <code>step</code> <code>Union[Step, GeneratorStep]</code> <p>The step to run.</p> required <code>input_queue</code> <code>Queue[_Batch]</code> <p>The queue to receive the input data.</p> required <code>output_queue</code> <code>Queue[_Batch]</code> <p>The queue to send the output data.</p> required <code>load_queue</code> <code>Queue[Union[StepLoadStatus, None]]</code> <p>The queue used to notify the main process that the step has been loaded, has been unloaded or has failed to load.</p> required <code>dry_run</code> <code>bool</code> <p>Flag to ensure we are forcing to run the last batch.</p> <code>False</code> <code>ray_pipeline</code> <code>bool</code> <p>Whether the step is running a <code>RayPipeline</code> or not.</p> <code>False</code> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>def __init__(\n    self,\n    step: Union[\"Step\", \"GeneratorStep\"],\n    replica: int,\n    input_queue: \"Queue[_Batch]\",\n    output_queue: \"Queue[_Batch]\",\n    load_queue: \"Queue[Union[StepLoadStatus, None]]\",\n    dry_run: bool = False,\n    ray_pipeline: bool = False,\n) -&gt; None:\n    \"\"\"Initializes the `_ProcessWrapper`.\n\n    Args:\n        step: The step to run.\n        input_queue: The queue to receive the input data.\n        output_queue: The queue to send the output data.\n        load_queue: The queue used to notify the main process that the step has been\n            loaded, has been unloaded or has failed to load.\n        dry_run: Flag to ensure we are forcing to run the last batch.\n        ray_pipeline: Whether the step is running a `RayPipeline` or not.\n    \"\"\"\n    self.step = step\n    self.replica = replica\n    self.input_queue = input_queue\n    self.output_queue = output_queue\n    self.load_queue = load_queue\n    self.dry_run = dry_run\n    self.ray_pipeline = ray_pipeline\n\n    self._init_cuda_device_placement()\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapper.run","title":"<code>run()</code>","text":"<p>The target function executed by the process. This function will also handle the step lifecycle, executing first the <code>load</code> function of the <code>Step</code> and then waiting to receive a batch from the <code>input_queue</code> that will be handled by the <code>process</code> method of the <code>Step</code>.</p> <p>Returns:</p> Type Description <code>str</code> <p>The name of the step that was executed.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>def run(self) -&gt; str:\n    \"\"\"The target function executed by the process. This function will also handle\n    the step lifecycle, executing first the `load` function of the `Step` and then\n    waiting to receive a batch from the `input_queue` that will be handled by the\n    `process` method of the `Step`.\n\n    Returns:\n        The name of the step that was executed.\n    \"\"\"\n\n    try:\n        self.step.load()\n        self.step._logger.debug(f\"Step '{self.step.name}' loaded!\")\n    except Exception as e:\n        self.step.unload()\n        self._notify_load_failed()\n        raise _StepWrapperException.create_load_error(\n            message=f\"Step load failed: {e}\",\n            step=self.step,\n            subprocess_exception=e,\n        ) from e\n\n    self._notify_load()\n\n    if self.step.is_generator:\n        self._generator_step_process_loop()\n    else:\n        self._non_generator_process_loop()\n\n    # Just in case `None` sentinel was sent\n    try:\n        self.input_queue.get(block=False)\n    except Exception:\n        pass\n\n    self.step.unload()\n\n    self._notify_unload()\n\n    self.step._logger.info(\n        f\"\ud83c\udfc1 Finished running step '{self.step.name}' (replica ID: {self.replica})\"\n    )\n\n    return self.step.name  # type: ignore\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapperException","title":"<code>_StepWrapperException</code>","text":"<p>               Bases: <code>Exception</code></p> <p>Exception to be raised when an error occurs in the <code>_StepWrapper</code> class.</p> <p>Attributes:</p> Name Type Description <code>message</code> <p>The error message.</p> <code>step</code> <p>The <code>Step</code> that raised the error.</p> <code>code</code> <p>The error code.</p> <code>subprocess_exception</code> <p>The exception raised by the subprocess.</p> <code>data</code> <p>The data that caused the error. Defaults to <code>None</code>.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>class _StepWrapperException(Exception):\n    \"\"\"Exception to be raised when an error occurs in the `_StepWrapper` class.\n\n    Attributes:\n        message: The error message.\n        step: The `Step` that raised the error.\n        code: The error code.\n        subprocess_exception: The exception raised by the subprocess.\n        data: The data that caused the error. Defaults to `None`.\n    \"\"\"\n\n    def __init__(\n        self,\n        message: str,\n        step: \"_Step\",\n        code: int,\n        subprocess_exception: Exception,\n        data: Optional[List[List[Dict[str, Any]]]] = None,\n    ) -&gt; None:\n        self.message = f\"{message}\\n\\nFor further information visit '{DISTILABEL_DOCS_URL}api/pipeline/step_wrapper'\"\n        self.step = step\n        self.code = code\n        self.subprocess_exception = subprocess_exception\n        self.formatted_traceback = \"\".join(\n            traceback.format_exception(\n                type(subprocess_exception),\n                subprocess_exception,\n                subprocess_exception.__traceback__,\n            )\n        )\n        self.data = data\n\n    @classmethod\n    def create_load_error(\n        cls,\n        message: str,\n        step: \"_Step\",\n        subprocess_exception: Optional[Exception] = None,\n    ) -&gt; \"_StepWrapperException\":\n        \"\"\"Creates a `_StepWrapperException` for a load error.\n\n        Args:\n            message: The error message.\n            step: The `Step` that raised the error.\n            subprocess_exception: The exception raised by the subprocess. Defaults to `None`.\n\n        Returns:\n            The `_StepWrapperException` instance.\n        \"\"\"\n        return cls(message, step, 1, subprocess_exception, None)\n\n    @property\n    def is_load_error(self) -&gt; bool:\n        \"\"\"Whether the error is a load error.\n\n        Returns:\n            `True` if the error is a load error, `False` otherwise.\n        \"\"\"\n        return self.code == 1\n</code></pre>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapperException.is_load_error","title":"<code>is_load_error</code>  <code>property</code>","text":"<p>Whether the error is a load error.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the error is a load error, <code>False</code> otherwise.</p>"},{"location":"api/pipeline/step_wrapper/#distilabel.pipeline.step_wrapper._StepWrapperException.create_load_error","title":"<code>create_load_error(message, step, subprocess_exception=None)</code>  <code>classmethod</code>","text":"<p>Creates a <code>_StepWrapperException</code> for a load error.</p> <p>Parameters:</p> Name Type Description Default <code>message</code> <code>str</code> <p>The error message.</p> required <code>step</code> <code>_Step</code> <p>The <code>Step</code> that raised the error.</p> required <code>subprocess_exception</code> <code>Optional[Exception]</code> <p>The exception raised by the subprocess. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>_StepWrapperException</code> <p>The <code>_StepWrapperException</code> instance.</p> Source code in <code>src/distilabel/pipeline/step_wrapper.py</code> <pre><code>@classmethod\ndef create_load_error(\n    cls,\n    message: str,\n    step: \"_Step\",\n    subprocess_exception: Optional[Exception] = None,\n) -&gt; \"_StepWrapperException\":\n    \"\"\"Creates a `_StepWrapperException` for a load error.\n\n    Args:\n        message: The error message.\n        step: The `Step` that raised the error.\n        subprocess_exception: The exception raised by the subprocess. Defaults to `None`.\n\n    Returns:\n        The `_StepWrapperException` instance.\n    \"\"\"\n    return cls(message, step, 1, subprocess_exception, None)\n</code></pre>"},{"location":"api/pipeline/typing/","title":"Pipeline Typing","text":""},{"location":"api/pipeline/typing/#distilabel.pipeline.typing","title":"<code>typing</code>","text":""},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.DownstreamConnectable","title":"<code>DownstreamConnectable = Union['Step', 'GlobalStep']</code>  <code>module-attribute</code>","text":"<p>Alias for the <code>Step</code> types that can be connected as downstream steps.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.UpstreamConnectableSteps","title":"<code>UpstreamConnectableSteps = TypeVar('UpstreamConnectableSteps', bound=Union['Step', 'GlobalStep', 'GeneratorStep'])</code>  <code>module-attribute</code>","text":"<p>Type for the <code>Step</code> types that can be connected as upstream steps.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.DownstreamConnectableSteps","title":"<code>DownstreamConnectableSteps = TypeVar('DownstreamConnectableSteps', bound=DownstreamConnectable, covariant=True)</code>  <code>module-attribute</code>","text":"<p>Type for the <code>Step</code> types that can be connected as downstream steps.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.PipelineRuntimeParametersInfo","title":"<code>PipelineRuntimeParametersInfo = Dict[str, Union[List['RuntimeParameterInfo'], Dict[str, 'RuntimeParameterInfo']]]</code>  <code>module-attribute</code>","text":"<p>Alias for the information of the runtime parameters of a <code>Pipeline</code>.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.InputDataset","title":"<code>InputDataset = Union['Dataset', 'pd.DataFrame', List[Dict[str, str]]]</code>  <code>module-attribute</code>","text":"<p>Alias for the types we can process as input dataset.</p>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.LoadGroups","title":"<code>LoadGroups = Union[List[List[Any]], Literal['sequential_step_execution']]</code>  <code>module-attribute</code>","text":"<p>Alias for the types that can be used as load groups.</p> <ul> <li>if <code>List[List[Any]]</code>, it's a list containing lists of steps that have to be loaded in isolation.</li> <li>if \"sequential_step_execution\", each step will be loaded in a different stage i.e. only one step will be executed at a time.</li> </ul>"},{"location":"api/pipeline/typing/#distilabel.pipeline.typing.StepLoadStatus","title":"<code>StepLoadStatus</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>Dict containing information about if one step was loaded/unloaded or if it's load failed</p> Source code in <code>src/distilabel/pipeline/typing.py</code> <pre><code>class StepLoadStatus(TypedDict):\n    \"\"\"Dict containing information about if one step was loaded/unloaded or if it's load\n    failed\"\"\"\n\n    name: str\n    status: Literal[\"loaded\", \"unloaded\", \"load_failed\"]\n</code></pre>"},{"location":"api/step/","title":"Step","text":"<p>This section contains the API reference for the <code>distilabel</code> step, both for the <code>_Step</code> base class and the <code>Step</code> class.</p> <p>For more information and examples on how to use existing steps or create custom ones, please refer to Tutorial - Step.</p>"},{"location":"api/step/#distilabel.steps.base","title":"<code>base</code>","text":""},{"location":"api/step/#distilabel.steps.base.StepInput","title":"<code>StepInput = Annotated[List[Dict[str, Any]], _STEP_INPUT_ANNOTATION]</code>  <code>module-attribute</code>","text":"<p>StepInput is just an <code>Annotated</code> alias of the typing <code>List[Dict[str, Any]]</code> with extra metadata that allows <code>distilabel</code> to perform validations over the <code>process</code> step method defined in each <code>Step</code></p>"},{"location":"api/step/#distilabel.steps.base._Step","title":"<code>_Step</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>RequirementsMixin</code>, <code>SignatureMixin</code>, <code>BaseModel</code>, <code>_Serializable</code>, <code>ABC</code></p> <p>Base class for the steps that can be included in a <code>Pipeline</code>.</p> <p>A <code>Step</code> is a class defining some processing logic. The input and outputs for this processing logic are lists of dictionaries with the same keys:</p> <pre><code>```python\n[\n    {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n    {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n    {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n]\n```\n</code></pre> <p>The processing logic is defined in the <code>process</code> method, which depending on the number of previous steps, can receive more than one list of dictionaries, each with the output of the previous steps. In order to make <code>distilabel</code> know where the outputs from the previous steps are, the <code>process</code> function from each <code>Step</code> must have an argument or positional argument annotated with <code>StepInput</code>.</p> <pre><code>```python\nclass StepWithOnePreviousStep(Step):\n    def process(self, inputs: StepInput) -&gt; StepOutput:\n        yield [...]\n\nclass StepWithSeveralPreviousStep(Step):\n    # mind the * to indicate that the argument is a list of StepInput\n    def process(self, *inputs: StepInput) -&gt; StepOutput:\n        yield [...]\n```\n</code></pre> <p>In order to perform static validations and to check that the chaining of the steps in the pipeline is valid, a <code>Step</code> must also define the <code>inputs</code> and <code>outputs</code> properties:</p> <ul> <li><code>inputs</code>: a list of strings with the names of the columns that the step needs as     input. It can be an empty list if the step is a generator step.</li> <li><code>outputs</code>: a list of strings with the names of the columns that the step will     produce as output.</li> </ul> <p>Optionally, a <code>Step</code> can override the <code>load</code> method to perform any initialization logic before the <code>process</code> method is called. For example, to load an LLM, stablish a connection to a database, etc.</p> <p>Finally, the <code>Step</code> class inherits from <code>pydantic.BaseModel</code>, so attributes can be easily defined, validated, serialized and included in the <code>__init__</code> method of the step.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class _Step(\n    RuntimeParametersMixin,\n    RequirementsMixin,\n    SignatureMixin,\n    BaseModel,\n    _Serializable,\n    ABC,\n):\n    \"\"\"Base class for the steps that can be included in a `Pipeline`.\n\n    A `Step` is a class defining some processing logic. The input and outputs for this\n    processing logic are lists of dictionaries with the same keys:\n\n        ```python\n        [\n            {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n            {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n            {\"column1\": \"value1\", \"column2\": \"value2\", ...},\n        ]\n        ```\n\n    The processing logic is defined in the `process` method, which depending on the\n    number of previous steps, can receive more than one list of dictionaries, each with\n    the output of the previous steps. In order to make `distilabel` know where the outputs\n    from the previous steps are, the `process` function from each `Step` must have an argument\n    or positional argument annotated with `StepInput`.\n\n        ```python\n        class StepWithOnePreviousStep(Step):\n            def process(self, inputs: StepInput) -&gt; StepOutput:\n                yield [...]\n\n        class StepWithSeveralPreviousStep(Step):\n            # mind the * to indicate that the argument is a list of StepInput\n            def process(self, *inputs: StepInput) -&gt; StepOutput:\n                yield [...]\n        ```\n\n    In order to perform static validations and to check that the chaining of the steps\n    in the pipeline is valid, a `Step` must also define the `inputs` and `outputs`\n    properties:\n\n    - `inputs`: a list of strings with the names of the columns that the step needs as\n        input. It can be an empty list if the step is a generator step.\n    - `outputs`: a list of strings with the names of the columns that the step will\n        produce as output.\n\n    Optionally, a `Step` can override the `load` method to perform any initialization\n    logic before the `process` method is called. For example, to load an LLM, stablish a\n    connection to a database, etc.\n\n    Finally, the `Step` class inherits from `pydantic.BaseModel`, so attributes can be easily\n    defined, validated, serialized and included in the `__init__` method of the step.\n    \"\"\"\n\n    model_config = ConfigDict(\n        arbitrary_types_allowed=True,\n        validate_default=True,\n        validate_assignment=True,\n        extra=\"forbid\",\n    )\n\n    name: Optional[str] = Field(default=None, pattern=r\"^[a-zA-Z0-9_-]+$\")\n    resources: StepResources = StepResources()\n    pipeline: Any = Field(default=None, exclude=True, repr=False)\n    input_mappings: Dict[str, str] = {}\n    output_mappings: Dict[str, str] = {}\n    use_cache: bool = True\n\n    _pipeline_artifacts_path: Path = PrivateAttr(None)\n    _built_from_decorator: bool = PrivateAttr(default=False)\n    _logger: \"Logger\" = PrivateAttr(None)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        from distilabel.pipeline.base import _GlobalPipelineManager\n\n        super().model_post_init(__context)\n\n        if self.pipeline is None:\n            self.pipeline = _GlobalPipelineManager.get_pipeline()\n\n        if self.pipeline is None:\n            _logger = logging.getLogger(f\"distilabel.step.{self.name}\")\n            _logger.warning(\n                f\"Step '{self.name}' hasn't received a pipeline, and it hasn't been\"\n                \" created within a `Pipeline` context. Please, use\"\n                \" `with Pipeline() as pipeline:` and create the step within the context.\"\n            )\n\n        if not self.name:\n            # This must be done before the check for repeated names, but assuming\n            # we are passing the pipeline from the _GlobalPipelineManager, should\n            # be done after that.\n            self.name = _infer_step_name(type(self).__name__, self.pipeline)\n\n        if self.pipeline is not None:\n            # If not set an error will be raised in `Pipeline.run` parent\n            self.pipeline._add_step(self)\n\n    def connect(\n        self,\n        *steps: \"_Step\",\n        routing_batch_function: Optional[\"RoutingBatchFunction\"] = None,\n    ) -&gt; None:\n        \"\"\"Connects the current step to another step in the pipeline, which means that\n        the output of this step will be the input of the other step.\n\n        Args:\n            steps: The steps to connect to the current step.\n            routing_batch_function: A function that receives a list of steps and returns\n                a list of steps to which the output batch generated by this step should be\n                routed. It should be used to define the routing logic of the pipeline. If\n                not provided, the output batch will be routed to all the connected steps.\n                Defaults to `None`.\n        \"\"\"\n        assert self.pipeline is not None\n\n        if routing_batch_function:\n            self._set_routing_batch_function(routing_batch_function)\n\n        for step in steps:\n            self.pipeline._add_edge(from_step=self.name, to_step=step.name)  # type: ignore\n\n    def _set_routing_batch_function(\n        self, routing_batch_function: \"RoutingBatchFunction\"\n    ) -&gt; None:\n        \"\"\"Sets a routing batch function for the batches generated by this step, so they\n        get routed to specific downstream steps.\n\n        Args:\n            routing_batch_function: The routing batch function that will be used to route\n                the batches generated by this step.\n        \"\"\"\n        self.pipeline._add_routing_batch_function(\n            step_name=self.name,  # type: ignore\n            routing_batch_function=routing_batch_function,\n        )\n        routing_batch_function._step = self\n\n    @overload\n    def __rshift__(self, other: \"RoutingBatchFunction\") -&gt; \"RoutingBatchFunction\": ...\n\n    @overload\n    def __rshift__(\n        self, other: List[\"DownstreamConnectableSteps\"]\n    ) -&gt; List[\"DownstreamConnectableSteps\"]: ...\n\n    @overload\n    def __rshift__(self, other: \"DownstreamConnectable\") -&gt; \"DownstreamConnectable\": ...\n\n    def __rshift__(\n        self,\n        other: Union[\n            \"DownstreamConnectable\",\n            \"RoutingBatchFunction\",\n            List[\"DownstreamConnectableSteps\"],\n        ],\n    ) -&gt; Union[\n        \"DownstreamConnectable\",\n        \"RoutingBatchFunction\",\n        List[\"DownstreamConnectableSteps\"],\n    ]:\n        \"\"\"Allows using the `&gt;&gt;` operator to connect steps in the pipeline.\n\n        Args:\n            other: The step to connect, a list of steps to connect to or a routing batch\n                function to be set for the step.\n\n        Returns:\n            The connected step, the list of connected steps or the routing batch function.\n\n        Example:\n            ```python\n            step1 &gt;&gt; step2\n            # Would be equivalent to:\n            step1.connect(step2)\n\n            # It also allows to connect a list of steps\n            step1 &gt;&gt; [step2, step3]\n            ```\n        \"\"\"\n        # Here to avoid circular imports\n        from distilabel.pipeline.routing_batch_function import RoutingBatchFunction\n\n        if isinstance(other, list):\n            self.connect(*other)\n            return other\n\n        if isinstance(other, RoutingBatchFunction):\n            self._set_routing_batch_function(other)\n            return other\n\n        self.connect(other)\n        return other\n\n    def __rrshift__(self, other: List[\"UpstreamConnectableSteps\"]) -&gt; Self:\n        \"\"\"Allows using the [step1, step2] &gt;&gt; step3 operator to connect a list of steps in the pipeline\n        to a single step, as the list doesn't have the __rshift__ operator.\n\n        Args:\n            other: The step to connect to.\n\n        Returns:\n            The connected step\n\n        Example:\n            ```python\n            [step2, step3] &gt;&gt; step1\n            # Would be equivalent to:\n            step2.connect(step1)\n            step3.connect(step1)\n            ```\n        \"\"\"\n        for o in other:\n            o.connect(self)\n        return self\n\n    def load(self) -&gt; None:\n        \"\"\"Method to perform any initialization logic before the `process` method is\n        called. For example, to load an LLM, stablish a connection to a database, etc.\n        \"\"\"\n        self._logger = logging.getLogger(f\"distilabel.step.{self.name}\")\n\n    def unload(self) -&gt; None:\n        \"\"\"Method to perform any cleanup logic after the `process` method is called. For\n        example, to close a connection to a database, etc.\n        \"\"\"\n        self._logger.debug(\"Executing step unload logic.\")\n\n    @property\n    def is_generator(self) -&gt; bool:\n        \"\"\"Whether the step is a generator step or not.\n\n        Returns:\n            `True` if the step is a generator step, `False` otherwise.\n        \"\"\"\n        return isinstance(self, GeneratorStep)\n\n    @property\n    def is_global(self) -&gt; bool:\n        \"\"\"Whether the step is a global step or not.\n\n        Returns:\n            `True` if the step is a global step, `False` otherwise.\n        \"\"\"\n        return isinstance(self, GlobalStep)\n\n    @property\n    def is_normal(self) -&gt; bool:\n        \"\"\"Whether the step is a normal step or not.\n\n        Returns:\n            `True` if the step is a normal step, `False` otherwise.\n        \"\"\"\n        return not self.is_generator and not self.is_global\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of strings with the names of the mandatory columns that the step needs as\n        input or dictionary in which the keys are the input columns of the step and the\n        values are booleans indicating whether the column is optional or not.\n\n        Returns:\n            List of strings with the names of the columns that the step needs as input.\n        \"\"\"\n        return []\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of strings with the names of the columns that the step will produce as\n        output or dictionary in which the keys are the output columns of the step and the\n        values are booleans indicating whether the column is optional or not.\n\n        Returns:\n            List of strings with the names of the columns that the step will produce as\n            output.\n        \"\"\"\n        return []\n\n    @cached_property\n    def process_parameters(self) -&gt; List[inspect.Parameter]:\n        \"\"\"Returns the parameters of the `process` method of the step.\n\n        Returns:\n            The parameters of the `process` method of the step.\n        \"\"\"\n        return list(inspect.signature(self.process).parameters.values())  # type: ignore\n\n    def has_multiple_inputs(self) -&gt; bool:\n        \"\"\"Whether the `process` method of the step receives more than one input or not\n        i.e. has a `*` argument annotated with `StepInput`.\n\n        Returns:\n            `True` if the `process` method of the step receives more than one input,\n            `False` otherwise.\n        \"\"\"\n        return any(\n            param.kind == param.VAR_POSITIONAL for param in self.process_parameters\n        )\n\n    def get_process_step_input(self) -&gt; Union[inspect.Parameter, None]:\n        \"\"\"Returns the parameter of the `process` method of the step annotated with\n        `StepInput`.\n\n        Returns:\n            The parameter of the `process` method of the step annotated with `StepInput`,\n            or `None` if there is no parameter annotated with `StepInput`.\n\n        Raises:\n            TypeError: If the step has more than one parameter annotated with `StepInput`.\n        \"\"\"\n        step_input_parameter = None\n        for parameter in self.process_parameters:\n            if is_parameter_annotated_with(parameter, _STEP_INPUT_ANNOTATION):\n                if step_input_parameter is not None:\n                    raise DistilabelTypeError(\n                        f\"Step '{self.name}' should have only one parameter with type\"\n                        \" hint `StepInput`.\",\n                        page=\"sections/how_to_guides/basic/step/#defining-custom-steps\",\n                    )\n                step_input_parameter = parameter\n        return step_input_parameter\n\n    def verify_inputs_mappings(self) -&gt; None:\n        \"\"\"Verifies that the `inputs_mappings` of the step are valid i.e. the input\n        columns exist in the inputs of the step.\n\n        Raises:\n            ValueError: If the `inputs_mappings` of the step are not valid.\n        \"\"\"\n        if not self.input_mappings:\n            return\n\n        for input in self.input_mappings:\n            if input not in self.inputs:\n                raise DistilabelUserError(\n                    f\"The input column '{input}' doesn't exist in the inputs of the\"\n                    f\" step '{self.name}'. Inputs of the step are: {self.inputs}.\"\n                    \" Please, review the `inputs_mappings` argument of the step.\",\n                    page=\"sections/how_to_guides/basic/step/#arguments\",\n                )\n\n    def verify_outputs_mappings(self) -&gt; None:\n        \"\"\"Verifies that the `outputs_mappings` of the step are valid i.e. the output\n        columns exist in the outputs of the step.\n\n        Raises:\n            ValueError: If the `outputs_mappings` of the step are not valid.\n        \"\"\"\n        if not self.output_mappings:\n            return\n\n        for output in self.output_mappings:\n            if output not in self.outputs:\n                raise DistilabelUserError(\n                    f\"The output column '{output}' doesn't exist in the outputs of the\"\n                    f\" step '{self.name}'. Outputs of the step are: {self.outputs}.\"\n                    \" Please, review the `outputs_mappings` argument of the step.\",\n                    page=\"sections/how_to_guides/basic/step/#arguments\",\n                )\n\n    def get_inputs(self) -&gt; Dict[str, bool]:\n        \"\"\"Gets the inputs of the step after the `input_mappings`. This method is meant\n        to be used to run validations on the inputs of the step.\n\n        Returns:\n            The inputs of the step after the `input_mappings` and if they are required or\n            not.\n        \"\"\"\n        if isinstance(self.inputs, list):\n            return {\n                self.input_mappings.get(input, input): True for input in self.inputs\n            }\n\n        return {\n            self.input_mappings.get(input, input): required\n            for input, required in self.inputs.items()\n        }\n\n    def get_outputs(self) -&gt; Dict[str, bool]:\n        \"\"\"Gets the outputs of the step after the `outputs_mappings`. This method is\n        meant to be used to run validations on the outputs of the step.\n\n        Returns:\n            The outputs of the step after the `outputs_mappings` and if they are required\n            or not.\n        \"\"\"\n        if isinstance(self.outputs, list):\n            return {\n                self.output_mappings.get(output, output): True\n                for output in self.outputs\n            }\n\n        return {\n            self.output_mappings.get(output, output): required\n            for output, required in self.outputs.items()\n        }\n\n    def set_pipeline_artifacts_path(self, path: Path) -&gt; None:\n        \"\"\"Sets the `_pipeline_artifacts_path` attribute. This method is meant to be used\n        by the `Pipeline` once the cache location is known.\n\n        Args:\n            path: the path where the artifacts generated by the pipeline steps should be\n                saved.\n        \"\"\"\n        self._pipeline_artifacts_path = path\n\n    @property\n    def artifacts_directory(self) -&gt; Union[Path, None]:\n        \"\"\"Gets the path of the directory where the step should save its generated artifacts.\n\n        Returns:\n            The path of the directory where the step should save the generated artifacts,\n                or `None` if `_pipeline_artifacts_path` is not set.\n        \"\"\"\n        if self._pipeline_artifacts_path is None:\n            return None\n        return self._pipeline_artifacts_path / self.name  # type: ignore\n\n    def save_artifact(\n        self,\n        name: str,\n        write_function: Callable[[Path], None],\n        metadata: Optional[Dict[str, Any]] = None,\n    ) -&gt; None:\n        \"\"\"Saves an artifact generated by the `Step`.\n\n        Args:\n            name: the name of the artifact.\n            write_function: a function that will receive the path where the artifact should\n                be saved.\n            metadata: the artifact metadata. Defaults to `None`.\n        \"\"\"\n        if self.artifacts_directory is None:\n            self._logger.warning(\n                f\"Cannot save artifact with '{name}' as `_pipeline_artifacts_path` is not\"\n                \" set. This is normal if the `Step` is being executed as a standalone component.\"\n            )\n            return\n\n        artifact_directory_path = self.artifacts_directory / name\n        artifact_directory_path.mkdir(parents=True, exist_ok=True)\n\n        self._logger.info(f\"\ud83c\udffa Storing '{name}' generated artifact...\")\n\n        self._logger.debug(\n            f\"Calling `write_function` to write artifact in '{artifact_directory_path}'...\"\n        )\n        write_function(artifact_directory_path)\n\n        metadata_path = artifact_directory_path / \"metadata.json\"\n        self._logger.debug(\n            f\"Calling `write_json` to write artifact metadata in '{metadata_path}'...\"\n        )\n        write_json(filename=metadata_path, data=metadata or {})\n\n    def impute_step_outputs(\n        self, step_output: List[Dict[str, Any]]\n    ) -&gt; List[Dict[str, Any]]:\n        \"\"\"\n        Imputes the output columns of the step that are not present in the step output.\n        \"\"\"\n        result = []\n        for row in step_output:\n            data = row.copy()\n            for output in self.get_outputs().keys():\n                data[output] = None\n            result.append(data)\n        return result\n\n    def _model_dump(self, obj: Any, **kwargs: Any) -&gt; Dict[str, Any]:\n        dump = super()._model_dump(obj, **kwargs)\n        dump[\"runtime_parameters_info\"] = self.get_runtime_parameters_info()\n        return dump\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.is_generator","title":"<code>is_generator</code>  <code>property</code>","text":"<p>Whether the step is a generator step or not.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the step is a generator step, <code>False</code> otherwise.</p>"},{"location":"api/step/#distilabel.steps.base._Step.is_global","title":"<code>is_global</code>  <code>property</code>","text":"<p>Whether the step is a global step or not.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the step is a global step, <code>False</code> otherwise.</p>"},{"location":"api/step/#distilabel.steps.base._Step.is_normal","title":"<code>is_normal</code>  <code>property</code>","text":"<p>Whether the step is a normal step or not.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the step is a normal step, <code>False</code> otherwise.</p>"},{"location":"api/step/#distilabel.steps.base._Step.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of strings with the names of the mandatory columns that the step needs as input or dictionary in which the keys are the input columns of the step and the values are booleans indicating whether the column is optional or not.</p> <p>Returns:</p> Type Description <code>StepColumns</code> <p>List of strings with the names of the columns that the step needs as input.</p>"},{"location":"api/step/#distilabel.steps.base._Step.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of strings with the names of the columns that the step will produce as output or dictionary in which the keys are the output columns of the step and the values are booleans indicating whether the column is optional or not.</p> <p>Returns:</p> Type Description <code>StepColumns</code> <p>List of strings with the names of the columns that the step will produce as</p> <code>StepColumns</code> <p>output.</p>"},{"location":"api/step/#distilabel.steps.base._Step.process_parameters","title":"<code>process_parameters</code>  <code>cached</code> <code>property</code>","text":"<p>Returns the parameters of the <code>process</code> method of the step.</p> <p>Returns:</p> Type Description <code>List[Parameter]</code> <p>The parameters of the <code>process</code> method of the step.</p>"},{"location":"api/step/#distilabel.steps.base._Step.artifacts_directory","title":"<code>artifacts_directory</code>  <code>property</code>","text":"<p>Gets the path of the directory where the step should save its generated artifacts.</p> <p>Returns:</p> Type Description <code>Union[Path, None]</code> <p>The path of the directory where the step should save the generated artifacts, or <code>None</code> if <code>_pipeline_artifacts_path</code> is not set.</p>"},{"location":"api/step/#distilabel.steps.base._Step.connect","title":"<code>connect(*steps, routing_batch_function=None)</code>","text":"<p>Connects the current step to another step in the pipeline, which means that the output of this step will be the input of the other step.</p> <p>Parameters:</p> Name Type Description Default <code>steps</code> <code>_Step</code> <p>The steps to connect to the current step.</p> <code>()</code> <code>routing_batch_function</code> <code>Optional[RoutingBatchFunction]</code> <p>A function that receives a list of steps and returns a list of steps to which the output batch generated by this step should be routed. It should be used to define the routing logic of the pipeline. If not provided, the output batch will be routed to all the connected steps. Defaults to <code>None</code>.</p> <code>None</code> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def connect(\n    self,\n    *steps: \"_Step\",\n    routing_batch_function: Optional[\"RoutingBatchFunction\"] = None,\n) -&gt; None:\n    \"\"\"Connects the current step to another step in the pipeline, which means that\n    the output of this step will be the input of the other step.\n\n    Args:\n        steps: The steps to connect to the current step.\n        routing_batch_function: A function that receives a list of steps and returns\n            a list of steps to which the output batch generated by this step should be\n            routed. It should be used to define the routing logic of the pipeline. If\n            not provided, the output batch will be routed to all the connected steps.\n            Defaults to `None`.\n    \"\"\"\n    assert self.pipeline is not None\n\n    if routing_batch_function:\n        self._set_routing_batch_function(routing_batch_function)\n\n    for step in steps:\n        self.pipeline._add_edge(from_step=self.name, to_step=step.name)  # type: ignore\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.__rshift__","title":"<code>__rshift__(other)</code>","text":"<pre><code>__rshift__(other: RoutingBatchFunction) -&gt; RoutingBatchFunction\n</code></pre><pre><code>__rshift__(other: List[DownstreamConnectableSteps]) -&gt; List[DownstreamConnectableSteps]\n</code></pre><pre><code>__rshift__(other: DownstreamConnectable) -&gt; DownstreamConnectable\n</code></pre> <p>Allows using the <code>&gt;&gt;</code> operator to connect steps in the pipeline.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>Union[DownstreamConnectable, RoutingBatchFunction, List[DownstreamConnectableSteps]]</code> <p>The step to connect, a list of steps to connect to or a routing batch function to be set for the step.</p> required <p>Returns:</p> Type Description <code>Union[DownstreamConnectable, RoutingBatchFunction, List[DownstreamConnectableSteps]]</code> <p>The connected step, the list of connected steps or the routing batch function.</p> Example <pre><code>step1 &gt;&gt; step2\n# Would be equivalent to:\nstep1.connect(step2)\n\n# It also allows to connect a list of steps\nstep1 &gt;&gt; [step2, step3]\n</code></pre> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def __rshift__(\n    self,\n    other: Union[\n        \"DownstreamConnectable\",\n        \"RoutingBatchFunction\",\n        List[\"DownstreamConnectableSteps\"],\n    ],\n) -&gt; Union[\n    \"DownstreamConnectable\",\n    \"RoutingBatchFunction\",\n    List[\"DownstreamConnectableSteps\"],\n]:\n    \"\"\"Allows using the `&gt;&gt;` operator to connect steps in the pipeline.\n\n    Args:\n        other: The step to connect, a list of steps to connect to or a routing batch\n            function to be set for the step.\n\n    Returns:\n        The connected step, the list of connected steps or the routing batch function.\n\n    Example:\n        ```python\n        step1 &gt;&gt; step2\n        # Would be equivalent to:\n        step1.connect(step2)\n\n        # It also allows to connect a list of steps\n        step1 &gt;&gt; [step2, step3]\n        ```\n    \"\"\"\n    # Here to avoid circular imports\n    from distilabel.pipeline.routing_batch_function import RoutingBatchFunction\n\n    if isinstance(other, list):\n        self.connect(*other)\n        return other\n\n    if isinstance(other, RoutingBatchFunction):\n        self._set_routing_batch_function(other)\n        return other\n\n    self.connect(other)\n    return other\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.__rrshift__","title":"<code>__rrshift__(other)</code>","text":"<p>Allows using the [step1, step2] &gt;&gt; step3 operator to connect a list of steps in the pipeline to a single step, as the list doesn't have the rshift operator.</p> <p>Parameters:</p> Name Type Description Default <code>other</code> <code>List[UpstreamConnectableSteps]</code> <p>The step to connect to.</p> required <p>Returns:</p> Type Description <code>Self</code> <p>The connected step</p> Example <pre><code>[step2, step3] &gt;&gt; step1\n# Would be equivalent to:\nstep2.connect(step1)\nstep3.connect(step1)\n</code></pre> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def __rrshift__(self, other: List[\"UpstreamConnectableSteps\"]) -&gt; Self:\n    \"\"\"Allows using the [step1, step2] &gt;&gt; step3 operator to connect a list of steps in the pipeline\n    to a single step, as the list doesn't have the __rshift__ operator.\n\n    Args:\n        other: The step to connect to.\n\n    Returns:\n        The connected step\n\n    Example:\n        ```python\n        [step2, step3] &gt;&gt; step1\n        # Would be equivalent to:\n        step2.connect(step1)\n        step3.connect(step1)\n        ```\n    \"\"\"\n    for o in other:\n        o.connect(self)\n    return self\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.load","title":"<code>load()</code>","text":"<p>Method to perform any initialization logic before the <code>process</code> method is called. For example, to load an LLM, stablish a connection to a database, etc.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to perform any initialization logic before the `process` method is\n    called. For example, to load an LLM, stablish a connection to a database, etc.\n    \"\"\"\n    self._logger = logging.getLogger(f\"distilabel.step.{self.name}\")\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.unload","title":"<code>unload()</code>","text":"<p>Method to perform any cleanup logic after the <code>process</code> method is called. For example, to close a connection to a database, etc.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def unload(self) -&gt; None:\n    \"\"\"Method to perform any cleanup logic after the `process` method is called. For\n    example, to close a connection to a database, etc.\n    \"\"\"\n    self._logger.debug(\"Executing step unload logic.\")\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.has_multiple_inputs","title":"<code>has_multiple_inputs()</code>","text":"<p>Whether the <code>process</code> method of the step receives more than one input or not i.e. has a <code>*</code> argument annotated with <code>StepInput</code>.</p> <p>Returns:</p> Type Description <code>bool</code> <p><code>True</code> if the <code>process</code> method of the step receives more than one input,</p> <code>bool</code> <p><code>False</code> otherwise.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def has_multiple_inputs(self) -&gt; bool:\n    \"\"\"Whether the `process` method of the step receives more than one input or not\n    i.e. has a `*` argument annotated with `StepInput`.\n\n    Returns:\n        `True` if the `process` method of the step receives more than one input,\n        `False` otherwise.\n    \"\"\"\n    return any(\n        param.kind == param.VAR_POSITIONAL for param in self.process_parameters\n    )\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.get_process_step_input","title":"<code>get_process_step_input()</code>","text":"<p>Returns the parameter of the <code>process</code> method of the step annotated with <code>StepInput</code>.</p> <p>Returns:</p> Type Description <code>Union[Parameter, None]</code> <p>The parameter of the <code>process</code> method of the step annotated with <code>StepInput</code>,</p> <code>Union[Parameter, None]</code> <p>or <code>None</code> if there is no parameter annotated with <code>StepInput</code>.</p> <p>Raises:</p> Type Description <code>TypeError</code> <p>If the step has more than one parameter annotated with <code>StepInput</code>.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def get_process_step_input(self) -&gt; Union[inspect.Parameter, None]:\n    \"\"\"Returns the parameter of the `process` method of the step annotated with\n    `StepInput`.\n\n    Returns:\n        The parameter of the `process` method of the step annotated with `StepInput`,\n        or `None` if there is no parameter annotated with `StepInput`.\n\n    Raises:\n        TypeError: If the step has more than one parameter annotated with `StepInput`.\n    \"\"\"\n    step_input_parameter = None\n    for parameter in self.process_parameters:\n        if is_parameter_annotated_with(parameter, _STEP_INPUT_ANNOTATION):\n            if step_input_parameter is not None:\n                raise DistilabelTypeError(\n                    f\"Step '{self.name}' should have only one parameter with type\"\n                    \" hint `StepInput`.\",\n                    page=\"sections/how_to_guides/basic/step/#defining-custom-steps\",\n                )\n            step_input_parameter = parameter\n    return step_input_parameter\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.verify_inputs_mappings","title":"<code>verify_inputs_mappings()</code>","text":"<p>Verifies that the <code>inputs_mappings</code> of the step are valid i.e. the input columns exist in the inputs of the step.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>inputs_mappings</code> of the step are not valid.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def verify_inputs_mappings(self) -&gt; None:\n    \"\"\"Verifies that the `inputs_mappings` of the step are valid i.e. the input\n    columns exist in the inputs of the step.\n\n    Raises:\n        ValueError: If the `inputs_mappings` of the step are not valid.\n    \"\"\"\n    if not self.input_mappings:\n        return\n\n    for input in self.input_mappings:\n        if input not in self.inputs:\n            raise DistilabelUserError(\n                f\"The input column '{input}' doesn't exist in the inputs of the\"\n                f\" step '{self.name}'. Inputs of the step are: {self.inputs}.\"\n                \" Please, review the `inputs_mappings` argument of the step.\",\n                page=\"sections/how_to_guides/basic/step/#arguments\",\n            )\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.verify_outputs_mappings","title":"<code>verify_outputs_mappings()</code>","text":"<p>Verifies that the <code>outputs_mappings</code> of the step are valid i.e. the output columns exist in the outputs of the step.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the <code>outputs_mappings</code> of the step are not valid.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def verify_outputs_mappings(self) -&gt; None:\n    \"\"\"Verifies that the `outputs_mappings` of the step are valid i.e. the output\n    columns exist in the outputs of the step.\n\n    Raises:\n        ValueError: If the `outputs_mappings` of the step are not valid.\n    \"\"\"\n    if not self.output_mappings:\n        return\n\n    for output in self.output_mappings:\n        if output not in self.outputs:\n            raise DistilabelUserError(\n                f\"The output column '{output}' doesn't exist in the outputs of the\"\n                f\" step '{self.name}'. Outputs of the step are: {self.outputs}.\"\n                \" Please, review the `outputs_mappings` argument of the step.\",\n                page=\"sections/how_to_guides/basic/step/#arguments\",\n            )\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.get_inputs","title":"<code>get_inputs()</code>","text":"<p>Gets the inputs of the step after the <code>input_mappings</code>. This method is meant to be used to run validations on the inputs of the step.</p> <p>Returns:</p> Type Description <code>Dict[str, bool]</code> <p>The inputs of the step after the <code>input_mappings</code> and if they are required or</p> <code>Dict[str, bool]</code> <p>not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def get_inputs(self) -&gt; Dict[str, bool]:\n    \"\"\"Gets the inputs of the step after the `input_mappings`. This method is meant\n    to be used to run validations on the inputs of the step.\n\n    Returns:\n        The inputs of the step after the `input_mappings` and if they are required or\n        not.\n    \"\"\"\n    if isinstance(self.inputs, list):\n        return {\n            self.input_mappings.get(input, input): True for input in self.inputs\n        }\n\n    return {\n        self.input_mappings.get(input, input): required\n        for input, required in self.inputs.items()\n    }\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.get_outputs","title":"<code>get_outputs()</code>","text":"<p>Gets the outputs of the step after the <code>outputs_mappings</code>. This method is meant to be used to run validations on the outputs of the step.</p> <p>Returns:</p> Type Description <code>Dict[str, bool]</code> <p>The outputs of the step after the <code>outputs_mappings</code> and if they are required</p> <code>Dict[str, bool]</code> <p>or not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def get_outputs(self) -&gt; Dict[str, bool]:\n    \"\"\"Gets the outputs of the step after the `outputs_mappings`. This method is\n    meant to be used to run validations on the outputs of the step.\n\n    Returns:\n        The outputs of the step after the `outputs_mappings` and if they are required\n        or not.\n    \"\"\"\n    if isinstance(self.outputs, list):\n        return {\n            self.output_mappings.get(output, output): True\n            for output in self.outputs\n        }\n\n    return {\n        self.output_mappings.get(output, output): required\n        for output, required in self.outputs.items()\n    }\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.set_pipeline_artifacts_path","title":"<code>set_pipeline_artifacts_path(path)</code>","text":"<p>Sets the <code>_pipeline_artifacts_path</code> attribute. This method is meant to be used by the <code>Pipeline</code> once the cache location is known.</p> <p>Parameters:</p> Name Type Description Default <code>path</code> <code>Path</code> <p>the path where the artifacts generated by the pipeline steps should be saved.</p> required Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def set_pipeline_artifacts_path(self, path: Path) -&gt; None:\n    \"\"\"Sets the `_pipeline_artifacts_path` attribute. This method is meant to be used\n    by the `Pipeline` once the cache location is known.\n\n    Args:\n        path: the path where the artifacts generated by the pipeline steps should be\n            saved.\n    \"\"\"\n    self._pipeline_artifacts_path = path\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.save_artifact","title":"<code>save_artifact(name, write_function, metadata=None)</code>","text":"<p>Saves an artifact generated by the <code>Step</code>.</p> <p>Parameters:</p> Name Type Description Default <code>name</code> <code>str</code> <p>the name of the artifact.</p> required <code>write_function</code> <code>Callable[[Path], None]</code> <p>a function that will receive the path where the artifact should be saved.</p> required <code>metadata</code> <code>Optional[Dict[str, Any]]</code> <p>the artifact metadata. Defaults to <code>None</code>.</p> <code>None</code> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def save_artifact(\n    self,\n    name: str,\n    write_function: Callable[[Path], None],\n    metadata: Optional[Dict[str, Any]] = None,\n) -&gt; None:\n    \"\"\"Saves an artifact generated by the `Step`.\n\n    Args:\n        name: the name of the artifact.\n        write_function: a function that will receive the path where the artifact should\n            be saved.\n        metadata: the artifact metadata. Defaults to `None`.\n    \"\"\"\n    if self.artifacts_directory is None:\n        self._logger.warning(\n            f\"Cannot save artifact with '{name}' as `_pipeline_artifacts_path` is not\"\n            \" set. This is normal if the `Step` is being executed as a standalone component.\"\n        )\n        return\n\n    artifact_directory_path = self.artifacts_directory / name\n    artifact_directory_path.mkdir(parents=True, exist_ok=True)\n\n    self._logger.info(f\"\ud83c\udffa Storing '{name}' generated artifact...\")\n\n    self._logger.debug(\n        f\"Calling `write_function` to write artifact in '{artifact_directory_path}'...\"\n    )\n    write_function(artifact_directory_path)\n\n    metadata_path = artifact_directory_path / \"metadata.json\"\n    self._logger.debug(\n        f\"Calling `write_json` to write artifact metadata in '{metadata_path}'...\"\n    )\n    write_json(filename=metadata_path, data=metadata or {})\n</code></pre>"},{"location":"api/step/#distilabel.steps.base._Step.impute_step_outputs","title":"<code>impute_step_outputs(step_output)</code>","text":"<p>Imputes the output columns of the step that are not present in the step output.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def impute_step_outputs(\n    self, step_output: List[Dict[str, Any]]\n) -&gt; List[Dict[str, Any]]:\n    \"\"\"\n    Imputes the output columns of the step that are not present in the step output.\n    \"\"\"\n    result = []\n    for row in step_output:\n        data = row.copy()\n        for output in self.get_outputs().keys():\n            data[output] = None\n        result.append(data)\n    return result\n</code></pre>"},{"location":"api/step/#distilabel.steps.base.Step","title":"<code>Step</code>","text":"<p>               Bases: <code>_Step</code>, <code>ABC</code></p> <p>Base class for the steps that can be included in a <code>Pipeline</code>.</p> <p>Attributes:</p> Name Type Description <code>input_batch_size</code> <code>RuntimeParameter[PositiveInt]</code> <p>The number of rows that will contain the batches processed by the step. Defaults to <code>50</code>.</p> Runtime parameters <ul> <li><code>input_batch_size</code>: The number of rows that will contain the batches processed     by the step. Defaults to <code>50</code>.</li> </ul> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class Step(_Step, ABC):\n    \"\"\"Base class for the steps that can be included in a `Pipeline`.\n\n    Attributes:\n        input_batch_size: The number of rows that will contain the batches processed by\n            the step. Defaults to `50`.\n\n    Runtime parameters:\n        - `input_batch_size`: The number of rows that will contain the batches processed\n            by the step. Defaults to `50`.\n    \"\"\"\n\n    input_batch_size: RuntimeParameter[PositiveInt] = Field(\n        default=DEFAULT_INPUT_BATCH_SIZE,\n        description=\"The number of rows that will contain the batches processed by the\"\n        \" step.\",\n    )\n\n    @abstractmethod\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Method that defines the processing logic of the step. It should yield the\n        output rows.\n\n        Args:\n            *inputs: An argument used to receive the outputs of the previous steps. The\n                number of arguments depends on the number of previous steps. It doesn't\n                need to be an `*args` argument, it can be a regular argument annotated\n                with `StepInput` if the step has only one previous step.\n        \"\"\"\n        pass\n\n    def process_applying_mappings(self, *args: List[Dict[str, Any]]) -&gt; \"StepOutput\":\n        \"\"\"Runs the `process` method of the step applying the `input_mappings` to the input\n        rows and the `outputs_mappings` to the output rows. This is the function that\n        should be used to run the processing logic of the step.\n\n        Yields:\n            The output rows.\n        \"\"\"\n\n        inputs, overriden_inputs = (\n            self._apply_input_mappings(args)\n            if self.input_mappings\n            else (args, [{} for _ in range(len(args[0]))])\n        )\n\n        # If the `Step` was built using the `@step` decorator, then we need to pass\n        # the runtime parameters as kwargs, so they can be used within the processing\n        # function\n        generator = (\n            self.process(*inputs)\n            if not self._built_from_decorator\n            else self.process(*inputs, **self._runtime_parameters)\n        )\n\n        for output_rows in generator:\n            restored = []\n            for i, row in enumerate(output_rows):\n                # Correct the index here because we don't know the num_generations from the llm\n                # ahead of time. For example, if we have `len(overriden_inputs)==5` and `len(row)==10`,\n                # from `num_generations==2` and `group_generations=False` in the LLM:\n                # The loop will use indices 0, 1, 2, 3, 4, 0, 1, 2, 3, 4\n                ntimes_i = i % len(overriden_inputs)\n                restored.append(\n                    self._apply_mappings_and_restore_overriden(\n                        row, overriden_inputs[ntimes_i]\n                    )\n                )\n            yield restored\n\n    def _apply_input_mappings(\n        self, inputs: Tuple[List[Dict[str, Any]], ...]\n    ) -&gt; Tuple[Tuple[List[Dict[str, Any]], ...], List[Dict[str, Any]]]:\n        \"\"\"Applies the `input_mappings` to the input rows.\n\n        Args:\n            inputs: The input rows.\n\n        Returns:\n            The input rows with the `input_mappings` applied and the overriden values\n                that were replaced by the `input_mappings`.\n        \"\"\"\n        reverted_input_mappings = {v: k for k, v in self.input_mappings.items()}\n\n        renamed_inputs = []\n        overriden_inputs = []\n        for i, row_inputs in enumerate(inputs):\n            renamed_row_inputs = []\n            for row in row_inputs:\n                overriden_keys = {}\n                renamed_row = {}\n                for k, v in row.items():\n                    renamed_key = reverted_input_mappings.get(k, k)\n\n                    if renamed_key not in renamed_row or k != renamed_key:\n                        renamed_row[renamed_key] = v\n\n                        if k != renamed_key and renamed_key in row and len(inputs) == 1:\n                            overriden_keys[renamed_key] = row[renamed_key]\n\n                if i == 0:\n                    overriden_inputs.append(overriden_keys)\n                renamed_row_inputs.append(renamed_row)\n            renamed_inputs.append(renamed_row_inputs)\n        return tuple(renamed_inputs), overriden_inputs\n\n    def _apply_mappings_and_restore_overriden(\n        self, row: Dict[str, Any], overriden: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Reverts the `input_mappings` applied to the input rows and applies the `output_mappings`\n        to the output rows. In addition, it restores the overriden values that were replaced\n        by the `input_mappings`.\n\n        Args:\n            row: The output row.\n            overriden: The overriden values that were replaced by the `input_mappings`.\n\n        Returns:\n            The output row with the `output_mappings` applied and the overriden values\n            restored.\n        \"\"\"\n        result = {}\n        for k, v in row.items():\n            mapped_key = (\n                self.output_mappings.get(k, None)\n                or self.input_mappings.get(k, None)\n                or k\n            )\n            result[mapped_key] = v\n\n        # Restore overriden values\n        for k, v in overriden.items():\n            if k not in result:\n                result[k] = v\n\n        return result\n</code></pre>"},{"location":"api/step/#distilabel.steps.base.Step.process","title":"<code>process(*inputs)</code>  <code>abstractmethod</code>","text":"<p>Method that defines the processing logic of the step. It should yield the output rows.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>An argument used to receive the outputs of the previous steps. The number of arguments depends on the number of previous steps. It doesn't need to be an <code>*args</code> argument, it can be a regular argument annotated with <code>StepInput</code> if the step has only one previous step.</p> <code>()</code> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>@abstractmethod\ndef process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Method that defines the processing logic of the step. It should yield the\n    output rows.\n\n    Args:\n        *inputs: An argument used to receive the outputs of the previous steps. The\n            number of arguments depends on the number of previous steps. It doesn't\n            need to be an `*args` argument, it can be a regular argument annotated\n            with `StepInput` if the step has only one previous step.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/step/#distilabel.steps.base.Step.process_applying_mappings","title":"<code>process_applying_mappings(*args)</code>","text":"<p>Runs the <code>process</code> method of the step applying the <code>input_mappings</code> to the input rows and the <code>outputs_mappings</code> to the output rows. This is the function that should be used to run the processing logic of the step.</p> <p>Yields:</p> Type Description <code>StepOutput</code> <p>The output rows.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def process_applying_mappings(self, *args: List[Dict[str, Any]]) -&gt; \"StepOutput\":\n    \"\"\"Runs the `process` method of the step applying the `input_mappings` to the input\n    rows and the `outputs_mappings` to the output rows. This is the function that\n    should be used to run the processing logic of the step.\n\n    Yields:\n        The output rows.\n    \"\"\"\n\n    inputs, overriden_inputs = (\n        self._apply_input_mappings(args)\n        if self.input_mappings\n        else (args, [{} for _ in range(len(args[0]))])\n    )\n\n    # If the `Step` was built using the `@step` decorator, then we need to pass\n    # the runtime parameters as kwargs, so they can be used within the processing\n    # function\n    generator = (\n        self.process(*inputs)\n        if not self._built_from_decorator\n        else self.process(*inputs, **self._runtime_parameters)\n    )\n\n    for output_rows in generator:\n        restored = []\n        for i, row in enumerate(output_rows):\n            # Correct the index here because we don't know the num_generations from the llm\n            # ahead of time. For example, if we have `len(overriden_inputs)==5` and `len(row)==10`,\n            # from `num_generations==2` and `group_generations=False` in the LLM:\n            # The loop will use indices 0, 1, 2, 3, 4, 0, 1, 2, 3, 4\n            ntimes_i = i % len(overriden_inputs)\n            restored.append(\n                self._apply_mappings_and_restore_overriden(\n                    row, overriden_inputs[ntimes_i]\n                )\n            )\n        yield restored\n</code></pre>"},{"location":"api/step/decorator/","title":"@step","text":"<p>This section contains the reference for the <code>@step</code> decorator, used to create new <code>Step</code> subclasses without having to manually define the class.</p> <p>For more information check the Tutorial - Step page.</p>"},{"location":"api/step/decorator/#distilabel.steps.decorator","title":"<code>decorator</code>","text":""},{"location":"api/step/decorator/#distilabel.steps.decorator.step","title":"<code>step(inputs=None, outputs=None, step_type='normal')</code>","text":"<pre><code>step(inputs: Union[StepColumns, None] = None, outputs: Union[StepColumns, None] = None, step_type: Literal['normal'] = 'normal') -&gt; Callable[..., Type[Step]]\n</code></pre><pre><code>step(inputs: Union[StepColumns, None] = None, outputs: Union[StepColumns, None] = None, step_type: Literal['global'] = 'global') -&gt; Callable[..., Type[GlobalStep]]\n</code></pre><pre><code>step(inputs: None = None, outputs: Union[StepColumns, None] = None, step_type: Literal['generator'] = 'generator') -&gt; Callable[..., Type[GeneratorStep]]\n</code></pre> <p>Creates an <code>Step</code> from a processing function.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the inputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not, that are required by the step. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> <code>outputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the outputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column will be generated or not. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> <code>step_type</code> <code>Literal['normal', 'global', 'generator']</code> <p>the kind of step to create. Valid choices are: \"normal\" (<code>Step</code>), \"global\" (<code>GlobalStep</code>) or \"generator\" (<code>GeneratorStep</code>). Defaults to <code>\"normal\"</code>.</p> <code>'normal'</code> <p>Returns:</p> Type Description <code>Callable[..., Type[_Step]]</code> <p>A callable that will generate the type given the processing function.</p> <p>Example:</p> <pre><code># Normal step\n@step(inputs=[\"instruction\"], outputs=[\"generation\"])\ndef GenerationStep(inputs: StepInput, dummy_generation: RuntimeParameter[str]) -&gt; StepOutput:\n    for input in inputs:\n        input[\"generation\"] = dummy_generation\n    yield inputs\n\n# Global step\n@step(inputs=[\"instruction\"], step_type=\"global\")\ndef FilteringStep(inputs: StepInput, max_length: RuntimeParameter[int] = 256) -&gt; StepOutput:\n    yield [\n        input\n        for input in inputs\n        if len(input[\"instruction\"]) &lt;= max_length\n    ]\n\n# Generator step\n@step(outputs=[\"num\"], step_type=\"generator\")\ndef RowGenerator(num_rows: RuntimeParameter[int] = 500) -&gt; GeneratorStepOutput:\n    data = list(range(num_rows))\n    for i in range(0, len(data), 100):\n        last_batch = i + 100 &gt;= len(data)\n        yield [{\"num\": num} for num in data[i : i + 100]], last_batch\n</code></pre> Source code in <code>src/distilabel/steps/decorator.py</code> <pre><code>def step(\n    inputs: Union[\"StepColumns\", None] = None,\n    outputs: Union[\"StepColumns\", None] = None,\n    step_type: Literal[\"normal\", \"global\", \"generator\"] = \"normal\",\n) -&gt; Callable[..., Type[\"_Step\"]]:\n    \"\"\"Creates an `Step` from a processing function.\n\n    Args:\n        inputs: a list containing the name of the inputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column is required or not, that are required by the step. If not provided\n            the default will be an empty list `[]` and it will be assumed that the step\n            doesn't need any specific columns. Defaults to `None`.\n        outputs: a list containing the name of the outputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column will be generated or not. If not provided the default will be an\n            empty list `[]` and it will be assumed that the step doesn't need any specific\n            columns. Defaults to `None`.\n        step_type: the kind of step to create. Valid choices are: \"normal\" (`Step`),\n            \"global\" (`GlobalStep`) or \"generator\" (`GeneratorStep`). Defaults to\n            `\"normal\"`.\n\n    Returns:\n        A callable that will generate the type given the processing function.\n\n    Example:\n\n    ```python\n    # Normal step\n    @step(inputs=[\"instruction\"], outputs=[\"generation\"])\n    def GenerationStep(inputs: StepInput, dummy_generation: RuntimeParameter[str]) -&gt; StepOutput:\n        for input in inputs:\n            input[\"generation\"] = dummy_generation\n        yield inputs\n\n    # Global step\n    @step(inputs=[\"instruction\"], step_type=\"global\")\n    def FilteringStep(inputs: StepInput, max_length: RuntimeParameter[int] = 256) -&gt; StepOutput:\n        yield [\n            input\n            for input in inputs\n            if len(input[\"instruction\"]) &lt;= max_length\n        ]\n\n    # Generator step\n    @step(outputs=[\"num\"], step_type=\"generator\")\n    def RowGenerator(num_rows: RuntimeParameter[int] = 500) -&gt; GeneratorStepOutput:\n        data = list(range(num_rows))\n        for i in range(0, len(data), 100):\n            last_batch = i + 100 &gt;= len(data)\n            yield [{\"num\": num} for num in data[i : i + 100]], last_batch\n    ```\n    \"\"\"\n\n    inputs = inputs or []\n    outputs = outputs or []\n\n    def decorator(func: ProcessingFunc) -&gt; Type[\"_Step\"]:\n        if step_type not in _STEP_MAPPING:\n            raise ValueError(\n                f\"Invalid step type '{step_type}'. Please, review the '{func.__name__}'\"\n                \" function decorated with the `@step` decorator and provide a valid\"\n                \" `step_type`. Valid choices are: 'normal', 'global' or 'generator'.\"\n            )\n\n        BaseClass = _STEP_MAPPING[step_type]\n\n        signature = inspect.signature(func)\n\n        runtime_parameters = {\n            name: (\n                param.annotation,\n                param.default if param.default != param.empty else None,\n            )\n            for name, param in signature.parameters.items()\n        }\n\n        runtime_parameters = {}\n        step_input_parameter = None\n        for name, param in signature.parameters.items():\n            if is_parameter_annotated_with(param, _RUNTIME_PARAMETER_ANNOTATION):\n                runtime_parameters[name] = (\n                    param.annotation,\n                    param.default if param.default != param.empty else None,\n                )\n\n            if not step_type == \"generator\" and is_parameter_annotated_with(\n                param, _STEP_INPUT_ANNOTATION\n            ):\n                if step_input_parameter is not None:\n                    raise ValueError(\n                        f\"Function '{func.__name__}' has more than one parameter annotated\"\n                        f\" with `StepInput`. Please, review the '{func.__name__}' function\"\n                        \" decorated with the `@step` decorator and provide only one\"\n                        \" argument annotated with `StepInput`.\"\n                    )\n                step_input_parameter = param\n\n        RuntimeParametersModel = create_model(  # type: ignore\n            \"RuntimeParametersModel\",\n            **runtime_parameters,  # type: ignore\n        )\n\n        def inputs_property(self) -&gt; \"StepColumns\":\n            return inputs\n\n        def outputs_property(self) -&gt; \"StepColumns\":\n            return outputs\n\n        def process(\n            self, *args: Any, **kwargs: Any\n        ) -&gt; Union[\"StepOutput\", \"GeneratorStepOutput\"]:\n            return func(*args, **kwargs)\n\n        return type(  # type: ignore\n            func.__name__,\n            (\n                BaseClass,\n                RuntimeParametersModel,\n            ),\n            {\n                \"process\": process,\n                \"inputs\": property(inputs_property),\n                \"outputs\": property(outputs_property),\n                \"__module__\": func.__module__,\n                \"__doc__\": func.__doc__,\n                \"_built_from_decorator\": True,\n                # Override the `get_process_step_input` method to return the parameter\n                # of the original function annotated with `StepInput`.\n                \"get_process_step_input\": lambda self: step_input_parameter,\n            },\n        )\n\n    return decorator\n</code></pre>"},{"location":"api/step/generator_step/","title":"GeneratorStep","text":"<p>This section contains the API reference for the <code>GeneratorStep</code> class.</p> <p>For more information and examples on how to use existing generator steps or create custom ones, please refer to Tutorial - Step - GeneratorStep.</p>"},{"location":"api/step/generator_step/#distilabel.steps.base.GeneratorStep","title":"<code>GeneratorStep</code>","text":"<p>               Bases: <code>_Step</code>, <code>ABC</code></p> <p>A special kind of <code>Step</code> that is able to generate data i.e. it doesn't receive any input from the previous steps.</p> <p>Attributes:</p> Name Type Description <code>batch_size</code> <code>RuntimeParameter[int]</code> <p>The number of rows that will contain the batches generated by the step. Defaults to <code>50</code>.</p> Runtime parameters <ul> <li><code>batch_size</code>: The number of rows that will contain the batches generated by     the step. Defaults to <code>50</code>.</li> </ul> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class GeneratorStep(_Step, ABC):\n    \"\"\"A special kind of `Step` that is able to generate data i.e. it doesn't receive\n    any input from the previous steps.\n\n    Attributes:\n        batch_size: The number of rows that will contain the batches generated by the\n            step. Defaults to `50`.\n\n    Runtime parameters:\n        - `batch_size`: The number of rows that will contain the batches generated by\n            the step. Defaults to `50`.\n    \"\"\"\n\n    batch_size: RuntimeParameter[int] = Field(\n        default=50,\n        description=\"The number of rows that will contain the batches generated by the\"\n        \" step.\",\n    )\n\n    @abstractmethod\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Method that defines the generation logic of the step. It should yield the\n        output rows and a boolean indicating if it's the last batch or not.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to 0.\n\n        Yields:\n            The output rows and a boolean indicating if it's the last batch or not.\n        \"\"\"\n        pass\n\n    def process_applying_mappings(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Runs the `process` method of the step applying the `outputs_mappings` to the\n        output rows. This is the function that should be used to run the generation logic\n        of the step.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to 0.\n\n        Yields:\n            The output rows and a boolean indicating if it's the last batch or not.\n        \"\"\"\n\n        # If the `Step` was built using the `@step` decorator, then we need to pass\n        # the runtime parameters as `kwargs`, so they can be used within the processing\n        # function\n        generator = (\n            self.process(offset=offset)\n            if not self._built_from_decorator\n            else self.process(offset=offset, **self._runtime_parameters)\n        )\n\n        for output_rows, last_batch in generator:\n            yield (\n                [\n                    {self.output_mappings.get(k, k): v for k, v in row.items()}\n                    for row in output_rows\n                ],\n                last_batch,\n            )\n</code></pre>"},{"location":"api/step/generator_step/#distilabel.steps.base.GeneratorStep.process","title":"<code>process(offset=0)</code>  <code>abstractmethod</code>","text":"<p>Method that defines the generation logic of the step. It should yield the output rows and a boolean indicating if it's the last batch or not.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to 0.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>The output rows and a boolean indicating if it's the last batch or not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>@abstractmethod\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Method that defines the generation logic of the step. It should yield the\n    output rows and a boolean indicating if it's the last batch or not.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to 0.\n\n    Yields:\n        The output rows and a boolean indicating if it's the last batch or not.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/step/generator_step/#distilabel.steps.base.GeneratorStep.process_applying_mappings","title":"<code>process_applying_mappings(offset=0)</code>","text":"<p>Runs the <code>process</code> method of the step applying the <code>outputs_mappings</code> to the output rows. This is the function that should be used to run the generation logic of the step.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to 0.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>The output rows and a boolean indicating if it's the last batch or not.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>def process_applying_mappings(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Runs the `process` method of the step applying the `outputs_mappings` to the\n    output rows. This is the function that should be used to run the generation logic\n    of the step.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to 0.\n\n    Yields:\n        The output rows and a boolean indicating if it's the last batch or not.\n    \"\"\"\n\n    # If the `Step` was built using the `@step` decorator, then we need to pass\n    # the runtime parameters as `kwargs`, so they can be used within the processing\n    # function\n    generator = (\n        self.process(offset=offset)\n        if not self._built_from_decorator\n        else self.process(offset=offset, **self._runtime_parameters)\n    )\n\n    for output_rows, last_batch in generator:\n        yield (\n            [\n                {self.output_mappings.get(k, k): v for k, v in row.items()}\n                for row in output_rows\n            ],\n            last_batch,\n        )\n</code></pre>"},{"location":"api/step/generator_step/#distilabel.steps.generators.utils.make_generator_step","title":"<code>make_generator_step(dataset, pipeline=None, batch_size=50, input_mappings=None, output_mappings=None, resources=StepResources(), repo_id='default_name')</code>","text":"<p>Helper method to create a <code>GeneratorStep</code> from a dataset, to simplify</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Union[Dataset, DataFrame, List[Dict[str, str]]]</code> <p>The dataset to use in the <code>Pipeline</code>.</p> required <code>batch_size</code> <code>int</code> <p>The batch_size, will default to the same used by the <code>GeneratorStep</code>s. Defaults to <code>50</code>.</p> <code>50</code> <code>input_mappings</code> <code>Optional[Dict[str, str]]</code> <p>Applies the same as any other step. Defaults to <code>None</code>.</p> <code>None</code> <code>output_mappings</code> <code>Optional[Dict[str, str]]</code> <p>Applies the same as any other step. Defaults to <code>None</code>.</p> <code>None</code> <code>resources</code> <code>StepResources</code> <p>Applies the same as any other step. Defaults to <code>StepResources()</code>.</p> <code>StepResources()</code> <code>repo_id</code> <code>Optional[str]</code> <p>The repository ID to use in the <code>LoadDataFromHub</code> step. This shouldn't be necessary, but in case of error, the dataset will try to be loaded using <code>load_dataset</code> internally. If that case happens, the <code>repo_id</code> will be used.</p> <code>'default_name'</code> <p>Raises:</p> Type Description <code>ValueError</code> <p>If the format is different from the ones supported.</p> <p>Returns:</p> Type Description <code>GeneratorStep</code> <p>A <code>LoadDataFromDicts</code> if the input is a list of dicts, or <code>LoadDataFromHub</code> instance</p> <code>GeneratorStep</code> <p>if the input is a <code>pd.DataFrame</code> or a <code>Dataset</code>.</p> Source code in <code>src/distilabel/steps/generators/utils.py</code> <pre><code>def make_generator_step(\n    dataset: Union[Dataset, pd.DataFrame, List[Dict[str, str]]],\n    pipeline: Union[\"BasePipeline\", None] = None,\n    batch_size: int = 50,\n    input_mappings: Optional[Dict[str, str]] = None,\n    output_mappings: Optional[Dict[str, str]] = None,\n    resources: StepResources = StepResources(),\n    repo_id: Optional[str] = \"default_name\",\n) -&gt; \"GeneratorStep\":\n    \"\"\"Helper method to create a `GeneratorStep` from a dataset, to simplify\n\n    Args:\n        dataset: The dataset to use in the `Pipeline`.\n        batch_size: The batch_size, will default to the same used by the `GeneratorStep`s.\n            Defaults to `50`.\n        input_mappings: Applies the same as any other step. Defaults to `None`.\n        output_mappings: Applies the same as any other step. Defaults to `None`.\n        resources: Applies the same as any other step. Defaults to `StepResources()`.\n        repo_id: The repository ID to use in the `LoadDataFromHub` step.\n            This shouldn't be necessary, but in case of error, the dataset will try to be loaded\n            using `load_dataset` internally. If that case happens, the `repo_id` will be used.\n\n    Raises:\n        ValueError: If the format is different from the ones supported.\n\n    Returns:\n        A `LoadDataFromDicts` if the input is a list of dicts, or `LoadDataFromHub` instance\n        if the input is a `pd.DataFrame` or a `Dataset`.\n    \"\"\"\n    from distilabel.steps import LoadDataFromDicts, LoadDataFromHub\n\n    if isinstance(dataset, list):\n        return LoadDataFromDicts(\n            pipeline=pipeline,\n            data=dataset,\n            batch_size=batch_size,\n            input_mappings=input_mappings or {},\n            output_mappings=output_mappings or {},\n            resources=resources,\n        )\n\n    if isinstance(dataset, pd.DataFrame):\n        dataset = Dataset.from_pandas(dataset, preserve_index=False)\n\n    if not isinstance(dataset, Dataset):\n        raise DistilabelUserError(\n            f\"Dataset type not allowed: {type(dataset)}, must be one of: \"\n            \"`datasets.Dataset`, `pd.DataFrame`, `List[Dict[str, str]]`\",\n            page=\"sections/how_to_guides/basic/pipeline/?h=make_#__tabbed_1_2\",\n        )\n\n    loader = LoadDataFromHub(\n        pipeline=pipeline,\n        repo_id=repo_id,\n        batch_size=batch_size,\n        input_mappings=input_mappings or {},\n        output_mappings=output_mappings or {},\n        resources=resources,\n    )\n    super(loader.__class__, loader).load()  # Ensure the logger is loaded\n    loader._dataset = dataset\n    loader.num_examples = len(dataset)\n    loader._dataset_info = {\"default\": dataset.info}\n    return loader\n</code></pre>"},{"location":"api/step/global_step/","title":"GlobalStep","text":"<p>This section contains the API reference for the <code>GlobalStep</code> class.</p> <p>For more information and examples on how to use existing global steps or create custom ones, please refer to Tutorial - Step - GlobalStep.</p>"},{"location":"api/step/global_step/#distilabel.steps.base.GlobalStep","title":"<code>GlobalStep</code>","text":"<p>               Bases: <code>Step</code>, <code>ABC</code></p> <p>A special kind of <code>Step</code> which it's <code>process</code> method receives all the data processed by their previous steps at once, instead of receiving it in batches. This kind of steps are useful when the processing logic requires to have all the data at once, for example to train a model, to perform a global aggregation, etc.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class GlobalStep(Step, ABC):\n    \"\"\"A special kind of `Step` which it's `process` method receives all the data processed\n    by their previous steps at once, instead of receiving it in batches. This kind of steps\n    are useful when the processing logic requires to have all the data at once, for example\n    to train a model, to perform a global aggregation, etc.\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return []\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return []\n</code></pre>"},{"location":"api/step/resources/","title":"StepResources","text":""},{"location":"api/step/resources/#distilabel.steps.base.StepResources","title":"<code>StepResources</code>","text":"<p>               Bases: <code>RuntimeParametersMixin</code>, <code>BaseModel</code></p> <p>A class to define the resources assigned to a <code>_Step</code>.</p> <p>Attributes:</p> Name Type Description <code>replicas</code> <code>RuntimeParameter[PositiveInt]</code> <p>The number of replicas for the step.</p> <code>cpus</code> <code>Optional[RuntimeParameter[PositiveInt]]</code> <p>The number of CPUs assigned to each step replica.</p> <code>gpus</code> <code>Optional[RuntimeParameter[PositiveInt]]</code> <p>The number of GPUs assigned to each step replica.</p> <code>memory</code> <code>Optional[RuntimeParameter[PositiveInt]]</code> <p>The memory in bytes required for each step replica.</p> <code>resources</code> <code>Optional[RuntimeParameter[Dict[str, int]]]</code> <p>A dictionary containing the number of custom resources required for each step replica.</p> Source code in <code>src/distilabel/steps/base.py</code> <pre><code>class StepResources(RuntimeParametersMixin, BaseModel):\n    \"\"\"A class to define the resources assigned to a `_Step`.\n\n    Attributes:\n        replicas: The number of replicas for the step.\n        cpus: The number of CPUs assigned to each step replica.\n        gpus: The number of GPUs assigned to each step replica.\n        memory: The memory in bytes required for each step replica.\n        resources: A dictionary containing the number of custom resources required for\n            each step replica.\n    \"\"\"\n\n    replicas: RuntimeParameter[PositiveInt] = Field(\n        default=1, description=\"The number of replicas for the step.\"\n    )\n    cpus: Optional[RuntimeParameter[PositiveInt]] = Field(\n        default=None, description=\"The number of CPUs assigned to each step replica.\"\n    )\n    gpus: Optional[RuntimeParameter[PositiveInt]] = Field(\n        default=None, description=\"The number of GPUs assigned to each step replica.\"\n    )\n    memory: Optional[RuntimeParameter[PositiveInt]] = Field(\n        default=None, description=\"The memory in bytes required for each step replica.\"\n    )\n    resources: Optional[RuntimeParameter[Dict[str, int]]] = Field(\n        default=None,\n        description=\"A dictionary containing names of custom resources and the\"\n        \" number of those resources required for each step replica.\",\n    )\n</code></pre>"},{"location":"api/step/typing/","title":"Step Typing","text":""},{"location":"api/step/typing/#distilabel.steps.typing","title":"<code>typing</code>","text":""},{"location":"api/step/typing/#distilabel.steps.typing.StepOutput","title":"<code>StepOutput = Iterator[List[Dict[str, Any]]]</code>  <code>module-attribute</code>","text":"<p><code>StepOutput</code> is an alias of the typing <code>Iterator[List[Dict[str, Any]]]</code></p>"},{"location":"api/step/typing/#distilabel.steps.typing.GeneratorStepOutput","title":"<code>GeneratorStepOutput = Iterator[Tuple[List[Dict[str, Any]], bool]]</code>  <code>module-attribute</code>","text":"<p><code>GeneratorStepOutput</code> is an alias of the typing <code>Iterator[Tuple[List[Dict[str, Any]], bool]]</code></p>"},{"location":"api/step/typing/#distilabel.steps.typing.StepColumns","title":"<code>StepColumns = Union[List[str], Dict[str, bool]]</code>  <code>module-attribute</code>","text":"<p><code>StepColumns</code> is an alias of the typing <code>Union[List[str], Dict[str, bool]]</code> used by the <code>inputs</code> and <code>outputs</code> properties of an <code>Step</code>. In the case of a <code>List[str]</code>, it is a list with the required columns. In the case of a <code>Dict[str, bool]</code>, it is a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not.</p>"},{"location":"api/step_gallery/argilla/","title":"Argilla","text":"<p>This section contains the existing steps integrated with <code>Argilla</code> so as to easily push the generated datasets to Argilla.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base","title":"<code>base</code>","text":""},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase","title":"<code>ArgillaBase</code>","text":"<p>               Bases: <code>Step</code>, <code>ABC</code></p> <p>Abstract step that provides a class to subclass from, that contains the boilerplate code required to interact with Argilla, as well as some extra validations on top of it. It also defines the abstract methods that need to be implemented in order to add a new dataset type as a step.</p> Note <p>This class is not intended to be instanced directly, but via subclass.</p> <p>Attributes:</p> Name Type Description <code>dataset_name</code> <code>RuntimeParameter[str]</code> <p>The name of the dataset in Argilla where the records will be added.</p> <code>dataset_workspace</code> <code>Optional[RuntimeParameter[str]]</code> <p>The workspace where the dataset will be created in Argilla. Defaults to <code>None</code>, which means it will be created in the default workspace.</p> <code>api_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_URL</code> environment variable.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> Runtime parameters <ul> <li><code>dataset_name</code>: The name of the dataset in Argilla where the records will be     added.</li> <li><code>dataset_workspace</code>: The workspace where the dataset will be created in Argilla.     Defaults to <code>None</code>, which means it will be created in the default workspace.</li> <li><code>api_url</code>: The base URL to use for the Argilla API requests.</li> <li><code>api_key</code>: The API key to authenticate the requests to the Argilla API.</li> </ul> Input columns <ul> <li>dynamic, based on the <code>inputs</code> value provided</li> </ul> Source code in <code>src/distilabel/steps/argilla/base.py</code> <pre><code>class ArgillaBase(Step, ABC):\n    \"\"\"Abstract step that provides a class to subclass from, that contains the boilerplate code\n    required to interact with Argilla, as well as some extra validations on top of it. It also defines\n    the abstract methods that need to be implemented in order to add a new dataset type as a step.\n\n    Note:\n        This class is not intended to be instanced directly, but via subclass.\n\n    Attributes:\n        dataset_name: The name of the dataset in Argilla where the records will be added.\n        dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to\n            `None`, which means it will be created in the default workspace.\n        api_url: The URL of the Argilla API. Defaults to `None`, which means it will be read from\n            the `ARGILLA_API_URL` environment variable.\n        api_key: The API key to authenticate with Argilla. Defaults to `None`, which means it will\n            be read from the `ARGILLA_API_KEY` environment variable.\n\n    Runtime parameters:\n        - `dataset_name`: The name of the dataset in Argilla where the records will be\n            added.\n        - `dataset_workspace`: The workspace where the dataset will be created in Argilla.\n            Defaults to `None`, which means it will be created in the default workspace.\n        - `api_url`: The base URL to use for the Argilla API requests.\n        - `api_key`: The API key to authenticate the requests to the Argilla API.\n\n    Input columns:\n        - dynamic, based on the `inputs` value provided\n    \"\"\"\n\n    dataset_name: RuntimeParameter[str] = Field(\n        default=None, description=\"The name of the dataset in Argilla.\"\n    )\n    dataset_workspace: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The workspace where the dataset will be created in Argilla. Defaults \"\n        \"to `None` which means it will be created in the default workspace.\",\n    )\n\n    api_url: Optional[RuntimeParameter[str]] = Field(\n        default_factory=lambda: os.getenv(_ARGILLA_API_URL_ENV_VAR_NAME),\n        description=\"The base URL to use for the Argilla API requests.\",\n    )\n    api_key: Optional[RuntimeParameter[SecretStr]] = Field(\n        default_factory=lambda: os.getenv(_ARGILLA_API_KEY_ENV_VAR_NAME),\n        description=\"The API key to authenticate the requests to the Argilla API.\",\n    )\n\n    _client: Optional[\"Argilla\"] = PrivateAttr(...)\n    _dataset: Optional[\"Dataset\"] = PrivateAttr(...)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Checks that the Argilla Python SDK is installed, and then filters the Argilla warnings.\"\"\"\n        super().model_post_init(__context)\n\n        if importlib.util.find_spec(\"argilla\") is None:\n            raise ImportError(\n                \"Argilla is not installed. Please install it using `pip install argilla\"\n                \" --upgrade`.\"\n            )\n\n    def _client_init(self) -&gt; None:\n        \"\"\"Initializes the Argilla API client with the provided `api_url` and `api_key`.\"\"\"\n        try:\n            self._client = rg.Argilla(  # type: ignore\n                api_url=self.api_url,\n                api_key=self.api_key.get_secret_value(),  # type: ignore\n                headers={\"Authorization\": f\"Bearer {os.environ['HF_TOKEN']}\"}\n                if isinstance(self.api_url, str)\n                and \"hf.space\" in self.api_url\n                and \"HF_TOKEN\" in os.environ\n                else {},\n            )\n        except Exception as e:\n            raise DistilabelUserError(\n                f\"Failed to initialize the Argilla API: {e}\",\n                page=\"sections/how_to_guides/advanced/argilla/\",\n            ) from e\n\n    @property\n    def _dataset_exists_in_workspace(self) -&gt; bool:\n        \"\"\"Checks if the dataset already exists in Argilla in the provided workspace if any.\n\n        Returns:\n            `True` if the dataset exists, `False` otherwise.\n        \"\"\"\n        return (\n            self._client.datasets(  # type: ignore\n                name=self.dataset_name,  # type: ignore\n                workspace=self.dataset_workspace,\n            )\n            is not None\n        )\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs of the step is an empty list, since the steps subclassing from this one, will\n        always be leaf nodes and won't propagate the inputs neither generate any outputs.\n        \"\"\"\n        return []\n\n    def load(self) -&gt; None:\n        \"\"\"Method to perform any initialization logic before the `process` method is\n        called. For example, to load an LLM, stablish a connection to a database, etc.\n        \"\"\"\n        super().load()\n\n        if self.api_url is None or self.api_key is None:\n            raise DistilabelUserError(\n                \"`Argilla` step requires the `api_url` and `api_key` to be provided. Please,\"\n                \" provide those at step instantiation, via environment variables `ARGILLA_API_URL`\"\n                \" and `ARGILLA_API_KEY`, or as `Step` runtime parameters via `pipeline.run(parameters={...})`.\",\n                page=\"sections/how_to_guides/advanced/argilla/\",\n            )\n\n        self._client_init()\n\n    @property\n    @abstractmethod\n    def inputs(self) -&gt; \"StepColumns\": ...\n\n    @abstractmethod\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\": ...\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs of the step is an empty list, since the steps subclassing from this one, will always be leaf nodes and won't propagate the inputs neither generate any outputs.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Checks that the Argilla Python SDK is installed, and then filters the Argilla warnings.</p> Source code in <code>src/distilabel/steps/argilla/base.py</code> <pre><code>def model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Checks that the Argilla Python SDK is installed, and then filters the Argilla warnings.\"\"\"\n    super().model_post_init(__context)\n\n    if importlib.util.find_spec(\"argilla\") is None:\n        raise ImportError(\n            \"Argilla is not installed. Please install it using `pip install argilla\"\n            \" --upgrade`.\"\n        )\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.base.ArgillaBase.load","title":"<code>load()</code>","text":"<p>Method to perform any initialization logic before the <code>process</code> method is called. For example, to load an LLM, stablish a connection to a database, etc.</p> Source code in <code>src/distilabel/steps/argilla/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Method to perform any initialization logic before the `process` method is\n    called. For example, to load an LLM, stablish a connection to a database, etc.\n    \"\"\"\n    super().load()\n\n    if self.api_url is None or self.api_key is None:\n        raise DistilabelUserError(\n            \"`Argilla` step requires the `api_url` and `api_key` to be provided. Please,\"\n            \" provide those at step instantiation, via environment variables `ARGILLA_API_URL`\"\n            \" and `ARGILLA_API_KEY`, or as `Step` runtime parameters via `pipeline.run(parameters={...})`.\",\n            page=\"sections/how_to_guides/advanced/argilla/\",\n        )\n\n    self._client_init()\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference","title":"<code>preference</code>","text":""},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla","title":"<code>PreferenceToArgilla</code>","text":"<p>               Bases: <code>ArgillaBase</code></p> <p>Creates a preference dataset in Argilla.</p> <p>Step that creates a dataset in Argilla during the load phase, and then pushes the input batches into it as records. This dataset is a preference dataset, where there's one field for the instruction and one extra field per each generation within the same record, and then a rating question per each of the generation fields. The rating question asks the annotator to set a rating from 1 to 5 for each of the provided generations.</p> Note <p>This step is meant to be used in conjunction with the <code>UltraFeedback</code> step, or any other step generating both ratings and responses for a given set of instruction and generations for the given instruction. But alternatively, it can also be used with any other task or step generating only the <code>instruction</code> and <code>generations</code>, as the <code>ratings</code> and <code>rationales</code> are optional.</p> <p>Attributes:</p> Name Type Description <code>num_generations</code> <code>int</code> <p>The number of generations to include in the dataset.</p> <code>dataset_name</code> <code>RuntimeParameter[str]</code> <p>The name of the dataset in Argilla.</p> <code>dataset_workspace</code> <code>Optional[RuntimeParameter[str]]</code> <p>The workspace where the dataset will be created in Argilla. Defaults to <code>None</code>, which means it will be created in the default workspace.</p> <code>api_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_URL</code> environment variable.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> Runtime parameters <ul> <li><code>api_url</code>: The base URL to use for the Argilla API requests.</li> <li><code>api_key</code>: The API key to authenticate the requests to the Argilla API.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the completion.</li> <li>generations (<code>List[str]</code>): The completion that was generated based on the input instruction.</li> <li>ratings (<code>List[str]</code>, optional): The ratings for the generations. If not provided, the     generated ratings won't be pushed to Argilla.</li> <li>rationales (<code>List[str]</code>, optional): The rationales for the ratings. If not provided, the     generated rationales won't be pushed to Argilla.</li> </ul> <p>Examples:</p> <p>Push a preference dataset to an Argilla instance:</p> <pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = PreferenceToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generations': ['first_generation', 'second_generation']}]\n</code></pre> <p>It can also include ratings and rationales:</p> <pre><code>result = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n                \"ratings\": [\"4\", \"5\"],\n                \"rationales\": [\"rationale for 4\", \"rationale for 5\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'instruction',\n#         'generations': ['first_generation', 'second_generation'],\n#         'ratings': ['4', '5'],\n#         'rationales': ['rationale for 4', 'rationale for 5']\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/argilla/preference.py</code> <pre><code>class PreferenceToArgilla(ArgillaBase):\n    \"\"\"Creates a preference dataset in Argilla.\n\n    Step that creates a dataset in Argilla during the load phase, and then pushes the input\n    batches into it as records. This dataset is a preference dataset, where there's one field\n    for the instruction and one extra field per each generation within the same record, and then\n    a rating question per each of the generation fields. The rating question asks the annotator to\n    set a rating from 1 to 5 for each of the provided generations.\n\n    Note:\n        This step is meant to be used in conjunction with the `UltraFeedback` step, or any other step\n        generating both ratings and responses for a given set of instruction and generations for the\n        given instruction. But alternatively, it can also be used with any other task or step generating\n        only the `instruction` and `generations`, as the `ratings` and `rationales` are optional.\n\n    Attributes:\n        num_generations: The number of generations to include in the dataset.\n        dataset_name: The name of the dataset in Argilla.\n        dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to\n            `None`, which means it will be created in the default workspace.\n        api_url: The URL of the Argilla API. Defaults to `None`, which means it will be read from\n            the `ARGILLA_API_URL` environment variable.\n        api_key: The API key to authenticate with Argilla. Defaults to `None`, which means it will\n            be read from the `ARGILLA_API_KEY` environment variable.\n\n    Runtime parameters:\n        - `api_url`: The base URL to use for the Argilla API requests.\n        - `api_key`: The API key to authenticate the requests to the Argilla API.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the completion.\n        - generations (`List[str]`): The completion that was generated based on the input instruction.\n        - ratings (`List[str]`, optional): The ratings for the generations. If not provided, the\n            generated ratings won't be pushed to Argilla.\n        - rationales (`List[str]`, optional): The rationales for the ratings. If not provided, the\n            generated rationales won't be pushed to Argilla.\n\n    Examples:\n        Push a preference dataset to an Argilla instance:\n\n        ```python\n        from distilabel.steps import PreferenceToArgilla\n\n        to_argilla = PreferenceToArgilla(\n            num_generations=2,\n            api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n            api_key=\"api.key\",\n            dataset_name=\"argilla_dataset\",\n            dataset_workspace=\"my_workspace\",\n        )\n        to_argilla.load()\n\n        result = next(\n            to_argilla.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"generations\": [\"first_generation\", \"second_generation\"],\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction', 'generations': ['first_generation', 'second_generation']}]\n        ```\n\n        It can also include ratings and rationales:\n\n        ```python\n        result = next(\n            to_argilla.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"generations\": [\"first_generation\", \"second_generation\"],\n                        \"ratings\": [\"4\", \"5\"],\n                        \"rationales\": [\"rationale for 4\", \"rationale for 5\"],\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'instruction': 'instruction',\n        #         'generations': ['first_generation', 'second_generation'],\n        #         'ratings': ['4', '5'],\n        #         'rationales': ['rationale for 4', 'rationale for 5']\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    num_generations: int\n\n    _id: str = PrivateAttr(default=\"id\")\n    _instruction: str = PrivateAttr(...)\n    _generations: str = PrivateAttr(...)\n    _ratings: str = PrivateAttr(...)\n    _rationales: str = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Sets the `_instruction` and `_generations` attributes based on the `inputs_mapping`, otherwise\n        uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n        the text-generation scenario. And then it pushes it to Argilla.\n        \"\"\"\n        super().load()\n\n        # Both `instruction` and `generations` will be used as the fields of the dataset\n        self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n        self._generations = self.input_mappings.get(\"generations\", \"generations\")\n        # Both `ratings` and `rationales` will be used as suggestions to the default questions of the dataset\n        self._ratings = self.input_mappings.get(\"ratings\", \"ratings\")\n        self._rationales = self.input_mappings.get(\"rationales\", \"rationales\")\n\n        if self._dataset_exists_in_workspace:\n            _dataset = self._client.datasets(  # type: ignore\n                name=self.dataset_name,  # type: ignore\n                workspace=self.dataset_workspace,  # type: ignore\n            )\n\n            for field in _dataset.fields:\n                if not isinstance(field, rg.TextField):\n                    continue\n                if (\n                    field.name\n                    not in [self._id, self._instruction]  # type: ignore\n                    + [\n                        f\"{self._generations}-{idx}\"\n                        for idx in range(self.num_generations)\n                    ]\n                    and field.required\n                ):\n                    raise DistilabelUserError(\n                        f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                        f\" already exists, but contains at least a required field that is\"\n                        f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generations}`\"\n                        f\" (one per generation starting from 0 up to {self.num_generations - 1}).\",\n                        page=\"components-gallery/steps/preferencetoargilla/\",\n                    )\n\n            self._dataset = _dataset\n        else:\n            _settings = rg.Settings(  # type: ignore\n                fields=[\n                    rg.TextField(name=self._id, title=self._id),  # type: ignore\n                    rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                    *self._generation_fields(),  # type: ignore\n                ],\n                questions=self._rating_rationale_pairs(),  # type: ignore\n            )\n            _dataset = rg.Dataset(  # type: ignore\n                name=self.dataset_name,\n                workspace=self.dataset_workspace,\n                settings=_settings,\n                client=self._client,\n            )\n            self._dataset = _dataset.create()\n\n    def _generation_fields(self) -&gt; List[\"TextField\"]:\n        \"\"\"Method to generate the fields for each of the generations.\n\n        Returns:\n            A list containing `TextField`s for each text generation.\n        \"\"\"\n        return [\n            rg.TextField(  # type: ignore\n                name=f\"{self._generations}-{idx}\",\n                title=f\"{self._generations}-{idx}\",\n                required=True if idx == 0 else False,\n            )\n            for idx in range(self.num_generations)\n        ]\n\n    def _rating_rationale_pairs(\n        self,\n    ) -&gt; List[Union[\"RatingQuestion\", \"TextQuestion\"]]:\n        \"\"\"Method to generate the rating and rationale questions for each of the generations.\n\n        Returns:\n            A list of questions containing a `RatingQuestion` and `TextQuestion` pair for\n            each text generation.\n        \"\"\"\n        questions = []\n        for idx in range(self.num_generations):\n            questions.extend(\n                [\n                    rg.RatingQuestion(  # type: ignore\n                        name=f\"{self._generations}-{idx}-rating\",\n                        title=f\"Rate {self._generations}-{idx} given {self._instruction}.\",\n                        description=f\"Ignore this question if the corresponding `{self._generations}-{idx}` field is not available.\"\n                        if idx != 0\n                        else None,\n                        values=[1, 2, 3, 4, 5],\n                        required=True if idx == 0 else False,\n                    ),\n                    rg.TextQuestion(  # type: ignore\n                        name=f\"{self._generations}-{idx}-rationale\",\n                        title=f\"Specify the rationale for {self._generations}-{idx}'s rating.\",\n                        description=f\"Ignore this question if the corresponding `{self._generations}-{idx}` field is not available.\"\n                        if idx != 0\n                        else None,\n                        required=False,\n                    ),\n                ]\n            )\n        return questions\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the step are the `instruction` and the `generations`. Optionally, one could also\n        provide the `ratings` and the `rationales` for the generations.\"\"\"\n        return [\"instruction\", \"generations\"]\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"The optional inputs for the step are the `ratings` and the `rationales` for the generations.\"\"\"\n        return [\"ratings\", \"rationales\"]\n\n    def _add_suggestions_if_any(self, input: Dict[str, Any]) -&gt; List[\"Suggestion\"]:\n        \"\"\"Method to generate the suggestions for the `rg.Record` based on the input.\n\n        Returns:\n            A list of `Suggestion`s for the rating and rationales questions.\n        \"\"\"\n        # Since the `suggestions` i.e. answers to the `questions` are optional, will default to {}\n        suggestions = []\n        # If `ratings` is in `input`, then add those as suggestions\n        if self._ratings in input:\n            suggestions.extend(\n                [\n                    rg.Suggestion(  # type: ignore\n                        value=rating,\n                        question_name=f\"{self._generations}-{idx}-rating\",\n                    )\n                    for idx, rating in enumerate(input[self._ratings])\n                    if rating is not None\n                    and isinstance(rating, int)\n                    and rating in [1, 2, 3, 4, 5]\n                ],\n            )\n        # If `rationales` is in `input`, then add those as suggestions\n        if self._rationales in input:\n            suggestions.extend(\n                [\n                    rg.Suggestion(  # type: ignore\n                        value=rationale,\n                        question_name=f\"{self._generations}-{idx}-rationale\",\n                    )\n                    for idx, rationale in enumerate(input[self._rationales])\n                    if rationale is not None and isinstance(rationale, str)\n                ],\n            )\n        return suggestions\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Creates and pushes the records as `rg.Record`s to the Argilla dataset.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n        records = []\n        for input in inputs:\n            # Generate the SHA-256 hash of the instruction to use it as the metadata\n            instruction_id = hashlib.sha256(\n                input[\"instruction\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            generations = {\n                f\"{self._generations}-{idx}\": generation\n                for idx, generation in enumerate(input[\"generations\"])  # type: ignore\n            }\n\n            records.append(  # type: ignore\n                rg.Record(  # type: ignore\n                    fields={\n                        \"id\": instruction_id,\n                        \"instruction\": input[\"instruction\"],  # type: ignore\n                        **generations,\n                    },\n                    suggestions=self._add_suggestions_if_any(input),  # type: ignore\n                )\n            )\n        self._dataset.records.log(records)  # type: ignore\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the step are the <code>instruction</code> and the <code>generations</code>. Optionally, one could also provide the <code>ratings</code> and the <code>rationales</code> for the generations.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>The optional inputs for the step are the <code>ratings</code> and the <code>rationales</code> for the generations.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.load","title":"<code>load()</code>","text":"<p>Sets the <code>_instruction</code> and <code>_generations</code> attributes based on the <code>inputs_mapping</code>, otherwise uses the default values; and then uses those values to create a <code>FeedbackDataset</code> suited for the text-generation scenario. And then it pushes it to Argilla.</p> Source code in <code>src/distilabel/steps/argilla/preference.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Sets the `_instruction` and `_generations` attributes based on the `inputs_mapping`, otherwise\n    uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n    the text-generation scenario. And then it pushes it to Argilla.\n    \"\"\"\n    super().load()\n\n    # Both `instruction` and `generations` will be used as the fields of the dataset\n    self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n    self._generations = self.input_mappings.get(\"generations\", \"generations\")\n    # Both `ratings` and `rationales` will be used as suggestions to the default questions of the dataset\n    self._ratings = self.input_mappings.get(\"ratings\", \"ratings\")\n    self._rationales = self.input_mappings.get(\"rationales\", \"rationales\")\n\n    if self._dataset_exists_in_workspace:\n        _dataset = self._client.datasets(  # type: ignore\n            name=self.dataset_name,  # type: ignore\n            workspace=self.dataset_workspace,  # type: ignore\n        )\n\n        for field in _dataset.fields:\n            if not isinstance(field, rg.TextField):\n                continue\n            if (\n                field.name\n                not in [self._id, self._instruction]  # type: ignore\n                + [\n                    f\"{self._generations}-{idx}\"\n                    for idx in range(self.num_generations)\n                ]\n                and field.required\n            ):\n                raise DistilabelUserError(\n                    f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                    f\" already exists, but contains at least a required field that is\"\n                    f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generations}`\"\n                    f\" (one per generation starting from 0 up to {self.num_generations - 1}).\",\n                    page=\"components-gallery/steps/preferencetoargilla/\",\n                )\n\n        self._dataset = _dataset\n    else:\n        _settings = rg.Settings(  # type: ignore\n            fields=[\n                rg.TextField(name=self._id, title=self._id),  # type: ignore\n                rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                *self._generation_fields(),  # type: ignore\n            ],\n            questions=self._rating_rationale_pairs(),  # type: ignore\n        )\n        _dataset = rg.Dataset(  # type: ignore\n            name=self.dataset_name,\n            workspace=self.dataset_workspace,\n            settings=_settings,\n            client=self._client,\n        )\n        self._dataset = _dataset.create()\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.preference.PreferenceToArgilla.process","title":"<code>process(inputs)</code>","text":"<p>Creates and pushes the records as <code>rg.Record</code>s to the Argilla dataset.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/argilla/preference.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Creates and pushes the records as `rg.Record`s to the Argilla dataset.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n    records = []\n    for input in inputs:\n        # Generate the SHA-256 hash of the instruction to use it as the metadata\n        instruction_id = hashlib.sha256(\n            input[\"instruction\"].encode(\"utf-8\")  # type: ignore\n        ).hexdigest()\n\n        generations = {\n            f\"{self._generations}-{idx}\": generation\n            for idx, generation in enumerate(input[\"generations\"])  # type: ignore\n        }\n\n        records.append(  # type: ignore\n            rg.Record(  # type: ignore\n                fields={\n                    \"id\": instruction_id,\n                    \"instruction\": input[\"instruction\"],  # type: ignore\n                    **generations,\n                },\n                suggestions=self._add_suggestions_if_any(input),  # type: ignore\n            )\n        )\n    self._dataset.records.log(records)  # type: ignore\n    yield inputs\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation","title":"<code>text_generation</code>","text":""},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla","title":"<code>TextGenerationToArgilla</code>","text":"<p>               Bases: <code>ArgillaBase</code></p> <p>Creates a text generation dataset in Argilla.</p> <p><code>Step</code> that creates a dataset in Argilla during the load phase, and then pushes the input batches into it as records. This dataset is a text-generation dataset, where there's one field per each input, and then a label question to rate the quality of the completion in either bad (represented with \ud83d\udc4e) or good (represented with \ud83d\udc4d).</p> Note <p>This step is meant to be used in conjunction with a <code>TextGeneration</code> step and no column mapping is needed, as it will use the default values for the <code>instruction</code> and <code>generation</code> columns.</p> <p>Attributes:</p> Name Type Description <code>dataset_name</code> <code>RuntimeParameter[str]</code> <p>The name of the dataset in Argilla.</p> <code>dataset_workspace</code> <code>Optional[RuntimeParameter[str]]</code> <p>The workspace where the dataset will be created in Argilla. Defaults to <code>None</code>, which means it will be created in the default workspace.</p> <code>api_url</code> <code>Optional[RuntimeParameter[str]]</code> <p>The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_URL</code> environment variable.</p> <code>api_key</code> <code>Optional[RuntimeParameter[SecretStr]]</code> <p>The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> Runtime parameters <ul> <li><code>api_url</code>: The base URL to use for the Argilla API requests.</li> <li><code>api_key</code>: The API key to authenticate the requests to the Argilla API.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the completion.</li> <li>generation (<code>str</code> or <code>List[str]</code>): The completions that were generated based on the input instruction.</li> </ul> <p>Examples:</p> <p>Push a text generation dataset to an Argilla instance:</p> <pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = TextGenerationToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generation\": \"generation\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generation': 'generation'}]\n</code></pre> Source code in <code>src/distilabel/steps/argilla/text_generation.py</code> <pre><code>class TextGenerationToArgilla(ArgillaBase):\n    \"\"\"Creates a text generation dataset in Argilla.\n\n    `Step` that creates a dataset in Argilla during the load phase, and then pushes the input\n    batches into it as records. This dataset is a text-generation dataset, where there's one field\n    per each input, and then a label question to rate the quality of the completion in either bad\n    (represented with \ud83d\udc4e) or good (represented with \ud83d\udc4d).\n\n    Note:\n        This step is meant to be used in conjunction with a `TextGeneration` step and no column mapping\n        is needed, as it will use the default values for the `instruction` and `generation` columns.\n\n    Attributes:\n        dataset_name: The name of the dataset in Argilla.\n        dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to\n            `None`, which means it will be created in the default workspace.\n        api_url: The URL of the Argilla API. Defaults to `None`, which means it will be read from\n            the `ARGILLA_API_URL` environment variable.\n        api_key: The API key to authenticate with Argilla. Defaults to `None`, which means it will\n            be read from the `ARGILLA_API_KEY` environment variable.\n\n    Runtime parameters:\n        - `api_url`: The base URL to use for the Argilla API requests.\n        - `api_key`: The API key to authenticate the requests to the Argilla API.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the completion.\n        - generation (`str` or `List[str]`): The completions that were generated based on the input instruction.\n\n    Examples:\n        Push a text generation dataset to an Argilla instance:\n\n        ```python\n        from distilabel.steps import PreferenceToArgilla\n\n        to_argilla = TextGenerationToArgilla(\n            num_generations=2,\n            api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n            api_key=\"api.key\",\n            dataset_name=\"argilla_dataset\",\n            dataset_workspace=\"my_workspace\",\n        )\n        to_argilla.load()\n\n        result = next(\n            to_argilla.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"generation\": \"generation\",\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction', 'generation': 'generation'}]\n        ```\n    \"\"\"\n\n    _id: str = PrivateAttr(default=\"id\")\n    _instruction: str = PrivateAttr(...)\n    _generation: str = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Sets the `_instruction` and `_generation` attributes based on the `inputs_mapping`, otherwise\n        uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n        the text-generation scenario. And then it pushes it to Argilla.\n        \"\"\"\n        super().load()\n\n        self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n        self._generation = self.input_mappings.get(\"generation\", \"generation\")\n\n        if self._dataset_exists_in_workspace:\n            _dataset = self._client.datasets(  # type: ignore\n                name=self.dataset_name,  # type: ignore\n                workspace=self.dataset_workspace,  # type: ignore\n            )\n\n            for field in _dataset.fields:\n                if not isinstance(field, rg.TextField):  # type: ignore\n                    continue\n                if (\n                    field.name not in [self._id, self._instruction, self._generation]\n                    and field.required\n                ):\n                    raise DistilabelUserError(\n                        f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                        f\" already exists, but contains at least a required field that is\"\n                        f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generation}`,\"\n                        \" so it cannot be reused for this dataset.\",\n                        page=\"components-gallery/steps/textgenerationtoargilla/\",\n                    )\n\n            self._dataset = _dataset\n        else:\n            _settings = rg.Settings(  # type: ignore\n                fields=[\n                    rg.TextField(name=self._id, title=self._id),  # type: ignore\n                    rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                    rg.TextField(name=self._generation, title=self._generation),  # type: ignore\n                ],\n                questions=[\n                    rg.LabelQuestion(  # type: ignore\n                        name=\"quality\",\n                        title=f\"What's the quality of the {self._generation} for the given {self._instruction}?\",\n                        labels={\"bad\": \"\ud83d\udc4e\", \"good\": \"\ud83d\udc4d\"},  # type: ignore\n                    )\n                ],\n            )\n            _dataset = rg.Dataset(  # type: ignore\n                name=self.dataset_name,\n                workspace=self.dataset_workspace,\n                settings=_settings,\n                client=self._client,\n            )\n            self._dataset = _dataset.create()\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the step are the `instruction` and the `generation`.\"\"\"\n        return [\"instruction\", \"generation\"]\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Creates and pushes the records as FeedbackRecords to the Argilla dataset.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n        records = []\n        for input in inputs:\n            # Generate the SHA-256 hash of the instruction to use it as the metadata\n            instruction_id = hashlib.sha256(\n                input[\"instruction\"].encode(\"utf-8\")\n            ).hexdigest()\n\n            generations = input[\"generation\"]\n\n            # If the `generation` is not a list, then convert it into a list\n            if not isinstance(generations, list):\n                generations = [generations]\n\n            # Create a `generations_set` to avoid adding duplicates\n            generations_set = set()\n\n            for generation in generations:\n                # If the generation is already in the set, then skip it\n                if generation in generations_set:\n                    continue\n                # Otherwise, add it to the set\n                generations_set.add(generation)\n\n                records.append(\n                    rg.Record(  # type: ignore\n                        fields={\n                            self._id: instruction_id,\n                            self._instruction: input[\"instruction\"],\n                            self._generation: generation,\n                        },\n                    ),\n                )\n        self._dataset.records.log(records)  # type: ignore\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the step are the <code>instruction</code> and the <code>generation</code>.</p>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla.load","title":"<code>load()</code>","text":"<p>Sets the <code>_instruction</code> and <code>_generation</code> attributes based on the <code>inputs_mapping</code>, otherwise uses the default values; and then uses those values to create a <code>FeedbackDataset</code> suited for the text-generation scenario. And then it pushes it to Argilla.</p> Source code in <code>src/distilabel/steps/argilla/text_generation.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Sets the `_instruction` and `_generation` attributes based on the `inputs_mapping`, otherwise\n    uses the default values; and then uses those values to create a `FeedbackDataset` suited for\n    the text-generation scenario. And then it pushes it to Argilla.\n    \"\"\"\n    super().load()\n\n    self._instruction = self.input_mappings.get(\"instruction\", \"instruction\")\n    self._generation = self.input_mappings.get(\"generation\", \"generation\")\n\n    if self._dataset_exists_in_workspace:\n        _dataset = self._client.datasets(  # type: ignore\n            name=self.dataset_name,  # type: ignore\n            workspace=self.dataset_workspace,  # type: ignore\n        )\n\n        for field in _dataset.fields:\n            if not isinstance(field, rg.TextField):  # type: ignore\n                continue\n            if (\n                field.name not in [self._id, self._instruction, self._generation]\n                and field.required\n            ):\n                raise DistilabelUserError(\n                    f\"The dataset '{self.dataset_name}' in the workspace '{self.dataset_workspace}'\"\n                    f\" already exists, but contains at least a required field that is\"\n                    f\" neither `{self._id}`, `{self._instruction}`, nor `{self._generation}`,\"\n                    \" so it cannot be reused for this dataset.\",\n                    page=\"components-gallery/steps/textgenerationtoargilla/\",\n                )\n\n        self._dataset = _dataset\n    else:\n        _settings = rg.Settings(  # type: ignore\n            fields=[\n                rg.TextField(name=self._id, title=self._id),  # type: ignore\n                rg.TextField(name=self._instruction, title=self._instruction),  # type: ignore\n                rg.TextField(name=self._generation, title=self._generation),  # type: ignore\n            ],\n            questions=[\n                rg.LabelQuestion(  # type: ignore\n                    name=\"quality\",\n                    title=f\"What's the quality of the {self._generation} for the given {self._instruction}?\",\n                    labels={\"bad\": \"\ud83d\udc4e\", \"good\": \"\ud83d\udc4d\"},  # type: ignore\n                )\n            ],\n        )\n        _dataset = rg.Dataset(  # type: ignore\n            name=self.dataset_name,\n            workspace=self.dataset_workspace,\n            settings=_settings,\n            client=self._client,\n        )\n        self._dataset = _dataset.create()\n</code></pre>"},{"location":"api/step_gallery/argilla/#distilabel.steps.argilla.text_generation.TextGenerationToArgilla.process","title":"<code>process(inputs)</code>","text":"<p>Creates and pushes the records as FeedbackRecords to the Argilla dataset.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/argilla/text_generation.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Creates and pushes the records as FeedbackRecords to the Argilla dataset.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n    records = []\n    for input in inputs:\n        # Generate the SHA-256 hash of the instruction to use it as the metadata\n        instruction_id = hashlib.sha256(\n            input[\"instruction\"].encode(\"utf-8\")\n        ).hexdigest()\n\n        generations = input[\"generation\"]\n\n        # If the `generation` is not a list, then convert it into a list\n        if not isinstance(generations, list):\n            generations = [generations]\n\n        # Create a `generations_set` to avoid adding duplicates\n        generations_set = set()\n\n        for generation in generations:\n            # If the generation is already in the set, then skip it\n            if generation in generations_set:\n                continue\n            # Otherwise, add it to the set\n            generations_set.add(generation)\n\n            records.append(\n                rg.Record(  # type: ignore\n                    fields={\n                        self._id: instruction_id,\n                        self._instruction: input[\"instruction\"],\n                        self._generation: generation,\n                    },\n                ),\n            )\n    self._dataset.records.log(records)  # type: ignore\n    yield inputs\n</code></pre>"},{"location":"api/step_gallery/columns/","title":"Columns","text":"<p>This section contains the existing steps intended to be used for common column operations to apply to the batches.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand","title":"<code>expand</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns","title":"<code>ExpandColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Expand columns that contain lists into multiple rows.</p> <p><code>ExpandColumns</code> is a <code>Step</code> that takes a list of columns and expands them into multiple rows. The new rows will have the same data as the original row, except for the expanded column, which will contain a single item from the original list.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>Union[Dict[str, str], List[str]]</code> <p>A dictionary that maps the column to be expanded to the new column name or a list of columns to be expanded. If a list is provided, the new column name will be the same as the column name.</p> <code>encoded</code> <code>Union[bool, List[str]]</code> <p>A bool to inform Whether the columns are JSON encoded lists. If this value is set to True, the columns will be decoded before expanding. Alternatively, to specify columns that can be encoded, a list can be provided. In this case, the column names informed must be a subset of the columns selected for expansion.</p> <code>split_statistics</code> <code>bool</code> <p>A bool to inform whether the statistics in the <code>distilabel_metadata</code> column should be split into multiple rows. If we want to expand some columns containing a list of strings that come from having parsed the output of an LLM, the tokens in the <code>statistics_{step_name}</code> of the <code>distilabel_metadata</code> column should be splitted to avoid multiplying them if we aggregate the data afterwards. For example, with a task that is supposed to generate a list of N instructions, and we want each of those N instructions in different rows, we should split the statistics by N. In such a case, set this value to True.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to be expanded into     multiple rows.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> attribute):  The expanded columns.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <p>Expand the selected columns into multiple rows:</p> <pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"]}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre> <p>Expand the selected columns which are JSON encoded into multiple rows:</p> <pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    encoded=True,  # It can also be a list of columns that are encoded, i.e. [\"generation\"]\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": '[\"generation 1\", \"generation 2\"]'}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre> <p>Expand the selected columns and split the statistics in the <code>distilabel_metadata</code> column:</p> <pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    split_statistics=True,\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"],\n                \"distilabel_metadata\": {\n                    \"statistics_generation\": {\n                        \"input_tokens\": [12],\n                        \"output_tokens\": [12],\n                    },\n                },\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}, {'instruction': 'instruction 1', 'generation': 'generation 2', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}]\n</code></pre> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>class ExpandColumns(Step):\n    \"\"\"Expand columns that contain lists into multiple rows.\n\n    `ExpandColumns` is a `Step` that takes a list of columns and expands them into multiple\n    rows. The new rows will have the same data as the original row, except for the expanded\n    column, which will contain a single item from the original list.\n\n    Attributes:\n        columns: A dictionary that maps the column to be expanded to the new column name\n            or a list of columns to be expanded. If a list is provided, the new column name\n            will be the same as the column name.\n        encoded: A bool to inform Whether the columns are JSON encoded lists. If this value is\n            set to True, the columns will be decoded before expanding. Alternatively, to specify\n            columns that can be encoded, a list can be provided. In this case, the column names\n            informed must be a subset of the columns selected for expansion.\n        split_statistics: A bool to inform whether the statistics in the `distilabel_metadata`\n            column should be split into multiple rows.\n            If we want to expand some columns containing a list of strings that come from\n            having parsed the output of an LLM, the tokens in the `statistics_{step_name}`\n            of the `distilabel_metadata` column should be splitted to avoid multiplying\n            them if we aggregate the data afterwards. For example, with a task that is supposed\n            to generate a list of N instructions, and we want each of those N instructions in\n            different rows, we should split the statistics by N.\n            In such a case, set this value to True.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to be expanded into\n            multiple rows.\n\n    Output columns:\n        - dynamic (determined by `columns` attribute):  The expanded columns.\n\n    Categories:\n        - columns\n\n    Examples:\n        Expand the selected columns into multiple rows:\n\n        ```python\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(\n            columns=[\"generation\"],\n        )\n        expand_columns.load()\n\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"instruction 1\",\n                        \"generation\": [\"generation 1\", \"generation 2\"]}\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n        ```\n\n        Expand the selected columns which are JSON encoded into multiple rows:\n\n        ```python\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(\n            columns=[\"generation\"],\n            encoded=True,  # It can also be a list of columns that are encoded, i.e. [\"generation\"]\n        )\n        expand_columns.load()\n\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"instruction 1\",\n                        \"generation\": '[\"generation 1\", \"generation 2\"]'}\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n        ```\n\n        Expand the selected columns and split the statistics in the `distilabel_metadata` column:\n\n        ```python\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(\n            columns=[\"generation\"],\n            split_statistics=True,\n        )\n        expand_columns.load()\n\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"instruction 1\",\n                        \"generation\": [\"generation 1\", \"generation 2\"],\n                        \"distilabel_metadata\": {\n                            \"statistics_generation\": {\n                                \"input_tokens\": [12],\n                                \"output_tokens\": [12],\n                            },\n                        },\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction 1', 'generation': 'generation 1', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}, {'instruction': 'instruction 1', 'generation': 'generation 2', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}]\n        ```\n    \"\"\"\n\n    columns: Union[Dict[str, str], List[str]]\n    encoded: Union[bool, List[str]] = False\n    split_statistics: bool = False\n\n    @field_validator(\"columns\")\n    @classmethod\n    def always_dict(cls, value: Union[Dict[str, str], List[str]]) -&gt; Dict[str, str]:\n        \"\"\"Ensure that the columns are always a dictionary.\n\n        Args:\n            value: The columns to be expanded.\n\n        Returns:\n            The columns to be expanded as a dictionary.\n        \"\"\"\n        if isinstance(value, list):\n            return {col: col for col in value}\n\n        return value\n\n    @model_validator(mode=\"after\")\n    def is_subset(self) -&gt; Self:\n        \"\"\"Ensure the \"encoded\" column names are a subset of the \"columns\" selected.\n\n        Returns:\n            The \"encoded\" attribute updated to work internally.\n        \"\"\"\n        if isinstance(self.encoded, list):\n            if not set(self.encoded).issubset(set(self.columns.keys())):\n                raise ValueError(\n                    \"The 'encoded' columns must be a subset of the 'columns' selected for expansion.\"\n                )\n        if isinstance(self.encoded, bool):\n            self.encoded = list(self.columns.keys()) if self.encoded else []\n        return self\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The columns to be expanded.\"\"\"\n        return list(self.columns.keys())\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The expanded columns.\"\"\"\n        return [\n            new_column if new_column else expand_column\n            for expand_column, new_column in self.columns.items()\n        ]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Expand the columns in the input data.\n\n        Args:\n            inputs: The input data.\n\n        Yields:\n            The expanded rows.\n        \"\"\"\n        if self.encoded:\n            for input in inputs:\n                for column in self.encoded:\n                    input[column] = json.loads(input[column])\n\n        yield [row for input in inputs for row in self._expand_columns(input)]\n\n    def _expand_columns(self, input: Dict[str, Any]) -&gt; List[Dict[str, Any]]:\n        \"\"\"Expand the columns in the input data.\n\n        Args:\n            input: The input data.\n\n        Returns:\n            The expanded rows.\n        \"\"\"\n        metadata_visited = False\n        expanded_rows = []\n        # Update the columns here to avoid doing the validation on the `inputs`, as the\n        # `distilabel_metadata` is not defined on Pipeline creation on the DAG.\n        columns = self.columns\n        if self.split_statistics:\n            columns[\"distilabel_metadata\"] = \"distilabel_metadata\"\n\n        for expand_column, new_column in columns.items():  # type: ignore\n            data = input.get(expand_column)\n            input, metadata_visited = self._split_metadata(\n                input, len(data), metadata_visited\n            )\n\n            rows = []\n            for item, expanded in zip_longest(*[data, expanded_rows], fillvalue=input):\n                rows.append({**expanded, new_column: item})\n            expanded_rows = rows\n        return expanded_rows\n\n    def _split_metadata(\n        self, input: Dict[str, Any], n: int, metadata_visited: bool = False\n    ) -&gt; None:\n        \"\"\"Help method to split the statistics in `distilabel_metadata` column.\n\n        Args:\n            input: The input data.\n            n: Number of splits to apply to the tokens (if we have 12 tokens and want to split\n                them 3 times, n==3).\n            metadata_visited: Bool to prevent from updating the data more than once.\n\n        Returns:\n            Updated input with the `distilabel_metadata` updated.\n        \"\"\"\n        # - If we want to split the statistics, we need to ensure that the metadata is present.\n        # - Metadata can only be visited once per row to avoid successive splitting.\n        # TODO: For an odd number of tokens, this will miss 1, we have to fix it.\n        if (\n            self.split_statistics\n            and (metadata := input.get(\"distilabel_metadata\", {}))\n            and not metadata_visited\n        ):\n            for k, v in metadata.items():\n                if (\n                    not v\n                ):  # In case it wasn't found in the metadata for some error, skip it\n                    continue\n                if k.startswith(\"statistics_\") and (\n                    \"input_tokens\" in v and \"output_tokens\" in v\n                ):\n                    # For num_generations&gt;1 we assume all the tokens should be divided by n\n                    # TODO: The tokens should always come as a list, but there can\n                    # be differences\n                    if isinstance(v[\"input_tokens\"], list):\n                        input_tokens = [value // n for value in v[\"input_tokens\"]]\n                    else:\n                        input_tokens = [v[\"input_tokens\"] // n]\n                    if isinstance(v[\"input_tokens\"], list):\n                        output_tokens = [value // n for value in v[\"output_tokens\"]]\n                    else:\n                        output_tokens = [v[\"output_tokens\"] // n]\n\n                    input[\"distilabel_metadata\"][k] = {\n                        \"input_tokens\": input_tokens,\n                        \"output_tokens\": output_tokens,\n                    }\n                metadata_visited = True\n            # Once we have updated the metadata, Create a list out of it to let the\n            # following section to expand it as any other column.\n            if isinstance(input[\"distilabel_metadata\"], dict):\n                input[\"distilabel_metadata\"] = [input[\"distilabel_metadata\"]] * n\n        return input, metadata_visited\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The columns to be expanded.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The expanded columns.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.always_dict","title":"<code>always_dict(value)</code>  <code>classmethod</code>","text":"<p>Ensure that the columns are always a dictionary.</p> <p>Parameters:</p> Name Type Description Default <code>value</code> <code>Union[Dict[str, str], List[str]]</code> <p>The columns to be expanded.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>The columns to be expanded as a dictionary.</p> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>@field_validator(\"columns\")\n@classmethod\ndef always_dict(cls, value: Union[Dict[str, str], List[str]]) -&gt; Dict[str, str]:\n    \"\"\"Ensure that the columns are always a dictionary.\n\n    Args:\n        value: The columns to be expanded.\n\n    Returns:\n        The columns to be expanded as a dictionary.\n    \"\"\"\n    if isinstance(value, list):\n        return {col: col for col in value}\n\n    return value\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.is_subset","title":"<code>is_subset()</code>","text":"<p>Ensure the \"encoded\" column names are a subset of the \"columns\" selected.</p> <p>Returns:</p> Type Description <code>Self</code> <p>The \"encoded\" attribute updated to work internally.</p> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>@model_validator(mode=\"after\")\ndef is_subset(self) -&gt; Self:\n    \"\"\"Ensure the \"encoded\" column names are a subset of the \"columns\" selected.\n\n    Returns:\n        The \"encoded\" attribute updated to work internally.\n    \"\"\"\n    if isinstance(self.encoded, list):\n        if not set(self.encoded).issubset(set(self.columns.keys())):\n            raise ValueError(\n                \"The 'encoded' columns must be a subset of the 'columns' selected for expansion.\"\n            )\n    if isinstance(self.encoded, bool):\n        self.encoded = list(self.columns.keys()) if self.encoded else []\n    return self\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.expand.ExpandColumns.process","title":"<code>process(inputs)</code>","text":"<p>Expand the columns in the input data.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>The expanded rows.</p> Source code in <code>src/distilabel/steps/columns/expand.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Expand the columns in the input data.\n\n    Args:\n        inputs: The input data.\n\n    Yields:\n        The expanded rows.\n    \"\"\"\n    if self.encoded:\n        for input in inputs:\n            for column in self.encoded:\n                input[column] = json.loads(input[column])\n\n    yield [row for input in inputs for row in self._expand_columns(input)]\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep","title":"<code>keep</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns","title":"<code>KeepColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Keeps selected columns in the dataset.</p> <p><code>KeepColumns</code> is a <code>Step</code> that implements the <code>process</code> method that keeps only the columns specified in the <code>columns</code> attribute. Also <code>KeepColumns</code> provides an attribute <code>columns</code> to specify the columns to keep which will override the default value for the properties <code>inputs</code> and <code>outputs</code>.</p> Note <p>The order in which the columns are provided is important, as the output will be sorted using the provided order, which is useful before pushing either a <code>dataset.Dataset</code> via the <code>PushToHub</code> step or a <code>distilabel.Distiset</code> via the <code>Pipeline.run</code> output variable.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>List of strings with the names of the columns to keep.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to keep.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns that were kept.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <p>Select the columns to keep:</p> <pre><code>from distilabel.steps import KeepColumns\n\nkeep_columns = KeepColumns(\n    columns=[\"instruction\", \"generation\"],\n)\nkeep_columns.load()\n\nresult = next(\n    keep_columns.process(\n        [{\"instruction\": \"What's the brightest color?\", \"generation\": \"white\", \"model_name\": \"my_model\"}],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': \"What's the brightest color?\", 'generation': 'white'}]\n</code></pre> Source code in <code>src/distilabel/steps/columns/keep.py</code> <pre><code>class KeepColumns(Step):\n    \"\"\"Keeps selected columns in the dataset.\n\n    `KeepColumns` is a `Step` that implements the `process` method that keeps only the columns\n    specified in the `columns` attribute. Also `KeepColumns` provides an attribute `columns` to\n    specify the columns to keep which will override the default value for the properties `inputs`\n    and `outputs`.\n\n    Note:\n        The order in which the columns are provided is important, as the output will be sorted\n        using the provided order, which is useful before pushing either a `dataset.Dataset` via\n        the `PushToHub` step or a `distilabel.Distiset` via the `Pipeline.run` output variable.\n\n    Attributes:\n        columns: List of strings with the names of the columns to keep.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to keep.\n\n    Output columns:\n        - dynamic (determined by `columns` attribute): The columns that were kept.\n\n    Categories:\n        - columns\n\n    Examples:\n        Select the columns to keep:\n\n        ```python\n        from distilabel.steps import KeepColumns\n\n        keep_columns = KeepColumns(\n            columns=[\"instruction\", \"generation\"],\n        )\n        keep_columns.load()\n\n        result = next(\n            keep_columns.process(\n                [{\"instruction\": \"What's the brightest color?\", \"generation\": \"white\", \"model_name\": \"my_model\"}],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': \"What's the brightest color?\", 'generation': 'white'}]\n        ```\n    \"\"\"\n\n    columns: List[str]\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task are the column names in `columns`.\"\"\"\n        return self.columns\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs for the task are the column names in `columns`.\"\"\"\n        return self.columns\n\n    @override\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"The `process` method keeps only the columns specified in the `columns` attribute.\n\n        Args:\n            *inputs: A list of dictionaries with the input data.\n\n        Yields:\n            A list of dictionaries with the output data.\n        \"\"\"\n        for input in inputs:\n            outputs = []\n            for item in input:\n                outputs.append({col: item[col] for col in self.columns})\n            yield outputs\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the column names in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task are the column names in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.keep.KeepColumns.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method keeps only the columns specified in the <code>columns</code> attribute.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of dictionaries with the input data.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of dictionaries with the output data.</p> Source code in <code>src/distilabel/steps/columns/keep.py</code> <pre><code>@override\ndef process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"The `process` method keeps only the columns specified in the `columns` attribute.\n\n    Args:\n        *inputs: A list of dictionaries with the input data.\n\n    Yields:\n        A list of dictionaries with the output data.\n    \"\"\"\n    for input in inputs:\n        outputs = []\n        for item in input:\n            outputs.append({col: item[col] for col in self.columns})\n        yield outputs\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.merge","title":"<code>merge</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.merge.MergeColumns","title":"<code>MergeColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Merge columns from a row.</p> <p><code>MergeColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>merge_columns</code> function to handle and combine columns in a <code>StepInput</code>. <code>MergeColumns</code> provides two attributes <code>columns</code> and <code>output_column</code> to specify the columns to merge and the resulting output column.</p> <p>This step can be useful if you have a <code>Task</code> that generates instructions for example, and you want to have more examples of those. In such a case, you could for example use another <code>Task</code> to multiply your instructions synthetically, what would yield two different columns splitted. Using <code>MergeColumns</code> you can merge them and use them as a single column in your dataset for further processing.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>List of strings with the names of the columns to merge.</p> <code>output_column</code> <code>Optional[str]</code> <p>str name of the output column</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to merge.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> and <code>output_column</code> attributes): The columns     that were merged.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <p>Combine columns in rows of a dataset:</p> <pre><code>from distilabel.steps import MergeColumns\n\ncombiner = MergeColumns(\n    columns=[\"queries\", \"multiple_queries\"],\n    output_column=\"queries\",\n)\ncombiner.load()\n\nresult = next(\n    combiner.process(\n        [\n            {\n                \"queries\": \"How are you?\",\n                \"multiple_queries\": [\"What's up?\", \"Everything ok?\"]\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'queries': ['How are you?', \"What's up?\", 'Everything ok?']}]\n</code></pre> Source code in <code>src/distilabel/steps/columns/merge.py</code> <pre><code>class MergeColumns(Step):\n    \"\"\"Merge columns from a row.\n\n    `MergeColumns` is a `Step` that implements the `process` method that calls the `merge_columns`\n    function to handle and combine columns in a `StepInput`. `MergeColumns` provides two attributes\n    `columns` and `output_column` to specify the columns to merge and the resulting output column.\n\n    This step can be useful if you have a `Task` that generates instructions for example, and you\n    want to have more examples of those. In such a case, you could for example use another `Task`\n    to multiply your instructions synthetically, what would yield two different columns splitted.\n    Using `MergeColumns` you can merge them and use them as a single column in your dataset for\n    further processing.\n\n    Attributes:\n        columns: List of strings with the names of the columns to merge.\n        output_column: str name of the output column\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to merge.\n\n    Output columns:\n        - dynamic (determined by `columns` and `output_column` attributes): The columns\n            that were merged.\n\n    Categories:\n        - columns\n\n    Examples:\n        Combine columns in rows of a dataset:\n\n        ```python\n        from distilabel.steps import MergeColumns\n\n        combiner = MergeColumns(\n            columns=[\"queries\", \"multiple_queries\"],\n            output_column=\"queries\",\n        )\n        combiner.load()\n\n        result = next(\n            combiner.process(\n                [\n                    {\n                        \"queries\": \"How are you?\",\n                        \"multiple_queries\": [\"What's up?\", \"Everything ok?\"]\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'queries': ['How are you?', \"What's up?\", 'Everything ok?']}]\n        ```\n    \"\"\"\n\n    columns: List[str]\n    output_column: Optional[str] = None\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return self.columns\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [self.output_column] if self.output_column else [\"merged_column\"]\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        combined = []\n        for input in inputs:\n            combined.append(\n                merge_columns(\n                    input,\n                    columns=self.columns,\n                    new_column=self.outputs[0],\n                )\n            )\n        yield combined\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group","title":"<code>group</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns","title":"<code>GroupColumns</code>","text":"<p>               Bases: <code>Step</code></p> <p>Combines columns from a list of <code>StepInput</code>.</p> <p><code>GroupColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>group_dicts</code> function to handle and combine a list of <code>StepInput</code>. Also <code>GroupColumns</code> provides two attributes <code>columns</code> and <code>output_columns</code> to specify the columns to group and the output columns which will override the default value for the properties <code>inputs</code> and <code>outputs</code>, respectively.</p> <p>Attributes:</p> Name Type Description <code>columns</code> <code>List[str]</code> <p>List of strings with the names of the columns to group.</p> <code>output_columns</code> <code>Optional[List[str]]</code> <p>Optional list of strings with the names of the output columns.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to group.</li> </ul> Output columns <ul> <li>dynamic (determined by <code>columns</code> and <code>output_columns</code> attributes): The columns     that were grouped.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <pre><code>Group columns of a dataset:\n\n```python\nfrom distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n# [{'merged_generation': ['AI generated text', 'Other generated text'], 'merged_model_name': ['my_model']}]\n```\n\nSpecify the name of the output columns:\n\n```python\nfrom distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n    output_columns=[\"generations\", \"generation_models\"]\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n#[{'generations': ['AI generated text', 'Other generated text'], 'generation_models': ['my_model']}]\n```\n</code></pre> Source code in <code>src/distilabel/steps/columns/group.py</code> <pre><code>class GroupColumns(Step):\n    \"\"\"Combines columns from a list of `StepInput`.\n\n    `GroupColumns` is a `Step` that implements the `process` method that calls the `group_dicts`\n    function to handle and combine a list of `StepInput`. Also `GroupColumns` provides two attributes\n    `columns` and `output_columns` to specify the columns to group and the output columns\n    which will override the default value for the properties `inputs` and `outputs`, respectively.\n\n    Attributes:\n        columns: List of strings with the names of the columns to group.\n        output_columns: Optional list of strings with the names of the output columns.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): The columns to group.\n\n    Output columns:\n        - dynamic (determined by `columns` and `output_columns` attributes): The columns\n            that were grouped.\n\n    Categories:\n        - columns\n\n    Examples:\n\n        Group columns of a dataset:\n\n        ```python\n        from distilabel.steps import GroupColumns\n\n        group_columns = GroupColumns(\n            name=\"group_columns\",\n            columns=[\"generation\", \"model_name\"],\n        )\n        group_columns.load()\n\n        result = next(\n            group_columns.process(\n                [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n                [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'merged_generation': ['AI generated text', 'Other generated text'], 'merged_model_name': ['my_model']}]\n        ```\n\n        Specify the name of the output columns:\n\n        ```python\n        from distilabel.steps import GroupColumns\n\n        group_columns = GroupColumns(\n            name=\"group_columns\",\n            columns=[\"generation\", \"model_name\"],\n            output_columns=[\"generations\", \"generation_models\"]\n        )\n        group_columns.load()\n\n        result = next(\n            group_columns.process(\n                [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n                [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n            )\n        )\n        # &gt;&gt;&gt; result\n        #[{'generations': ['AI generated text', 'Other generated text'], 'generation_models': ['my_model']}]\n        ```\n    \"\"\"\n\n    columns: List[str]\n    output_columns: Optional[List[str]] = None\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are the column names in `columns`.\"\"\"\n        return self.columns\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The outputs for the task are the column names in `output_columns` or\n        `grouped_{column}` for each column in `columns`.\"\"\"\n        return (\n            self.output_columns\n            if self.output_columns is not None\n            else [f\"grouped_{column}\" for column in self.columns]\n        )\n\n    @override\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"The `process` method calls the `group_dicts` function to handle and combine a list of `StepInput`.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with the combined `StepInput` using the `group_dicts` function.\n        \"\"\"\n        yield group_columns(\n            *inputs,\n            group_columns=self.inputs,\n            output_group_columns=self.outputs,\n        )\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the column names in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task are the column names in <code>output_columns</code> or <code>grouped_{column}</code> for each column in <code>columns</code>.</p>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.GroupColumns.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method calls the <code>group_dicts</code> function to handle and combine a list of <code>StepInput</code>.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with the combined <code>StepInput</code> using the <code>group_dicts</code> function.</p> Source code in <code>src/distilabel/steps/columns/group.py</code> <pre><code>@override\ndef process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"The `process` method calls the `group_dicts` function to handle and combine a list of `StepInput`.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with the combined `StepInput` using the `group_dicts` function.\n    \"\"\"\n    yield group_columns(\n        *inputs,\n        group_columns=self.inputs,\n        output_group_columns=self.outputs,\n    )\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.group.CombineColumns","title":"<code>CombineColumns</code>","text":"<p>               Bases: <code>GroupColumns</code></p> <p><code>CombineColumns</code> is deprecated and will be removed in version 1.5.0, use <code>GroupColumns</code> instead.</p> Source code in <code>src/distilabel/steps/columns/group.py</code> <pre><code>class CombineColumns(GroupColumns):\n    \"\"\"`CombineColumns` is deprecated and will be removed in version 1.5.0, use `GroupColumns` instead.\"\"\"\n\n    def __init__(self, **data: Any) -&gt; None:\n        warnings.warn(\n            \"`CombineColumns` is deprecated and will be removed in version 1.5.0, use `GroupColumns` instead.\",\n            DeprecationWarning,\n            stacklevel=2,\n        )\n        return super().__init__(**data)\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils","title":"<code>utils</code>","text":""},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils.merge_distilabel_metadata","title":"<code>merge_distilabel_metadata(*output_dicts)</code>","text":"<p>Merge the <code>DISTILABEL_METADATA_KEY</code> from multiple output dictionaries. <code>DISTILABEL_METADATA_KEY</code> can be either a dictionary containing metadata keys or a list containing dictionaries of metadata keys.</p> <p>Parameters:</p> Name Type Description Default <code>*output_dicts</code> <code>Dict[str, Any]</code> <p>Variable number of dictionaries or lists containing distilabel metadata.</p> <code>()</code> <p>Returns:</p> Type Description <code>Union[Dict[str, Any], List[Dict[str, Any]]]</code> <p>A merged dictionary or list containing all the distilabel metadata.</p> Source code in <code>src/distilabel/steps/columns/utils.py</code> <pre><code>def merge_distilabel_metadata(\n    *output_dicts: Dict[str, Any],\n) -&gt; Union[Dict[str, Any], List[Dict[str, Any]]]:\n    \"\"\"\n    Merge the `DISTILABEL_METADATA_KEY` from multiple output dictionaries. `DISTILABEL_METADATA_KEY`\n    can be either a dictionary containing metadata keys or a list containing dictionaries\n    of metadata keys.\n\n    Args:\n        *output_dicts: Variable number of dictionaries or lists containing distilabel metadata.\n\n    Returns:\n        A merged dictionary or list containing all the distilabel metadata.\n    \"\"\"\n    merged_metadata = defaultdict(list)\n\n    for output_dict in output_dicts:\n        metadata = output_dict.get(DISTILABEL_METADATA_KEY, {})\n        # If `distilabel_metadata_key` is a `list` then it contains dictionaries with\n        # the metadata per `num_generations` created when `group_generations==True`\n        if isinstance(metadata, list):\n            if not isinstance(merged_metadata, list):\n                merged_metadata = []\n            merged_metadata.extend(metadata)\n        else:\n            for key, value in metadata.items():\n                merged_metadata[key].append(value)\n\n    if isinstance(merged_metadata, list):\n        return merged_metadata\n\n    final_metadata = {}\n    for key, value_list in merged_metadata.items():\n        if len(value_list) == 1:\n            final_metadata[key] = value_list[0]\n        else:\n            final_metadata[key] = value_list\n\n    return final_metadata\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils.group_columns","title":"<code>group_columns(*inputs, group_columns, output_group_columns=None)</code>","text":"<p>Groups multiple list of dictionaries into a single list of dictionaries on the specified <code>group_columns</code>. If <code>group_columns</code> are provided, then it will also rename <code>group_columns</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>list of dictionaries to combine.</p> <code>()</code> <code>group_columns</code> <code>List[str]</code> <p>list of keys to merge on.</p> required <code>output_group_columns</code> <code>Optional[List[str]]</code> <p>list of keys to rename the merge keys to. Defaults to <code>None</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>StepInput</code> <p>A list of dictionaries where the values of the <code>group_columns</code> are combined into a</p> <code>StepInput</code> <p>list and renamed to <code>output_group_columns</code>.</p> Source code in <code>src/distilabel/steps/columns/utils.py</code> <pre><code>def group_columns(\n    *inputs: \"StepInput\",\n    group_columns: List[str],\n    output_group_columns: Optional[List[str]] = None,\n) -&gt; \"StepInput\":\n    \"\"\"Groups multiple list of dictionaries into a single list of dictionaries on the\n    specified `group_columns`. If `group_columns` are provided, then it will also rename\n    `group_columns`.\n\n    Args:\n        inputs: list of dictionaries to combine.\n        group_columns: list of keys to merge on.\n        output_group_columns: list of keys to rename the merge keys to. Defaults to `None`.\n\n    Returns:\n        A list of dictionaries where the values of the `group_columns` are combined into a\n        list and renamed to `output_group_columns`.\n    \"\"\"\n    if output_group_columns is not None and len(output_group_columns) != len(\n        group_columns\n    ):\n        raise ValueError(\n            \"The length of `output_group_columns` must be the same as the length of `group_columns`.\"\n        )\n    if output_group_columns is None:\n        output_group_columns = [f\"grouped_{key}\" for key in group_columns]\n    group_columns_dict = dict(zip(group_columns, output_group_columns))\n\n    result = []\n    # Use zip to iterate over lists based on their index\n    for dicts_at_index in zip(*inputs):\n        combined_dict = {}\n        metadata_dicts = []\n        # Iterate over dicts at the same index\n        for d in dicts_at_index:\n            # Extract metadata for merging\n            if DISTILABEL_METADATA_KEY in d:\n                metadata_dicts.append(\n                    {DISTILABEL_METADATA_KEY: d[DISTILABEL_METADATA_KEY]}\n                )\n            # Iterate over key-value pairs in each dict\n            for key, value in d.items():\n                if key == DISTILABEL_METADATA_KEY:\n                    continue\n                # If the key is in the merge_keys, append the value to the existing list\n                if key in group_columns_dict.keys():\n                    combined_dict.setdefault(group_columns_dict[key], []).append(value)\n                # If the key is not in the merge_keys, create a new key-value pair\n                else:\n                    combined_dict[key] = value\n\n        if metadata_dicts:\n            combined_dict[DISTILABEL_METADATA_KEY] = merge_distilabel_metadata(\n                *metadata_dicts\n            )\n\n        result.append(combined_dict)\n    return result\n</code></pre>"},{"location":"api/step_gallery/columns/#distilabel.steps.columns.utils.merge_columns","title":"<code>merge_columns(row, columns, new_column='combined_key')</code>","text":"<p>Merge columns in a dictionary into a single column on the specified <code>new_column</code>.</p> <p>Parameters:</p> Name Type Description Default <code>row</code> <code>Dict[str, Any]</code> <p>Dictionary corresponding to a row in a dataset.</p> required <code>columns</code> <code>List[str]</code> <p>List of keys to merge.</p> required <code>new_column</code> <code>str</code> <p>Name of the new key created.</p> <code>'combined_key'</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Dictionary with the new merged key.</p> Source code in <code>src/distilabel/steps/columns/utils.py</code> <pre><code>def merge_columns(\n    row: Dict[str, Any], columns: List[str], new_column: str = \"combined_key\"\n) -&gt; Dict[str, Any]:\n    \"\"\"Merge columns in a dictionary into a single column on the specified `new_column`.\n\n    Args:\n        row: Dictionary corresponding to a row in a dataset.\n        columns: List of keys to merge.\n        new_column: Name of the new key created.\n\n    Returns:\n        Dictionary with the new merged key.\n    \"\"\"\n    result = row.copy()  # preserve the original dictionary\n    combined = []\n    for key in columns:\n        to_combine = result.pop(key)\n        if not isinstance(to_combine, list):\n            to_combine = [to_combine]\n        combined += to_combine\n    result[new_column] = combined\n    return result\n</code></pre>"},{"location":"api/step_gallery/extra/","title":"Extra","text":""},{"location":"api/step_gallery/extra/#distilabel.steps","title":"<code>steps</code>","text":""},{"location":"api/step_gallery/extra/#distilabel.steps.DBSCAN","title":"<code>DBSCAN</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core samples in regions of high density and expands clusters from them. This algorithm is good for data which contains clusters of similar density.</p> <p>This is a <code>GlobalStep</code> that clusters the embeddings using the DBSCAN algorithm from <code>sklearn</code>. Visit <code>TextClustering</code> step for an example of use. The trained model is saved as an artifact when creating a distiset and pushing it to the Hugging Face Hub.</p> Input columns <ul> <li>projection (<code>List[float]</code>): Vector representation of the text to cluster,     normally the output from the <code>UMAP</code> step.</li> </ul> Output columns <ul> <li>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1     means it wasn't clustered.</li> </ul> Categories <ul> <li>clustering</li> <li>text-classification</li> </ul> References <ul> <li><code>DBSCAN demo of sklearn</code></li> <li><code>sklearn dbscan</code></li> </ul> <p>Attributes:</p> Name Type Description <code>-</code> <code>eps</code> <p>The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.</p> <code>-</code> <code>min_samples</code> <p>The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If <code>min_samples</code> is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse.</p> <code>-</code> <code>metric</code> <p>The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.</p> <code>-</code> <code>n_jobs</code> <p>The number of parallel jobs to run.</p> Runtime parameters <ul> <li><code>eps</code>: The maximum distance between two samples for one to be considered as in the     neighborhood of the other. This is not a maximum bound on the distances of     points within a cluster. This is the most important DBSCAN parameter to     choose appropriately for your data set and distance function.</li> <li><code>min_samples</code>: The number of samples (or total weight) in a neighborhood for a point     to be considered as a core point. This includes the point itself. If <code>min_samples</code>     is set to a higher value, DBSCAN will find denser clusters, whereas if it is set     to a lower value, the found clusters will be more sparse.</li> <li><code>metric</code>: The metric to use when calculating distance between instances in a feature     array. If metric is a string or callable, it must be one of the options allowed     by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.</li> <li><code>n_jobs</code>: The number of parallel jobs to run.</li> </ul> Source code in <code>src/distilabel/steps/clustering/dbscan.py</code> <pre><code>class DBSCAN(GlobalStep):\n    r\"\"\"DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core\n    samples in regions of high density and expands clusters from them. This algorithm\n    is good for data which contains clusters of similar density.\n\n    This is a `GlobalStep` that clusters the embeddings using the DBSCAN algorithm\n    from `sklearn`. Visit `TextClustering` step for an example of use.\n    The trained model is saved as an artifact when creating a distiset\n    and pushing it to the Hugging Face Hub.\n\n    Input columns:\n        - projection (`List[float]`): Vector representation of the text to cluster,\n            normally the output from the `UMAP` step.\n\n    Output columns:\n        - cluster_label (`int`): Integer representing the label of a given cluster. -1\n            means it wasn't clustered.\n\n    Categories:\n        - clustering\n        - text-classification\n\n    References:\n        - [`DBSCAN demo of sklearn`](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#demo-of-dbscan-clustering-algorithm)\n        - [`sklearn dbscan`](https://scikit-learn.org/stable/modules/clustering.html#dbscan)\n\n    Attributes:\n        - eps: The maximum distance between two samples for one to be considered as in the\n            neighborhood of the other. This is not a maximum bound on the distances of\n            points within a cluster. This is the most important DBSCAN parameter to\n            choose appropriately for your data set and distance function.\n        - min_samples: The number of samples (or total weight) in a neighborhood for a point\n            to be considered as a core point. This includes the point itself. If `min_samples`\n            is set to a higher value, DBSCAN will find denser clusters, whereas if it is set\n            to a lower value, the found clusters will be more sparse.\n        - metric: The metric to use when calculating distance between instances in a feature\n            array. If metric is a string or callable, it must be one of the options allowed\n            by `sklearn.metrics.pairwise_distances` for its metric parameter.\n        - n_jobs: The number of parallel jobs to run.\n\n    Runtime parameters:\n        - `eps`: The maximum distance between two samples for one to be considered as in the\n            neighborhood of the other. This is not a maximum bound on the distances of\n            points within a cluster. This is the most important DBSCAN parameter to\n            choose appropriately for your data set and distance function.\n        - `min_samples`: The number of samples (or total weight) in a neighborhood for a point\n            to be considered as a core point. This includes the point itself. If `min_samples`\n            is set to a higher value, DBSCAN will find denser clusters, whereas if it is set\n            to a lower value, the found clusters will be more sparse.\n        - `metric`: The metric to use when calculating distance between instances in a feature\n            array. If metric is a string or callable, it must be one of the options allowed\n            by `sklearn.metrics.pairwise_distances` for its metric parameter.\n        - `n_jobs`: The number of parallel jobs to run.\n    \"\"\"\n\n    eps: Optional[RuntimeParameter[float]] = Field(\n        default=0.3,\n        description=(\n            \"The maximum distance between two samples for one to be considered \"\n            \"as in the neighborhood of the other. This is not a maximum bound \"\n            \"on the distances of points within a cluster. This is the most \"\n            \"important DBSCAN parameter to choose appropriately for your data set \"\n            \"and distance function.\"\n        ),\n    )\n    min_samples: Optional[RuntimeParameter[int]] = Field(\n        default=30,\n        description=(\n            \"The number of samples (or total weight) in a neighborhood for a point to \"\n            \"be considered as a core point. This includes the point itself. If \"\n            \"`min_samples` is set to a higher value, DBSCAN will find denser clusters, \"\n            \"whereas if it is set to a lower value, the found clusters will be more \"\n            \"sparse.\"\n        ),\n    )\n    metric: Optional[RuntimeParameter[str]] = Field(\n        default=\"euclidean\",\n        description=(\n            \"The metric to use when calculating distance between instances in a \"\n            \"feature array. If metric is a string or callable, it must be one of \"\n            \"the options allowed by `sklearn.metrics.pairwise_distances` for \"\n            \"its metric parameter.\"\n        ),\n    )\n    n_jobs: Optional[RuntimeParameter[int]] = Field(\n        default=8, description=\"The number of parallel jobs to run.\"\n    )\n\n    _clusterer: Optional[\"_DBSCAN\"] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n        if importlib.util.find_spec(\"sklearn\") is None:\n            raise ImportError(\n                \"`sklearn` package is not installed. Please install it using `pip install scikit-learn`.\"\n            )\n        from sklearn.cluster import DBSCAN as _DBSCAN\n\n        self._clusterer = _DBSCAN(\n            eps=self.eps,\n            min_samples=self.min_samples,\n            metric=self.metric,\n            n_jobs=self.n_jobs,\n        )\n\n    def unload(self) -&gt; None:\n        self._clusterer = None\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"projection\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"cluster_label\"]\n\n    def _save_model(self, model: Any) -&gt; None:\n        import joblib\n\n        def save_model(path):\n            with open(str(path / \"DBSCAN.joblib\"), \"wb\") as f:\n                joblib.dump(model, f)\n\n        self.save_artifact(\n            name=\"DBSCAN_model\",\n            write_function=lambda path: save_model(path),\n            metadata={\n                \"eps\": self.eps,\n                \"min_samples\": self.min_samples,\n                \"metric\": self.metric,\n            },\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        projections = np.array([input[\"projection\"] for input in inputs])\n\n        self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Start training DBSCAN...\")\n        fitted_clusterer = self._clusterer.fit(projections)\n        cluster_labels = fitted_clusterer.labels_\n        # Sets the cluster labels for each input, -1 means it wasn't clustered\n        for input, cluster_label in zip(inputs, cluster_labels):\n            input[\"cluster_label\"] = cluster_label\n        self._logger.info(f\"DBSCAN labels assigned: {len(set(cluster_labels))}\")\n        self._save_model(fitted_clusterer)\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering","title":"<code>TextClustering</code>","text":"<p>               Bases: <code>TextClassification</code>, <code>GlobalTask</code></p> <p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p>This is a <code>GlobalTask</code> that inherits from <code>TextClassification</code>, this means that all the attributes from that class are available here. Also, in this case we deal with all the inputs at once, instead of using batches. The <code>input_batch_size</code> is used here to send the examples to the LLM in batches (a subtle difference with the more common <code>Task</code> definitions). The task looks in each cluster for a given number of representative examples (the number is set by the <code>samples_per_cluster</code> attribute), and sends them to the LLM to get a label/s that represent the cluster. The labels are then assigned to each text in the cluster. The clusters and projections used in the step, are assumed to be obtained from the <code>UMAP</code> + <code>DBSCAN</code> steps, but could be generated for similar steps, as long as they represent the same concepts. This step runs a pipeline like the one in this repository: https://github.com/huggingface/text-clustering</p> Input columns <ul> <li>text (<code>str</code>): The reference text we want to obtain labels for.</li> <li>projection (<code>List[float]</code>): Vector representation of the text to cluster,     normally the output from the <code>UMAP</code> step.</li> <li>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1     means it wasn't clustered.</li> </ul> Output columns <ul> <li>summary_label (<code>str</code>): The label or list of labels for the text.</li> <li>model_name (<code>str</code>): The name of the model used to generate the label/s.</li> </ul> Categories <ul> <li>clustering</li> <li>text-classification</li> </ul> References <ul> <li><code>text-clustering repository</code></li> </ul> <p>Attributes:</p> Name Type Description <code>-</code> <code>savefig</code> <p>Whether to generate and save a figure with the clustering of the texts.</p> <code>-</code> <code>samples_per_cluster</code> <p>The number of examples to use in the LLM as a sample of the cluster.</p> <p>Examples:</p> <p>Generate labels for a set of texts using clustering:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps import UMAP, DBSCAN, TextClustering\nfrom distilabel.pipeline import Pipeline\n\nds_name = \"argilla-warehouse/personahub-fineweb-edu-4-clustering-100k\"\n\nwith Pipeline(name=\"Text clustering dataset\") as pipeline:\n    batch_size = 500\n\n    ds = load_dataset(ds_name, split=\"train\").select(range(10000))\n    loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)\n\n    umap = UMAP(n_components=2, metric=\"cosine\")\n    dbscan = DBSCAN(eps=0.3, min_samples=30)\n\n    text_clustering = TextClustering(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        ),\n        n=3,  # 3 labels per example\n        query_title=\"Examples of Personas\",\n        samples_per_cluster=10,\n        context=(\n            \"Describe the main themes, topics, or categories that could describe the \"\n            \"following types of personas. All the examples of personas must share \"\n            \"the same set of labels.\"\n        ),\n        default_label=\"None\",\n        savefig=True,\n        input_batch_size=8,\n        input_mappings={\"text\": \"persona\"},\n        use_default_structured_output=True,\n    )\n\n    loader &gt;&gt; umap &gt;&gt; dbscan &gt;&gt; text_clustering\n</code></pre> Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>class TextClustering(TextClassification, GlobalTask):\n    \"\"\"Task that clusters a set of texts and generates summary labels for each cluster.\n\n    This is a `GlobalTask` that inherits from `TextClassification`, this means that all\n    the attributes from that class are available here. Also, in this case we deal\n    with all the inputs at once, instead of using batches. The `input_batch_size` is\n    used here to send the examples to the LLM in batches (a subtle difference with the\n    more common `Task` definitions).\n    The task looks in each cluster for a given number of representative examples (the number\n    is set by the `samples_per_cluster` attribute), and sends them to the LLM to get a label/s\n    that represent the cluster. The labels are then assigned to each text in the cluster.\n    The clusters and projections used in the step, are assumed to be obtained from the `UMAP`\n    + `DBSCAN` steps, but could be generated for similar steps, as long as they represent the\n    same concepts.\n    This step runs a pipeline like the one in this repository:\n    https://github.com/huggingface/text-clustering\n\n    Input columns:\n        - text (`str`): The reference text we want to obtain labels for.\n        - projection (`List[float]`): Vector representation of the text to cluster,\n            normally the output from the `UMAP` step.\n        - cluster_label (`int`): Integer representing the label of a given cluster. -1\n            means it wasn't clustered.\n\n    Output columns:\n        - summary_label (`str`): The label or list of labels for the text.\n        - model_name (`str`): The name of the model used to generate the label/s.\n\n    Categories:\n        - clustering\n        - text-classification\n\n    References:\n        - [`text-clustering repository`](https://github.com/huggingface/text-clustering)\n\n    Attributes:\n        - savefig: Whether to generate and save a figure with the clustering of the texts.\n        - samples_per_cluster: The number of examples to use in the LLM as a sample of the cluster.\n\n    Examples:\n        Generate labels for a set of texts using clustering:\n\n        ```python\n        from distilabel.models import InferenceEndpointsLLM\n        from distilabel.steps import UMAP, DBSCAN, TextClustering\n        from distilabel.pipeline import Pipeline\n\n        ds_name = \"argilla-warehouse/personahub-fineweb-edu-4-clustering-100k\"\n\n        with Pipeline(name=\"Text clustering dataset\") as pipeline:\n            batch_size = 500\n\n            ds = load_dataset(ds_name, split=\"train\").select(range(10000))\n            loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)\n\n            umap = UMAP(n_components=2, metric=\"cosine\")\n            dbscan = DBSCAN(eps=0.3, min_samples=30)\n\n            text_clustering = TextClustering(\n                llm=InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                ),\n                n=3,  # 3 labels per example\n                query_title=\"Examples of Personas\",\n                samples_per_cluster=10,\n                context=(\n                    \"Describe the main themes, topics, or categories that could describe the \"\n                    \"following types of personas. All the examples of personas must share \"\n                    \"the same set of labels.\"\n                ),\n                default_label=\"None\",\n                savefig=True,\n                input_batch_size=8,\n                input_mappings={\"text\": \"persona\"},\n                use_default_structured_output=True,\n            )\n\n            loader &gt;&gt; umap &gt;&gt; dbscan &gt;&gt; text_clustering\n        ```\n    \"\"\"\n\n    savefig: Optional[RuntimeParameter[bool]] = Field(\n        default=True,\n        description=\"Whether to generate and save a figure with the clustering of the texts.\",\n    )\n    samples_per_cluster: int = Field(\n        default=10,\n        description=\"The number of examples to use in the LLM as a sample of the cluster.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the same as those for `TextClassification` plus\n        the `projection` and `cluster_label` columns (which can be obtained from\n        UMAP + DBSCAN steps).\n        \"\"\"\n        return super().inputs + [\"projection\", \"cluster_label\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `summary_label` and the `model_name`.\"\"\"\n        return [\"summary_label\", \"model_name\"]\n\n    def load(self) -&gt; None:\n        super().load()\n        if self.savefig and (importlib.util.find_spec(\"matplotlib\") is None):\n            raise ImportError(\n                \"`matplotlib` package is not installed. Please install it using `pip install matplotlib`.\"\n            )\n\n    def _save_figure(\n        self,\n        data: pd.DataFrame,\n        cluster_centers: Dict[str, Tuple[float, float]],\n        cluster_summaries: Dict[int, str],\n    ) -&gt; None:\n        \"\"\"Saves the figure starting from the dataframe, using matplotlib.\n\n        Args:\n            data: pd.DataFrame with the columns 'X', 'Y' and 'labels' representing\n                the projections and the label of each text respectively.\n            cluster_centers: Dictionary mapping from each label the center of a cluster,\n                to help with the placement of the annotations.\n            cluster_summaries: The summaries of the clusters, obtained from the LLM.\n        \"\"\"\n        import matplotlib.pyplot as plt\n\n        fig, ax = plt.subplots(figsize=(12, 8), dpi=300)\n        unique_labels = data[\"labels\"].unique()\n        # Map of colors for each label (-1 is black)\n        colormap = dict(\n            zip(unique_labels, plt.cm.Spectral(np.linspace(0, 1, len(unique_labels))))\n        )\n        colormap[-1] = np.array([0, 0, 0, 0])\n        data[\"color\"] = data[\"labels\"].map(colormap)\n\n        data.plot(\n            kind=\"scatter\",\n            x=\"X\",\n            y=\"Y\",\n            c=\"color\",\n            s=0.75,\n            alpha=0.8,\n            linewidth=0.4,\n            ax=ax,\n            colorbar=False,\n        )\n\n        for label in cluster_summaries.keys():\n            if label == -1:\n                continue\n            summary = str(cluster_summaries[label])  # These are obtained from the LLM\n            position = cluster_centers[label]\n            t = ax.text(\n                position[0],\n                position[1],\n                summary,\n                horizontalalignment=\"center\",\n                verticalalignment=\"center\",\n                fontsize=4,\n            )\n            t.set_bbox(\n                {\n                    \"facecolor\": \"white\",\n                    \"alpha\": 0.9,\n                    \"linewidth\": 0,\n                    \"boxstyle\": \"square,pad=0.1\",\n                }\n            )\n\n        ax.set_axis_off()\n        # Save the plot as an artifact of the step\n        self.save_artifact(\n            name=\"Text clusters\",\n            write_function=lambda path: fig.savefig(path / \"figure_clustering.png\"),\n            metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n        )\n        plt.close()\n\n    def _create_figure(\n        self,\n        inputs: StepInput,\n        label2docs: Dict[int, List[str]],\n        cluster_summaries: Dict[int, str],\n    ) -&gt; None:\n        \"\"\"Creates a figure of the clustered texts and save it as an artifact.\n\n        Args:\n            inputs: The inputs of the step, as we will extract information from them again.\n            label2docs: Map from each label to the list of documents (texts) that belong to that cluster.\n            cluster_summaries: The summaries of the clusters, obtained from the LLM.\n        \"\"\"\n        self._logger.info(\"\ud83d\uddbc\ufe0f Creating figure for the clusters...\")\n\n        labels = []\n        projections = []\n        id2cluster = {}\n        for i, input in enumerate(inputs):\n            label = input[\"cluster_label\"]\n            id2cluster[i] = label\n            labels.append(label)\n            projections.append(input[\"projection\"])\n\n        projections = np.array(projections)\n\n        # Contains the placement of the cluster centers in the figure\n        cluster_centers: Dict[str, Tuple[float, float]] = {}\n        for label in label2docs.keys():\n            x = np.mean([projections[doc, 0] for doc in label2docs[label]])\n            y = np.mean([projections[doc, 1] for doc in label2docs[label]])\n            cluster_centers[label] = (x, y)\n\n        df = pd.DataFrame(\n            data={\n                \"X\": projections[:, 0],\n                \"Y\": projections[:, 1],\n                \"labels\": labels,\n            }\n        )\n\n        self._save_figure(\n            df, cluster_centers=cluster_centers, cluster_summaries=cluster_summaries\n        )\n\n    def _prepare_input_texts(\n        self,\n        inputs: StepInput,\n        label2docs: Dict[int, List[int]],\n        unique_labels: List[int],\n    ) -&gt; List[Dict[str, Union[str, int]]]:\n        \"\"\"Prepares a batch of inputs to send to the LLM, with the examples of each cluster.\n\n        Args:\n            inputs: Inputs from the step.\n            label2docs: Map from each label to the list of documents (texts) that\n                belong to that cluster.\n            unique_labels: The unique labels of the clusters.\n\n        Returns:\n            The input texts to send to the LLM, with the examples of each cluster\n            prepared to be used in the prompt, and an additional key to store the\n            labels (that will be needed to find the data after the batches are\n            returned from the LLM).\n        \"\"\"\n        input_texts = []\n        for label in range(unique_labels):  # The label -1 is implicitly excluded\n            # Get the ids but remove possible duplicates, which could happen with bigger probability\n            # the bigger the number of examples requested, and the smaller the subset of examples\n            ids = set(\n                np.random.choice(label2docs[label], size=self.samples_per_cluster)\n            )  # Grab the number of examples\n            examples = [inputs[i][\"text\"] for i in ids]\n            input_text = {\n                \"text\": \"\\n\\n\".join(\n                    [f\"Example {i}:\\n{t}\" for i, t in enumerate(examples, start=1)]\n                ),\n                \"__LABEL\": label,\n            }\n            input_texts.append(input_text)\n        return input_texts\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        labels = [input[\"cluster_label\"] for input in inputs]\n        # -1 because -1 is the label for the unclassified\n        unique_labels = len(set(labels)) - 1\n        # This will be the output of the LLM, the set of labels for each cluster\n        cluster_summaries: Dict[int, str] = {-1: self.default_label}\n\n        # Map from label to list of documents, will use them to select examples from each cluster\n        label2docs = defaultdict(list)\n        for i, label in enumerate(labels):\n            label2docs[label].append(i)\n\n        input_texts = self._prepare_input_texts(inputs, label2docs, unique_labels)\n\n        # Send the texts in batches to the LLM, and get the labels for each cluster\n        for i, batched_inputs in enumerate(batched(input_texts, self.input_batch_size)):\n            self._logger.info(f\"\ud83d\udce6 Processing internal batch of inputs {i}...\")\n            results = super().process(batched_inputs)\n            for result in next(results):  # Extract the elements from the generator\n                cluster_summaries[result[\"__LABEL\"]] = result[\"labels\"]\n\n        # Assign the labels to each text\n        for input in inputs:\n            input[\"summary_label\"] = json.dumps(\n                cluster_summaries[input[\"cluster_label\"]]\n            )\n\n        if self.savefig:\n            self._create_figure(inputs, label2docs, cluster_summaries)\n\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the same as those for <code>TextClassification</code> plus the <code>projection</code> and <code>cluster_label</code> columns (which can be obtained from UMAP + DBSCAN steps).</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>summary_label</code> and the <code>model_name</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering._save_figure","title":"<code>_save_figure(data, cluster_centers, cluster_summaries)</code>","text":"<p>Saves the figure starting from the dataframe, using matplotlib.</p> <p>Parameters:</p> Name Type Description Default <code>data</code> <code>DataFrame</code> <p>pd.DataFrame with the columns 'X', 'Y' and 'labels' representing the projections and the label of each text respectively.</p> required <code>cluster_centers</code> <code>Dict[str, Tuple[float, float]]</code> <p>Dictionary mapping from each label the center of a cluster, to help with the placement of the annotations.</p> required <code>cluster_summaries</code> <code>Dict[int, str]</code> <p>The summaries of the clusters, obtained from the LLM.</p> required Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>def _save_figure(\n    self,\n    data: pd.DataFrame,\n    cluster_centers: Dict[str, Tuple[float, float]],\n    cluster_summaries: Dict[int, str],\n) -&gt; None:\n    \"\"\"Saves the figure starting from the dataframe, using matplotlib.\n\n    Args:\n        data: pd.DataFrame with the columns 'X', 'Y' and 'labels' representing\n            the projections and the label of each text respectively.\n        cluster_centers: Dictionary mapping from each label the center of a cluster,\n            to help with the placement of the annotations.\n        cluster_summaries: The summaries of the clusters, obtained from the LLM.\n    \"\"\"\n    import matplotlib.pyplot as plt\n\n    fig, ax = plt.subplots(figsize=(12, 8), dpi=300)\n    unique_labels = data[\"labels\"].unique()\n    # Map of colors for each label (-1 is black)\n    colormap = dict(\n        zip(unique_labels, plt.cm.Spectral(np.linspace(0, 1, len(unique_labels))))\n    )\n    colormap[-1] = np.array([0, 0, 0, 0])\n    data[\"color\"] = data[\"labels\"].map(colormap)\n\n    data.plot(\n        kind=\"scatter\",\n        x=\"X\",\n        y=\"Y\",\n        c=\"color\",\n        s=0.75,\n        alpha=0.8,\n        linewidth=0.4,\n        ax=ax,\n        colorbar=False,\n    )\n\n    for label in cluster_summaries.keys():\n        if label == -1:\n            continue\n        summary = str(cluster_summaries[label])  # These are obtained from the LLM\n        position = cluster_centers[label]\n        t = ax.text(\n            position[0],\n            position[1],\n            summary,\n            horizontalalignment=\"center\",\n            verticalalignment=\"center\",\n            fontsize=4,\n        )\n        t.set_bbox(\n            {\n                \"facecolor\": \"white\",\n                \"alpha\": 0.9,\n                \"linewidth\": 0,\n                \"boxstyle\": \"square,pad=0.1\",\n            }\n        )\n\n    ax.set_axis_off()\n    # Save the plot as an artifact of the step\n    self.save_artifact(\n        name=\"Text clusters\",\n        write_function=lambda path: fig.savefig(path / \"figure_clustering.png\"),\n        metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n    )\n    plt.close()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering._create_figure","title":"<code>_create_figure(inputs, label2docs, cluster_summaries)</code>","text":"<p>Creates a figure of the clustered texts and save it as an artifact.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The inputs of the step, as we will extract information from them again.</p> required <code>label2docs</code> <code>Dict[int, List[str]]</code> <p>Map from each label to the list of documents (texts) that belong to that cluster.</p> required <code>cluster_summaries</code> <code>Dict[int, str]</code> <p>The summaries of the clusters, obtained from the LLM.</p> required Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>def _create_figure(\n    self,\n    inputs: StepInput,\n    label2docs: Dict[int, List[str]],\n    cluster_summaries: Dict[int, str],\n) -&gt; None:\n    \"\"\"Creates a figure of the clustered texts and save it as an artifact.\n\n    Args:\n        inputs: The inputs of the step, as we will extract information from them again.\n        label2docs: Map from each label to the list of documents (texts) that belong to that cluster.\n        cluster_summaries: The summaries of the clusters, obtained from the LLM.\n    \"\"\"\n    self._logger.info(\"\ud83d\uddbc\ufe0f Creating figure for the clusters...\")\n\n    labels = []\n    projections = []\n    id2cluster = {}\n    for i, input in enumerate(inputs):\n        label = input[\"cluster_label\"]\n        id2cluster[i] = label\n        labels.append(label)\n        projections.append(input[\"projection\"])\n\n    projections = np.array(projections)\n\n    # Contains the placement of the cluster centers in the figure\n    cluster_centers: Dict[str, Tuple[float, float]] = {}\n    for label in label2docs.keys():\n        x = np.mean([projections[doc, 0] for doc in label2docs[label]])\n        y = np.mean([projections[doc, 1] for doc in label2docs[label]])\n        cluster_centers[label] = (x, y)\n\n    df = pd.DataFrame(\n        data={\n            \"X\": projections[:, 0],\n            \"Y\": projections[:, 1],\n            \"labels\": labels,\n        }\n    )\n\n    self._save_figure(\n        df, cluster_centers=cluster_centers, cluster_summaries=cluster_summaries\n    )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TextClustering._prepare_input_texts","title":"<code>_prepare_input_texts(inputs, label2docs, unique_labels)</code>","text":"<p>Prepares a batch of inputs to send to the LLM, with the examples of each cluster.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>Inputs from the step.</p> required <code>label2docs</code> <code>Dict[int, List[int]]</code> <p>Map from each label to the list of documents (texts) that belong to that cluster.</p> required <code>unique_labels</code> <code>List[int]</code> <p>The unique labels of the clusters.</p> required <p>Returns:</p> Type Description <code>List[Dict[str, Union[str, int]]]</code> <p>The input texts to send to the LLM, with the examples of each cluster</p> <code>List[Dict[str, Union[str, int]]]</code> <p>prepared to be used in the prompt, and an additional key to store the</p> <code>List[Dict[str, Union[str, int]]]</code> <p>labels (that will be needed to find the data after the batches are</p> <code>List[Dict[str, Union[str, int]]]</code> <p>returned from the LLM).</p> Source code in <code>src/distilabel/steps/clustering/text_clustering.py</code> <pre><code>def _prepare_input_texts(\n    self,\n    inputs: StepInput,\n    label2docs: Dict[int, List[int]],\n    unique_labels: List[int],\n) -&gt; List[Dict[str, Union[str, int]]]:\n    \"\"\"Prepares a batch of inputs to send to the LLM, with the examples of each cluster.\n\n    Args:\n        inputs: Inputs from the step.\n        label2docs: Map from each label to the list of documents (texts) that\n            belong to that cluster.\n        unique_labels: The unique labels of the clusters.\n\n    Returns:\n        The input texts to send to the LLM, with the examples of each cluster\n        prepared to be used in the prompt, and an additional key to store the\n        labels (that will be needed to find the data after the batches are\n        returned from the LLM).\n    \"\"\"\n    input_texts = []\n    for label in range(unique_labels):  # The label -1 is implicitly excluded\n        # Get the ids but remove possible duplicates, which could happen with bigger probability\n        # the bigger the number of examples requested, and the smaller the subset of examples\n        ids = set(\n            np.random.choice(label2docs[label], size=self.samples_per_cluster)\n        )  # Grab the number of examples\n        examples = [inputs[i][\"text\"] for i in ids]\n        input_text = {\n            \"text\": \"\\n\\n\".join(\n                [f\"Example {i}:\\n{t}\" for i, t in enumerate(examples, start=1)]\n            ),\n            \"__LABEL\": label,\n        }\n        input_texts.append(input_text)\n    return input_texts\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.UMAP","title":"<code>UMAP</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>UMAP is a general purpose manifold learning and dimension reduction algorithm.</p> <p>This is a <code>GlobalStep</code> that reduces the dimensionality of the embeddings using. Visit the <code>TextClustering</code> step for an example of use. The trained model is saved as an artifact when creating a distiset and pushing it to the Hugging Face Hub.</p> Input columns <ul> <li>embedding (<code>List[float]</code>): The original embeddings we want to reduce the dimension.</li> </ul> Output columns <ul> <li>projection (<code>List[float]</code>): Embedding reduced to the number of components specified,     the size of the new embeddings will be determined by the <code>n_components</code>.</li> </ul> Categories <ul> <li>clustering</li> <li>text-classification</li> </ul> References <ul> <li><code>UMAP repository</code></li> <li><code>UMAP documentation</code></li> </ul> <p>Attributes:</p> Name Type Description <code>-</code> <code>n_components</code> <p>The dimension of the space to embed into. This defaults to 2 to provide easy visualization (that's probably what you want), but can reasonably be set to any integer value in the range 2 to 100.</p> <code>-</code> <code>metric</code> <p>The metric to use to compute distances in high dimensional space. Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.</p> <code>-</code> <code>n_jobs</code> <p>The number of parallel jobs to run. Defaults to <code>8</code>.</p> <code>-</code> <code>random_state</code> <p>The random state to use for the UMAP algorithm.</p> Runtime parameters <ul> <li><code>n_components</code>: The dimension of the space to embed into. This defaults to 2 to     provide easy visualization (that's probably what you want), but can     reasonably be set to any integer value in the range 2 to 100.</li> <li><code>metric</code>: The metric to use to compute distances in high dimensional space.     Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.</li> <li><code>n_jobs</code>: The number of parallel jobs to run. Defaults to <code>8</code>.</li> <li><code>random_state</code>: The random state to use for the UMAP algorithm.</li> </ul> Citations <pre><code>@misc{mcinnes2020umapuniformmanifoldapproximation,\n    title={UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction},\n    author={Leland McInnes and John Healy and James Melville},\n    year={2020},\n    eprint={1802.03426},\n    archivePrefix={arXiv},\n    primaryClass={stat.ML},\n    url={https://arxiv.org/abs/1802.03426},\n}\n</code></pre> Source code in <code>src/distilabel/steps/clustering/umap.py</code> <pre><code>class UMAP(GlobalStep):\n    r\"\"\"UMAP is a general purpose manifold learning and dimension reduction algorithm.\n\n    This is a `GlobalStep` that reduces the dimensionality of the embeddings using. Visit\n    the `TextClustering` step for an example of use. The trained model is saved as an artifact\n    when creating a distiset and pushing it to the Hugging Face Hub.\n\n    Input columns:\n        - embedding (`List[float]`): The original embeddings we want to reduce the dimension.\n\n    Output columns:\n        - projection (`List[float]`): Embedding reduced to the number of components specified,\n            the size of the new embeddings will be determined by the `n_components`.\n\n    Categories:\n        - clustering\n        - text-classification\n\n    References:\n        - [`UMAP repository`](https://github.com/lmcinnes/umap/tree/master)\n        - [`UMAP documentation`](https://umap-learn.readthedocs.io/en/latest/)\n\n    Attributes:\n        - n_components: The dimension of the space to embed into. This defaults to 2 to\n            provide easy visualization (that's probably what you want), but can\n            reasonably be set to any integer value in the range 2 to 100.\n        - metric: The metric to use to compute distances in high dimensional space.\n            Visit UMAP's documentation for more information. Defaults to `euclidean`.\n        - n_jobs: The number of parallel jobs to run. Defaults to `8`.\n        - random_state: The random state to use for the UMAP algorithm.\n\n    Runtime parameters:\n        - `n_components`: The dimension of the space to embed into. This defaults to 2 to\n            provide easy visualization (that's probably what you want), but can\n            reasonably be set to any integer value in the range 2 to 100.\n        - `metric`: The metric to use to compute distances in high dimensional space.\n            Visit UMAP's documentation for more information. Defaults to `euclidean`.\n        - `n_jobs`: The number of parallel jobs to run. Defaults to `8`.\n        - `random_state`: The random state to use for the UMAP algorithm.\n\n    Citations:\n        ```\n        @misc{mcinnes2020umapuniformmanifoldapproximation,\n            title={UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction},\n            author={Leland McInnes and John Healy and James Melville},\n            year={2020},\n            eprint={1802.03426},\n            archivePrefix={arXiv},\n            primaryClass={stat.ML},\n            url={https://arxiv.org/abs/1802.03426},\n        }\n        ```\n    \"\"\"\n\n    n_components: Optional[RuntimeParameter[int]] = Field(\n        default=2,\n        description=(\n            \"The dimension of the space to embed into. This defaults to 2 to \"\n            \"provide easy visualization, but can reasonably be set to any \"\n            \"integer value in the range 2 to 100.\"\n        ),\n    )\n    metric: Optional[RuntimeParameter[str]] = Field(\n        default=\"euclidean\",\n        description=(\n            \"The metric to use to compute distances in high dimensional space. \"\n            \"Visit UMAP's documentation for more information.\"\n        ),\n    )\n    n_jobs: Optional[RuntimeParameter[int]] = Field(\n        default=8, description=\"The number of parallel jobs to run.\"\n    )\n    random_state: Optional[RuntimeParameter[int]] = Field(\n        default=None, description=\"The random state to use for the UMAP algorithm.\"\n    )\n\n    _umap: Optional[\"_UMAP\"] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n        if importlib.util.find_spec(\"umap\") is None:\n            raise ImportError(\n                \"`umap` package is not installed. Please install it using `pip install umap-learn`.\"\n            )\n        from umap import UMAP as _UMAP\n\n        self._umap = _UMAP(\n            n_components=self.n_components,\n            metric=self.metric,\n            n_jobs=self.n_jobs,\n            random_state=self.random_state,\n        )\n\n    def unload(self) -&gt; None:\n        self._umap = None\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"embedding\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"projection\"]\n\n    def _save_model(self, model: Any) -&gt; None:\n        import joblib\n\n        def save_model(path):\n            with open(str(path / \"UMAP.joblib\"), \"wb\") as f:\n                joblib.dump(model, f)\n\n        self.save_artifact(\n            name=\"UMAP_model\",\n            write_function=lambda path: save_model(path),\n            metadata={\n                \"n_components\": self.n_components,\n                \"metric\": self.metric,\n            },\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        # Shape of the embeddings is (n_samples, n_features)\n        embeddings = np.array([input[\"embedding\"] for input in inputs])\n\n        self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Start UMAP training...\")\n        mapper = self._umap.fit(embeddings)\n        # Shape of the projection will be (n_samples, n_components)\n        for input, projection in zip(inputs, mapper.embedding_):\n            input[\"projection\"] = projection\n\n        self._save_model(mapper)\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.CombineOutputs","title":"<code>CombineOutputs</code>","text":"<p>               Bases: <code>Step</code></p> <p>Combine the outputs of several upstream steps.</p> <p><code>CombineOutputs</code> is a <code>Step</code> that takes the outputs of several upstream steps and combines them to generate a new dictionary with all keys/columns of the upstream steps outputs.</p> Input columns <ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul> Output columns <ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul> Categories <ul> <li>columns</li> </ul> <p>Examples:</p> <pre><code>Combine dictionaries of a dataset:\n\n```python\nfrom distilabel.steps import CombineOutputs\n\ncombine_outputs = CombineOutputs()\ncombine_outputs.load()\n\nresult = next(\n    combine_outputs.process(\n        [{\"a\": 1, \"b\": 2}, {\"a\": 3, \"b\": 4}],\n        [{\"c\": 5, \"d\": 6}, {\"c\": 7, \"d\": 8}],\n    )\n)\n# [\n#   {\"a\": 1, \"b\": 2, \"c\": 5, \"d\": 6},\n#   {\"a\": 3, \"b\": 4, \"c\": 7, \"d\": 8},\n# ]\n```\n\nCombine upstream steps outputs in a pipeline:\n\n```python\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs\n\nwith Pipeline() as pipeline:\n    step_1 = ...\n    step_2 = ...\n    step_3 = ...\n    combine = CombineOutputs()\n\n    [step_1, step_2, step_3] &gt;&gt; combine\n```\n</code></pre> Source code in <code>src/distilabel/steps/columns/combine.py</code> <pre><code>class CombineOutputs(Step):\n    \"\"\"Combine the outputs of several upstream steps.\n\n    `CombineOutputs` is a `Step` that takes the outputs of several upstream steps and combines\n    them to generate a new dictionary with all keys/columns of the upstream steps outputs.\n\n    Input columns:\n        - dynamic (based on the upstream `Step`s): All the columns of the upstream steps outputs.\n\n    Output columns:\n        - dynamic (based on the upstream `Step`s): All the columns of the upstream steps outputs.\n\n    Categories:\n        - columns\n\n    Examples:\n\n        Combine dictionaries of a dataset:\n\n        ```python\n        from distilabel.steps import CombineOutputs\n\n        combine_outputs = CombineOutputs()\n        combine_outputs.load()\n\n        result = next(\n            combine_outputs.process(\n                [{\"a\": 1, \"b\": 2}, {\"a\": 3, \"b\": 4}],\n                [{\"c\": 5, \"d\": 6}, {\"c\": 7, \"d\": 8}],\n            )\n        )\n        # [\n        #   {\"a\": 1, \"b\": 2, \"c\": 5, \"d\": 6},\n        #   {\"a\": 3, \"b\": 4, \"c\": 7, \"d\": 8},\n        # ]\n        ```\n\n        Combine upstream steps outputs in a pipeline:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import CombineOutputs\n\n        with Pipeline() as pipeline:\n            step_1 = ...\n            step_2 = ...\n            step_3 = ...\n            combine = CombineOutputs()\n\n            [step_1, step_2, step_3] &gt;&gt; combine\n        ```\n    \"\"\"\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        combined_outputs = []\n        for output_dicts in zip(*inputs):\n            combined_dict = {}\n            for output_dict in output_dicts:\n                combined_dict.update(\n                    {\n                        k: v\n                        for k, v in output_dict.items()\n                        if k != DISTILABEL_METADATA_KEY\n                    }\n                )\n\n            if any(\n                DISTILABEL_METADATA_KEY in output_dict for output_dict in output_dicts\n            ):\n                combined_dict[DISTILABEL_METADATA_KEY] = merge_distilabel_metadata(\n                    *output_dicts\n                )\n            combined_outputs.append(combined_dict)\n\n        yield combined_outputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering","title":"<code>DeitaFiltering</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Filter dataset rows using DEITA filtering strategy.</p> <p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings. It's an implementation of the filtering step from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>data_budget</code> <code>RuntimeParameter[int]</code> <p>The desired size of the dataset after filtering.</p> <code>diversity_threshold</code> <code>RuntimeParameter[float]</code> <p>If a row has a cosine distance with respect to it's nearest neighbor greater than this value, it will be included in the filtered dataset. Defaults to <code>0.9</code>.</p> <code>normalize_embeddings</code> <code>RuntimeParameter[bool]</code> <p>Whether to normalize the embeddings before computing the cosine distance. Defaults to <code>True</code>.</p> Runtime parameters <ul> <li><code>data_budget</code>: The desired size of the dataset after filtering.</li> <li><code>diversity_threshold</code>: If a row has a cosine distance with respect to it's nearest     neighbor greater than this value, it will be included in the filtered dataset.</li> </ul> Input columns <ul> <li>evol_instruction_score (<code>float</code>): The score of the instruction generated by     <code>ComplexityScorer</code> step.</li> <li>evol_response_score (<code>float</code>): The score of the response generated by     <code>QualityScorer</code> step.</li> <li>embedding (<code>List[float]</code>): The embedding generated for the conversation of the     instruction-response pair using <code>GenerateEmbeddings</code> step.</li> </ul> Output columns <ul> <li>deita_score (<code>float</code>): The DEITA score for the instruction-response pair.</li> <li>deita_score_computed_with (<code>List[str]</code>): The scores used to compute the DEITA     score.</li> <li>nearest_neighbor_distance (<code>float</code>): The cosine distance between the embeddings     of the instruction-response pair.</li> </ul> Categories <ul> <li>filtering</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings:</p> <pre><code>from distilabel.steps import DeitaFiltering\n\ndeita_filtering = DeitaFiltering(data_budget=1)\n\ndeita_filtering.load()\n\nresult = next(\n    deita_filtering.process(\n        [\n            {\n                \"evol_instruction_score\": 0.5,\n                \"evol_response_score\": 0.5,\n                \"embedding\": [-8.12729941, -5.24642847, -6.34003029],\n            },\n            {\n                \"evol_instruction_score\": 0.6,\n                \"evol_response_score\": 0.6,\n                \"embedding\": [2.99329242, 0.7800932, 0.7799726],\n            },\n            {\n                \"evol_instruction_score\": 0.7,\n                \"evol_response_score\": 0.7,\n                \"embedding\": [10.29041806, 14.33088073, 13.00557506],\n            },\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>class DeitaFiltering(GlobalStep):\n    \"\"\"Filter dataset rows using DEITA filtering strategy.\n\n    Filter the dataset based on the DEITA score and the cosine distance between the embeddings.\n    It's an implementation of the filtering step from the paper 'What Makes Good Data\n    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.\n\n    Attributes:\n        data_budget: The desired size of the dataset after filtering.\n        diversity_threshold: If a row has a cosine distance with respect to it's nearest\n            neighbor greater than this value, it will be included in the filtered dataset.\n            Defaults to `0.9`.\n        normalize_embeddings: Whether to normalize the embeddings before computing the cosine\n            distance. Defaults to `True`.\n\n    Runtime parameters:\n        - `data_budget`: The desired size of the dataset after filtering.\n        - `diversity_threshold`: If a row has a cosine distance with respect to it's nearest\n            neighbor greater than this value, it will be included in the filtered dataset.\n\n    Input columns:\n        - evol_instruction_score (`float`): The score of the instruction generated by\n            `ComplexityScorer` step.\n        - evol_response_score (`float`): The score of the response generated by\n            `QualityScorer` step.\n        - embedding (`List[float]`): The embedding generated for the conversation of the\n            instruction-response pair using `GenerateEmbeddings` step.\n\n    Output columns:\n        - deita_score (`float`): The DEITA score for the instruction-response pair.\n        - deita_score_computed_with (`List[str]`): The scores used to compute the DEITA\n            score.\n        - nearest_neighbor_distance (`float`): The cosine distance between the embeddings\n            of the instruction-response pair.\n\n    Categories:\n        - filtering\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Filter the dataset based on the DEITA score and the cosine distance between the embeddings:\n\n        ```python\n        from distilabel.steps import DeitaFiltering\n\n        deita_filtering = DeitaFiltering(data_budget=1)\n\n        deita_filtering.load()\n\n        result = next(\n            deita_filtering.process(\n                [\n                    {\n                        \"evol_instruction_score\": 0.5,\n                        \"evol_response_score\": 0.5,\n                        \"embedding\": [-8.12729941, -5.24642847, -6.34003029],\n                    },\n                    {\n                        \"evol_instruction_score\": 0.6,\n                        \"evol_response_score\": 0.6,\n                        \"embedding\": [2.99329242, 0.7800932, 0.7799726],\n                    },\n                    {\n                        \"evol_instruction_score\": 0.7,\n                        \"evol_response_score\": 0.7,\n                        \"embedding\": [10.29041806, 14.33088073, 13.00557506],\n                    },\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    data_budget: RuntimeParameter[int] = Field(\n        default=None, description=\"The desired size of the dataset after filtering.\"\n    )\n    diversity_threshold: RuntimeParameter[float] = Field(\n        default=0.9,\n        description=\"If a row has a cosine distance with respect to it's nearest neighbor\"\n        \" greater than this value, it will be included in the filtered dataset.\",\n    )\n    normalize_embeddings: RuntimeParameter[bool] = Field(\n        default=True,\n        description=\"Whether to normalize the embeddings before computing the cosine distance.\",\n    )\n    distance_metric: RuntimeParameter[Literal[\"cosine\", \"manhattan\"]] = Field(\n        default=\"cosine\",\n        description=\"The distance metric to use. Currently only 'cosine' is supported.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"evol_instruction_score\", \"evol_response_score\", \"embedding\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"deita_score\", \"nearest_neighbor_distance\", \"deita_score_computed_with\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Filter the dataset based on the DEITA score and the cosine distance between the\n        embeddings.\n\n        Args:\n            inputs: The input data.\n\n        Returns:\n            The filtered dataset.\n        \"\"\"\n        inputs = self._compute_deita_score(inputs)\n        inputs = self._compute_nearest_neighbor(inputs)\n        inputs.sort(key=lambda x: x[\"deita_score\"], reverse=True)\n\n        selected_rows = []\n        for input in inputs:\n            if len(selected_rows) &gt;= self.data_budget:  # type: ignore\n                break\n            if input[\"nearest_neighbor_distance\"] &gt;= self.diversity_threshold:\n                selected_rows.append(input)\n        yield selected_rows\n\n    def _compute_deita_score(self, inputs: StepInput) -&gt; StepInput:\n        \"\"\"Computes the DEITA score for each instruction-response pair. The DEITA score is\n        the product of the instruction score and the response score.\n\n        Args:\n            inputs: The input data.\n\n        Returns:\n            The input data with the DEITA score computed.\n        \"\"\"\n        for input_ in inputs:\n            evol_instruction_score = input_.get(\"evol_instruction_score\")\n            evol_response_score = input_.get(\"evol_response_score\")\n\n            if evol_instruction_score and evol_response_score:\n                deita_score = evol_instruction_score * evol_response_score\n                score_computed_with = [\"evol_instruction_score\", \"evol_response_score\"]\n            elif evol_instruction_score:\n                self._logger.warning(\n                    \"Response score is missing for the instruction-response pair. Using\"\n                    \" instruction score as DEITA score.\"\n                )\n                deita_score = evol_instruction_score\n                score_computed_with = [\"evol_instruction_score\"]\n            elif evol_response_score:\n                self._logger.warning(\n                    \"Instruction score is missing for the instruction-response pair. Using\"\n                    \" response score as DEITA score.\"\n                )\n                deita_score = evol_response_score\n                score_computed_with = [\"evol_response_score\"]\n            else:\n                self._logger.warning(\n                    \"Instruction and response scores are missing for the instruction-response\"\n                    \" pair. Setting DEITA score to 0.\"\n                )\n                deita_score = 0\n                score_computed_with = []\n\n            input_.update(\n                {\n                    \"deita_score\": deita_score,\n                    \"deita_score_computed_with\": score_computed_with,\n                }\n            )\n        return inputs\n\n    def _compute_nearest_neighbor(self, inputs: StepInput) -&gt; StepInput:\n        \"\"\"Computes the cosine distance between the embeddings of the instruction-response\n        pairs and the nearest neighbor.\n\n        Args:\n            inputs: The input data.\n\n        Returns:\n            The input data with the cosine distance computed.\n        \"\"\"\n        embeddings = np.array([input[\"embedding\"] for input in inputs])\n        if self.normalize_embeddings:\n            embeddings = self._normalize_embeddings(embeddings)\n        self._logger.info(\"\ud83d\udccf Computing nearest neighbor distance...\")\n\n        if self.distance_metric == \"cosine\":\n            self._logger.info(\"\ud83d\udccf Using cosine distance.\")\n            distances = self._cosine_distance(embeddings)\n        else:\n            self._logger.info(\"\ud83d\udccf Using manhattan distance.\")\n            distances = self._manhattan_distance(embeddings)\n\n        for distance, input in zip(distances, inputs):\n            input[\"nearest_neighbor_distance\"] = distance\n        return inputs\n\n    def _normalize_embeddings(self, embeddings: np.ndarray) -&gt; np.ndarray:\n        \"\"\"Normalize the embeddings.\n\n        Args:\n            embeddings: The embeddings to normalize.\n\n        Returns:\n            The normalized embeddings.\n        \"\"\"\n        self._logger.info(\"\u2696\ufe0f Normalizing embeddings...\")\n        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)\n        return embeddings / norms\n\n    def _cosine_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n        \"\"\"Computes the cosine distance between the embeddings.\n\n        Args:\n            embeddings: The embeddings.\n\n        Returns:\n            The cosine distance between the embeddings.\n        \"\"\"\n        cosine_similarity = np.dot(embeddings, embeddings.T)\n        cosine_distance = 1 - cosine_similarity\n        # Ignore self-distance\n        np.fill_diagonal(cosine_distance, np.inf)\n        return np.min(cosine_distance, axis=1)\n\n    def _manhattan_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n        \"\"\"Computes the manhattan distance between the embeddings.\n\n        Args:\n            embeddings: The embeddings.\n\n        Returns:\n            The manhattan distance between the embeddings.\n        \"\"\"\n        manhattan_distance = np.abs(embeddings[:, None] - embeddings).sum(-1)\n        # Ignore self-distance\n        np.fill_diagonal(manhattan_distance, np.inf)\n        return np.min(manhattan_distance, axis=1)\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering.process","title":"<code>process(inputs)</code>","text":"<p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>The filtered dataset.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Filter the dataset based on the DEITA score and the cosine distance between the\n    embeddings.\n\n    Args:\n        inputs: The input data.\n\n    Returns:\n        The filtered dataset.\n    \"\"\"\n    inputs = self._compute_deita_score(inputs)\n    inputs = self._compute_nearest_neighbor(inputs)\n    inputs.sort(key=lambda x: x[\"deita_score\"], reverse=True)\n\n    selected_rows = []\n    for input in inputs:\n        if len(selected_rows) &gt;= self.data_budget:  # type: ignore\n            break\n        if input[\"nearest_neighbor_distance\"] &gt;= self.diversity_threshold:\n            selected_rows.append(input)\n    yield selected_rows\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._compute_deita_score","title":"<code>_compute_deita_score(inputs)</code>","text":"<p>Computes the DEITA score for each instruction-response pair. The DEITA score is the product of the instruction score and the response score.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Returns:</p> Type Description <code>StepInput</code> <p>The input data with the DEITA score computed.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _compute_deita_score(self, inputs: StepInput) -&gt; StepInput:\n    \"\"\"Computes the DEITA score for each instruction-response pair. The DEITA score is\n    the product of the instruction score and the response score.\n\n    Args:\n        inputs: The input data.\n\n    Returns:\n        The input data with the DEITA score computed.\n    \"\"\"\n    for input_ in inputs:\n        evol_instruction_score = input_.get(\"evol_instruction_score\")\n        evol_response_score = input_.get(\"evol_response_score\")\n\n        if evol_instruction_score and evol_response_score:\n            deita_score = evol_instruction_score * evol_response_score\n            score_computed_with = [\"evol_instruction_score\", \"evol_response_score\"]\n        elif evol_instruction_score:\n            self._logger.warning(\n                \"Response score is missing for the instruction-response pair. Using\"\n                \" instruction score as DEITA score.\"\n            )\n            deita_score = evol_instruction_score\n            score_computed_with = [\"evol_instruction_score\"]\n        elif evol_response_score:\n            self._logger.warning(\n                \"Instruction score is missing for the instruction-response pair. Using\"\n                \" response score as DEITA score.\"\n            )\n            deita_score = evol_response_score\n            score_computed_with = [\"evol_response_score\"]\n        else:\n            self._logger.warning(\n                \"Instruction and response scores are missing for the instruction-response\"\n                \" pair. Setting DEITA score to 0.\"\n            )\n            deita_score = 0\n            score_computed_with = []\n\n        input_.update(\n            {\n                \"deita_score\": deita_score,\n                \"deita_score_computed_with\": score_computed_with,\n            }\n        )\n    return inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._compute_nearest_neighbor","title":"<code>_compute_nearest_neighbor(inputs)</code>","text":"<p>Computes the cosine distance between the embeddings of the instruction-response pairs and the nearest neighbor.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Returns:</p> Type Description <code>StepInput</code> <p>The input data with the cosine distance computed.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _compute_nearest_neighbor(self, inputs: StepInput) -&gt; StepInput:\n    \"\"\"Computes the cosine distance between the embeddings of the instruction-response\n    pairs and the nearest neighbor.\n\n    Args:\n        inputs: The input data.\n\n    Returns:\n        The input data with the cosine distance computed.\n    \"\"\"\n    embeddings = np.array([input[\"embedding\"] for input in inputs])\n    if self.normalize_embeddings:\n        embeddings = self._normalize_embeddings(embeddings)\n    self._logger.info(\"\ud83d\udccf Computing nearest neighbor distance...\")\n\n    if self.distance_metric == \"cosine\":\n        self._logger.info(\"\ud83d\udccf Using cosine distance.\")\n        distances = self._cosine_distance(embeddings)\n    else:\n        self._logger.info(\"\ud83d\udccf Using manhattan distance.\")\n        distances = self._manhattan_distance(embeddings)\n\n    for distance, input in zip(distances, inputs):\n        input[\"nearest_neighbor_distance\"] = distance\n    return inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._normalize_embeddings","title":"<code>_normalize_embeddings(embeddings)</code>","text":"<p>Normalize the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>embeddings</code> <code>ndarray</code> <p>The embeddings to normalize.</p> required <p>Returns:</p> Type Description <code>ndarray</code> <p>The normalized embeddings.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _normalize_embeddings(self, embeddings: np.ndarray) -&gt; np.ndarray:\n    \"\"\"Normalize the embeddings.\n\n    Args:\n        embeddings: The embeddings to normalize.\n\n    Returns:\n        The normalized embeddings.\n    \"\"\"\n    self._logger.info(\"\u2696\ufe0f Normalizing embeddings...\")\n    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)\n    return embeddings / norms\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._cosine_distance","title":"<code>_cosine_distance(embeddings)</code>","text":"<p>Computes the cosine distance between the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>embeddings</code> <code>array</code> <p>The embeddings.</p> required <p>Returns:</p> Type Description <code>array</code> <p>The cosine distance between the embeddings.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _cosine_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n    \"\"\"Computes the cosine distance between the embeddings.\n\n    Args:\n        embeddings: The embeddings.\n\n    Returns:\n        The cosine distance between the embeddings.\n    \"\"\"\n    cosine_similarity = np.dot(embeddings, embeddings.T)\n    cosine_distance = 1 - cosine_similarity\n    # Ignore self-distance\n    np.fill_diagonal(cosine_distance, np.inf)\n    return np.min(cosine_distance, axis=1)\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DeitaFiltering._manhattan_distance","title":"<code>_manhattan_distance(embeddings)</code>","text":"<p>Computes the manhattan distance between the embeddings.</p> <p>Parameters:</p> Name Type Description Default <code>embeddings</code> <code>array</code> <p>The embeddings.</p> required <p>Returns:</p> Type Description <code>array</code> <p>The manhattan distance between the embeddings.</p> Source code in <code>src/distilabel/steps/deita.py</code> <pre><code>def _manhattan_distance(self, embeddings: np.array) -&gt; np.array:  # type: ignore\n    \"\"\"Computes the manhattan distance between the embeddings.\n\n    Args:\n        embeddings: The embeddings.\n\n    Returns:\n        The manhattan distance between the embeddings.\n    \"\"\"\n    manhattan_distance = np.abs(embeddings[:, None] - embeddings).sum(-1)\n    # Ignore self-distance\n    np.fill_diagonal(manhattan_distance, np.inf)\n    return np.min(manhattan_distance, axis=1)\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.EmbeddingGeneration","title":"<code>EmbeddingGeneration</code>","text":"<p>               Bases: <code>Step</code></p> <p>Generate embeddings using an <code>Embeddings</code> model.</p> <p><code>EmbeddingGeneration</code> is a <code>Step</code> that using an <code>Embeddings</code> model generates sentence embeddings for the provided input texts.</p> <p>Attributes:</p> Name Type Description <code>embeddings</code> <code>Embeddings</code> <p>the <code>Embeddings</code> model used to generate the sentence embeddings.</p> Input columns <ul> <li>text (<code>str</code>): The text for which the sentence embedding has to be generated.</li> </ul> Output columns <ul> <li>embedding (<code>List[Union[float, int]]</code>): the generated sentence embedding.</li> </ul> Categories <ul> <li>embedding</li> </ul> <p>Examples:</p> <p>Generate sentence embeddings with Sentence Transformers:</p> <pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.steps import EmbeddingGeneration\n\nembedding_generation = EmbeddingGeneration(\n    embeddings=SentenceTransformerEmbeddings(\n        model=\"mixedbread-ai/mxbai-embed-large-v1\",\n    )\n)\n\nembedding_generation.load()\n\nresult = next(embedding_generation.process([{\"text\": \"Hello, how are you?\"}]))\n# [{'text': 'Hello, how are you?', 'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]\n</code></pre> Source code in <code>src/distilabel/steps/embeddings/embedding_generation.py</code> <pre><code>class EmbeddingGeneration(Step):\n    \"\"\"Generate embeddings using an `Embeddings` model.\n\n    `EmbeddingGeneration` is a `Step` that using an `Embeddings` model generates sentence\n    embeddings for the provided input texts.\n\n    Attributes:\n        embeddings: the `Embeddings` model used to generate the sentence embeddings.\n\n    Input columns:\n        - text (`str`): The text for which the sentence embedding has to be generated.\n\n    Output columns:\n        - embedding (`List[Union[float, int]]`): the generated sentence embedding.\n\n    Categories:\n        - embedding\n\n    Examples:\n        Generate sentence embeddings with Sentence Transformers:\n\n        ```python\n        from distilabel.models import SentenceTransformerEmbeddings\n        from distilabel.steps import EmbeddingGeneration\n\n        embedding_generation = EmbeddingGeneration(\n            embeddings=SentenceTransformerEmbeddings(\n                model=\"mixedbread-ai/mxbai-embed-large-v1\",\n            )\n        )\n\n        embedding_generation.load()\n\n        result = next(embedding_generation.process([{\"text\": \"Hello, how are you?\"}]))\n        # [{'text': 'Hello, how are you?', 'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]\n        ```\n\n    \"\"\"\n\n    embeddings: Embeddings\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"embedding\", \"model_name\"]\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `Embeddings` model.\"\"\"\n        super().load()\n\n        self.embeddings.load()\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        embeddings = self.embeddings.encode(inputs=[input[\"text\"] for input in inputs])\n        for input, embedding in zip(inputs, embeddings):\n            input[\"embedding\"] = embedding\n            input[\"model_name\"] = self.embeddings.model_name\n        yield inputs\n\n    def unload(self) -&gt; None:\n        super().unload()\n        self.embeddings.unload()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.EmbeddingGeneration.load","title":"<code>load()</code>","text":"<p>Loads the <code>Embeddings</code> model.</p> Source code in <code>src/distilabel/steps/embeddings/embedding_generation.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `Embeddings` model.\"\"\"\n    super().load()\n\n    self.embeddings.load()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour","title":"<code>FaissNearestNeighbour</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Create a <code>faiss</code> index to get the nearest neighbours.</p> <p><code>FaissNearestNeighbour</code> is a <code>GlobalStep</code> that creates a <code>faiss</code> index using the Hugging Face <code>datasets</code> library integration, and then gets the nearest neighbours and the scores or distance of the nearest neighbours for each input row.</p> <p>Attributes:</p> Name Type Description <code>device</code> <code>Optional[RuntimeParameter[Union[int, List[int]]]]</code> <p>the CUDA device ID or a list of IDs to be used. If negative integer, it will use all the available GPUs. Defaults to <code>None</code>.</p> <code>string_factory</code> <code>Optional[RuntimeParameter[str]]</code> <p>the name of the factory to be used to build the <code>faiss</code> index. Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. Defaults to <code>None</code>.</p> <code>metric_type</code> <code>Optional[RuntimeParameter[int]]</code> <p>the metric to be used to measure the distance between the points. It's an integer and the recommend way to pass it is importing <code>faiss</code> and then passing one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</p> <code>k</code> <code>Optional[RuntimeParameter[int]]</code> <p>the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</p> <code>search_batch_size</code> <code>Optional[RuntimeParameter[int]]</code> <p>the number of rows to include in a search batch. The value can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults to <code>50</code>.</p> <code>train_size</code> <code>Optional[RuntimeParameter[int]]</code> <p>If the index needs a training step, specifies how many vectors will be used to train the index.</p> Runtime parameters <ul> <li><code>device</code>: the CUDA device ID or a list of IDs to be used. If negative integer,     it will use all the available GPUs. Defaults to <code>None</code>.</li> <li><code>string_factory</code>: the name of the factory to be used to build the <code>faiss</code> index.     Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.     Defaults to <code>None</code>.</li> <li><code>metric_type</code>: the metric to be used to measure the distance between the points.     It's an integer and the recommend way to pass it is importing <code>faiss</code> and then     passing one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</li> <li><code>k</code>: the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</li> <li><code>search_batch_size</code>: the number of rows to include in a search batch. The value     can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults     to <code>50</code>.</li> <li><code>train_size</code>: If the index needs a training step, specifies how many vectors will     be used to train the index.</li> </ul> Input columns <ul> <li>embedding (<code>List[Union[float, int]]</code>): a sentence embedding.</li> </ul> Output columns <ul> <li>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours     in the inputs for the row.</li> <li>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>     nearest neighbour in the inputs.</li> </ul> Categories <ul> <li>embedding</li> </ul> References <ul> <li><code>The Faiss library</code></li> </ul> <p>Examples:</p> <p>Generating embeddings and getting the nearest neighbours:</p> <pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub\n\nwith Pipeline(name=\"hello\") as pipeline:\n    load_data = LoadDataFromHub(output_mappings={\"prompt\": \"text\"})\n\n    embeddings = EmbeddingGeneration(\n        embeddings=SentenceTransformerEmbeddings(\n            model=\"mixedbread-ai/mxbai-embed-large-v1\"\n        )\n    )\n\n    nearest_neighbours = FaissNearestNeighbour()\n\n    load_data &gt;&gt; embeddings &gt;&gt; nearest_neighbours\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n        },\n        use_cache=False,\n    )\n</code></pre> Citations <pre><code>@misc{douze2024faisslibrary,\n    title={The Faiss library},\n    author={Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre-Emmanuel Mazar\u00e9 and Maria Lomeli and Lucas Hosseini and Herv\u00e9 J\u00e9gou},\n    year={2024},\n    eprint={2401.08281},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG},\n    url={https://arxiv.org/abs/2401.08281},\n}\n</code></pre> Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>class FaissNearestNeighbour(GlobalStep):\n    \"\"\"Create a `faiss` index to get the nearest neighbours.\n\n    `FaissNearestNeighbour` is a `GlobalStep` that creates a `faiss` index using the Hugging\n    Face `datasets` library integration, and then gets the nearest neighbours and the scores\n    or distance of the nearest neighbours for each input row.\n\n    Attributes:\n        device: the CUDA device ID or a list of IDs to be used. If negative integer, it\n            will use all the available GPUs. Defaults to `None`.\n        string_factory: the name of the factory to be used to build the `faiss` index.\n            Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.\n            Defaults to `None`.\n        metric_type: the metric to be used to measure the distance between the points. It's\n            an integer and the recommend way to pass it is importing `faiss` and then passing\n            one of `faiss.METRIC_x` variables. Defaults to `None`.\n        k: the number of nearest neighbours to search for each input row. Defaults to `1`.\n        search_batch_size: the number of rows to include in a search batch. The value can\n            be adjusted to maximize the resources usage or to avoid OOM issues. Defaults\n            to `50`.\n        train_size: If the index needs a training step, specifies how many vectors will be\n            used to train the index.\n\n    Runtime parameters:\n        - `device`: the CUDA device ID or a list of IDs to be used. If negative integer,\n            it will use all the available GPUs. Defaults to `None`.\n        - `string_factory`: the name of the factory to be used to build the `faiss` index.\n            Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.\n            Defaults to `None`.\n        - `metric_type`: the metric to be used to measure the distance between the points.\n            It's an integer and the recommend way to pass it is importing `faiss` and then\n            passing one of `faiss.METRIC_x` variables. Defaults to `None`.\n        - `k`: the number of nearest neighbours to search for each input row. Defaults to `1`.\n        - `search_batch_size`: the number of rows to include in a search batch. The value\n            can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults\n            to `50`.\n        - `train_size`: If the index needs a training step, specifies how many vectors will\n            be used to train the index.\n\n    Input columns:\n        - embedding (`List[Union[float, int]]`): a sentence embedding.\n\n    Output columns:\n        - nn_indices (`List[int]`): a list containing the indices of the `k` nearest neighbours\n            in the inputs for the row.\n        - nn_scores (`List[float]`): a list containing the score or distance to each `k`\n            nearest neighbour in the inputs.\n\n    Categories:\n        - embedding\n\n    References:\n        - [`The Faiss library`](https://arxiv.org/abs/2401.08281)\n\n    Examples:\n        Generating embeddings and getting the nearest neighbours:\n\n        ```python\n        from distilabel.models import SentenceTransformerEmbeddings\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub\n\n        with Pipeline(name=\"hello\") as pipeline:\n            load_data = LoadDataFromHub(output_mappings={\"prompt\": \"text\"})\n\n            embeddings = EmbeddingGeneration(\n                embeddings=SentenceTransformerEmbeddings(\n                    model=\"mixedbread-ai/mxbai-embed-large-v1\"\n                )\n            )\n\n            nearest_neighbours = FaissNearestNeighbour()\n\n            load_data &gt;&gt; embeddings &gt;&gt; nearest_neighbours\n\n        if __name__ == \"__main__\":\n            distiset = pipeline.run(\n                parameters={\n                    load_data.name: {\n                        \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                        \"split\": \"test\",\n                    },\n                },\n                use_cache=False,\n            )\n        ```\n\n    Citations:\n        ```\n        @misc{douze2024faisslibrary,\n            title={The Faiss library},\n            author={Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre-Emmanuel Mazar\u00e9 and Maria Lomeli and Lucas Hosseini and Herv\u00e9 J\u00e9gou},\n            year={2024},\n            eprint={2401.08281},\n            archivePrefix={arXiv},\n            primaryClass={cs.LG},\n            url={https://arxiv.org/abs/2401.08281},\n        }\n        ```\n    \"\"\"\n\n    device: Optional[RuntimeParameter[Union[int, List[int]]]] = Field(\n        default=None,\n        description=\"The CUDA device ID or a list of IDs to be used. If negative integer,\"\n        \" it will use all the available GPUs.\",\n    )\n    string_factory: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The name of the factory to be used to build the `faiss` index.\"\n        \"Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.\",\n    )\n    metric_type: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"The metric to be used to measure the distance between the points. It's\"\n        \" an integer and the recommend way to pass it is importing `faiss` and thenpassing\"\n        \" one of `faiss.METRIC_x` variables.\",\n    )\n    k: Optional[RuntimeParameter[int]] = Field(\n        default=1,\n        description=\"The number of nearest neighbours to search for each input row.\",\n    )\n    search_batch_size: Optional[RuntimeParameter[int]] = Field(\n        default=50,\n        description=\"The number of rows to include in a search batch. The value can be adjusted\"\n        \" to maximize the resources usage or to avoid OOM issues.\",\n    )\n    train_size: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"If the index needs a training step, specifies how many vectors will be used to train the index.\",\n    )\n\n    def load(self) -&gt; None:\n        super().load()\n\n        if importlib.util.find_spec(\"faiss\") is None:\n            raise ImportError(\n                \"`faiss` package is not installed. Please install it using `pip install\"\n                \" faiss-cpu` or `pip install faiss-gpu`.\"\n            )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"embedding\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"nn_indices\", \"nn_scores\"]\n\n    def _build_index(self, inputs: List[Dict[str, Any]]) -&gt; Dataset:\n        \"\"\"Builds a `faiss` index using `datasets` integration.\n\n        Args:\n            inputs: a list of dictionaries.\n\n        Returns:\n            The build `datasets.Dataset` with its `faiss` index.\n        \"\"\"\n        dataset = Dataset.from_list(inputs)\n        if self.train_size is not None and self.string_factory:\n            self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Starting Faiss index training...\")\n        dataset.add_faiss_index(\n            column=\"embedding\",\n            device=self.device,  # type: ignore\n            string_factory=self.string_factory,\n            metric_type=self.metric_type,\n            train_size=self.train_size,\n        )\n        return dataset\n\n    def _save_index(self, dataset: Dataset) -&gt; None:\n        \"\"\"Save the generated Faiss index as an artifact of the step.\n\n        Args:\n            dataset: the dataset with the `faiss` index built.\n        \"\"\"\n        self.save_artifact(\n            name=\"faiss_index\",\n            write_function=lambda path: dataset.save_faiss_index(\n                index_name=\"embedding\", file=path / \"index.faiss\"\n            ),\n            metadata={\n                \"num_rows\": len(dataset),\n                \"embedding_dim\": len(dataset[0][\"embedding\"]),\n            },\n        )\n\n    def _search(self, dataset: Dataset) -&gt; Dataset:\n        \"\"\"Search the top `k` nearest neighbours for each row in the dataset.\n\n        Args:\n            dataset: the dataset with the `faiss` index built.\n\n        Returns:\n            The updated dataset containing the top `k` nearest neighbours for each row,\n            as well as the score or distance.\n        \"\"\"\n\n        def add_search_results(examples: Dict[str, List[Any]]) -&gt; Dict[str, List[Any]]:\n            queries = np.array(examples[\"embedding\"])\n            results = dataset.search_batch(\n                index_name=\"embedding\",\n                queries=queries,\n                k=self.k + 1,  # type: ignore\n            )\n            examples[\"nn_indices\"] = [indices[1:] for indices in results.total_indices]\n            examples[\"nn_scores\"] = [scores[1:] for scores in results.total_scores]\n            return examples\n\n        return dataset.map(\n            add_search_results, batched=True, batch_size=self.search_batch_size\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        dataset = self._build_index(inputs)\n        dataset_with_search_results = self._search(dataset)\n        self._save_index(dataset)\n        yield dataset_with_search_results.to_list()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour._build_index","title":"<code>_build_index(inputs)</code>","text":"<p>Builds a <code>faiss</code> index using <code>datasets</code> integration.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>List[Dict[str, Any]]</code> <p>a list of dictionaries.</p> required <p>Returns:</p> Type Description <code>Dataset</code> <p>The build <code>datasets.Dataset</code> with its <code>faiss</code> index.</p> Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>def _build_index(self, inputs: List[Dict[str, Any]]) -&gt; Dataset:\n    \"\"\"Builds a `faiss` index using `datasets` integration.\n\n    Args:\n        inputs: a list of dictionaries.\n\n    Returns:\n        The build `datasets.Dataset` with its `faiss` index.\n    \"\"\"\n    dataset = Dataset.from_list(inputs)\n    if self.train_size is not None and self.string_factory:\n        self._logger.info(\"\ud83c\udfcb\ufe0f\u200d\u2640\ufe0f Starting Faiss index training...\")\n    dataset.add_faiss_index(\n        column=\"embedding\",\n        device=self.device,  # type: ignore\n        string_factory=self.string_factory,\n        metric_type=self.metric_type,\n        train_size=self.train_size,\n    )\n    return dataset\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour._save_index","title":"<code>_save_index(dataset)</code>","text":"<p>Save the generated Faiss index as an artifact of the step.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Dataset</code> <p>the dataset with the <code>faiss</code> index built.</p> required Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>def _save_index(self, dataset: Dataset) -&gt; None:\n    \"\"\"Save the generated Faiss index as an artifact of the step.\n\n    Args:\n        dataset: the dataset with the `faiss` index built.\n    \"\"\"\n    self.save_artifact(\n        name=\"faiss_index\",\n        write_function=lambda path: dataset.save_faiss_index(\n            index_name=\"embedding\", file=path / \"index.faiss\"\n        ),\n        metadata={\n            \"num_rows\": len(dataset),\n            \"embedding_dim\": len(dataset[0][\"embedding\"]),\n        },\n    )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FaissNearestNeighbour._search","title":"<code>_search(dataset)</code>","text":"<p>Search the top <code>k</code> nearest neighbours for each row in the dataset.</p> <p>Parameters:</p> Name Type Description Default <code>dataset</code> <code>Dataset</code> <p>the dataset with the <code>faiss</code> index built.</p> required <p>Returns:</p> Type Description <code>Dataset</code> <p>The updated dataset containing the top <code>k</code> nearest neighbours for each row,</p> <code>Dataset</code> <p>as well as the score or distance.</p> Source code in <code>src/distilabel/steps/embeddings/nearest_neighbour.py</code> <pre><code>def _search(self, dataset: Dataset) -&gt; Dataset:\n    \"\"\"Search the top `k` nearest neighbours for each row in the dataset.\n\n    Args:\n        dataset: the dataset with the `faiss` index built.\n\n    Returns:\n        The updated dataset containing the top `k` nearest neighbours for each row,\n        as well as the score or distance.\n    \"\"\"\n\n    def add_search_results(examples: Dict[str, List[Any]]) -&gt; Dict[str, List[Any]]:\n        queries = np.array(examples[\"embedding\"])\n        results = dataset.search_batch(\n            index_name=\"embedding\",\n            queries=queries,\n            k=self.k + 1,  # type: ignore\n        )\n        examples[\"nn_indices\"] = [indices[1:] for indices in results.total_indices]\n        examples[\"nn_scores\"] = [scores[1:] for scores in results.total_scores]\n        return examples\n\n    return dataset.map(\n        add_search_results, batched=True, batch_size=self.search_batch_size\n    )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.EmbeddingDedup","title":"<code>EmbeddingDedup</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Deduplicates text using embeddings.</p> <p><code>EmbeddingDedup</code> is a Step that detects near-duplicates in datasets, using embeddings to compare the similarity between the texts. The typical workflow with this step would include having a dataset with embeddings precomputed, and then (possibly using the <code>FaissNearestNeighbour</code>) using the <code>nn_indices</code> and <code>nn_scores</code>, determine the texts that are duplicate.</p> <p>Attributes:</p> Name Type Description <code>threshold</code> <code>Optional[RuntimeParameter[float]]</code> <p>the threshold to consider 2 examples as duplicates. It's dependent on the type of index that was used to generate the embeddings. For example, if the embeddings were generated using cosine similarity, a threshold of <code>0.9</code> would make all the texts with a cosine similarity above the value duplicates. Higher values detect less duplicates in such an index, but that should be taken into account when building it. Defaults to <code>0.9</code>.</p> Runtime Parameters <ul> <li><code>threshold</code>: the threshold to consider 2 examples as duplicates.</li> </ul> Input columns <ul> <li>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours     in the inputs for the row.</li> <li>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>     nearest neighbour in the inputs.</li> </ul> Output columns <ul> <li>keep_row_after_embedding_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is     not a duplicate i.e. this text should be kept.</li> </ul> Categories <ul> <li>filtering</li> </ul> <p>Examples:</p> <pre><code>Deduplicate a list of texts using embedding information:\n\n```python\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    data = LoadDataFromDicts(\n        data=[\n            {\n                \"persona\": \"A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.\",\n                \"embedding\": [\n                    0.018477669046149742,\n                    -0.03748236608841726,\n                    0.001919870620352492,\n                    0.024918478063770535,\n                    0.02348063521315178,\n                    0.0038251285566308375,\n                    -0.01723884983037716,\n                    0.02881971942372201,\n                ],\n                \"nn_indices\": [0, 1],\n                \"nn_scores\": [\n                    0.9164746999740601,\n                    0.782106876373291,\n                ],\n            },\n            {\n                \"persona\": \"A music teacher or instructor focused on theoretical and practical piano lessons.\",\n                \"embedding\": [\n                    -0.0023464179614082125,\n                    -0.07325472251663565,\n                    -0.06058678419516501,\n                    -0.02100326928586996,\n                    -0.013462744792362657,\n                    0.027368447064244242,\n                    -0.003916070100455717,\n                    0.01243614518480423,\n                ],\n                \"nn_indices\": [0, 2],\n                \"nn_scores\": [\n                    0.7552462220191956,\n                    0.7261884808540344,\n                ],\n            },\n            {\n                \"persona\": \"A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.\",\n                \"embedding\": [\n                    -0.01630817942328242,\n                    -0.023760151552345232,\n                    -0.014249650090627883,\n                    -0.005713686451446624,\n                    -0.016033059279131567,\n                    0.0071440908501058786,\n                    -0.05691099643425161,\n                    0.01597412704817784,\n                ],\n                \"nn_indices\": [1, 2],\n                \"nn_scores\": [\n                    0.8107735514640808,\n                    0.7172299027442932,\n                ],\n            },\n        ],\n        batch_size=batch_size,\n    )\n    # In general you should do something like this before the deduplication step, to obtain the\n    # `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's\n    # no need for it.\n    # nn = FaissNearestNeighbour(\n    #     k=30,\n    #     metric_type=faiss.METRIC_INNER_PRODUCT,\n    #     search_batch_size=50,\n    #     train_size=len(dataset),              # The number of embeddings to use for training\n    #     string_factory=\"IVF300_HNSW32,Flat\"   # To use an index (optional, maybe required for big datasets)\n    # )\n    # Read more about the `string_factory` here:\n    # https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index\n\n    embedding_dedup = EmbeddingDedup(\n        threshold=0.8,\n        input_batch_size=batch_size,\n    )\n\n    data &gt;&gt; embedding_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_embedding_filtering\"])\n```\n</code></pre> Source code in <code>src/distilabel/steps/filtering/embedding.py</code> <pre><code>class EmbeddingDedup(GlobalStep):\n    \"\"\"Deduplicates text using embeddings.\n\n    `EmbeddingDedup` is a Step that detects near-duplicates in datasets, using\n    embeddings to compare the similarity between the texts. The typical workflow with this step\n    would include having a dataset with embeddings precomputed, and then (possibly using the\n    `FaissNearestNeighbour`) using the `nn_indices` and `nn_scores`, determine the texts that\n    are duplicate.\n\n    Attributes:\n        threshold: the threshold to consider 2 examples as duplicates.\n            It's dependent on the type of index that was used to generate the embeddings.\n            For example, if the embeddings were generated using cosine similarity, a threshold\n            of `0.9` would make all the texts with a cosine similarity above the value\n            duplicates. Higher values detect less duplicates in such an index, but that should\n            be taken into account when building it. Defaults to `0.9`.\n\n    Runtime Parameters:\n        - `threshold`: the threshold to consider 2 examples as duplicates.\n\n    Input columns:\n        - nn_indices (`List[int]`): a list containing the indices of the `k` nearest neighbours\n            in the inputs for the row.\n        - nn_scores (`List[float]`): a list containing the score or distance to each `k`\n            nearest neighbour in the inputs.\n\n    Output columns:\n        - keep_row_after_embedding_filtering (`bool`): boolean indicating if the piece `text` is\n            not a duplicate i.e. this text should be kept.\n\n    Categories:\n        - filtering\n\n    Examples:\n\n        Deduplicate a list of texts using embedding information:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import EmbeddingDedup\n        from distilabel.steps import LoadDataFromDicts\n\n        with Pipeline() as pipeline:\n            data = LoadDataFromDicts(\n                data=[\n                    {\n                        \"persona\": \"A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.\",\n                        \"embedding\": [\n                            0.018477669046149742,\n                            -0.03748236608841726,\n                            0.001919870620352492,\n                            0.024918478063770535,\n                            0.02348063521315178,\n                            0.0038251285566308375,\n                            -0.01723884983037716,\n                            0.02881971942372201,\n                        ],\n                        \"nn_indices\": [0, 1],\n                        \"nn_scores\": [\n                            0.9164746999740601,\n                            0.782106876373291,\n                        ],\n                    },\n                    {\n                        \"persona\": \"A music teacher or instructor focused on theoretical and practical piano lessons.\",\n                        \"embedding\": [\n                            -0.0023464179614082125,\n                            -0.07325472251663565,\n                            -0.06058678419516501,\n                            -0.02100326928586996,\n                            -0.013462744792362657,\n                            0.027368447064244242,\n                            -0.003916070100455717,\n                            0.01243614518480423,\n                        ],\n                        \"nn_indices\": [0, 2],\n                        \"nn_scores\": [\n                            0.7552462220191956,\n                            0.7261884808540344,\n                        ],\n                    },\n                    {\n                        \"persona\": \"A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.\",\n                        \"embedding\": [\n                            -0.01630817942328242,\n                            -0.023760151552345232,\n                            -0.014249650090627883,\n                            -0.005713686451446624,\n                            -0.016033059279131567,\n                            0.0071440908501058786,\n                            -0.05691099643425161,\n                            0.01597412704817784,\n                        ],\n                        \"nn_indices\": [1, 2],\n                        \"nn_scores\": [\n                            0.8107735514640808,\n                            0.7172299027442932,\n                        ],\n                    },\n                ],\n                batch_size=batch_size,\n            )\n            # In general you should do something like this before the deduplication step, to obtain the\n            # `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's\n            # no need for it.\n            # nn = FaissNearestNeighbour(\n            #     k=30,\n            #     metric_type=faiss.METRIC_INNER_PRODUCT,\n            #     search_batch_size=50,\n            #     train_size=len(dataset),              # The number of embeddings to use for training\n            #     string_factory=\"IVF300_HNSW32,Flat\"   # To use an index (optional, maybe required for big datasets)\n            # )\n            # Read more about the `string_factory` here:\n            # https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index\n\n            embedding_dedup = EmbeddingDedup(\n                threshold=0.8,\n                input_batch_size=batch_size,\n            )\n\n            data &gt;&gt; embedding_dedup\n\n        if __name__ == \"__main__\":\n            distiset = pipeline.run(use_cache=False)\n            ds = distiset[\"default\"][\"train\"]\n            # Filter out the duplicates\n            ds_dedup = ds.filter(lambda x: x[\"keep_row_after_embedding_filtering\"])\n        ```\n    \"\"\"\n\n    threshold: Optional[RuntimeParameter[float]] = Field(\n        default=0.9,\n        description=\"The threshold to consider 2 examples as duplicates. It's dependent \"\n        \"on the type of index that was used to generate the embeddings. For example, if \"\n        \"the embeddings were generated using cosine similarity, a threshold of `0.9` \"\n        \"would make all the texts with a cosine similarity above the value duplicates. \"\n        \"Higher values detect less duplicates in such an index, but that should be \"\n        \"taken into account when building it.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"nn_scores\", \"nn_indices\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"keep_row_after_embedding_filtering\"]\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        rows_to_remove = set()\n\n        for input in track(inputs, description=\"Running Embedding deduplication...\"):\n            input[\"keep_row_after_embedding_filtering\"] = True\n            indices_scores = np.array(input[\"nn_scores\"]) &gt; self.threshold\n            indices = np.array(input[\"nn_indices\"])[indices_scores]\n            if len(indices) &gt; 0:  # If there are any rows found over the threshold\n                rows_to_remove.update(list(indices))\n\n        # Remove duplicates and get the list of rows to remove\n        for idx in rows_to_remove:\n            inputs[idx][\"keep_row_after_embedding_filtering\"] = False\n\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.MinHashDedup","title":"<code>MinHashDedup</code>","text":"<p>               Bases: <code>Step</code></p> <p>Deduplicates text using <code>MinHash</code> and <code>MinHashLSH</code>.</p> <p><code>MinHashDedup</code> is a Step that detects near-duplicates in datasets. The idea roughly translates to the following steps: 1. Tokenize the text into words or ngrams. 2. Create a <code>MinHash</code> for each text. 3. Store the <code>MinHashes</code> in a <code>MinHashLSH</code>. 4. Check if the <code>MinHash</code> is already in the <code>LSH</code>, if so, it is a duplicate.</p> <p>Attributes:</p> Name Type Description <code>num_perm</code> <code>int</code> <p>the number of permutations to use. Defaults to <code>128</code>.</p> <code>seed</code> <code>int</code> <p>the seed to use for the MinHash. Defaults to <code>1</code>.</p> <code>tokenizer</code> <code>Literal['words', 'ngrams']</code> <p>the tokenizer to use. Available ones are <code>words</code> or <code>ngrams</code>. If <code>words</code> is selected, it tokenizes the text into words using nltk's word tokenizer. <code>ngram</code> estimates the ngrams (together with the size <code>n</code>). Defaults to <code>words</code>.</p> <code>n</code> <code>Optional[int]</code> <p>the size of the ngrams to use. Only relevant if <code>tokenizer=\"ngrams\"</code>. Defaults to <code>5</code>.</p> <code>threshold</code> <code>float</code> <p>the threshold to consider two MinHashes as duplicates. Values closer to 0 detect more duplicates. Defaults to <code>0.9</code>.</p> <code>storage</code> <code>Literal['dict', 'disk']</code> <p>the storage to use for the LSH. Can be <code>dict</code> to store the index in memory, or <code>disk</code>. Keep in mind, <code>disk</code> is an experimental feature not defined in <code>datasketch</code>, that is based on DiskCache's <code>Index</code> class. It should work as a <code>dict</code>, but backed by disk, but depending on the system it can be slower. Defaults to <code>dict</code>.</p> Input columns <ul> <li>text (<code>str</code>): the texts to be filtered.</li> </ul> Output columns <ul> <li>keep_row_after_minhash_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is     not a duplicate i.e. this text should be kept.</li> </ul> Categories <ul> <li>filtering</li> </ul> References <ul> <li><code>datasketch documentation</code></li> <li>Identifying and Filtering Near-Duplicate Documents</li> <li>Diskcache's Index</li> </ul> <p>Examples:</p> <pre><code>Deduplicate a list of texts using MinHash and MinHashLSH:\n\n```python\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import MinHashDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    ds_size = 1000\n    batch_size = 500  # Bigger batch sizes work better for this step\n    data = LoadDataFromDicts(\n        data=[\n            {\"text\": \"This is a test document.\"},\n            {\"text\": \"This document is a test.\"},\n            {\"text\": \"Test document for duplication.\"},\n            {\"text\": \"Document for duplication test.\"},\n            {\"text\": \"This is another unique document.\"},\n        ]\n        * (ds_size // 5),\n        batch_size=batch_size,\n    )\n    minhash_dedup = MinHashDedup(\n        tokenizer=\"words\",\n        threshold=0.9,      # lower values will increase the number of duplicates\n        storage=\"dict\",     # or \"disk\" for bigger datasets\n    )\n\n    data &gt;&gt; minhash_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_minhash_filtering\"])\n```\n</code></pre> Source code in <code>src/distilabel/steps/filtering/minhash.py</code> <pre><code>class MinHashDedup(Step):\n    \"\"\"Deduplicates text using `MinHash` and `MinHashLSH`.\n\n    `MinHashDedup` is a Step that detects near-duplicates in datasets. The idea roughly translates\n    to the following steps:\n    1. Tokenize the text into words or ngrams.\n    2. Create a `MinHash` for each text.\n    3. Store the `MinHashes` in a `MinHashLSH`.\n    4. Check if the `MinHash` is already in the `LSH`, if so, it is a duplicate.\n\n    Attributes:\n        num_perm: the number of permutations to use. Defaults to `128`.\n        seed: the seed to use for the MinHash. Defaults to `1`.\n        tokenizer: the tokenizer to use. Available ones are `words` or `ngrams`.\n            If `words` is selected, it tokenizes the text into words using nltk's\n            word tokenizer. `ngram` estimates the ngrams (together with the size\n            `n`). Defaults to `words`.\n        n: the size of the ngrams to use. Only relevant if `tokenizer=\"ngrams\"`. Defaults to `5`.\n        threshold: the threshold to consider two MinHashes as duplicates.\n            Values closer to 0 detect more duplicates. Defaults to `0.9`.\n        storage: the storage to use for the LSH. Can be `dict` to store the index\n            in memory, or `disk`. Keep in mind, `disk` is an experimental feature\n            not defined in `datasketch`, that is based on DiskCache's `Index` class.\n            It should work as a `dict`, but backed by disk, but depending on the system\n            it can be slower. Defaults to `dict`.\n\n    Input columns:\n        - text (`str`): the texts to be filtered.\n\n    Output columns:\n        - keep_row_after_minhash_filtering (`bool`): boolean indicating if the piece `text` is\n            not a duplicate i.e. this text should be kept.\n\n    Categories:\n        - filtering\n\n    References:\n        - [`datasketch documentation`](https://ekzhu.github.io/datasketch/lsh.html)\n        - [Identifying and Filtering Near-Duplicate Documents](https://cs.brown.edu/courses/cs253/papers/nearduplicate.pdf)\n        - [Diskcache's Index](https://grantjenks.com/docs/diskcache/api.html#diskcache.Index)\n\n    Examples:\n\n        Deduplicate a list of texts using MinHash and MinHashLSH:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import MinHashDedup\n        from distilabel.steps import LoadDataFromDicts\n\n        with Pipeline() as pipeline:\n            ds_size = 1000\n            batch_size = 500  # Bigger batch sizes work better for this step\n            data = LoadDataFromDicts(\n                data=[\n                    {\"text\": \"This is a test document.\"},\n                    {\"text\": \"This document is a test.\"},\n                    {\"text\": \"Test document for duplication.\"},\n                    {\"text\": \"Document for duplication test.\"},\n                    {\"text\": \"This is another unique document.\"},\n                ]\n                * (ds_size // 5),\n                batch_size=batch_size,\n            )\n            minhash_dedup = MinHashDedup(\n                tokenizer=\"words\",\n                threshold=0.9,      # lower values will increase the number of duplicates\n                storage=\"dict\",     # or \"disk\" for bigger datasets\n            )\n\n            data &gt;&gt; minhash_dedup\n\n        if __name__ == \"__main__\":\n            distiset = pipeline.run(use_cache=False)\n            ds = distiset[\"default\"][\"train\"]\n            # Filter out the duplicates\n            ds_dedup = ds.filter(lambda x: x[\"keep_row_after_minhash_filtering\"])\n        ```\n    \"\"\"\n\n    num_perm: int = 128\n    seed: int = 1\n    tokenizer: Literal[\"words\", \"ngrams\"] = \"words\"\n    n: Optional[int] = 5\n    threshold: float = 0.9\n    storage: Literal[\"dict\", \"disk\"] = \"dict\"\n\n    _hasher: Union[\"MinHash\", None] = PrivateAttr(None)\n    _tokenizer: Union[Callable, None] = PrivateAttr(None)\n    _lhs: Union[\"MinHashLSH\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n        if not importlib.import_module(\"datasketch\"):\n            raise ImportError(\n                \"`datasketch` is needed to deduplicate with MinHash, but is not installed. \"\n                \"Please install it using `pip install datasketch`.\"\n            )\n        from datasketch import MinHash\n\n        from distilabel.steps.filtering._datasketch import MinHashLSH\n\n        self._hasher = MinHash.bulk\n        self._lsh = MinHashLSH(\n            num_perm=self.num_perm,\n            threshold=self.threshold,\n            storage_config={\"type\": self.storage},\n        )\n\n        if self.tokenizer == \"words\":\n            if not importlib.import_module(\"nltk\"):\n                raise ImportError(\n                    \"`nltk` is needed to tokenize based on words, but is not installed. \"\n                    \"Please install it using `pip install nltk`. Then run `nltk.download('punkt_tab')`.\"\n                )\n            self._tokenizer = tokenized_on_words\n        else:\n            self._tokenizer = partial(tokenize_on_ngrams, n=self.n)\n\n    def unload(self) -&gt; None:\n        super().unload()\n        # In case of LSH being stored in disk, we need to close the file.\n        if self.storage == \"disk\":\n            self._lsh.close()\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"keep_row_after_minhash_filtering\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        tokenized_texts = []\n        for input in inputs:\n            tokenized_texts.append(self._tokenizer([input[self.inputs[0]]])[0])\n\n        minhashes = self._hasher(\n            tokenized_texts, num_perm=self.num_perm, seed=self.seed\n        )\n\n        for input, minhash in zip(inputs, minhashes):\n            # Check if the text is already in the LSH index\n            if self._lsh.query(minhash):\n                input[\"keep_row_after_minhash_filtering\"] = False\n            else:\n                self._lsh.insert(str(uuid.uuid4()), minhash)\n                input[\"keep_row_after_minhash_filtering\"] = True\n\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate","title":"<code>ConversationTemplate</code>","text":"<p>               Bases: <code>Step</code></p> <p>Generate a conversation template from an instruction and a response.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction to be used in the conversation.</li> <li>response (<code>str</code>): The response to be used in the conversation.</li> </ul> Output columns <ul> <li>conversation (<code>ChatType</code>): The conversation template.</li> </ul> Categories <ul> <li>format</li> <li>chat</li> <li>template</li> </ul> <p>Examples:</p> <p>Create a conversation from an instruction and a response:</p> <pre><code>from distilabel.steps import ConversationTemplate\n\nconv_template = ConversationTemplate()\nconv_template.load()\n\nresult = next(\n    conv_template.process(\n        [\n            {\n                \"instruction\": \"Hello\",\n                \"response\": \"Hi\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'Hello', 'response': 'Hi', 'conversation': [{'role': 'user', 'content': 'Hello'}, {'role': 'assistant', 'content': 'Hi'}]}]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/conversation.py</code> <pre><code>class ConversationTemplate(Step):\n    \"\"\"Generate a conversation template from an instruction and a response.\n\n    Input columns:\n        - instruction (`str`): The instruction to be used in the conversation.\n        - response (`str`): The response to be used in the conversation.\n\n    Output columns:\n        - conversation (`ChatType`): The conversation template.\n\n    Categories:\n        - format\n        - chat\n        - template\n\n    Examples:\n        Create a conversation from an instruction and a response:\n\n        ```python\n        from distilabel.steps import ConversationTemplate\n\n        conv_template = ConversationTemplate()\n        conv_template.load()\n\n        result = next(\n            conv_template.process(\n                [\n                    {\n                        \"instruction\": \"Hello\",\n                        \"response\": \"Hi\",\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'Hello', 'response': 'Hi', 'conversation': [{'role': 'user', 'content': 'Hello'}, {'role': 'assistant', 'content': 'Hi'}]}]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The instruction and response.\"\"\"\n        return [\"instruction\", \"response\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The conversation template.\"\"\"\n        return [\"conversation\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Generate a conversation template from an instruction and a response.\n\n        Args:\n            inputs: The input data.\n\n        Yields:\n            The input data with the conversation template.\n        \"\"\"\n        for input in inputs:\n            input[\"conversation\"] = [\n                {\"role\": \"user\", \"content\": input[\"instruction\"]},\n                {\"role\": \"assistant\", \"content\": input[\"response\"]},\n            ]\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The instruction and response.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The conversation template.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.ConversationTemplate.process","title":"<code>process(inputs)</code>","text":"<p>Generate a conversation template from an instruction and a response.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>The input data with the conversation template.</p> Source code in <code>src/distilabel/steps/formatting/conversation.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Generate a conversation template from an instruction and a response.\n\n    Args:\n        inputs: The input data.\n\n    Yields:\n        The input data with the conversation template.\n    \"\"\"\n    for input in inputs:\n        input[\"conversation\"] = [\n            {\"role\": \"user\", \"content\": input[\"instruction\"]},\n            {\"role\": \"assistant\", \"content\": input[\"response\"]},\n        ]\n    yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO","title":"<code>FormatChatGenerationDPO</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of a combination of a <code>ChatGeneration</code> + a preference task for Direct Preference Optimization (DPO).</p> <p><code>FormatChatGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>ChatGeneration</code> task with a preference <code>Task</code> i.e. a task generating <code>ratings</code> such as <code>UltraFeedback</code> following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>., so that those are used to rank the existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>.</p> Note <p>The <code>messages</code> column should contain at least one message from the user, the <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p> Input columns <ul> <li>messages (<code>List[Dict[str, str]]</code>): The conversation messages.</li> <li>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</li> <li>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,     only available if the <code>model_name</code> from the <code>ChatGeneration</code> task/s is combined into a single     column named this way, otherwise, it will be ignored.</li> <li>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference     task such as <code>UltraFeedback</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The user message used to generate the <code>generations</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</li> <li>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,     if the <code>generation_models</code> are available.</li> <li>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</li> <li>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</li> <li>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,     if the <code>generation_models</code> are available.</li> <li>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</li> </ul> Categories <ul> <li>format</li> <li>chat-generation</li> <li>preference</li> <li>messages</li> <li>generations</li> </ul> <p>Examples:</p> <p>Format your dataset for DPO fine tuning:</p> <pre><code>from distilabel.steps import FormatChatGenerationDPO\n\nformat_dpo = FormatChatGenerationDPO()\nformat_dpo.load()\n\n# NOTE: \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}],\n#         'generations': ['4', '5', '6'],\n#         'ratings': [1, 0, -1],\n#         'prompt': \"What's 2+2?\",\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'chosen_rating': 1,\n#         'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#         'rejected_rating': -1\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>class FormatChatGenerationDPO(Step):\n    \"\"\"Format the output of a combination of a `ChatGeneration` + a preference task for Direct Preference Optimization (DPO).\n\n    `FormatChatGenerationDPO` is a `Step` that formats the output of the combination of a `ChatGeneration`\n    task with a preference `Task` i.e. a task generating `ratings` such as `UltraFeedback` following the standard\n    formatting from frameworks such as `axolotl` or `alignment-handbook`., so that those are used to rank the\n    existing generations and provide the `chosen` and `rejected` generations based on the `ratings`.\n\n    Note:\n        The `messages` column should contain at least one message from the user, the `generations`\n        column should contain at least two generations, the `ratings` column should contain the same\n        number of ratings as generations.\n\n    Input columns:\n        - messages (`List[Dict[str, str]]`): The conversation messages.\n        - generations (`List[str]`): The generations produced by the `LLM`.\n        - generation_models (`List[str]`, optional): The model names used to generate the `generations`,\n            only available if the `model_name` from the `ChatGeneration` task/s is combined into a single\n            column named this way, otherwise, it will be ignored.\n        - ratings (`List[float]`): The ratings for each of the `generations`, produced by a preference\n            task such as `UltraFeedback`.\n\n    Output columns:\n        - prompt (`str`): The user message used to generate the `generations` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - chosen (`List[Dict[str, str]]`): The `chosen` generation based on the `ratings`.\n        - chosen_model (`str`, optional): The model name used to generate the `chosen` generation,\n            if the `generation_models` are available.\n        - chosen_rating (`float`): The rating of the `chosen` generation.\n        - rejected (`List[Dict[str, str]]`): The `rejected` generation based on the `ratings`.\n        - rejected_model (`str`, optional): The model name used to generate the `rejected` generation,\n            if the `generation_models` are available.\n        - rejected_rating (`float`): The rating of the `rejected` generation.\n\n    Categories:\n        - format\n        - chat-generation\n        - preference\n        - messages\n        - generations\n\n    Examples:\n        Format your dataset for DPO fine tuning:\n\n        ```python\n        from distilabel.steps import FormatChatGenerationDPO\n\n        format_dpo = FormatChatGenerationDPO()\n        format_dpo.load()\n\n        # NOTE: \"generation_models\" can be added optionally.\n        result = next(\n            format_dpo.process(\n                [\n                    {\n                        \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                        \"generations\": [\"4\", \"5\", \"6\"],\n                        \"ratings\": [1, 0, -1],\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}],\n        #         'generations': ['4', '5', '6'],\n        #         'ratings': [1, 0, -1],\n        #         'prompt': \"What's 2+2?\",\n        #         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #         'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n        #         'chosen_rating': 1,\n        #         'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n        #         'rejected_rating': -1\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `messages`, `generations`,\n        and `ratings`.\"\"\"\n        return [\"messages\", \"generations\", \"ratings\"]\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"List of optional inputs, which are not required by the `Step` but used if available,\n        which in this case is: `generation_models`.\"\"\"\n        return [\"generation_models\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `chosen`,\n        `chosen_model`, `chosen_rating`, `rejected`, `rejected_model`, `rejected_rating`. Both\n        the `chosen_model` and `rejected_model` being optional and only used if `generation_models`\n        is available.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\n            \"prompt\",\n            \"prompt_id\",\n            \"chosen\",\n            \"chosen_model\",\n            \"chosen_rating\",\n            \"rejected\",\n            \"rejected_model\",\n            \"rejected_rating\",\n        ]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the DPO formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                item[\"prompt\"] = next(\n                    (\n                        turn[\"content\"]\n                        for turn in item[\"messages\"]\n                        if turn[\"role\"] == \"user\"\n                    ),\n                    None,\n                )\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"chosen\"] = item[\"messages\"] + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][chosen_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n                item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n                rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"rejected\"] = item[\"messages\"] + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][rejected_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n                item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>messages</code>, <code>generations</code>, and <code>ratings</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>List of optional inputs, which are not required by the <code>Step</code> but used if available, which in this case is: <code>generation_models</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>chosen</code>, <code>chosen_model</code>, <code>chosen_rating</code>, <code>rejected</code>, <code>rejected_model</code>, <code>rejected_rating</code>. Both the <code>chosen_model</code> and <code>rejected_model</code> being optional and only used if <code>generation_models</code> is available.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationDPO.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the DPO formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the DPO standard.</p> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the DPO formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            item[\"prompt\"] = next(\n                (\n                    turn[\"content\"]\n                    for turn in item[\"messages\"]\n                    if turn[\"role\"] == \"user\"\n                ),\n                None,\n            )\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"chosen\"] = item[\"messages\"] + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][chosen_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n            item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n            rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"rejected\"] = item[\"messages\"] + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][rejected_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n            item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO","title":"<code>FormatTextGenerationDPO</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of your LLMs for Direct Preference Optimization (DPO).</p> <p><code>FormatTextGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>TextGeneration</code> task with a preference <code>Task</code> i.e. a task generating <code>ratings</code>, so that those are used to rank the existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>. Use this step to transform the output of a combination of a <code>TextGeneration</code> + a preference task such as <code>UltraFeedback</code> following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>.</p> Note <p>The <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p> Input columns <ul> <li>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the     <code>generations</code>, if available.</li> <li>instruction (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</li> <li>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</li> <li>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,     only available if the <code>model_name</code> from the <code>TextGeneration</code> task/s is combined into a single     column named this way, otherwise, it will be ignored.</li> <li>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference     task such as <code>UltraFeedback</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</li> <li>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,     if the <code>generation_models</code> are available.</li> <li>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</li> <li>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</li> <li>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,     if the <code>generation_models</code> are available.</li> <li>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</li> </ul> Categories <ul> <li>format</li> <li>text-generation</li> <li>preference</li> <li>instruction</li> <li>generations</li> </ul> <p>Examples:</p> <p>Format your dataset for DPO fine tuning:</p> <pre><code>from distilabel.steps import FormatTextGenerationDPO\n\nformat_dpo = FormatTextGenerationDPO()\nformat_dpo.load()\n\n# NOTE: Both \"system_prompt\" and \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#    {   'instruction': \"What's 2+2?\",\n#        'generations': ['4', '5', '6'],\n#        'ratings': [1, 0, -1],\n#        'prompt': \"What's 2+2?\",\n#        'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#        'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#        'chosen_rating': 1,\n#        'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#        'rejected_rating': -1\n#    }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>class FormatTextGenerationDPO(Step):\n    \"\"\"Format the output of your LLMs for Direct Preference Optimization (DPO).\n\n    `FormatTextGenerationDPO` is a `Step` that formats the output of the combination of a `TextGeneration`\n    task with a preference `Task` i.e. a task generating `ratings`, so that those are used to rank the\n    existing generations and provide the `chosen` and `rejected` generations based on the `ratings`.\n    Use this step to transform the output of a combination of a `TextGeneration` + a preference task such as\n    `UltraFeedback` following the standard formatting from frameworks such as `axolotl` or `alignment-handbook`.\n\n    Note:\n        The `generations` column should contain at least two generations, the `ratings` column should\n        contain the same number of ratings as generations.\n\n    Input columns:\n        - system_prompt (`str`, optional): The system prompt used within the `LLM` to generate the\n            `generations`, if available.\n        - instruction (`str`): The instruction used to generate the `generations` with the `LLM`.\n        - generations (`List[str]`): The generations produced by the `LLM`.\n        - generation_models (`List[str]`, optional): The model names used to generate the `generations`,\n            only available if the `model_name` from the `TextGeneration` task/s is combined into a single\n            column named this way, otherwise, it will be ignored.\n        - ratings (`List[float]`): The ratings for each of the `generations`, produced by a preference\n            task such as `UltraFeedback`.\n\n    Output columns:\n        - prompt (`str`): The instruction used to generate the `generations` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - chosen (`List[Dict[str, str]]`): The `chosen` generation based on the `ratings`.\n        - chosen_model (`str`, optional): The model name used to generate the `chosen` generation,\n            if the `generation_models` are available.\n        - chosen_rating (`float`): The rating of the `chosen` generation.\n        - rejected (`List[Dict[str, str]]`): The `rejected` generation based on the `ratings`.\n        - rejected_model (`str`, optional): The model name used to generate the `rejected` generation,\n            if the `generation_models` are available.\n        - rejected_rating (`float`): The rating of the `rejected` generation.\n\n    Categories:\n        - format\n        - text-generation\n        - preference\n        - instruction\n        - generations\n\n    Examples:\n        Format your dataset for DPO fine tuning:\n\n        ```python\n        from distilabel.steps import FormatTextGenerationDPO\n\n        format_dpo = FormatTextGenerationDPO()\n        format_dpo.load()\n\n        # NOTE: Both \"system_prompt\" and \"generation_models\" can be added optionally.\n        result = next(\n            format_dpo.process(\n                [\n                    {\n                        \"instruction\": \"What's 2+2?\",\n                        \"generations\": [\"4\", \"5\", \"6\"],\n                        \"ratings\": [1, 0, -1],\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #    {   'instruction': \"What's 2+2?\",\n        #        'generations': ['4', '5', '6'],\n        #        'ratings': [1, 0, -1],\n        #        'prompt': \"What's 2+2?\",\n        #        'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #        'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n        #        'chosen_rating': 1,\n        #        'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n        #        'rejected_rating': -1\n        #    }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `instruction`, `generations`,\n        and `ratings`.\"\"\"\n        return {\n            \"system_prompt\": False,\n            \"instruction\": True,\n            \"generations\": True,\n            \"generation_models\": False,\n            \"ratings\": True,\n        }\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"List of optional inputs, which are not required by the `Step` but used if available,\n        which in this case are: `system_prompt`, and `generation_models`.\"\"\"\n        return [\"system_prompt\", \"generation_models\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `chosen`,\n        `chosen_model`, `chosen_rating`, `rejected`, `rejected_model`, `rejected_rating`. Both\n        the `chosen_model` and `rejected_model` being optional and only used if `generation_models`\n        is available.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\n            \"prompt\",\n            \"prompt_id\",\n            \"chosen\",\n            \"chosen_model\",\n            \"chosen_rating\",\n            \"rejected\",\n            \"rejected_model\",\n            \"rejected_rating\",\n        ]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the DPO formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                messages = [\n                    {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n                ]\n                if (\n                    \"system_prompt\" in item\n                    and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                    and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n                ):\n                    messages.insert(\n                        0,\n                        {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                    )\n\n                item[\"prompt\"] = item[\"instruction\"]\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"chosen\"] = messages + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][chosen_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n                item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n                rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n                item[\"rejected\"] = messages + [\n                    {\n                        \"role\": \"assistant\",\n                        \"content\": item[\"generations\"][rejected_idx],\n                    }\n                ]\n                if \"generation_models\" in item:\n                    item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n                item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>instruction</code>, <code>generations</code>, and <code>ratings</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>List of optional inputs, which are not required by the <code>Step</code> but used if available, which in this case are: <code>system_prompt</code>, and <code>generation_models</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>chosen</code>, <code>chosen_model</code>, <code>chosen_rating</code>, <code>rejected</code>, <code>rejected_model</code>, <code>rejected_rating</code>. Both the <code>chosen_model</code> and <code>rejected_model</code> being optional and only used if <code>generation_models</code> is available.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationDPO.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the DPO formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the DPO standard.</p> Source code in <code>src/distilabel/steps/formatting/dpo.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the DPO formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the DPO standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            messages = [\n                {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n            ]\n            if (\n                \"system_prompt\" in item\n                and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n            ):\n                messages.insert(\n                    0,\n                    {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                )\n\n            item[\"prompt\"] = item[\"instruction\"]\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            chosen_idx = max(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"chosen\"] = messages + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][chosen_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"chosen_model\"] = item[\"generation_models\"][chosen_idx]\n            item[\"chosen_rating\"] = item[\"ratings\"][chosen_idx]\n\n            rejected_idx = min(enumerate(item[\"ratings\"]), key=lambda x: x[1])[0]\n            item[\"rejected\"] = messages + [\n                {\n                    \"role\": \"assistant\",\n                    \"content\": item[\"generations\"][rejected_idx],\n                }\n            ]\n            if \"generation_models\" in item:\n                item[\"rejected_model\"] = item[\"generation_models\"][rejected_idx]\n            item[\"rejected_rating\"] = item[\"ratings\"][rejected_idx]\n\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT","title":"<code>FormatChatGenerationSFT</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatChatGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>. The output of the <code>ChatGeneration</code> task is formatted into a chat-like conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message in the conversation.</p> Input columns <ul> <li>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the     <code>generation</code>, if available.</li> <li>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as     the user message and the <code>generation</code> as the assistant message.</li> </ul> Categories <ul> <li>format</li> <li>chat-generation</li> <li>instruction</li> <li>generation</li> </ul> <p>Examples:</p> <p>Format your dataset for SFT:</p> <pre><code>from distilabel.steps import FormatChatGenerationSFT\n\nformat_sft = FormatChatGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>class FormatChatGenerationSFT(Step):\n    \"\"\"Format the output of a `ChatGeneration` task for Supervised Fine-Tuning (SFT).\n\n    `FormatChatGenerationSFT` is a `Step` that formats the output of a `ChatGeneration` task for\n    Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as `axolotl`\n    or `alignment-handbook`. The output of the `ChatGeneration` task is formatted into a chat-like\n    conversation with the `instruction` as the user message and the `generation` as the assistant\n    message. Optionally, if the `system_prompt` is available, it is included as the first message\n    in the conversation.\n\n    Input columns:\n        - system_prompt (`str`, optional): The system prompt used within the `LLM` to generate the\n            `generation`, if available.\n        - instruction (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - generation (`str`): The generation produced by the `LLM`.\n\n    Output columns:\n        - prompt (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - messages (`List[Dict[str, str]]`): The chat-like conversation with the `instruction` as\n            the user message and the `generation` as the assistant message.\n\n    Categories:\n        - format\n        - chat-generation\n        - instruction\n        - generation\n\n    Examples:\n        Format your dataset for SFT:\n\n        ```python\n        from distilabel.steps import FormatChatGenerationSFT\n\n        format_sft = FormatChatGenerationSFT()\n        format_sft.load()\n\n        # NOTE: \"system_prompt\" can be added optionally.\n        result = next(\n            format_sft.process(\n                [\n                    {\n                        \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                        \"generation\": \"4\"\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n        #         'generation': '4',\n        #         'prompt': 'What's 2+2?',\n        #         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `instruction`, and `generation`.\"\"\"\n        return [\"messages\", \"generation\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `messages`.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\"prompt\", \"prompt_id\", \"messages\"]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the SFT formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                item[\"prompt\"] = next(\n                    (\n                        turn[\"content\"]\n                        for turn in item[\"messages\"]\n                        if turn[\"role\"] == \"user\"\n                    ),\n                    None,\n                )\n\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                item[\"messages\"] = item[\"messages\"] + [\n                    {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n                ]\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>instruction</code>, and <code>generation</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>messages</code>.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatChatGenerationSFT.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the SFT formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the SFT standard.</p> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the SFT formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            item[\"prompt\"] = next(\n                (\n                    turn[\"content\"]\n                    for turn in item[\"messages\"]\n                    if turn[\"role\"] == \"user\"\n                ),\n                None,\n            )\n\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            item[\"messages\"] = item[\"messages\"] + [\n                {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n            ]\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT","title":"<code>FormatTextGenerationSFT</code>","text":"<p>               Bases: <code>Step</code></p> <p>Format the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatTextGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>. The output of the <code>TextGeneration</code> task is formatted into a chat-like conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message in the conversation.</p> Input columns <ul> <li>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the     <code>generation</code>, if available.</li> <li>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</li> </ul> Output columns <ul> <li>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</li> <li>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</li> <li>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as     the user message and the <code>generation</code> as the assistant message.</li> </ul> Categories <ul> <li>format</li> <li>text-generation</li> <li>instruction</li> <li>generation</li> </ul> <p>Examples:</p> <p>Format your dataset for SFT fine tuning:</p> <pre><code>from distilabel.steps import FormatTextGenerationSFT\n\nformat_sft = FormatTextGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'What's 2+2?',\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}]\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>class FormatTextGenerationSFT(Step):\n    \"\"\"Format the output of a `TextGeneration` task for Supervised Fine-Tuning (SFT).\n\n    `FormatTextGenerationSFT` is a `Step` that formats the output of a `TextGeneration` task for\n    Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as `axolotl`\n    or `alignment-handbook`. The output of the `TextGeneration` task is formatted into a chat-like\n    conversation with the `instruction` as the user message and the `generation` as the assistant\n    message. Optionally, if the `system_prompt` is available, it is included as the first message\n    in the conversation.\n\n    Input columns:\n        - system_prompt (`str`, optional): The system prompt used within the `LLM` to generate the\n            `generation`, if available.\n        - instruction (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - generation (`str`): The generation produced by the `LLM`.\n\n    Output columns:\n        - prompt (`str`): The instruction used to generate the `generation` with the `LLM`.\n        - prompt_id (`str`): The `SHA256` hash of the `prompt`.\n        - messages (`List[Dict[str, str]]`): The chat-like conversation with the `instruction` as\n            the user message and the `generation` as the assistant message.\n\n    Categories:\n        - format\n        - text-generation\n        - instruction\n        - generation\n\n    Examples:\n        Format your dataset for SFT fine tuning:\n\n        ```python\n        from distilabel.steps import FormatTextGenerationSFT\n\n        format_sft = FormatTextGenerationSFT()\n        format_sft.load()\n\n        # NOTE: \"system_prompt\" can be added optionally.\n        result = next(\n            format_sft.process(\n                [\n                    {\n                        \"instruction\": \"What's 2+2?\",\n                        \"generation\": \"4\"\n                    }\n                ]\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [\n        #     {\n        #         'instruction': 'What's 2+2?',\n        #         'generation': '4',\n        #         'prompt': 'What's 2+2?',\n        #         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n        #         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}]\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of inputs required by the `Step`, which in this case are: `instruction`, and `generation`.\"\"\"\n        return {\n            \"system_prompt\": False,\n            \"instruction\": True,\n            \"generation\": True,\n        }\n\n    @property\n    def optional_inputs(self) -&gt; List[str]:\n        \"\"\"List of optional inputs, which are not required by the `Step` but used if available,\n        which in this case is: `system_prompt`.\"\"\"\n        return [\"system_prompt\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"List of outputs generated by the `Step`, which are: `prompt`, `prompt_id`, `messages`.\n\n        Reference:\n            - Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n        \"\"\"\n        return [\"prompt\", \"prompt_id\", \"messages\"]\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n        according to the SFT formatting standard.\n\n        Args:\n            *inputs: A list of `StepInput` to be combined.\n\n        Yields:\n            A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n        \"\"\"\n        for input in inputs:\n            for item in input:\n                item[\"prompt\"] = item[\"instruction\"]\n\n                item[\"prompt_id\"] = hashlib.sha256(\n                    item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n                ).hexdigest()\n\n                item[\"messages\"] = [\n                    {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n                    {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n                ]\n                if (\n                    \"system_prompt\" in item\n                    and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                    and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n                ):\n                    item[\"messages\"].insert(\n                        0,\n                        {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                    )\n\n            yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>List of inputs required by the <code>Step</code>, which in this case are: <code>instruction</code>, and <code>generation</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.optional_inputs","title":"<code>optional_inputs</code>  <code>property</code>","text":"<p>List of optional inputs, which are not required by the <code>Step</code> but used if available, which in this case is: <code>system_prompt</code>.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>List of outputs generated by the <code>Step</code>, which are: <code>prompt</code>, <code>prompt_id</code>, <code>messages</code>.</p> Reference <ul> <li>Format inspired in https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k</li> </ul>"},{"location":"api/step_gallery/extra/#distilabel.steps.FormatTextGenerationSFT.process","title":"<code>process(*inputs)</code>","text":"<p>The <code>process</code> method formats the received <code>StepInput</code> or list of <code>StepInput</code> according to the SFT formatting standard.</p> <p>Parameters:</p> Name Type Description Default <code>*inputs</code> <code>StepInput</code> <p>A list of <code>StepInput</code> to be combined.</p> <code>()</code> <p>Yields:</p> Type Description <code>StepOutput</code> <p>A <code>StepOutput</code> with batches of formatted <code>StepInput</code> following the SFT standard.</p> Source code in <code>src/distilabel/steps/formatting/sft.py</code> <pre><code>def process(self, *inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The `process` method formats the received `StepInput` or list of `StepInput`\n    according to the SFT formatting standard.\n\n    Args:\n        *inputs: A list of `StepInput` to be combined.\n\n    Yields:\n        A `StepOutput` with batches of formatted `StepInput` following the SFT standard.\n    \"\"\"\n    for input in inputs:\n        for item in input:\n            item[\"prompt\"] = item[\"instruction\"]\n\n            item[\"prompt_id\"] = hashlib.sha256(\n                item[\"prompt\"].encode(\"utf-8\")  # type: ignore\n            ).hexdigest()\n\n            item[\"messages\"] = [\n                {\"role\": \"user\", \"content\": item[\"instruction\"]},  # type: ignore\n                {\"role\": \"assistant\", \"content\": item[\"generation\"]},  # type: ignore\n            ]\n            if (\n                \"system_prompt\" in item\n                and isinstance(item[\"system_prompt\"], str)  # type: ignore\n                and len(item[\"system_prompt\"]) &gt; 0  # type: ignore\n            ):\n                item[\"messages\"].insert(\n                    0,\n                    {\"role\": \"system\", \"content\": item[\"system_prompt\"]},  # type: ignore\n                )\n\n        yield input\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.LoadDataFromDicts","title":"<code>LoadDataFromDicts</code>","text":"<p>               Bases: <code>GeneratorStep</code></p> <p>Loads a dataset from a list of dictionaries.</p> <p><code>GeneratorStep</code> that loads a dataset from a list of dictionaries and yields it in batches.</p> <p>Attributes:</p> Name Type Description <code>data</code> <code>List[Dict[str, Any]]</code> <p>The list of dictionaries to load the data from.</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> </ul> Output columns <ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns     of the dataset.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a list of dictionaries:</p> <pre><code>from distilabel.steps import LoadDataFromDicts\n\nloader = LoadDataFromDicts(\n    data=[{\"instruction\": \"What are 2+2?\"}] * 5,\n    batch_size=2\n)\nloader.load()\n\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'instruction': 'What are 2+2?'}, {'instruction': 'What are 2+2?'}], False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/data.py</code> <pre><code>class LoadDataFromDicts(GeneratorStep):\n    \"\"\"Loads a dataset from a list of dictionaries.\n\n    `GeneratorStep` that loads a dataset from a list of dictionaries and yields it in\n    batches.\n\n    Attributes:\n        data: The list of dictionaries to load the data from.\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n\n    Output columns:\n        - dynamic (based on the keys found on the first dictionary of the list): The columns\n            of the dataset.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a list of dictionaries:\n\n        ```python\n        from distilabel.steps import LoadDataFromDicts\n\n        loader = LoadDataFromDicts(\n            data=[{\"instruction\": \"What are 2+2?\"}] * 5,\n            batch_size=2\n        )\n        loader.load()\n\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'instruction': 'What are 2+2?'}, {'instruction': 'What are 2+2?'}], False)\n        ```\n    \"\"\"\n\n    data: List[Dict[str, Any]] = Field(default_factory=list, exclude=True)\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n        \"\"\"Yields batches from a list of dictionaries.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to `0`.\n\n        Yields:\n            A list of Python dictionaries as read from the inputs (propagated in batches)\n            and a flag indicating whether the yield batch is the last one.\n        \"\"\"\n        if offset:\n            self.data = self.data[offset:]\n\n        while self.data:\n            batch = self.data[: self.batch_size]\n            self.data = self.data[self.batch_size :]\n            yield (\n                batch,\n                True if len(self.data) == 0 else False,\n            )\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"Returns a list of strings with the names of the columns that the step will generate.\"\"\"\n        return list(self.data[0].keys())\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.LoadDataFromDicts.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>Returns a list of strings with the names of the columns that the step will generate.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.LoadDataFromDicts.process","title":"<code>process(offset=0)</code>","text":"<p>Yields batches from a list of dictionaries.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to <code>0</code>.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A list of Python dictionaries as read from the inputs (propagated in batches)</p> <code>GeneratorStepOutput</code> <p>and a flag indicating whether the yield batch is the last one.</p> Source code in <code>src/distilabel/steps/generators/data.py</code> <pre><code>@override\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n    \"\"\"Yields batches from a list of dictionaries.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to `0`.\n\n    Yields:\n        A list of Python dictionaries as read from the inputs (propagated in batches)\n        and a flag indicating whether the yield batch is the last one.\n    \"\"\"\n    if offset:\n        self.data = self.data[offset:]\n\n    while self.data:\n        batch = self.data[: self.batch_size]\n        self.data = self.data[self.batch_size :]\n        yield (\n            batch,\n            True if len(self.data) == 0 else False,\n        )\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DataSampler","title":"<code>DataSampler</code>","text":"<p>               Bases: <code>GeneratorStep</code></p> <p>Step to sample from a dataset.</p> <p><code>GeneratorStep</code> that samples from a dataset and yields it in batches. This step is useful when you have a pipeline that can benefit from using examples in the prompts for example as few-shot learning, that can be changing on each row. For example, you can pass a list of dictionaries with N examples and generate M samples from it (assuming you have another step loading data, this M should have the same size as the data being loaded in that step). The size S argument is the number of samples per row generated, so each example would contain S examples to be used as examples.</p> <p>Attributes:</p> Name Type Description <code>data</code> <code>List[Dict[str, Any]]</code> <p>The list of dictionaries to sample from.</p> <code>size</code> <code>int</code> <p>Number of samples per example. For example in a few-shot learning scenario, the number of few-shot examples that will be generated per example. Defaults to 2.</p> <code>samples</code> <code>int</code> <p>Number of examples that will be generated by the step in total. If used with another loader step, this should be the same as the number of samples in the loader step. Defaults to 100.</p> Output columns <ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns     of the dataset.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Sample data from a list of dictionaries:</p> <pre><code>from distilabel.steps import DataSampler\n\nsampler = DataSampler(\n    data=[{\"sample\": f\"sample {i}\"} for i in range(30)],\n    samples=10,\n    size=2,\n    batch_size=4\n)\nsampler.load()\n\nresult = next(sampler.process())\n# &gt;&gt;&gt; result\n# ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)\n</code></pre> <p>Pipeline with a loader and a sampler combined in a single stream:</p> <pre><code>from datasets import load_dataset\n\nfrom distilabel.steps import LoadDataFromDicts, DataSampler\nfrom distilabel.steps.tasks.apigen.utils import PrepareExamples\nfrom distilabel.pipeline import Pipeline\n\nds = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\ndata = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n]\nwith Pipeline(name=\"APIGenPipeline\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)\n    sampler = DataSampler(\n        data=ds,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n    prep_examples = PrepareExamples()\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples]\n        &gt;&gt; combine_steps\n    )\n# Now we have a single stream of data with the loader and the sampler data\n</code></pre> Source code in <code>src/distilabel/steps/generators/data_sampler.py</code> <pre><code>class DataSampler(GeneratorStep):\n    \"\"\"Step to sample from a dataset.\n\n    `GeneratorStep` that samples from a dataset and yields it in batches.\n    This step is useful when you have a pipeline that can benefit from using examples\n    in the prompts for example as few-shot learning, that can be changing on each row.\n    For example, you can pass a list of dictionaries with N examples and generate M samples\n    from it (assuming you have another step loading data, this M should have the same size\n    as the data being loaded in that step). The size S argument is the number of samples per\n    row generated, so each example would contain S examples to be used as examples.\n\n    Attributes:\n        data: The list of dictionaries to sample from.\n        size: Number of samples per example. For example in a few-shot learning scenario,\n            the number of few-shot examples that will be generated per example. Defaults to 2.\n        samples: Number of examples that will be generated by the step in total.\n            If used with another loader step, this should be the same as the number\n            of samples in the loader step. Defaults to 100.\n\n    Output columns:\n        - dynamic (based on the keys found on the first dictionary of the list): The columns\n            of the dataset.\n\n    Categories:\n        - load\n\n    Examples:\n        Sample data from a list of dictionaries:\n\n        ```python\n        from distilabel.steps import DataSampler\n\n        sampler = DataSampler(\n            data=[{\"sample\": f\"sample {i}\"} for i in range(30)],\n            samples=10,\n            size=2,\n            batch_size=4\n        )\n        sampler.load()\n\n        result = next(sampler.process())\n        # &gt;&gt;&gt; result\n        # ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)\n        ```\n\n        Pipeline with a loader and a sampler combined in a single stream:\n\n        ```python\n        from datasets import load_dataset\n\n        from distilabel.steps import LoadDataFromDicts, DataSampler\n        from distilabel.steps.tasks.apigen.utils import PrepareExamples\n        from distilabel.pipeline import Pipeline\n\n        ds = (\n            load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n            .shuffle(seed=42)\n            .select(range(500))\n            .to_list()\n        )\n        data = [\n            {\n                \"func_name\": \"final_velocity\",\n                \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n            },\n            {\n                \"func_name\": \"permutation_count\",\n                \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n            },\n            {\n                \"func_name\": \"getdivision\",\n                \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n            },\n        ]\n        with Pipeline(name=\"APIGenPipeline\") as pipeline:\n            loader_seeds = LoadDataFromDicts(data=data)\n            sampler = DataSampler(\n                data=ds,\n                size=2,\n                samples=len(data),\n                batch_size=8,\n            )\n            prep_examples = PrepareExamples()\n\n            sampler &gt;&gt; prep_examples\n            (\n                [loader_seeds, prep_examples]\n                &gt;&gt; combine_steps\n            )\n        # Now we have a single stream of data with the loader and the sampler data\n        ```\n    \"\"\"\n\n    data: List[Dict[str, Any]] = Field(default_factory=list, exclude=True)\n    size: int = Field(\n        default=2,\n        description=(\n            \"Number of samples per example. For example in a few-shot learning scenario, the number \"\n            \"of few-shot examples that will be generated per example.\"\n        ),\n    )\n    samples: int = Field(\n        default=100,\n        description=(\n            \"Number of examples that will be generated by the step in total. \"\n            \"If used with another loader step, this should be the same as the number of \"\n            \"samples in the loader step.\"\n        ),\n    )\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n        \"\"\"Yields batches from a list of dictionaries.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to `0`.\n\n        Yields:\n            A list of Python dictionaries as read from the inputs (propagated in batches)\n            and a flag indicating whether the yield batch is the last one.\n        \"\"\"\n\n        total_samples = 0\n\n        while total_samples &lt; self.samples:\n            batch = []\n            bs = min(self.batch_size, self.samples - total_samples)\n            for _ in range(self.batch_size):\n                choices = random.choices(self.data, k=self.size)\n                choices = self._transform_data(choices)\n                batch.extend(choices)\n            total_samples += bs\n            batch = list(islice(batch, bs))\n            yield (batch, True if total_samples &gt;= self.samples else False)\n            batch = []\n\n    @staticmethod\n    def _transform_data(data: List[Dict[str, Any]]) -&gt; List[Dict[str, Any]]:\n        if not data:\n            return []\n\n        result = {key: [] for key in data[0].keys()}\n\n        for item in data:\n            for key, value in item.items():\n                result[key].append(value)\n\n        return [result]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return list(self.data[0].keys())\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.DataSampler.process","title":"<code>process(offset=0)</code>","text":"<p>Yields batches from a list of dictionaries.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to <code>0</code>.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A list of Python dictionaries as read from the inputs (propagated in batches)</p> <code>GeneratorStepOutput</code> <p>and a flag indicating whether the yield batch is the last one.</p> Source code in <code>src/distilabel/steps/generators/data_sampler.py</code> <pre><code>@override\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # type: ignore\n    \"\"\"Yields batches from a list of dictionaries.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to `0`.\n\n    Yields:\n        A list of Python dictionaries as read from the inputs (propagated in batches)\n        and a flag indicating whether the yield batch is the last one.\n    \"\"\"\n\n    total_samples = 0\n\n    while total_samples &lt; self.samples:\n        batch = []\n        bs = min(self.batch_size, self.samples - total_samples)\n        for _ in range(self.batch_size):\n            choices = random.choices(self.data, k=self.size)\n            choices = self._transform_data(choices)\n            batch.extend(choices)\n        total_samples += bs\n        batch = list(islice(batch, bs))\n        yield (batch, True if total_samples &gt;= self.samples else False)\n        batch = []\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.RewardModelScore","title":"<code>RewardModelScore</code>","text":"<p>               Bases: <code>Step</code>, <code>CudaDevicePlacementMixin</code></p> <p>Assign a score to a response using a Reward Model.</p> <p><code>RewardModelScore</code> is a <code>Step</code> that using a Reward Model (RM) loaded using <code>transformers</code>, assigns an score to a response generated for an instruction, or a score to a multi-turn conversation.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>the model Hugging Face Hub repo id or a path to a directory containing the model weights and configuration files.</p> <code>revision</code> <code>str</code> <p>if <code>model</code> refers to a Hugging Face Hub repository, then the revision (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> <code>torch_dtype</code> <code>str</code> <p>the torch dtype to use for the model e.g. \"float16\", \"float32\", etc. Defaults to <code>\"auto\"</code>.</p> <code>trust_remote_code</code> <code>bool</code> <p>whether to allow fetching and executing remote code fetched from the repository in the Hub. Defaults to <code>False</code>.</p> <code>device_map</code> <code>Union[str, Dict[str, Any], None]</code> <p>a dictionary mapping each layer of the model to a device, or a mode like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> <code>token</code> <code>Union[SecretStr, None]</code> <p>the Hugging Face Hub token that will be used to authenticate to the Hugging Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package local configuration will be used. Defaults to <code>None</code>.</p> <code>truncation</code> <code>bool</code> <p>whether to truncate sequences at the maximum length. Defaults to <code>False</code>.</p> <code>max_length</code> <code>Union[int, None]</code> <p>maximun length to use for padding or truncation. Defaults to <code>None</code>.</p> Input columns <ul> <li>instruction (<code>str</code>, optional): the instruction used to generate a <code>response</code>.     If provided, then <code>response</code> must be provided too.</li> <li>response (<code>str</code>, optional): the response generated for <code>instruction</code>. If provided,     then <code>instruction</code> must be provide too.</li> <li>conversation (<code>ChatType</code>, optional): a multi-turn conversation. If not provided,     then <code>instruction</code> and <code>response</code> columns must be provided.</li> </ul> Output columns <ul> <li>score (<code>float</code>): the score given by the reward model for the instruction-response     pair or the conversation.</li> </ul> Categories <ul> <li>scorer</li> </ul> <p>Examples:</p> <p>Assigning an score for an instruction-response pair:</p> <pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"response\": \"The output of 2+2 is 4\",\n            },\n            {\"instruction\": \"How much is 2+2?\", \"response\": \"4\"},\n        ]\n    )\n)\n# [\n#   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},\n#   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}\n# ]\n</code></pre> <p>Assigning an score for a multi-turn conversation:</p> <pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"The output of 2+2 is 4\"},\n                ],\n            },\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"4\"},\n                ],\n            },\n        ]\n    )\n)\n# [\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/reward_model.py</code> <pre><code>class RewardModelScore(Step, CudaDevicePlacementMixin):\n    \"\"\"Assign a score to a response using a Reward Model.\n\n    `RewardModelScore` is a `Step` that using a Reward Model (RM) loaded using `transformers`,\n    assigns an score to a response generated for an instruction, or a score to a multi-turn\n    conversation.\n\n    Attributes:\n        model: the model Hugging Face Hub repo id or a path to a directory containing the\n            model weights and configuration files.\n        revision: if `model` refers to a Hugging Face Hub repository, then the revision\n            (e.g. a branch name or a commit id) to use. Defaults to `\"main\"`.\n        torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.\n            Defaults to `\"auto\"`.\n        trust_remote_code: whether to allow fetching and executing remote code fetched\n            from the repository in the Hub. Defaults to `False`.\n        device_map: a dictionary mapping each layer of the model to a device, or a mode like `\"sequential\"` or `\"auto\"`. Defaults to `None`.\n        token: the Hugging Face Hub token that will be used to authenticate to the Hugging\n            Face Hub. If not provided, the `HF_TOKEN` environment or `huggingface_hub` package\n            local configuration will be used. Defaults to `None`.\n        truncation: whether to truncate sequences at the maximum length. Defaults to `False`.\n        max_length: maximun length to use for padding or truncation. Defaults to `None`.\n\n    Input columns:\n        - instruction (`str`, optional): the instruction used to generate a `response`.\n            If provided, then `response` must be provided too.\n        - response (`str`, optional): the response generated for `instruction`. If provided,\n            then `instruction` must be provide too.\n        - conversation (`ChatType`, optional): a multi-turn conversation. If not provided,\n            then `instruction` and `response` columns must be provided.\n\n    Output columns:\n        - score (`float`): the score given by the reward model for the instruction-response\n            pair or the conversation.\n\n    Categories:\n        - scorer\n\n    Examples:\n        Assigning an score for an instruction-response pair:\n\n        ```python\n        from distilabel.steps import RewardModelScore\n\n        step = RewardModelScore(\n            model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n        )\n\n        step.load()\n\n        result = next(\n            step.process(\n                inputs=[\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"response\": \"The output of 2+2 is 4\",\n                    },\n                    {\"instruction\": \"How much is 2+2?\", \"response\": \"4\"},\n                ]\n            )\n        )\n        # [\n        #   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},\n        #   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}\n        # ]\n        ```\n\n        Assigning an score for a multi-turn conversation:\n\n        ```python\n        from distilabel.steps import RewardModelScore\n\n        step = RewardModelScore(\n            model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n        )\n\n        step.load()\n\n        result = next(\n            step.process(\n                inputs=[\n                    {\n                        \"conversation\": [\n                            {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                            {\"role\": \"assistant\", \"content\": \"The output of 2+2 is 4\"},\n                        ],\n                    },\n                    {\n                        \"conversation\": [\n                            {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                            {\"role\": \"assistant\", \"content\": \"4\"},\n                        ],\n                    },\n                ]\n            )\n        )\n        # [\n        #   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},\n        #   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}\n        # ]\n        ```\n    \"\"\"\n\n    model: str\n    revision: str = \"main\"\n    torch_dtype: str = \"auto\"\n    trust_remote_code: bool = False\n    device_map: Union[str, Dict[str, Any], None] = None\n    token: Union[SecretStr, None] = Field(\n        default_factory=lambda: os.getenv(HF_TOKEN_ENV_VAR), description=\"\"\n    )\n    truncation: bool = False\n    max_length: Union[int, None] = None\n\n    _model: Union[\"PreTrainedModel\", None] = PrivateAttr(None)\n    _tokenizer: Union[\"PreTrainedTokenizer\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        super().load()\n\n        if self.device_map in [\"cuda\", \"auto\"]:\n            CudaDevicePlacementMixin.load(self)\n\n        try:\n            from transformers import AutoModelForSequenceClassification, AutoTokenizer\n        except ImportError as e:\n            raise ImportError(\n                \"`transformers` is not installed. Please install it using `pip install transformers`.\"\n            ) from e\n\n        token = self.token.get_secret_value() if self.token is not None else self.token\n\n        self._model = AutoModelForSequenceClassification.from_pretrained(\n            self.model,\n            revision=self.revision,\n            torch_dtype=self.torch_dtype,\n            trust_remote_code=self.trust_remote_code,\n            device_map=self.device_map,\n            token=token,\n        )\n        self._tokenizer = AutoTokenizer.from_pretrained(\n            self.model,\n            revision=self.revision,\n            torch_dtype=self.torch_dtype,\n            trust_remote_code=self.trust_remote_code,\n            token=token,\n        )\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"Either `response` and `instruction`, or a `conversation` columns.\"\"\"\n        return {\n            \"response\": False,\n            \"instruction\": False,\n            \"conversation\": False,\n        }\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The `score` given by the reward model.\"\"\"\n        return [\"score\"]\n\n    def _prepare_conversation(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        if \"instruction\" in input and \"response\" in input:\n            return [\n                {\"role\": \"user\", \"content\": input[\"instruction\"]},\n                {\"role\": \"assistant\", \"content\": input[\"response\"]},\n            ]\n\n        return input[\"conversation\"]\n\n    def _prepare_inputs(self, inputs: List[Dict[str, Any]]) -&gt; \"torch.Tensor\":\n        return self._tokenizer.apply_chat_template(  # type: ignore\n            [self._prepare_conversation(input) for input in inputs],  # type: ignore\n            return_tensors=\"pt\",\n            padding=True,\n            truncation=self.truncation,\n            max_length=self.max_length,\n        ).to(self._model.device)  # type: ignore\n\n    def _inference(self, inputs: List[Dict[str, Any]]) -&gt; List[float]:\n        import torch\n\n        input_ids = self._prepare_inputs(inputs)\n        with torch.no_grad():\n            output = self._model(input_ids)  # type: ignore\n            logits = output.logits\n            if logits.shape == (2, 1):\n                logits = logits.squeeze(-1)\n            return logits.tolist()\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        scores = self._inference(inputs)\n        for input, score in zip(inputs, scores):\n            input[\"score\"] = score\n        yield inputs\n\n    def unload(self) -&gt; None:\n        if self.device_map in [\"cuda\", \"auto\"]:\n            CudaDevicePlacementMixin.unload(self)\n        super().unload()\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.RewardModelScore.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>Either <code>response</code> and <code>instruction</code>, or a <code>conversation</code> columns.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.RewardModelScore.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The <code>score</code> given by the reward model.</p>"},{"location":"api/step_gallery/extra/#distilabel.steps.TruncateTextColumn","title":"<code>TruncateTextColumn</code>","text":"<p>               Bases: <code>Step</code></p> <p>Truncate a row using a tokenizer or the number of characters.</p> <p><code>TruncateTextColumn</code> is a <code>Step</code> that truncates a row according to the max length. If the <code>tokenizer</code> is provided, then the row will be truncated using the tokenizer, and the <code>max_length</code> will be used as the maximum number of tokens, otherwise it will be used as the maximum number of characters. The <code>TruncateTextColumn</code> step is useful when one wants to truncate a row to a certain length, to avoid posterior errors in the model due to the length.</p> <p>Attributes:</p> Name Type Description <code>column</code> <code>str</code> <p>the column to truncate. Defaults to <code>\"text\"</code>.</p> <code>max_length</code> <code>int</code> <p>the maximum length to use for truncation. If a <code>tokenizer</code> is given, corresponds to the number of tokens, otherwise corresponds to the number of characters. Defaults to <code>8192</code>.</p> <code>tokenizer</code> <code>Optional[str]</code> <p>the name of the tokenizer to use. If provided, the row will be truncated using the tokenizer. Defaults to <code>None</code>.</p> Input columns <ul> <li>dynamic (determined by <code>column</code> attribute): The columns to be truncated, defaults to \"text\".</li> </ul> Output columns <ul> <li>dynamic (determined by <code>column</code> attribute): The truncated column.</li> </ul> Categories <ul> <li>text-manipulation</li> </ul> <p>Examples:</p> <p>Truncating a row to a given number of tokens:</p> <pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(\n    tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    max_length=4,\n    column=\"text\"\n)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a sample'}]\n</code></pre> <p>Truncating a row to a given number of characters:</p> <pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(max_length=10)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a '}]\n</code></pre> Source code in <code>src/distilabel/steps/truncate.py</code> <pre><code>class TruncateTextColumn(Step):\n    \"\"\"Truncate a row using a tokenizer or the number of characters.\n\n    `TruncateTextColumn` is a `Step` that truncates a row according to the max length. If\n    the `tokenizer` is provided, then the row will be truncated using the tokenizer,\n    and the `max_length` will be used as the maximum number of tokens, otherwise it will\n    be used as the maximum number of characters. The `TruncateTextColumn` step is useful when one\n    wants to truncate a row to a certain length, to avoid posterior errors in the model due\n    to the length.\n\n    Attributes:\n        column: the column to truncate. Defaults to `\"text\"`.\n        max_length: the maximum length to use for truncation.\n            If a `tokenizer` is given, corresponds to the number of tokens,\n            otherwise corresponds to the number of characters. Defaults to `8192`.\n        tokenizer: the name of the tokenizer to use. If provided, the row will be\n            truncated using the tokenizer. Defaults to `None`.\n\n    Input columns:\n        - dynamic (determined by `column` attribute): The columns to be truncated, defaults to \"text\".\n\n    Output columns:\n        - dynamic (determined by `column` attribute): The truncated column.\n\n    Categories:\n        - text-manipulation\n\n    Examples:\n        Truncating a row to a given number of tokens:\n\n        ```python\n        from distilabel.steps import TruncateTextColumn\n\n        trunc = TruncateTextColumn(\n            tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            max_length=4,\n            column=\"text\"\n        )\n\n        trunc.load()\n\n        result = next(\n            trunc.process(\n                [\n                    {\"text\": \"This is a sample text that is longer than 10 characters\"}\n                ]\n            )\n        )\n        # result\n        # [{'text': 'This is a sample'}]\n        ```\n\n        Truncating a row to a given number of characters:\n\n        ```python\n        from distilabel.steps import TruncateTextColumn\n\n        trunc = TruncateTextColumn(max_length=10)\n\n        trunc.load()\n\n        result = next(\n            trunc.process(\n                [\n                    {\"text\": \"This is a sample text that is longer than 10 characters\"}\n                ]\n            )\n        )\n        # result\n        # [{'text': 'This is a '}]\n        ```\n    \"\"\"\n\n    column: str = \"text\"\n    max_length: int = 8192\n    tokenizer: Optional[str] = None\n    _truncator: Optional[Callable[[str], str]] = None\n    _tokenizer: Optional[Any] = None\n\n    def load(self):\n        super().load()\n        if self.tokenizer:\n            if not importlib.util.find_spec(\"transformers\"):\n                raise ImportError(\n                    \"`transformers` is needed to tokenize, but is not installed. \"\n                    \"Please install it using `pip install transformers`.\"\n                )\n\n            from transformers import AutoTokenizer\n\n            self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer)\n            self._truncator = self._truncate_with_tokenizer\n        else:\n            self._truncator = self._truncate_with_length\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [self.column]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return self.inputs\n\n    def _truncate_with_length(self, text: str) -&gt; str:\n        \"\"\"Truncates the text according to the number of characters.\"\"\"\n        return text[: self.max_length]\n\n    def _truncate_with_tokenizer(self, text: str) -&gt; str:\n        \"\"\"Truncates the text according to the number of characters using the tokenizer.\"\"\"\n        return self._tokenizer.decode(\n            self._tokenizer.encode(\n                text,\n                add_special_tokens=False,\n                max_length=self.max_length,\n                truncation=True,\n            )\n        )\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        for input in inputs:\n            input[self.column] = self._truncator(input[self.column])\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TruncateTextColumn._truncate_with_length","title":"<code>_truncate_with_length(text)</code>","text":"<p>Truncates the text according to the number of characters.</p> Source code in <code>src/distilabel/steps/truncate.py</code> <pre><code>def _truncate_with_length(self, text: str) -&gt; str:\n    \"\"\"Truncates the text according to the number of characters.\"\"\"\n    return text[: self.max_length]\n</code></pre>"},{"location":"api/step_gallery/extra/#distilabel.steps.TruncateTextColumn._truncate_with_tokenizer","title":"<code>_truncate_with_tokenizer(text)</code>","text":"<p>Truncates the text according to the number of characters using the tokenizer.</p> Source code in <code>src/distilabel/steps/truncate.py</code> <pre><code>def _truncate_with_tokenizer(self, text: str) -&gt; str:\n    \"\"\"Truncates the text according to the number of characters using the tokenizer.\"\"\"\n    return self._tokenizer.decode(\n        self._tokenizer.encode(\n            text,\n            add_special_tokens=False,\n            max_length=self.max_length,\n            truncation=True,\n        )\n    )\n</code></pre>"},{"location":"api/step_gallery/hugging_face/","title":"Hugging Face","text":"<p>This section contains the existing steps integrated with <code>Hugging Face</code> so as to easily push the generated datasets to Hugging Face.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromDisk","title":"<code>LoadDataFromDisk</code>","text":"<p>               Bases: <code>LoadDataFromHub</code></p> <p>Load a dataset that was previously saved to disk.</p> <p>If you previously saved your dataset using the <code>save_to_disk</code> method, or <code>Distiset.save_to_disk</code> you can load it again to build a new pipeline using this class.</p> <p>Attributes:</p> Name Type Description <code>dataset_path</code> <code>RuntimeParameter[Union[str, Path]]</code> <p>The path to the dataset or distiset.</p> <code>split</code> <code>Optional[RuntimeParameter[str]]</code> <p>The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> <code>config</code> <code>Optional[RuntimeParameter[str]]</code> <p>The configuration of the dataset to load. Defaults to <code>default</code>, if there are multiple configurations in the dataset this must be suplied or an error is raised.</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> <li><code>dataset_path</code>: The path to the dataset or distiset.</li> <li><code>is_distiset</code>: Whether the dataset to load is a <code>Distiset</code> or not. Defaults to False.</li> <li><code>split</code>: The split of the dataset to load. Defaults to 'train'.</li> <li><code>config</code>: The configuration of the dataset to load. Defaults to <code>default</code>, if there are     multiple configurations in the dataset this must be suplied or an error is raised.</li> <li><code>num_examples</code>: The number of examples to load from the dataset.     By default will load all examples.</li> <li><code>storage_options</code>: Key/value pairs to be passed on to the file-system backend, if any.     Defaults to <code>None</code>.</li> </ul> Output columns <ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the     datasets loaded from the Hugging Face Hub.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a Hugging Face Dataset:</p> <pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(dataset_path=\"path/to/dataset\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Load data from a distilabel Distiset:</p> <pre><code>from distilabel.steps import LoadDataFromDisk\n\n# Specify the configuration to load.\nloader = LoadDataFromDisk(\n    dataset_path=\"path/to/dataset\",\n    is_distiset=True,\n    config=\"leaf_step_1\"\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'a': 1}, {'a': 2}, {'a': 3}], True)\n</code></pre> <p>Load data from a Hugging Face Dataset or Distiset in your cloud provider:</p> <pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(\n    dataset_path=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>class LoadDataFromDisk(LoadDataFromHub):\n    \"\"\"Load a dataset that was previously saved to disk.\n\n    If you previously saved your dataset using the `save_to_disk` method, or\n    `Distiset.save_to_disk` you can load it again to build a new pipeline using this class.\n\n    Attributes:\n        dataset_path: The path to the dataset or distiset.\n        split: The split of the dataset to load (typically will be `train`, `test` or `validation`).\n        config: The configuration of the dataset to load. Defaults to `default`, if there are\n            multiple configurations in the dataset this must be suplied or an error is raised.\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n        - `dataset_path`: The path to the dataset or distiset.\n        - `is_distiset`: Whether the dataset to load is a `Distiset` or not. Defaults to False.\n        - `split`: The split of the dataset to load. Defaults to 'train'.\n        - `config`: The configuration of the dataset to load. Defaults to `default`, if there are\n            multiple configurations in the dataset this must be suplied or an error is raised.\n        - `num_examples`: The number of examples to load from the dataset.\n            By default will load all examples.\n        - `storage_options`: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n\n    Output columns:\n        - dynamic (`all`): The columns that will be generated by this step, based on the\n            datasets loaded from the Hugging Face Hub.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a Hugging Face Dataset:\n\n        ```python\n        from distilabel.steps import LoadDataFromDisk\n\n        loader = LoadDataFromDisk(dataset_path=\"path/to/dataset\")\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Load data from a distilabel Distiset:\n\n        ```python\n        from distilabel.steps import LoadDataFromDisk\n\n        # Specify the configuration to load.\n        loader = LoadDataFromDisk(\n            dataset_path=\"path/to/dataset\",\n            is_distiset=True,\n            config=\"leaf_step_1\"\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'a': 1}, {'a': 2}, {'a': 3}], True)\n        ```\n\n        Load data from a Hugging Face Dataset or Distiset in your cloud provider:\n\n        ```python\n        from distilabel.steps import LoadDataFromDisk\n\n        loader = LoadDataFromDisk(\n            dataset_path=\"gcs://path/to/dataset\",\n            storage_options={\"project\": \"experiments-0001\"}\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n    \"\"\"\n\n    dataset_path: RuntimeParameter[Union[str, Path]] = Field(\n        default=None,\n        description=\"Path to the dataset or distiset.\",\n    )\n    config: Optional[RuntimeParameter[str]] = Field(\n        default=\"default\",\n        description=(\n            \"The configuration of the dataset to load. Will default to 'default'\",\n            \" which corresponds to a distiset with a single configuration.\",\n        ),\n    )\n    is_distiset: Optional[RuntimeParameter[bool]] = Field(\n        default=False,\n        description=\"Whether the dataset to load is a `Distiset` or not. Defaults to False.\",\n    )\n    keep_in_memory: Optional[RuntimeParameter[bool]] = Field(\n        default=None,\n        description=\"Whether to copy the dataset in-memory, see `datasets.Dataset.load_from_disk` \"\n        \" for more information. Defaults to `None`.\",\n    )\n    split: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The split of the dataset to load. By default will load the whole Dataset/Distiset.\",\n    )\n    repo_id: ExcludedField[Union[str, None]] = None\n\n    def load(self) -&gt; None:\n        \"\"\"Load the dataset from the file/s in disk.\"\"\"\n        super(GeneratorStep, self).load()\n        if self.is_distiset:\n            ds = Distiset.load_from_disk(\n                self.dataset_path,\n                keep_in_memory=self.keep_in_memory,\n                storage_options=self.storage_options,\n            )\n            if self.config not in ds.keys():\n                raise DistilabelUserError(\n                    f\"Configuration '{self.config}' not found in the Distiset, available ones\"\n                    f\" are: {list(ds.keys())}. Please try changing the `config` parameter to one \"\n                    \"of the available configurations.\\n\\n\",\n                    page=\"sections/how_to_guides/advanced/distiset/#using-the-distiset-dataset-object\",\n                )\n            ds = ds[self.config]\n\n        else:\n            ds = load_from_disk(\n                self.dataset_path,\n                keep_in_memory=self.keep_in_memory,\n                storage_options=self.storage_options,\n            )\n\n        if self.split:\n            ds = ds[self.split]\n\n        self._dataset = ds\n\n        if self.num_examples:\n            self._dataset = self._dataset.select(range(self.num_examples))\n        else:\n            self.num_examples = len(self._dataset)\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The columns that will be generated by this step, based on the datasets from a file\n        in disk.\n\n        Returns:\n            The columns that will be generated by this step.\n        \"\"\"\n        # We assume there are Dataset/IterableDataset, not it's ...Dict counterparts\n        if self._dataset is None:\n            self.load()\n\n        return self._dataset.column_names\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromDisk.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The columns that will be generated by this step, based on the datasets from a file in disk.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>The columns that will be generated by this step.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromDisk.load","title":"<code>load()</code>","text":"<p>Load the dataset from the file/s in disk.</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Load the dataset from the file/s in disk.\"\"\"\n    super(GeneratorStep, self).load()\n    if self.is_distiset:\n        ds = Distiset.load_from_disk(\n            self.dataset_path,\n            keep_in_memory=self.keep_in_memory,\n            storage_options=self.storage_options,\n        )\n        if self.config not in ds.keys():\n            raise DistilabelUserError(\n                f\"Configuration '{self.config}' not found in the Distiset, available ones\"\n                f\" are: {list(ds.keys())}. Please try changing the `config` parameter to one \"\n                \"of the available configurations.\\n\\n\",\n                page=\"sections/how_to_guides/advanced/distiset/#using-the-distiset-dataset-object\",\n            )\n        ds = ds[self.config]\n\n    else:\n        ds = load_from_disk(\n            self.dataset_path,\n            keep_in_memory=self.keep_in_memory,\n            storage_options=self.storage_options,\n        )\n\n    if self.split:\n        ds = ds[self.split]\n\n    self._dataset = ds\n\n    if self.num_examples:\n        self._dataset = self._dataset.select(range(self.num_examples))\n    else:\n        self.num_examples = len(self._dataset)\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromFileSystem","title":"<code>LoadDataFromFileSystem</code>","text":"<p>               Bases: <code>LoadDataFromHub</code></p> <p>Loads a dataset from a file in your filesystem.</p> <p><code>GeneratorStep</code> that creates a dataset from a file in the filesystem, uses Hugging Face <code>datasets</code> library. Take a look at Hugging Face Datasets for more information of the supported file types.</p> <p>Attributes:</p> Name Type Description <code>data_files</code> <code>RuntimeParameter[Union[str, Path]]</code> <p>The path to the file, or directory containing the files that conform the dataset.</p> <code>split</code> <code>RuntimeParameter[str]</code> <p>The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> <li><code>data_files</code>: The path to the file, or directory containing the files that conform     the dataset.</li> <li><code>split</code>: The split of the dataset to load. Defaults to 'train'.</li> <li><code>streaming</code>: Whether to load the dataset in streaming mode or not. Defaults to     <code>False</code>.</li> <li><code>num_examples</code>: The number of examples to load from the dataset.     By default will load all examples.</li> <li><code>storage_options</code>: Key/value pairs to be passed on to the file-system backend, if any.     Defaults to <code>None</code>.</li> <li><code>filetype</code>: The expected filetype. If not provided, it will be inferred from the file extension.     For more than one file, it will be inferred from the first file.</li> </ul> Output columns <ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the     datasets loaded from the Hugging Face Hub.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a Hugging Face dataset in your file system:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(data_files=\"path/to/dataset.jsonl\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Specify a filetype if the file extension is not expected:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(filetype=\"csv\", data_files=\"path/to/dataset.txtr\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Load data from a file in your cloud provider:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> <p>Load data passing a glob pattern:</p> <pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"path/to/dataset/*.jsonl\",\n    streaming=True\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>class LoadDataFromFileSystem(LoadDataFromHub):\n    \"\"\"Loads a dataset from a file in your filesystem.\n\n    `GeneratorStep` that creates a dataset from a file in the filesystem, uses Hugging Face `datasets`\n    library. Take a look at [Hugging Face Datasets](https://huggingface.co/docs/datasets/loading)\n    for more information of the supported file types.\n\n    Attributes:\n        data_files: The path to the file, or directory containing the files that conform\n            the dataset.\n        split: The split of the dataset to load (typically will be `train`, `test` or `validation`).\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n        - `data_files`: The path to the file, or directory containing the files that conform\n            the dataset.\n        - `split`: The split of the dataset to load. Defaults to 'train'.\n        - `streaming`: Whether to load the dataset in streaming mode or not. Defaults to\n            `False`.\n        - `num_examples`: The number of examples to load from the dataset.\n            By default will load all examples.\n        - `storage_options`: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n        - `filetype`: The expected filetype. If not provided, it will be inferred from the file extension.\n            For more than one file, it will be inferred from the first file.\n\n    Output columns:\n        - dynamic (`all`): The columns that will be generated by this step, based on the\n            datasets loaded from the Hugging Face Hub.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a Hugging Face dataset in your file system:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(data_files=\"path/to/dataset.jsonl\")\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Specify a filetype if the file extension is not expected:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(filetype=\"csv\", data_files=\"path/to/dataset.txtr\")\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Load data from a file in your cloud provider:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(\n            data_files=\"gcs://path/to/dataset\",\n            storage_options={\"project\": \"experiments-0001\"}\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n\n        Load data passing a glob pattern:\n\n        ```python\n        from distilabel.steps import LoadDataFromFileSystem\n\n        loader = LoadDataFromFileSystem(\n            data_files=\"path/to/dataset/*.jsonl\",\n            streaming=True\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'type': 'function', 'function':...', False)\n        ```\n    \"\"\"\n\n    data_files: RuntimeParameter[Union[str, Path]] = Field(\n        default=None,\n        description=\"The data files, or directory containing the data files, to generate the dataset from.\",\n    )\n    filetype: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The expected filetype. If not provided, it will be inferred from the file extension.\",\n    )\n    repo_id: ExcludedField[Union[str, None]] = None\n\n    def load(self) -&gt; None:\n        \"\"\"Load the dataset from the file/s in disk.\"\"\"\n        GeneratorStep.load(self)\n\n        data_path = UPath(self.data_files, storage_options=self.storage_options)\n\n        (data_files, self.filetype) = self._prepare_data_files(data_path)\n\n        self._dataset = load_dataset(\n            self.filetype,\n            data_files=data_files,\n            split=self.split,\n            streaming=self.streaming,\n            storage_options=self.storage_options,\n        )\n\n        if not self.streaming and self.num_examples:\n            self._dataset = self._dataset.select(range(self.num_examples))\n        if not self.num_examples:\n            if self.streaming:\n                # There's no better way to get the number of examples in a streaming dataset,\n                # load it again for the moment.\n                self.num_examples = len(\n                    load_dataset(\n                        self.filetype, data_files=self.data_files, split=self.split\n                    )\n                )\n            else:\n                self.num_examples = len(self._dataset)\n\n    @staticmethod\n    def _prepare_data_files(  # noqa: C901\n        data_path: UPath,\n    ) -&gt; Tuple[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]], str]:\n        \"\"\"Prepare the loading process by setting the `data_files` attribute.\n\n        Args:\n            data_path: The path to the data files, or directory containing the data files.\n\n        Returns:\n            Tuple with the data files and the filetype.\n        \"\"\"\n\n        def get_filetype(data_path: UPath) -&gt; str:\n            filetype = data_path.suffix.lstrip(\".\")\n            if filetype == \"jsonl\":\n                filetype = \"json\"\n            return filetype\n\n        if data_path.is_file() or (\n            len(str(data_path.parent.glob(data_path.name))) &gt;= 1\n        ):\n            filetype = get_filetype(data_path)\n            data_files = str(data_path)\n\n        elif data_path.is_dir():\n            file_sequence = []\n            file_map = defaultdict(list)\n            for file_or_folder in data_path.iterdir():\n                if file_or_folder.is_file():\n                    file_sequence.append(str(file_or_folder))\n                elif file_or_folder.is_dir():\n                    for file in file_or_folder.iterdir():\n                        file_sequence.append(str(file))\n                        file_map[str(file_or_folder)].append(str(file))\n\n            data_files = file_sequence or file_map\n            # Try to obtain the filetype from any of the files, assuming all files have the same type.\n            if file_sequence:\n                filetype = get_filetype(UPath(file_sequence[0]))\n            else:\n                filetype = get_filetype(UPath(file_map[list(file_map.keys())[0]][0]))\n        return data_files, filetype\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The columns that will be generated by this step, based on the datasets from a file\n        in disk.\n\n        Returns:\n            The columns that will be generated by this step.\n        \"\"\"\n        # We assume there are Dataset/IterableDataset, not it's ...Dict counterparts\n        if self._dataset is None:\n            self.load()\n\n        return self._dataset.column_names\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromFileSystem.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The columns that will be generated by this step, based on the datasets from a file in disk.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>The columns that will be generated by this step.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromFileSystem.load","title":"<code>load()</code>","text":"<p>Load the dataset from the file/s in disk.</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Load the dataset from the file/s in disk.\"\"\"\n    GeneratorStep.load(self)\n\n    data_path = UPath(self.data_files, storage_options=self.storage_options)\n\n    (data_files, self.filetype) = self._prepare_data_files(data_path)\n\n    self._dataset = load_dataset(\n        self.filetype,\n        data_files=data_files,\n        split=self.split,\n        streaming=self.streaming,\n        storage_options=self.storage_options,\n    )\n\n    if not self.streaming and self.num_examples:\n        self._dataset = self._dataset.select(range(self.num_examples))\n    if not self.num_examples:\n        if self.streaming:\n            # There's no better way to get the number of examples in a streaming dataset,\n            # load it again for the moment.\n            self.num_examples = len(\n                load_dataset(\n                    self.filetype, data_files=self.data_files, split=self.split\n                )\n            )\n        else:\n            self.num_examples = len(self._dataset)\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub","title":"<code>LoadDataFromHub</code>","text":"<p>               Bases: <code>GeneratorStep</code></p> <p>Loads a dataset from the Hugging Face Hub.</p> <p><code>GeneratorStep</code> that loads a dataset from the Hugging Face Hub using the <code>datasets</code> library.</p> <p>Attributes:</p> Name Type Description <code>repo_id</code> <code>RuntimeParameter[str]</code> <p>The Hugging Face Hub repository ID of the dataset to load.</p> <code>split</code> <code>RuntimeParameter[str]</code> <p>The split of the dataset to load.</p> <code>config</code> <code>Optional[RuntimeParameter[str]]</code> <p>The configuration of the dataset to load. This is optional and only needed if the dataset has multiple configurations.</p> Runtime parameters <ul> <li><code>batch_size</code>: The batch size to use when processing the data.</li> <li><code>repo_id</code>: The Hugging Face Hub repository ID of the dataset to load.</li> <li><code>split</code>: The split of the dataset to load. Defaults to 'train'.</li> <li><code>config</code>: The configuration of the dataset to load. This is optional and only     needed if the dataset has multiple configurations.</li> <li><code>revision</code>: The revision of the dataset to load. Defaults to the latest revision.</li> <li><code>streaming</code>: Whether to load the dataset in streaming mode or not. Defaults to     <code>False</code>.</li> <li><code>num_examples</code>: The number of examples to load from the dataset.     By default will load all examples.</li> <li><code>storage_options</code>: Key/value pairs to be passed on to the file-system backend, if any.     Defaults to <code>None</code>.</li> </ul> Output columns <ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the     datasets loaded from the Hugging Face Hub.</li> </ul> Categories <ul> <li>load</li> </ul> <p>Examples:</p> <p>Load data from a dataset in Hugging Face Hub:</p> <pre><code>from distilabel.steps import LoadDataFromHub\n\nloader = LoadDataFromHub(\n    repo_id=\"distilabel-internal-testing/instruction-dataset-mini\",\n    split=\"test\",\n    batch_size=2\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'prompt': 'Arianna has 12...', False)\n</code></pre> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>class LoadDataFromHub(GeneratorStep):\n    \"\"\"Loads a dataset from the Hugging Face Hub.\n\n    `GeneratorStep` that loads a dataset from the Hugging Face Hub using the `datasets`\n    library.\n\n    Attributes:\n        repo_id: The Hugging Face Hub repository ID of the dataset to load.\n        split: The split of the dataset to load.\n        config: The configuration of the dataset to load. This is optional and only needed\n            if the dataset has multiple configurations.\n\n    Runtime parameters:\n        - `batch_size`: The batch size to use when processing the data.\n        - `repo_id`: The Hugging Face Hub repository ID of the dataset to load.\n        - `split`: The split of the dataset to load. Defaults to 'train'.\n        - `config`: The configuration of the dataset to load. This is optional and only\n            needed if the dataset has multiple configurations.\n        - `revision`: The revision of the dataset to load. Defaults to the latest revision.\n        - `streaming`: Whether to load the dataset in streaming mode or not. Defaults to\n            `False`.\n        - `num_examples`: The number of examples to load from the dataset.\n            By default will load all examples.\n        - `storage_options`: Key/value pairs to be passed on to the file-system backend, if any.\n            Defaults to `None`.\n\n    Output columns:\n        - dynamic (`all`): The columns that will be generated by this step, based on the\n            datasets loaded from the Hugging Face Hub.\n\n    Categories:\n        - load\n\n    Examples:\n        Load data from a dataset in Hugging Face Hub:\n\n        ```python\n        from distilabel.steps import LoadDataFromHub\n\n        loader = LoadDataFromHub(\n            repo_id=\"distilabel-internal-testing/instruction-dataset-mini\",\n            split=\"test\",\n            batch_size=2\n        )\n        loader.load()\n\n        # Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\n        result = next(loader.process())\n        # &gt;&gt;&gt; result\n        # ([{'prompt': 'Arianna has 12...', False)\n        ```\n    \"\"\"\n\n    repo_id: RuntimeParameter[str] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repository ID of the dataset to load.\",\n    )\n    split: RuntimeParameter[str] = Field(\n        default=\"train\",\n        description=\"The split of the dataset to load. Defaults to 'train'.\",\n    )\n    config: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The configuration of the dataset to load. This is optional and only\"\n        \" needed if the dataset has multiple configurations.\",\n    )\n    revision: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The revision of the dataset to load. Defaults to the latest revision.\",\n    )\n    streaming: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether to load the dataset in streaming mode or not. Defaults to False.\",\n    )\n    num_examples: Optional[RuntimeParameter[int]] = Field(\n        default=None,\n        description=\"The number of examples to load from the dataset. By default will load all examples.\",\n    )\n    storage_options: Optional[Dict[str, Any]] = Field(\n        default=None,\n        description=\"The storage options to use when loading the dataset.\",\n    )\n\n    _dataset: Union[IterableDataset, Dataset, None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Load the dataset from the Hugging Face Hub\"\"\"\n        super().load()\n\n        if self._dataset is not None:\n            # Here to simplify the functionality of\u00a0distilabel.steps.generators.util.make_generator_step\n            return\n\n        self._dataset = load_dataset(\n            self.repo_id,  # type: ignore\n            self.config,\n            split=self.split,\n            revision=self.revision,\n            streaming=self.streaming,\n        )\n        num_examples = self._get_dataset_num_examples()\n        self.num_examples = (\n            min(self.num_examples, num_examples) if self.num_examples else num_examples\n        )\n\n        if not self.streaming:\n            self._dataset = self._dataset.select(range(self.num_examples))\n\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Yields batches from the loaded dataset from the Hugging Face Hub.\n\n        Args:\n            offset: The offset to start yielding the data from. Will be used during the caching\n                process to help skipping already processed data.\n\n        Yields:\n            A tuple containing a batch of rows and a boolean indicating if the batch is\n            the last one.\n        \"\"\"\n        num_returned_rows = 0\n        for batch_num, batch in enumerate(\n            self._dataset.iter(batch_size=self.batch_size)  # type: ignore\n        ):\n            if batch_num * self.batch_size &lt; offset:\n                continue\n            transformed_batch = self._transform_batch(batch)\n            batch_size = len(transformed_batch)\n            num_returned_rows += batch_size\n            yield transformed_batch, num_returned_rows &gt;= self.num_examples\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The columns that will be generated by this step, based on the datasets loaded\n        from the Hugging Face Hub.\n\n        Returns:\n            The columns that will be generated by this step.\n        \"\"\"\n        return self._get_dataset_columns()\n\n    def _transform_batch(self, batch: Dict[str, Any]) -&gt; List[Dict[str, Any]]:\n        \"\"\"Transform a batch of data from the Hugging Face Hub into a list of rows.\n\n        Args:\n            batch: The batch of data from the Hugging Face Hub.\n\n        Returns:\n            A list of rows, where each row is a dictionary of column names and values.\n        \"\"\"\n        length = len(next(iter(batch.values())))\n        rows = []\n        for i in range(length):\n            rows.append({col: values[i] for col, values in batch.items()})\n        return rows\n\n    def _get_dataset_num_examples(self) -&gt; int:\n        \"\"\"Get the number of examples in the dataset, based on the `split` and `config`\n        runtime parameters provided.\n\n        Returns:\n            The number of examples in the dataset.\n        \"\"\"\n        default_config = self.config\n        if not default_config:\n            default_config = list(self._dataset_info.keys())[0]\n\n        return self._dataset_info[default_config].splits[self.split].num_examples\n\n    def _get_dataset_columns(self) -&gt; List[str]:\n        \"\"\"Get the columns of the dataset, based on the `config` runtime parameter provided.\n\n        Returns:\n            The columns of the dataset.\n        \"\"\"\n        return list(\n            self._dataset_info[\n                self.config if self.config else \"default\"\n            ].features.keys()\n        )\n\n    @cached_property\n    def _dataset_info(self) -&gt; Dict[str, DatasetInfo]:\n        \"\"\"Calls the Datasets Server API from Hugging Face to obtain the dataset information.\n\n        Returns:\n            The dataset information.\n        \"\"\"\n\n        try:\n            return get_dataset_infos(self.repo_id)\n        except Exception as e:\n            warnings.warn(\n                f\"Failed to get dataset info from Hugging Face Hub, trying to get it loading the dataset. Error: {e}\",\n                UserWarning,\n                stacklevel=2,\n            )\n            ds = load_dataset(self.repo_id, config=self.config, split=self.split)\n            if self.config:\n                return ds[self.config].info\n            return ds.info\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The columns that will be generated by this step, based on the datasets loaded from the Hugging Face Hub.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>The columns that will be generated by this step.</p>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub.load","title":"<code>load()</code>","text":"<p>Load the dataset from the Hugging Face Hub</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Load the dataset from the Hugging Face Hub\"\"\"\n    super().load()\n\n    if self._dataset is not None:\n        # Here to simplify the functionality of\u00a0distilabel.steps.generators.util.make_generator_step\n        return\n\n    self._dataset = load_dataset(\n        self.repo_id,  # type: ignore\n        self.config,\n        split=self.split,\n        revision=self.revision,\n        streaming=self.streaming,\n    )\n    num_examples = self._get_dataset_num_examples()\n    self.num_examples = (\n        min(self.num_examples, num_examples) if self.num_examples else num_examples\n    )\n\n    if not self.streaming:\n        self._dataset = self._dataset.select(range(self.num_examples))\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.LoadDataFromHub.process","title":"<code>process(offset=0)</code>","text":"<p>Yields batches from the loaded dataset from the Hugging Face Hub.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start yielding the data from. Will be used during the caching process to help skipping already processed data.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A tuple containing a batch of rows and a boolean indicating if the batch is</p> <code>GeneratorStepOutput</code> <p>the last one.</p> Source code in <code>src/distilabel/steps/generators/huggingface.py</code> <pre><code>def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Yields batches from the loaded dataset from the Hugging Face Hub.\n\n    Args:\n        offset: The offset to start yielding the data from. Will be used during the caching\n            process to help skipping already processed data.\n\n    Yields:\n        A tuple containing a batch of rows and a boolean indicating if the batch is\n        the last one.\n    \"\"\"\n    num_returned_rows = 0\n    for batch_num, batch in enumerate(\n        self._dataset.iter(batch_size=self.batch_size)  # type: ignore\n    ):\n        if batch_num * self.batch_size &lt; offset:\n            continue\n        transformed_batch = self._transform_batch(batch)\n        batch_size = len(transformed_batch)\n        num_returned_rows += batch_size\n        yield transformed_batch, num_returned_rows &gt;= self.num_examples\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.PushToHub","title":"<code>PushToHub</code>","text":"<p>               Bases: <code>GlobalStep</code></p> <p>Push data to a Hugging Face Hub dataset.</p> <p>A <code>GlobalStep</code> which creates a <code>datasets.Dataset</code> with the input data and pushes it to the Hugging Face Hub.</p> <p>Attributes:</p> Name Type Description <code>repo_id</code> <code>RuntimeParameter[str]</code> <p>The Hugging Face Hub repository ID where the dataset will be uploaded.</p> <code>split</code> <code>RuntimeParameter[str]</code> <p>The split of the dataset that will be pushed. Defaults to <code>\"train\"</code>.</p> <code>private</code> <code>RuntimeParameter[bool]</code> <p>Whether the dataset to be pushed should be private or not. Defaults to <code>False</code>.</p> <code>token</code> <code>Optional[RuntimeParameter[str]]</code> <p>The token that will be used to authenticate in the Hub. If not provided, the token will be tried to be obtained from the environment variable <code>HF_TOKEN</code>. If not provided using one of the previous methods, then <code>huggingface_hub</code> library will try to use the token from the local Hugging Face CLI configuration. Defaults to <code>None</code>.</p> Runtime parameters <ul> <li><code>repo_id</code>: The Hugging Face Hub repository ID where the dataset will be uploaded.</li> <li><code>split</code>: The split of the dataset that will be pushed.</li> <li><code>private</code>: Whether the dataset to be pushed should be private or not.</li> <li><code>token</code>: The token that will be used to authenticate in the Hub.</li> </ul> Input columns <ul> <li>dynamic (<code>all</code>): all columns from the input will be used to create the dataset.</li> </ul> Categories <ul> <li>save</li> <li>dataset</li> <li>huggingface</li> </ul> <p>Examples:</p> <p>Push batches of your dataset to the Hugging Face Hub repository:</p> <pre><code>from distilabel.steps import PushToHub\n\npush = PushToHub(repo_id=\"path_to/repo\")\npush.load()\n\nresult = next(\n    push.process(\n        [\n            {\n                \"instruction\": \"instruction \",\n                \"generation\": \"generation\"\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction ', 'generation': 'generation'}]\n</code></pre> Source code in <code>src/distilabel/steps/globals/huggingface.py</code> <pre><code>class PushToHub(GlobalStep):\n    \"\"\"Push data to a Hugging Face Hub dataset.\n\n    A `GlobalStep` which creates a `datasets.Dataset` with the input data and pushes\n    it to the Hugging Face Hub.\n\n    Attributes:\n        repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.\n        split: The split of the dataset that will be pushed. Defaults to `\"train\"`.\n        private: Whether the dataset to be pushed should be private or not. Defaults to\n            `False`.\n        token: The token that will be used to authenticate in the Hub. If not provided, the\n            token will be tried to be obtained from the environment variable `HF_TOKEN`.\n            If not provided using one of the previous methods, then `huggingface_hub` library\n            will try to use the token from the local Hugging Face CLI configuration. Defaults\n            to `None`.\n\n    Runtime parameters:\n        - `repo_id`: The Hugging Face Hub repository ID where the dataset will be uploaded.\n        - `split`: The split of the dataset that will be pushed.\n        - `private`: Whether the dataset to be pushed should be private or not.\n        - `token`: The token that will be used to authenticate in the Hub.\n\n    Input columns:\n        - dynamic (`all`): all columns from the input will be used to create the dataset.\n\n    Categories:\n        - save\n        - dataset\n        - huggingface\n\n    Examples:\n        Push batches of your dataset to the Hugging Face Hub repository:\n\n        ```python\n        from distilabel.steps import PushToHub\n\n        push = PushToHub(repo_id=\"path_to/repo\")\n        push.load()\n\n        result = next(\n            push.process(\n                [\n                    {\n                        \"instruction\": \"instruction \",\n                        \"generation\": \"generation\"\n                    }\n                ],\n            )\n        )\n        # &gt;&gt;&gt; result\n        # [{'instruction': 'instruction ', 'generation': 'generation'}]\n        ```\n    \"\"\"\n\n    repo_id: RuntimeParameter[str] = Field(\n        default=None,\n        description=\"The Hugging Face Hub repository ID where the dataset will be uploaded.\",\n    )\n    split: RuntimeParameter[str] = Field(\n        default=\"train\",\n        description=\"The split of the dataset that will be pushed. Defaults to 'train'.\",\n    )\n    private: RuntimeParameter[bool] = Field(\n        default=False,\n        description=\"Whether the dataset to be pushed should be private or not. Defaults\"\n        \" to `False`.\",\n    )\n    token: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The token that will be used to authenticate in the Hub. If not provided,\"\n        \" the token will be tried to be obtained from the environment variable `HF_TOKEN`.\"\n        \" If not provided using one of the previous methods, then `huggingface_hub` library\"\n        \" will try to use the token from the local Hugging Face CLI configuration. Defaults\"\n        \" to `None`\",\n    )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Method that processes the input data, respecting the `datasets.Dataset` formatting,\n        and pushes it to the Hugging Face Hub based on the `RuntimeParameter`s attributes.\n\n        Args:\n            inputs: that input data within a single object (as it's a GlobalStep) that\n                will be transformed into a `datasets.Dataset`.\n\n        Yields:\n            Propagates the received inputs so that the `Distiset` can be generated if this is\n            the last step of the `Pipeline`, or if this is not a leaf step and has follow up\n            steps.\n        \"\"\"\n        dataset_dict = defaultdict(list)\n        for input in inputs:\n            for key, value in input.items():\n                dataset_dict[key].append(value)\n        dataset_dict = dict(dataset_dict)\n        dataset = Dataset.from_dict(dataset_dict)\n        dataset.push_to_hub(\n            self.repo_id,  # type: ignore\n            split=self.split,\n            private=self.private,\n            token=self.token or os.getenv(\"HF_TOKEN\"),\n        )\n        yield inputs\n</code></pre>"},{"location":"api/step_gallery/hugging_face/#distilabel.steps.PushToHub.process","title":"<code>process(inputs)</code>","text":"<p>Method that processes the input data, respecting the <code>datasets.Dataset</code> formatting, and pushes it to the Hugging Face Hub based on the <code>RuntimeParameter</code>s attributes.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>that input data within a single object (as it's a GlobalStep) that will be transformed into a <code>datasets.Dataset</code>.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>Propagates the received inputs so that the <code>Distiset</code> can be generated if this is</p> <code>StepOutput</code> <p>the last step of the <code>Pipeline</code>, or if this is not a leaf step and has follow up</p> <code>StepOutput</code> <p>steps.</p> Source code in <code>src/distilabel/steps/globals/huggingface.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Method that processes the input data, respecting the `datasets.Dataset` formatting,\n    and pushes it to the Hugging Face Hub based on the `RuntimeParameter`s attributes.\n\n    Args:\n        inputs: that input data within a single object (as it's a GlobalStep) that\n            will be transformed into a `datasets.Dataset`.\n\n    Yields:\n        Propagates the received inputs so that the `Distiset` can be generated if this is\n        the last step of the `Pipeline`, or if this is not a leaf step and has follow up\n        steps.\n    \"\"\"\n    dataset_dict = defaultdict(list)\n    for input in inputs:\n        for key, value in input.items():\n            dataset_dict[key].append(value)\n    dataset_dict = dict(dataset_dict)\n    dataset = Dataset.from_dict(dataset_dict)\n    dataset.push_to_hub(\n        self.repo_id,  # type: ignore\n        split=self.split,\n        private=self.private,\n        token=self.token or os.getenv(\"HF_TOKEN\"),\n    )\n    yield inputs\n</code></pre>"},{"location":"api/task/","title":"Task","text":"<p>This section contains the API reference for the <code>distilabel</code> tasks.</p> <p>For more information on how the <code>Task</code> works and see some examples, check the Tutorial - Task page.</p>"},{"location":"api/task/#distilabel.steps.tasks.base","title":"<code>base</code>","text":""},{"location":"api/task/#distilabel.steps.tasks.base._Task","title":"<code>_Task</code>","text":"<p>               Bases: <code>_Step</code>, <code>ABC</code></p> <p>_Task is an abstract class that implements the <code>_Step</code> interface and adds the <code>format_input</code> and <code>format_output</code> methods to format the inputs and outputs of the task. It also adds a <code>llm</code> attribute to be used as the LLM to generate the outputs.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>the <code>LLM</code> to be used to generate the outputs of the task.</p> <code>group_generations</code> <code>bool</code> <p>whether to group the <code>num_generations</code> generated per input in a list or create a row per generation. Defaults to <code>False</code>.</p> <code>add_raw_output</code> <code>RuntimeParameter[bool]</code> <p>whether to include a field with the raw output of the LLM in the <code>distilabel_metadata</code> field of the output. Can be helpful to not loose data with <code>Tasks</code> that need to format the output of the <code>LLM</code>. Defaults to <code>False</code>.</p> <code>num_generations</code> <code>RuntimeParameter[int]</code> <p>The number of generations to be produced per input.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>class _Task(_Step, ABC):\n    \"\"\"_Task is an abstract class that implements the `_Step` interface and adds the\n    `format_input` and `format_output` methods to format the inputs and outputs of the\n    task. It also adds a `llm` attribute to be used as the LLM to generate the outputs.\n\n    Attributes:\n        llm: the `LLM` to be used to generate the outputs of the task.\n        group_generations: whether to group the `num_generations` generated per input in\n            a list or create a row per generation. Defaults to `False`.\n        add_raw_output: whether to include a field with the raw output of the LLM in the\n            `distilabel_metadata` field of the output. Can be helpful to not loose data\n            with `Tasks` that need to format the output of the `LLM`. Defaults to `False`.\n        num_generations: The number of generations to be produced per input.\n    \"\"\"\n\n    llm: LLM\n\n    group_generations: bool = False\n    add_raw_output: RuntimeParameter[bool] = Field(\n        default=True,\n        description=(\n            \"Whether to include the raw output of the LLM in the key `raw_output_&lt;TASK_NAME&gt;`\"\n            \" of the `distilabel_metadata` dictionary output column\"\n        ),\n    )\n    add_raw_input: RuntimeParameter[bool] = Field(\n        default=True,\n        description=(\n            \"Whether to include the raw input of the LLM in the key `raw_input_&lt;TASK_NAME&gt;`\"\n            \" of the `distilabel_metadata` dictionary column\"\n        ),\n    )\n    num_generations: RuntimeParameter[int] = Field(\n        default=1, description=\"The number of generations to be produced per input.\"\n    )\n    use_default_structured_output: bool = False\n\n    _can_be_used_with_offline_batch_generation: bool = PrivateAttr(False)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        if (\n            self.llm.use_offline_batch_generation\n            and not self._can_be_used_with_offline_batch_generation\n        ):\n            raise DistilabelUserError(\n                f\"`{self.__class__.__name__}` task cannot be used with offline batch generation\"\n                \" feature.\",\n                page=\"sections/how_to_guides/advanced/offline-batch-generation\",\n            )\n\n        super().model_post_init(__context)\n\n    @property\n    def is_global(self) -&gt; bool:\n        \"\"\"Extends the `is_global` property to return `True` if the task is using the\n        offline batch generation feature, otherwise it returns the value of the parent\n        class property. `offline_batch_generation` requires to receive all the inputs\n        at once, so for the `_BatchManager` this is a global step.\n\n        Returns:\n            Whether the task is a global step or not.\n        \"\"\"\n        if self.llm.use_offline_batch_generation:\n            return True\n\n        return super().is_global\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the LLM via the `LLM.load()` method.\"\"\"\n        super().load()\n        self._set_default_structured_output()\n        self.llm.load()\n\n    @override\n    def unload(self) -&gt; None:\n        \"\"\"Unloads the LLM.\"\"\"\n        self._logger.debug(\"Executing task unload logic.\")\n        self.llm.unload()\n\n    @override\n    def impute_step_outputs(\n        self, step_output: List[Dict[str, Any]]\n    ) -&gt; List[Dict[str, Any]]:\n        \"\"\"\n        Imputes the outputs of the task in case the LLM failed to generate a response.\n        \"\"\"\n        result = []\n        for row in step_output:\n            data = row.copy()\n            for output in self.get_outputs().keys():\n                data[output] = None\n            data = self._create_metadata(\n                data,\n                None,\n                None,\n                add_raw_output=self.add_raw_output,\n                add_raw_input=self.add_raw_input,\n            )\n            result.append(data)\n        return result\n\n    @abstractmethod\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Abstract method to format the outputs of the task. It needs to receive an output\n        as a string, and generates a Python dictionary with the outputs of the task. In\n        addition the `input` used to generate the output is also received just in case it's\n        needed to be able to parse the output correctly.\n        \"\"\"\n        pass\n\n    def _format_outputs(\n        self,\n        outputs: \"GenerateOutput\",\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; List[Dict[str, Any]]:\n        \"\"\"Formats the outputs of the task using the `format_output` method. If the output\n        is `None` (i.e. the LLM failed to generate a response), then the outputs will be\n        set to `None` as well.\n\n        Args:\n            outputs: The outputs (`n` generations) for the provided `input`.\n            input: The input used to generate the output.\n\n        Returns:\n            A list containing a dictionary with the outputs of the task for each input.\n        \"\"\"\n        inputs = [None] if input is None else [input]\n        formatted_outputs = []\n        repeate_inputs = len(outputs.get(\"generations\"))\n        outputs = normalize_statistics(outputs)\n\n        for (output, stats, extra), input in zip(\n            iterate_generations_with_stats(outputs), inputs * repeate_inputs\n        ):  # type: ignore\n            try:\n                # Extract the generations, and move the statistics to the distilabel_metadata,\n                # to keep everything clean\n                formatted_output = self.format_output(output, input)\n                formatted_output = self._create_metadata(\n                    output=formatted_output,\n                    raw_output=output,\n                    input=input,\n                    add_raw_output=self.add_raw_output,  # type: ignore\n                    add_raw_input=self.add_raw_input,  # type: ignore\n                    statistics=stats,\n                )\n                formatted_output = self._create_extra(\n                    output=formatted_output, extra=extra\n                )\n                formatted_outputs.append(formatted_output)\n            except Exception as e:\n                self._logger.warning(  # type: ignore\n                    f\"Task '{self.name}' failed to format output: {e}. Saving raw response.\"  # type: ignore\n                )\n                formatted_outputs.append(self._output_on_failure(output, input))\n        return formatted_outputs\n\n    def _output_on_failure(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"In case of failure to format the output, this method will return a dictionary including\n        a new field `distilabel_meta` with the raw output of the LLM.\n        \"\"\"\n        # Create a dictionary with the outputs of the task (every output set to None)\n        outputs = {output: None for output in self.outputs}\n        outputs[\"model_name\"] = self.llm.model_name  # type: ignore\n        outputs = self._create_metadata(\n            outputs,\n            output,\n            input,\n            add_raw_output=self.add_raw_output,  # type: ignore\n            add_raw_input=self.add_raw_input,  # type: ignore\n        )\n        return outputs\n\n    def _create_metadata(\n        self,\n        output: Dict[str, Any],\n        raw_output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n        add_raw_output: bool = True,\n        add_raw_input: bool = True,\n        statistics: Optional[\"LLMStatistics\"] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Adds the raw output and or the formatted input of the LLM to the output dictionary\n        if `add_raw_output` is True or `add_raw_input` is True.\n\n        Args:\n            output:\n                The output dictionary after formatting the output from the LLM,\n                to add the raw output and or raw input.\n            raw_output: The raw output of the `LLM`.\n            input: The input used to generate the output.\n            add_raw_output: Whether to add the raw output to the output dictionary.\n            add_raw_input: Whether to add the raw input to the output dictionary.\n            statistics: The statistics generated by the LLM, which should contain at least\n                the number of input and output tokens.\n        \"\"\"\n        meta = output.get(DISTILABEL_METADATA_KEY, {})\n\n        if add_raw_output:\n            meta[f\"raw_output_{self.name}\"] = raw_output\n\n        if add_raw_input:\n            meta[f\"raw_input_{self.name}\"] = self.format_input(input) if input else None\n\n        if statistics:\n            meta[f\"statistics_{self.name}\"] = statistics\n\n        if meta:\n            output[DISTILABEL_METADATA_KEY] = meta\n\n        return output\n\n    def _create_extra(\n        self, output: Dict[str, Any], extra: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        column_name_prefix = f\"llm_{self.name}_\"\n        for key, value in extra.items():\n            column_name = column_name_prefix + key\n            output[column_name] = value\n        return output\n\n    def _set_default_structured_output(self) -&gt; None:\n        \"\"\"Prepares the structured output to be set in the selected `LLM`.\n\n        If the method `get_structured_output` returns None (the default), there's no need\n        to set anything, as it doesn't apply.\n        If the `use_default_structured_output` and there's no previous structured output\n        set by hand, then decide the type of structured output to select depending on the\n        `LLM` provider.\n        \"\"\"\n        schema = self.get_structured_output()\n        if not schema:\n            return\n\n        if self.use_default_structured_output and not self.llm.structured_output:\n            # In case the default structured output is required, we have to set it before\n            # the LLM is loaded\n            from distilabel.models.llms import InferenceEndpointsLLM\n            from distilabel.models.llms.base import AsyncLLM\n\n            def check_dependency(module_name: str) -&gt; None:\n                if not importlib.util.find_spec(module_name):\n                    raise ImportError(\n                        f\"`{module_name}` is not installed and is needed for the structured generation with this LLM.\"\n                        f\" Please install it using `pip install {module_name}`.\"\n                    )\n\n            dependency = \"outlines\"\n            structured_output = {\"schema\": schema}\n            if isinstance(self.llm, InferenceEndpointsLLM):\n                structured_output.update({\"format\": \"json\"})\n            # To determine instructor or outlines format\n            elif isinstance(self.llm, AsyncLLM) and not isinstance(\n                self.llm, InferenceEndpointsLLM\n            ):\n                dependency = \"instructor\"\n                structured_output.update({\"format\": \"json\"})\n\n            check_dependency(dependency)\n            self.llm.structured_output = structured_output\n\n    def get_structured_output(self) -&gt; Union[Dict[str, Any], None]:\n        \"\"\"Returns the structured output for a task that implements one by default,\n        must be overriden by subclasses of `Task`. When implemented, should be a json\n        schema that enforces the response from the LLM so that it's easier to parse.\n        \"\"\"\n        return None\n\n    def _sample_input(self) -&gt; \"ChatType\":\n        \"\"\"Returns a sample input to be used in the `print` method.\n        Tasks that don't adhere to a format input that returns a map of the type\n        str -&gt; str should override this method to return a sample input.\n        \"\"\"\n        return self.format_input(\n            {input: f\"&lt;PLACEHOLDER_{input.upper()}&gt;\" for input in self.inputs}\n        )\n\n    def print(self, sample_input: Optional[\"ChatType\"] = None) -&gt; None:\n        \"\"\"Prints a sample input to the console using the `rich` library.\n        Helper method to visualize the prompt of the task.\n\n        Args:\n            sample_input: A sample input to be printed. If not provided, a default will be\n                generated using the `_sample_input` method, which can be overriden by\n                subclasses. This should correspond to the same example you could pass to\n                the `format_input` method.\n                The variables be named &lt;PLACEHOLDER_VARIABLE_NAME&gt; by default.\n\n        Examples:\n            Print the URIAL prompt:\n\n            ```python\n            from distilabel.steps.tasks import URIAL\n            from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n            # Consider this as a placeholder for your actual LLM.\n            urial = URIAL(\n                llm=InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                ),\n            )\n            urial.load()\n            urial.print()\n            \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt: URIAL  \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n            \u2502 \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 User Message \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e \u2502\n            \u2502 \u2502 # Instruction                                                                             \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 Below is a list of conversations between a human and an AI assistant (you).               \u2502 \u2502\n            \u2502 \u2502 Users place their queries under \"# User:\", and your responses are under  \"# Assistant:\".  \u2502 \u2502\n            \u2502 \u2502 You are a helpful, respectful, and honest assistant.                                      \u2502 \u2502\n            \u2502 \u2502 You should always answer as helpfully as possible while ensuring safety.                  \u2502 \u2502\n            \u2502 \u2502 Your answers should be well-structured and provide detailed information. They should also \u2502 \u2502\n            \u2502 \u2502 have an engaging tone.                                                                    \u2502 \u2502\n            \u2502 \u2502 Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic,      \u2502 \u2502\n            \u2502 \u2502 dangerous, or illegal content, even if it may be helpful.                                 \u2502 \u2502\n            \u2502 \u2502 Your response must be socially responsible, and thus you can refuse to answer some        \u2502 \u2502\n            \u2502 \u2502 controversial topics.                                                                     \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 # User:                                                                                   \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 &lt;PLACEHOLDER_INSTRUCTION&gt;                                                                 \u2502 \u2502\n            \u2502 \u2502                                                                                           \u2502 \u2502\n            \u2502 \u2502 # Assistant:                                                                              \u2502 \u2502\n            \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f \u2502\n            \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n            ```\n        \"\"\"\n        from rich.console import Console, Group\n        from rich.panel import Panel\n        from rich.text import Text\n\n        console = Console()\n        sample_input = sample_input or self._sample_input()\n\n        panels = []\n        for item in sample_input:\n            content = Text.assemble((item.get(\"content\", \"\"),))\n            panel = Panel(\n                content,\n                title=f\"[bold][magenta]{item.get('role', '').capitalize()} Message[/magenta][/bold]\",\n                border_style=\"light_cyan3\",\n            )\n            panels.append(panel)\n\n        # Create a group of panels\n        # Wrap the group in an outer panel\n        outer_panel = Panel(\n            Group(*panels),\n            title=f\"[bold][magenta]Prompt: {type(self).__name__} [/magenta][/bold]\",\n            border_style=\"light_cyan3\",\n            expand=False,\n        )\n        console.print(outer_panel)\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.is_global","title":"<code>is_global</code>  <code>property</code>","text":"<p>Extends the <code>is_global</code> property to return <code>True</code> if the task is using the offline batch generation feature, otherwise it returns the value of the parent class property. <code>offline_batch_generation</code> requires to receive all the inputs at once, so for the <code>_BatchManager</code> this is a global step.</p> <p>Returns:</p> Type Description <code>bool</code> <p>Whether the task is a global step or not.</p>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.load","title":"<code>load()</code>","text":"<p>Loads the LLM via the <code>LLM.load()</code> method.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the LLM via the `LLM.load()` method.\"\"\"\n    super().load()\n    self._set_default_structured_output()\n    self.llm.load()\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.unload","title":"<code>unload()</code>","text":"<p>Unloads the LLM.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@override\ndef unload(self) -&gt; None:\n    \"\"\"Unloads the LLM.\"\"\"\n    self._logger.debug(\"Executing task unload logic.\")\n    self.llm.unload()\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.impute_step_outputs","title":"<code>impute_step_outputs(step_output)</code>","text":"<p>Imputes the outputs of the task in case the LLM failed to generate a response.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@override\ndef impute_step_outputs(\n    self, step_output: List[Dict[str, Any]]\n) -&gt; List[Dict[str, Any]]:\n    \"\"\"\n    Imputes the outputs of the task in case the LLM failed to generate a response.\n    \"\"\"\n    result = []\n    for row in step_output:\n        data = row.copy()\n        for output in self.get_outputs().keys():\n            data[output] = None\n        data = self._create_metadata(\n            data,\n            None,\n            None,\n            add_raw_output=self.add_raw_output,\n            add_raw_input=self.add_raw_input,\n        )\n        result.append(data)\n    return result\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.format_output","title":"<code>format_output(output, input=None)</code>  <code>abstractmethod</code>","text":"<p>Abstract method to format the outputs of the task. It needs to receive an output as a string, and generates a Python dictionary with the outputs of the task. In addition the <code>input</code> used to generate the output is also received just in case it's needed to be able to parse the output correctly.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@abstractmethod\ndef format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Abstract method to format the outputs of the task. It needs to receive an output\n    as a string, and generates a Python dictionary with the outputs of the task. In\n    addition the `input` used to generate the output is also received just in case it's\n    needed to be able to parse the output correctly.\n    \"\"\"\n    pass\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Returns the structured output for a task that implements one by default, must be overriden by subclasses of <code>Task</code>. When implemented, should be a json schema that enforces the response from the LLM so that it's easier to parse.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def get_structured_output(self) -&gt; Union[Dict[str, Any], None]:\n    \"\"\"Returns the structured output for a task that implements one by default,\n    must be overriden by subclasses of `Task`. When implemented, should be a json\n    schema that enforces the response from the LLM so that it's easier to parse.\n    \"\"\"\n    return None\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base._Task.print","title":"<code>print(sample_input=None)</code>","text":"<p>Prints a sample input to the console using the <code>rich</code> library. Helper method to visualize the prompt of the task.</p> <p>Parameters:</p> Name Type Description Default <code>sample_input</code> <code>Optional[ChatType]</code> <p>A sample input to be printed. If not provided, a default will be generated using the <code>_sample_input</code> method, which can be overriden by subclasses. This should correspond to the same example you could pass to the <code>format_input</code> method. The variables be named  by default. <code>None</code> <p>Examples:</p> <p>Print the URIAL prompt:</p> <pre><code>from distilabel.steps.tasks import URIAL\nfrom distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nurial = URIAL(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n)\nurial.load()\nurial.print()\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt: URIAL  \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 User Message \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e \u2502\n\u2502 \u2502 # Instruction                                                                             \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 Below is a list of conversations between a human and an AI assistant (you).               \u2502 \u2502\n\u2502 \u2502 Users place their queries under \"# User:\", and your responses are under  \"# Assistant:\".  \u2502 \u2502\n\u2502 \u2502 You are a helpful, respectful, and honest assistant.                                      \u2502 \u2502\n\u2502 \u2502 You should always answer as helpfully as possible while ensuring safety.                  \u2502 \u2502\n\u2502 \u2502 Your answers should be well-structured and provide detailed information. They should also \u2502 \u2502\n\u2502 \u2502 have an engaging tone.                                                                    \u2502 \u2502\n\u2502 \u2502 Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic,      \u2502 \u2502\n\u2502 \u2502 dangerous, or illegal content, even if it may be helpful.                                 \u2502 \u2502\n\u2502 \u2502 Your response must be socially responsible, and thus you can refuse to answer some        \u2502 \u2502\n\u2502 \u2502 controversial topics.                                                                     \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 # User:                                                                                   \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 &lt;PLACEHOLDER_INSTRUCTION&gt;                                                                 \u2502 \u2502\n\u2502 \u2502                                                                                           \u2502 \u2502\n\u2502 \u2502 # Assistant:                                                                              \u2502 \u2502\n\u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def print(self, sample_input: Optional[\"ChatType\"] = None) -&gt; None:\n    \"\"\"Prints a sample input to the console using the `rich` library.\n    Helper method to visualize the prompt of the task.\n\n    Args:\n        sample_input: A sample input to be printed. If not provided, a default will be\n            generated using the `_sample_input` method, which can be overriden by\n            subclasses. This should correspond to the same example you could pass to\n            the `format_input` method.\n            The variables be named &lt;PLACEHOLDER_VARIABLE_NAME&gt; by default.\n\n    Examples:\n        Print the URIAL prompt:\n\n        ```python\n        from distilabel.steps.tasks import URIAL\n        from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        urial = URIAL(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n        )\n        urial.load()\n        urial.print()\n        \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt: URIAL  \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n        \u2502 \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 User Message \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e \u2502\n        \u2502 \u2502 # Instruction                                                                             \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 Below is a list of conversations between a human and an AI assistant (you).               \u2502 \u2502\n        \u2502 \u2502 Users place their queries under \"# User:\", and your responses are under  \"# Assistant:\".  \u2502 \u2502\n        \u2502 \u2502 You are a helpful, respectful, and honest assistant.                                      \u2502 \u2502\n        \u2502 \u2502 You should always answer as helpfully as possible while ensuring safety.                  \u2502 \u2502\n        \u2502 \u2502 Your answers should be well-structured and provide detailed information. They should also \u2502 \u2502\n        \u2502 \u2502 have an engaging tone.                                                                    \u2502 \u2502\n        \u2502 \u2502 Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic,      \u2502 \u2502\n        \u2502 \u2502 dangerous, or illegal content, even if it may be helpful.                                 \u2502 \u2502\n        \u2502 \u2502 Your response must be socially responsible, and thus you can refuse to answer some        \u2502 \u2502\n        \u2502 \u2502 controversial topics.                                                                     \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 # User:                                                                                   \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 &lt;PLACEHOLDER_INSTRUCTION&gt;                                                                 \u2502 \u2502\n        \u2502 \u2502                                                                                           \u2502 \u2502\n        \u2502 \u2502 # Assistant:                                                                              \u2502 \u2502\n        \u2502 \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f \u2502\n        \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n        ```\n    \"\"\"\n    from rich.console import Console, Group\n    from rich.panel import Panel\n    from rich.text import Text\n\n    console = Console()\n    sample_input = sample_input or self._sample_input()\n\n    panels = []\n    for item in sample_input:\n        content = Text.assemble((item.get(\"content\", \"\"),))\n        panel = Panel(\n            content,\n            title=f\"[bold][magenta]{item.get('role', '').capitalize()} Message[/magenta][/bold]\",\n            border_style=\"light_cyan3\",\n        )\n        panels.append(panel)\n\n    # Create a group of panels\n    # Wrap the group in an outer panel\n    outer_panel = Panel(\n        Group(*panels),\n        title=f\"[bold][magenta]Prompt: {type(self).__name__} [/magenta][/bold]\",\n        border_style=\"light_cyan3\",\n        expand=False,\n    )\n    console.print(outer_panel)\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base.Task","title":"<code>Task</code>","text":"<p>               Bases: <code>_Task</code>, <code>Step</code></p> <p>Task is a class that implements the <code>_Task</code> abstract class and adds the <code>Step</code> interface to be used as a step in the pipeline.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>the <code>LLM</code> to be used to generate the outputs of the task.</p> <code>group_generations</code> <code>bool</code> <p>whether to group the <code>num_generations</code> generated per input in a list or create a row per generation. Defaults to <code>False</code>.</p> <code>num_generations</code> <code>RuntimeParameter[int]</code> <p>The number of generations to be produced per input.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>class Task(_Task, Step):\n    \"\"\"Task is a class that implements the `_Task` abstract class and adds the `Step`\n    interface to be used as a step in the pipeline.\n\n    Attributes:\n        llm: the `LLM` to be used to generate the outputs of the task.\n        group_generations: whether to group the `num_generations` generated per input in\n            a list or create a row per generation. Defaults to `False`.\n        num_generations: The number of generations to be produced per input.\n    \"\"\"\n\n    @abstractmethod\n    def format_input(self, input: Dict[str, Any]) -&gt; \"FormattedInput\":\n        \"\"\"Abstract method to format the inputs of the task. It needs to receive an input\n        as a Python dictionary, and generates an OpenAI chat-like list of dicts.\"\"\"\n        pass\n\n    def _format_inputs(self, inputs: List[Dict[str, Any]]) -&gt; List[\"FormattedInput\"]:\n        \"\"\"Formats the inputs of the task using the `format_input` method.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list containing the formatted inputs, which are `ChatType`-like following\n            the OpenAI formatting.\n        \"\"\"\n        return [self.format_input(input) for input in inputs]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n\n        formatted_inputs = self._format_inputs(inputs)\n\n        # `outputs` is a dict containing the LLM outputs in the `generations`\n        # key and the statistics in the `statistics` key\n        outputs = self.llm.generate_outputs(\n            inputs=formatted_inputs,\n            num_generations=self.num_generations,  # type: ignore\n            **self.llm.get_generation_kwargs(),  # type: ignore\n        )\n        task_outputs = []\n        for input, input_outputs in zip(inputs, outputs):\n            formatted_outputs = self._format_outputs(input_outputs, input)\n\n            if self.group_generations:\n                combined = group_dicts(*formatted_outputs)\n                task_outputs.append(\n                    {**input, **combined, \"model_name\": self.llm.model_name}\n                )\n                continue\n\n            # Create a row per generation\n            for formatted_output in formatted_outputs:\n                task_outputs.append(\n                    {**input, **formatted_output, \"model_name\": self.llm.model_name}\n                )\n\n        yield task_outputs\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base.Task.format_input","title":"<code>format_input(input)</code>  <code>abstractmethod</code>","text":"<p>Abstract method to format the inputs of the task. It needs to receive an input as a Python dictionary, and generates an OpenAI chat-like list of dicts.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>@abstractmethod\ndef format_input(self, input: Dict[str, Any]) -&gt; \"FormattedInput\":\n    \"\"\"Abstract method to format the inputs of the task. It needs to receive an input\n    as a Python dictionary, and generates an OpenAI chat-like list of dicts.\"\"\"\n    pass\n</code></pre>"},{"location":"api/task/#distilabel.steps.tasks.base.Task.process","title":"<code>process(inputs)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n\n    formatted_inputs = self._format_inputs(inputs)\n\n    # `outputs` is a dict containing the LLM outputs in the `generations`\n    # key and the statistics in the `statistics` key\n    outputs = self.llm.generate_outputs(\n        inputs=formatted_inputs,\n        num_generations=self.num_generations,  # type: ignore\n        **self.llm.get_generation_kwargs(),  # type: ignore\n    )\n    task_outputs = []\n    for input, input_outputs in zip(inputs, outputs):\n        formatted_outputs = self._format_outputs(input_outputs, input)\n\n        if self.group_generations:\n            combined = group_dicts(*formatted_outputs)\n            task_outputs.append(\n                {**input, **combined, \"model_name\": self.llm.model_name}\n            )\n            continue\n\n        # Create a row per generation\n        for formatted_output in formatted_outputs:\n            task_outputs.append(\n                {**input, **formatted_output, \"model_name\": self.llm.model_name}\n            )\n\n    yield task_outputs\n</code></pre>"},{"location":"api/task/generator_task/","title":"GeneratorTask","text":"<p>This section contains the API reference for the <code>distilabel</code> generator tasks.</p> <p>For more information on how the <code>GeneratorTask</code> works and see some examples, check the Tutorial - Task - GeneratorTask page.</p>"},{"location":"api/task/generator_task/#distilabel.steps.tasks.base.GeneratorTask","title":"<code>GeneratorTask</code>","text":"<p>               Bases: <code>_Task</code>, <code>GeneratorStep</code></p> <p><code>GeneratorTask</code> is a class that implements the <code>_Task</code> abstract class and adds the <code>GeneratorStep</code> interface to be used as a step in the pipeline.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>the <code>LLM</code> to be used to generate the outputs of the task.</p> <code>group_generations</code> <code>bool</code> <p>whether to group the <code>num_generations</code> generated per input in a list or create a row per generation. Defaults to <code>False</code>.</p> <code>num_generations</code> <code>RuntimeParameter[int]</code> <p>The number of generations to be produced per input.</p> Source code in <code>src/distilabel/steps/tasks/base.py</code> <pre><code>class GeneratorTask(_Task, GeneratorStep):\n    \"\"\"`GeneratorTask` is a class that implements the `_Task` abstract class and adds the\n    `GeneratorStep` interface to be used as a step in the pipeline.\n\n    Attributes:\n        llm: the `LLM` to be used to generate the outputs of the task.\n        group_generations: whether to group the `num_generations` generated per input in\n            a list or create a row per generation. Defaults to `False`.\n        num_generations: The number of generations to be produced per input.\n    \"\"\"\n\n    pass\n</code></pre>"},{"location":"api/task/task_gallery/","title":"Task Gallery","text":"<p>This section contains the existing <code>Task</code> subclasses implemented in <code>distilabel</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks","title":"<code>tasks</code>","text":""},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker","title":"<code>APIGenExecutionChecker</code>","text":"<p>               Bases: <code>Step</code></p> <p>Executes the generated function calls.</p> <p>This step checks if a given answer from a model as generated by <code>APIGenGenerator</code> can be executed against the given library (given by <code>libpath</code>, which is a string pointing to a python .py file with functions).</p> <p>Attributes:</p> Name Type Description <code>libpath</code> <code>str</code> <p>The path to the library where we will retrieve the functions. It can also point to a folder with the functions. In this case, the folder layout should be a folder with .py files, each containing a single function, the name of the function being the same as the filename.</p> <code>check_is_dangerous</code> <code>bool</code> <p>Bool to exclude some potentially dangerous functions, it contains some heuristics found while testing. This functions can run subprocesses, deal with the OS, or have other potentially dangerous operations. Defaults to True.</p> Input columns <ul> <li>answers (<code>str</code>): List with arguments to be passed to the function,     dumped as a string from a list of dictionaries. Should be loaded using     <code>json.loads</code>.</li> </ul> Output columns <ul> <li>keep_row_after_execution_check (<code>bool</code>): Whether the function should be kept or not.</li> <li>execution_result (<code>str</code>): The result from executing the function.</li> </ul> Categories <ul> <li>filtering</li> <li>execution</li> </ul> References <ul> <li>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</li> <li>Salesforce/xlam-function-calling-60k</li> </ul> <p>Examples:</p> <p>Execute a function from a given library with the answer from an LLM:</p> <pre><code>from distilabel.steps.tasks import APIGenExecutionChecker\n\n# For the libpath you can use as an example the file at the tests folder:\n# ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\ntask = APIGenExecutionChecker(\n    libpath=\"../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\",\n)\ntask.load()\n\nres = next(\n    task.process(\n        [\n            {\n                \"answers\": [\n                    {\n                        \"arguments\": {\n                            \"initial_velocity\": 0.2,\n                            \"acceleration\": 0.1,\n                            \"time\": 0.5,\n                        },\n                        \"name\": \"final_velocity\",\n                    }\n                ],\n            }\n        ]\n    )\n)\nres\n#[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>class APIGenExecutionChecker(Step):\n    \"\"\"Executes the generated function calls.\n\n    This step checks if a given answer from a model as generated by `APIGenGenerator`\n    can be executed against the given library (given by `libpath`, which is a string\n    pointing to a python .py file with functions).\n\n    Attributes:\n        libpath: The path to the library where we will retrieve the functions.\n            It can also point to a folder with the functions. In this case, the folder\n            layout should be a folder with .py files, each containing a single function,\n            the name of the function being the same as the filename.\n        check_is_dangerous: Bool to exclude some potentially dangerous functions, it contains\n            some heuristics found while testing. This functions can run subprocesses, deal with\n            the OS, or have other potentially dangerous operations. Defaults to True.\n\n    Input columns:\n        - answers (`str`): List with arguments to be passed to the function,\n            dumped as a string from a list of dictionaries. Should be loaded using\n            `json.loads`.\n\n    Output columns:\n        - keep_row_after_execution_check (`bool`): Whether the function should be kept or not.\n        - execution_result (`str`): The result from executing the function.\n\n    Categories:\n        - filtering\n        - execution\n\n    References:\n        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n        - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\n    Examples:\n        Execute a function from a given library with the answer from an LLM:\n\n        ```python\n        from distilabel.steps.tasks import APIGenExecutionChecker\n\n        # For the libpath you can use as an example the file at the tests folder:\n        # ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\n        task = APIGenExecutionChecker(\n            libpath=\"../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\",\n        )\n        task.load()\n\n        res = next(\n            task.process(\n                [\n                    {\n                        \"answers\": [\n                            {\n                                \"arguments\": {\n                                    \"initial_velocity\": 0.2,\n                                    \"acceleration\": 0.1,\n                                    \"time\": 0.5,\n                                },\n                                \"name\": \"final_velocity\",\n                            }\n                        ],\n                    }\n                ]\n            )\n        )\n        res\n        #[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]\n        ```\n    \"\"\"\n\n    libpath: str = Field(\n        default=...,\n        description=(\n            \"The path to the library where we will retrieve the functions, \"\n            \"or a folder with python files named the same as the functions they contain.\",\n        ),\n    )\n    check_is_dangerous: bool = Field(\n        default=True,\n        description=(\n            \"Bool to exclude some potentially dangerous functions, it contains \"\n            \"some heuristics found while testing. This functions can run subprocesses, \"\n            \"deal with the OS, or have other potentially dangerous operations.\",\n        ),\n    )\n\n    _toolbox: Union[\"ModuleType\", None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the library where the functions will be extracted from.\"\"\"\n        super().load()\n        if Path(self.libpath).suffix == \".py\":\n            self._toolbox = load_module_from_path(self.libpath)\n\n    def unload(self) -&gt; None:\n        self._toolbox = None\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task are those found in the original dataset.\"\"\"\n        return [\"answers\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs are the columns required by `APIGenGenerator` task.\"\"\"\n        return [\"keep_row_after_execution_check\", \"execution_result\"]\n\n    def _get_function(self, function_name: str) -&gt; Callable:\n        \"\"\"Retrieves the function from the toolbox.\n\n        Args:\n            function_name: The name of the function to retrieve.\n\n        Returns:\n            Callable: The function to be executed.\n        \"\"\"\n        if self._toolbox:\n            return getattr(self._toolbox, function_name, None)\n        try:\n            toolbox = load_module_from_path(\n                str(Path(self.libpath) / f\"{function_name}.py\")\n            )\n            return getattr(toolbox, function_name, None)\n        except FileNotFoundError:\n            return None\n        except Exception as e:\n            self._logger.warning(f\"Error loading function '{function_name}': {e}\")\n            return None\n\n    def _is_dangerous(self, function: Callable) -&gt; bool:\n        \"\"\"Checks if a function is dangerous to remove it.\n        Contains a list of heuristics to avoid executing possibly dangerous functions.\n        \"\"\"\n        source_code = inspect.getsource(function)\n        # We don't want to execute functions that use subprocess\n        if (\n            (\"subprocess.\" in source_code)\n            or (\"os.system(\" in source_code)\n            or (\"input(\" in source_code)\n            # Avoiding threading\n            or (\"threading.Thread(\" in source_code)\n            or (\"exec(\" in source_code)\n            # Avoiding argparse (not sure why)\n            or (\"argparse.ArgumentParser(\" in source_code)\n            # Avoiding logging changing the levels to not mess with the logs\n            or (\".setLevel(\" in source_code)\n            # Don't run a test battery\n            or (\"unittest.main(\" in source_code)\n            # Avoid exiting the program\n            or (\"sys.exit(\" in source_code)\n            or (\"exit(\" in source_code)\n            or (\"raise SystemExit(\" in source_code)\n            or (\"multiprocessing.Pool(\" in source_code)\n        ):\n            return True\n        return False\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Checks the answer to see if it can be executed.\n        Captures the possible errors and returns them.\n\n        If a single example is provided, it is copied to avoid raising an error.\n\n        Args:\n            inputs: A list of dictionaries with the input data.\n\n        Yields:\n            A list of dictionaries with the output data.\n        \"\"\"\n        for input in inputs:\n            output = []\n            if input[\"answers\"]:\n                answers = json.loads(input[\"answers\"])\n            else:\n                input.update(\n                    **{\n                        \"keep_row_after_execution_check\": False,\n                        \"execution_result\": [\"No answers were provided.\"],\n                    }\n                )\n                continue\n            for answer in answers:\n                if answer is None:\n                    output.append(\n                        {\n                            \"keep\": False,\n                            \"execution_result\": \"Nothing was generated for this answer.\",\n                        }\n                    )\n                    continue\n\n                function_name = answer.get(\"name\", None)\n                arguments = answer.get(\"arguments\", None)\n\n                self._logger.debug(\n                    f\"Executing function '{function_name}' with arguments: {arguments}\"\n                )\n                function = self._get_function(function_name)\n\n                if self.check_is_dangerous:\n                    if function and self._is_dangerous(function):\n                        function = None\n\n                if function is None:\n                    output.append(\n                        {\n                            \"keep\": False,\n                            \"execution_result\": f\"Function '{function_name}' not found.\",\n                        }\n                    )\n                else:\n                    execution = execute_from_response(function, arguments)\n                    output.append(\n                        {\n                            \"keep\": execution[\"keep\"],\n                            \"execution_result\": execution[\"execution_result\"],\n                        }\n                    )\n            # We only consider a good response if all the answers were executed successfully,\n            # but keep the reasons for further review if needed.\n            input.update(\n                **{\n                    \"keep_row_after_execution_check\": all(\n                        o[\"keep\"] is True for o in output\n                    ),\n                    \"execution_result\": [o[\"execution_result\"] for o in output],\n                }\n            )\n\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are those found in the original dataset.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs are the columns required by <code>APIGenGenerator</code> task.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.load","title":"<code>load()</code>","text":"<p>Loads the library where the functions will be extracted from.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the library where the functions will be extracted from.\"\"\"\n    super().load()\n    if Path(self.libpath).suffix == \".py\":\n        self._toolbox = load_module_from_path(self.libpath)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker._get_function","title":"<code>_get_function(function_name)</code>","text":"<p>Retrieves the function from the toolbox.</p> <p>Parameters:</p> Name Type Description Default <code>function_name</code> <code>str</code> <p>The name of the function to retrieve.</p> required <p>Returns:</p> Name Type Description <code>Callable</code> <code>Callable</code> <p>The function to be executed.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>def _get_function(self, function_name: str) -&gt; Callable:\n    \"\"\"Retrieves the function from the toolbox.\n\n    Args:\n        function_name: The name of the function to retrieve.\n\n    Returns:\n        Callable: The function to be executed.\n    \"\"\"\n    if self._toolbox:\n        return getattr(self._toolbox, function_name, None)\n    try:\n        toolbox = load_module_from_path(\n            str(Path(self.libpath) / f\"{function_name}.py\")\n        )\n        return getattr(toolbox, function_name, None)\n    except FileNotFoundError:\n        return None\n    except Exception as e:\n        self._logger.warning(f\"Error loading function '{function_name}': {e}\")\n        return None\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker._is_dangerous","title":"<code>_is_dangerous(function)</code>","text":"<p>Checks if a function is dangerous to remove it. Contains a list of heuristics to avoid executing possibly dangerous functions.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>def _is_dangerous(self, function: Callable) -&gt; bool:\n    \"\"\"Checks if a function is dangerous to remove it.\n    Contains a list of heuristics to avoid executing possibly dangerous functions.\n    \"\"\"\n    source_code = inspect.getsource(function)\n    # We don't want to execute functions that use subprocess\n    if (\n        (\"subprocess.\" in source_code)\n        or (\"os.system(\" in source_code)\n        or (\"input(\" in source_code)\n        # Avoiding threading\n        or (\"threading.Thread(\" in source_code)\n        or (\"exec(\" in source_code)\n        # Avoiding argparse (not sure why)\n        or (\"argparse.ArgumentParser(\" in source_code)\n        # Avoiding logging changing the levels to not mess with the logs\n        or (\".setLevel(\" in source_code)\n        # Don't run a test battery\n        or (\"unittest.main(\" in source_code)\n        # Avoid exiting the program\n        or (\"sys.exit(\" in source_code)\n        or (\"exit(\" in source_code)\n        or (\"raise SystemExit(\" in source_code)\n        or (\"multiprocessing.Pool(\" in source_code)\n    ):\n        return True\n    return False\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenExecutionChecker.process","title":"<code>process(inputs)</code>","text":"<p>Checks the answer to see if it can be executed. Captures the possible errors and returns them.</p> <p>If a single example is provided, it is copied to avoid raising an error.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of dictionaries with the input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of dictionaries with the output data.</p> Source code in <code>src/distilabel/steps/tasks/apigen/execution_checker.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Checks the answer to see if it can be executed.\n    Captures the possible errors and returns them.\n\n    If a single example is provided, it is copied to avoid raising an error.\n\n    Args:\n        inputs: A list of dictionaries with the input data.\n\n    Yields:\n        A list of dictionaries with the output data.\n    \"\"\"\n    for input in inputs:\n        output = []\n        if input[\"answers\"]:\n            answers = json.loads(input[\"answers\"])\n        else:\n            input.update(\n                **{\n                    \"keep_row_after_execution_check\": False,\n                    \"execution_result\": [\"No answers were provided.\"],\n                }\n            )\n            continue\n        for answer in answers:\n            if answer is None:\n                output.append(\n                    {\n                        \"keep\": False,\n                        \"execution_result\": \"Nothing was generated for this answer.\",\n                    }\n                )\n                continue\n\n            function_name = answer.get(\"name\", None)\n            arguments = answer.get(\"arguments\", None)\n\n            self._logger.debug(\n                f\"Executing function '{function_name}' with arguments: {arguments}\"\n            )\n            function = self._get_function(function_name)\n\n            if self.check_is_dangerous:\n                if function and self._is_dangerous(function):\n                    function = None\n\n            if function is None:\n                output.append(\n                    {\n                        \"keep\": False,\n                        \"execution_result\": f\"Function '{function_name}' not found.\",\n                    }\n                )\n            else:\n                execution = execute_from_response(function, arguments)\n                output.append(\n                    {\n                        \"keep\": execution[\"keep\"],\n                        \"execution_result\": execution[\"execution_result\"],\n                    }\n                )\n        # We only consider a good response if all the answers were executed successfully,\n        # but keep the reasons for further review if needed.\n        input.update(\n            **{\n                \"keep_row_after_execution_check\": all(\n                    o[\"keep\"] is True for o in output\n                ),\n                \"execution_result\": [o[\"execution_result\"] for o in output],\n            }\n        )\n\n    yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator","title":"<code>APIGenGenerator</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate queries and answers for the given functions in JSON format.</p> <pre><code>The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate\nverifiable and diverse function-calling datasets. The task generates a set of diverse queries\nand corresponding answers for the given functions in JSON format.\n\nAttributes:\n    system_prompt: The system prompt to guide the user in the generation of queries and answers.\n    use_tools: Whether to use the tools available in the prompt to generate the queries and answers.\n        In case the tools are given in the input, they will be added to the prompt.\n    number: The number of queries to generate. It can be a list, where each number will be\n        chosen randomly, or a dictionary with the number of queries and the probability of each.\n        I.e: `number=1`, `number=[1, 2, 3]`, `number={1: 0.5, 2: 0.3, 3: 0.2}` are all valid inputs.\n        It corresponds to the number of parallel queries to generate.\n    use_default_structured_output: Whether to use the default structured output or not.\n\nInput columns:\n    - examples (`str`): Examples used as few shots to guide the model.\n    - func_name (`str`): Name for the function to generate.\n    - func_desc (`str`): Description of what the function should do.\n    - tools (`str`): JSON formatted string containing the tool representation of the function.\n\nOutput columns:\n    - query (`str`): The list of queries.\n    - answers (`str`): JSON formatted string with the list of answers, containing the info as\n        a dictionary to be passed to the functions.\n\nCategories:\n    - text-generation\n\nReferences:\n    - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n    - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\nExamples:\n    Generate without structured output (original implementation):\n\n    ```python\n    from distilabel.steps.tasks import ApiGenGenerator\n    from distilabel.models import InferenceEndpointsLLM\n\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 1024,\n        },\n    )\n    apigen = ApiGenGenerator(\n        use_default_structured_output=False,\n        llm=llm\n    )\n    apigen.load()\n\n    res = next(\n        apigen.process(\n            [\n                {\n                    \"examples\": 'QUERY:\n</code></pre> <p>What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',                         \"func_name\": \"getrandommovie\",                         \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"                     }                 ]             )         )         res         # [{'examples': 'QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',         # 'number': 1,         # 'func_name': 'getrandommovie',         # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',         # 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',         # 'Give me 5 random movie suggestions from your database to plan my weekend.'],         # 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],         # [{'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}},         #     {'name': 'getrandommovie', 'arguments': {}}]],         # 'raw_input_api_gen_generator_0': [{'role': 'system',         #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.</p> <p>Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.</p> <p>Ensure the query: - Is clear and concise - Demonstrates typical use cases - Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words - Across a variety level of difficulties, ranging from beginner and advanced use cases - The corresponding result's parameter types and ranges match with the function's descriptions</p> <p>Ensure the answer: - Is a list of function calls in JSON format - The length of the answer list should be equal to the number of requests in the query - Can solve all the requests in the query effectively\"},         #     {'role': 'user',         #     'content': 'Here are examples of queries and the corresponding answers for similar functions: QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]</p> <p>Note that the query could be interpreted as a combination of several independent requests. Based on these examples, generate 2 diverse query and answer pairs for the function <code>getrandommovie</code> The detailed function description is the following: Returns a list of random movies from a database by calling an external API.</p> <p>The output MUST strictly adhere to the following JSON format, and NO other text MUST be included: <pre><code>[\n   {\n       \"query\": \"The generated query.\",\n       \"answers\": [\n           {\n               \"name\": \"api_name\",\n               \"arguments\": {\n                   \"arg_name\": \"value\"\n                   ... (more arguments as required)\n               }\n           },\n           ... (more API calls as required)\n       ]\n   }\n]\n</code></pre></p> <p>Now please generate 2 diverse query and answer pairs following the above format.'}]},         # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]         ```</p> <pre><code>    Generate with structured output:\n\n    ```python\n    from distilabel.steps.tasks import ApiGenGenerator\n    from distilabel.models import InferenceEndpointsLLM\n\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 1024,\n        },\n    )\n    apigen = ApiGenGenerator(\n        use_default_structured_output=True,\n        llm=llm\n    )\n    apigen.load()\n\n    res_struct = next(\n        apigen.process(\n            [\n                {\n                    \"examples\": 'QUERY:\n</code></pre> <p>What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',                         \"func_name\": \"getrandommovie\",                         \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"                     }                 ]             )         )         res_struct         # [{'examples': 'QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',         # 'number': 1,         # 'func_name': 'getrandommovie',         # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',         # 'queries': [\"I'm bored and want to watch a movie. Can you suggest some movies?\",         # \"My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?\"],         # 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],         # [{'arguments': {}, 'name': 'getrandommovie'}]],         # 'raw_input_api_gen_generator_0': [{'role': 'system',         #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.</p> <p>Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.</p> <p>Ensure the query: - Is clear and concise - Demonstrates typical use cases - Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words - Across a variety level of difficulties, ranging from beginner and advanced use cases - The corresponding result's parameter types and ranges match with the function's descriptions</p> <p>Ensure the answer: - Is a list of function calls in JSON format - The length of the answer list should be equal to the number of requests in the query - Can solve all the requests in the query effectively\"},         #     {'role': 'user',         #     'content': 'Here are examples of queries and the corresponding answers for similar functions: QUERY: What is the binary sum of 10010 and 11101? ANSWER: [{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]</p> <p>Note that the query could be interpreted as a combination of several independent requests. Based on these examples, generate 2 diverse query and answer pairs for the function <code>getrandommovie</code> The detailed function description is the following: Returns a list of random movies from a database by calling an external API.</p> <p>Now please generate 2 diverse query and answer pairs following the above format.'}]},         # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]         ```</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>class APIGenGenerator(Task):\n    \"\"\"Generate queries and answers for the given functions in JSON format.\n\n    The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate\n    verifiable and diverse function-calling datasets. The task generates a set of diverse queries\n    and corresponding answers for the given functions in JSON format.\n\n    Attributes:\n        system_prompt: The system prompt to guide the user in the generation of queries and answers.\n        use_tools: Whether to use the tools available in the prompt to generate the queries and answers.\n            In case the tools are given in the input, they will be added to the prompt.\n        number: The number of queries to generate. It can be a list, where each number will be\n            chosen randomly, or a dictionary with the number of queries and the probability of each.\n            I.e: `number=1`, `number=[1, 2, 3]`, `number={1: 0.5, 2: 0.3, 3: 0.2}` are all valid inputs.\n            It corresponds to the number of parallel queries to generate.\n        use_default_structured_output: Whether to use the default structured output or not.\n\n    Input columns:\n        - examples (`str`): Examples used as few shots to guide the model.\n        - func_name (`str`): Name for the function to generate.\n        - func_desc (`str`): Description of what the function should do.\n        - tools (`str`): JSON formatted string containing the tool representation of the function.\n\n    Output columns:\n        - query (`str`): The list of queries.\n        - answers (`str`): JSON formatted string with the list of answers, containing the info as\n            a dictionary to be passed to the functions.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n        - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\n    Examples:\n        Generate without structured output (original implementation):\n\n        ```python\n        from distilabel.steps.tasks import ApiGenGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        apigen = ApiGenGenerator(\n            use_default_structured_output=False,\n            llm=llm\n        )\n        apigen.load()\n\n        res = next(\n            apigen.process(\n                [\n                    {\n                        \"examples\": 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                        \"func_name\": \"getrandommovie\",\n                        \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n                    }\n                ]\n            )\n        )\n        res\n        # [{'examples': 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n        # 'number': 1,\n        # 'func_name': 'getrandommovie',\n        # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n        # 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',\n        # 'Give me 5 random movie suggestions from your database to plan my weekend.'],\n        # 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],\n        # [{'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}},\n        #     {'name': 'getrandommovie', 'arguments': {}}]],\n        # 'raw_input_api_gen_generator_0': [{'role': 'system',\n        #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\\n\\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\\n\\nEnsure the query:\\n- Is clear and concise\\n- Demonstrates typical use cases\\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\\n- The corresponding result's parameter types and ranges match with the function's descriptions\\n\\nEnsure the answer:\\n- Is a list of function calls in JSON format\\n- The length of the answer list should be equal to the number of requests in the query\\n- Can solve all the requests in the query effectively\"},\n        #     {'role': 'user',\n        #     'content': 'Here are examples of queries and the corresponding answers for similar functions:\\nQUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\\n\\nNote that the query could be interpreted as a combination of several independent requests.\\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\\nThe detailed function description is the following:\\nReturns a list of random movies from a database by calling an external API.\\n\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n```json\\n[\\n   {\\n       \"query\": \"The generated query.\",\\n       \"answers\": [\\n           {\\n               \"name\": \"api_name\",\\n               \"arguments\": {\\n                   \"arg_name\": \"value\"\\n                   ... (more arguments as required)\\n               }\\n           },\\n           ... (more API calls as required)\\n       ]\\n   }\\n]\\n```\\n\\nNow please generate 2 diverse query and answer pairs following the above format.'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Generate with structured output:\n\n        ```python\n        from distilabel.steps.tasks import ApiGenGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        apigen = ApiGenGenerator(\n            use_default_structured_output=True,\n            llm=llm\n        )\n        apigen.load()\n\n        res_struct = next(\n            apigen.process(\n                [\n                    {\n                        \"examples\": 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                        \"func_name\": \"getrandommovie\",\n                        \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n                    }\n                ]\n            )\n        )\n        res_struct\n        # [{'examples': 'QUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n        # 'number': 1,\n        # 'func_name': 'getrandommovie',\n        # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n        # 'queries': [\"I'm bored and want to watch a movie. Can you suggest some movies?\",\n        # \"My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?\"],\n        # 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],\n        # [{'arguments': {}, 'name': 'getrandommovie'}]],\n        # 'raw_input_api_gen_generator_0': [{'role': 'system',\n        #     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\\n\\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\\n\\nEnsure the query:\\n- Is clear and concise\\n- Demonstrates typical use cases\\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\\n- The corresponding result's parameter types and ranges match with the function's descriptions\\n\\nEnsure the answer:\\n- Is a list of function calls in JSON format\\n- The length of the answer list should be equal to the number of requests in the query\\n- Can solve all the requests in the query effectively\"},\n        #     {'role': 'user',\n        #     'content': 'Here are examples of queries and the corresponding answers for similar functions:\\nQUERY:\\nWhat is the binary sum of 10010 and 11101?\\nANSWER:\\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\\n\\nNote that the query could be interpreted as a combination of several independent requests.\\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\\nThe detailed function description is the following:\\nReturns a list of random movies from a database by calling an external API.\\n\\nNow please generate 2 diverse query and answer pairs following the above format.'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n    \"\"\"\n\n    system_prompt: str = SYSTEM_PROMPT_API_GEN\n    use_default_structured_output: bool = False\n    number: Union[int, List[int], Dict[int, float]] = 1\n    use_tools: bool = True\n\n    _number: Union[int, None] = PrivateAttr(None)\n    _fn_parallel_queries: Union[Callable[[], str], None] = PrivateAttr(None)\n    _format_inst: Union[str, None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the template for the generator prompt.\"\"\"\n        super().load()\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"apigen\"\n            / \"generator.jinja2\"\n        )\n        self._template = Template(open(_path).read())\n        self._format_inst = self._set_format_inst()\n\n    def _parallel_queries(self, number: int) -&gt; Callable[[int], str]:\n        \"\"\"Prepares the function to update the parallel queries guide in the prompt.\n\n        Raises:\n            ValueError: if `is_parallel` is not a boolean or a list of floats.\n\n        Returns:\n            The function to generate the parallel queries guide.\n        \"\"\"\n        if number &gt; 1:\n            return (\n                \"It can contain multiple parallel queries in natural language for the given functions. \"\n                \"They could use either the same function with different arguments or different functions.\\n\"\n            )\n        return \"\"\n\n    def _get_number(self) -&gt; int:\n        \"\"\"Generates the number of queries to generate in a single call.\n        The number must be set to `_number` to avoid changing the original value\n        when calling `_default_error`.\n        \"\"\"\n        if isinstance(self.number, list):\n            self._number = random.choice(self.number)\n        elif isinstance(self.number, dict):\n            self._number = random.choices(\n                list(self.number.keys()), list(self.number.values())\n            )[0]\n        else:\n            self._number = self.number\n        return self._number\n\n    def _set_format_inst(self) -&gt; str:\n        \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n        If the default structured output is used, returns an empty string because nothing\n        else is needed, otherwise, returns the original addition to the prompt to guide the model\n        to generate a formatted JSON.\n        \"\"\"\n        return (\n            \"\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n\"\n            \"```\\n\"\n            \"[\\n\"\n            \"   {\\n\"\n            '       \"query\": \"The generated query.\",\\n'\n            '       \"answers\": [\\n'\n            \"           {\\n\"\n            '               \"name\": \"api_name\",\\n'\n            '               \"arguments\": {\\n'\n            '                   \"arg_name\": \"value\"\\n'\n            \"                   ... (more arguments as required)\\n\"\n            \"               }\\n\"\n            \"           },\\n\"\n            \"           ... (more API calls as required)\\n\"\n            \"       ]\\n\"\n            \"   }\\n\"\n            \"]\\n\"\n            \"```\\n\"\n        )\n\n    def _get_func_desc(self, input: Dict[str, Any]) -&gt; str:\n        \"\"\"If available and required, will use the info from the tools in the\n        prompt for extra information. Otherwise will use jut the function description.\n        \"\"\"\n        if not self.use_tools:\n            return input[\"func_desc\"]\n        extra = \"\"  # Extra information from the tools (if available will be added)\n        if \"tools\" in input:\n            extra = f\"\\n\\nThis is the available tool to guide you (respect the order of the parameters):\\n{input['tools']}\"\n        return input[\"func_desc\"] + extra\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task.\"\"\"\n        return {\n            \"examples\": True,\n            \"func_name\": True,\n            \"func_desc\": True,\n            \"tools\": False,\n        }\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType`.\"\"\"\n        number = self._get_number()\n        parallel_queries = self._parallel_queries(number)\n        return [\n            {\"role\": \"system\", \"content\": self.system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    examples=input[\"examples\"],\n                    parallel_queries=parallel_queries,\n                    number=number,\n                    func_name=input[\"func_name\"],\n                    func_desc=self._get_func_desc(input),\n                    format_inst=self._format_inst,\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The output for the task are the queries and corresponding answers.\"\"\"\n        return [\"query\", \"answers\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the queries and answers pairs.\n            The answers are an array of answers corresponding to the query.\n            Each answer is represented as an object with the following properties:\n                - name (string): The name of the tool used to generate the answer.\n                - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n            Each argument is represented as a key-value pair, where the key is the parameter name and the\n            value is the corresponding value.\n        \"\"\"\n        if output is None:\n            return self._default_error(input)\n\n        if not self.use_default_structured_output:\n            output = remove_fences(output)\n\n        try:\n            pairs = orjson.loads(output)\n        except orjson.JSONDecodeError:\n            return self._default_error(input)\n\n        pairs = pairs[\"pairs\"] if self.use_default_structured_output else pairs\n\n        return self._format_output(pairs, input)\n\n    def _format_output(\n        self, pairs: Dict[str, Any], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Parses the response, returning a dictionary with queries and answers.\n\n        Args:\n            pairs: The parsed dictionary from the LLM's output.\n            input: The input from the `LLM`.\n\n        Returns:\n            Formatted output, where the `queries` are a list of strings, and the `answers`\n            are a list of objects.\n        \"\"\"\n        try:\n            input.update(\n                **{\n                    \"query\": pairs[0][\"query\"],\n                    \"answers\": json.dumps(pairs[0][\"answers\"]),\n                }\n            )\n            return input\n        except Exception as e:\n            self._logger.error(f\"Error formatting output: {e}, pairs: '{pairs}'\")\n            return self._default_error(input)\n\n    def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n        \"\"\"Returns a default error output, to fill the responses in case of failure.\"\"\"\n        input.update(\n            **{\n                \"query\": None,\n                \"answers\": json.dumps([None] * self._number),\n            }\n        )\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from typing import Dict, List\n        from pydantic import BaseModel\n\n\n        class Answer(BaseModel):\n            name: str\n            arguments: Dict[str, str]\n\n        class QueryAnswer(BaseModel):\n            query: str\n            answers: List[Answer]\n\n        class QueryAnswerPairs(BaseModel):\n            pairs: List[QueryAnswer]\n\n        json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"$defs\": {\n                \"Answer\": {\n                    \"properties\": {\n                        \"name\": {\"title\": \"Name\", \"type\": \"string\"},\n                        \"arguments\": {\n                            \"additionalProperties\": {\"type\": \"string\"},\n                            \"title\": \"Arguments\",\n                            \"type\": \"object\",\n                        },\n                    },\n                    \"required\": [\"name\", \"arguments\"],\n                    \"title\": \"Answer\",\n                    \"type\": \"object\",\n                },\n                \"QueryAnswer\": {\n                    \"properties\": {\n                        \"query\": {\"title\": \"Query\", \"type\": \"string\"},\n                        \"answers\": {\n                            \"items\": {\"$ref\": \"#/$defs/Answer\"},\n                            \"title\": \"Answers\",\n                            \"type\": \"array\",\n                        },\n                    },\n                    \"required\": [\"query\", \"answers\"],\n                    \"title\": \"QueryAnswer\",\n                    \"type\": \"object\",\n                },\n            },\n            \"properties\": {\n                \"pairs\": {\n                    \"items\": {\"$ref\": \"#/$defs/QueryAnswer\"},\n                    \"title\": \"Pairs\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"pairs\"],\n            \"title\": \"QueryAnswerPairs\",\n            \"type\": \"object\",\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the queries and corresponding answers.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.load","title":"<code>load()</code>","text":"<p>Loads the template for the generator prompt.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the template for the generator prompt.\"\"\"\n    super().load()\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"apigen\"\n        / \"generator.jinja2\"\n    )\n    self._template = Template(open(_path).read())\n    self._format_inst = self._set_format_inst()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._parallel_queries","title":"<code>_parallel_queries(number)</code>","text":"<p>Prepares the function to update the parallel queries guide in the prompt.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>if <code>is_parallel</code> is not a boolean or a list of floats.</p> <p>Returns:</p> Type Description <code>Callable[[int], str]</code> <p>The function to generate the parallel queries guide.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _parallel_queries(self, number: int) -&gt; Callable[[int], str]:\n    \"\"\"Prepares the function to update the parallel queries guide in the prompt.\n\n    Raises:\n        ValueError: if `is_parallel` is not a boolean or a list of floats.\n\n    Returns:\n        The function to generate the parallel queries guide.\n    \"\"\"\n    if number &gt; 1:\n        return (\n            \"It can contain multiple parallel queries in natural language for the given functions. \"\n            \"They could use either the same function with different arguments or different functions.\\n\"\n        )\n    return \"\"\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._get_number","title":"<code>_get_number()</code>","text":"<p>Generates the number of queries to generate in a single call. The number must be set to <code>_number</code> to avoid changing the original value when calling <code>_default_error</code>.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _get_number(self) -&gt; int:\n    \"\"\"Generates the number of queries to generate in a single call.\n    The number must be set to `_number` to avoid changing the original value\n    when calling `_default_error`.\n    \"\"\"\n    if isinstance(self.number, list):\n        self._number = random.choice(self.number)\n    elif isinstance(self.number, dict):\n        self._number = random.choices(\n            list(self.number.keys()), list(self.number.values())\n        )[0]\n    else:\n        self._number = self.number\n    return self._number\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._set_format_inst","title":"<code>_set_format_inst()</code>","text":"<p>Prepares the function to generate the formatted instructions for the prompt.</p> <p>If the default structured output is used, returns an empty string because nothing else is needed, otherwise, returns the original addition to the prompt to guide the model to generate a formatted JSON.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _set_format_inst(self) -&gt; str:\n    \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n    If the default structured output is used, returns an empty string because nothing\n    else is needed, otherwise, returns the original addition to the prompt to guide the model\n    to generate a formatted JSON.\n    \"\"\"\n    return (\n        \"\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n\"\n        \"```\\n\"\n        \"[\\n\"\n        \"   {\\n\"\n        '       \"query\": \"The generated query.\",\\n'\n        '       \"answers\": [\\n'\n        \"           {\\n\"\n        '               \"name\": \"api_name\",\\n'\n        '               \"arguments\": {\\n'\n        '                   \"arg_name\": \"value\"\\n'\n        \"                   ... (more arguments as required)\\n\"\n        \"               }\\n\"\n        \"           },\\n\"\n        \"           ... (more API calls as required)\\n\"\n        \"       ]\\n\"\n        \"   }\\n\"\n        \"]\\n\"\n        \"```\\n\"\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._get_func_desc","title":"<code>_get_func_desc(input)</code>","text":"<p>If available and required, will use the info from the tools in the prompt for extra information. Otherwise will use jut the function description.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _get_func_desc(self, input: Dict[str, Any]) -&gt; str:\n    \"\"\"If available and required, will use the info from the tools in the\n    prompt for extra information. Otherwise will use jut the function description.\n    \"\"\"\n    if not self.use_tools:\n        return input[\"func_desc\"]\n    extra = \"\"  # Extra information from the tools (if available will be added)\n    if \"tools\" in input:\n        extra = f\"\\n\\nThis is the available tool to guide you (respect the order of the parameters):\\n{input['tools']}\"\n    return input[\"func_desc\"] + extra\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code>.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType`.\"\"\"\n    number = self._get_number()\n    parallel_queries = self._parallel_queries(number)\n    return [\n        {\"role\": \"system\", \"content\": self.system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                examples=input[\"examples\"],\n                parallel_queries=parallel_queries,\n                number=number,\n                func_name=input[\"func_name\"],\n                func_desc=self._get_func_desc(input),\n                format_inst=self._format_inst,\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the queries and answers pairs.</p> <code>Dict[str, Any]</code> <p>The answers are an array of answers corresponding to the query.</p> <code>Dict[str, Any]</code> <p>Each answer is represented as an object with the following properties: - name (string): The name of the tool used to generate the answer. - arguments (object): An object representing the arguments passed to the tool to generate the answer.</p> <code>Dict[str, Any]</code> <p>Each argument is represented as a key-value pair, where the key is the parameter name and the</p> <code>Dict[str, Any]</code> <p>value is the corresponding value.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the queries and answers pairs.\n        The answers are an array of answers corresponding to the query.\n        Each answer is represented as an object with the following properties:\n            - name (string): The name of the tool used to generate the answer.\n            - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n        Each argument is represented as a key-value pair, where the key is the parameter name and the\n        value is the corresponding value.\n    \"\"\"\n    if output is None:\n        return self._default_error(input)\n\n    if not self.use_default_structured_output:\n        output = remove_fences(output)\n\n    try:\n        pairs = orjson.loads(output)\n    except orjson.JSONDecodeError:\n        return self._default_error(input)\n\n    pairs = pairs[\"pairs\"] if self.use_default_structured_output else pairs\n\n    return self._format_output(pairs, input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._format_output","title":"<code>_format_output(pairs, input)</code>","text":"<p>Parses the response, returning a dictionary with queries and answers.</p> <p>Parameters:</p> Name Type Description Default <code>pairs</code> <code>Dict[str, Any]</code> <p>The parsed dictionary from the LLM's output.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>The input from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Formatted output, where the <code>queries</code> are a list of strings, and the <code>answers</code></p> <code>Dict[str, Any]</code> <p>are a list of objects.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _format_output(\n    self, pairs: Dict[str, Any], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"Parses the response, returning a dictionary with queries and answers.\n\n    Args:\n        pairs: The parsed dictionary from the LLM's output.\n        input: The input from the `LLM`.\n\n    Returns:\n        Formatted output, where the `queries` are a list of strings, and the `answers`\n        are a list of objects.\n    \"\"\"\n    try:\n        input.update(\n            **{\n                \"query\": pairs[0][\"query\"],\n                \"answers\": json.dumps(pairs[0][\"answers\"]),\n            }\n        )\n        return input\n    except Exception as e:\n        self._logger.error(f\"Error formatting output: {e}, pairs: '{pairs}'\")\n        return self._default_error(input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator._default_error","title":"<code>_default_error(input)</code>","text":"<p>Returns a default error output, to fill the responses in case of failure.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n    \"\"\"Returns a default error output, to fill the responses in case of failure.\"\"\"\n    input.update(\n        **{\n            \"query\": None,\n            \"answers\": json.dumps([None] * self._number),\n        }\n    )\n    return input\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenGenerator.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from typing import Dict, List\nfrom pydantic import BaseModel\n\n\nclass Answer(BaseModel):\n    name: str\n    arguments: Dict[str, str]\n\nclass QueryAnswer(BaseModel):\n    query: str\n    answers: List[Answer]\n\nclass QueryAnswerPairs(BaseModel):\n    pairs: List[QueryAnswer]\n\njson.dumps(QueryAnswerPairs.model_json_schema(), indent=4)\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/apigen/generator.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from typing import Dict, List\n    from pydantic import BaseModel\n\n\n    class Answer(BaseModel):\n        name: str\n        arguments: Dict[str, str]\n\n    class QueryAnswer(BaseModel):\n        query: str\n        answers: List[Answer]\n\n    class QueryAnswerPairs(BaseModel):\n        pairs: List[QueryAnswer]\n\n    json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"$defs\": {\n            \"Answer\": {\n                \"properties\": {\n                    \"name\": {\"title\": \"Name\", \"type\": \"string\"},\n                    \"arguments\": {\n                        \"additionalProperties\": {\"type\": \"string\"},\n                        \"title\": \"Arguments\",\n                        \"type\": \"object\",\n                    },\n                },\n                \"required\": [\"name\", \"arguments\"],\n                \"title\": \"Answer\",\n                \"type\": \"object\",\n            },\n            \"QueryAnswer\": {\n                \"properties\": {\n                    \"query\": {\"title\": \"Query\", \"type\": \"string\"},\n                    \"answers\": {\n                        \"items\": {\"$ref\": \"#/$defs/Answer\"},\n                        \"title\": \"Answers\",\n                        \"type\": \"array\",\n                    },\n                },\n                \"required\": [\"query\", \"answers\"],\n                \"title\": \"QueryAnswer\",\n                \"type\": \"object\",\n            },\n        },\n        \"properties\": {\n            \"pairs\": {\n                \"items\": {\"$ref\": \"#/$defs/QueryAnswer\"},\n                \"title\": \"Pairs\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"pairs\"],\n        \"title\": \"QueryAnswerPairs\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker","title":"<code>APIGenSemanticChecker</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate queries and answers for the given functions in JSON format.</p> <p>The <code>APIGenGenerator</code> is inspired by the APIGen pipeline, which was designed to generate verifiable and diverse function-calling datasets. The task generates a set of diverse queries and corresponding answers for the given functions in JSON format.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>str</code> <p>System prompt for the task. Has a default one.</p> <code>exclude_failed_execution</code> <code>str</code> <p>Whether to exclude failed executions (won't run on those rows that have a False in <code>keep_row_after_execution_check</code> column, which comes from running <code>APIGenExecutionChecker</code>). Defaults to True.</p> Input columns <ul> <li>func_desc (<code>str</code>): Description of what the function should do.</li> <li>query (<code>str</code>): Instruction from the user.</li> <li>answers (<code>str</code>): JSON encoded list with arguments to be passed to the function/API.     Should be loaded using <code>json.loads</code>.</li> <li>execution_result (<code>str</code>): Result of the function/API executed.</li> </ul> Output columns <ul> <li>thought (<code>str</code>): Reasoning for the output on whether to keep this output or not.</li> <li>keep_row_after_semantic_check (<code>bool</code>): True or False, can be used to filter     afterwards.</li> </ul> Categories <ul> <li>filtering</li> <li>text-generation</li> </ul> References <ul> <li>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</li> <li>Salesforce/xlam-function-calling-60k</li> </ul> <p>Examples:</p> <pre><code>Semantic checker for generated function calls (original implementation):\n\n```python\nfrom distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=False,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'thought': '',\n# 'keep_row_after_semantic_check': True,\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \"thought\": \"Concisely describe your reasoning here\",\\n   \"pass\": \"yes\" or \"no\"\\n}\\n```\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n```\n\nSemantic checker for generated function calls (structured output):\n\n```python\nfrom distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=True,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'keep_row_after_semantic_check': True,\n# 'thought': '',\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>class APIGenSemanticChecker(Task):\n    r\"\"\"Generate queries and answers for the given functions in JSON format.\n\n    The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate\n    verifiable and diverse function-calling datasets. The task generates a set of diverse queries\n    and corresponding answers for the given functions in JSON format.\n\n    Attributes:\n        system_prompt: System prompt for the task. Has a default one.\n        exclude_failed_execution: Whether to exclude failed executions (won't run on those\n            rows that have a False in `keep_row_after_execution_check` column, which\n            comes from running `APIGenExecutionChecker`). Defaults to True.\n\n    Input columns:\n        - func_desc (`str`): Description of what the function should do.\n        - query (`str`): Instruction from the user.\n        - answers (`str`): JSON encoded list with arguments to be passed to the function/API.\n            Should be loaded using `json.loads`.\n        - execution_result (`str`): Result of the function/API executed.\n\n    Output columns:\n        - thought (`str`): Reasoning for the output on whether to keep this output or not.\n        - keep_row_after_semantic_check (`bool`): True or False, can be used to filter\n            afterwards.\n\n    Categories:\n        - filtering\n        - text-generation\n\n    References:\n        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)\n        - [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)\n\n    Examples:\n\n        Semantic checker for generated function calls (original implementation):\n\n        ```python\n        from distilabel.steps.tasks import APIGenSemanticChecker\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        semantic_checker = APIGenSemanticChecker(\n            use_default_structured_output=False,\n            llm=llm\n        )\n        semantic_checker.load()\n\n        res = next(\n            semantic_checker.process(\n                [\n                    {\n                        \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                        \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                        \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                        \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n                    }\n                ]\n            )\n        )\n        res\n        # [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n        # 'query': 'What information can be obtained about the Maine Coon cat breed?',\n        # 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n        # 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n        # 'thought': '',\n        # 'keep_row_after_semantic_check': True,\n        # 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n        #     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n        #     {'role': 'user',\n        #     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \"thought\": \"Concisely describe your reasoning here\",\\n   \"pass\": \"yes\" or \"no\"\\n}\\n```\\n'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Semantic checker for generated function calls (structured output):\n\n        ```python\n        from distilabel.steps.tasks import APIGenSemanticChecker\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        semantic_checker = APIGenSemanticChecker(\n            use_default_structured_output=True,\n            llm=llm\n        )\n        semantic_checker.load()\n\n        res = next(\n            semantic_checker.process(\n                [\n                    {\n                        \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                        \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                        \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                        \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n                    }\n                ]\n            )\n        )\n        res\n        # [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n        # 'query': 'What information can be obtained about the Maine Coon cat breed?',\n        # 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n        # 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n        # 'keep_row_after_semantic_check': True,\n        # 'thought': '',\n        # 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n        #     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n        #     {'role': 'user',\n        #     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n    \"\"\"\n\n    system_prompt: str = SYSTEM_PROMPT_SEMANTIC_CHECKER\n    use_default_structured_output: bool = False\n\n    _format_inst: Union[str, None] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the template for the generator prompt.\"\"\"\n        super().load()\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"apigen\"\n            / \"semantic_checker.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n        self._format_inst = self._set_format_inst()\n\n    def _set_format_inst(self) -&gt; str:\n        \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n        If the default structured output is used, returns an empty string because nothing\n        else is needed, otherwise, returns the original addition to the prompt to guide the model\n        to generate a formatted JSON.\n        \"\"\"\n        return (\n            \"\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n\"\n            \"```\\n\"\n            \"{\\n\"\n            '   \"thought\": \"Concisely describe your reasoning here\",\\n'\n            '   \"passes\": \"yes\" or \"no\"\\n'\n            \"}\\n\"\n            \"```\\n\"\n        )\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task.\"\"\"\n        return {\n            \"func_desc\": True,\n            \"query\": True,\n            \"answers\": True,\n            \"execution_result\": True,\n            \"keep_row_after_execution_check\": True,\n        }\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType`.\"\"\"\n        return [\n            {\"role\": \"system\", \"content\": self.system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    func_desc=input[\"func_desc\"],\n                    query=input[\"query\"] or \"\",\n                    func_call=input[\"answers\"] or \"\",\n                    execution_result=input[\"execution_result\"],\n                    format_inst=self._format_inst,\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The output for the task are the queries and corresponding answers.\"\"\"\n        return [\"keep_row_after_semantic_check\", \"thought\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the queries and answers pairs.\n            The answers are an array of answers corresponding to the query.\n            Each answer is represented as an object with the following properties:\n                - name (string): The name of the tool used to generate the answer.\n                - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n            Each argument is represented as a key-value pair, where the key is the parameter name and the\n            value is the corresponding value.\n        \"\"\"\n        if output is None:\n            return self._default_error(input)\n\n        output = remove_fences(output)\n\n        try:\n            result = orjson.loads(output)\n            # Update the column name and change to bool\n            result[\"keep_row_after_semantic_check\"] = (\n                result.pop(\"passes\").lower() == \"yes\"\n            )\n            input.update(**result)\n            return input\n        except orjson.JSONDecodeError:\n            return self._default_error(input)\n\n    def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n        \"\"\"Default error message for the task.\"\"\"\n        input.update({\"thought\": None, \"keep_row_after_semantic_check\": None})\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from typing import Literal\n        from pydantic import BaseModel\n        import json\n\n        class Checker(BaseModel):\n            thought: str\n            passes: Literal[\"yes\", \"no\"]\n\n        json.dumps(Checker.model_json_schema(), indent=4)\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"properties\": {\n                \"thought\": {\"title\": \"Thought\", \"type\": \"string\"},\n                \"passes\": {\"enum\": [\"yes\", \"no\"], \"title\": \"Passes\", \"type\": \"string\"},\n            },\n            \"required\": [\"thought\", \"passes\"],\n            \"title\": \"Checker\",\n            \"type\": \"object\",\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the queries and corresponding answers.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.load","title":"<code>load()</code>","text":"<p>Loads the template for the generator prompt.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the template for the generator prompt.\"\"\"\n    super().load()\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"apigen\"\n        / \"semantic_checker.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n    self._format_inst = self._set_format_inst()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker._set_format_inst","title":"<code>_set_format_inst()</code>","text":"<p>Prepares the function to generate the formatted instructions for the prompt.</p> <p>If the default structured output is used, returns an empty string because nothing else is needed, otherwise, returns the original addition to the prompt to guide the model to generate a formatted JSON.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def _set_format_inst(self) -&gt; str:\n    \"\"\"Prepares the function to generate the formatted instructions for the prompt.\n\n    If the default structured output is used, returns an empty string because nothing\n    else is needed, otherwise, returns the original addition to the prompt to guide the model\n    to generate a formatted JSON.\n    \"\"\"\n    return (\n        \"\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n\"\n        \"```\\n\"\n        \"{\\n\"\n        '   \"thought\": \"Concisely describe your reasoning here\",\\n'\n        '   \"passes\": \"yes\" or \"no\"\\n'\n        \"}\\n\"\n        \"```\\n\"\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code>.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType`.\"\"\"\n    return [\n        {\"role\": \"system\", \"content\": self.system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                func_desc=input[\"func_desc\"],\n                query=input[\"query\"] or \"\",\n                func_call=input[\"answers\"] or \"\",\n                execution_result=input[\"execution_result\"],\n                format_inst=self._format_inst,\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the queries and answers pairs.</p> <code>Dict[str, Any]</code> <p>The answers are an array of answers corresponding to the query.</p> <code>Dict[str, Any]</code> <p>Each answer is represented as an object with the following properties: - name (string): The name of the tool used to generate the answer. - arguments (object): An object representing the arguments passed to the tool to generate the answer.</p> <code>Dict[str, Any]</code> <p>Each argument is represented as a key-value pair, where the key is the parameter name and the</p> <code>Dict[str, Any]</code> <p>value is the corresponding value.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the queries and answers pairs.\n        The answers are an array of answers corresponding to the query.\n        Each answer is represented as an object with the following properties:\n            - name (string): The name of the tool used to generate the answer.\n            - arguments (object): An object representing the arguments passed to the tool to generate the answer.\n        Each argument is represented as a key-value pair, where the key is the parameter name and the\n        value is the corresponding value.\n    \"\"\"\n    if output is None:\n        return self._default_error(input)\n\n    output = remove_fences(output)\n\n    try:\n        result = orjson.loads(output)\n        # Update the column name and change to bool\n        result[\"keep_row_after_semantic_check\"] = (\n            result.pop(\"passes\").lower() == \"yes\"\n        )\n        input.update(**result)\n        return input\n    except orjson.JSONDecodeError:\n        return self._default_error(input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker._default_error","title":"<code>_default_error(input)</code>","text":"<p>Default error message for the task.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>def _default_error(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n    \"\"\"Default error message for the task.\"\"\"\n    input.update({\"thought\": None, \"keep_row_after_semantic_check\": None})\n    return input\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.APIGenSemanticChecker.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from typing import Literal\nfrom pydantic import BaseModel\nimport json\n\nclass Checker(BaseModel):\n    thought: str\n    passes: Literal[\"yes\", \"no\"]\n\njson.dumps(Checker.model_json_schema(), indent=4)\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/apigen/semantic_checker.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from typing import Literal\n    from pydantic import BaseModel\n    import json\n\n    class Checker(BaseModel):\n        thought: str\n        passes: Literal[\"yes\", \"no\"]\n\n    json.dumps(Checker.model_json_schema(), indent=4)\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"properties\": {\n            \"thought\": {\"title\": \"Thought\", \"type\": \"string\"},\n            \"passes\": {\"enum\": [\"yes\", \"no\"], \"title\": \"Passes\", \"type\": \"string\"},\n        },\n        \"required\": [\"thought\", \"passes\"],\n        \"title\": \"Checker\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller","title":"<code>ArgillaLabeller</code>","text":"<p>               Bases: <code>Task</code></p> <p>Annotate Argilla records based on input fields, example records and question settings.</p> <p>This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM. It uses a system prompt that guides the LLM to understand the input fields, the question type, and the question settings. The task then formats the input data and generates a response based on the question. The response is validated against the question's value model, and the final suggestion is prepared for annotation.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>record (<code>argilla.Record</code>): The record to be annotated.</li> <li>fields (<code>Optional[List[Dict[str, Any]]]</code>): The list of field settings for the input fields.</li> <li>question (<code>Optional[Dict[str, Any]]</code>): The question settings for the question to be answered.</li> <li>example_records (<code>Optional[List[Dict[str, Any]]]</code>): The few shot example records with responses to be used to answer the question.</li> <li>guidelines (<code>Optional[str]</code>): The guidelines for the annotation task.</li> </ul> Output columns <ul> <li>suggestion (<code>Dict[str, Any]</code>): The final suggestion for annotation.</li> </ul> Categories <ul> <li>text-classification</li> <li>scorer</li> <li>text-generation</li> </ul> References <ul> <li><code>Argilla: Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets</code></li> </ul> <p>Examples:</p> <p>Annotate a record with the same dataset and question:</p> <pre><code>import argilla as rg\nfrom argilla import Suggestion\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\npending_records_filter = rg.Filter((\"status\", \"==\", \"pending\"))\ncompleted_records_filter = rg.Filter((\"status\", \"==\", \"completed\"))\npending_records = list(\n    dataset.records(\n        query=rg.Query(filter=pending_records_filter),\n        limit=5,\n    )\n)\nexample_records = list(\n    dataset.records(\n        query=rg.Query(filter=completed_records_filter),\n        limit=5,\n    )\n)\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    fields=[field],\n    question=question,\n    example_records=example_records,\n    guidelines=dataset.guidelines\n)\nlabeller.load()\n\n# Process the pending records\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record\n            } for record in pending_records\n        ]\n    )\n)\n\n# Add the suggestions to the records\nfor record, suggestion in zip(pending_records, result):\n    record.suggestions.add(Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated records\ndataset.records.log(pending_records)\n</code></pre> <p>Annotate a record with alternating datasets and questions:</p> <pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\nquestion2 = dataset.settings.questions[\"label2\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\nlabeller.load()\n\n# Process the record\nrecord = next(dataset.records())\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question,\n            },\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question2,\n            }\n        ]\n    )\n)\n\n# Add the suggestions to the record\nfor suggestion in result:\n    record.suggestions.add(rg.Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated record\ndataset.records.log([record])\n</code></pre> <p>Overwrite default prompts and instructions:</p> <pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Overwrite default prompts and instructions\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    system_prompt=\"You are an expert annotator and labelling assistant that understands complex domains and natural language processing.\",\n    question_to_label_instruction={\n        \"label_selection\": \"Select the appropriate label from the list of provided labels.\",\n        \"multi_label_selection\": \"Select none, one or multiple labels from the list of provided labels.\",\n        \"text\": \"Provide a text response to the question.\",\n        \"rating\": \"Provide a rating for the question.\",\n    },\n)\nlabeller.load()\n</code></pre> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>class ArgillaLabeller(Task):\n    \"\"\"\n    Annotate Argilla records based on input fields, example records and question settings.\n\n    This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM.\n    It uses a system prompt that guides the LLM to understand the input fields, the question type,\n    and the question settings. The task then formats the input data and generates a response based on the question.\n    The response is validated against the question's value model, and the final suggestion is prepared for annotation.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - record (`argilla.Record`): The record to be annotated.\n        - fields (`Optional[List[Dict[str, Any]]]`): The list of field settings for the input fields.\n        - question (`Optional[Dict[str, Any]]`): The question settings for the question to be answered.\n        - example_records (`Optional[List[Dict[str, Any]]]`): The few shot example records with responses to be used to answer the question.\n        - guidelines (`Optional[str]`): The guidelines for the annotation task.\n\n    Output columns:\n        - suggestion (`Dict[str, Any]`): The final suggestion for annotation.\n\n    Categories:\n        - text-classification\n        - scorer\n        - text-generation\n\n    References:\n        - [`Argilla: Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets`](https://github.com/argilla-io/argilla/)\n\n    Examples:\n        Annotate a record with the same dataset and question:\n\n        ```python\n        import argilla as rg\n        from argilla import Suggestion\n        from distilabel.steps.tasks import ArgillaLabeller\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Get information from Argilla dataset definition\n        dataset = rg.Dataset(\"my_dataset\")\n        pending_records_filter = rg.Filter((\"status\", \"==\", \"pending\"))\n        completed_records_filter = rg.Filter((\"status\", \"==\", \"completed\"))\n        pending_records = list(\n            dataset.records(\n                query=rg.Query(filter=pending_records_filter),\n                limit=5,\n            )\n        )\n        example_records = list(\n            dataset.records(\n                query=rg.Query(filter=completed_records_filter),\n                limit=5,\n            )\n        )\n        field = dataset.settings.fields[\"text\"]\n        question = dataset.settings.questions[\"label\"]\n\n        # Initialize the labeller with the model and fields\n        labeller = ArgillaLabeller(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            fields=[field],\n            question=question,\n            example_records=example_records,\n            guidelines=dataset.guidelines\n        )\n        labeller.load()\n\n        # Process the pending records\n        result = next(\n            labeller.process(\n                [\n                    {\n                        \"record\": record\n                    } for record in pending_records\n                ]\n            )\n        )\n\n        # Add the suggestions to the records\n        for record, suggestion in zip(pending_records, result):\n            record.suggestions.add(Suggestion(**suggestion[\"suggestion\"]))\n\n        # Log the updated records\n        dataset.records.log(pending_records)\n        ```\n\n        Annotate a record with alternating datasets and questions:\n\n        ```python\n        import argilla as rg\n        from distilabel.steps.tasks import ArgillaLabeller\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Get information from Argilla dataset definition\n        dataset = rg.Dataset(\"my_dataset\")\n        field = dataset.settings.fields[\"text\"]\n        question = dataset.settings.questions[\"label\"]\n        question2 = dataset.settings.questions[\"label2\"]\n\n        # Initialize the labeller with the model and fields\n        labeller = ArgillaLabeller(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n        labeller.load()\n\n        # Process the record\n        record = next(dataset.records())\n        result = next(\n            labeller.process(\n                [\n                    {\n                        \"record\": record,\n                        \"fields\": [field],\n                        \"question\": question,\n                    },\n                    {\n                        \"record\": record,\n                        \"fields\": [field],\n                        \"question\": question2,\n                    }\n                ]\n            )\n        )\n\n        # Add the suggestions to the record\n        for suggestion in result:\n            record.suggestions.add(rg.Suggestion(**suggestion[\"suggestion\"]))\n\n        # Log the updated record\n        dataset.records.log([record])\n        ```\n\n        Overwrite default prompts and instructions:\n\n        ```python\n        import argilla as rg\n        from distilabel.steps.tasks import ArgillaLabeller\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Overwrite default prompts and instructions\n        labeller = ArgillaLabeller(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            system_prompt=\"You are an expert annotator and labelling assistant that understands complex domains and natural language processing.\",\n            question_to_label_instruction={\n                \"label_selection\": \"Select the appropriate label from the list of provided labels.\",\n                \"multi_label_selection\": \"Select none, one or multiple labels from the list of provided labels.\",\n                \"text\": \"Provide a text response to the question.\",\n                \"rating\": \"Provide a rating for the question.\",\n            },\n        )\n        labeller.load()\n        ```\n    \"\"\"\n\n    system_prompt: str = (\n        \"You are an expert annotator and labelling assistant that understands complex domains and natural language processing. \"\n        \"You are given input fields and a question. \"\n        \"You should create a valid JSON object as an response to the question based on the input fields. \"\n    )\n    question_to_label_instruction: Dict[str, str] = {\n        \"label_selection\": \"Select the appropriate label for the fields from the list of optional labels.\",\n        \"multi_label_selection\": \"Select none, one or multiple labels for the fields from the list of optional labels.\",\n        \"text\": \"Provide a response to the question based on the fields.\",\n        \"rating\": \"Provide a rating for the question based on the fields.\",\n    }\n    example_records: Optional[\n        RuntimeParameter[Union[List[Union[Dict[str, Any], BaseModel]], None]]\n    ] = Field(\n        default=None,\n        description=\"The few shot serialized example records or `BaseModel`s with responses to be used to answer the question.\",\n    )\n    fields: Optional[\n        RuntimeParameter[Union[List[Union[BaseModel, Dict[str, Any]]], None]]\n    ] = Field(\n        default=None,\n        description=\"The field serialized field settings or `BaseModel` for the fields to be used to answer the question.\",\n    )\n    question: Optional[\n        RuntimeParameter[\n            Union[\n                Dict[str, Any],\n                BaseModel,\n                None,\n            ]\n        ]\n    ] = Field(\n        default=None,\n        description=\"The question serialized question settings or `BaseModel` for the question to be answered.\",\n    )\n    guidelines: Optional[RuntimeParameter[str]] = Field(\n        default=None,\n        description=\"The guidelines for the annotation task.\",\n    )\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _client: Optional[Any] = PrivateAttr(None)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"argillalabeller.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; Dict[str, bool]:\n        return {\n            \"record\": True,\n            \"fields\": False,\n            \"question\": False,\n            \"example_records\": False,\n            \"guidelines\": False,\n        }\n\n    def _format_record(\n        self, record: Dict[str, Any], fields: List[Dict[str, Any]]\n    ) -&gt; str:\n        \"\"\"Format the record fields into a string.\n\n        Args:\n            record (Dict[str, Any]): The record to format.\n            fields (List[Dict[str, Any]]): The fields to format.\n\n        Returns:\n            str: The formatted record fields.\n        \"\"\"\n        output = []\n        for field in fields:\n            output.append(record.get(\"fields\", {}).get(field.get(\"name\", \"\")))\n        return \"fields: \" + \"\\n\".join(output)\n\n    def _get_label_instruction(self, question: Dict[str, Any]) -&gt; str:\n        \"\"\"Get the label instruction for the question.\n\n        Args:\n            question (Dict[str, Any]): The question to get the label instruction for.\n\n        Returns:\n            str: The label instruction for the question.\n        \"\"\"\n        question_type = question[\"settings\"][\"type\"]\n        return self.question_to_label_instruction[question_type]\n\n    def _format_question(self, question: Dict[str, Any]) -&gt; str:\n        \"\"\"Format the question settings into a string.\n\n        Args:\n            question (Dict[str, Any]): The question to format.\n\n        Returns:\n            str: The formatted question.\n        \"\"\"\n        output = []\n        output.append(f\"question: {self._get_label_instruction(question)}\")\n        if \"options\" in question.get(\"settings\", {}):\n            output.append(\n                f\"optional labels: {[option['value'] for option in question.get('settings', {}).get('options', [])]}\"\n            )\n        return \"\\n\".join(output)\n\n    def _format_example_records(\n        self,\n        records: List[Dict[str, Any]],\n        fields: List[Dict[str, Any]],\n        question: Dict[str, Any],\n    ) -&gt; str:\n        \"\"\"Format the example records into a string.\n\n        Args:\n            records (List[Dict[str, Any]]): The records to format.\n            fields (List[Dict[str, Any]]): The fields to format.\n            question (Dict[str, Any]): The question to format.\n\n        Returns:\n            str: The formatted example records.\n        \"\"\"\n        base = []\n        for record in records:\n            responses = record.get(\"responses\", {})\n            if responses.get(question[\"name\"]):\n                base.append(self._format_record(record, fields))\n                value = responses[question[\"name\"]][0][\"value\"]\n                formatted_value = self._assign_value_to_question_value_model(\n                    value, question\n                )\n                base.append(f\"response: {formatted_value}\")\n                base.append(\"\")\n            else:\n                warnings.warn(\n                    f\"Record {record} has no response for question {question['name']}. Skipping example record.\",\n                    stacklevel=2,\n                )\n        return \"\\n\".join(base)\n\n    def format_input(\n        self,\n        input: Dict[\n            str,\n            Union[\n                Dict[str, Any],\n                \"Record\",\n                \"TextField\",\n                \"MultiLabelQuestion\",\n                \"LabelQuestion\",\n                \"RatingQuestion\",\n                \"TextQuestion\",\n            ],\n        ],\n    ) -&gt; \"ChatType\":\n        \"\"\"Format the input into a chat message.\n\n        Args:\n            input: The input to format.\n\n        Returns:\n            The formatted chat message.\n\n        Raises:\n            ValueError: If question or fields are not provided.\n        \"\"\"\n        input_keys = list(self.inputs.keys())\n        record = input[input_keys[0]]\n        fields = input.get(input_keys[1], self.fields)\n        question = input.get(input_keys[2], self.question)\n        examples = input.get(input_keys[3], self.example_records)\n        guidelines = input.get(input_keys[4], self.guidelines)\n\n        if question is None:\n            raise ValueError(\"Question must be provided.\")\n        if fields is None or any(field is None for field in fields):\n            raise ValueError(\"Fields must be provided.\")\n\n        record = record.to_dict() if not isinstance(record, dict) else record\n        question = question.serialize() if not isinstance(question, dict) else question\n        fields = [\n            field.serialize() if not isinstance(field, dict) else field\n            for field in fields\n        ]\n        examples = (\n            [\n                example.to_dict() if not isinstance(example, dict) else example\n                for example in examples\n            ]\n            if examples\n            else None\n        )\n\n        formatted_fields = self._format_record(record, fields)\n        formatted_question = self._format_question(question)\n        formatted_examples = (\n            self._format_example_records(examples, fields, question)\n            if examples\n            else False\n        )\n\n        prompt = self._template.render(\n            fields=formatted_fields,\n            question=formatted_question,\n            examples=formatted_examples,\n            guidelines=guidelines,\n        )\n\n        messages = []\n        if self.system_prompt:\n            messages.append({\"role\": \"system\", \"content\": self.system_prompt})\n        messages.append({\"role\": \"user\", \"content\": prompt})\n        return messages\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"suggestion\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Format the output into a dictionary.\n\n        Args:\n            output (Union[str, None]): The output to format.\n            input (Dict[str, Any]): The input to format.\n\n        Returns:\n            Dict[str, Any]: The formatted output.\n        \"\"\"\n        from argilla import Suggestion\n\n        question: Union[\n            Any,\n            Dict[str, Any],\n            LabelQuestion,\n            MultiLabelQuestion,\n            RatingQuestion,\n            TextQuestion,\n            None,\n        ] = input.get(list(self.inputs.keys())[2], self.question) or self.question\n        question = question.serialize() if not isinstance(question, dict) else question\n        model = self._get_pydantic_model_of_structured_output(question)\n        validated_output = model(**json.loads(output))\n        value = self._get_value_from_question_value_model(validated_output)\n        suggestion = Suggestion(\n            value=value,\n            question_name=question[\"name\"],\n            type=\"model\",\n            agent=self.llm.model_name,\n        ).serialize()\n        return {\n            self.outputs[0]: {\n                k: v\n                for k, v in suggestion.items()\n                if k in [\"value\", \"question_name\", \"type\", \"agent\"]\n            }\n        }\n\n    def _set_llm_structured_output_for_question(self, question: Dict[str, Any]) -&gt; None:\n        runtime_parameters = self.llm._runtime_parameters\n        runtime_parameters.update(\n            {\n                \"structured_output\": {\n                    \"format\": \"json\",\n                    \"schema\": self._get_pydantic_model_of_structured_output(question),\n                },\n            }\n        )\n        self.llm.set_runtime_parameters(runtime_parameters)\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Process the input through the task.\n\n        Args:\n            inputs (StepInput): The input to process.\n\n        Returns:\n            StepOutput: The output of the task.\n        \"\"\"\n\n        question_list = [input.get(\"question\", self.question) for input in inputs]\n        fields_list = [input.get(\"fields\", self.fields) for input in inputs]\n        # check if any field for the field in fields is None\n        for fields in fields_list:\n            if any(field is None for field in fields):\n                raise ValueError(\n                    \"Fields must be provided during init or through `process` method.\"\n                )\n        # check if any question is None\n        if any(question is None for question in question_list):\n            raise ValueError(\n                \"Question must be provided during init or through `process` method.\"\n            )\n        question_list = [\n            question.serialize() if not isinstance(question, dict) else question\n            for question in question_list\n        ]\n        if not all(question == question_list[0] for question in question_list):\n            warnings.warn(\n                \"Not all questions are the same. Processing each question separately by setting the structured output for each question. This may impact performance.\",\n                stacklevel=2,\n            )\n            for input, question in zip(inputs, question_list):\n                self._set_llm_structured_output_for_question(question)\n                yield from super().process([input])\n        else:\n            question = question_list[0]\n            self._set_llm_structured_output_for_question(question)\n            yield from super().process(inputs)\n\n    def _get_value_from_question_value_model(\n        self, question_value_model: BaseModel\n    ) -&gt; Any:\n        \"\"\"Get the value from the question value model.\n\n        Args:\n            question_value_model (BaseModel): The question value model to get the value from.\n\n        Returns:\n            Any: The value from the question value model.\n        \"\"\"\n        for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n            if hasattr(question_value_model, attr):\n                return getattr(question_value_model, attr)\n        raise ValueError(f\"Unsupported question type: {question_value_model}\")\n\n    def _assign_value_to_question_value_model(\n        self, value: Any, question: Dict[str, Any]\n    ) -&gt; BaseModel:\n        \"\"\"Assign the value to the question value model.\n\n        Args:\n            value (Any): The value to assign.\n            question (Dict[str, Any]): The question to assign the value to.\n\n        Returns:\n            BaseModel: The question value model with the assigned value.\n        \"\"\"\n        question_value_model = self._get_pydantic_model_of_structured_output(question)\n        for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n            try:\n                model_dict = {attr: value}\n                question_value_model = question_value_model(**model_dict)\n                return question_value_model.model_dump_json()\n            except AttributeError:\n                pass\n        return value\n\n    def _get_pydantic_model_of_structured_output(\n        self,\n        question: Dict[str, Any],\n    ) -&gt; BaseModel:\n        \"\"\"Get the Pydantic model of the structured output.\n\n        Args:\n            question (Dict[str, Any]): The question to get the Pydantic model of the structured output for.\n\n        Returns:\n            BaseModel: The Pydantic model of the structured output.\n        \"\"\"\n\n        question_type = question[\"settings\"][\"type\"]\n\n        if question_type == \"multi_label_selection\":\n\n            class QuestionValueModel(BaseModel):\n                labels: Optional[List[str]] = Field(default_factory=list)\n\n        elif question_type == \"label_selection\":\n\n            class QuestionValueModel(BaseModel):\n                label: str\n\n        elif question_type == \"text\":\n\n            class QuestionValueModel(BaseModel):\n                text: str\n\n        elif question_type == \"rating\":\n\n            class QuestionValueModel(BaseModel):\n                rating: int\n        else:\n            raise ValueError(f\"Unsupported question type: {question}\")\n\n        return QuestionValueModel\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"argillalabeller.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._format_record","title":"<code>_format_record(record, fields)</code>","text":"<p>Format the record fields into a string.</p> <p>Parameters:</p> Name Type Description Default <code>record</code> <code>Dict[str, Any]</code> <p>The record to format.</p> required <code>fields</code> <code>List[Dict[str, Any]]</code> <p>The fields to format.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The formatted record fields.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _format_record(\n    self, record: Dict[str, Any], fields: List[Dict[str, Any]]\n) -&gt; str:\n    \"\"\"Format the record fields into a string.\n\n    Args:\n        record (Dict[str, Any]): The record to format.\n        fields (List[Dict[str, Any]]): The fields to format.\n\n    Returns:\n        str: The formatted record fields.\n    \"\"\"\n    output = []\n    for field in fields:\n        output.append(record.get(\"fields\", {}).get(field.get(\"name\", \"\")))\n    return \"fields: \" + \"\\n\".join(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._get_label_instruction","title":"<code>_get_label_instruction(question)</code>","text":"<p>Get the label instruction for the question.</p> <p>Parameters:</p> Name Type Description Default <code>question</code> <code>Dict[str, Any]</code> <p>The question to get the label instruction for.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The label instruction for the question.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _get_label_instruction(self, question: Dict[str, Any]) -&gt; str:\n    \"\"\"Get the label instruction for the question.\n\n    Args:\n        question (Dict[str, Any]): The question to get the label instruction for.\n\n    Returns:\n        str: The label instruction for the question.\n    \"\"\"\n    question_type = question[\"settings\"][\"type\"]\n    return self.question_to_label_instruction[question_type]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._format_question","title":"<code>_format_question(question)</code>","text":"<p>Format the question settings into a string.</p> <p>Parameters:</p> Name Type Description Default <code>question</code> <code>Dict[str, Any]</code> <p>The question to format.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The formatted question.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _format_question(self, question: Dict[str, Any]) -&gt; str:\n    \"\"\"Format the question settings into a string.\n\n    Args:\n        question (Dict[str, Any]): The question to format.\n\n    Returns:\n        str: The formatted question.\n    \"\"\"\n    output = []\n    output.append(f\"question: {self._get_label_instruction(question)}\")\n    if \"options\" in question.get(\"settings\", {}):\n        output.append(\n            f\"optional labels: {[option['value'] for option in question.get('settings', {}).get('options', [])]}\"\n        )\n    return \"\\n\".join(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._format_example_records","title":"<code>_format_example_records(records, fields, question)</code>","text":"<p>Format the example records into a string.</p> <p>Parameters:</p> Name Type Description Default <code>records</code> <code>List[Dict[str, Any]]</code> <p>The records to format.</p> required <code>fields</code> <code>List[Dict[str, Any]]</code> <p>The fields to format.</p> required <code>question</code> <code>Dict[str, Any]</code> <p>The question to format.</p> required <p>Returns:</p> Name Type Description <code>str</code> <code>str</code> <p>The formatted example records.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _format_example_records(\n    self,\n    records: List[Dict[str, Any]],\n    fields: List[Dict[str, Any]],\n    question: Dict[str, Any],\n) -&gt; str:\n    \"\"\"Format the example records into a string.\n\n    Args:\n        records (List[Dict[str, Any]]): The records to format.\n        fields (List[Dict[str, Any]]): The fields to format.\n        question (Dict[str, Any]): The question to format.\n\n    Returns:\n        str: The formatted example records.\n    \"\"\"\n    base = []\n    for record in records:\n        responses = record.get(\"responses\", {})\n        if responses.get(question[\"name\"]):\n            base.append(self._format_record(record, fields))\n            value = responses[question[\"name\"]][0][\"value\"]\n            formatted_value = self._assign_value_to_question_value_model(\n                value, question\n            )\n            base.append(f\"response: {formatted_value}\")\n            base.append(\"\")\n        else:\n            warnings.warn(\n                f\"Record {record} has no response for question {question['name']}. Skipping example record.\",\n                stacklevel=2,\n            )\n    return \"\\n\".join(base)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.format_input","title":"<code>format_input(input)</code>","text":"<p>Format the input into a chat message.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Union[Dict[str, Any], Record, TextField, MultiLabelQuestion, LabelQuestion, RatingQuestion, TextQuestion]]</code> <p>The input to format.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>The formatted chat message.</p> <p>Raises:</p> Type Description <code>ValueError</code> <p>If question or fields are not provided.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def format_input(\n    self,\n    input: Dict[\n        str,\n        Union[\n            Dict[str, Any],\n            \"Record\",\n            \"TextField\",\n            \"MultiLabelQuestion\",\n            \"LabelQuestion\",\n            \"RatingQuestion\",\n            \"TextQuestion\",\n        ],\n    ],\n) -&gt; \"ChatType\":\n    \"\"\"Format the input into a chat message.\n\n    Args:\n        input: The input to format.\n\n    Returns:\n        The formatted chat message.\n\n    Raises:\n        ValueError: If question or fields are not provided.\n    \"\"\"\n    input_keys = list(self.inputs.keys())\n    record = input[input_keys[0]]\n    fields = input.get(input_keys[1], self.fields)\n    question = input.get(input_keys[2], self.question)\n    examples = input.get(input_keys[3], self.example_records)\n    guidelines = input.get(input_keys[4], self.guidelines)\n\n    if question is None:\n        raise ValueError(\"Question must be provided.\")\n    if fields is None or any(field is None for field in fields):\n        raise ValueError(\"Fields must be provided.\")\n\n    record = record.to_dict() if not isinstance(record, dict) else record\n    question = question.serialize() if not isinstance(question, dict) else question\n    fields = [\n        field.serialize() if not isinstance(field, dict) else field\n        for field in fields\n    ]\n    examples = (\n        [\n            example.to_dict() if not isinstance(example, dict) else example\n            for example in examples\n        ]\n        if examples\n        else None\n    )\n\n    formatted_fields = self._format_record(record, fields)\n    formatted_question = self._format_question(question)\n    formatted_examples = (\n        self._format_example_records(examples, fields, question)\n        if examples\n        else False\n    )\n\n    prompt = self._template.render(\n        fields=formatted_fields,\n        question=formatted_question,\n        examples=formatted_examples,\n        guidelines=guidelines,\n    )\n\n    messages = []\n    if self.system_prompt:\n        messages.append({\"role\": \"system\", \"content\": self.system_prompt})\n    messages.append({\"role\": \"user\", \"content\": prompt})\n    return messages\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.format_output","title":"<code>format_output(output, input)</code>","text":"<p>Format the output into a dictionary.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>The output to format.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>The input to format.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Dict[str, Any]: The formatted output.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"Format the output into a dictionary.\n\n    Args:\n        output (Union[str, None]): The output to format.\n        input (Dict[str, Any]): The input to format.\n\n    Returns:\n        Dict[str, Any]: The formatted output.\n    \"\"\"\n    from argilla import Suggestion\n\n    question: Union[\n        Any,\n        Dict[str, Any],\n        LabelQuestion,\n        MultiLabelQuestion,\n        RatingQuestion,\n        TextQuestion,\n        None,\n    ] = input.get(list(self.inputs.keys())[2], self.question) or self.question\n    question = question.serialize() if not isinstance(question, dict) else question\n    model = self._get_pydantic_model_of_structured_output(question)\n    validated_output = model(**json.loads(output))\n    value = self._get_value_from_question_value_model(validated_output)\n    suggestion = Suggestion(\n        value=value,\n        question_name=question[\"name\"],\n        type=\"model\",\n        agent=self.llm.model_name,\n    ).serialize()\n    return {\n        self.outputs[0]: {\n            k: v\n            for k, v in suggestion.items()\n            if k in [\"value\", \"question_name\", \"type\", \"agent\"]\n        }\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller.process","title":"<code>process(inputs)</code>","text":"<p>Process the input through the task.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The input to process.</p> required <p>Returns:</p> Name Type Description <code>StepOutput</code> <code>StepOutput</code> <p>The output of the task.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Process the input through the task.\n\n    Args:\n        inputs (StepInput): The input to process.\n\n    Returns:\n        StepOutput: The output of the task.\n    \"\"\"\n\n    question_list = [input.get(\"question\", self.question) for input in inputs]\n    fields_list = [input.get(\"fields\", self.fields) for input in inputs]\n    # check if any field for the field in fields is None\n    for fields in fields_list:\n        if any(field is None for field in fields):\n            raise ValueError(\n                \"Fields must be provided during init or through `process` method.\"\n            )\n    # check if any question is None\n    if any(question is None for question in question_list):\n        raise ValueError(\n            \"Question must be provided during init or through `process` method.\"\n        )\n    question_list = [\n        question.serialize() if not isinstance(question, dict) else question\n        for question in question_list\n    ]\n    if not all(question == question_list[0] for question in question_list):\n        warnings.warn(\n            \"Not all questions are the same. Processing each question separately by setting the structured output for each question. This may impact performance.\",\n            stacklevel=2,\n        )\n        for input, question in zip(inputs, question_list):\n            self._set_llm_structured_output_for_question(question)\n            yield from super().process([input])\n    else:\n        question = question_list[0]\n        self._set_llm_structured_output_for_question(question)\n        yield from super().process(inputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._get_value_from_question_value_model","title":"<code>_get_value_from_question_value_model(question_value_model)</code>","text":"<p>Get the value from the question value model.</p> <p>Parameters:</p> Name Type Description Default <code>question_value_model</code> <code>BaseModel</code> <p>The question value model to get the value from.</p> required <p>Returns:</p> Name Type Description <code>Any</code> <code>Any</code> <p>The value from the question value model.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _get_value_from_question_value_model(\n    self, question_value_model: BaseModel\n) -&gt; Any:\n    \"\"\"Get the value from the question value model.\n\n    Args:\n        question_value_model (BaseModel): The question value model to get the value from.\n\n    Returns:\n        Any: The value from the question value model.\n    \"\"\"\n    for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n        if hasattr(question_value_model, attr):\n            return getattr(question_value_model, attr)\n    raise ValueError(f\"Unsupported question type: {question_value_model}\")\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._assign_value_to_question_value_model","title":"<code>_assign_value_to_question_value_model(value, question)</code>","text":"<p>Assign the value to the question value model.</p> <p>Parameters:</p> Name Type Description Default <code>value</code> <code>Any</code> <p>The value to assign.</p> required <code>question</code> <code>Dict[str, Any]</code> <p>The question to assign the value to.</p> required <p>Returns:</p> Name Type Description <code>BaseModel</code> <code>BaseModel</code> <p>The question value model with the assigned value.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _assign_value_to_question_value_model(\n    self, value: Any, question: Dict[str, Any]\n) -&gt; BaseModel:\n    \"\"\"Assign the value to the question value model.\n\n    Args:\n        value (Any): The value to assign.\n        question (Dict[str, Any]): The question to assign the value to.\n\n    Returns:\n        BaseModel: The question value model with the assigned value.\n    \"\"\"\n    question_value_model = self._get_pydantic_model_of_structured_output(question)\n    for attr in [\"label\", \"labels\", \"rating\", \"text\"]:\n        try:\n            model_dict = {attr: value}\n            question_value_model = question_value_model(**model_dict)\n            return question_value_model.model_dump_json()\n        except AttributeError:\n            pass\n    return value\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ArgillaLabeller._get_pydantic_model_of_structured_output","title":"<code>_get_pydantic_model_of_structured_output(question)</code>","text":"<p>Get the Pydantic model of the structured output.</p> <p>Parameters:</p> Name Type Description Default <code>question</code> <code>Dict[str, Any]</code> <p>The question to get the Pydantic model of the structured output for.</p> required <p>Returns:</p> Name Type Description <code>BaseModel</code> <code>BaseModel</code> <p>The Pydantic model of the structured output.</p> Source code in <code>src/distilabel/steps/tasks/argilla_labeller.py</code> <pre><code>def _get_pydantic_model_of_structured_output(\n    self,\n    question: Dict[str, Any],\n) -&gt; BaseModel:\n    \"\"\"Get the Pydantic model of the structured output.\n\n    Args:\n        question (Dict[str, Any]): The question to get the Pydantic model of the structured output for.\n\n    Returns:\n        BaseModel: The Pydantic model of the structured output.\n    \"\"\"\n\n    question_type = question[\"settings\"][\"type\"]\n\n    if question_type == \"multi_label_selection\":\n\n        class QuestionValueModel(BaseModel):\n            labels: Optional[List[str]] = Field(default_factory=list)\n\n    elif question_type == \"label_selection\":\n\n        class QuestionValueModel(BaseModel):\n            label: str\n\n    elif question_type == \"text\":\n\n        class QuestionValueModel(BaseModel):\n            text: str\n\n    elif question_type == \"rating\":\n\n        class QuestionValueModel(BaseModel):\n            rating: int\n    else:\n        raise ValueError(f\"Unsupported question type: {question}\")\n\n    return QuestionValueModel\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.CLAIR","title":"<code>CLAIR</code>","text":"<p>               Bases: <code>Task</code></p> <p>Contrastive Learning from AI Revisions (CLAIR).</p> <p>CLAIR uses an AI system to minimally revise a solution A\u2192A\u00b4 such that the resulting preference A <code>preferred</code> A\u2019 is much more contrastive and precise.</p> Input columns <ul> <li>task (<code>str</code>): The task or instruction.</li> <li>student_solution (<code>str</code>): An answer to the task that is to be revised.</li> </ul> Output columns <ul> <li>revision (<code>str</code>): The revised text.</li> <li>rational (<code>str</code>): The rational for the provided revision.</li> <li>model_name (<code>str</code>): The name of the model used to generate the revision and rational.</li> </ul> Categories <ul> <li>preference</li> <li>text-generation</li> </ul> References <ul> <li><code>Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment</code></li> <li><code>APO and CLAIR - GitHub Repository</code></li> </ul> <p>Examples:</p> <p>Create contrastive preference pairs:</p> <pre><code>from distilabel.steps.tasks import CLAIR\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 4096,\n    },\n)\nclair_task = CLAIR(llm=llm)\n\nclair_task.load()\n\nresult = next(\n    clair_task.process(\n        [\n            {\n                \"task\": \"How many gaps are there between the earth and the moon?\",\n                \"student_solution\": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'\n            }\n        ]\n    )\n)\n# result\n# [{'task': 'How many gaps are there between the earth and the moon?',\n# 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',\n# 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'rational': 'The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.',\n# 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.\\n\\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'raw_input_c_l_a_i_r_0': [{'role': 'system',\n#     'content': \"You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution.\"},\n#     {'role': 'user',\n#     'content': '{task}: How many gaps are there between the earth and the moon?\\n\\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\\n\\n-----------------\\n\\nLet\\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Citations:</p> <pre><code>```\n@misc{doosterlinck2024anchoredpreferenceoptimizationcontrastive,\n    title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment},\n    author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},\n    year={2024},\n    eprint={2408.06266},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG},\n    url={https://arxiv.org/abs/2408.06266},\n}\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/clair.py</code> <pre><code>class CLAIR(Task):\n    r\"\"\"Contrastive Learning from AI Revisions (CLAIR).\n\n    CLAIR uses an AI system to minimally revise a solution A\u2192A\u00b4 such that the resulting\n    preference A `preferred` A\u2019 is much more contrastive and precise.\n\n    Input columns:\n        - task (`str`): The task or instruction.\n        - student_solution (`str`): An answer to the task that is to be revised.\n\n    Output columns:\n        - revision (`str`): The revised text.\n        - rational (`str`): The rational for the provided revision.\n        - model_name (`str`): The name of the model used to generate the revision and rational.\n\n    Categories:\n        - preference\n        - text-generation\n\n    References:\n        - [`Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment`](https://arxiv.org/abs/2408.06266v1)\n        - [`APO and CLAIR - GitHub Repository`](https://github.com/ContextualAI/CLAIR_and_APO)\n\n    Examples:\n        Create contrastive preference pairs:\n\n        ```python\n        from distilabel.steps.tasks import CLAIR\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 4096,\n            },\n        )\n        clair_task = CLAIR(llm=llm)\n\n        clair_task.load()\n\n        result = next(\n            clair_task.process(\n                [\n                    {\n                        \"task\": \"How many gaps are there between the earth and the moon?\",\n                        \"student_solution\": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'\n                    }\n                ]\n            )\n        )\n        # result\n        # [{'task': 'How many gaps are there between the earth and the moon?',\n        # 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',\n        # 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n        # 'rational': 'The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.',\n        # 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.\\n\\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n        # 'raw_input_c_l_a_i_r_0': [{'role': 'system',\n        #     'content': \"You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution.\"},\n        #     {'role': 'user',\n        #     'content': '{task}: How many gaps are there between the earth and the moon?\\n\\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\\n\\n-----------------\\n\\nLet\\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n    Citations:\n\n        ```\n        @misc{doosterlinck2024anchoredpreferenceoptimizationcontrastive,\n            title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment},\n            author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},\n            year={2024},\n            eprint={2408.06266},\n            archivePrefix={arXiv},\n            primaryClass={cs.LG},\n            url={https://arxiv.org/abs/2408.06266},\n        }\n        ```\n    \"\"\"\n\n    system_prompt: str = SYSTEM_PROMPT\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        super().load()\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"clair.jinja2\"\n        )\n        with open(_path, \"r\") as f:\n            self._template = Template(f.read())\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"task\", \"student_solution\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"revision\", \"rational\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\"role\": \"system\", \"content\": self.system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    task=input[\"task\"], student_solution=input[\"student_solution\"]\n                ),\n            },\n        ]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the key `scores` containing the scores for each instruction-response pair.\n        \"\"\"\n        if output is None:\n            return self._default_error()\n\n        return self._format_output(output)\n\n    def _format_output(self, output: Union[str, None]) -&gt; Dict[str, Any]:\n        if \"**Corrected Student Solution:**\" in output:\n            splits = output.split(\"**Corrected Student Solution:**\")\n        elif \"{corrected_student_solution}:\" in output:\n            splits = output.split(\"{corrected_student_solution}:\")\n        elif \"{corrected_student_solution}\" in output:\n            splits = output.split(\"{corrected_student_solution}\")\n        elif \"**Worsened Student Solution:**\" in output:\n            splits = output.split(\"**Worsened Student Solution:**\")\n        elif \"{worsened_student_solution}:\" in output:\n            splits = output.split(\"{worsened_student_solution}:\")\n        elif \"{worsened_student_solution}\" in output:\n            splits = output.split(\"{worsened_student_solution}\")\n        else:\n            splits = None\n\n        # Safety check when the output doesn't follow the expected format\n        if not splits:\n            return self._default_error()\n\n        if len(splits) &gt;= 2:\n            revision = splits[1]\n            revision = revision.strip(\"\\n\\n\").strip()  # noqa: B005\n\n            rational = splits[0]\n            if \"{teacher_reasoning}\" in rational:\n                rational = rational.split(\"{teacher_reasoning}\")[1].strip(\":\").strip()\n            rational = rational.strip(\"\\n\\n\").strip()  # noqa: B005\n        else:\n            return self._default_error()\n        return {\"revision\": revision, \"rational\": rational}\n\n    def _default_error(self) -&gt; Dict[str, None]:\n        return {\"revision\": None, \"rational\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.CLAIR.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/clair.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\"role\": \"system\", \"content\": self.system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                task=input[\"task\"], student_solution=input[\"student_solution\"]\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.CLAIR.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction-response pair.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the key <code>scores</code> containing the scores for each instruction-response pair.</p> Source code in <code>src/distilabel/steps/tasks/clair.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the key `scores` containing the scores for each instruction-response pair.\n    \"\"\"\n    if output is None:\n        return self._default_error()\n\n    return self._format_output(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer","title":"<code>ComplexityScorer</code>","text":"<p>               Bases: <code>Task</code></p> <p>Score instructions based on their complexity using an <code>LLM</code>.</p> <p><code>ComplexityScorer</code> is a pre-defined task used to rank a list of instructions based in their complexity. It's an implementation of the complexity score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>instructions (<code>List[str]</code>): The list of instructions to be scored.</li> </ul> Output columns <ul> <li>scores (<code>List[float]</code>): The score for each instruction.</li> <li>model_name (<code>str</code>): The model name used to generate the scores.</li> </ul> Categories <ul> <li>scorer</li> <li>complexity</li> <li>instruction</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Evaluate the complexity of your instructions:</p> <pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]\n</code></pre> <p>Generate structured output with default schema:</p> <pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=use_default_structured_output\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \\n  \"scores\": [\\n    1, \\n    2\\n  ]\\n}'}}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>class ComplexityScorer(Task):\n    \"\"\"Score instructions based on their complexity using an `LLM`.\n\n    `ComplexityScorer` is a pre-defined task used to rank a list of instructions based in\n    their complexity. It's an implementation of the complexity score task from the paper\n    'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection\n    in Instruction Tuning'.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - instructions (`List[str]`): The list of instructions to be scored.\n\n    Output columns:\n        - scores (`List[float]`): The score for each instruction.\n        - model_name (`str`): The model name used to generate the scores.\n\n    Categories:\n        - scorer\n        - complexity\n        - instruction\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Evaluate the complexity of your instructions:\n\n        ```python\n        from distilabel.steps.tasks import ComplexityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        scorer = ComplexityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n            )\n        )\n        # result\n        # [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]\n        ```\n\n        Generate structured output with default schema:\n\n        ```python\n        from distilabel.steps.tasks import ComplexityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        scorer = ComplexityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            use_default_structured_output=use_default_structured_output\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n            )\n        )\n        # result\n        # [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \\\\n  \"scores\": [\\\\n    1, \\\\n    2\\\\n  ]\\\\n}'}}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"complexity-scorer.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are the `instructions`.\"\"\"\n        return [\"instructions\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(instructions=input[\"instructions\"]),  # type: ignore\n            }\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are: a list of `scores` containing the complexity score for each\n        instruction in `instructions`, and the `model_name`.\"\"\"\n        return [\"scores\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the key `scores` containing the scores for each instruction.\n        \"\"\"\n        if output is None:\n            return {\"scores\": [None] * len(input[\"instructions\"])}\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        scores = []\n        score_lines = output.split(\"\\n\")\n        for i, line in enumerate(score_lines):\n            match = _PARSE_SCORE_LINE_REGEX.match(line)\n            score = float(match.group(1)) if match else None\n            scores.append(score)\n            if i == len(input[\"instructions\"]) - 1:\n                break\n        return {\"scores\": scores}\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel\n        from typing import List\n\n        class SchemaComplexityScorer(BaseModel):\n            scores: List[int]\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"properties\": {\n                \"scores\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Scores\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"scores\"],\n            \"title\": \"SchemaComplexityScorer\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(\n        self, output: str, input: Dict[str, Any]\n    ) -&gt; Dict[str, str]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with either `positive`, or `positive` and `negative` keys.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            return {\"scores\": [None] * len(input[\"instructions\"])}\n\n    @override\n    def _sample_input(self) -&gt; \"ChatType\":\n        \"\"\"Returns a sample input to be used in the `print` method.\n        Tasks that don't adhere to a format input that returns a map of the type\n        str -&gt; str should override this method to return a sample input.\n        \"\"\"\n        return self.format_input(\n            {\n                \"instructions\": [\n                    f\"&lt;PLACEHOLDER_{f'GENERATION_{i}'.upper()}&gt;\" for i in range(2)\n                ],\n            }\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the <code>instructions</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are: a list of <code>scores</code> containing the complexity score for each instruction in <code>instructions</code>, and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"complexity-scorer.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(instructions=input[\"instructions\"]),  # type: ignore\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the key <code>scores</code> containing the scores for each instruction.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the key `scores` containing the scores for each instruction.\n    \"\"\"\n    if output is None:\n        return {\"scores\": [None] * len(input[\"instructions\"])}\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    scores = []\n    score_lines = output.split(\"\\n\")\n    for i, line in enumerate(score_lines):\n        match = _PARSE_SCORE_LINE_REGEX.match(line)\n        score = float(match.group(1)) if match else None\n        scores.append(score)\n        if i == len(input[\"instructions\"]) - 1:\n            break\n    return {\"scores\": scores}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel\nfrom typing import List\n\nclass SchemaComplexityScorer(BaseModel):\n    scores: List[int]\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel\n    from typing import List\n\n    class SchemaComplexityScorer(BaseModel):\n        scores: List[int]\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"properties\": {\n            \"scores\": {\n                \"items\": {\"type\": \"integer\"},\n                \"title\": \"Scores\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"scores\"],\n        \"title\": \"SchemaComplexityScorer\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer._format_structured_output","title":"<code>_format_structured_output(output, input)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with either <code>positive</code>, or <code>positive</code> and <code>negative</code> keys.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>def _format_structured_output(\n    self, output: str, input: Dict[str, Any]\n) -&gt; Dict[str, str]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with either `positive`, or `positive` and `negative` keys.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        return {\"scores\": [None] * len(input[\"instructions\"])}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ComplexityScorer._sample_input","title":"<code>_sample_input()</code>","text":"<p>Returns a sample input to be used in the <code>print</code> method. Tasks that don't adhere to a format input that returns a map of the type str -&gt; str should override this method to return a sample input.</p> Source code in <code>src/distilabel/steps/tasks/complexity_scorer.py</code> <pre><code>@override\ndef _sample_input(self) -&gt; \"ChatType\":\n    \"\"\"Returns a sample input to be used in the `print` method.\n    Tasks that don't adhere to a format input that returns a map of the type\n    str -&gt; str should override this method to return a sample input.\n    \"\"\"\n    return self.format_input(\n        {\n            \"instructions\": [\n                f\"&lt;PLACEHOLDER_{f'GENERATION_{i}'.upper()}&gt;\" for i in range(2)\n            ],\n        }\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct","title":"<code>EvolInstruct</code>","text":"<p>               Bases: <code>Task</code></p> <p>Evolve instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> <p>Attributes:</p> Name Type Description <code>num_evolutions</code> <code>int</code> <p>The number of evolutions to be performed.</p> <code>store_evolutions</code> <code>bool</code> <p>Whether to store all the evolutions or just the last one. Defaults to <code>False</code>.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the evolved instructions. Defaults to <code>False</code>.</p> <code>include_original_instruction</code> <code>bool</code> <p>Whether to include the original instruction in the <code>evolved_instructions</code> output column. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for evolving the instructions. Defaults to the ones provided in the <code>utils.py</code> file.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>seed</code>: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul> Output columns <ul> <li>evolved_instruction (<code>str</code>): The evolved instruction if <code>store_evolutions=False</code>.</li> <li>evolved_instructions (<code>List[str]</code>): The evolved instructions if <code>store_evolutions=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</li> <li>answer (<code>str</code>): The answer to the evolved instruction if <code>generate_answers=True</code>     and <code>store_evolutions=False</code>.</li> <li>answers (<code>List[str]</code>): The answers to the evolved instructions if <code>generate_answers=True</code>     and <code>store_evolutions=True</code>.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> </ul> References <ul> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> <li>GitHub: h2oai/h2o-wizardlm</li> </ul> <p>Examples:</p> <p>Evolve an instruction using an LLM:</p> <pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre> <p>Keep the iterations of the evolutions:</p> <pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    store_evolutions=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instructions': ['initial evolution', 'final evolution'],\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre> <p>Generate answers for the instructions in a single step:</p> <pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    generate_answers=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instruction': 'evolved instruction',\n#         'answer': 'answer to the instruction',\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>class EvolInstruct(Task):\n    \"\"\"Evolve instructions using an `LLM`.\n\n    WizardLM: Empowering Large Language Models to Follow Complex Instructions\n\n    Attributes:\n        num_evolutions: The number of evolutions to be performed.\n        store_evolutions: Whether to store all the evolutions or just the last one. Defaults\n            to `False`.\n        generate_answers: Whether to generate answers for the evolved instructions. Defaults\n            to `False`.\n        include_original_instruction: Whether to include the original instruction in the\n            `evolved_instructions` output column. Defaults to `False`.\n        mutation_templates: The mutation templates to be used for evolving the instructions.\n            Defaults to the ones provided in the `utils.py` file.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.\n\n    Input columns:\n        - instruction (`str`): The instruction to evolve.\n\n    Output columns:\n        - evolved_instruction (`str`): The evolved instruction if `store_evolutions=False`.\n        - evolved_instructions (`List[str]`): The evolved instructions if `store_evolutions=True`.\n        - model_name (`str`): The name of the LLM used to evolve the instructions.\n        - answer (`str`): The answer to the evolved instruction if `generate_answers=True`\n            and `store_evolutions=False`.\n        - answers (`List[str]`): The answers to the evolved instructions if `generate_answers=True`\n            and `store_evolutions=True`.\n\n    Categories:\n        - evol\n        - instruction\n\n    References:\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)\n\n    Examples:\n        Evolve an instruction using an LLM:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct = EvolInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n        )\n\n        evol_instruct.load()\n\n        result = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n        ```\n\n        Keep the iterations of the evolutions:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct = EvolInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n            store_evolutions=True,\n        )\n\n        evol_instruct.load()\n\n        result = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [\n        #     {\n        #         'instruction': 'common instruction',\n        #         'evolved_instructions': ['initial evolution', 'final evolution'],\n        #         'model_name': 'model_name'\n        #     }\n        # ]\n        ```\n\n        Generate answers for the instructions in a single step:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct = EvolInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n            generate_answers=True,\n        )\n\n        evol_instruct.load()\n\n        result = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [\n        #     {\n        #         'instruction': 'common instruction',\n        #         'evolved_instruction': 'evolved instruction',\n        #         'answer': 'answer to the instruction',\n        #         'model_name': 'model_name'\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    num_evolutions: int\n    store_evolutions: bool = False\n    generate_answers: bool = False\n    include_original_instruction: bool = False\n    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES\n\n    seed: RuntimeParameter[int] = Field(\n        default=42,\n        description=\"As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.\",\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"instruction\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [{\"role\": \"user\", \"content\": input}]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `evolved_instruction/s`, the `answer` if `generate_answers=True`\n        and the `model_name`.\"\"\"\n        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,\n        # this could be handled always and the value could be included within the DAG validation when\n        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.\n        _outputs = [\n            (\n                \"evolved_instruction\"\n                if not self.store_evolutions\n                else \"evolved_instructions\"\n            ),\n            \"model_name\",\n        ]\n        if self.generate_answers:\n            _outputs.append(\"answer\" if not self.store_evolutions else \"answers\")\n        return _outputs\n\n    @override\n    def format_output(  # type: ignore\n        self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,\n        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n        `answer` if `generate_answers=True`; and, finally, the `model_name`.\n\n        Args:\n            instructions: The instructions to be included within the output.\n            answers: The answers to be included within the output if `generate_answers=True`.\n\n        Returns:\n            If `store_evolutions=False` and `generate_answers=True` return {\"evolved_instruction\": ..., \"model_name\": ..., \"answer\": ...};\n            if `store_evolutions=True` and `generate_answers=True` return {\"evolved_instructions\": ..., \"model_name\": ..., \"answer\": ...};\n            if `store_evolutions=False` and `generate_answers=False` return {\"evolved_instruction\": ..., \"model_name\": ...};\n            if `store_evolutions=True` and `generate_answers=False` return {\"evolved_instructions\": ..., \"model_name\": ...}.\n        \"\"\"\n        _output = {}\n        if not self.store_evolutions:\n            _output[\"evolved_instruction\"] = instructions[-1]\n        else:\n            _output[\"evolved_instructions\"] = instructions\n\n        if self.generate_answers and answers:\n            if not self.store_evolutions:\n                _output[\"answer\"] = answers[-1]\n            else:\n                _output[\"answers\"] = answers\n\n        _output[\"model_name\"] = self.llm.model_name\n        return _output\n\n    @property\n    def mutation_templates_names(self) -&gt; List[str]:\n        \"\"\"Returns the names i.e. keys of the provided `mutation_templates`.\"\"\"\n        return list(self.mutation_templates.keys())\n\n    def _apply_random_mutation(self, instruction: str) -&gt; str:\n        \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n        enum, and returns the provided instruction within the mutation prompt.\n\n        Args:\n            instruction: The instruction to be included within the mutation prompt.\n\n        Returns:\n            A random mutation prompt with the provided instruction.\n        \"\"\"\n        mutation = np.random.choice(self.mutation_templates_names)\n        return self.mutation_templates[mutation].replace(\"&lt;PROMPT&gt;\", instruction)  # type: ignore\n\n    def _evolve_instructions(self, inputs: \"StepInput\") -&gt; List[List[str]]:\n        \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list where each item is a list with either the last evolved instruction if\n            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n        \"\"\"\n\n        instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n        statistics: \"LLMStatistics\" = defaultdict(list)\n\n        for iter_no in range(self.num_evolutions):\n            formatted_prompts = []\n            for instruction in instructions:\n                formatted_prompts.append(self._apply_random_mutation(instruction[-1]))\n\n            formatted_prompts = [\n                self.format_input(prompt) for prompt in formatted_prompts\n            ]\n            responses = self.llm.generate(\n                formatted_prompts,\n                **self.llm.generation_kwargs,  # type: ignore\n            )\n            generated_prompts = flatten_responses(\n                [response[\"generations\"] for response in responses]\n            )\n            for response in responses:\n                for k, v in response[\"statistics\"].items():\n                    statistics[k].append(v[0])\n\n            evolved_instructions = []\n            for generated_prompt in generated_prompts:\n                generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n                evolved_instructions.append(generated_prompt)\n\n            if self.store_evolutions:\n                instructions = [\n                    instruction + [evolved_instruction]\n                    for instruction, evolved_instruction in zip(\n                        instructions, evolved_instructions\n                    )\n                ]\n            else:\n                instructions = [\n                    [evolved_instruction]\n                    for evolved_instruction in evolved_instructions\n                ]\n\n            self._logger.info(\n                f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(instructions)} instructions!\"\n            )\n        return instructions, dict(statistics)\n\n    def _generate_answers(\n        self, evolved_instructions: List[List[str]]\n    ) -&gt; Tuple[List[List[str]], \"LLMStatistics\"]:\n        \"\"\"Generates the answer for the instructions in `instructions`.\n\n        Args:\n            evolved_instructions: A list of lists where each item is a list with either the last\n                evolved instruction if `store_evolutions=False` or all the evolved instructions\n                if `store_evolutions=True`.\n\n        Returns:\n            A list of answers for each instruction.\n        \"\"\"\n        formatted_instructions = [\n            self.format_input(instruction)\n            for instructions in evolved_instructions\n            for instruction in instructions\n        ]\n\n        responses = self.llm.generate(\n            formatted_instructions,\n            num_generations=1,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        generations = [response[\"generations\"] for response in responses]\n\n        statistics: Dict[str, Any] = defaultdict(list)\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        step = (\n            self.num_evolutions\n            if not self.include_original_instruction\n            else self.num_evolutions + 1\n        )\n\n        return [\n            flatten_responses(generations[i : i + step])\n            for i in range(0, len(responses), step)\n        ], dict(statistics)\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n\n        evolved_instructions, statistics = self._evolve_instructions(inputs)\n\n        if self.store_evolutions:\n            # Remove the input instruction from the `evolved_instructions` list\n            from_ = 1 if not self.include_original_instruction else 0\n            evolved_instructions = [\n                instruction[from_:] for instruction in evolved_instructions\n            ]\n\n        if not self.generate_answers:\n            for input, instruction in zip(inputs, evolved_instructions):\n                input.update(self.format_output(instruction))\n                input.update(\n                    {\n                        \"distilabel_metadata\": {\n                            f\"statistics_instruction_{self.name}\": statistics\n                        }\n                    }\n                )\n            yield inputs\n\n        self._logger.info(\n            f\"\ud83c\udf89 Finished evolving {len(evolved_instructions)} instructions!\"\n        )\n\n        if self.generate_answers:\n            self._logger.info(\n                f\"\ud83e\udde0 Generating answers for the {len(evolved_instructions)} evolved instructions!\"\n            )\n\n            answers, statistics = self._generate_answers(evolved_instructions)\n\n            self._logger.info(\n                f\"\ud83c\udf89 Finished generating answers for the {len(evolved_instructions)} evolved\"\n                \" instructions!\"\n            )\n\n            for idx, (input, instruction) in enumerate(\n                zip(inputs, evolved_instructions)\n            ):\n                input.update(self.format_output(instruction, answers[idx]))\n                input.update(\n                    {\n                        \"distilabel_metadata\": {\n                            f\"statistics_answer_{self.name}\": statistics\n                        }\n                    }\n                )\n            yield inputs\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\n            self._apply_random_mutation(\"&lt;PLACEHOLDER_INSTRUCTION&gt;\")\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>evolved_instruction/s</code>, the <code>answer</code> if <code>generate_answers=True</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.mutation_templates_names","title":"<code>mutation_templates_names</code>  <code>property</code>","text":"<p>Returns the names i.e. keys of the provided <code>mutation_templates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation. And the <code>system_prompt</code> is added as the first message if it exists.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation. And the\n    `system_prompt` is added as the first message if it exists.\"\"\"\n    return [{\"role\": \"user\", \"content\": input}]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.format_output","title":"<code>format_output(instructions, answers=None)</code>","text":"<p>The output for the task is a dict with: <code>evolved_instruction</code> or <code>evolved_instructions</code>, depending whether the value is either <code>False</code> or <code>True</code> for <code>store_evolutions</code>, respectively; <code>answer</code> if <code>generate_answers=True</code>; and, finally, the <code>model_name</code>.</p> <p>Parameters:</p> Name Type Description Default <code>instructions</code> <code>Union[str, List[str]]</code> <p>The instructions to be included within the output.</p> required <code>answers</code> <code>Optional[List[str]]</code> <p>The answers to be included within the output if <code>generate_answers=True</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>If <code>store_evolutions=False</code> and <code>generate_answers=True</code> return {\"evolved_instruction\": ..., \"model_name\": ..., \"answer\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=True</code> and <code>generate_answers=True</code> return {\"evolved_instructions\": ..., \"model_name\": ..., \"answer\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=False</code> and <code>generate_answers=False</code> return {\"evolved_instruction\": ..., \"model_name\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=True</code> and <code>generate_answers=False</code> return {\"evolved_instructions\": ..., \"model_name\": ...}.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>@override\ndef format_output(  # type: ignore\n    self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None\n) -&gt; Dict[str, Any]:  # type: ignore\n    \"\"\"The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,\n    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n    `answer` if `generate_answers=True`; and, finally, the `model_name`.\n\n    Args:\n        instructions: The instructions to be included within the output.\n        answers: The answers to be included within the output if `generate_answers=True`.\n\n    Returns:\n        If `store_evolutions=False` and `generate_answers=True` return {\"evolved_instruction\": ..., \"model_name\": ..., \"answer\": ...};\n        if `store_evolutions=True` and `generate_answers=True` return {\"evolved_instructions\": ..., \"model_name\": ..., \"answer\": ...};\n        if `store_evolutions=False` and `generate_answers=False` return {\"evolved_instruction\": ..., \"model_name\": ...};\n        if `store_evolutions=True` and `generate_answers=False` return {\"evolved_instructions\": ..., \"model_name\": ...}.\n    \"\"\"\n    _output = {}\n    if not self.store_evolutions:\n        _output[\"evolved_instruction\"] = instructions[-1]\n    else:\n        _output[\"evolved_instructions\"] = instructions\n\n    if self.generate_answers and answers:\n        if not self.store_evolutions:\n            _output[\"answer\"] = answers[-1]\n        else:\n            _output[\"answers\"] = answers\n\n    _output[\"model_name\"] = self.llm.model_name\n    return _output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct._apply_random_mutation","title":"<code>_apply_random_mutation(instruction)</code>","text":"<p>Applies a random mutation from the ones provided as part of the <code>mutation_templates</code> enum, and returns the provided instruction within the mutation prompt.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>The instruction to be included within the mutation prompt.</p> required <p>Returns:</p> Type Description <code>str</code> <p>A random mutation prompt with the provided instruction.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def _apply_random_mutation(self, instruction: str) -&gt; str:\n    \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n    enum, and returns the provided instruction within the mutation prompt.\n\n    Args:\n        instruction: The instruction to be included within the mutation prompt.\n\n    Returns:\n        A random mutation prompt with the provided instruction.\n    \"\"\"\n    mutation = np.random.choice(self.mutation_templates_names)\n    return self.mutation_templates[mutation].replace(\"&lt;PROMPT&gt;\", instruction)  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct._evolve_instructions","title":"<code>_evolve_instructions(inputs)</code>","text":"<p>Evolves the instructions provided as part of the inputs of the task.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>List[List[str]]</code> <p>A list where each item is a list with either the last evolved instruction if</p> <code>List[List[str]]</code> <p><code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def _evolve_instructions(self, inputs: \"StepInput\") -&gt; List[List[str]]:\n    \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list where each item is a list with either the last evolved instruction if\n        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n    \"\"\"\n\n    instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n    statistics: \"LLMStatistics\" = defaultdict(list)\n\n    for iter_no in range(self.num_evolutions):\n        formatted_prompts = []\n        for instruction in instructions:\n            formatted_prompts.append(self._apply_random_mutation(instruction[-1]))\n\n        formatted_prompts = [\n            self.format_input(prompt) for prompt in formatted_prompts\n        ]\n        responses = self.llm.generate(\n            formatted_prompts,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        generated_prompts = flatten_responses(\n            [response[\"generations\"] for response in responses]\n        )\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        evolved_instructions = []\n        for generated_prompt in generated_prompts:\n            generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n            evolved_instructions.append(generated_prompt)\n\n        if self.store_evolutions:\n            instructions = [\n                instruction + [evolved_instruction]\n                for instruction, evolved_instruction in zip(\n                    instructions, evolved_instructions\n                )\n            ]\n        else:\n            instructions = [\n                [evolved_instruction]\n                for evolved_instruction in evolved_instructions\n            ]\n\n        self._logger.info(\n            f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(instructions)} instructions!\"\n        )\n    return instructions, dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct._generate_answers","title":"<code>_generate_answers(evolved_instructions)</code>","text":"<p>Generates the answer for the instructions in <code>instructions</code>.</p> <p>Parameters:</p> Name Type Description Default <code>evolved_instructions</code> <code>List[List[str]]</code> <p>A list of lists where each item is a list with either the last evolved instruction if <code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> required <p>Returns:</p> Type Description <code>Tuple[List[List[str]], LLMStatistics]</code> <p>A list of answers for each instruction.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>def _generate_answers(\n    self, evolved_instructions: List[List[str]]\n) -&gt; Tuple[List[List[str]], \"LLMStatistics\"]:\n    \"\"\"Generates the answer for the instructions in `instructions`.\n\n    Args:\n        evolved_instructions: A list of lists where each item is a list with either the last\n            evolved instruction if `store_evolutions=False` or all the evolved instructions\n            if `store_evolutions=True`.\n\n    Returns:\n        A list of answers for each instruction.\n    \"\"\"\n    formatted_instructions = [\n        self.format_input(instruction)\n        for instructions in evolved_instructions\n        for instruction in instructions\n    ]\n\n    responses = self.llm.generate(\n        formatted_instructions,\n        num_generations=1,\n        **self.llm.generation_kwargs,  # type: ignore\n    )\n    generations = [response[\"generations\"] for response in responses]\n\n    statistics: Dict[str, Any] = defaultdict(list)\n    for response in responses:\n        for k, v in response[\"statistics\"].items():\n            statistics[k].append(v[0])\n\n    step = (\n        self.num_evolutions\n        if not self.include_original_instruction\n        else self.num_evolutions + 1\n    )\n\n    return [\n        flatten_responses(generations[i : i + step])\n        for i in range(0, len(responses), step)\n    ], dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstruct.process","title":"<code>process(inputs)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/base.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n\n    evolved_instructions, statistics = self._evolve_instructions(inputs)\n\n    if self.store_evolutions:\n        # Remove the input instruction from the `evolved_instructions` list\n        from_ = 1 if not self.include_original_instruction else 0\n        evolved_instructions = [\n            instruction[from_:] for instruction in evolved_instructions\n        ]\n\n    if not self.generate_answers:\n        for input, instruction in zip(inputs, evolved_instructions):\n            input.update(self.format_output(instruction))\n            input.update(\n                {\n                    \"distilabel_metadata\": {\n                        f\"statistics_instruction_{self.name}\": statistics\n                    }\n                }\n            )\n        yield inputs\n\n    self._logger.info(\n        f\"\ud83c\udf89 Finished evolving {len(evolved_instructions)} instructions!\"\n    )\n\n    if self.generate_answers:\n        self._logger.info(\n            f\"\ud83e\udde0 Generating answers for the {len(evolved_instructions)} evolved instructions!\"\n        )\n\n        answers, statistics = self._generate_answers(evolved_instructions)\n\n        self._logger.info(\n            f\"\ud83c\udf89 Finished generating answers for the {len(evolved_instructions)} evolved\"\n            \" instructions!\"\n        )\n\n        for idx, (input, instruction) in enumerate(\n            zip(inputs, evolved_instructions)\n        ):\n            input.update(self.format_output(instruction, answers[idx]))\n            input.update(\n                {\n                    \"distilabel_metadata\": {\n                        f\"statistics_answer_{self.name}\": statistics\n                    }\n                }\n            )\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolComplexity","title":"<code>EvolComplexity</code>","text":"<p>               Bases: <code>EvolInstruct</code></p> <p>Evolve instructions to make them more complex using an <code>LLM</code>.</p> <p><code>EvolComplexity</code> is a task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, using slight different prompts, but the exact same evolutionary approach.</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <p>The number of instructions to be generated.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the instructions or not. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for the generation of the instructions.</p> <code>min_length</code> <code>Dict[str, str]</code> <p>Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to <code>512</code>.</p> <code>max_length</code> <code>Dict[str, str]</code> <p>Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to <code>1024</code>.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>min_length</code>: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</li> <li><code>max_length</code>: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</li> <li><code>seed</code>: The number of evolutions to be run.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul> Output columns <ul> <li>evolved_instruction (<code>str</code>): The evolved instruction.</li> <li>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> <li>deita</li> </ul> References <ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> </ul> <p>Examples:</p> <p>Evolve an instruction using an LLM:</p> <pre><code>from distilabel.steps.tasks import EvolComplexity\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity = EvolComplexity(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_complexity.load()\n\nresult = next(evol_complexity.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py</code> <pre><code>class EvolComplexity(EvolInstruct):\n    \"\"\"Evolve instructions to make them more complex using an `LLM`.\n\n    `EvolComplexity` is a task that evolves instructions to make them more complex,\n    and it is based in the EvolInstruct task, using slight different prompts, but the\n    exact same evolutionary approach.\n\n    Attributes:\n        num_instructions: The number of instructions to be generated.\n        generate_answers: Whether to generate answers for the instructions or not. Defaults\n            to `False`.\n        mutation_templates: The mutation templates to be used for the generation of the\n            instructions.\n        min_length: Defines the length (in bytes) that the generated instruction needs to\n            be higher than, to be considered valid. Defaults to `512`.\n        max_length: Defines the length (in bytes) that the generated instruction needs to\n            be lower than, to be considered valid. Defaults to `1024`.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.\n        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.\n        - `seed`: The number of evolutions to be run.\n\n    Input columns:\n        - instruction (`str`): The instruction to evolve.\n\n    Output columns:\n        - evolved_instruction (`str`): The evolved instruction.\n        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.\n        - model_name (`str`): The name of the LLM used to evolve the instructions.\n\n    Categories:\n        - evol\n        - instruction\n        - deita\n\n    References:\n        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n\n    Examples:\n        Evolve an instruction using an LLM:\n\n        ```python\n        from distilabel.steps.tasks import EvolComplexity\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_complexity = EvolComplexity(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n        )\n\n        evol_complexity.load()\n\n        result = next(evol_complexity.process([{\"instruction\": \"common instruction\"}]))\n        # result\n        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolComplexityGenerator","title":"<code>EvolComplexityGenerator</code>","text":"<p>               Bases: <code>EvolInstructGenerator</code></p> <p>Generate evolved instructions with increased complexity using an <code>LLM</code>.</p> <p><code>EvolComplexityGenerator</code> is a generation task that evolves instructions to make them more complex, and it is based in the EvolInstruct task, but using slight different prompts, but the exact same evolutionary approach.</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <code>int</code> <p>The number of instructions to be generated.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the instructions or not. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for the generation of the instructions.</p> <code>min_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to <code>512</code>.</p> <code>max_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to <code>1024</code>.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>min_length</code>: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</li> <li><code>max_length</code>: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</li> <li><code>seed</code>: The number of evolutions to be run.</li> </ul> Output columns <ul> <li>instruction (<code>str</code>): The evolved instruction.</li> <li>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> <li>generation</li> <li>deita</li> </ul> References <ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> </ul> <p>Examples:</p> <p>Generate evolved instructions without initial instructions:</p> <pre><code>from distilabel.steps.tasks import EvolComplexityGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity_generator = EvolComplexityGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_complexity_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py</code> <pre><code>class EvolComplexityGenerator(EvolInstructGenerator):\n    \"\"\"Generate evolved instructions with increased complexity using an `LLM`.\n\n    `EvolComplexityGenerator` is a generation task that evolves instructions to make\n    them more complex, and it is based in the EvolInstruct task, but using slight different\n    prompts, but the exact same evolutionary approach.\n\n    Attributes:\n        num_instructions: The number of instructions to be generated.\n        generate_answers: Whether to generate answers for the instructions or not. Defaults\n            to `False`.\n        mutation_templates: The mutation templates to be used for the generation of the\n            instructions.\n        min_length: Defines the length (in bytes) that the generated instruction needs to\n            be higher than, to be considered valid. Defaults to `512`.\n        max_length: Defines the length (in bytes) that the generated instruction needs to\n            be lower than, to be considered valid. Defaults to `1024`.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.\n        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.\n        - `seed`: The number of evolutions to be run.\n\n    Output columns:\n        - instruction (`str`): The evolved instruction.\n        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.\n        - model_name (`str`): The name of the LLM used to evolve the instructions.\n\n    Categories:\n        - evol\n        - instruction\n        - generation\n        - deita\n\n    References:\n        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n\n    Examples:\n        Generate evolved instructions without initial instructions:\n\n        ```python\n        from distilabel.steps.tasks import EvolComplexityGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_complexity_generator = EvolComplexityGenerator(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_instructions=2,\n        )\n\n        evol_complexity_generator.load()\n\n        result = next(scorer.process())\n        # result\n        # [{'instruction': 'generated instruction', 'model_name': 'test'}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator","title":"<code>EvolInstructGenerator</code>","text":"<p>               Bases: <code>GeneratorTask</code></p> <p>Generate evolved instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <code>int</code> <p>The number of instructions to be generated.</p> <code>generate_answers</code> <code>bool</code> <p>Whether to generate answers for the instructions or not. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used for the generation of the instructions.</p> <code>min_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid. Defaults to <code>512</code>.</p> <code>max_length</code> <code>RuntimeParameter[int]</code> <p>Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid. Defaults to <code>1024</code>.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>min_length</code>: Defines the length (in bytes) that the generated instruction needs     to be higher than, to be considered valid.</li> <li><code>max_length</code>: Defines the length (in bytes) that the generated instruction needs     to be lower than, to be considered valid.</li> <li><code>seed</code>: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul> Output columns <ul> <li>instruction (<code>str</code>): The generated instruction if <code>generate_answers=False</code>.</li> <li>answer (<code>str</code>): The generated answer if <code>generate_answers=True</code>.</li> <li>instructions (<code>List[str]</code>): The generated instructions if <code>generate_answers=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to generate and evolve the instructions.</li> </ul> Categories <ul> <li>evol</li> <li>instruction</li> <li>generation</li> </ul> References <ul> <li>WizardLM: Empowering Large Language Models to Follow Complex Instructions</li> <li>GitHub: h2oai/h2o-wizardlm</li> </ul> <p>Examples:</p> <p>Generate evolved instructions without initial instructions:</p> <pre><code>from distilabel.steps.tasks import EvolInstructGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct_generator = EvolInstructGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_instruct_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre> Citations <pre><code>@misc{xu2023wizardlmempoweringlargelanguage,\n    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n    year={2023},\n    eprint={2304.12244},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2304.12244},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>class EvolInstructGenerator(GeneratorTask):\n    \"\"\"Generate evolved instructions using an `LLM`.\n\n    WizardLM: Empowering Large Language Models to Follow Complex Instructions\n\n    Attributes:\n        num_instructions: The number of instructions to be generated.\n        generate_answers: Whether to generate answers for the instructions or not. Defaults\n            to `False`.\n        mutation_templates: The mutation templates to be used for the generation of the\n            instructions.\n        min_length: Defines the length (in bytes) that the generated instruction needs to\n            be higher than, to be considered valid. Defaults to `512`.\n        max_length: Defines the length (in bytes) that the generated instruction needs to\n            be lower than, to be considered valid. Defaults to `1024`.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `min_length`: Defines the length (in bytes) that the generated instruction needs\n            to be higher than, to be considered valid.\n        - `max_length`: Defines the length (in bytes) that the generated instruction needs\n            to be lower than, to be considered valid.\n        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.\n\n    Output columns:\n        - instruction (`str`): The generated instruction if `generate_answers=False`.\n        - answer (`str`): The generated answer if `generate_answers=True`.\n        - instructions (`List[str]`): The generated instructions if `generate_answers=True`.\n        - model_name (`str`): The name of the LLM used to generate and evolve the instructions.\n\n    Categories:\n        - evol\n        - instruction\n        - generation\n\n    References:\n        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)\n        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)\n\n    Examples:\n        Generate evolved instructions without initial instructions:\n\n        ```python\n        from distilabel.steps.tasks import EvolInstructGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_instruct_generator = EvolInstructGenerator(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_instructions=2,\n        )\n\n        evol_instruct_generator.load()\n\n        result = next(scorer.process())\n        # result\n        # [{'instruction': 'generated instruction', 'model_name': 'test'}]\n        ```\n\n    Citations:\n        ```\n        @misc{xu2023wizardlmempoweringlargelanguage,\n            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},\n            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},\n            year={2023},\n            eprint={2304.12244},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2304.12244},\n        }\n        ```\n    \"\"\"\n\n    num_instructions: int\n    generate_answers: bool = False\n    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES\n\n    min_length: RuntimeParameter[int] = Field(\n        default=512,\n        description=\"Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.\",\n    )\n    max_length: RuntimeParameter[int] = Field(\n        default=1024,\n        description=\"Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.\",\n    )\n\n    seed: RuntimeParameter[int] = Field(\n        default=42,\n        description=\"As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.\",\n    )\n    _seed_texts: Optional[List[str]] = PrivateAttr(default_factory=list)\n    _prompts: Optional[List[str]] = PrivateAttr(default_factory=list)\n\n    def _generate_seed_texts(self) -&gt; List[str]:\n        \"\"\"Generates a list of seed texts to be used as part of the starting prompts for the task.\n\n        It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and\n        a list of English words will be used to generate the seed texts that will be provided to the\n        mutation method and included within the prompt.\n\n        Returns:\n            A list of seed texts to be used as part of the starting prompts for the task.\n        \"\"\"\n        seed_texts = []\n        for _ in range(self.num_instructions * 10):\n            num_words = np.random.choice([1, 2, 3, 4])\n            seed_texts.append(\n                self.mutation_templates[\"FRESH_START\"].replace(  # type: ignore\n                    \"&lt;PROMPT&gt;\",\n                    \", \".join(\n                        [\n                            np.random.choice(self._english_nouns).strip()\n                            for _ in range(num_words)\n                        ]\n                    ),\n                )\n            )\n        return seed_texts\n\n    @override\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n        This is useful if you want to do some validation that requires the entire model to be initialized.\n        \"\"\"\n        super().model_post_init(__context)\n\n        np.random.seed(self.seed)\n\n        self._seed_texts = self._generate_seed_texts()\n        self._prompts = [\n            np.random.choice(self._seed_texts) for _ in range(self.num_instructions)\n        ]\n\n    @cached_property\n    def _english_nouns(self) -&gt; List[str]:\n        \"\"\"A list of English nouns to be used as part of the starting prompts for the task.\n\n        References:\n            - https://github.com/h2oai/h2o-wizardlm\n        \"\"\"\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps/tasks/evol_instruct/english_nouns.txt\"\n        )\n        with open(_path, mode=\"r\") as f:\n            return [line.strip() for line in f.readlines()]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `instruction`, the `answer` if `generate_answers=True`\n        and the `model_name`.\"\"\"\n        _outputs = [\"instruction\", \"model_name\"]\n        if self.generate_answers:\n            _outputs.append(\"answer\")\n        return _outputs\n\n    def format_output(  # type: ignore\n        self, instruction: str, answer: Optional[str] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;\n        and, finally, the `model_name`.\n\n        Args:\n            instruction: The instruction to be included within the output.\n            answer: The answer to be included within the output if `generate_answers=True`.\n\n        Returns:\n            If `generate_answers=True` return {\"instruction\": ..., \"answer\": ..., \"model_name\": ...};\n            if `generate_answers=False` return {\"instruction\": ..., \"model_name\": ...};\n        \"\"\"\n        _output = {\n            \"instruction\": instruction,\n            \"model_name\": self.llm.model_name,\n        }\n        if self.generate_answers and answer is not None:\n            _output[\"answer\"] = answer\n        return _output\n\n    @property\n    def mutation_templates_names(self) -&gt; List[str]:\n        \"\"\"Returns the names i.e. keys of the provided `mutation_templates`.\"\"\"\n        return list(self.mutation_templates.keys())\n\n    def _apply_random_mutation(self, iter_no: int) -&gt; List[\"ChatType\"]:\n        \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n        enum, and returns the provided instruction within the mutation prompt.\n\n        Args:\n            iter_no: The iteration number to be used to check whether the iteration is the\n                first one i.e. FRESH_START, or not.\n\n        Returns:\n            A random mutation prompt with the provided instruction formatted as an OpenAI conversation.\n        \"\"\"\n        prompts = []\n        for idx in range(self.num_instructions):\n            if (\n                iter_no == 0\n                or \"Write one question or request containing\" in self._prompts[idx]  # type: ignore\n            ):\n                mutation = \"FRESH_START\"\n            else:\n                mutation = np.random.choice(self.mutation_templates_names)\n                if mutation == \"FRESH_START\":\n                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n\n            prompt_with_template = (\n                self.mutation_templates[mutation].replace(  # type: ignore\n                    \"&lt;PROMPT&gt;\",\n                    self._prompts[idx],  # type: ignore\n                )  # type: ignore\n                if iter_no != 0\n                else self._prompts[idx]  # type: ignore\n            )\n            prompts.append([{\"role\": \"user\", \"content\": prompt_with_template}])\n        return prompts\n\n    def _generate_answers(\n        self, instructions: List[List[str]]\n    ) -&gt; Tuple[List[str], \"LLMStatistics\"]:\n        \"\"\"Generates the answer for the last instruction in `instructions`.\n\n        Args:\n            instructions: A list of lists where each item is a list with either the last\n                evolved instruction if `store_evolutions=False` or all the evolved instructions\n                if `store_evolutions=True`.\n\n        Returns:\n            A list of answers for the last instruction in `instructions`.\n        \"\"\"\n        # TODO: update to generate answers for all the instructions\n        _formatted_instructions = [\n            [{\"role\": \"user\", \"content\": instruction[-1]}]\n            for instruction in instructions\n        ]\n        responses = self.llm.generate(\n            _formatted_instructions,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        statistics: Dict[str, Any] = defaultdict(list)\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        return flatten_responses(\n            [response[\"generations\"] for response in responses]\n        ), dict(statistics)\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # NOQA: C901, type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to 0.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task, and a boolean\n            flag indicating whether the task has finished or not i.e. is the last batch.\n        \"\"\"\n        instructions = []\n        mutation_no = 0\n\n        # TODO: update to take into account `offset`\n        iter_no = 0\n        while len(instructions) &lt; self.num_instructions:\n            prompts = self._apply_random_mutation(iter_no=iter_no)\n\n            # TODO: Update the function to extract from the dict\n            responses = self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore\n\n            generated_prompts = flatten_responses(\n                [response[\"generations\"] for response in responses]\n            )\n            statistics: \"LLMStatistics\" = defaultdict(list)\n            for response in responses:\n                for k, v in response[\"statistics\"].items():\n                    statistics[k].append(v[0])\n\n            for idx, generated_prompt in enumerate(generated_prompts):\n                generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n                if self.max_length &gt;= len(generated_prompt) &gt;= self.min_length:  # type: ignore\n                    instructions.append(generated_prompt)\n                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n                else:\n                    self._prompts[idx] = generated_prompt  # type: ignore\n\n            self._logger.info(\n                f\"\ud83d\udd04 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!\"\n            )\n            iter_no += 1\n\n            if len(instructions) &gt; self.num_instructions:\n                instructions = instructions[: self.num_instructions]\n            if len(instructions) &gt; mutation_no:\n                mutation_no = len(instructions) - mutation_no\n\n            if not self.generate_answers and len(instructions[-mutation_no:]) &gt; 0:\n                formatted_generations = []\n                for mutated_instruction in instructions[-mutation_no:]:\n                    mutated_instruction = self.format_output(mutated_instruction)\n                    mutated_instruction[\"distilabel_metadata\"] = {\n                        f\"statistics_instruction_{self.name}\": dict(statistics)\n                    }\n                    formatted_generations.append(mutated_instruction)\n                yield (\n                    formatted_generations,\n                    len(instructions) &gt;= self.num_instructions,\n                )\n\n        self._logger.info(f\"\ud83c\udf89 Finished evolving {len(instructions)} instructions!\")\n\n        if self.generate_answers:\n            self._logger.info(\n                f\"\ud83e\udde0 Generating answers for the {len(instructions)} evolved instructions!\"\n            )\n\n            answers, statistics = self._generate_answers(instructions)\n\n            self._logger.info(\n                f\"\ud83c\udf89 Finished generating answers for the {len(instructions)} evolved instructions!\"\n            )\n\n            formatted_outputs = []\n            for instruction, answer in zip(instructions, answers):\n                formatted_output = self.format_output(instruction, answer)\n                formatted_output[\"distilabel_metadata\"] = {\n                    f\"statistics_answer_{self.name}\": dict(statistics)\n                }\n                formatted_outputs.append(formatted_output)\n\n            yield (\n                formatted_outputs,\n                True,\n            )\n\n    @override\n    def _sample_input(self) -&gt; \"ChatType\":\n        return self._apply_random_mutation(iter_no=0)[0]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._english_nouns","title":"<code>_english_nouns</code>  <code>cached</code> <code>property</code>","text":"<p>A list of English nouns to be used as part of the starting prompts for the task.</p> References <ul> <li>https://github.com/h2oai/h2o-wizardlm</li> </ul>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>instruction</code>, the <code>answer</code> if <code>generate_answers=True</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.mutation_templates_names","title":"<code>mutation_templates_names</code>  <code>property</code>","text":"<p>Returns the names i.e. keys of the provided <code>mutation_templates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._generate_seed_texts","title":"<code>_generate_seed_texts()</code>","text":"<p>Generates a list of seed texts to be used as part of the starting prompts for the task.</p> <p>It will use the <code>FRESH_START</code> mutation template, as it needs to generate text from scratch; and a list of English words will be used to generate the seed texts that will be provided to the mutation method and included within the prompt.</p> <p>Returns:</p> Type Description <code>List[str]</code> <p>A list of seed texts to be used as part of the starting prompts for the task.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def _generate_seed_texts(self) -&gt; List[str]:\n    \"\"\"Generates a list of seed texts to be used as part of the starting prompts for the task.\n\n    It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and\n    a list of English words will be used to generate the seed texts that will be provided to the\n    mutation method and included within the prompt.\n\n    Returns:\n        A list of seed texts to be used as part of the starting prompts for the task.\n    \"\"\"\n    seed_texts = []\n    for _ in range(self.num_instructions * 10):\n        num_words = np.random.choice([1, 2, 3, 4])\n        seed_texts.append(\n            self.mutation_templates[\"FRESH_START\"].replace(  # type: ignore\n                \"&lt;PROMPT&gt;\",\n                \", \".join(\n                    [\n                        np.random.choice(self._english_nouns).strip()\n                        for _ in range(num_words)\n                    ]\n                ),\n            )\n        )\n    return seed_texts\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Override this method to perform additional initialization after <code>__init__</code> and <code>model_construct</code>. This is useful if you want to do some validation that requires the entire model to be initialized.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>@override\ndef model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n    This is useful if you want to do some validation that requires the entire model to be initialized.\n    \"\"\"\n    super().model_post_init(__context)\n\n    np.random.seed(self.seed)\n\n    self._seed_texts = self._generate_seed_texts()\n    self._prompts = [\n        np.random.choice(self._seed_texts) for _ in range(self.num_instructions)\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.format_output","title":"<code>format_output(instruction, answer=None)</code>","text":"<p>The output for the task is a dict with: <code>instruction</code>; <code>answer</code> if <code>generate_answers=True</code>; and, finally, the <code>model_name</code>.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>The instruction to be included within the output.</p> required <code>answer</code> <code>Optional[str]</code> <p>The answer to be included within the output if <code>generate_answers=True</code>.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>If <code>generate_answers=True</code> return {\"instruction\": ..., \"answer\": ..., \"model_name\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>generate_answers=False</code> return {\"instruction\": ..., \"model_name\": ...};</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def format_output(  # type: ignore\n    self, instruction: str, answer: Optional[str] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;\n    and, finally, the `model_name`.\n\n    Args:\n        instruction: The instruction to be included within the output.\n        answer: The answer to be included within the output if `generate_answers=True`.\n\n    Returns:\n        If `generate_answers=True` return {\"instruction\": ..., \"answer\": ..., \"model_name\": ...};\n        if `generate_answers=False` return {\"instruction\": ..., \"model_name\": ...};\n    \"\"\"\n    _output = {\n        \"instruction\": instruction,\n        \"model_name\": self.llm.model_name,\n    }\n    if self.generate_answers and answer is not None:\n        _output[\"answer\"] = answer\n    return _output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._apply_random_mutation","title":"<code>_apply_random_mutation(iter_no)</code>","text":"<p>Applies a random mutation from the ones provided as part of the <code>mutation_templates</code> enum, and returns the provided instruction within the mutation prompt.</p> <p>Parameters:</p> Name Type Description Default <code>iter_no</code> <code>int</code> <p>The iteration number to be used to check whether the iteration is the first one i.e. FRESH_START, or not.</p> required <p>Returns:</p> Type Description <code>List[ChatType]</code> <p>A random mutation prompt with the provided instruction formatted as an OpenAI conversation.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def _apply_random_mutation(self, iter_no: int) -&gt; List[\"ChatType\"]:\n    \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n    enum, and returns the provided instruction within the mutation prompt.\n\n    Args:\n        iter_no: The iteration number to be used to check whether the iteration is the\n            first one i.e. FRESH_START, or not.\n\n    Returns:\n        A random mutation prompt with the provided instruction formatted as an OpenAI conversation.\n    \"\"\"\n    prompts = []\n    for idx in range(self.num_instructions):\n        if (\n            iter_no == 0\n            or \"Write one question or request containing\" in self._prompts[idx]  # type: ignore\n        ):\n            mutation = \"FRESH_START\"\n        else:\n            mutation = np.random.choice(self.mutation_templates_names)\n            if mutation == \"FRESH_START\":\n                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n\n        prompt_with_template = (\n            self.mutation_templates[mutation].replace(  # type: ignore\n                \"&lt;PROMPT&gt;\",\n                self._prompts[idx],  # type: ignore\n            )  # type: ignore\n            if iter_no != 0\n            else self._prompts[idx]  # type: ignore\n        )\n        prompts.append([{\"role\": \"user\", \"content\": prompt_with_template}])\n    return prompts\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator._generate_answers","title":"<code>_generate_answers(instructions)</code>","text":"<p>Generates the answer for the last instruction in <code>instructions</code>.</p> <p>Parameters:</p> Name Type Description Default <code>instructions</code> <code>List[List[str]]</code> <p>A list of lists where each item is a list with either the last evolved instruction if <code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> required <p>Returns:</p> Type Description <code>Tuple[List[str], LLMStatistics]</code> <p>A list of answers for the last instruction in <code>instructions</code>.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>def _generate_answers(\n    self, instructions: List[List[str]]\n) -&gt; Tuple[List[str], \"LLMStatistics\"]:\n    \"\"\"Generates the answer for the last instruction in `instructions`.\n\n    Args:\n        instructions: A list of lists where each item is a list with either the last\n            evolved instruction if `store_evolutions=False` or all the evolved instructions\n            if `store_evolutions=True`.\n\n    Returns:\n        A list of answers for the last instruction in `instructions`.\n    \"\"\"\n    # TODO: update to generate answers for all the instructions\n    _formatted_instructions = [\n        [{\"role\": \"user\", \"content\": instruction[-1]}]\n        for instruction in instructions\n    ]\n    responses = self.llm.generate(\n        _formatted_instructions,\n        **self.llm.generation_kwargs,  # type: ignore\n    )\n    statistics: Dict[str, Any] = defaultdict(list)\n    for response in responses:\n        for k, v in response[\"statistics\"].items():\n            statistics[k].append(v[0])\n\n    return flatten_responses(\n        [response[\"generations\"] for response in responses]\n    ), dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolInstructGenerator.process","title":"<code>process(offset=0)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to 0.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>A list of Python dictionaries with the outputs of the task, and a boolean</p> <code>GeneratorStepOutput</code> <p>flag indicating whether the task has finished or not i.e. is the last batch.</p> Source code in <code>src/distilabel/steps/tasks/evol_instruct/generator.py</code> <pre><code>@override\ndef process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":  # NOQA: C901, type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to 0.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task, and a boolean\n        flag indicating whether the task has finished or not i.e. is the last batch.\n    \"\"\"\n    instructions = []\n    mutation_no = 0\n\n    # TODO: update to take into account `offset`\n    iter_no = 0\n    while len(instructions) &lt; self.num_instructions:\n        prompts = self._apply_random_mutation(iter_no=iter_no)\n\n        # TODO: Update the function to extract from the dict\n        responses = self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore\n\n        generated_prompts = flatten_responses(\n            [response[\"generations\"] for response in responses]\n        )\n        statistics: \"LLMStatistics\" = defaultdict(list)\n        for response in responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        for idx, generated_prompt in enumerate(generated_prompts):\n            generated_prompt = generated_prompt.split(\"Prompt#:\")[-1].strip()\n            if self.max_length &gt;= len(generated_prompt) &gt;= self.min_length:  # type: ignore\n                instructions.append(generated_prompt)\n                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore\n            else:\n                self._prompts[idx] = generated_prompt  # type: ignore\n\n        self._logger.info(\n            f\"\ud83d\udd04 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!\"\n        )\n        iter_no += 1\n\n        if len(instructions) &gt; self.num_instructions:\n            instructions = instructions[: self.num_instructions]\n        if len(instructions) &gt; mutation_no:\n            mutation_no = len(instructions) - mutation_no\n\n        if not self.generate_answers and len(instructions[-mutation_no:]) &gt; 0:\n            formatted_generations = []\n            for mutated_instruction in instructions[-mutation_no:]:\n                mutated_instruction = self.format_output(mutated_instruction)\n                mutated_instruction[\"distilabel_metadata\"] = {\n                    f\"statistics_instruction_{self.name}\": dict(statistics)\n                }\n                formatted_generations.append(mutated_instruction)\n            yield (\n                formatted_generations,\n                len(instructions) &gt;= self.num_instructions,\n            )\n\n    self._logger.info(f\"\ud83c\udf89 Finished evolving {len(instructions)} instructions!\")\n\n    if self.generate_answers:\n        self._logger.info(\n            f\"\ud83e\udde0 Generating answers for the {len(instructions)} evolved instructions!\"\n        )\n\n        answers, statistics = self._generate_answers(instructions)\n\n        self._logger.info(\n            f\"\ud83c\udf89 Finished generating answers for the {len(instructions)} evolved instructions!\"\n        )\n\n        formatted_outputs = []\n        for instruction, answer in zip(instructions, answers):\n            formatted_output = self.format_output(instruction, answer)\n            formatted_output[\"distilabel_metadata\"] = {\n                f\"statistics_answer_{self.name}\": dict(statistics)\n            }\n            formatted_outputs.append(formatted_output)\n\n        yield (\n            formatted_outputs,\n            True,\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality","title":"<code>EvolQuality</code>","text":"<p>               Bases: <code>Task</code></p> <p>Evolve the quality of the responses using an <code>LLM</code>.</p> <p><code>EvolQuality</code> task is used to evolve the quality of the responses given a prompt, by generating a new response with a language model. This step implements the evolution quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>num_evolutions</code> <code>int</code> <p>The number of evolutions to be performed on the responses.</p> <code>store_evolutions</code> <code>bool</code> <p>Whether to store all the evolved responses or just the last one. Defaults to <code>False</code>.</p> <code>include_original_response</code> <code>bool</code> <p>Whether to include the original response within the evolved responses. Defaults to <code>False</code>.</p> <code>mutation_templates</code> <code>Dict[str, str]</code> <p>The mutation templates to be used to evolve the responses.</p> <code>seed</code> <code>RuntimeParameter[int]</code> <p>The seed to be set for <code>numpy</code> in order to randomly pick a mutation method. Defaults to <code>42</code>.</p> Runtime parameters <ul> <li><code>seed</code>: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</li> <li>response (<code>str</code>): The responses to be rewritten.</li> </ul> Output columns <ul> <li>evolved_response (<code>str</code>): The evolved response if <code>store_evolutions=False</code>.</li> <li>evolved_responses (<code>List[str]</code>): The evolved responses if <code>store_evolutions=True</code>.</li> <li>model_name (<code>str</code>): The name of the LLM used to evolve the responses.</li> </ul> Categories <ul> <li>evol</li> <li>response</li> <li>deita</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Evolve the quality of the responses given a prompt:</p> <pre><code>from distilabel.steps.tasks import EvolQuality\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_quality = EvolQuality(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_quality.load()\n\nresult = next(\n    evol_quality.process(\n        [\n            {\"instruction\": \"common instruction\", \"response\": \"a response\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'response': 'a response',\n#         'evolved_response': 'evolved response',\n#         'model_name': '\"mistralai/Mistral-7B-Instruct-v0.2\"'\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>class EvolQuality(Task):\n    \"\"\"Evolve the quality of the responses using an `LLM`.\n\n    `EvolQuality` task is used to evolve the quality of the responses given a prompt,\n    by generating a new response with a language model. This step implements the evolution\n    quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of\n    Automatic Data Selection in Instruction Tuning'.\n\n    Attributes:\n        num_evolutions: The number of evolutions to be performed on the responses.\n        store_evolutions: Whether to store all the evolved responses or just the last one.\n            Defaults to `False`.\n        include_original_response: Whether to include the original response within the evolved\n            responses. Defaults to `False`.\n        mutation_templates: The mutation templates to be used to evolve the responses.\n        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.\n            Defaults to `42`.\n\n    Runtime parameters:\n        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the `responses`.\n        - response (`str`): The responses to be rewritten.\n\n    Output columns:\n        - evolved_response (`str`): The evolved response if `store_evolutions=False`.\n        - evolved_responses (`List[str]`): The evolved responses if `store_evolutions=True`.\n        - model_name (`str`): The name of the LLM used to evolve the responses.\n\n    Categories:\n        - evol\n        - response\n        - deita\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Evolve the quality of the responses given a prompt:\n\n        ```python\n        from distilabel.steps.tasks import EvolQuality\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        evol_quality = EvolQuality(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_evolutions=2,\n        )\n\n        evol_quality.load()\n\n        result = next(\n            evol_quality.process(\n                [\n                    {\"instruction\": \"common instruction\", \"response\": \"a response\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'common instruction',\n        #         'response': 'a response',\n        #         'evolved_response': 'evolved response',\n        #         'model_name': '\"mistralai/Mistral-7B-Instruct-v0.2\"'\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    num_evolutions: int\n    store_evolutions: bool = False\n    include_original_response: bool = False\n    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES\n\n    seed: RuntimeParameter[int] = Field(\n        default=42,\n        description=\"As `numpy` is being used in order to randomly pick a mutation method, then is nice to set a random seed.\",\n    )\n\n    @override\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n        This is useful if you want to do some validation that requires the entire model to be initialized.\n        \"\"\"\n        super().model_post_init(__context)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the `instruction` and `response`.\"\"\"\n        return [\"instruction\", \"response\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [{\"role\": \"user\", \"content\": input}]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `evolved_response/s` and the `model_name`.\"\"\"\n        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,\n        # this could be handled always and the value could be included within the DAG validation when\n        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.\n        _outputs = [\n            (\"evolved_response\" if not self.store_evolutions else \"evolved_responses\"),\n            \"model_name\",\n        ]\n\n        return _outputs\n\n    def format_output(self, responses: Union[str, List[str]]) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"The output for the task is a dict with: `evolved_response` or `evolved_responses`,\n        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n        and, finally, the `model_name`.\n\n        Args:\n            responses: The responses to be included within the output.\n\n        Returns:\n            if `store_evolutions=False` return {\"evolved_response\": ..., \"model_name\": ...};\n            if `store_evolutions=True` return {\"evolved_responses\": ..., \"model_name\": ...}.\n        \"\"\"\n        _output = {}\n\n        if not self.store_evolutions:\n            _output[\"evolved_response\"] = responses[-1]\n        else:\n            _output[\"evolved_responses\"] = responses\n\n        _output[\"model_name\"] = self.llm.model_name\n        return _output\n\n    @property\n    def mutation_templates_names(self) -&gt; List[str]:\n        \"\"\"Returns the names i.e. keys of the provided `mutation_templates` enum.\"\"\"\n        return list(self.mutation_templates.keys())\n\n    def _apply_random_mutation(self, instruction: str, response: str) -&gt; str:\n        \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n        enum, and returns the provided instruction within the mutation prompt.\n\n        Args:\n            instruction: The instruction to be included within the mutation prompt.\n\n        Returns:\n            A random mutation prompt with the provided instruction.\n        \"\"\"\n        mutation = np.random.choice(self.mutation_templates_names)\n        return (\n            self.mutation_templates[mutation]\n            .replace(\"&lt;PROMPT&gt;\", instruction)\n            .replace(\"&lt;RESPONSE&gt;\", response)\n        )\n\n    def _evolve_reponses(\n        self, inputs: \"StepInput\"\n    ) -&gt; Tuple[List[List[str]], Dict[str, Any]]:\n        \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list where each item is a list with either the last evolved instruction if\n            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n        \"\"\"\n        np.random.seed(self.seed)\n        instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n        responses: List[List[str]] = [[input[\"response\"]] for input in inputs]\n        statistics: Dict[str, Any] = defaultdict(list)\n\n        for iter_no in range(self.num_evolutions):\n            formatted_prompts = []\n            for instruction, response in zip(instructions, responses):\n                formatted_prompts.append(\n                    self._apply_random_mutation(instruction[-1], response[-1])\n                )\n\n            formatted_prompts = [\n                self.format_input(prompt) for prompt in formatted_prompts\n            ]\n\n            generated_responses = self.llm.generate(\n                formatted_prompts,\n                **self.llm.generation_kwargs,  # type: ignore\n            )\n            for response in generated_responses:\n                for k, v in response[\"statistics\"].items():\n                    statistics[k].append(v[0])\n\n            if self.store_evolutions:\n                responses = [\n                    response + [evolved_response[\"generations\"][0]]\n                    for response, evolved_response in zip(\n                        responses, generated_responses\n                    )\n                ]\n            else:\n                responses = [\n                    [evolved_response[\"generations\"][0]]\n                    for evolved_response in generated_responses\n                ]\n\n            self._logger.info(\n                f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(responses)} responses!\"\n            )\n\n        return responses, dict(statistics)\n\n    @override\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Returns:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n\n        responses, statistics = self._evolve_reponses(inputs)\n\n        if self.store_evolutions:\n            # Remove the input instruction from the `evolved_responses` list\n            from_ = 1 if not self.include_original_response else 0\n            responses = [response[from_:] for response in responses]\n\n        for input, response in zip(inputs, responses):\n            input.update(self.format_output(response))\n            input.update(\n                {\"distilabel_metadata\": {f\"statistics_{self.name}\": statistics}}\n            )\n        yield inputs\n\n        self._logger.info(f\"\ud83c\udf89 Finished evolving {len(responses)} instructions!\")\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\"&lt;PLACEHOLDER_INSTRUCTION&gt;\")\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the <code>instruction</code> and <code>response</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>evolved_response/s</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.mutation_templates_names","title":"<code>mutation_templates_names</code>  <code>property</code>","text":"<p>Returns the names i.e. keys of the provided <code>mutation_templates</code> enum.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Override this method to perform additional initialization after <code>__init__</code> and <code>model_construct</code>. This is useful if you want to do some validation that requires the entire model to be initialized.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>@override\ndef model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Override this method to perform additional initialization after `__init__` and `model_construct`.\n    This is useful if you want to do some validation that requires the entire model to be initialized.\n    \"\"\"\n    super().model_post_init(__context)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation. And the <code>system_prompt</code> is added as the first message if it exists.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation. And the\n    `system_prompt` is added as the first message if it exists.\"\"\"\n    return [{\"role\": \"user\", \"content\": input}]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.format_output","title":"<code>format_output(responses)</code>","text":"<p>The output for the task is a dict with: <code>evolved_response</code> or <code>evolved_responses</code>, depending whether the value is either <code>False</code> or <code>True</code> for <code>store_evolutions</code>, respectively; and, finally, the <code>model_name</code>.</p> <p>Parameters:</p> Name Type Description Default <code>responses</code> <code>Union[str, List[str]]</code> <p>The responses to be included within the output.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>if <code>store_evolutions=False</code> return {\"evolved_response\": ..., \"model_name\": ...};</p> <code>Dict[str, Any]</code> <p>if <code>store_evolutions=True</code> return {\"evolved_responses\": ..., \"model_name\": ...}.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def format_output(self, responses: Union[str, List[str]]) -&gt; Dict[str, Any]:  # type: ignore\n    \"\"\"The output for the task is a dict with: `evolved_response` or `evolved_responses`,\n    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;\n    and, finally, the `model_name`.\n\n    Args:\n        responses: The responses to be included within the output.\n\n    Returns:\n        if `store_evolutions=False` return {\"evolved_response\": ..., \"model_name\": ...};\n        if `store_evolutions=True` return {\"evolved_responses\": ..., \"model_name\": ...}.\n    \"\"\"\n    _output = {}\n\n    if not self.store_evolutions:\n        _output[\"evolved_response\"] = responses[-1]\n    else:\n        _output[\"evolved_responses\"] = responses\n\n    _output[\"model_name\"] = self.llm.model_name\n    return _output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality._apply_random_mutation","title":"<code>_apply_random_mutation(instruction, response)</code>","text":"<p>Applies a random mutation from the ones provided as part of the <code>mutation_templates</code> enum, and returns the provided instruction within the mutation prompt.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>The instruction to be included within the mutation prompt.</p> required <p>Returns:</p> Type Description <code>str</code> <p>A random mutation prompt with the provided instruction.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def _apply_random_mutation(self, instruction: str, response: str) -&gt; str:\n    \"\"\"Applies a random mutation from the ones provided as part of the `mutation_templates`\n    enum, and returns the provided instruction within the mutation prompt.\n\n    Args:\n        instruction: The instruction to be included within the mutation prompt.\n\n    Returns:\n        A random mutation prompt with the provided instruction.\n    \"\"\"\n    mutation = np.random.choice(self.mutation_templates_names)\n    return (\n        self.mutation_templates[mutation]\n        .replace(\"&lt;PROMPT&gt;\", instruction)\n        .replace(\"&lt;RESPONSE&gt;\", response)\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality._evolve_reponses","title":"<code>_evolve_reponses(inputs)</code>","text":"<p>Evolves the instructions provided as part of the inputs of the task.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>List[List[str]]</code> <p>A list where each item is a list with either the last evolved instruction if</p> <code>Dict[str, Any]</code> <p><code>store_evolutions=False</code> or all the evolved instructions if <code>store_evolutions=True</code>.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>def _evolve_reponses(\n    self, inputs: \"StepInput\"\n) -&gt; Tuple[List[List[str]], Dict[str, Any]]:\n    \"\"\"Evolves the instructions provided as part of the inputs of the task.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list where each item is a list with either the last evolved instruction if\n        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.\n    \"\"\"\n    np.random.seed(self.seed)\n    instructions: List[List[str]] = [[input[\"instruction\"]] for input in inputs]\n    responses: List[List[str]] = [[input[\"response\"]] for input in inputs]\n    statistics: Dict[str, Any] = defaultdict(list)\n\n    for iter_no in range(self.num_evolutions):\n        formatted_prompts = []\n        for instruction, response in zip(instructions, responses):\n            formatted_prompts.append(\n                self._apply_random_mutation(instruction[-1], response[-1])\n            )\n\n        formatted_prompts = [\n            self.format_input(prompt) for prompt in formatted_prompts\n        ]\n\n        generated_responses = self.llm.generate(\n            formatted_prompts,\n            **self.llm.generation_kwargs,  # type: ignore\n        )\n        for response in generated_responses:\n            for k, v in response[\"statistics\"].items():\n                statistics[k].append(v[0])\n\n        if self.store_evolutions:\n            responses = [\n                response + [evolved_response[\"generations\"][0]]\n                for response, evolved_response in zip(\n                    responses, generated_responses\n                )\n            ]\n        else:\n            responses = [\n                [evolved_response[\"generations\"][0]]\n                for evolved_response in generated_responses\n            ]\n\n        self._logger.info(\n            f\"\ud83d\udd04 Ran iteration {iter_no} evolving {len(responses)} responses!\"\n        )\n\n    return responses, dict(statistics)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.EvolQuality.process","title":"<code>process(inputs)</code>","text":"<p>Processes the inputs of the task and generates the outputs using the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Returns:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/evol_quality/base.py</code> <pre><code>@override\ndef process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Processes the inputs of the task and generates the outputs using the LLM.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Returns:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n\n    responses, statistics = self._evolve_reponses(inputs)\n\n    if self.store_evolutions:\n        # Remove the input instruction from the `evolved_responses` list\n        from_ = 1 if not self.include_original_response else 0\n        responses = [response[from_:] for response in responses]\n\n    for input, response in zip(inputs, responses):\n        input.update(self.format_output(response))\n        input.update(\n            {\"distilabel_metadata\": {f\"statistics_{self.name}\": statistics}}\n        )\n    yield inputs\n\n    self._logger.info(f\"\ud83c\udf89 Finished evolving {len(responses)} instructions!\")\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings","title":"<code>GenerateEmbeddings</code>","text":"<p>               Bases: <code>Step</code></p> <p>Generate embeddings using the last hidden state of an <code>LLM</code>.</p> <p>Generate embeddings for a text input using the last hidden state of an <code>LLM</code>, as described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p> <p>Attributes:</p> Name Type Description <code>llm</code> <code>LLM</code> <p>The <code>LLM</code> to use to generate the embeddings.</p> Input columns <ul> <li>text (<code>str</code>, <code>List[Dict[str, str]]</code>): The input text or conversation to generate     embeddings for.</li> </ul> Output columns <ul> <li>embedding (<code>List[float]</code>): The embedding of the input text or conversation.</li> <li>model_name (<code>str</code>): The model name used to generate the embeddings.</li> </ul> Categories <ul> <li>embedding</li> <li>llm</li> </ul> References <ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul> <p>Examples:</p> <p>Rank LLM candidates:</p> <pre><code>from distilabel.steps.tasks import GenerateEmbeddings\nfrom distilabel.models.llms.huggingface import TransformersLLM\n\n# Consider this as a placeholder for your actual LLM.\nembedder = GenerateEmbeddings(\n    llm=TransformersLLM(\n        model=\"TaylorAI/bge-micro-v2\",\n        model_kwargs={\"is_decoder\": True},\n        cuda_devices=[],\n    )\n)\nembedder.load()\n\nresult = next(\n    embedder.process(\n        [\n            {\"text\": \"Hello, how are you?\"},\n        ]\n    )\n)\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>class GenerateEmbeddings(Step):\n    \"\"\"Generate embeddings using the last hidden state of an `LLM`.\n\n    Generate embeddings for a text input using the last hidden state of an `LLM`, as\n    described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of\n    Automatic Data Selection in Instruction Tuning'.\n\n    Attributes:\n        llm: The `LLM` to use to generate the embeddings.\n\n    Input columns:\n        - text (`str`, `List[Dict[str, str]]`): The input text or conversation to generate\n            embeddings for.\n\n    Output columns:\n        - embedding (`List[float]`): The embedding of the input text or conversation.\n        - model_name (`str`): The model name used to generate the embeddings.\n\n    Categories:\n        - embedding\n        - llm\n\n    References:\n        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Rank LLM candidates:\n\n        ```python\n        from distilabel.steps.tasks import GenerateEmbeddings\n        from distilabel.models.llms.huggingface import TransformersLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        embedder = GenerateEmbeddings(\n            llm=TransformersLLM(\n                model=\"TaylorAI/bge-micro-v2\",\n                model_kwargs={\"is_decoder\": True},\n                cuda_devices=[],\n            )\n        )\n        embedder.load()\n\n        result = next(\n            embedder.process(\n                [\n                    {\"text\": \"Hello, how are you?\"},\n                ]\n            )\n        )\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    llm: LLM\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the `LLM` used to generate the embeddings.\"\"\"\n        super().load()\n\n        self.llm.load()\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The inputs for the task is a `text` column containing either a string or a\n        list of dictionaries in OpenAI chat-like format.\"\"\"\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs for the task is an `embedding` column containing the embedding of\n        the `text` input.\"\"\"\n        return [\"embedding\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Formats the input to be used by the LLM to generate the embeddings. The input\n        can be in `ChatType` format or a string. If a string, it will be converted to a\n        list of dictionaries in OpenAI chat-like format.\n\n        Args:\n            input: The input to format.\n\n        Returns:\n            The OpenAI chat-like format of the input.\n        \"\"\"\n        text = input[\"text\"] = input[\"text\"]\n\n        # input is in `ChatType` format\n        if isinstance(text, str):\n            return [{\"role\": \"user\", \"content\": text}]\n\n        if is_openai_format(text):\n            return text\n\n        raise DistilabelUserError(\n            f\"Couldn't format input for step {self.name}. The `text` input column has to\"\n            \" be a string or a list of dictionaries in OpenAI chat-like format.\",\n            page=\"components-gallery/tasks/generateembeddings/\",\n        )\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Generates an embedding for each input using the last hidden state of the `LLM`.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n        \"\"\"\n        formatted_inputs = [self.format_input(input) for input in inputs]\n        last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)\n        for input, hidden_state in zip(inputs, last_hidden_states):\n            input[\"embedding\"] = hidden_state[-1].tolist()\n            input[\"model_name\"] = self.llm.model_name\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task is a <code>text</code> column containing either a string or a list of dictionaries in OpenAI chat-like format.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task is an <code>embedding</code> column containing the embedding of the <code>text</code> input.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.load","title":"<code>load()</code>","text":"<p>Loads the <code>LLM</code> used to generate the embeddings.</p> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the `LLM` used to generate the embeddings.\"\"\"\n    super().load()\n\n    self.llm.load()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.format_input","title":"<code>format_input(input)</code>","text":"<p>Formats the input to be used by the LLM to generate the embeddings. The input can be in <code>ChatType</code> format or a string. If a string, it will be converted to a list of dictionaries in OpenAI chat-like format.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input to format.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>The OpenAI chat-like format of the input.</p> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Formats the input to be used by the LLM to generate the embeddings. The input\n    can be in `ChatType` format or a string. If a string, it will be converted to a\n    list of dictionaries in OpenAI chat-like format.\n\n    Args:\n        input: The input to format.\n\n    Returns:\n        The OpenAI chat-like format of the input.\n    \"\"\"\n    text = input[\"text\"] = input[\"text\"]\n\n    # input is in `ChatType` format\n    if isinstance(text, str):\n        return [{\"role\": \"user\", \"content\": text}]\n\n    if is_openai_format(text):\n        return text\n\n    raise DistilabelUserError(\n        f\"Couldn't format input for step {self.name}. The `text` input column has to\"\n        \" be a string or a list of dictionaries in OpenAI chat-like format.\",\n        page=\"components-gallery/tasks/generateembeddings/\",\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateEmbeddings.process","title":"<code>process(inputs)</code>","text":"<p>Generates an embedding for each input using the last hidden state of the <code>LLM</code>.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of Python dictionaries with the outputs of the task.</p> Source code in <code>src/distilabel/steps/tasks/generate_embeddings.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Generates an embedding for each input using the last hidden state of the `LLM`.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        A list of Python dictionaries with the outputs of the task.\n    \"\"\"\n    formatted_inputs = [self.format_input(input) for input in inputs]\n    last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)\n    for input, hidden_state in zip(inputs, last_hidden_states):\n        input[\"embedding\"] = hidden_state[-1].tolist()\n        input[\"model_name\"] = self.llm.model_name\n    yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct","title":"<code>Genstruct</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate a pair of instruction-response from a document using an <code>LLM</code>.</p> <p><code>Genstruct</code> is a pre-defined task designed to generate valid instructions from a given raw document, with the title and the content, enabling the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is inspired in the Ada-Instruct paper.</p> Note <p>The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended option is to use <code>NousResearch/Genstruct-7B</code> as the LLM provided to the task, since it was trained for this specific task.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>title (<code>str</code>): The title of the document.</li> <li>content (<code>str</code>): The content of the document.</li> </ul> Output columns <ul> <li>user (<code>str</code>): The user's instruction based on the document.</li> <li>assistant (<code>str</code>): The assistant's response based on the user's instruction.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</li> </ul> Categories <ul> <li>text-generation</li> <li>instruction</li> <li>response</li> </ul> References <ul> <li>Genstruct 7B by Nous Research</li> <li>Ada-Instruct: Adapting Instruction Generators for Complex Reasoning</li> </ul> <p>Examples:</p> <p>Generate instructions from raw documents using the title and content:</p> <pre><code>from distilabel.steps.tasks import Genstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ngenstruct = Genstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"NousResearch/Genstruct-7B\",\n    ),\n)\n\ngenstruct.load()\n\nresult = next(\n    genstruct.process(\n        [\n            {\"title\": \"common instruction\", \"content\": \"content of the document\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'title': 'An instruction',\n#         'content': 'content of the document',\n#         'model_name': 'test',\n#         'user': 'An instruction',\n#         'assistant': 'content of the document',\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{cui2023adainstructadaptinginstructiongenerators,\n    title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},\n    author={Wanyun Cui and Qianle Wang},\n    year={2023},\n    eprint={2310.04484},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2310.04484},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>class Genstruct(Task):\n    \"\"\"Generate a pair of instruction-response from a document using an `LLM`.\n\n    `Genstruct` is a pre-defined task designed to generate valid instructions from a given raw document,\n    with the title and the content, enabling the creation of new, partially synthetic instruction finetuning\n    datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is\n    inspired in the Ada-Instruct paper.\n\n    Note:\n        The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended\n        option is to use `NousResearch/Genstruct-7B` as the LLM provided to the task, since it was trained\n        for this specific task.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - title (`str`): The title of the document.\n        - content (`str`): The content of the document.\n\n    Output columns:\n        - user (`str`): The user's instruction based on the document.\n        - assistant (`str`): The assistant's response based on the user's instruction.\n        - model_name (`str`): The model name used to generate the `feedback` and `result`.\n\n    Categories:\n        - text-generation\n        - instruction\n        - response\n\n    References:\n        - [Genstruct 7B by Nous Research](https://huggingface.co/NousResearch/Genstruct-7B)\n        - [Ada-Instruct: Adapting Instruction Generators for Complex Reasoning](https://arxiv.org/abs/2310.04484)\n\n    Examples:\n        Generate instructions from raw documents using the title and content:\n\n        ```python\n        from distilabel.steps.tasks import Genstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        genstruct = Genstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"NousResearch/Genstruct-7B\",\n            ),\n        )\n\n        genstruct.load()\n\n        result = next(\n            genstruct.process(\n                [\n                    {\"title\": \"common instruction\", \"content\": \"content of the document\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'title': 'An instruction',\n        #         'content': 'content of the document',\n        #         'model_name': 'test',\n        #         'user': 'An instruction',\n        #         'assistant': 'content of the document',\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{cui2023adainstructadaptinginstructiongenerators,\n            title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},\n            author={Wanyun Cui and Qianle Wang},\n            year={2023},\n            eprint={2310.04484},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2310.04484},\n        }\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"genstruct.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are the `title` and the `content`.\"\"\"\n        return [\"title\", \"content\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    title=input[\"title\"], content=input[\"content\"]\n                ),\n            }\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `user` instruction based on the provided document\n        and the `assistant` response based on the user's instruction.\"\"\"\n        return [\"user\", \"assistant\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted so that both the user and the assistant messages are\n        captured.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the keys `user` and `assistant` containing the content for each role.\n        \"\"\"\n        if output is None:\n            return {\"user\": None, \"assistant\": None}\n\n        matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)\n        if not matches:\n            return {\"user\": None, \"assistant\": None}\n\n        return {\n            \"user\": matches.group(1).strip(),\n            \"assistant\": matches.group(2).strip(),\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are the <code>title</code> and the <code>content</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>user</code> instruction based on the provided document and the <code>assistant</code> response based on the user's instruction.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"genstruct.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                title=input[\"title\"], content=input[\"content\"]\n            ),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Genstruct.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted so that both the user and the assistant messages are captured.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the keys <code>user</code> and <code>assistant</code> containing the content for each role.</p> Source code in <code>src/distilabel/steps/tasks/genstruct.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted so that both the user and the assistant messages are\n    captured.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the keys `user` and `assistant` containing the content for each role.\n    \"\"\"\n    if output is None:\n        return {\"user\": None, \"assistant\": None}\n\n    matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)\n    if not matches:\n        return {\"user\": None, \"assistant\": None}\n\n    return {\n        \"user\": matches.group(1).strip(),\n        \"assistant\": matches.group(2).strip(),\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.BitextRetrievalGenerator","title":"<code>BitextRetrievalGenerator</code>","text":"<p>               Bases: <code>_EmbeddingDataGenerator</code></p> <p>Generate bitext retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>BitextRetrievalGenerator</code> is a <code>GeneratorTask</code> that generates bitext retrieval data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> <p>Attributes:</p> Name Type Description <code>source_language</code> <code>str</code> <p>The source language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>target_language</code> <code>str</code> <p>The target language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>unit</code> <code>Optional[Literal['sentence', 'phrase', 'passage']]</code> <p>The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>difficulty</code> <code>Optional[Literal['elementary school', 'high school', 'college']]</code> <p>The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>high_score</code> <code>Optional[Literal['4', '4.5', '5']]</code> <p>The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>low_score</code> <code>Optional[Literal['2.5', '3', '3.5']]</code> <p>The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Output columns <ul> <li>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</li> <li>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</li> <li>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the bitext retrieval     data.</li> </ul> <p>Examples:</p> <p>Generate bitext retrieval data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import BitextRetrievalGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = BitextRetrievalGenerator(\n        source_language=\"English\",\n        target_language=\"Spanish\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class BitextRetrievalGenerator(_EmbeddingDataGenerator):\n    \"\"\"Generate bitext retrieval data with an `LLM` to later on train an embedding model.\n\n    `BitextRetrievalGenerator` is a `GeneratorTask` that generates bitext retrieval data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Attributes:\n        source_language: The source language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        target_language: The target language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Output columns:\n        - S1 (`str`): the first sentence generated by the `LLM`.\n        - S2 (`str`): the second sentence generated by the `LLM`.\n        - S3 (`str`): the third sentence generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the bitext retrieval\n            data.\n\n    Examples:\n        Generate bitext retrieval data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import BitextRetrievalGenerator\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = BitextRetrievalGenerator(\n                source_language=\"English\",\n                target_language=\"Spanish\",\n                unit=\"sentence\",\n                difficulty=\"elementary school\",\n                high_score=\"4\",\n                low_score=\"2.5\",\n                llm=...,\n            )\n\n            ...\n\n            task &gt;&gt; ...\n        ```\n    \"\"\"\n\n    source_language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n    target_language: str = Field(\n        default=...,\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    unit: Optional[Literal[\"sentence\", \"phrase\", \"passage\"]] = None\n    difficulty: Optional[Literal[\"elementary school\", \"high school\", \"college\"]] = None\n    high_score: Optional[Literal[\"4\", \"4.5\", \"5\"]] = None\n    low_score: Optional[Literal[\"2.5\", \"3\", \"3.5\"]] = None\n\n    _template_name: str = PrivateAttr(default=\"bitext-retrieval\")\n    _can_be_used_with_offline_batch_generation = True\n\n    @property\n    def prompt(self) -&gt; ChatType:\n        \"\"\"Contains the `prompt` to be used in the `process` method, rendering the `_template`; and\n        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,\n        being from the user with the content being the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    source_language=self.source_language,\n                    target_language=self.target_language,\n                    unit=self.unit or random.choice([\"sentence\", \"phrase\", \"passage\"]),\n                    difficulty=self.difficulty\n                    or random.choice([\"elementary school\", \"high school\", \"college\"]),\n                    high_score=self.high_score or random.choice([\"4\", \"4.5\", \"5\"]),\n                    low_score=self.low_score or random.choice([\"2.5\", \"3\", \"3.5\"]),\n                ).strip(),\n            }\n        ]  # type: ignore\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"S1\", \"S2\", \"S3\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.BitextRetrievalGenerator.prompt","title":"<code>prompt</code>  <code>property</code>","text":"<p>Contains the <code>prompt</code> to be used in the <code>process</code> method, rendering the <code>_template</code>; and formatted as an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.BitextRetrievalGenerator.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateLongTextMatchingData","title":"<code>GenerateLongTextMatchingData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate long text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateLongTextMatchingData</code> is a <code>Task</code> that generates long text matching data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-long\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-long category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method. Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>input (<code>str</code>): the input generated by the <code>LLM</code>.</li> <li>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the long text matching     data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic long text matching data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-long\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateLongTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateLongTextMatchingData(_EmbeddingDataGeneration):\n    \"\"\"Generate long text matching data with an `LLM` to later on train an embedding model.\n\n    `GenerateLongTextMatchingData` is a `Task` that generates long text matching data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-matching-long\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-matching-long category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n            Note that in this task the `seed` has no effect since there are no sampling params.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - input (`str`): the input generated by the `LLM`.\n        - positive_document (`str`): the positive document generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the long text matching\n            data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic long text matching data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-matching-long\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateLongTextMatchingData(\n                language=\"English\",\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    _template_name: str = PrivateAttr(default=\"long-text-matching\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n        there's only one turn, being from the user with the content being the rendered `_template`.\n\n        Args:\n            input: The input dictionary containing the `task` to be used in the `_template`.\n\n        Returns:\n            A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"input\", \"positive_document\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateLongTextMatchingData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateLongTextMatchingData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input dictionary containing the <code>task</code> to be used in the <code>_template</code>.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list with a single chat containing the user's message with the rendered <code>_template</code>.</p> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n    there's only one turn, being from the user with the content being the rendered `_template`.\n\n    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateShortTextMatchingData","title":"<code>GenerateShortTextMatchingData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate short text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateShortTextMatchingData</code> is a <code>Task</code> that generates short text matching data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-short\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-short category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method. Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>input (<code>str</code>): the input generated by the <code>LLM</code>.</li> <li>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the short text matching     data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic short text matching data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-short\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateShortTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateShortTextMatchingData(_EmbeddingDataGeneration):\n    \"\"\"Generate short text matching data with an `LLM` to later on train an embedding model.\n\n    `GenerateShortTextMatchingData` is a `Task` that generates short text matching data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-matching-short\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-matching-short category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n            Note that in this task the `seed` has no effect since there are no sampling params.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - input (`str`): the input generated by the `LLM`.\n        - positive_document (`str`): the positive document generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the short text matching\n            data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic short text matching data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-matching-short\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateShortTextMatchingData(\n                language=\"English\",\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    _template_name: str = PrivateAttr(default=\"short-text-matching\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n                the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n                there's only one turn, being from the user with the content being the rendered `_template`.\n\n                Args:\n                    input: The input dictionary containing the `task` to be used in the `_template`.\n\n                Returns:\n                    A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"input\", \"positive_document\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateShortTextMatchingData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateShortTextMatchingData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with         the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that         there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <pre><code>    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n            the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n            there's only one turn, being from the user with the content being the rendered `_template`.\n\n            Args:\n                input: The input dictionary containing the `task` to be used in the `_template`.\n\n            Returns:\n                A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextClassificationData","title":"<code>GenerateTextClassificationData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate text classification data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextClassificationData</code> is a <code>Task</code> that generates text classification data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-classification\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-classification category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>difficulty</code> <code>Optional[Literal['high school', 'college', 'PhD']]</code> <p>The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>clarity</code> <code>Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]</code> <p>The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>, or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>input_text (<code>str</code>): the input text generated by the <code>LLM</code>.</li> <li>label (<code>str</code>): the label generated by the <code>LLM</code>.</li> <li>misleading_label (<code>str</code>): the misleading label generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the text classification     data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic text classification data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-classification\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextClassificationData(\n        language=\"English\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateTextClassificationData(_EmbeddingDataGeneration):\n    \"\"\"Generate text classification data with an `LLM` to later on train an embedding model.\n\n    `GenerateTextClassificationData` is a `Task` that generates text classification data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-classification\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-classification category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,\n            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - input_text (`str`): the input text generated by the `LLM`.\n        - label (`str`): the label generated by the `LLM`.\n        - misleading_label (`str`): the misleading label generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the text classification\n            data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic text classification data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-classification\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateTextClassificationData(\n                language=\"English\",\n                difficulty=\"high school\",\n                clarity=\"clear\",\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    difficulty: Optional[Literal[\"high school\", \"college\", \"PhD\"]] = None\n    clarity: Optional[\n        Literal[\"clear\", \"understandable with some effort\", \"ambiguous\"]\n    ] = None\n\n    _template_name: str = PrivateAttr(default=\"text-classification\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n        there's only one turn, being from the user with the content being the rendered `_template`.\n\n        Args:\n            input: The input dictionary containing the `task` to be used in the `_template`.\n\n        Returns:\n            A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                    difficulty=self.difficulty\n                    or random.choice([\"high school\", \"college\", \"PhD\"]),\n                    clarity=self.clarity\n                    or random.choice(\n                        [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                    ),\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"input_text\", \"label\", \"misleading_label\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextClassificationData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextClassificationData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input dictionary containing the <code>task</code> to be used in the <code>_template</code>.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list with a single chat containing the user's message with the rendered <code>_template</code>.</p> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n    there's only one turn, being from the user with the content being the rendered `_template`.\n\n    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n                difficulty=self.difficulty\n                or random.choice([\"high school\", \"college\", \"PhD\"]),\n                clarity=self.clarity\n                or random.choice(\n                    [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                ),\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextRetrievalData","title":"<code>GenerateTextRetrievalData</code>","text":"<p>               Bases: <code>_EmbeddingDataGeneration</code></p> <p>Generate text retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextRetrievalData</code> is a <code>Task</code> that generates text retrieval data with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> Note <p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-retrieval\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-retrieval category.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>query_type</code> <code>Optional[Literal['extremely long-tail', 'long-tail', 'common']]</code> <p>The type of query to be generated, which can be <code>extremely long-tail</code>, <code>long-tail</code>, or <code>common</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>query_length</code> <code>Optional[Literal['less than 5 words', '5 to 15 words', 'at least 10 words']]</code> <p>The length of the query to be generated, which can be <code>less than 5 words</code>, <code>5 to 15 words</code>, or <code>at least 10 words</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>difficulty</code> <code>Optional[Literal['high school', 'college', 'PhD']]</code> <p>The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>clarity</code> <code>Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]</code> <p>The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>, or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>num_words</code> <code>Optional[Literal[50, 100, 200, 300, 400, 500]]</code> <p>The number of words in the query to be generated, which can be <code>50</code>, <code>100</code>, <code>200</code>, <code>300</code>, <code>400</code>, or <code>500</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Input columns <ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul> Output columns <ul> <li>user_query (<code>str</code>): the user query generated by the <code>LLM</code>.</li> <li>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</li> <li>hard_negative_document (<code>str</code>): the hard negative document generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the text retrieval data.</li> </ul> References <ul> <li>Improving Text Embeddings with Large Language Models</li> </ul> <p>Examples:</p> <p>Generate synthetic text retrieval data for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-retrieval\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextRetrievalData(\n        language=\"English\",\n        query_type=\"common\",\n        query_length=\"5 to 15 words\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        num_words=100,\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class GenerateTextRetrievalData(_EmbeddingDataGeneration):\n    \"\"\"Generate text retrieval data with an `LLM` to later on train an embedding model.\n\n    `GenerateTextRetrievalData` is a `Task` that generates text retrieval data with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Note:\n        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`\n        with the `category=\"text-retrieval\"`; so that the `LLM` generates a list of tasks that\n        are flattened so that each row contains a single task for the text-retrieval category.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        query_type: The type of query to be generated, which can be `extremely long-tail`, `long-tail`,\n            or `common`. Defaults to `None`, meaning that it will be randomly sampled.\n        query_length: The length of the query to be generated, which can be `less than 5 words`, `5 to 15 words`,\n            or `at least 10 words`. Defaults to `None`, meaning that it will be randomly sampled.\n        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,\n            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.\n        num_words: The number of words in the query to be generated, which can be `50`, `100`, `200`, `300`, `400`, or `500`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Input columns:\n        - task (`str`): The task description to be used in the generation.\n\n    Output columns:\n        - user_query (`str`): the user query generated by the `LLM`.\n        - positive_document (`str`): the positive document generated by the `LLM`.\n        - hard_negative_document (`str`): the hard negative document generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the text retrieval data.\n\n    References:\n        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)\n\n    Examples:\n        Generate synthetic text retrieval data for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = EmbeddingTaskGenerator(\n                category=\"text-retrieval\",\n                flatten_tasks=True,\n                llm=...,  # LLM instance\n            )\n\n            generate = GenerateTextRetrievalData(\n                language=\"English\",\n                query_type=\"common\",\n                query_length=\"5 to 15 words\",\n                difficulty=\"high school\",\n                clarity=\"clear\",\n                num_words=100,\n                llm=...,  # LLM instance\n            )\n\n            task &gt;&gt; generate\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    query_type: Optional[Literal[\"extremely long-tail\", \"long-tail\", \"common\"]] = None\n    query_length: Optional[\n        Literal[\"less than 5 words\", \"5 to 15 words\", \"at least 10 words\"]\n    ] = None\n    difficulty: Optional[Literal[\"high school\", \"college\", \"PhD\"]] = None\n    clarity: Optional[\n        Literal[\"clear\", \"understandable with some effort\", \"ambiguous\"]\n    ] = None\n    num_words: Optional[Literal[50, 100, 200, 300, 400, 500]] = None\n\n    _template_name: str = PrivateAttr(default=\"text-retrieval\")\n    _can_be_used_with_offline_batch_generation = True\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n        randomly sampling those if not provided. This method will render the `_template` with\n        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n        there's only one turn, being from the user with the content being the rendered `_template`.\n\n        Args:\n            input: The input dictionary containing the `task` to be used in the `_template`.\n\n        Returns:\n            A list with a single chat containing the user's message with the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    task=input[\"task\"],\n                    language=self.language,\n                    query_type=self.query_type\n                    or random.choice([\"extremely long-tail\", \"long-tail\", \"common\"]),\n                    query_length=self.query_length\n                    or random.choice(\n                        [\"less than 5 words\", \"5 to 15 words\", \"at least 10 words\"]\n                    ),\n                    difficulty=self.difficulty\n                    or random.choice([\"high school\", \"college\", \"PhD\"]),\n                    clarity=self.clarity\n                    or random.choice(\n                        [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                    ),\n                    num_words=self.num_words\n                    or random.choice([50, 100, 200, 300, 400, 500]),\n                ).strip(),\n            }\n        ]\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\n            \"user_query\",\n            \"positive_document\",\n            \"hard_negative_document\",\n        ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextRetrievalData.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateTextRetrievalData.format_input","title":"<code>format_input(input)</code>","text":"<p>Method to format the input based on the <code>task</code> and the provided attributes, or just randomly sampling those if not provided. This method will render the <code>_template</code> with the provided arguments and return an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input dictionary containing the <code>task</code> to be used in the <code>_template</code>.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list with a single chat containing the user's message with the rendered <code>_template</code>.</p> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"Method to format the input based on the `task` and the provided attributes, or just\n    randomly sampling those if not provided. This method will render the `_template` with\n    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that\n    there's only one turn, being from the user with the content being the rendered `_template`.\n\n    Args:\n        input: The input dictionary containing the `task` to be used in the `_template`.\n\n    Returns:\n        A list with a single chat containing the user's message with the rendered `_template`.\n    \"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                task=input[\"task\"],\n                language=self.language,\n                query_type=self.query_type\n                or random.choice([\"extremely long-tail\", \"long-tail\", \"common\"]),\n                query_length=self.query_length\n                or random.choice(\n                    [\"less than 5 words\", \"5 to 15 words\", \"at least 10 words\"]\n                ),\n                difficulty=self.difficulty\n                or random.choice([\"high school\", \"college\", \"PhD\"]),\n                clarity=self.clarity\n                or random.choice(\n                    [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n                ),\n                num_words=self.num_words\n                or random.choice([50, 100, 200, 300, 400, 500]),\n            ).strip(),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MonolingualTripletGenerator","title":"<code>MonolingualTripletGenerator</code>","text":"<p>               Bases: <code>_EmbeddingDataGenerator</code></p> <p>Generate monolingual triplets with an <code>LLM</code> to later on train an embedding model.</p> <p><code>MonolingualTripletGenerator</code> is a <code>GeneratorTask</code> that generates monolingual triplets with an <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving Text Embeddings with Large Language Models\" and the data is generated based on the provided attributes, or randomly sampled if not provided.</p> <p>Attributes:</p> Name Type Description <code>language</code> <code>str</code> <p>The language of the data to be generated, which can be any of the languages retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> <code>unit</code> <code>Optional[Literal['sentence', 'phrase', 'passage']]</code> <p>The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>difficulty</code> <code>Optional[Literal['elementary school', 'high school', 'college']]</code> <p>The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>high_score</code> <code>Optional[Literal['4', '4.5', '5']]</code> <p>The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>low_score</code> <code>Optional[Literal['2.5', '3', '3.5']]</code> <p>The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> <code>seed</code> <code>int</code> <p>The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> Output columns <ul> <li>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</li> <li>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</li> <li>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</li> <li>model_name (<code>str</code>): the name of the model used to generate the monolingual triplets.</li> </ul> <p>Examples:</p> <p>Generate monolingual triplets for training embedding models:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import MonolingualTripletGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = MonolingualTripletGenerator(\n        language=\"English\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre> Source code in <code>src/distilabel/steps/tasks/improving_text_embeddings.py</code> <pre><code>class MonolingualTripletGenerator(_EmbeddingDataGenerator):\n    \"\"\"Generate monolingual triplets with an `LLM` to later on train an embedding model.\n\n    `MonolingualTripletGenerator` is a `GeneratorTask` that generates monolingual triplets with an\n    `LLM` to later on train an embedding model. The task is based on the paper \"Improving\n    Text Embeddings with Large Language Models\" and the data is generated based on the\n    provided attributes, or randomly sampled if not provided.\n\n    Attributes:\n        language: The language of the data to be generated, which can be any of the languages\n            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.\n        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.\n            Defaults to `None`, meaning that it will be randomly sampled.\n        seed: The random seed to be set in case there's any sampling within the `format_input` method.\n\n    Output columns:\n        - S1 (`str`): the first sentence generated by the `LLM`.\n        - S2 (`str`): the second sentence generated by the `LLM`.\n        - S3 (`str`): the third sentence generated by the `LLM`.\n        - model_name (`str`): the name of the model used to generate the monolingual triplets.\n\n    Examples:\n        Generate monolingual triplets for training embedding models:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps.tasks import MonolingualTripletGenerator\n\n        with Pipeline(\"my-pipeline\") as pipeline:\n            task = MonolingualTripletGenerator(\n                language=\"English\",\n                unit=\"sentence\",\n                difficulty=\"elementary school\",\n                high_score=\"4\",\n                low_score=\"2.5\",\n                llm=...,\n            )\n\n            ...\n\n            task &gt;&gt; ...\n        ```\n    \"\"\"\n\n    language: str = Field(\n        default=\"English\",\n        description=\"The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf\",\n    )\n\n    unit: Optional[Literal[\"sentence\", \"phrase\", \"passage\"]] = None\n    difficulty: Optional[Literal[\"elementary school\", \"high school\", \"college\"]] = None\n    high_score: Optional[Literal[\"4\", \"4.5\", \"5\"]] = None\n    low_score: Optional[Literal[\"2.5\", \"3\", \"3.5\"]] = None\n\n    _template_name: str = PrivateAttr(default=\"monolingual-triplet\")\n    _can_be_used_with_offline_batch_generation = True\n\n    @property\n    def prompt(self) -&gt; ChatType:\n        \"\"\"Contains the `prompt` to be used in the `process` method, rendering the `_template`; and\n        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,\n        being from the user with the content being the rendered `_template`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    language=self.language,\n                    unit=self.unit or random.choice([\"sentence\", \"phrase\", \"passage\"]),\n                    difficulty=self.difficulty\n                    or random.choice([\"elementary school\", \"high school\", \"college\"]),\n                    high_score=self.high_score or random.choice([\"4\", \"4.5\", \"5\"]),\n                    low_score=self.low_score or random.choice([\"2.5\", \"3\", \"3.5\"]),\n                ).strip(),\n            }\n        ]  # type: ignore\n\n    @property\n    def keys(self) -&gt; List[str]:\n        \"\"\"Contains the `keys` that will be parsed from the `LLM` output into a Python dict.\"\"\"\n        return [\"S1\", \"S2\", \"S3\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MonolingualTripletGenerator.prompt","title":"<code>prompt</code>  <code>property</code>","text":"<p>Contains the <code>prompt</code> to be used in the <code>process</code> method, rendering the <code>_template</code>; and formatted as an OpenAI formatted chat i.e. a <code>ChatType</code>, assuming that there's only one turn, being from the user with the content being the rendered <code>_template</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MonolingualTripletGenerator.keys","title":"<code>keys</code>  <code>property</code>","text":"<p>Contains the <code>keys</code> that will be parsed from the <code>LLM</code> output into a Python dict.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation","title":"<code>InstructionBacktranslation</code>","text":"<p>               Bases: <code>Task</code></p> <p>Self-Alignment with Instruction Backtranslation.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Optional[Template]</code> <p>the Jinja2 template to use for the Instruction Backtranslation task.</p> Input columns <ul> <li>instruction (<code>str</code>): The reference instruction to evaluate the text output.</li> <li>generation (<code>str</code>): The text output to evaluate for the given instruction.</li> </ul> Output columns <ul> <li>score (<code>str</code>): The score for the generation based on the given instruction.</li> <li>reason (<code>str</code>): The reason for the provided score.</li> <li>model_name (<code>str</code>): The model name used to score the generation.</li> </ul> Categories <ul> <li>critique</li> </ul> References <ul> <li><code>Self-Alignment with Instruction Backtranslation</code></li> </ul> <p>Examples:</p> <p>Generate a score and reason for a given instruction and generation:</p> <pre><code>from distilabel.steps.tasks import InstructionBacktranslation\n\ninstruction_backtranslation = InstructionBacktranslation(\n        name=\"instruction_backtranslation\",\n        llm=llm,\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"scoring_model\"},\n    )\ninstruction_backtranslation.load()\n\nresult = next(\n    instruction_backtranslation.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generation\": \"4\",\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         \"instruction\": \"How much is 2+2?\",\n#         \"generation\": \"4\",\n#         \"score\": 3,\n#         \"reason\": \"Reason for the generation.\",\n#         \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{li2024selfalignmentinstructionbacktranslation,\n    title={Self-Alignment with Instruction Backtranslation},\n    author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},\n    year={2024},\n    eprint={2308.06259},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2308.06259},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>class InstructionBacktranslation(Task):\n    \"\"\"Self-Alignment with Instruction Backtranslation.\n\n    Attributes:\n        _template: the Jinja2 template to use for the Instruction Backtranslation task.\n\n    Input columns:\n        - instruction (`str`): The reference instruction to evaluate the text output.\n        - generation (`str`): The text output to evaluate for the given instruction.\n\n    Output columns:\n        - score (`str`): The score for the generation based on the given instruction.\n        - reason (`str`): The reason for the provided score.\n        - model_name (`str`): The model name used to score the generation.\n\n    Categories:\n        - critique\n\n    References:\n        - [`Self-Alignment with Instruction Backtranslation`](https://arxiv.org/abs/2308.06259)\n\n    Examples:\n        Generate a score and reason for a given instruction and generation:\n\n        ```python\n        from distilabel.steps.tasks import InstructionBacktranslation\n\n        instruction_backtranslation = InstructionBacktranslation(\n                name=\"instruction_backtranslation\",\n                llm=llm,\n                input_batch_size=10,\n                output_mappings={\"model_name\": \"scoring_model\"},\n            )\n        instruction_backtranslation.load()\n\n        result = next(\n            instruction_backtranslation.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generation\": \"4\",\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         \"instruction\": \"How much is 2+2?\",\n        #         \"generation\": \"4\",\n        #         \"score\": 3,\n        #         \"reason\": \"Reason for the generation.\",\n        #         \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{li2024selfalignmentinstructionbacktranslation,\n            title={Self-Alignment with Instruction Backtranslation},\n            author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},\n            year={2024},\n            eprint={2308.06259},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2308.06259},\n        }\n        ```\n    \"\"\"\n\n    _template: Optional[\"Template\"] = PrivateAttr(default=...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"instruction-backtranslation.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`, and the `generation` for it.\"\"\"\n        return [\"instruction\", \"generation\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    instruction=input[\"instruction\"], generation=input[\"generation\"]\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `score`, `reason` and the `model_name`.\"\"\"\n        return [\"score\", \"reason\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `score` and `reason`. The\n        `model_name` will be automatically included within the `process` method of `Task`.\n\n        Args:\n            output: a string representing the output of the LLM via the `process` method.\n            input: the input to the task, as required by some tasks to format the output.\n\n        Returns:\n            A dictionary containing the `score` and the `reason` for the provided `score`.\n        \"\"\"\n        pattern = r\"(.+?)Score: (\\d)\"\n\n        matches = None\n        if output is not None:\n            matches = re.findall(pattern, output, re.DOTALL)\n        if matches is None:\n            return {\"score\": None, \"reason\": None}\n\n        return {\n            \"score\": int(matches[0][1]),\n            \"reason\": matches[0][0].strip(),\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>, and the <code>generation</code> for it.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>score</code>, <code>reason</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"instruction-backtranslation.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                instruction=input[\"instruction\"], generation=input[\"generation\"]\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.InstructionBacktranslation.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a dictionary with the <code>score</code> and <code>reason</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>a string representing the output of the LLM via the <code>process</code> method.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task, as required by some tasks to format the output.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dictionary containing the <code>score</code> and the <code>reason</code> for the provided <code>score</code>.</p> Source code in <code>src/distilabel/steps/tasks/instruction_backtranslation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `score` and `reason`. The\n    `model_name` will be automatically included within the `process` method of `Task`.\n\n    Args:\n        output: a string representing the output of the LLM via the `process` method.\n        input: the input to the task, as required by some tasks to format the output.\n\n    Returns:\n        A dictionary containing the `score` and the `reason` for the provided `score`.\n    \"\"\"\n    pattern = r\"(.+?)Score: (\\d)\"\n\n    matches = None\n    if output is not None:\n        matches = re.findall(pattern, output, re.DOTALL)\n    if matches is None:\n        return {\"score\": None, \"reason\": None}\n\n    return {\n        \"score\": int(matches[0][1]),\n        \"reason\": matches[0][0].strip(),\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie","title":"<code>Magpie</code>","text":"<p>               Bases: <code>Task</code>, <code>MagpieBase</code></p> <p>Generates conversations using an instruct fine-tuned LLM.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as if it was the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.</p> <p>Attributes:</p> Name Type Description <code>n_turns</code> <code>RuntimeParameter[PositiveInt]</code> <p>the number of turns that the generated conversation will have. Defaults to <code>1</code>.</p> <code>end_with_user</code> <code>RuntimeParameter[bool]</code> <p>whether the conversation should end with a user message. Defaults to <code>False</code>.</p> <code>include_system_prompt</code> <code>RuntimeParameter[bool]</code> <p>whether to include the system prompt used in the generated conversation. Defaults to <code>False</code>.</p> <code>only_instruction</code> <code>RuntimeParameter[bool]</code> <p>whether to generate only the instruction. If this argument is <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> <code>system_prompt</code> <code>Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]</code> <p>an optional system prompt, or a list of system prompts from which a random one will be chosen, or a dictionary of system prompts from which a random one will be choosen, or a dictionary of system prompts with their probability of being chosen. The random system prompt will be chosen per input/output batch. This system prompt can be used to guide the generation of the instruct LLM and steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> Runtime parameters <ul> <li><code>n_turns</code>: the number of turns that the generated conversation will have. Defaults     to <code>1</code>.</li> <li><code>end_with_user</code>: whether the conversation should end with a user message.     Defaults to <code>False</code>.</li> <li><code>include_system_prompt</code>: whether to include the system prompt used in the generated     conversation. Defaults to <code>False</code>.</li> <li><code>only_instruction</code>: whether to generate only the instruction. If this argument is     <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</li> <li><code>system_prompt</code>: an optional system prompt or list of system prompts that can     be used to steer the LLM to generate content of certain topic, guide the style,     etc. If it's a list of system prompts, then a random system prompt will be chosen     per input/output batch. If the provided inputs contains a <code>system_prompt</code> column,     then this runtime parameter will be ignored and the one from the column will     be used. Defaults to <code>None</code>.</li> <li><code>system_prompt</code>: an optional system prompt, or a list of system prompts from which     a random one will be chosen, or a dictionary of system prompts from which a     random one will be choosen, or a dictionary of system prompts with their probability     of being chosen. The random system prompt will be chosen per input/output batch.     This system prompt can be used to guide the generation of the instruct LLM and     steer it to generate instructions of a certain topic.</li> </ul> Input columns <ul> <li>system_prompt (<code>str</code>, optional): an optional system prompt that can be provided     to guide the generation of the instruct LLM and steer it to generate instructions     of certain topic.</li> </ul> Output columns <ul> <li>conversation (<code>ChatType</code>): the generated conversation which is a list of chat     items with a role and a message. Only if <code>only_instruction=False</code>.</li> <li>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code> or <code>n_turns==1</code>.</li> <li>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</li> <li>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate     the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</li> </ul> Categories <ul> <li>text-generation</li> <li>instruction</li> </ul> References <ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul> <p>Examples:</p> <p>Generating instructions with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {'instruction': \"That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?\"},\n#     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}\n# ]\n</code></pre> <p>Generating conversations with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    n_turns=2,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"},\n#             {\n#                 'role': 'user',\n#                 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x\u2192a f(x) or lim x\u2192a [f(x)]. It is read as \"the limit as x approaches a of f\n# of x\".'\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x\u2192a f(x) indeed means \"the limit as x approaches a of f of\n# x\". What it's asking us to do is find the'\n#             }\n#         ]\n#     },\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"},\n#             {\n#                 'role': 'user',\n#                 'content': \"As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it\n# might be pests or diseases, but I'm not sure which.\"\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': \"I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.\n# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called\"\n#             }\n#         ]\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>class Magpie(Task, MagpieBase):\n    \"\"\"Generates conversations using an instruct fine-tuned LLM.\n\n    Magpie is a neat method that allows generating user instructions with no seed data\n    or specific system prompt thanks to the autoregressive capabilities of the instruct\n    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message\n    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query\n    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the\n    LLM without any user message, then the LLM will continue generating tokens as if it was\n    the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.\n    After this instruct is generated, it can be sent again to the LLM to generate this time\n    an assistant response. This process can be repeated N times allowing to build a multi-turn\n    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from\n    Scratch by Prompting Aligned LLMs with Nothing'.\n\n    Attributes:\n        n_turns: the number of turns that the generated conversation will have.\n            Defaults to `1`.\n        end_with_user: whether the conversation should end with a user message.\n            Defaults to `False`.\n        include_system_prompt: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        only_instruction: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        system_prompt: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic. Defaults to `None`.\n\n    Runtime parameters:\n        - `n_turns`: the number of turns that the generated conversation will have. Defaults\n            to `1`.\n        - `end_with_user`: whether the conversation should end with a user message.\n            Defaults to `False`.\n        - `include_system_prompt`: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        - `only_instruction`: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        - `system_prompt`: an optional system prompt or list of system prompts that can\n            be used to steer the LLM to generate content of certain topic, guide the style,\n            etc. If it's a list of system prompts, then a random system prompt will be chosen\n            per input/output batch. If the provided inputs contains a `system_prompt` column,\n            then this runtime parameter will be ignored and the one from the column will\n            be used. Defaults to `None`.\n        - `system_prompt`: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic.\n\n    Input columns:\n        - system_prompt (`str`, optional): an optional system prompt that can be provided\n            to guide the generation of the instruct LLM and steer it to generate instructions\n            of certain topic.\n\n    Output columns:\n        - conversation (`ChatType`): the generated conversation which is a list of chat\n            items with a role and a message. Only if `only_instruction=False`.\n        - instruction (`str`): the generated instructions if `only_instruction=True` or `n_turns==1`.\n        - response (`str`): the generated response if `n_turns==1`.\n        - system_prompt_key (`str`, optional): the key of the system prompt used to generate\n            the conversation or instruction. Only if `system_prompt` is a dictionary.\n        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.\n\n    Categories:\n        - text-generation\n        - instruction\n\n    References:\n        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)\n\n    Examples:\n        Generating instructions with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import Magpie\n\n        magpie = Magpie(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 64,\n                },\n                device=\"mps\",\n            ),\n            only_instruction=True,\n        )\n\n        magpie.load()\n\n        result = next(\n            magpie.process(\n                inputs=[\n                    {\n                        \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n                    },\n                    {\n                        \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n                    },\n                ]\n            )\n        )\n        # [\n        #     {'instruction': \"That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?\"},\n        #     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}\n        # ]\n        ```\n\n        Generating conversations with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import Magpie\n\n        magpie = Magpie(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 256,\n                },\n                device=\"mps\",\n            ),\n            n_turns=2,\n        )\n\n        magpie.load()\n\n        result = next(\n            magpie.process(\n                inputs=[\n                    {\n                        \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n                    },\n                    {\n                        \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n                    },\n                ]\n            )\n        )\n        # [\n        #     {\n        #         'conversation': [\n        #             {'role': 'system', 'content': \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"},\n        #             {\n        #                 'role': 'user',\n        #                 'content': 'I\\'m having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x\u2192a f(x) or lim x\u2192a [f(x)]. It is read as \"the limit as x approaches a of f\n        # of x\".'\n        #             },\n        #             {\n        #                 'role': 'assistant',\n        #                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don\\'t worry, I\\'m here to help! The notation lim x\u2192a f(x) indeed means \"the limit as x approaches a of f of\n        # x\". What it\\'s asking us to do is find the'\n        #             }\n        #         ]\n        #     },\n        #     {\n        #         'conversation': [\n        #             {'role': 'system', 'content': \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"},\n        #             {\n        #                 'role': 'user',\n        #                 'content': \"As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it\n        # might be pests or diseases, but I'm not sure which.\"\n        #             },\n        #             {\n        #                 'role': 'assistant',\n        #                 'content': \"I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.\n        # **Aphids**: These small, soft-bodied insects can secrete a sticky substance called\"\n        #             }\n        #         ]\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n        super().model_post_init(__context)\n\n        if not isinstance(self.llm, MagpieChatTemplateMixin):\n            raise DistilabelUserError(\n                f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n                f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n                page=\"components-gallery/tasks/magpie/\",\n            )\n\n        self.llm.use_magpie_template = True\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return {\"system_prompt\": False}\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Does nothing.\"\"\"\n        return []\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"Either a multi-turn conversation or the instruction generated.\"\"\"\n        outputs = []\n\n        if self.only_instruction:\n            outputs.append(\"instruction\")\n        elif self.n_turns == 1:\n            outputs.extend([\"instruction\", \"response\"])\n        else:\n            outputs.append(\"conversation\")\n\n        if isinstance(self.system_prompt, dict):\n            outputs.append(\"system_prompt_key\")\n\n        outputs.append(\"model_name\")\n\n        return outputs\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Does nothing.\"\"\"\n        return {}\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Generate a list of instructions or conversations of the specified number of turns.\n\n        Args:\n            inputs: a list of dictionaries that can contain a `system_prompt` key.\n\n        Yields:\n            The list of generated conversations.\n        \"\"\"\n        yield self._generate_with_pre_query_template(inputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>Either a multi-turn conversation or the instruction generated.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Checks that the provided <code>LLM</code> uses the <code>MagpieChatTemplateMixin</code>.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n    super().model_post_init(__context)\n\n    if not isinstance(self.llm, MagpieChatTemplateMixin):\n        raise DistilabelUserError(\n            f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n            f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n            page=\"components-gallery/tasks/magpie/\",\n        )\n\n    self.llm.use_magpie_template = True\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.format_input","title":"<code>format_input(input)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Does nothing.\"\"\"\n    return []\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Does nothing.\"\"\"\n    return {}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.Magpie.process","title":"<code>process(inputs)</code>","text":"<p>Generate a list of instructions or conversations of the specified number of turns.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>a list of dictionaries that can contain a <code>system_prompt</code> key.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>The list of generated conversations.</p> Source code in <code>src/distilabel/steps/tasks/magpie/base.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Generate a list of instructions or conversations of the specified number of turns.\n\n    Args:\n        inputs: a list of dictionaries that can contain a `system_prompt` key.\n\n    Yields:\n        The list of generated conversations.\n    \"\"\"\n    yield self._generate_with_pre_query_template(inputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator","title":"<code>MagpieGenerator</code>","text":"<p>               Bases: <code>GeneratorTask</code>, <code>MagpieBase</code></p> <p>Generator task the generates instructions or conversations using Magpie.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data or specific system prompt thanks to the autoregressive capabilities of the instruct fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the LLM without any user message, then the LLM will continue generating tokens as it was the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM. After this instruct is generated, it can be sent again to the LLM to generate this time an assistant response. This process can be repeated N times allowing to build a multi-turn conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'.</p> <p>Attributes:</p> Name Type Description <code>n_turns</code> <code>RuntimeParameter[PositiveInt]</code> <p>the number of turns that the generated conversation will have. Defaults to <code>1</code>.</p> <code>end_with_user</code> <code>RuntimeParameter[bool]</code> <p>whether the conversation should end with a user message. Defaults to <code>False</code>.</p> <code>include_system_prompt</code> <code>RuntimeParameter[bool]</code> <p>whether to include the system prompt used in the generated conversation. Defaults to <code>False</code>.</p> <code>only_instruction</code> <code>RuntimeParameter[bool]</code> <p>whether to generate only the instruction. If this argument is <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> <code>system_prompt</code> <code>Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]</code> <p>an optional system prompt, or a list of system prompts from which a random one will be chosen, or a dictionary of system prompts from which a random one will be choosen, or a dictionary of system prompts with their probability of being chosen. The random system prompt will be chosen per input/output batch. This system prompt can be used to guide the generation of the instruct LLM and steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> <code>num_rows</code> <code>RuntimeParameter[int]</code> <p>the number of rows to be generated.</p> Runtime parameters <ul> <li><code>n_turns</code>: the number of turns that the generated conversation will have. Defaults     to <code>1</code>.</li> <li><code>end_with_user</code>: whether the conversation should end with a user message.     Defaults to <code>False</code>.</li> <li><code>include_system_prompt</code>: whether to include the system prompt used in the generated     conversation. Defaults to <code>False</code>.</li> <li><code>only_instruction</code>: whether to generate only the instruction. If this argument is     <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</li> <li><code>system_prompt</code>: an optional system prompt, or a list of system prompts from which     a random one will be chosen, or a dictionary of system prompts from which a     random one will be choosen, or a dictionary of system prompts with their probability     of being chosen. The random system prompt will be chosen per input/output batch.     This system prompt can be used to guide the generation of the instruct LLM and     steer it to generate instructions of a certain topic.</li> <li><code>num_rows</code>: the number of rows to be generated.</li> </ul> Output columns <ul> <li>conversation (<code>ChatType</code>): the generated conversation which is a list of chat     items with a role and a message.</li> <li>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code>.</li> <li>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</li> <li>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate     the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</li> </ul> Categories <ul> <li>text-generation</li> <li>instruction</li> <li>generator</li> </ul> References <ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul> <p>Examples:</p> <p>Generating instructions with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#       [\n#           {\"instruction\": \"I've just bought a new phone and I're excited to start using it.\"},\n#           {\"instruction\": \"What are the most common types of companies that use digital signage?\"}\n#       ],\n#       True\n# )\n</code></pre> <p>Generating a conversation with Llama 3 8B Instruct and TransformersLLM:</p> <pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    n_turns=3,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#     [\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,\n# let's break down the basics. First, we need to identify your goals and target audience. What do\"\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': \"Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main\n# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating\"\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or\n# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'\n#                 }\n#             ]\n#         },\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.\n# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': 'Let me stop you there. Let's explore this \"purpose\" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my\n# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly\n# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* \"\n#                 }\n#             ]\n#         }\n#     ],\n#     True\n# )\n</code></pre> <p>Generating with system prompts with probabilities:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\nmagpie = MagpieGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 0.8,\n            \"max_new_tokens\": 256,\n        },\n    ),\n    n_turns=2,\n    system_prompt={\n        \"math\": (\"You're an expert AI assistant.\", 0.8),\n        \"writing\": (\"You're an expert writing assistant.\", 0.2),\n    },\n)\n\nmagpie.load()\n\nresult = next(magpie.process())\n</code></pre> Citations <pre><code>@misc{xu2024magpiealignmentdatasynthesis,\n    title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},\n    author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},\n    year={2024},\n    eprint={2406.08464},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2406.08464},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>class MagpieGenerator(GeneratorTask, MagpieBase):\n    \"\"\"Generator task the generates instructions or conversations using Magpie.\n\n    Magpie is a neat method that allows generating user instructions with no seed data\n    or specific system prompt thanks to the autoregressive capabilities of the instruct\n    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message\n    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query\n    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the\n    LLM without any user message, then the LLM will continue generating tokens as it was\n    the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.\n    After this instruct is generated, it can be sent again to the LLM to generate this time\n    an assistant response. This process can be repeated N times allowing to build a multi-turn\n    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from\n    Scratch by Prompting Aligned LLMs with Nothing'.\n\n    Attributes:\n        n_turns: the number of turns that the generated conversation will have.\n            Defaults to `1`.\n        end_with_user: whether the conversation should end with a user message.\n            Defaults to `False`.\n        include_system_prompt: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        only_instruction: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        system_prompt: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic. Defaults to `None`.\n        num_rows: the number of rows to be generated.\n\n    Runtime parameters:\n        - `n_turns`: the number of turns that the generated conversation will have. Defaults\n            to `1`.\n        - `end_with_user`: whether the conversation should end with a user message.\n            Defaults to `False`.\n        - `include_system_prompt`: whether to include the system prompt used in the generated\n            conversation. Defaults to `False`.\n        - `only_instruction`: whether to generate only the instruction. If this argument is\n            `True`, then `n_turns` will be ignored. Defaults to `False`.\n        - `system_prompt`: an optional system prompt, or a list of system prompts from which\n            a random one will be chosen, or a dictionary of system prompts from which a\n            random one will be choosen, or a dictionary of system prompts with their probability\n            of being chosen. The random system prompt will be chosen per input/output batch.\n            This system prompt can be used to guide the generation of the instruct LLM and\n            steer it to generate instructions of a certain topic.\n        - `num_rows`: the number of rows to be generated.\n\n    Output columns:\n        - conversation (`ChatType`): the generated conversation which is a list of chat\n            items with a role and a message.\n        - instruction (`str`): the generated instructions if `only_instruction=True`.\n        - response (`str`): the generated response if `n_turns==1`.\n        - system_prompt_key (`str`, optional): the key of the system prompt used to generate\n            the conversation or instruction. Only if `system_prompt` is a dictionary.\n        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.\n\n    Categories:\n        - text-generation\n        - instruction\n        - generator\n\n    References:\n        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)\n\n    Examples:\n        Generating instructions with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import MagpieGenerator\n\n        generator = MagpieGenerator(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 256,\n                },\n                device=\"mps\",\n            ),\n            only_instruction=True,\n            num_rows=5,\n        )\n\n        generator.load()\n\n        result = next(generator.process())\n        # (\n        #       [\n        #           {\"instruction\": \"I've just bought a new phone and I're excited to start using it.\"},\n        #           {\"instruction\": \"What are the most common types of companies that use digital signage?\"}\n        #       ],\n        #       True\n        # )\n        ```\n\n        Generating a conversation with Llama 3 8B Instruct and TransformersLLM:\n\n        ```python\n        from distilabel.models import TransformersLLM\n        from distilabel.steps.tasks import MagpieGenerator\n\n        generator = MagpieGenerator(\n            llm=TransformersLLM(\n                model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 1.0,\n                    \"max_new_tokens\": 64,\n                },\n                device=\"mps\",\n            ),\n            n_turns=3,\n            num_rows=5,\n        )\n\n        generator.load()\n\n        result = next(generator.process())\n        # (\n        #     [\n        #         {\n        #             'conversation': [\n        #                 {\n        #                     'role': 'system',\n        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n        # insightful responses to help the user with their queries.'\n        #                 },\n        #                 {'role': 'user', 'content': \"I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?\"},\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': \"Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,\n        # let's break down the basics. First, we need to identify your goals and target audience. What do\"\n        #                 },\n        #                 {\n        #                     'role': 'user',\n        #                     'content': \"Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main\n        # expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating\"\n        #                 },\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or\n        # agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'\n        #                 }\n        #             ]\n        #         },\n        #         {\n        #             'conversation': [\n        #                 {\n        #                     'role': 'system',\n        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n        # insightful responses to help the user with their queries.'\n        #                 },\n        #                 {'role': 'user', 'content': \"I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!\"},\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.\n        # **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'\n        #                 },\n        #                 {\n        #                     'role': 'user',\n        #                     'content': 'Let me stop you there. Let\\'s explore this \"purpose\" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I\\'re primarily using my\n        # laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'\n        #                 },\n        #                 {\n        #                     'role': 'assistant',\n        #                     'content': \"Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly\n        # option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* \"\n        #                 }\n        #             ]\n        #         }\n        #     ],\n        #     True\n        # )\n        ```\n\n        Generating with system prompts with probabilities:\n\n        ```python\n        from distilabel.models import InferenceEndpointsLLM\n        from distilabel.steps.tasks import MagpieGenerator\n\n        magpie = MagpieGenerator(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                magpie_pre_query_template=\"llama3\",\n                generation_kwargs={\n                    \"temperature\": 0.8,\n                    \"max_new_tokens\": 256,\n                },\n            ),\n            n_turns=2,\n            system_prompt={\n                \"math\": (\"You're an expert AI assistant.\", 0.8),\n                \"writing\": (\"You're an expert writing assistant.\", 0.2),\n            },\n        )\n\n        magpie.load()\n\n        result = next(magpie.process())\n        ```\n\n    Citations:\n        ```\n        @misc{xu2024magpiealignmentdatasynthesis,\n            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},\n            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},\n            year={2024},\n            eprint={2406.08464},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2406.08464},\n        }\n        ```\n    \"\"\"\n\n    # TODO: move this to `GeneratorTask`\n    num_rows: RuntimeParameter[int] = Field(\n        default=None, description=\"The number of rows to generate.\"\n    )\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n        super().model_post_init(__context)\n\n        if not isinstance(self.llm, MagpieChatTemplateMixin):\n            raise DistilabelUserError(\n                f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n                f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n                page=\"components-gallery/tasks/magpiegenerator/\",\n            )\n\n        self.llm.use_magpie_template = True\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"Either a multi-turn conversation or the instruction generated.\"\"\"\n        outputs = []\n\n        if self.only_instruction:\n            outputs.append(\"instruction\")\n        elif self.n_turns == 1:\n            outputs.extend([\"instruction\", \"response\"])\n        else:\n            outputs.append(\"conversation\")\n\n        if isinstance(self.system_prompt, dict):\n            outputs.append(\"system_prompt_key\")\n\n        outputs.append(\"model_name\")\n\n        return outputs\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Does nothing.\"\"\"\n        return {}\n\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        \"\"\"Generates the desired number of instructions or conversations using Magpie.\n\n        Args:\n            offset: The offset to start the generation from. Defaults to `0`.\n\n        Yields:\n            The generated instructions or conversations.\n        \"\"\"\n        generated = offset\n\n        while generated &lt;= self.num_rows:  # type: ignore\n            rows_to_generate = (\n                self.num_rows if self.num_rows &lt; self.batch_size else self.batch_size  # type: ignore\n            )\n            conversations = self._generate_with_pre_query_template(\n                inputs=[{} for _ in range(rows_to_generate)]  # type: ignore\n            )\n            generated += rows_to_generate  # type: ignore\n            yield (conversations, generated == self.num_rows)\n\n    @override\n    def _sample_input(self) -&gt; \"ChatType\":\n        return self._generate_with_pre_query_template(inputs=[{}])\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>Either a multi-turn conversation or the instruction generated.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.model_post_init","title":"<code>model_post_init(__context)</code>","text":"<p>Checks that the provided <code>LLM</code> uses the <code>MagpieChatTemplateMixin</code>.</p> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>def model_post_init(self, __context: Any) -&gt; None:\n    \"\"\"Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`.\"\"\"\n    super().model_post_init(__context)\n\n    if not isinstance(self.llm, MagpieChatTemplateMixin):\n        raise DistilabelUserError(\n            f\"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`.\"\n            f\"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.\",\n            page=\"components-gallery/tasks/magpiegenerator/\",\n        )\n\n    self.llm.use_magpie_template = True\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Does nothing.\"\"\"\n    return {}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MagpieGenerator.process","title":"<code>process(offset=0)</code>","text":"<p>Generates the desired number of instructions or conversations using Magpie.</p> <p>Parameters:</p> Name Type Description Default <code>offset</code> <code>int</code> <p>The offset to start the generation from. Defaults to <code>0</code>.</p> <code>0</code> <p>Yields:</p> Type Description <code>GeneratorStepOutput</code> <p>The generated instructions or conversations.</p> Source code in <code>src/distilabel/steps/tasks/magpie/generator.py</code> <pre><code>def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    \"\"\"Generates the desired number of instructions or conversations using Magpie.\n\n    Args:\n        offset: The offset to start the generation from. Defaults to `0`.\n\n    Yields:\n        The generated instructions or conversations.\n    \"\"\"\n    generated = offset\n\n    while generated &lt;= self.num_rows:  # type: ignore\n        rows_to_generate = (\n            self.num_rows if self.num_rows &lt; self.batch_size else self.batch_size  # type: ignore\n        )\n        conversations = self._generate_with_pre_query_template(\n            inputs=[{} for _ in range(rows_to_generate)]  # type: ignore\n        )\n        generated += rows_to_generate  # type: ignore\n        yield (conversations, generated == self.num_rows)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter","title":"<code>MathShepherdCompleter</code>","text":"<p>               Bases: <code>Task</code></p> <p>Math Shepherd Completer and auto-labeller task.</p> <p>This task is in charge of, given a list of solutions to an instruction, and a golden solution, as reference, generate completions for the solutions, and label them according to the golden solution using the hard estimation method from figure 2 in the reference paper, Eq. 3. The attributes make the task flexible to be used with different types of dataset and LLMs, and allow making use of different fields to modify the system and user prompts for it. Before modifying them, review the current defaults to ensure the completions are generated correctly.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Optional[str]</code> <p>The system prompt to be used in the completions. The default one has been checked and generates good completions using Llama 3.1 with 8B and 70B, but it can be modified to adapt it to the model and dataset selected.</p> <code>extra_rules</code> <code>Optional[str]</code> <p>This field can be used to insert extra rules relevant to the type of dataset. For example, in the original paper they used GSM8K and MATH datasets, and this field can be used to insert the rules for the GSM8K dataset.</p> <code>few_shots</code> <code>Optional[str]</code> <p>Few shots to help the model generating the completions, write them in the format of the type of solutions wanted for your dataset.</p> <code>N</code> <code>PositiveInt</code> <p>Number of completions to generate for each step, correspond to N in the paper. They used 8 in the paper, but it can be adjusted.</p> <code>tags</code> <code>list[str]</code> <p>List of tags to be used in the completions, the default ones are [\"+\", \"-\"] as in the paper, where the first is used as a positive label, and the second as a negative one. This can be updated, but it MUST be a list with 2 elements, where the first is the positive one, and the second the negative one.</p> Input columns <ul> <li>instruction (<code>str</code>): The task or instruction.</li> <li>solutions (<code>List[str]</code>): List of solutions to the task.</li> <li>golden_solution (<code>str</code>): The reference solution to the task, will be used     to annotate the candidate solutions.</li> </ul> Output columns <ul> <li>solutions (<code>List[str]</code>): The same columns that were used as input, the \"solutions\" is modified.</li> <li>model_name (<code>str</code>): The name of the model used to generate the revision.</li> </ul> Categories <ul> <li>text-generation</li> <li>labelling</li> </ul> References <ul> <li><code>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</code></li> </ul> <p>Examples:</p> <p>Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way):</p> <pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3,\n    use_default_structured_output=True\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre> <p>Annotate your steps with the Math Shepherd Completer:</p> <pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre> <p>Citations:</p> <pre><code>```\n@misc{wang2024mathshepherdverifyreinforcellms,\n    title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n    author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n    year={2024},\n    eprint={2312.08935},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI},\n    url={https://arxiv.org/abs/2312.08935},\n}\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>class MathShepherdCompleter(Task):\n    \"\"\"Math Shepherd Completer and auto-labeller task.\n\n    This task is in charge of, given a list of solutions to an instruction, and a golden solution,\n    as reference, generate completions for the solutions, and label them according to the golden\n    solution using the hard estimation method from figure 2 in the reference paper, Eq. 3.\n    The attributes make the task flexible to be used with different types of dataset and LLMs, and\n    allow making use of different fields to modify the system and user prompts for it. Before modifying\n    them, review the current defaults to ensure the completions are generated correctly.\n\n    Attributes:\n        system_prompt: The system prompt to be used in the completions. The default one has been\n            checked and generates good completions using Llama 3.1 with 8B and 70B,\n            but it can be modified to adapt it to the model and dataset selected.\n        extra_rules: This field can be used to insert extra rules relevant to the type of dataset.\n            For example, in the original paper they used GSM8K and MATH datasets, and this field\n            can be used to insert the rules for the GSM8K dataset.\n        few_shots: Few shots to help the model generating the completions, write them in the\n            format of the type of solutions wanted for your dataset.\n        N: Number of completions to generate for each step, correspond to N in the paper.\n            They used 8 in the paper, but it can be adjusted.\n        tags: List of tags to be used in the completions, the default ones are [\"+\", \"-\"] as in the\n            paper, where the first is used as a positive label, and the second as a negative one.\n            This can be updated, but it MUST be a list with 2 elements, where the first is the\n            positive one, and the second the negative one.\n\n    Input columns:\n        - instruction (`str`): The task or instruction.\n        - solutions (`List[str]`): List of solutions to the task.\n        - golden_solution (`str`): The reference solution to the task, will be used\n            to annotate the candidate solutions.\n\n    Output columns:\n        - solutions (`List[str]`): The same columns that were used as input, the \"solutions\" is modified.\n        - model_name (`str`): The name of the model used to generate the revision.\n\n    Categories:\n        - text-generation\n        - labelling\n\n    References:\n        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)\n\n    Examples:\n        Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way):\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdCompleter\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.6,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        task = MathShepherdCompleter(\n            llm=llm,\n            N=3,\n            use_default_structured_output=True\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                        \"solutions\": [\n                            [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                            ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                        ]\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"],\n        # 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n        ```\n\n        Annotate your steps with the Math Shepherd Completer:\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdCompleter\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.6,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        task = MathShepherdCompleter(\n            llm=llm,\n            N=3\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                        \"solutions\": [\n                            [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                            ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                        ]\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"],\n        # 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n        ```\n\n    Citations:\n\n        ```\n        @misc{wang2024mathshepherdverifyreinforcellms,\n            title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n            author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n            year={2024},\n            eprint={2312.08935},\n            archivePrefix={arXiv},\n            primaryClass={cs.AI},\n            url={https://arxiv.org/abs/2312.08935},\n        }\n        ```\n    \"\"\"\n\n    system_prompt: Optional[str] = SYSTEM_PROMPT\n    extra_rules: Optional[str] = RULES_GSM8K\n    few_shots: Optional[str] = FEW_SHOTS_GSM8K\n    N: PositiveInt = 1\n    tags: list[str] = [\"+\", \"-\"]\n\n    def load(self) -&gt; None:\n        super().load()\n\n        if self.system_prompt is not None:\n            self.system_prompt = Template(self.system_prompt).render(\n                extra_rules=self.extra_rules or \"\",\n                few_shots=self.few_shots or \"\",\n                structured_prompt=SYSTEM_PROMPT_STRUCTURED\n                if self.use_default_structured_output\n                else \"\",\n            )\n        if self.use_default_structured_output:\n            self._template = Template(TEMPLATE_STRUCTURED)\n        else:\n            self._template = Template(TEMPLATE)\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\", \"solutions\", \"golden_solution\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        messages = [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    instruction=input[\"instruction\"], N=self.N\n                ),\n            }\n        ]\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n        return messages  # type: ignore\n\n    def _parse_output(self, output: Union[str, None]) -&gt; list[list[str]]:\n        if output is None:\n            return [[\"\"]] * self.N\n\n        if self.N &gt; 1:\n            output_transformed = (  # type: ignore\n                self._format_structured_output(output)\n                if self.use_default_structured_output\n                else output.split(\"---\")\n            )\n            examples = [split_solution_steps(o) for o in output_transformed]\n            # In case there aren't the expected number of completions, we fill it with \"\", or short the list.\n            # This shoulnd't happen if the LLM works as expected, but it's a safety measure as it can be\n            # difficult to debug if the completions don't match the solutions.\n            if len(examples) &lt; self.N:\n                examples.extend([\"\"] * (self.N - len(examples)))  # type: ignore\n            elif len(examples) &gt; self.N:\n                examples = examples[: self.N]\n        else:\n            output_transformed = (\n                self._format_structured_output(output)[0]\n                if self.use_default_structured_output\n                else output\n            )\n            examples = [split_solution_steps(output_transformed)]\n        return examples\n\n    def _format_structured_output(self, output: str) -&gt; list[str]:\n        default_output = [\"\"] * self.N if self.N else [\"\"]\n        if parsed_output := parse_json_response(output):\n            solutions = parsed_output[\"solutions\"]\n            extracted_solutions = [solution[\"solution\"] for solution in solutions]\n            if len(output) != self.N:\n                extracted_solutions = default_output\n            return extracted_solutions\n        return default_output\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Does nothing.\"\"\"\n        return {}\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        \"\"\"Does the processing of generation completions for the solutions, and annotate\n        each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).\n\n        Args:\n            inputs: Inputs to the step\n\n        Yields:\n            Annotated inputs with the completions.\n        \"\"\"\n\n        # A list with all the inputs to be passed to the LLM. Needs another structure to\n        # find them afterwards\n        prepared_inputs = []\n        # Data structure with the indices of the elements.\n        # (i, j, k) where i is the input, j is the solution, and k is the completion\n        input_positions = []\n        golden_answers = []\n        for i, input in enumerate(inputs):\n            instruction = input[\"instruction\"]\n            golden_solution = input[\"golden_solution\"]  # This is a single solution\n            golden_answers.append(golden_solution[-1])\n            # This contains a list of solutions\n            solutions = input[\"solutions\"]\n            for j, solution in enumerate(solutions):\n                # For each solution, that has K steps, we have to generate N completions\n                # for the first K-2 steps (-2 because the last 2 steps are the last step, and\n                # the answer itself, which can be directly compared against golden answer)\n                prepared_completions = self._prepare_completions(instruction, solution)\n                prepared_inputs.extend(prepared_completions)\n                input_positions.extend(\n                    [(i, j, k) for k in range(len(prepared_completions))]\n                )\n\n        # Send the elements in batches to the LLM to speed up the process\n        final_outputs = []\n        # Added here to simplify testing in case we don't have anything to process\n        # TODO: Ensure the statistics has the same shape as all the outputs, raw_outputs, and raw_inputs\n        statistics = []\n        total_raw_outputs = []\n        total_raw_inputs = []\n        for inner_batch in batched(prepared_inputs, self.input_batch_size):  # type: ignore\n            outputs = self.llm.generate_outputs(\n                inputs=inner_batch,\n                num_generations=1,\n                **self.llm.get_generation_kwargs(),  # type: ignore\n            )\n\n            formatted_outputs = []\n            stats = []\n            raw_outputs = []\n            raw_inputs = []\n            for i, output in enumerate(outputs):\n                generation = output[\"generations\"][0]\n                raw_inputs.append(inner_batch[i])\n                raw_outputs.append(generation or \"\")\n                formatted_outputs.append(self._parse_output(generation))\n                stats.append(output[\"statistics\"])\n\n            final_outputs.extend(formatted_outputs)\n            statistics.extend(stats)\n            total_raw_outputs.extend(raw_outputs)\n            total_raw_inputs.extend(raw_inputs)\n\n        yield self._auto_label(  # type: ignore\n            inputs,\n            final_outputs,\n            input_positions,\n            golden_answers,\n            statistics,\n            total_raw_outputs,\n            total_raw_inputs,\n        )\n\n    def _prepare_completions(\n        self, instruction: str, steps: list[str]\n    ) -&gt; List[\"ChatType\"]:\n        \"\"\"Helper method to create, given a solution (a list of steps), and a instruction, the\n        texts to be completed by the LLM.\n\n        Args:\n            instruction: Instruction of the problem.\n            steps: List of steps that are part of the solution.\n\n        Returns:\n            List of ChatType, where each ChatType is the prompt corresponding to one of the steps\n            to be completed.\n        \"\"\"\n        prepared_inputs = []\n        # Use the number of completions that correspond to a given instruction/steps pair\n        # to find afterwards the input that corresponds to a given completion (to do the labelling)\n        num_completions = len(steps[:-2])\n        for i in range(1, num_completions + 1):\n            to_complete = instruction + \" \" + \"\\n\".join(steps[:i])\n            prepared_inputs.append(self.format_input({\"instruction\": to_complete}))\n\n        return prepared_inputs\n\n    def _auto_label(\n        self,\n        inputs: StepInput,\n        final_outputs: list[Completions],\n        input_positions: list[tuple[int, int, int]],\n        golden_answers: list[str],\n        statistics: list[\"LLMStatistics\"],\n        raw_outputs: list[str],\n        raw_inputs: list[str],\n    ) -&gt; StepInput:\n        \"\"\"Labels the steps inplace (in the inputs), and returns the inputs.\n\n        Args:\n            inputs: The original inputs\n            final_outputs: List of generations from the LLM.\n                It's organized as a list where the elements sent to the LLM are\n                grouped together, then each element contains the completions, and\n                each completion is a list of steps.\n            input_positions: A list with tuples generated in the process method\n                that contains (i, j, k) where i is the index of the input, j is the\n                index of the solution, and k is the index of the completion.\n            golden_answers: List of golden answers for each input.\n            statistics: List of statistics from the LLM.\n            raw_outputs: List of raw outputs from the LLM.\n            raw_inputs: List of raw inputs to the LLM.\n\n        Returns:\n            Inputs annotated.\n        \"\"\"\n        for i, (instruction_i, solution_i, step_i) in enumerate(input_positions):\n            input = inputs[instruction_i]\n            solutions = input[\"solutions\"]\n            n_completions = final_outputs[i]\n            label = f\" {self.tags[1]}\"\n            for completion in n_completions:\n                if len(completion) == 0:\n                    # This can be a failed generation\n                    label = \"\"  # Everyting stays the same\n                    self._logger.info(\"Completer failed due to empty completion\")\n                    continue\n                if completion[-1] == golden_answers[instruction_i]:\n                    label = f\" { self.tags[0]}\"\n                    # If we found one, it's enough as we are doing Hard Estimation\n                    continue\n            # In case we had no solutions from the previous step, otherwise we would have\n            # an IndexError\n            if not solutions[solution_i]:\n                continue\n            solutions[solution_i][step_i] += label\n            inputs[instruction_i][\"solutions\"] = solutions\n\n        for i, input in enumerate(inputs):\n            solutions = input[\"solutions\"]\n            new_solutions = []\n            for solution in solutions:\n                if not solution or (len(solution) == 1):\n                    # The generation may fail to generate the expected\n                    # completions, or just added an extra empty completion,\n                    # we skip it.\n                    # Other possible error is having a list of solutions\n                    # with a single item, so when we call .pop, we are left\n                    # with an empty list, so we skip it too.\n                    new_solutions.append(solution)\n                    continue\n\n                answer = solution.pop()\n                label = (\n                    f\" {self.tags[0]}\"\n                    if answer == golden_answers[i]\n                    else f\" {self.tags[1]}\"\n                )\n                solution[-1] += \" \" + answer + label\n                new_solutions.append(solution)\n\n            # Only add the solutions if the data was properly parsed\n            input[\"solutions\"] = new_solutions if new_solutions else input[\"solutions\"]\n            input = self._add_metadata(\n                input, statistics[i], raw_outputs[i], raw_inputs[i]\n            )\n\n        return inputs\n\n    def _add_metadata(\n        self,\n        input: dict[str, Any],\n        statistics: list[\"LLMStatistics\"],\n        raw_output: Union[str, None],\n        raw_input: Union[list[dict[str, Any]], None],\n    ) -&gt; dict[str, Any]:\n        \"\"\"Adds the `distilabel_metadata` to the input.\n\n        This method comes for free in the general Tasks, but as we have reimplemented the `process`,\n        we have to repeat it here.\n\n        Args:\n            input: The input to add the metadata to.\n            statistics: The statistics from the LLM.\n            raw_output: The raw output from the LLM.\n            raw_input: The raw input to the LLM.\n\n        Returns:\n            The input with the metadata added if applies.\n        \"\"\"\n        input[\"model_name\"] = self.llm.model_name\n\n        if DISTILABEL_METADATA_KEY not in input:\n            input[DISTILABEL_METADATA_KEY] = {}\n        # If the solutions are splitted afterwards, the statistics should be splitted\n        # to avoid counting extra tokens\n        input[DISTILABEL_METADATA_KEY][f\"statistics_{self.name}\"] = statistics\n\n        # Let some defaults in case something failed and we had None, otherwise when reading\n        # the parquet files using pyarrow, the following error will appear:\n        # ArrowInvalid: Schema\n        if self.add_raw_input:\n            input[DISTILABEL_METADATA_KEY][f\"raw_input_{self.name}\"] = raw_input or [\n                {\"content\": \"\", \"role\": \"\"}\n            ]\n        if self.add_raw_output:\n            input[DISTILABEL_METADATA_KEY][f\"raw_output_{self.name}\"] = raw_output or \"\"\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel, Field\n\n        class Solution(BaseModel):\n            solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n        class MathShepherdCompleter(BaseModel):\n            solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n        MathShepherdCompleter.model_json_schema()\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"$defs\": {\n                \"Solution\": {\n                    \"properties\": {\n                        \"solution\": {\n                            \"description\": \"Step by step solution leading to the final answer\",\n                            \"title\": \"Solution\",\n                            \"type\": \"string\",\n                        }\n                    },\n                    \"required\": [\"solution\"],\n                    \"title\": \"Solution\",\n                    \"type\": \"object\",\n                }\n            },\n            \"properties\": {\n                \"solutions\": {\n                    \"description\": \"List of solutions\",\n                    \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                    \"title\": \"Solutions\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"solutions\"],\n            \"title\": \"MathShepherdGenerator\",\n            \"type\": \"object\",\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Does nothing.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Union[Dict[str, Any], None] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"Does nothing.\"\"\"\n    return {}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter.process","title":"<code>process(inputs)</code>","text":"<p>Does the processing of generation completions for the solutions, and annotate each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>Inputs to the step</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>Annotated inputs with the completions.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n    \"\"\"Does the processing of generation completions for the solutions, and annotate\n    each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).\n\n    Args:\n        inputs: Inputs to the step\n\n    Yields:\n        Annotated inputs with the completions.\n    \"\"\"\n\n    # A list with all the inputs to be passed to the LLM. Needs another structure to\n    # find them afterwards\n    prepared_inputs = []\n    # Data structure with the indices of the elements.\n    # (i, j, k) where i is the input, j is the solution, and k is the completion\n    input_positions = []\n    golden_answers = []\n    for i, input in enumerate(inputs):\n        instruction = input[\"instruction\"]\n        golden_solution = input[\"golden_solution\"]  # This is a single solution\n        golden_answers.append(golden_solution[-1])\n        # This contains a list of solutions\n        solutions = input[\"solutions\"]\n        for j, solution in enumerate(solutions):\n            # For each solution, that has K steps, we have to generate N completions\n            # for the first K-2 steps (-2 because the last 2 steps are the last step, and\n            # the answer itself, which can be directly compared against golden answer)\n            prepared_completions = self._prepare_completions(instruction, solution)\n            prepared_inputs.extend(prepared_completions)\n            input_positions.extend(\n                [(i, j, k) for k in range(len(prepared_completions))]\n            )\n\n    # Send the elements in batches to the LLM to speed up the process\n    final_outputs = []\n    # Added here to simplify testing in case we don't have anything to process\n    # TODO: Ensure the statistics has the same shape as all the outputs, raw_outputs, and raw_inputs\n    statistics = []\n    total_raw_outputs = []\n    total_raw_inputs = []\n    for inner_batch in batched(prepared_inputs, self.input_batch_size):  # type: ignore\n        outputs = self.llm.generate_outputs(\n            inputs=inner_batch,\n            num_generations=1,\n            **self.llm.get_generation_kwargs(),  # type: ignore\n        )\n\n        formatted_outputs = []\n        stats = []\n        raw_outputs = []\n        raw_inputs = []\n        for i, output in enumerate(outputs):\n            generation = output[\"generations\"][0]\n            raw_inputs.append(inner_batch[i])\n            raw_outputs.append(generation or \"\")\n            formatted_outputs.append(self._parse_output(generation))\n            stats.append(output[\"statistics\"])\n\n        final_outputs.extend(formatted_outputs)\n        statistics.extend(stats)\n        total_raw_outputs.extend(raw_outputs)\n        total_raw_inputs.extend(raw_inputs)\n\n    yield self._auto_label(  # type: ignore\n        inputs,\n        final_outputs,\n        input_positions,\n        golden_answers,\n        statistics,\n        total_raw_outputs,\n        total_raw_inputs,\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter._prepare_completions","title":"<code>_prepare_completions(instruction, steps)</code>","text":"<p>Helper method to create, given a solution (a list of steps), and a instruction, the texts to be completed by the LLM.</p> <p>Parameters:</p> Name Type Description Default <code>instruction</code> <code>str</code> <p>Instruction of the problem.</p> required <code>steps</code> <code>list[str]</code> <p>List of steps that are part of the solution.</p> required <p>Returns:</p> Type Description <code>List[ChatType]</code> <p>List of ChatType, where each ChatType is the prompt corresponding to one of the steps</p> <code>List[ChatType]</code> <p>to be completed.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def _prepare_completions(\n    self, instruction: str, steps: list[str]\n) -&gt; List[\"ChatType\"]:\n    \"\"\"Helper method to create, given a solution (a list of steps), and a instruction, the\n    texts to be completed by the LLM.\n\n    Args:\n        instruction: Instruction of the problem.\n        steps: List of steps that are part of the solution.\n\n    Returns:\n        List of ChatType, where each ChatType is the prompt corresponding to one of the steps\n        to be completed.\n    \"\"\"\n    prepared_inputs = []\n    # Use the number of completions that correspond to a given instruction/steps pair\n    # to find afterwards the input that corresponds to a given completion (to do the labelling)\n    num_completions = len(steps[:-2])\n    for i in range(1, num_completions + 1):\n        to_complete = instruction + \" \" + \"\\n\".join(steps[:i])\n        prepared_inputs.append(self.format_input({\"instruction\": to_complete}))\n\n    return prepared_inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter._auto_label","title":"<code>_auto_label(inputs, final_outputs, input_positions, golden_answers, statistics, raw_outputs, raw_inputs)</code>","text":"<p>Labels the steps inplace (in the inputs), and returns the inputs.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>The original inputs</p> required <code>final_outputs</code> <code>list[Completions]</code> <p>List of generations from the LLM. It's organized as a list where the elements sent to the LLM are grouped together, then each element contains the completions, and each completion is a list of steps.</p> required <code>input_positions</code> <code>list[tuple[int, int, int]]</code> <p>A list with tuples generated in the process method that contains (i, j, k) where i is the index of the input, j is the index of the solution, and k is the index of the completion.</p> required <code>golden_answers</code> <code>list[str]</code> <p>List of golden answers for each input.</p> required <code>statistics</code> <code>list[LLMStatistics]</code> <p>List of statistics from the LLM.</p> required <code>raw_outputs</code> <code>list[str]</code> <p>List of raw outputs from the LLM.</p> required <code>raw_inputs</code> <code>list[str]</code> <p>List of raw inputs to the LLM.</p> required <p>Returns:</p> Type Description <code>StepInput</code> <p>Inputs annotated.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def _auto_label(\n    self,\n    inputs: StepInput,\n    final_outputs: list[Completions],\n    input_positions: list[tuple[int, int, int]],\n    golden_answers: list[str],\n    statistics: list[\"LLMStatistics\"],\n    raw_outputs: list[str],\n    raw_inputs: list[str],\n) -&gt; StepInput:\n    \"\"\"Labels the steps inplace (in the inputs), and returns the inputs.\n\n    Args:\n        inputs: The original inputs\n        final_outputs: List of generations from the LLM.\n            It's organized as a list where the elements sent to the LLM are\n            grouped together, then each element contains the completions, and\n            each completion is a list of steps.\n        input_positions: A list with tuples generated in the process method\n            that contains (i, j, k) where i is the index of the input, j is the\n            index of the solution, and k is the index of the completion.\n        golden_answers: List of golden answers for each input.\n        statistics: List of statistics from the LLM.\n        raw_outputs: List of raw outputs from the LLM.\n        raw_inputs: List of raw inputs to the LLM.\n\n    Returns:\n        Inputs annotated.\n    \"\"\"\n    for i, (instruction_i, solution_i, step_i) in enumerate(input_positions):\n        input = inputs[instruction_i]\n        solutions = input[\"solutions\"]\n        n_completions = final_outputs[i]\n        label = f\" {self.tags[1]}\"\n        for completion in n_completions:\n            if len(completion) == 0:\n                # This can be a failed generation\n                label = \"\"  # Everyting stays the same\n                self._logger.info(\"Completer failed due to empty completion\")\n                continue\n            if completion[-1] == golden_answers[instruction_i]:\n                label = f\" { self.tags[0]}\"\n                # If we found one, it's enough as we are doing Hard Estimation\n                continue\n        # In case we had no solutions from the previous step, otherwise we would have\n        # an IndexError\n        if not solutions[solution_i]:\n            continue\n        solutions[solution_i][step_i] += label\n        inputs[instruction_i][\"solutions\"] = solutions\n\n    for i, input in enumerate(inputs):\n        solutions = input[\"solutions\"]\n        new_solutions = []\n        for solution in solutions:\n            if not solution or (len(solution) == 1):\n                # The generation may fail to generate the expected\n                # completions, or just added an extra empty completion,\n                # we skip it.\n                # Other possible error is having a list of solutions\n                # with a single item, so when we call .pop, we are left\n                # with an empty list, so we skip it too.\n                new_solutions.append(solution)\n                continue\n\n            answer = solution.pop()\n            label = (\n                f\" {self.tags[0]}\"\n                if answer == golden_answers[i]\n                else f\" {self.tags[1]}\"\n            )\n            solution[-1] += \" \" + answer + label\n            new_solutions.append(solution)\n\n        # Only add the solutions if the data was properly parsed\n        input[\"solutions\"] = new_solutions if new_solutions else input[\"solutions\"]\n        input = self._add_metadata(\n            input, statistics[i], raw_outputs[i], raw_inputs[i]\n        )\n\n    return inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter._add_metadata","title":"<code>_add_metadata(input, statistics, raw_output, raw_input)</code>","text":"<p>Adds the <code>distilabel_metadata</code> to the input.</p> <p>This method comes for free in the general Tasks, but as we have reimplemented the <code>process</code>, we have to repeat it here.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>dict[str, Any]</code> <p>The input to add the metadata to.</p> required <code>statistics</code> <code>list[LLMStatistics]</code> <p>The statistics from the LLM.</p> required <code>raw_output</code> <code>Union[str, None]</code> <p>The raw output from the LLM.</p> required <code>raw_input</code> <code>Union[list[dict[str, Any]], None]</code> <p>The raw input to the LLM.</p> required <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>The input with the metadata added if applies.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>def _add_metadata(\n    self,\n    input: dict[str, Any],\n    statistics: list[\"LLMStatistics\"],\n    raw_output: Union[str, None],\n    raw_input: Union[list[dict[str, Any]], None],\n) -&gt; dict[str, Any]:\n    \"\"\"Adds the `distilabel_metadata` to the input.\n\n    This method comes for free in the general Tasks, but as we have reimplemented the `process`,\n    we have to repeat it here.\n\n    Args:\n        input: The input to add the metadata to.\n        statistics: The statistics from the LLM.\n        raw_output: The raw output from the LLM.\n        raw_input: The raw input to the LLM.\n\n    Returns:\n        The input with the metadata added if applies.\n    \"\"\"\n    input[\"model_name\"] = self.llm.model_name\n\n    if DISTILABEL_METADATA_KEY not in input:\n        input[DISTILABEL_METADATA_KEY] = {}\n    # If the solutions are splitted afterwards, the statistics should be splitted\n    # to avoid counting extra tokens\n    input[DISTILABEL_METADATA_KEY][f\"statistics_{self.name}\"] = statistics\n\n    # Let some defaults in case something failed and we had None, otherwise when reading\n    # the parquet files using pyarrow, the following error will appear:\n    # ArrowInvalid: Schema\n    if self.add_raw_input:\n        input[DISTILABEL_METADATA_KEY][f\"raw_input_{self.name}\"] = raw_input or [\n            {\"content\": \"\", \"role\": \"\"}\n        ]\n    if self.add_raw_output:\n        input[DISTILABEL_METADATA_KEY][f\"raw_output_{self.name}\"] = raw_output or \"\"\n    return input\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdCompleter.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel, Field\n\nclass Solution(BaseModel):\n    solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\nclass MathShepherdCompleter(BaseModel):\n    solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\nMathShepherdCompleter.model_json_schema()\n</code></pre> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/completer.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel, Field\n\n    class Solution(BaseModel):\n        solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n    class MathShepherdCompleter(BaseModel):\n        solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n    MathShepherdCompleter.model_json_schema()\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"$defs\": {\n            \"Solution\": {\n                \"properties\": {\n                    \"solution\": {\n                        \"description\": \"Step by step solution leading to the final answer\",\n                        \"title\": \"Solution\",\n                        \"type\": \"string\",\n                    }\n                },\n                \"required\": [\"solution\"],\n                \"title\": \"Solution\",\n                \"type\": \"object\",\n            }\n        },\n        \"properties\": {\n            \"solutions\": {\n                \"description\": \"List of solutions\",\n                \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                \"title\": \"Solutions\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"solutions\"],\n        \"title\": \"MathShepherdGenerator\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdGenerator","title":"<code>MathShepherdGenerator</code>","text":"<p>               Bases: <code>Task</code></p> <p>Math Shepherd solution generator.</p> <p>This task is in charge of generating completions for a given instruction, in the format expected by the Math Shepherd Completer task. The attributes make the task flexible to be used with different types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented in the original paper. Before modifying them, review the current defaults to ensure the completions are generated correctly. This task can be used to generate the golden solutions for a given problem if not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer. Only one of <code>solutions</code> or <code>golden_solution</code> will be generated, depending on the value of M.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Optional[str]</code> <p>The system prompt to be used in the completions. The default one has been checked and generates good completions using Llama 3.1 with 8B and 70B, but it can be modified to adapt it to the model and dataset selected. Take into account that the system prompt includes 2 variables in the Jinja2 template, {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example to steer the model towards a specific type of responses, and few shots to add examples. They can be modified to adapt the system prompt to the dataset and model used without needing to change the full system prompt.</p> <code>extra_rules</code> <code>Optional[str]</code> <p>This field can be used to insert extra rules relevant to the type of dataset. For example, in the original paper they used GSM8K and MATH datasets, and this field can be used to insert the rules for the GSM8K dataset.</p> <code>few_shots</code> <code>Optional[str]</code> <p>Few shots to help the model generating the completions, write them in the format of the type of solutions wanted for your dataset.</p> <code>M</code> <code>Optional[PositiveInt]</code> <p>Number of completions to generate for each step. By default is set to 1, which will generate the \"golden_solution\". In this case select a stronger model, as it will be used as the source of true during labelling. If M is set to a number greater than 1, the task will generate a list of completions to be labeled by the Math Shepherd Completer task.</p> Input columns <ul> <li>instruction (<code>str</code>): The task or instruction.</li> </ul> Output columns <ul> <li>golden_solution (<code>str</code>): The step by step solution to the instruction.     It will be generated if M is equal to 1.</li> <li>solutions (<code>List[List[str]]</code>): A list of possible solutions to the instruction.     It will be generated if M is greater than 1.</li> <li>model_name (<code>str</code>): The name of the model used to generate the revision.</li> </ul> Categories <ul> <li>text-generation</li> </ul> References <ul> <li><code>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</code></li> </ul> <p>Examples:</p> <p>Generate the solution for a given instruction (prefer a stronger model here):</p> <pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"golden_solution_generator\",\n    llm=llm,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': '[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"]'}]]\n</code></pre> <p>Generate M completions for a given instruction (using structured output generation):</p> <pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 2048,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"solution_generator\",\n    llm=llm,\n    M=2,\n    use_default_structured_output=True,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/math_shepherd/generator.py</code> <pre><code>class MathShepherdGenerator(Task):\n    \"\"\"Math Shepherd solution generator.\n\n    This task is in charge of generating completions for a given instruction, in the format expected\n    by the Math Shepherd Completer task. The attributes make the task flexible to be used with different\n    types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented\n    in the original paper. Before modifying them, review the current defaults to ensure the completions\n    are generated correctly. This task can be used to generate the golden solutions for a given problem if\n    not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer.\n    Only one of `solutions` or `golden_solution` will be generated, depending on the value of M.\n\n    Attributes:\n        system_prompt: The system prompt to be used in the completions. The default one has been\n            checked and generates good completions using Llama 3.1 with 8B and 70B,\n            but it can be modified to adapt it to the model and dataset selected.\n            Take into account that the system prompt includes 2 variables in the Jinja2 template,\n            {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example\n            to steer the model towards a specific type of responses, and few shots to add examples.\n            They can be modified to adapt the system prompt to the dataset and model used without needing\n            to change the full system prompt.\n        extra_rules: This field can be used to insert extra rules relevant to the type of dataset.\n            For example, in the original paper they used GSM8K and MATH datasets, and this field\n            can be used to insert the rules for the GSM8K dataset.\n        few_shots: Few shots to help the model generating the completions, write them in the\n            format of the type of solutions wanted for your dataset.\n        M: Number of completions to generate for each step. By default is set to 1, which will\n            generate the \"golden_solution\". In this case select a stronger model, as it will be used\n            as the source of true during labelling. If M is set to a number greater than 1, the task\n            will generate a list of completions to be labeled by the Math Shepherd Completer task.\n\n    Input columns:\n        - instruction (`str`): The task or instruction.\n\n    Output columns:\n        - golden_solution (`str`): The step by step solution to the instruction.\n            It will be generated if M is equal to 1.\n        - solutions (`List[List[str]]`): A list of possible solutions to the instruction.\n            It will be generated if M is greater than 1.\n        - model_name (`str`): The name of the model used to generate the revision.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)\n\n    Examples:\n        Generate the solution for a given instruction (prefer a stronger model here):\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.6,\n                \"max_new_tokens\": 1024,\n            },\n        )\n        task = MathShepherdGenerator(\n            name=\"golden_solution_generator\",\n            llm=llm,\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'golden_solution': '[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"]'}]]\n        ```\n\n        Generate M completions for a given instruction (using structured output generation):\n\n        ```python\n        from distilabel.steps.tasks import MathShepherdGenerator\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 2048,\n            },\n        )\n        task = MathShepherdGenerator(\n            name=\"solution_generator\",\n            llm=llm,\n            M=2,\n            use_default_structured_output=True,\n        )\n\n        task.load()\n\n        result = next(\n            task.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                    },\n                ]\n            )\n        )\n        # [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        # 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n        ```\n    \"\"\"\n\n    system_prompt: Optional[str] = SYSTEM_PROMPT\n    extra_rules: Optional[str] = RULES_GSM8K\n    few_shots: Optional[str] = FEW_SHOTS_GSM8K\n    M: Optional[PositiveInt] = None\n\n    def load(self) -&gt; None:\n        super().load()\n        if self.system_prompt is not None:\n            self.system_prompt = Template(self.system_prompt).render(\n                extra_rules=self.extra_rules or \"\",\n                few_shots=self.few_shots or \"\",\n                structured_prompt=SYSTEM_PROMPT_STRUCTURED\n                if self.use_default_structured_output\n                else \"\",\n            )\n        if self.use_default_structured_output:\n            self._template = Template(TEMPLATE_STRUCTURED)\n        else:\n            self._template = Template(TEMPLATE)\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        if self.M:\n            return [\"solutions\", \"model_name\"]\n        return [\"golden_solution\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        messages = [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    instruction=input[\"instruction\"],\n                    M=self.M,\n                ),\n            }\n        ]\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n        return messages\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        output_name = \"solutions\" if self.M else \"golden_solution\"\n\n        if output is None:\n            input.update(**{output_name: None})\n            return input\n\n        if self.M:\n            output_parsed = (\n                self._format_structured_output(output)\n                if self.use_default_structured_output\n                else output.split(\"---\")\n            )\n            solutions = [split_solution_steps(o) for o in output_parsed]\n        else:\n            output_parsed = (\n                self._format_structured_output(output)[0]\n                if self.use_default_structured_output\n                else output\n            )\n            solutions = split_solution_steps(output_parsed)\n\n        input.update(**{output_name: solutions})\n        return input\n\n    @override\n    def get_structured_output(self) -&gt; dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel, Field\n\n        class Solution(BaseModel):\n            solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n        class MathShepherdGenerator(BaseModel):\n            solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n        MathShepherdGenerator.model_json_schema()\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"$defs\": {\n                \"Solution\": {\n                    \"properties\": {\n                        \"solution\": {\n                            \"description\": \"Step by step solution leading to the final answer\",\n                            \"title\": \"Solution\",\n                            \"type\": \"string\",\n                        }\n                    },\n                    \"required\": [\"solution\"],\n                    \"title\": \"Solution\",\n                    \"type\": \"object\",\n                }\n            },\n            \"properties\": {\n                \"solutions\": {\n                    \"description\": \"List of solutions\",\n                    \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                    \"title\": \"Solutions\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"solutions\"],\n            \"title\": \"MathShepherdGenerator\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(self, output: str) -&gt; list[str]:\n        default_output = [\"\"] * self.M if self.M else [\"\"]\n        if parsed_output := parse_json_response(output):\n            solutions = parsed_output[\"solutions\"]\n            extracted_solutions = [o[\"solution\"] for o in solutions]\n            if len(extracted_solutions) != self.M:\n                extracted_solutions = default_output\n            return extracted_solutions\n        return default_output\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.MathShepherdGenerator.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel, Field\n\nclass Solution(BaseModel):\n    solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\nclass MathShepherdGenerator(BaseModel):\n    solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\nMathShepherdGenerator.model_json_schema()\n</code></pre> <p>Returns:</p> Type Description <code>dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/generator.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel, Field\n\n    class Solution(BaseModel):\n        solution: str = Field(..., description=\"Step by step solution leading to the final answer\")\n\n    class MathShepherdGenerator(BaseModel):\n        solutions: list[Solution] = Field(..., description=\"List of solutions\")\n\n    MathShepherdGenerator.model_json_schema()\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"$defs\": {\n            \"Solution\": {\n                \"properties\": {\n                    \"solution\": {\n                        \"description\": \"Step by step solution leading to the final answer\",\n                        \"title\": \"Solution\",\n                        \"type\": \"string\",\n                    }\n                },\n                \"required\": [\"solution\"],\n                \"title\": \"Solution\",\n                \"type\": \"object\",\n            }\n        },\n        \"properties\": {\n            \"solutions\": {\n                \"description\": \"List of solutions\",\n                \"items\": {\"$ref\": \"#/$defs/Solution\"},\n                \"title\": \"Solutions\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"solutions\"],\n        \"title\": \"MathShepherdGenerator\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.FormatPRM","title":"<code>FormatPRM</code>","text":"<p>               Bases: <code>Step</code></p> <p>Helper step to transform the data into the format expected by the PRM model.</p> <p>This step can be used to format the data in one of 2 formats: Following the format presented in peiyi9979/Math-Shepherd, in which case this step creates the columns input and label, where the input is the instruction with the solution (and the tag replaced by a token), and the label is the instruction with the solution, both separated by a newline. Following TRL's format for training, which generates the columns prompt, completions, and labels. The labels correspond to the original tags replaced by boolean values, where True represents correct steps.</p> <p>Attributes:</p> Name Type Description <code>format</code> <code>Literal['math-shepherd', 'trl']</code> <p>The format to use for the PRM model. \"math-shepherd\" corresponds to the original paper, while \"trl\" is a format prepared to train the model using TRL.</p> <code>step_token</code> <code>str</code> <p>String that serves as a unique token denoting the position for predicting the step score.</p> <code>tags</code> <code>list[str]</code> <p>List of tags that represent the correct and incorrect steps. This only needs to be informed if it's different than the default in <code>MathShepherdCompleter</code>.</p> Input columns <ul> <li>instruction (<code>str</code>): The task or instruction.</li> <li>solutions (<code>list[str]</code>): List of steps with a solution to the task.</li> </ul> Output columns <ul> <li>input (<code>str</code>): The instruction with the solutions, where the label tags     are replaced by a token.</li> <li>label (<code>str</code>): The instruction with the solutions.</li> <li>prompt (<code>str</code>): The instruction with the solutions, where the label tags     are replaced by a token.</li> <li>completions (<code>List[str]</code>): The solution represented as a list of steps.</li> <li>labels (<code>List[bool]</code>): The labels, as a list of booleans, where True represents     a good response.</li> </ul> Categories <ul> <li>text-manipulation</li> <li>columns</li> </ul> References <ul> <li><code>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</code></li> <li>peiyi9979/Math-Shepherd</li> </ul> <p>Examples:</p> <p>Prepare your data to train a PRM model with the Math-Shepherd format:</p> <pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM()\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\nresult = next(formatter.process(result))\n</code></pre> <p>Prepare your data to train a PRM model with the TRL format:</p> <pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM(format=\"trl\")\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\n\nresult = next(formatter.process(result))\n# {\n#     \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"solutions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"\n#     ],\n#     \"prompt\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"completions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber.\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3\"\n#     ],\n#     \"labels\": [\n#         true,\n#         true,\n#         true\n#     ]\n# }\n</code></pre> <p>Citations:</p> <pre><code>```\n@misc{wang2024mathshepherdverifyreinforcellms,\n    title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n    author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n    year={2024},\n    eprint={2312.08935},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI},\n    url={https://arxiv.org/abs/2312.08935},\n}\n```\n</code></pre> Source code in <code>src/distilabel/steps/tasks/math_shepherd/utils.py</code> <pre><code>class FormatPRM(Step):\n    \"\"\"Helper step to transform the data into the format expected by the PRM model.\n\n    This step can be used to format the data in one of 2 formats:\n    Following the format presented\n    in [peiyi9979/Math-Shepherd](https://huggingface.co/datasets/peiyi9979/Math-Shepherd?row=0),\n    in which case this step creates the columns input and label, where the input is the instruction\n    with the solution (and the tag replaced by a token), and the label is the instruction\n    with the solution, both separated by a newline.\n    Following TRL's format for training, which generates the columns prompt, completions, and labels.\n    The labels correspond to the original tags replaced by boolean values, where True represents\n    correct steps.\n\n    Attributes:\n        format: The format to use for the PRM model.\n            \"math-shepherd\" corresponds to the original paper, while \"trl\" is a format\n            prepared to train the model using TRL.\n        step_token: String that serves as a unique token denoting the position\n            for predicting the step score.\n        tags: List of tags that represent the correct and incorrect steps.\n            This only needs to be informed if it's different than the default in\n            `MathShepherdCompleter`.\n\n    Input columns:\n        - instruction (`str`): The task or instruction.\n        - solutions (`list[str]`): List of steps with a solution to the task.\n\n    Output columns:\n        - input (`str`): The instruction with the solutions, where the label tags\n            are replaced by a token.\n        - label (`str`): The instruction with the solutions.\n        - prompt (`str`): The instruction with the solutions, where the label tags\n            are replaced by a token.\n        - completions (`List[str]`): The solution represented as a list of steps.\n        - labels (`List[bool]`): The labels, as a list of booleans, where True represents\n            a good response.\n\n    Categories:\n        - text-manipulation\n        - columns\n\n    References:\n        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)\n        - [peiyi9979/Math-Shepherd](https://huggingface.co/datasets/peiyi9979/Math-Shepherd?row=0)\n\n    Examples:\n        Prepare your data to train a PRM model with the Math-Shepherd format:\n\n        ```python\n        from distilabel.steps.tasks import FormatPRM\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(columns=[\"solutions\"])\n        expand_columns.load()\n\n        # Define our PRM formatter\n        formatter = FormatPRM()\n        formatter.load()\n\n        # Expand the solutions column as it comes from the MathShepherdCompleter\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n                    },\n                ]\n            )\n        )\n        result = next(formatter.process(result))\n        ```\n\n        Prepare your data to train a PRM model with the TRL format:\n\n        ```python\n        from distilabel.steps.tasks import FormatPRM\n        from distilabel.steps import ExpandColumns\n\n        expand_columns = ExpandColumns(columns=[\"solutions\"])\n        expand_columns.load()\n\n        # Define our PRM formatter\n        formatter = FormatPRM(format=\"trl\")\n        formatter.load()\n\n        # Expand the solutions column as it comes from the MathShepherdCompleter\n        result = next(\n            expand_columns.process(\n                [\n                    {\n                        \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                        \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it\\'s half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n                    },\n                ]\n            )\n        )\n\n        result = next(formatter.process(result))\n        # {\n        #     \"instruction\": \"Janet\\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        #     \"solutions\": [\n        #         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\",\n        #         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\",\n        #         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"\n        #     ],\n        #     \"prompt\": \"Janet\\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n        #     \"completions\": [\n        #         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.\",\n        #         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber.\",\n        #         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3\"\n        #     ],\n        #     \"labels\": [\n        #         true,\n        #         true,\n        #         true\n        #     ]\n        # }\n        ```\n\n    Citations:\n\n        ```\n        @misc{wang2024mathshepherdverifyreinforcellms,\n            title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},\n            author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},\n            year={2024},\n            eprint={2312.08935},\n            archivePrefix={arXiv},\n            primaryClass={cs.AI},\n            url={https://arxiv.org/abs/2312.08935},\n        }\n        ```\n    \"\"\"\n\n    format: Literal[\"math-shepherd\", \"trl\"] = \"math-shepherd\"\n    step_token: str = \"\u043a\u0438\"\n    tags: list[str] = [\"+\", \"-\"]\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        super().model_post_init(__context)\n        if self.format == \"math-shepherd\":\n            self._formatter = self._format_math_shepherd\n        else:\n            self._formatter = self._format_trl\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\", \"solutions\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        if self.format == \"math-shepherd\":\n            return [\"input\", \"label\"]\n        return [\"prompt\", \"completions\", \"labels\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"The process prepares the data for the `APIGenGenerator` task.\n\n        If a single example is provided, it is copied to avoid raising an error.\n\n        Args:\n            inputs: A list of dictionaries with the input data.\n\n        Yields:\n            A list of dictionaries with the output data.\n        \"\"\"\n        for input in inputs:\n            self._formatter(input)\n\n        yield inputs  # type: ignore\n\n    def _format_math_shepherd(\n        self, input: dict[str, str]\n    ) -&gt; dict[str, Union[str, list[str]]]:\n        instruction = input[\"instruction\"]\n        replaced = []\n        # At this stage, the \"solutions\" column can only contain a single solution,\n        # and the last item of each solution is the tag.\n        solution = input[\"solutions\"]\n        for step in solution:\n            # Check there's a string, because the step that generated\n            # the solutions could have failed, and we would have an empty list.\n            replaced.append(step[:-1] + self.step_token if len(step) &gt; 1 else step)\n\n        input[\"input\"] = instruction + \" \" + \"\\n\".join(replaced)\n        input[\"label\"] = instruction + \" \" + \"\\n\".join(solution)\n\n        return input  # type: ignore\n\n    def _format_trl(\n        self, input: dict[str, str]\n    ) -&gt; dict[str, Union[str, list[str], list[bool]]]:\n        input[\"prompt\"] = input[\"instruction\"]\n        completions: list[str] = []\n        labels: list[bool] = []\n        for step in input[\"solutions\"]:\n            token = step[-1]\n            completions.append(step[:-1].strip())\n            labels.append(True if token == self.tags[0] else False)\n\n        input[\"completions\"] = completions  # type: ignore\n        input[\"labels\"] = labels  # type: ignore\n\n        return input  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.FormatPRM.process","title":"<code>process(inputs)</code>","text":"<p>The process prepares the data for the <code>APIGenGenerator</code> task.</p> <p>If a single example is provided, it is copied to avoid raising an error.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of dictionaries with the input data.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>A list of dictionaries with the output data.</p> Source code in <code>src/distilabel/steps/tasks/math_shepherd/utils.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"The process prepares the data for the `APIGenGenerator` task.\n\n    If a single example is provided, it is copied to avoid raising an error.\n\n    Args:\n        inputs: A list of dictionaries with the input data.\n\n    Yields:\n        A list of dictionaries with the output data.\n    \"\"\"\n    for input in inputs:\n        self._formatter(input)\n\n    yield inputs  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM","title":"<code>PairRM</code>","text":"<p>               Bases: <code>Step</code></p> <p>Rank the candidates based on the input using the <code>LLM</code> model.</p> <p>Attributes:</p> Name Type Description <code>model</code> <code>str</code> <p>The model to use for the ranking. Defaults to <code>\"llm-blender/PairRM\"</code>.</p> <code>instructions</code> <code>Optional[str]</code> <p>The instructions to use for the model. Defaults to <code>None</code>.</p> Input columns <ul> <li>inputs (<code>List[Dict[str, Any]]</code>): The input text or conversation to rank the candidates for.</li> <li>candidates (<code>List[Dict[str, Any]]</code>): The candidates to rank.</li> </ul> Output columns <ul> <li>ranks (<code>List[int]</code>): The ranks of the candidates based on the input.</li> <li>ranked_candidates (<code>List[Dict[str, Any]]</code>): The candidates ranked based on the input.</li> <li>model_name (<code>str</code>): The model name used to rank the candidate responses. Defaults to <code>\"llm-blender/PairRM\"</code>.</li> </ul> References <ul> <li>LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion.</li> <li>Pair Ranking Model.</li> </ul> Categories <ul> <li>preference</li> </ul> Note <p>This step differs to other tasks as there is a single implementation of this model currently, and we will use a specific <code>LLM</code>.</p> <p>Examples:</p> <p>Rank LLM candidates:</p> <pre><code>from distilabel.steps.tasks import PairRM\n\n# Consider this as a placeholder for your actual LLM.\npair_rm = PairRM()\n\npair_rm.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\"input\": \"Hello, how are you?\", \"candidates\": [\"fine\", \"good\", \"bad\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'input': 'Hello, how are you?',\n#         'candidates': ['fine', 'good', 'bad'],\n#         'ranks': [2, 1, 3],\n#         'ranked_candidates': ['good', 'fine', 'bad'],\n#         'model_name': 'llm-blender/PairRM',\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{jiang2023llmblenderensemblinglargelanguage,\n    title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},\n    author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},\n    year={2023},\n    eprint={2306.02561},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2306.02561},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>class PairRM(Step):\n    \"\"\"Rank the candidates based on the input using the `LLM` model.\n\n    Attributes:\n        model: The model to use for the ranking. Defaults to `\"llm-blender/PairRM\"`.\n        instructions: The instructions to use for the model. Defaults to `None`.\n\n    Input columns:\n        - inputs (`List[Dict[str, Any]]`): The input text or conversation to rank the candidates for.\n        - candidates (`List[Dict[str, Any]]`): The candidates to rank.\n\n    Output columns:\n        - ranks (`List[int]`): The ranks of the candidates based on the input.\n        - ranked_candidates (`List[Dict[str, Any]]`): The candidates ranked based on the input.\n        - model_name (`str`): The model name used to rank the candidate responses. Defaults to `\"llm-blender/PairRM\"`.\n\n    References:\n        - [LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion](https://arxiv.org/abs/2306.02561).\n        - [Pair Ranking Model](https://huggingface.co/llm-blender/PairRM).\n\n    Categories:\n        - preference\n\n    Note:\n        This step differs to other tasks as there is a single implementation of this model\n        currently, and we will use a specific `LLM`.\n\n    Examples:\n        Rank LLM candidates:\n\n        ```python\n        from distilabel.steps.tasks import PairRM\n\n        # Consider this as a placeholder for your actual LLM.\n        pair_rm = PairRM()\n\n        pair_rm.load()\n\n        result = next(\n            scorer.process(\n                [\n                    {\"input\": \"Hello, how are you?\", \"candidates\": [\"fine\", \"good\", \"bad\"]},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'input': 'Hello, how are you?',\n        #         'candidates': ['fine', 'good', 'bad'],\n        #         'ranks': [2, 1, 3],\n        #         'ranked_candidates': ['good', 'fine', 'bad'],\n        #         'model_name': 'llm-blender/PairRM',\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{jiang2023llmblenderensemblinglargelanguage,\n            title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},\n            author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},\n            year={2023},\n            eprint={2306.02561},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2306.02561},\n        }\n        ```\n    \"\"\"\n\n    model: str = \"llm-blender/PairRM\"\n    instructions: Optional[str] = None\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the\n        custom library for running the inference for the PairRM models.\"\"\"\n        try:\n            import llm_blender\n        except ImportError as e:\n            raise ImportError(\n                \"The `llm_blender` package is required to use the `PairRM` class.\"\n                \"Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`.\"\n            ) from e\n\n        self._blender = llm_blender.Blender()\n        self._blender.loadranker(self.model)\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The input columns correspond to the two required arguments from `Blender.rank`:\n        `inputs` and `candidates`.\"\"\"\n        return [\"input\", \"candidates\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        \"\"\"The outputs will include the `ranks` and the `ranked_candidates`.\"\"\"\n        return [\"ranks\", \"ranked_candidates\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n        \"\"\"The input is expected to be a dictionary with the keys `input` and `candidates`,\n        where the `input` corresponds to the instruction of a model and `candidates` are a\n        list of responses to be ranked.\n        \"\"\"\n        return {\"input\": input[\"input\"], \"candidates\": input[\"candidates\"]}\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        \"\"\"Generates the ranks for the candidates based on the input.\n\n        The ranks are the positions of the candidates, where lower is better,\n        and the ranked candidates correspond to the candidates sorted according to the\n        ranks obtained.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.\n        \"\"\"\n        input_texts = []\n        candidates = []\n        for input in inputs:\n            formatted_input = self.format_input(input)\n            input_texts.append(formatted_input[\"input\"])\n            candidates.append(formatted_input[\"candidates\"])\n\n        instructions = (\n            [self.instructions] * len(input_texts) if self.instructions else None\n        )\n\n        ranks = self._blender.rank(\n            input_texts,\n            candidates,\n            instructions=instructions,\n            return_scores=False,\n            batch_size=self.input_batch_size,\n        )\n        # Sort the candidates based on the ranks\n        ranked_candidates = np.take_along_axis(\n            np.array(candidates), ranks - 1, axis=1\n        ).tolist()\n        ranks = ranks.tolist()\n        for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):\n            input[\"ranks\"] = rank\n            input[\"ranked_candidates\"] = ranked_candidate\n            input[\"model_name\"] = self.model\n\n        yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input columns correspond to the two required arguments from <code>Blender.rank</code>: <code>inputs</code> and <code>candidates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs will include the <code>ranks</code> and the <code>ranked_candidates</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.load","title":"<code>load()</code>","text":"<p>Loads the PairRM model provided via <code>model</code> with <code>llm_blender.Blender</code>, which is the custom library for running the inference for the PairRM models.</p> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the\n    custom library for running the inference for the PairRM models.\"\"\"\n    try:\n        import llm_blender\n    except ImportError as e:\n        raise ImportError(\n            \"The `llm_blender` package is required to use the `PairRM` class.\"\n            \"Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`.\"\n        ) from e\n\n    self._blender = llm_blender.Blender()\n    self._blender.loadranker(self.model)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is expected to be a dictionary with the keys <code>input</code> and <code>candidates</code>, where the <code>input</code> corresponds to the instruction of a model and <code>candidates</code> are a list of responses to be ranked.</p> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; Dict[str, Any]:\n    \"\"\"The input is expected to be a dictionary with the keys `input` and `candidates`,\n    where the `input` corresponds to the instruction of a model and `candidates` are a\n    list of responses to be ranked.\n    \"\"\"\n    return {\"input\": input[\"input\"], \"candidates\": input[\"candidates\"]}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PairRM.process","title":"<code>process(inputs)</code>","text":"<p>Generates the ranks for the candidates based on the input.</p> <p>The ranks are the positions of the candidates, where lower is better, and the ranked candidates correspond to the candidates sorted according to the ranks obtained.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>StepInput</code> <p>A list of Python dictionaries with the inputs of the task.</p> required <p>Yields:</p> Type Description <code>StepOutput</code> <p>An iterator with the inputs containing the <code>ranks</code>, <code>ranked_candidates</code>, and <code>model_name</code>.</p> Source code in <code>src/distilabel/steps/tasks/pair_rm.py</code> <pre><code>def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n    \"\"\"Generates the ranks for the candidates based on the input.\n\n    The ranks are the positions of the candidates, where lower is better,\n    and the ranked candidates correspond to the candidates sorted according to the\n    ranks obtained.\n\n    Args:\n        inputs: A list of Python dictionaries with the inputs of the task.\n\n    Yields:\n        An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.\n    \"\"\"\n    input_texts = []\n    candidates = []\n    for input in inputs:\n        formatted_input = self.format_input(input)\n        input_texts.append(formatted_input[\"input\"])\n        candidates.append(formatted_input[\"candidates\"])\n\n    instructions = (\n        [self.instructions] * len(input_texts) if self.instructions else None\n    )\n\n    ranks = self._blender.rank(\n        input_texts,\n        candidates,\n        instructions=instructions,\n        return_scores=False,\n        batch_size=self.input_batch_size,\n    )\n    # Sort the candidates based on the ranks\n    ranked_candidates = np.take_along_axis(\n        np.array(candidates), ranks - 1, axis=1\n    ).tolist()\n    ranks = ranks.tolist()\n    for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):\n        input[\"ranks\"] = rank\n        input[\"ranked_candidates\"] = ranked_candidate\n        input[\"model_name\"] = self.model\n\n    yield inputs\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval","title":"<code>PrometheusEval</code>","text":"<p>               Bases: <code>Task</code></p> <p>Critique and rank the quality of generations from an <code>LLM</code> using Prometheus 2.0.</p> <p><code>PrometheusEval</code> is a task created for Prometheus 2.0, covering both the absolute and relative evaluations. The absolute evaluation i.e. <code>mode=\"absolute\"</code> is used to evaluate a single generation from an LLM for a given instruction. The relative evaluation i.e. <code>mode=\"relative\"</code> is used to evaluate two generations from an LLM for a given instruction. Both evaluations provide the possibility of using a reference answer to compare with or withoug the <code>reference</code> attribute, and both are based on a score rubric that critiques the generation/s based on the following default aspects: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, and <code>reasoning</code>, that can be overridden via <code>rubrics</code>, and the selected rubric is set via the attribute <code>rubric</code>.</p> Note <p>The <code>PrometheusEval</code> task is better suited and intended to be used with any of the Prometheus 2.0 models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0, and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting and quality is not guaranteed if using another model, even though some other models may be able to correctly follow the formatting and generate insightful critiques too.</p> <p>Attributes:</p> Name Type Description <code>mode</code> <code>Literal['absolute', 'relative']</code> <p>the evaluation mode to use, either <code>absolute</code> or <code>relative</code>. It defines whether the task will evaluate one or two generations.</p> <code>rubric</code> <code>str</code> <p>the score rubric to use within the prompt to run the critique based on different aspects. Can be any existing key in the <code>rubrics</code> attribute, which by default means that it can be: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, or <code>reasoning</code>. Those will only work if using the default <code>rubrics</code>, otherwise, the provided <code>rubrics</code> should be used.</p> <code>rubrics</code> <code>Optional[Dict[str, str]]</code> <p>a dictionary containing the different rubrics to use for the critique, where the keys are the rubric names and the values are the rubric descriptions. The default rubrics are the following: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, and <code>reasoning</code>.</p> <code>reference</code> <code>bool</code> <p>a boolean flag to indicate whether a reference answer / completion will be provided, so that the model critique is based on the comparison with it. It implies that the column <code>reference</code> needs to be provided within the input data in addition to the rest of the inputs.</p> <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction to use as reference.</li> <li>generation (<code>str</code>, optional): The generated text from the given <code>instruction</code>. This column is required     if <code>mode=absolute</code>.</li> <li>generations (<code>List[str]</code>, optional): The generated texts from the given <code>instruction</code>. It should     contain 2 generations only. This column is required if <code>mode=relative</code>.</li> <li>reference (<code>str</code>, optional): The reference / golden answer for the <code>instruction</code>, to be used by the LLM     for comparison against.</li> </ul> Output columns <ul> <li>feedback (<code>str</code>): The feedback explaining the result below, as critiqued by the LLM using the     pre-defined score rubric, compared against <code>reference</code> if provided.</li> <li>result (<code>Union[int, Literal[\"A\", \"B\"]]</code>): If <code>mode=absolute</code>, then the result contains the score for the     <code>generation</code> in a likert-scale from 1-5, otherwise, if <code>mode=relative</code>, then the result contains either     \"A\" or \"B\", the \"winning\" one being the generation in the index 0 of <code>generations</code> if <code>result='A'</code> or the     index 1 if <code>result='B'</code>.</li> <li>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</li> </ul> Categories <ul> <li>critique</li> <li>preference</li> </ul> References <ul> <li>Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models</li> <li>prometheus-eval: Evaluate your LLM's response with Prometheus \ud83d\udcaf</li> </ul> <p>Examples:</p> <p>Critique and evaluate LLM generation quality using Prometheus 2_0:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"factual-validity\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre> <p>Critique for relative evaluation:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"relative\",\n    rubric=\"honesty\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generations\": [\"something done\", \"other thing\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generations': ['something done', 'other thing'],\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 'something done',\n#     }\n# ]\n</code></pre> <p>Critique with a custom rubric:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"custom\",\n    rubrics={\n        \"custom\": \"[A]\\nScore 1: A\\nScore 2: B\\nScore 3: C\\nScore 4: D\\nScore 5: E\"\n    }\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre> <p>Critique using a reference answer:</p> <pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"helpfulness\",\n    reference=True,\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\n                \"instruction\": \"make something\",\n                \"generation\": \"something done\",\n                \"reference\": \"this is a reference answer\",\n            },\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'reference': 'this is a reference answer',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{kim2024prometheus2opensource,\n    title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},\n    author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},\n    year={2024},\n    eprint={2405.01535},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2405.01535},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>class PrometheusEval(Task):\n    \"\"\"Critique and rank the quality of generations from an `LLM` using Prometheus 2.0.\n\n    `PrometheusEval` is a task created for Prometheus 2.0, covering both the absolute and relative\n    evaluations. The absolute evaluation i.e. `mode=\"absolute\"` is used to evaluate a single generation from\n    an LLM for a given instruction. The relative evaluation i.e. `mode=\"relative\"` is used to evaluate two generations from an LLM\n    for a given instruction.\n    Both evaluations provide the possibility of using a reference answer to compare with or withoug\n    the `reference` attribute, and both are based on a score rubric that critiques the generation/s\n    based on the following default aspects: `helpfulness`, `harmlessness`, `honesty`, `factual-validity`,\n    and `reasoning`, that can be overridden via `rubrics`, and the selected rubric is set via the attribute\n    `rubric`.\n\n    Note:\n        The `PrometheusEval` task is better suited and intended to be used with any of the Prometheus 2.0\n        models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0,\n        and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting\n        and quality is not guaranteed if using another model, even though some other models may be able to\n        correctly follow the formatting and generate insightful critiques too.\n\n    Attributes:\n        mode: the evaluation mode to use, either `absolute` or `relative`. It defines whether the task\n            will evaluate one or two generations.\n        rubric: the score rubric to use within the prompt to run the critique based on different aspects.\n            Can be any existing key in the `rubrics` attribute, which by default means that it can be:\n            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, or `reasoning`. Those will only\n            work if using the default `rubrics`, otherwise, the provided `rubrics` should be used.\n        rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are\n            the rubric names and the values are the rubric descriptions. The default rubrics are the following:\n            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, and `reasoning`.\n        reference: a boolean flag to indicate whether a reference answer / completion will be provided, so\n            that the model critique is based on the comparison with it. It implies that the column `reference`\n            needs to be provided within the input data in addition to the rest of the inputs.\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - instruction (`str`): The instruction to use as reference.\n        - generation (`str`, optional): The generated text from the given `instruction`. This column is required\n            if `mode=absolute`.\n        - generations (`List[str]`, optional): The generated texts from the given `instruction`. It should\n            contain 2 generations only. This column is required if `mode=relative`.\n        - reference (`str`, optional): The reference / golden answer for the `instruction`, to be used by the LLM\n            for comparison against.\n\n    Output columns:\n        - feedback (`str`): The feedback explaining the result below, as critiqued by the LLM using the\n            pre-defined score rubric, compared against `reference` if provided.\n        - result (`Union[int, Literal[\"A\", \"B\"]]`): If `mode=absolute`, then the result contains the score for the\n            `generation` in a likert-scale from 1-5, otherwise, if `mode=relative`, then the result contains either\n            \"A\" or \"B\", the \"winning\" one being the generation in the index 0 of `generations` if `result='A'` or the\n            index 1 if `result='B'`.\n        - model_name (`str`): The model name used to generate the `feedback` and `result`.\n\n    Categories:\n        - critique\n        - preference\n\n    References:\n        - [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)\n        - [prometheus-eval: Evaluate your LLM's response with Prometheus \ud83d\udcaf](https://github.com/prometheus-eval/prometheus-eval)\n\n    Examples:\n        Critique and evaluate LLM generation quality using Prometheus 2_0:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"factual-validity\"\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\"instruction\": \"make something\", \"generation\": \"something done\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generation': 'something done',\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 6,\n        #     }\n        # ]\n        ```\n\n        Critique for relative evaluation:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"relative\",\n            rubric=\"honesty\"\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\"instruction\": \"make something\", \"generations\": [\"something done\", \"other thing\"]},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generations': ['something done', 'other thing'],\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 'something done',\n        #     }\n        # ]\n        ```\n\n        Critique with a custom rubric:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"custom\",\n            rubrics={\n                \"custom\": \"[A]\\\\nScore 1: A\\\\nScore 2: B\\\\nScore 3: C\\\\nScore 4: D\\\\nScore 5: E\"\n            }\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\"instruction\": \"make something\", \"generation\": \"something done\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generation': 'something done',\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 6,\n        #     }\n        # ]\n        ```\n\n        Critique using a reference answer:\n\n        ```python\n        from distilabel.steps.tasks import PrometheusEval\n        from distilabel.models import vLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prometheus = PrometheusEval(\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]\\\"content\\\" }}\\\\n{{ messages[1]\\\"content\\\" }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"helpfulness\",\n            reference=True,\n        )\n\n        prometheus.load()\n\n        result = next(\n            prometheus.process(\n                [\n                    {\n                        \"instruction\": \"make something\",\n                        \"generation\": \"something done\",\n                        \"reference\": \"this is a reference answer\",\n                    },\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'make something',\n        #         'generation': 'something done',\n        #         'reference': 'this is a reference answer',\n        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n        #         'feedback': 'the feedback',\n        #         'result': 6,\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{kim2024prometheus2opensource,\n            title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},\n            author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},\n            year={2024},\n            eprint={2405.01535},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2405.01535},\n        }\n        ```\n    \"\"\"\n\n    mode: Literal[\"absolute\", \"relative\"]\n    rubric: str\n    rubrics: Optional[Dict[str, str]] = Field(default=_DEFAULT_RUBRICS)\n    reference: bool = False\n\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    @model_validator(mode=\"after\")\n    def validate_rubric_and_rubrics(self) -&gt; Self:\n        if not isinstance(self.rubrics, dict) or len(self.rubrics) &lt; 1:\n            raise DistilabelUserError(\n                \"Provided `rubrics` must be a Python dictionary with string keys and string values.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        def rubric_matches_pattern(rubric: str) -&gt; bool:\n            \"\"\"Checks if the provided rubric matches the pattern of the default rubrics.\"\"\"\n            pattern = r\"^\\[.*?\\]\\n(?:Score [1-4]: .*?\\n){4}(?:Score 5: .*?)\"\n            return bool(re.match(pattern, rubric, re.MULTILINE))\n\n        if not all(rubric_matches_pattern(value) for value in self.rubrics.values()):\n            raise DistilabelUserError(\n                \"Provided rubrics should match the format of the default rubrics, which\"\n                \" is as follows: `[&lt;scoring criteria&gt;]\\nScore 1: &lt;description&gt;\\nScore 2: &lt;description&gt;\\n\"\n                \"Score 3: &lt;description&gt;\\nScore 4: &lt;description&gt;\\nScore 5: &lt;description&gt;`; replacing\"\n                \" `&lt;scoring criteria&gt;` and `&lt;description&gt;` with the actual criteria and description\"\n                \" for each or the scores, respectively.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        if self.rubric not in self.rubrics:\n            raise DistilabelUserError(\n                f\"Provided rubric '{self.rubric}' is not among the available rubrics: {', '.join(self.rubrics.keys())}.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        return self\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation\n        depending on the `mode` value, and either with or without reference, depending on the\n        value of `reference`.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"prometheus\"\n            / (\n                f\"{self.mode}_without_reference.jinja2\"\n                if self.reference is False\n                else f\"{self.mode}_with_reference.jinja2\"\n            )\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The default inputs for the task are the `instruction` and the `generation`\n        if `reference=False`, otherwise, the inputs are `instruction`, `generation`, and\n        `reference`.\"\"\"\n        if self.mode == \"absolute\":\n            if self.reference:\n                return [\"instruction\", \"generation\", \"reference\"]\n            return [\"instruction\", \"generation\"]\n        else:\n            if self.reference:\n                return [\"instruction\", \"generations\", \"reference\"]\n            return [\"instruction\", \"generations\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` where the prompt is formatted according\n        to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction\n        from the user, including a pre-defined system prompt.\"\"\"\n        template_kwargs = {\n            \"instruction\": input[\"instruction\"],\n            \"rubric\": self.rubrics[self.rubric],\n        }\n        if self.reference:\n            template_kwargs[\"reference\"] = input[\"reference\"]\n\n        if self.mode == \"absolute\":\n            if not isinstance(input[\"generation\"], str):\n                raise DistilabelUserError(\n                    f\"Provided `generation` is of type {type(input['generation'])} but a string\"\n                    \" should be provided instead.\",\n                    page=\"components-gallery/tasks/prometheuseval/\",\n                )\n\n            template_kwargs[\"generation\"] = input[\"generation\"]\n            system_message = (\n                \"You are a fair judge assistant tasked with providing clear, objective feedback based\"\n                \" on specific criteria, ensuring each assessment reflects the absolute standards set\"\n                \" for performance.\"\n            )\n        else:  # self.mode == \"relative\"\n            if (\n                not isinstance(input[\"generations\"], list)\n                or not all(\n                    isinstance(generation, str) for generation in input[\"generations\"]\n                )\n                or len(input[\"generations\"]) != 2\n            ):\n                raise DistilabelUserError(\n                    f\"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead.\",\n                    page=\"components-gallery/tasks/prometheuseval/\",\n                )\n\n            template_kwargs[\"generations\"] = input[\"generations\"]\n            system_message = (\n                \"You are a fair judge assistant assigned to deliver insightful feedback that compares\"\n                \" individual performances, highlighting how each stands relative to others within the\"\n                \" same cohort.\"\n            )\n\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": system_message,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(**template_kwargs),  # type: ignore\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task are the `feedback` and the `result` generated by Prometheus,\n        as well as the `model_name` which is automatically included based on the `LLM` used.\n        \"\"\"\n        return [\"feedback\", \"result\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dict with the keys `feedback` and `result` captured\n        using a regex from the Prometheus output.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Optionally provided in case it's useful to build the output.\n\n        Returns:\n            A dict with the keys `feedback` and `result` generated by the LLM.\n        \"\"\"\n        if output is None:\n            return {\"feedback\": None, \"result\": None}\n\n        parts = output.split(\"[RESULT]\")\n        if len(parts) != 2:\n            return {\"feedback\": None, \"result\": None}\n\n        feedback, result = parts[0].strip(), parts[1].strip()\n        if feedback.startswith(\"Feedback:\"):\n            feedback = feedback[len(\"Feedback:\") :].strip()\n        if self.mode == \"absolute\":\n            if not result.isdigit() or result not in [\"1\", \"2\", \"3\", \"4\", \"5\"]:\n                return {\"feedback\": None, \"result\": None}\n            return {\"feedback\": feedback, \"result\": int(result)}\n        else:  # self.mode == \"relative\"\n            if result not in [\"A\", \"B\"]:\n                return {\"feedback\": None, \"result\": None}\n            return {\"feedback\": feedback, \"result\": result}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The default inputs for the task are the <code>instruction</code> and the <code>generation</code> if <code>reference=False</code>, otherwise, the inputs are <code>instruction</code>, <code>generation</code>, and <code>reference</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task are the <code>feedback</code> and the <code>result</code> generated by Prometheus, as well as the <code>model_name</code> which is automatically included based on the <code>LLM</code> used.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation depending on the <code>mode</code> value, and either with or without reference, depending on the value of <code>reference</code>.</p> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation\n    depending on the `mode` value, and either with or without reference, depending on the\n    value of `reference`.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"prometheus\"\n        / (\n            f\"{self.mode}_without_reference.jinja2\"\n            if self.reference is False\n            else f\"{self.mode}_with_reference.jinja2\"\n        )\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> where the prompt is formatted according to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction from the user, including a pre-defined system prompt.</p> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` where the prompt is formatted according\n    to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction\n    from the user, including a pre-defined system prompt.\"\"\"\n    template_kwargs = {\n        \"instruction\": input[\"instruction\"],\n        \"rubric\": self.rubrics[self.rubric],\n    }\n    if self.reference:\n        template_kwargs[\"reference\"] = input[\"reference\"]\n\n    if self.mode == \"absolute\":\n        if not isinstance(input[\"generation\"], str):\n            raise DistilabelUserError(\n                f\"Provided `generation` is of type {type(input['generation'])} but a string\"\n                \" should be provided instead.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        template_kwargs[\"generation\"] = input[\"generation\"]\n        system_message = (\n            \"You are a fair judge assistant tasked with providing clear, objective feedback based\"\n            \" on specific criteria, ensuring each assessment reflects the absolute standards set\"\n            \" for performance.\"\n        )\n    else:  # self.mode == \"relative\"\n        if (\n            not isinstance(input[\"generations\"], list)\n            or not all(\n                isinstance(generation, str) for generation in input[\"generations\"]\n            )\n            or len(input[\"generations\"]) != 2\n        ):\n            raise DistilabelUserError(\n                f\"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead.\",\n                page=\"components-gallery/tasks/prometheuseval/\",\n            )\n\n        template_kwargs[\"generations\"] = input[\"generations\"]\n        system_message = (\n            \"You are a fair judge assistant assigned to deliver insightful feedback that compares\"\n            \" individual performances, highlighting how each stands relative to others within the\"\n            \" same cohort.\"\n        )\n\n    return [\n        {\n            \"role\": \"system\",\n            \"content\": system_message,\n        },\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(**template_kwargs),  # type: ignore\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.PrometheusEval.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a dict with the keys <code>feedback</code> and <code>result</code> captured using a regex from the Prometheus output.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Optionally provided in case it's useful to build the output.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the keys <code>feedback</code> and <code>result</code> generated by the LLM.</p> Source code in <code>src/distilabel/steps/tasks/prometheus_eval.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dict with the keys `feedback` and `result` captured\n    using a regex from the Prometheus output.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Optionally provided in case it's useful to build the output.\n\n    Returns:\n        A dict with the keys `feedback` and `result` generated by the LLM.\n    \"\"\"\n    if output is None:\n        return {\"feedback\": None, \"result\": None}\n\n    parts = output.split(\"[RESULT]\")\n    if len(parts) != 2:\n        return {\"feedback\": None, \"result\": None}\n\n    feedback, result = parts[0].strip(), parts[1].strip()\n    if feedback.startswith(\"Feedback:\"):\n        feedback = feedback[len(\"Feedback:\") :].strip()\n    if self.mode == \"absolute\":\n        if not result.isdigit() or result not in [\"1\", \"2\", \"3\", \"4\", \"5\"]:\n            return {\"feedback\": None, \"result\": None}\n        return {\"feedback\": feedback, \"result\": int(result)}\n    else:  # self.mode == \"relative\"\n        if result not in [\"A\", \"B\"]:\n            return {\"feedback\": None, \"result\": None}\n        return {\"feedback\": feedback, \"result\": result}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer","title":"<code>QualityScorer</code>","text":"<p>               Bases: <code>Task</code></p> <p>Score responses based on their quality using an <code>LLM</code>.</p> <p><code>QualityScorer</code> is a pre-defined task that defines the <code>instruction</code> as the input and <code>score</code> as the output. This task is used to rate the quality of instructions and responses. It's an implementation of the quality score task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'. The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs are scored in terms of quality, obtaining a quality score for each instruction.</p> <p>Attributes:</p> Name Type Description <code>_template</code> <code>Union[Template, None]</code> <p>a Jinja2 template used to format the input for the LLM.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</li> <li>responses (<code>List[str]</code>): The responses to be scored. Each response forms a pair with the instruction.</li> </ul> Output columns <ul> <li>scores (<code>List[float]</code>): The score for each instruction.</li> <li>model_name (<code>str</code>): The model name used to generate the scores.</li> </ul> Categories <ul> <li>scorer</li> <li>quality</li> <li>response</li> </ul> References <ul> <li><code>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</code></li> </ul> <p>Examples:</p> <p>Evaluate the quality of your instructions:</p> <pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n# result\n[\n    {\n        'instructions': 'instruction',\n        'model_name': 'test',\n        'scores': [5, 3, 1],\n    }\n]\n</code></pre> <p>Generate structured output with default schema:</p> <pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    use_default_structured_output=True\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n\n# result\n[{'instruction': 'instruction',\n'responses': ['good response', 'weird response', 'bad response'],\n'scores': [1, 2, 3],\n'distilabel_metadata': {'raw_output_quality_scorer_0': '{  \"scores\": [1, 2, 3] }'},\n'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> Citations <pre><code>@misc{liu2024makesgooddataalignment,\n    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n    year={2024},\n    eprint={2312.15685},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2312.15685},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>class QualityScorer(Task):\n    \"\"\"Score responses based on their quality using an `LLM`.\n\n    `QualityScorer` is a pre-defined task that defines the `instruction` as the input\n    and `score` as the output. This task is used to rate the quality of instructions and responses.\n    It's an implementation of the quality score task from the paper 'What Makes Good Data\n    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.\n    The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs\n    are scored in terms of quality, obtaining a quality score for each instruction.\n\n    Attributes:\n        _template: a Jinja2 template used to format the input for the LLM.\n\n    Input columns:\n        - instruction (`str`): The instruction that was used to generate the `responses`.\n        - responses (`List[str]`): The responses to be scored. Each response forms a pair with the instruction.\n\n    Output columns:\n        - scores (`List[float]`): The score for each instruction.\n        - model_name (`str`): The model name used to generate the scores.\n\n    Categories:\n        - scorer\n        - quality\n        - response\n\n    References:\n        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)\n\n    Examples:\n        Evaluate the quality of your instructions:\n\n        ```python\n        from distilabel.steps.tasks import QualityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        scorer = QualityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n                    }\n                ]\n            )\n        )\n        # result\n        [\n            {\n                'instructions': 'instruction',\n                'model_name': 'test',\n                'scores': [5, 3, 1],\n            }\n        ]\n        ```\n\n        Generate structured output with default schema:\n\n        ```python\n        from distilabel.steps.tasks import QualityScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        scorer = QualityScorer(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            use_default_structured_output=True\n        )\n\n        scorer.load()\n\n        result = next(\n            scorer.process(\n                [\n                    {\n                        \"instruction\": \"instruction\",\n                        \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n                    }\n                ]\n            )\n        )\n\n        # result\n        [{'instruction': 'instruction',\n        'responses': ['good response', 'weird response', 'bad response'],\n        'scores': [1, 2, 3],\n        'distilabel_metadata': {'raw_output_quality_scorer_0': '{  \"scores\": [1, 2, 3] }'},\n        'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n    Citations:\n        ```\n        @misc{liu2024makesgooddataalignment,\n            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\n            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\n            year={2024},\n            eprint={2312.15685},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2312.15685},\n        }\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"quality-scorer.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task are `instruction` and `responses`.\"\"\"\n        return [\"instruction\", \"responses\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    instruction=input[\"instruction\"], responses=input[\"responses\"]\n                ),\n            }\n        ]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `scores` containing the quality score for each\n        response in `responses`.\"\"\"\n        return [\"scores\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with the key `scores` containing the scores for each instruction-response pair.\n        \"\"\"\n        if output is None:\n            return {\"scores\": [None] * len(input[\"responses\"])}\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        scores = []\n        score_lines = output.split(\"\\n\")\n\n        for i, line in enumerate(score_lines):\n            match = _PARSE_SCORE_LINE_REGEX.match(line)\n            score = float(match.group(1)) if match else None\n            scores.append(score)\n            if i == len(input[\"responses\"]) - 1:\n                break\n        return {\"scores\": scores}\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel\n        from typing import List\n\n        class SchemaQualityScorer(BaseModel):\n            scores: List[int]\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        return {\n            \"properties\": {\n                \"scores\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Scores\",\n                    \"type\": \"array\",\n                }\n            },\n            \"required\": [\"scores\"],\n            \"title\": \"SchemaQualityScorer\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(\n        self, output: str, input: Dict[str, Any]\n    ) -&gt; Dict[str, str]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with the scores, and a list with them.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            return {\"scores\": [None] * len(input[\"responses\"])}\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\n            {\n                \"instruction\": f\"&lt;PLACEHOLDER_{'instruction'.upper()}&gt;\",\n                \"responses\": [\n                    f\"&lt;PLACEHOLDER_{f'RESPONSE_{i}'.upper()}&gt;\" for i in range(2)\n                ],\n            }\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task are <code>instruction</code> and <code>responses</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is a list of <code>scores</code> containing the quality score for each response in <code>responses</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"quality-scorer.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:  # type: ignore\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                instruction=input[\"instruction\"], responses=input[\"responses\"]\n            ),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a list with the score of each instruction-response pair.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Dict[str, Any]</code> <p>the input to the task. Used for obtaining the number of responses.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with the key <code>scores</code> containing the scores for each instruction-response pair.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the score of each instruction-response pair.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with the key `scores` containing the scores for each instruction-response pair.\n    \"\"\"\n    if output is None:\n        return {\"scores\": [None] * len(input[\"responses\"])}\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    scores = []\n    score_lines = output.split(\"\\n\")\n\n    for i, line in enumerate(score_lines):\n        match = _PARSE_SCORE_LINE_REGEX.match(line)\n        score = float(match.group(1)) if match else None\n        scores.append(score)\n        if i == len(input[\"responses\"]) - 1:\n            break\n    return {\"scores\": scores}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel\nfrom typing import List\n\nclass SchemaQualityScorer(BaseModel):\n    scores: List[int]\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel\n    from typing import List\n\n    class SchemaQualityScorer(BaseModel):\n        scores: List[int]\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    return {\n        \"properties\": {\n            \"scores\": {\n                \"items\": {\"type\": \"integer\"},\n                \"title\": \"Scores\",\n                \"type\": \"array\",\n            }\n        },\n        \"required\": [\"scores\"],\n        \"title\": \"SchemaQualityScorer\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.QualityScorer._format_structured_output","title":"<code>_format_structured_output(output, input)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with the scores, and a list with them.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/quality_scorer.py</code> <pre><code>def _format_structured_output(\n    self, output: str, input: Dict[str, Any]\n) -&gt; Dict[str, str]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with the scores, and a list with them.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        return {\"scores\": [None] * len(input[\"responses\"])}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct","title":"<code>SelfInstruct</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate instructions based on a given input using an <code>LLM</code>.</p> <p><code>SelfInstruct</code> is a pre-defined task that, given a number of instructions, a certain criteria for query generations, an application description, and an input, generates a number of instruction related to the given input and following what is stated in the criteria for query generation and the application description. It is based in the SelfInstruct framework from the paper \"Self-Instruct: Aligning Language Models with Self-Generated Instructions\".</p> <p>Attributes:</p> Name Type Description <code>num_instructions</code> <code>int</code> <p>The number of instructions to be generated. Defaults to 5.</p> <code>criteria_for_query_generation</code> <code>str</code> <p>The criteria for the query generation. Defaults to the criteria defined within the paper.</p> <code>application_description</code> <code>str</code> <p>The description of the AI application that one want to build with these instructions. Defaults to <code>AI assistant</code>.</p> Input columns <ul> <li>input (<code>str</code>): The input to generate the instructions. It's also called seed in     the paper.</li> </ul> Output columns <ul> <li>instructions (<code>List[str]</code>): The generated instructions.</li> <li>model_name (<code>str</code>): The model name used to generate the instructions.</li> </ul> Categories <ul> <li>text-generation</li> </ul> Reference <ul> <li><code>Self-Instruct: Aligning Language Models with Self-Generated Instructions</code></li> </ul> <p>Examples:</p> <p>Generate instructions based on a given input:</p> <pre><code>from distilabel.steps.tasks import SelfInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\nself_instruct = SelfInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=5,  # This is the default value\n)\n\nself_instruct.load()\n\nresult = next(self_instruct.process([{\"input\": \"instruction\"}]))\n# result\n# [\n#     {\n#         'input': 'instruction',\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'instructions': [\"instruction 1\", \"instruction 2\", \"instruction 3\", \"instruction 4\", \"instruction 5\"],\n#     }\n# ]\n</code></pre> Citations <pre><code>@misc{wang2023selfinstructaligninglanguagemodels,\n    title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},\n    author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},\n    year={2023},\n    eprint={2212.10560},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2212.10560},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>class SelfInstruct(Task):\n    \"\"\"Generate instructions based on a given input using an `LLM`.\n\n    `SelfInstruct` is a pre-defined task that, given a number of instructions, a\n    certain criteria for query generations, an application description, and an input,\n    generates a number of instruction related to the given input and following what\n    is stated in the criteria for query generation and the application description.\n    It is based in the SelfInstruct framework from the paper \"Self-Instruct: Aligning\n    Language Models with Self-Generated Instructions\".\n\n    Attributes:\n        num_instructions: The number of instructions to be generated. Defaults to 5.\n        criteria_for_query_generation: The criteria for the query generation. Defaults\n            to the criteria defined within the paper.\n        application_description: The description of the AI application that one want\n            to build with these instructions. Defaults to `AI assistant`.\n\n    Input columns:\n        - input (`str`): The input to generate the instructions. It's also called seed in\n            the paper.\n\n    Output columns:\n        - instructions (`List[str]`): The generated instructions.\n        - model_name (`str`): The model name used to generate the instructions.\n\n    Categories:\n        - text-generation\n\n    Reference:\n        - [`Self-Instruct: Aligning Language Models with Self-Generated Instructions`](https://arxiv.org/abs/2212.10560)\n\n    Examples:\n        Generate instructions based on a given input:\n\n        ```python\n        from distilabel.steps.tasks import SelfInstruct\n        from distilabel.models import InferenceEndpointsLLM\n\n        self_instruct = SelfInstruct(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            num_instructions=5,  # This is the default value\n        )\n\n        self_instruct.load()\n\n        result = next(self_instruct.process([{\"input\": \"instruction\"}]))\n        # result\n        # [\n        #     {\n        #         'input': 'instruction',\n        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n        #         'instructions': [\"instruction 1\", \"instruction 2\", \"instruction 3\", \"instruction 4\", \"instruction 5\"],\n        #     }\n        # ]\n        ```\n\n    Citations:\n        ```\n        @misc{wang2023selfinstructaligninglanguagemodels,\n            title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},\n            author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},\n            year={2023},\n            eprint={2212.10560},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2212.10560},\n        }\n        ```\n    \"\"\"\n\n    num_instructions: int = 5\n    criteria_for_query_generation: str = (\n        \"Incorporate a diverse range of verbs, avoiding repetition.\\n\"\n        \"Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\\n\"\n        \"Design queries to be self-contained and standalone.\\n\"\n        'Blend interrogative (e.g., \"What is the significance of x?\") and imperative (e.g., \"Detail the process of x.\") styles.'\n    )\n    application_description: str = \"AI assistant\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"self-instruct.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `input` i.e. seed text.\"\"\"\n        return [\"input\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    input=input[\"input\"],\n                    application_description=self.application_description,\n                    criteria_for_query_generation=self.criteria_for_query_generation,\n                    num_instructions=self.num_instructions,\n                ),\n            }\n        ]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `instructions` containing the generated instructions.\"\"\"\n        return [\"instructions\", \"model_name\"]\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Optional[Dict[str, Any]] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a list with the generated instructions.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Used for obtaining the number of responses.\n\n        Returns:\n            A dict with containing the generated instructions.\n        \"\"\"\n        if output is None:\n            return {\"instructions\": []}\n        return {\"instructions\": [line for line in output.split(\"\\n\") if line != \"\"]}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>input</code> i.e. seed text.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is a list of <code>instructions</code> containing the generated instructions.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"self-instruct.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                input=input[\"input\"],\n                application_description=self.application_description,\n                criteria_for_query_generation=self.criteria_for_query_generation,\n                num_instructions=self.num_instructions,\n            ),\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.SelfInstruct.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a list with the generated instructions.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>the raw output of the LLM.</p> required <code>input</code> <code>Optional[Dict[str, Any]]</code> <p>the input to the task. Used for obtaining the number of responses.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dict with containing the generated instructions.</p> Source code in <code>src/distilabel/steps/tasks/self_instruct.py</code> <pre><code>def format_output(\n    self,\n    output: Union[str, None],\n    input: Optional[Dict[str, Any]] = None,\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a list with the generated instructions.\n\n    Args:\n        output: the raw output of the LLM.\n        input: the input to the task. Used for obtaining the number of responses.\n\n    Returns:\n        A dict with containing the generated instructions.\n    \"\"\"\n    if output is None:\n        return {\"instructions\": []}\n    return {\"instructions\": [line for line in output.split(\"\\n\") if line != \"\"]}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair","title":"<code>GenerateSentencePair</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate a positive and negative (optionally) sentences given an anchor sentence.</p> <p><code>GenerateSentencePair</code> is a pre-defined task that given an anchor sentence generates a positive sentence related to the anchor and optionally a negative sentence unrelated to the anchor or similar to it. Optionally, you can give a context to guide the LLM towards more specific behavior. This task is useful to generate training datasets for training embeddings models.</p> <p>Attributes:</p> Name Type Description <code>triplet</code> <code>bool</code> <p>a flag to indicate if the task should generate a triplet of sentences (anchor, positive, negative). Defaults to <code>False</code>.</p> <code>action</code> <code>GenerationAction</code> <p>the action to perform to generate the positive sentence.</p> <code>context</code> <code>str</code> <p>the context to use for the generation. Can be helpful to guide the LLM towards more specific context. Not used by default.</p> <code>hard_negative</code> <code>bool</code> <p>A flag to indicate if the negative should be a hard-negative or not. Hard negatives make it hard for the model to distinguish against the positive, with a higher degree of semantic similarity.</p> Input columns <ul> <li>anchor (<code>str</code>): The anchor sentence to generate the positive and negative sentences.</li> </ul> Output columns <ul> <li>positive (<code>str</code>): The positive sentence related to the <code>anchor</code>.</li> <li>negative (<code>str</code>): The negative sentence unrelated to the <code>anchor</code> if <code>triplet=True</code>,     or more similar to the positive to make it more challenging for a model to distinguish     in case <code>hard_negative=True</code>.</li> <li>model_name (<code>str</code>): The name of the model that was used to generate the sentences.</li> </ul> Categories <ul> <li>embedding</li> </ul> <p>Examples:</p> <p>Paraphrasing:</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"paraphrase\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre> <p>Generating semantically similar sentences:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import GenerateSentencePair\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"semantically-similar\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"How does 3D printing work?\"}])\n</code></pre> <p>Generating queries:</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"Argilla is an open-source data curation platform for LLMs. Using Argilla, ...\"}])\n</code></pre> <p>Generating answers:</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"answer\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre> <p>Generating queries with context (applies to every action):</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre> <p>Generating Hard-negatives (applies to every action):</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    hard_negative=True,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre> <p>Generating structured data with default schema (applies to every action):</p> <pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    hard_negative=True,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n    use_default_structured_output=True\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>class GenerateSentencePair(Task):\n    \"\"\"Generate a positive and negative (optionally) sentences given an anchor sentence.\n\n    `GenerateSentencePair` is a pre-defined task that given an anchor sentence generates\n    a positive sentence related to the anchor and optionally a negative sentence unrelated\n    to the anchor or similar to it. Optionally, you can give a context to guide the LLM\n    towards more specific behavior. This task is useful to generate training datasets for\n    training embeddings models.\n\n    Attributes:\n        triplet: a flag to indicate if the task should generate a triplet of sentences\n            (anchor, positive, negative). Defaults to `False`.\n        action: the action to perform to generate the positive sentence.\n        context: the context to use for the generation. Can be helpful to guide the LLM\n            towards more specific context. Not used by default.\n        hard_negative: A flag to indicate if the negative should be a hard-negative or not.\n            Hard negatives make it hard for the model to distinguish against the positive,\n            with a higher degree of semantic similarity.\n\n    Input columns:\n        - anchor (`str`): The anchor sentence to generate the positive and negative sentences.\n\n    Output columns:\n        - positive (`str`): The positive sentence related to the `anchor`.\n        - negative (`str`): The negative sentence unrelated to the `anchor` if `triplet=True`,\n            or more similar to the positive to make it more challenging for a model to distinguish\n            in case `hard_negative=True`.\n        - model_name (`str`): The name of the model that was used to generate the sentences.\n\n    Categories:\n        - embedding\n\n    Examples:\n        Paraphrasing:\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"paraphrase\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n        ```\n\n        Generating semantically similar sentences:\n\n        ```python\n        from distilabel.models import InferenceEndpointsLLM\n        from distilabel.steps.tasks import GenerateSentencePair\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"semantically-similar\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"How does 3D printing work?\"}])\n        ```\n\n        Generating queries:\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"Argilla is an open-source data curation platform for LLMs. Using Argilla, ...\"}])\n        ```\n\n        Generating answers:\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"answer\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n        ```\n\n        Generating queries with context (**applies to every action**):\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            context=\"Argilla is an open-source data curation platform for LLMs.\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n        ```\n\n        Generating Hard-negatives (**applies to every action**):\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            context=\"Argilla is an open-source data curation platform for LLMs.\",\n            hard_negative=True,\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n        ```\n\n        Generating structured data with default schema (**applies to every action**):\n\n        ```python\n        from distilabel.steps.tasks import GenerateSentencePair\n        from distilabel.models import InferenceEndpointsLLM\n\n        generate_sentence_pair = GenerateSentencePair(\n            triplet=True, # `False` to generate only positive\n            action=\"query\",\n            context=\"Argilla is an open-source data curation platform for LLMs.\",\n            hard_negative=True,\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            input_batch_size=10,\n            use_default_structured_output=True\n        )\n\n        generate_sentence_pair.load()\n\n        result = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n        ```\n    \"\"\"\n\n    triplet: bool = False\n    action: GenerationAction\n    hard_negative: bool = False\n    context: str = \"\"\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"generate-sentence-pair.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs for the task is the `anchor` sentence.\"\"\"\n        return [\"anchor\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The inputs are formatted as a `ChatType`, with a system prompt describing the\n        task of generating a positive and negative sentences for the anchor sentence. The\n        anchor is provided as the first user interaction in the conversation.\n\n        Args:\n            input: The input containing the `anchor` sentence.\n\n        Returns:\n            A list of dictionaries containing the system and user interactions.\n        \"\"\"\n        action_sentence = GENERATION_ACTION_SENTENCES[self.action]\n\n        format_system_prompt = {\n            \"action_sentence\": action_sentence,\n            \"context\": CONTEXT_INTRO if self.context else \"\",\n        }\n        if self.triplet:\n            format_system_prompt[\"negative_style\"] = NEGATIVE_STYLE[\n                \"hard-negative\" if self.hard_negative else \"negative\"\n            ]\n\n        system_prompt = (\n            POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT\n        ).format(**format_system_prompt)\n\n        return [\n            {\"role\": \"system\", \"content\": system_prompt},\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    anchor=input[\"anchor\"],\n                    context=self.context if self.context else None,\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The outputs for the task are the `positive` and `negative` sentences, as well\n        as the `model_name` used to generate the sentences.\"\"\"\n        columns = [\"positive\", \"negative\"] if self.triplet else [\"positive\"]\n        columns += [\"model_name\"]\n        return columns\n\n    def format_output(\n        self, output: Union[str, None], input: Optional[Dict[str, Any]] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Formats the output of the LLM, to extract the `positive` and `negative` sentences\n        generated. If the output is `None` or the regex doesn't match, then the outputs\n        will be set to `None` as well.\n\n        Args:\n            output: The output of the LLM.\n            input: The input used to generate the output.\n\n        Returns:\n            The formatted output containing the `positive` and `negative` sentences.\n        \"\"\"\n        if output is None:\n            return {\"positive\": None, \"negative\": None}\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output)\n\n        match = POSITIVE_NEGATIVE_PAIR_REGEX.match(output)\n        if match is None:\n            formatted_output = {\"positive\": None}\n            if self.triplet:\n                formatted_output[\"negative\"] = None\n            return formatted_output\n\n        groups = match.groups()\n        if self.triplet:\n            return {\n                \"positive\": groups[0].strip(),\n                \"negative\": (\n                    groups[1].strip()\n                    if len(groups) &gt; 1 and groups[1] is not None\n                    else None\n                ),\n            }\n\n        return {\"positive\": groups[0].strip()}\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        if self.triplet:\n            return {\n                \"properties\": {\n                    \"positive\": {\"title\": \"Positive\", \"type\": \"string\"},\n                    \"negative\": {\"title\": \"Negative\", \"type\": \"string\"},\n                },\n                \"required\": [\"positive\", \"negative\"],\n                \"title\": \"Schema\",\n                \"type\": \"object\",\n            }\n        return {\n            \"properties\": {\"positive\": {\"title\": \"Positive\", \"type\": \"string\"}},\n            \"required\": [\"positive\"],\n            \"title\": \"Schema\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(self, output: str) -&gt; Dict[str, str]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with either `positive`, or `positive` and `negative` keys.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            if self.triplet:\n                return {\"positive\": None, \"negative\": None}\n            return {\"positive\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The inputs for the task is the <code>anchor</code> sentence.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The outputs for the task are the <code>positive</code> and <code>negative</code> sentences, as well as the <code>model_name</code> used to generate the sentences.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"generate-sentence-pair.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.format_input","title":"<code>format_input(input)</code>","text":"<p>The inputs are formatted as a <code>ChatType</code>, with a system prompt describing the task of generating a positive and negative sentences for the anchor sentence. The anchor is provided as the first user interaction in the conversation.</p> <p>Parameters:</p> Name Type Description Default <code>input</code> <code>Dict[str, Any]</code> <p>The input containing the <code>anchor</code> sentence.</p> required <p>Returns:</p> Type Description <code>ChatType</code> <p>A list of dictionaries containing the system and user interactions.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The inputs are formatted as a `ChatType`, with a system prompt describing the\n    task of generating a positive and negative sentences for the anchor sentence. The\n    anchor is provided as the first user interaction in the conversation.\n\n    Args:\n        input: The input containing the `anchor` sentence.\n\n    Returns:\n        A list of dictionaries containing the system and user interactions.\n    \"\"\"\n    action_sentence = GENERATION_ACTION_SENTENCES[self.action]\n\n    format_system_prompt = {\n        \"action_sentence\": action_sentence,\n        \"context\": CONTEXT_INTRO if self.context else \"\",\n    }\n    if self.triplet:\n        format_system_prompt[\"negative_style\"] = NEGATIVE_STYLE[\n            \"hard-negative\" if self.hard_negative else \"negative\"\n        ]\n\n    system_prompt = (\n        POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT\n    ).format(**format_system_prompt)\n\n    return [\n        {\"role\": \"system\", \"content\": system_prompt},\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(\n                anchor=input[\"anchor\"],\n                context=self.context if self.context else None,\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>Formats the output of the LLM, to extract the <code>positive</code> and <code>negative</code> sentences generated. If the output is <code>None</code> or the regex doesn't match, then the outputs will be set to <code>None</code> as well.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>The output of the LLM.</p> required <code>input</code> <code>Optional[Dict[str, Any]]</code> <p>The input used to generate the output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>The formatted output containing the <code>positive</code> and <code>negative</code> sentences.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Optional[Dict[str, Any]] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"Formats the output of the LLM, to extract the `positive` and `negative` sentences\n    generated. If the output is `None` or the regex doesn't match, then the outputs\n    will be set to `None` as well.\n\n    Args:\n        output: The output of the LLM.\n        input: The input used to generate the output.\n\n    Returns:\n        The formatted output containing the `positive` and `negative` sentences.\n    \"\"\"\n    if output is None:\n        return {\"positive\": None, \"negative\": None}\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output)\n\n    match = POSITIVE_NEGATIVE_PAIR_REGEX.match(output)\n    if match is None:\n        formatted_output = {\"positive\": None}\n        if self.triplet:\n            formatted_output[\"negative\"] = None\n        return formatted_output\n\n    groups = match.groups()\n    if self.triplet:\n        return {\n            \"positive\": groups[0].strip(),\n            \"negative\": (\n                groups[1].strip()\n                if len(groups) &gt; 1 and groups[1] is not None\n                else None\n            ),\n        }\n\n    return {\"positive\": groups[0].strip()}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    if self.triplet:\n        return {\n            \"properties\": {\n                \"positive\": {\"title\": \"Positive\", \"type\": \"string\"},\n                \"negative\": {\"title\": \"Negative\", \"type\": \"string\"},\n            },\n            \"required\": [\"positive\", \"negative\"],\n            \"title\": \"Schema\",\n            \"type\": \"object\",\n        }\n    return {\n        \"properties\": {\"positive\": {\"title\": \"Positive\", \"type\": \"string\"}},\n        \"required\": [\"positive\"],\n        \"title\": \"Schema\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.GenerateSentencePair._format_structured_output","title":"<code>_format_structured_output(output)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with either <code>positive</code>, or <code>positive</code> and <code>negative</code> keys.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, str]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/sentence_transformers.py</code> <pre><code>def _format_structured_output(self, output: str) -&gt; Dict[str, str]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with either `positive`, or `positive` and `negative` keys.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        if self.triplet:\n            return {\"positive\": None, \"negative\": None}\n        return {\"positive\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration","title":"<code>StructuredGeneration</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generate structured content for a given <code>instruction</code> using an <code>LLM</code>.</p> <p><code>StructuredGeneration</code> is a pre-defined task that defines the <code>instruction</code> and the <code>structured_output</code> as the inputs, and <code>generation</code> as the output. This task is used to generate structured content based on the input instruction and following the schema provided within the <code>structured_output</code> column per each <code>instruction</code>. The <code>model_name</code> also returned as part of the output in order to enhance it.</p> <p>Attributes:</p> Name Type Description <code>use_system_prompt</code> <code>bool</code> <p>Whether to use the system prompt in the generation. Defaults to <code>True</code>, which means that if the column <code>system_prompt</code> is  defined within the input batch, then the <code>system_prompt</code> will be used, otherwise, it will be ignored.</p> Input columns <ul> <li>instruction (<code>str</code>): The instruction to generate structured content from.</li> <li>structured_output (<code>Dict[str, Any]</code>): The structured_output to generate structured content from. It should be a     Python dictionary with the keys <code>format</code> and <code>schema</code>, where <code>format</code> should be one of <code>json</code> or     <code>regex</code>, and the <code>schema</code> should be either the JSON schema or the regex pattern, respectively.</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated text matching the provided schema, if possible.</li> <li>model_name (<code>str</code>): The name of the model used to generate the text.</li> </ul> Categories <ul> <li>outlines</li> <li>structured-generation</li> </ul> <p>Examples:</p> <p>Generate structured output from a JSON schema:</p> <pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"Create an RPG character\",\n                \"structured_output\": {\n                    \"format\": \"json\",\n                    \"schema\": {\n                        \"properties\": {\n                            \"name\": {\n                                \"title\": \"Name\",\n                                \"type\": \"string\"\n                            },\n                            \"description\": {\n                                \"title\": \"Description\",\n                                \"type\": \"string\"\n                            },\n                            \"role\": {\n                                \"title\": \"Role\",\n                                \"type\": \"string\"\n                            },\n                            \"weapon\": {\n                                \"title\": \"Weapon\",\n                                \"type\": \"string\"\n                            }\n                        },\n                        \"required\": [\n                            \"name\",\n                            \"description\",\n                            \"role\",\n                            \"weapon\"\n                        ],\n                        \"title\": \"Character\",\n                        \"type\": \"object\"\n                    }\n                },\n            }\n        ]\n    )\n)\n</code></pre> <p>Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines):</p> <pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"What's the weather like today in Seattle in Celsius degrees?\",\n                \"structured_output\": {\n                    \"format\": \"regex\",\n                    \"schema\": r\"(\\d{1,2})\u00b0C\"\n                },\n\n            }\n        ]\n    )\n)\n</code></pre> Source code in <code>src/distilabel/steps/tasks/structured_generation.py</code> <pre><code>class StructuredGeneration(Task):\n    \"\"\"Generate structured content for a given `instruction` using an `LLM`.\n\n    `StructuredGeneration` is a pre-defined task that defines the `instruction` and the `structured_output`\n    as the inputs, and `generation` as the output. This task is used to generate structured content based on\n    the input instruction and following the schema provided within the `structured_output` column per each\n    `instruction`. The `model_name` also returned as part of the output in order to enhance it.\n\n    Attributes:\n        use_system_prompt: Whether to use the system prompt in the generation. Defaults to `True`,\n            which means that if the column `system_prompt` is  defined within the input batch, then\n            the `system_prompt` will be used, otherwise, it will be ignored.\n\n    Input columns:\n        - instruction (`str`): The instruction to generate structured content from.\n        - structured_output (`Dict[str, Any]`): The structured_output to generate structured content from. It should be a\n            Python dictionary with the keys `format` and `schema`, where `format` should be one of `json` or\n            `regex`, and the `schema` should be either the JSON schema or the regex pattern, respectively.\n\n    Output columns:\n        - generation (`str`): The generated text matching the provided schema, if possible.\n        - model_name (`str`): The name of the model used to generate the text.\n\n    Categories:\n        - outlines\n        - structured-generation\n\n    Examples:\n        Generate structured output from a JSON schema:\n\n        ```python\n        from distilabel.steps.tasks import StructuredGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        structured_gen = StructuredGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            ),\n        )\n\n        structured_gen.load()\n\n        result = next(\n            structured_gen.process(\n                [\n                    {\n                        \"instruction\": \"Create an RPG character\",\n                        \"structured_output\": {\n                            \"format\": \"json\",\n                            \"schema\": {\n                                \"properties\": {\n                                    \"name\": {\n                                        \"title\": \"Name\",\n                                        \"type\": \"string\"\n                                    },\n                                    \"description\": {\n                                        \"title\": \"Description\",\n                                        \"type\": \"string\"\n                                    },\n                                    \"role\": {\n                                        \"title\": \"Role\",\n                                        \"type\": \"string\"\n                                    },\n                                    \"weapon\": {\n                                        \"title\": \"Weapon\",\n                                        \"type\": \"string\"\n                                    }\n                                },\n                                \"required\": [\n                                    \"name\",\n                                    \"description\",\n                                    \"role\",\n                                    \"weapon\"\n                                ],\n                                \"title\": \"Character\",\n                                \"type\": \"object\"\n                            }\n                        },\n                    }\n                ]\n            )\n        )\n        ```\n\n        Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines):\n\n        ```python\n        from distilabel.steps.tasks import StructuredGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        structured_gen = StructuredGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            ),\n        )\n\n        structured_gen.load()\n\n        result = next(\n            structured_gen.process(\n                [\n                    {\n                        \"instruction\": \"What's the weather like today in Seattle in Celsius degrees?\",\n                        \"structured_output\": {\n                            \"format\": \"regex\",\n                            \"schema\": r\"(\\\\d{1,2})\u00b0C\"\n                        },\n\n                    }\n                ]\n            )\n        )\n        ```\n    \"\"\"\n\n    use_system_prompt: bool = False\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the `instruction` and the `structured_output`.\n        Optionally, if the `use_system_prompt` flag is set to True, then the\n        `system_prompt` will be used too.\"\"\"\n        columns = [\"instruction\", \"structured_output\"]\n        if self.use_system_prompt:\n            columns = [\"system_prompt\"] + columns\n        return columns\n\n    def format_input(self, input: Dict[str, Any]) -&gt; StructuredInput:\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        if not isinstance(input[\"instruction\"], str):\n            raise DistilabelUserError(\n                f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n                page=\"components-gallery/tasks/structuredgeneration/\",\n            )\n\n        messages = [{\"role\": \"user\", \"content\": input[\"instruction\"]}]\n        if self.use_system_prompt:\n            if \"system_prompt\" in input:\n                messages.insert(\n                    0, {\"role\": \"system\", \"content\": input[\"system_prompt\"]}\n                )\n            else:\n                warnings.warn(\n                    \"`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.\",\n                    UserWarning,\n                    stacklevel=2,\n                )\n\n        return (messages, input.get(\"structured_output\", None))  # type: ignore\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`. Note that even\n        if the `structured_output` is defined to produce a JSON schema, this method will return the raw\n        output i.e. a string without any parsing.\"\"\"\n        return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the <code>instruction</code> and the <code>structured_output</code>. Optionally, if the <code>use_system_prompt</code> flag is set to True, then the <code>system_prompt</code> will be used too.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/structured_generation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; StructuredInput:\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    if not isinstance(input[\"instruction\"], str):\n        raise DistilabelUserError(\n            f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n            page=\"components-gallery/tasks/structuredgeneration/\",\n        )\n\n    messages = [{\"role\": \"user\", \"content\": input[\"instruction\"]}]\n    if self.use_system_prompt:\n        if \"system_prompt\" in input:\n            messages.insert(\n                0, {\"role\": \"system\", \"content\": input[\"system_prompt\"]}\n            )\n        else:\n            warnings.warn(\n                \"`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.\",\n                UserWarning,\n                stacklevel=2,\n            )\n\n    return (messages, input.get(\"structured_output\", None))  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.StructuredGeneration.format_output","title":"<code>format_output(output, input)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>. Note that even if the <code>structured_output</code> is defined to produce a JSON schema, this method will return the raw output i.e. a string without any parsing.</p> Source code in <code>src/distilabel/steps/tasks/structured_generation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`. Note that even\n    if the `structured_output` is defined to produce a JSON schema, this method will return the raw\n    output i.e. a string without any parsing.\"\"\"\n    return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification","title":"<code>TextClassification</code>","text":"<p>               Bases: <code>Task</code></p> <p>Classifies text into one or more categories or labels.</p> <p>This task can be used for text classification problems, where the goal is to assign one or multiple labels to a given text. It uses structured generation as per the reference paper by default, it can help to generate more concise labels. See section 4.1 in the reference.</p> Input columns <ul> <li>text (<code>str</code>): The reference text we want to obtain labels for.</li> </ul> Output columns <ul> <li>labels (<code>Union[str, List[str]]</code>): The label or list of labels for the text.</li> <li>model_name (<code>str</code>): The name of the model used to generate the label/s.</li> </ul> Categories <ul> <li>text-classification</li> </ul> References <ul> <li><code>Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models</code></li> </ul> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Optional[str]</code> <p>A prompt to display to the user before the task starts. Contains a default message to make the model behave like a classifier specialist.</p> <code>n</code> <code>PositiveInt</code> <p>Number of labels to generate If only 1 is required, corresponds to a label classification problem, if &gt;1 it will intend return the \"n\" labels most representative for the text. Defaults to 1.</p> <code>context</code> <code>Optional[str]</code> <p>Context to use when generating the labels. By default contains a generic message, but can be used to customize the context for the task.</p> <code>examples</code> <code>Optional[List[str]]</code> <p>List of examples to help the model understand the task, few shots.</p> <code>available_labels</code> <code>Optional[Union[List[str], Dict[str, str]]]</code> <p>List of available labels to choose from when classifying the text, or a dictionary with the labels and their descriptions.</p> <code>default_label</code> <code>Optional[Union[str, List[str]]]</code> <p>Default label to use when the text is ambiguous or lacks sufficient information for classification. Can be a list in case of multiple labels (n&gt;1).</p> <p>Examples:</p> <p>Assigning a sentiment to a text:</p> <pre><code>from distilabel.steps.tasks import TextClassification\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\ntext_classification = TextClassification(\n    llm=llm,\n    context=\"You are an AI system specialized in assigning sentiment to movies.\",\n    available_labels=[\"positive\", \"negative\"],\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\"}]\n    )\n)\n# result\n# [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',\n# 'labels': 'positive',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"positive\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the user query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nYou are an AI system specialized in assigning sentiment to movie the user queries.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"positive\",  # The text shows positive sentiment\\n    \"negative\",  # The text shows negative sentiment\\n]\\n\\n\\n## User Query\\n```\\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Assigning predefined labels with specified descriptions:</p> <pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=1,\n    context=\"Determine the intent of the text.\",\n    available_labels={\n        \"complaint\": \"A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.\",\n        \"inquiry\": \"A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.\",\n        \"feedback\": \"A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\",\n        \"praise\": \"A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\"\n    },\n    query_title=\"Customer Query\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"Can you tell me more about your return policy?\"}]\n    )\n)\n# result\n# [{'text': 'Can you tell me more about your return policy?',\n# 'labels': 'inquiry',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"inquiry\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the customer query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nDetermine the intent of the text.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"complaint\",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\\n    \"inquiry\",  # A question or request for information about a product, service, or situation. It\\'s a neutral or curious expression seeking clarification or details.\\n    \"feedback\",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\\n    \"praise\",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\\n]\\n\\n\\n## Customer Query\\n```\\nCan you tell me more about your return policy?\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Free multi label classification without predefined labels:</p> <pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=3,\n    context=(\n        \"Describe the main themes, topics, or categories that could describe the \"\n        \"following type of persona.\"\n    ),\n    query_title=\"Example of Persona\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\"}]\n    )\n)\n# result\n# [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',\n# 'labels': ['Historical Researcher',\n# 'Cultural Specialist',\n# 'Ethnic Studies Expert'],\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": [\"Historical Researcher\", \"Cultural Specialist\", \"Ethnic Studies Expert\"]\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the example of persona by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide a list of 3 labels that best describe the text.\\nDescribe the main themes, topics, or categories that could describe the following type of persona.\\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\\n\\n\\n## Example of Persona\\n```\\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": [\"label_0\", \"label_1\", \"label_2\"]\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>class TextClassification(Task):\n    r\"\"\"Classifies text into one or more categories or labels.\n\n    This task can be used for text classification problems, where the goal is to assign\n    one or multiple labels to a given text.\n    It uses structured generation as per the reference paper by default,\n    it can help to generate more concise labels. See section 4.1 in the reference.\n\n    Input columns:\n        - text (`str`): The reference text we want to obtain labels for.\n\n    Output columns:\n        - labels (`Union[str, List[str]]`): The label or list of labels for the text.\n        - model_name (`str`): The name of the model used to generate the label/s.\n\n    Categories:\n        - text-classification\n\n    References:\n        - [`Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models`](https://arxiv.org/abs/2408.02442)\n\n    Attributes:\n        system_prompt: A prompt to display to the user before the task starts. Contains a default\n            message to make the model behave like a classifier specialist.\n        n: Number of labels to generate If only 1 is required, corresponds to a label\n            classification problem, if &gt;1 it will intend return the \"n\" labels most representative\n            for the text. Defaults to 1.\n        context: Context to use when generating the labels. By default contains a generic message,\n            but can be used to customize the context for the task.\n        examples: List of examples to help the model understand the task, few shots.\n        available_labels: List of available labels to choose from when classifying the text, or\n            a dictionary with the labels and their descriptions.\n        default_label: Default label to use when the text is ambiguous or lacks sufficient information for\n            classification. Can be a list in case of multiple labels (n&gt;1).\n\n    Examples:\n        Assigning a sentiment to a text:\n\n        ```python\n        from distilabel.steps.tasks import TextClassification\n        from distilabel.models import InferenceEndpointsLLM\n\n        llm = InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        )\n\n        text_classification = TextClassification(\n            llm=llm,\n            context=\"You are an AI system specialized in assigning sentiment to movies.\",\n            available_labels=[\"positive\", \"negative\"],\n        )\n\n        text_classification.load()\n\n        result = next(\n            text_classification.process(\n                [{\"text\": \"This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\"}]\n            )\n        )\n        # result\n        # [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',\n        # 'labels': 'positive',\n        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"positive\"\\n}',\n        # 'raw_input_text_classification_0': [{'role': 'system',\n        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n        #     {'role': 'user',\n        #     'content': '# Instruction\\nPlease classify the user query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nYou are an AI system specialized in assigning sentiment to movie the user queries.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"positive\",  # The text shows positive sentiment\\n    \"negative\",  # The text shows negative sentiment\\n]\\n\\n\\n## User Query\\n```\\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Assigning predefined labels with specified descriptions:\n\n        ```python\n        from distilabel.steps.tasks import TextClassification\n\n        text_classification = TextClassification(\n            llm=llm,\n            n=1,\n            context=\"Determine the intent of the text.\",\n            available_labels={\n                \"complaint\": \"A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.\",\n                \"inquiry\": \"A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.\",\n                \"feedback\": \"A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\",\n                \"praise\": \"A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\"\n            },\n            query_title=\"Customer Query\",\n        )\n\n        text_classification.load()\n\n        result = next(\n            text_classification.process(\n                [{\"text\": \"Can you tell me more about your return policy?\"}]\n            )\n        )\n        # result\n        # [{'text': 'Can you tell me more about your return policy?',\n        # 'labels': 'inquiry',\n        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"inquiry\"\\n}',\n        # 'raw_input_text_classification_0': [{'role': 'system',\n        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n        #     {'role': 'user',\n        #     'content': '# Instruction\\nPlease classify the customer query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nDetermine the intent of the text.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"complaint\",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\\n    \"inquiry\",  # A question or request for information about a product, service, or situation. It\\'s a neutral or curious expression seeking clarification or details.\\n    \"feedback\",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\\n    \"praise\",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\\n]\\n\\n\\n## Customer Query\\n```\\nCan you tell me more about your return policy?\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Free multi label classification without predefined labels:\n\n        ```python\n        from distilabel.steps.tasks import TextClassification\n\n        text_classification = TextClassification(\n            llm=llm,\n            n=3,\n            context=(\n                \"Describe the main themes, topics, or categories that could describe the \"\n                \"following type of persona.\"\n            ),\n            query_title=\"Example of Persona\",\n        )\n\n        text_classification.load()\n\n        result = next(\n            text_classification.process(\n                [{\"text\": \"A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\"}]\n            )\n        )\n        # result\n        # [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',\n        # 'labels': ['Historical Researcher',\n        # 'Cultural Specialist',\n        # 'Ethnic Studies Expert'],\n        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": [\"Historical Researcher\", \"Cultural Specialist\", \"Ethnic Studies Expert\"]\\n}',\n        # 'raw_input_text_classification_0': [{'role': 'system',\n        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n        #     {'role': 'user',\n        #     'content': '# Instruction\\nPlease classify the example of persona by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide a list of 3 labels that best describe the text.\\nDescribe the main themes, topics, or categories that could describe the following type of persona.\\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\\n\\n\\n## Example of Persona\\n```\\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": [\"label_0\", \"label_1\", \"label_2\"]\\n}\\n```'}]},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n    \"\"\"\n\n    system_prompt: Optional[str] = (\n        \"You are an AI system specialized in generating labels to classify pieces of text. \"\n        \"Your sole purpose is to analyze the given text and provide appropriate classification labels.\"\n    )\n    n: PositiveInt = Field(\n        default=1,\n        description=\"Number of labels to generate. Defaults to 1.\",\n    )\n    context: Optional[str] = Field(\n        default=\"Generate concise, relevant labels that accurately represent the text's main themes, topics, or categories.\",\n        description=\"Context to use when generating the labels.\",\n    )\n    examples: Optional[List[str]] = Field(\n        default=None,\n        description=\"List of examples to help the model understand the task, few shots.\",\n    )\n    available_labels: Optional[Union[List[str], Dict[str, str]]] = Field(\n        default=None,\n        description=(\n            \"List of available labels to choose from when classifying the text, or \"\n            \"a dictionary with the labels and their descriptions.\"\n        ),\n    )\n    default_label: Optional[Union[str, List[str]]] = Field(\n        default=\"Unclassified\",\n        description=(\n            \"Default label to use when the text is ambiguous or lacks sufficient information for \"\n            \"classification. Can be a list in case of multiple labels (n&gt;1).\"\n        ),\n    )\n    query_title: str = Field(\n        default=\"User Query\",\n        description=\"Title of the query used to show the example/s to classify.\",\n    )\n    use_default_structured_output: bool = True\n\n    _template: Optional[Template] = PrivateAttr(default=None)\n\n    def load(self) -&gt; None:\n        super().load()\n        self._template = Template(TEXT_CLASSIFICATION_TEMPLATE)\n        self._labels_format: str = (\n            '\"label\"'\n            if self.n == 1\n            else \"[\" + \", \".join([f'\"label_{i}\"' for i in range(self.n)]) + \"]\"\n        )\n        self._labels_message: str = (\n            \"Provide the label that best describes the text.\"\n            if self.n == 1\n            else f\"Provide a list of {self.n} labels that best describe the text.\"\n        )\n        self._available_labels_message: str = self._get_available_labels_message()\n        self._examples: str = self._get_examples_message()\n\n    def _get_available_labels_message(self) -&gt; str:\n        \"\"\"Prepares the message to display depending on the available labels (if any),\n        and whether the labels have a specific context.\n        \"\"\"\n        if self.available_labels is None:\n            return (\n                \"Use clear, widely understood terms for labels.\"\n                \"Avoid overly specific or obscure labels unless the text demands it.\"\n            )\n\n        msg = (\n            \"## Labeling the user input\\n\"\n            \"Use the available labels to classify the user query{label_context}:\\n\"\n            \"available_labels = {available_labels}\"\n        )\n        if isinstance(self.available_labels, list):\n            specific_msg = (\n                \"[\\n\"\n                + indent(\n                    \"\".join([f'\"{label}\",\\n' for label in self.available_labels]),\n                    prefix=\" \" * 4,\n                )\n                + \"]\"\n            )\n            return msg.format(label_context=\"\", available_labels=specific_msg)\n\n        elif isinstance(self.available_labels, dict):\n            specific_msg = \"\"\n            for label, description in self.available_labels.items():\n                specific_msg += indent(\n                    f'\"{label}\",  # {description}' + \"\\n\", prefix=\" \" * 4\n                )\n\n            specific_msg = \"[\\n\" + specific_msg + \"]\"\n            return msg.format(\n                label_context=\". Analyze the context of each label specifically\",\n                available_labels=specific_msg,\n            )\n\n    def _get_examples_message(self) -&gt; str:\n        \"\"\"Prepares the message to display depending on the examples provided.\"\"\"\n        if self.examples is None:\n            return \"\"\n\n        examples_msg = \"\\n\".join([f\"- {ex}\" for ex in self.examples])\n\n        return (\n            \"\\n## Examples\\n\"\n            \"Here are some examples to help you understand the task:\\n\"\n            f\"{examples_msg}\"\n        )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"labels\", \"model_name\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        messages = [\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    context=f\"\\n{self.context}\",\n                    labels_message=self._labels_message,\n                    available_labels=self._available_labels_message,\n                    examples=self._examples,\n                    default_label=self.default_label,\n                    labels_format=self._labels_format,\n                    query_title=self.query_title,\n                    text=input[\"text\"],\n                ),\n            },\n        ]\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n        return messages\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\"\"\"\n        return self._format_structured_output(output)\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        if self.n &gt; 1:\n\n            class MultiLabelSchema(BaseModel):\n                labels: List[str]\n\n            return MultiLabelSchema.model_json_schema()\n\n        class SingleLabelSchema(BaseModel):\n            labels: str\n\n        return SingleLabelSchema.model_json_schema()\n\n    def _format_structured_output(\n        self, output: str\n    ) -&gt; Dict[str, Union[str, List[str]]]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with the `labels`, and either a string or a list of strings with the labels.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            if self.n &gt; 1:\n                return {\"labels\": [None for _ in range(self.n)]}\n            return {\"labels\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification._get_available_labels_message","title":"<code>_get_available_labels_message()</code>","text":"<p>Prepares the message to display depending on the available labels (if any), and whether the labels have a specific context.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def _get_available_labels_message(self) -&gt; str:\n    \"\"\"Prepares the message to display depending on the available labels (if any),\n    and whether the labels have a specific context.\n    \"\"\"\n    if self.available_labels is None:\n        return (\n            \"Use clear, widely understood terms for labels.\"\n            \"Avoid overly specific or obscure labels unless the text demands it.\"\n        )\n\n    msg = (\n        \"## Labeling the user input\\n\"\n        \"Use the available labels to classify the user query{label_context}:\\n\"\n        \"available_labels = {available_labels}\"\n    )\n    if isinstance(self.available_labels, list):\n        specific_msg = (\n            \"[\\n\"\n            + indent(\n                \"\".join([f'\"{label}\",\\n' for label in self.available_labels]),\n                prefix=\" \" * 4,\n            )\n            + \"]\"\n        )\n        return msg.format(label_context=\"\", available_labels=specific_msg)\n\n    elif isinstance(self.available_labels, dict):\n        specific_msg = \"\"\n        for label, description in self.available_labels.items():\n            specific_msg += indent(\n                f'\"{label}\",  # {description}' + \"\\n\", prefix=\" \" * 4\n            )\n\n        specific_msg = \"[\\n\" + specific_msg + \"]\"\n        return msg.format(\n            label_context=\". Analyze the context of each label specifically\",\n            available_labels=specific_msg,\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification._get_examples_message","title":"<code>_get_examples_message()</code>","text":"<p>Prepares the message to display depending on the examples provided.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def _get_examples_message(self) -&gt; str:\n    \"\"\"Prepares the message to display depending on the examples provided.\"\"\"\n    if self.examples is None:\n        return \"\"\n\n    examples_msg = \"\\n\".join([f\"- {ex}\" for ex in self.examples])\n\n    return (\n        \"\\n## Examples\\n\"\n        \"Here are some examples to help you understand the task:\\n\"\n        f\"{examples_msg}\"\n    )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    messages = [\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                context=f\"\\n{self.context}\",\n                labels_message=self._labels_message,\n                available_labels=self._available_labels_message,\n                examples=self._examples,\n                default_label=self.default_label,\n                labels_format=self._labels_format,\n                query_title=self.query_title,\n                text=input[\"text\"],\n            ),\n        },\n    ]\n    if self.system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n    return messages\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\"\"\"\n    return self._format_structured_output(output)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    if self.n &gt; 1:\n\n        class MultiLabelSchema(BaseModel):\n            labels: List[str]\n\n        return MultiLabelSchema.model_json_schema()\n\n    class SingleLabelSchema(BaseModel):\n        labels: str\n\n    return SingleLabelSchema.model_json_schema()\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextClassification._format_structured_output","title":"<code>_format_structured_output(output)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with the <code>labels</code>, and either a string or a list of strings with the labels.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, Union[str, List[str]]]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/text_classification.py</code> <pre><code>def _format_structured_output(\n    self, output: str\n) -&gt; Dict[str, Union[str, List[str]]]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with the `labels`, and either a string or a list of strings with the labels.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        if self.n &gt; 1:\n            return {\"labels\": [None for _ in range(self.n)]}\n        return {\"labels\": None}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration","title":"<code>ChatGeneration</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generates text based on a conversation.</p> <p><code>ChatGeneration</code> is a pre-defined task that defines the <code>messages</code> as the input and <code>generation</code> as the output. This task is used to generate text based on a conversation. The <code>model_name</code> is also returned as part of the output in order to enhance it.</p> Input columns <ul> <li>messages (<code>List[Dict[Literal[\"role\", \"content\"], str]]</code>): The messages to generate the     follow up completion from.</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated text from the assistant.</li> <li>model_name (<code>str</code>): The model name used to generate the text.</li> </ul> Categories <ul> <li>chat-generation</li> </ul> Icon <p><code>:material-chat:</code></p> <p>Examples:</p> <p>Generate text from a conversation in OpenAI chat format:</p> <pre><code>from distilabel.steps.tasks import ChatGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nchat = ChatGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nchat.load()\n\nresult = next(\n    chat.process(\n        [\n            {\n                \"messages\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                ]\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'generation': '4',\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>class ChatGeneration(Task):\n    \"\"\"Generates text based on a conversation.\n\n    `ChatGeneration` is a pre-defined task that defines the `messages` as the input\n    and `generation` as the output. This task is used to generate text based on a conversation.\n    The `model_name` is also returned as part of the output in order to enhance it.\n\n    Input columns:\n        - messages (`List[Dict[Literal[\"role\", \"content\"], str]]`): The messages to generate the\n            follow up completion from.\n\n    Output columns:\n        - generation (`str`): The generated text from the assistant.\n        - model_name (`str`): The model name used to generate the text.\n\n    Categories:\n        - chat-generation\n\n    Icon:\n        `:material-chat:`\n\n    Examples:\n        Generate text from a conversation in OpenAI chat format:\n\n        ```python\n        from distilabel.steps.tasks import ChatGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        chat = ChatGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            )\n        )\n\n        chat.load()\n\n        result = next(\n            chat.process(\n                [\n                    {\n                        \"messages\": [\n                            {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                        ]\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],\n        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n        #         'generation': '4',\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task are the `messages`.\"\"\"\n        return [\"messages\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the messages provided\n        are already formatted that way i.e. following the OpenAI chat format.\"\"\"\n\n        if not is_openai_format(input[\"messages\"]):\n            raise DistilabelUserError(\n                \"Input `messages` must be an OpenAI chat-like format conversation. \"\n                f\"Got: {input['messages']}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n                page=\"components-gallery/tasks/chatgeneration/\",\n            )\n\n        if input[\"messages\"][-1][\"role\"] != \"user\":\n            raise DistilabelUserError(\n                \"The last message must be from the user. Please check: \"\n                \"'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n                page=\"components-gallery/tasks/chatgeneration/\",\n            )\n\n        return input[\"messages\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\"\"\"\n        return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task are the <code>messages</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the messages provided are already formatted that way i.e. following the OpenAI chat format.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the messages provided\n    are already formatted that way i.e. following the OpenAI chat format.\"\"\"\n\n    if not is_openai_format(input[\"messages\"]):\n        raise DistilabelUserError(\n            \"Input `messages` must be an OpenAI chat-like format conversation. \"\n            f\"Got: {input['messages']}. Please check: 'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n            page=\"components-gallery/tasks/chatgeneration/\",\n        )\n\n    if input[\"messages\"][-1][\"role\"] != \"user\":\n        raise DistilabelUserError(\n            \"The last message must be from the user. Please check: \"\n            \"'https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models'.\",\n            page=\"components-gallery/tasks/chatgeneration/\",\n        )\n\n    return input[\"messages\"]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.ChatGeneration.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\"\"\"\n    return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration","title":"<code>TextGeneration</code>","text":"<p>               Bases: <code>Task</code></p> <p>Text generation with an <code>LLM</code> given a prompt.</p> <p><code>TextGeneration</code> is a pre-defined task that allows passing a custom prompt using the Jinja2 syntax. By default, a <code>instruction</code> is expected in the inputs, but the using <code>template</code> and <code>columns</code> attributes one can define a custom prompt and columns expected from the text. This task should be good enough for tasks that don't need post-processing of the responses generated by the LLM.</p> <p>Attributes:</p> Name Type Description <code>system_prompt</code> <code>Union[str, None]</code> <p>The system prompt to use in the generation. If not provided, then it will check if the input row has a column named <code>system_prompt</code> and use it. If not, then no system prompt will be used. Defaults to <code>None</code>.</p> <code>template</code> <code>str</code> <p>The template to use for the generation. It must follow the Jinja2 template syntax. If not provided, it will assume the text passed is an instruction and construct the appropriate template.</p> <code>columns</code> <code>Union[str, List[str]]</code> <p>A string with the column, or a list with columns expected in the template. Take a look at the examples for more information. Defaults to <code>instruction</code>.</p> <code>use_system_prompt</code> <code>bool</code> <p>DEPRECATED. To be removed in 1.5.0. Whether to use the system prompt in the generation. Defaults to <code>True</code>, which means that if the column <code>system_prompt</code> is defined within the input batch, then the <code>system_prompt</code> will be used, otherwise, it will be ignored.</p> Input columns <ul> <li>dynamic (determined by <code>columns</code> attribute): By default will be set to <code>instruction</code>.     The columns can point both to a <code>str</code> or a <code>List[str]</code> to be used in the template.</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated text.</li> <li>model_name (<code>str</code>): The name of the model used to generate the text.</li> </ul> Categories <ul> <li>text-generation</li> </ul> References <ul> <li>Jinja2 Template Designer Documentation</li> </ul> <p>Examples:</p> <p>Generate text from an instruction:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    )\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [{\"instruction\": \"your instruction\"}]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'your instruction',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'generation',\n#     }\n# ]\n</code></pre> <p>Use a custom template to generate text:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Document:\n{{ document }}\n\nQuestion: {{ question }}\n\nPlease provide a clear and concise answer to the question based on the information in the document and your general knowledge:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    system_prompt=\"You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.\",\n    template=CUSTOM_TEMPLATE,\n    columns=[\"document\", \"question\"],\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"document\": \"The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.\",\n                \"question\": \"What is the main threat to the Great Barrier Reef mentioned in the document?\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',\n#         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',\n#     }\n# ]\n</code></pre> <p>Few shot learning with different system prompts:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:\n\n{% for example in examples %}\nExample {{ loop.index }}:\nInstruction: {{ example }}\n\n{% endfor %}\nNow, generate a new instruction in a similar style:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    template=CUSTOM_TEMPLATE,\n    columns=\"examples\",\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"examples\": [\"This is an example\", \"Another relevant example\"],\n                \"system_prompt\": \"You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'examples': ['This is an example', 'Another relevant example'],\n#         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'Disable the firewall on the router',\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>class TextGeneration(Task):\n    \"\"\"Text generation with an `LLM` given a prompt.\n\n    `TextGeneration` is a pre-defined task that allows passing a custom prompt using the\n    Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using\n    `template` and `columns` attributes one can define a custom prompt and columns expected\n    from the text. This task should be good enough for tasks that don't need post-processing\n    of the responses generated by the LLM.\n\n    Attributes:\n        system_prompt: The system prompt to use in the generation. If not provided, then\n            it will check if the input row has a column named `system_prompt` and use it.\n            If not, then no system prompt will be used. Defaults to `None`.\n        template: The template to use for the generation. It must follow the Jinja2 template\n            syntax. If not provided, it will assume the text passed is an instruction and\n            construct the appropriate template.\n        columns: A string with the column, or a list with columns expected in the template.\n            Take a look at the examples for more information. Defaults to `instruction`.\n        use_system_prompt: DEPRECATED. To be removed in 1.5.0. Whether to use the system\n            prompt in the generation. Defaults to `True`, which means that if the column\n            `system_prompt` is defined within the input batch, then the `system_prompt`\n            will be used, otherwise, it will be ignored.\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): By default will be set to `instruction`.\n            The columns can point both to a `str` or a `List[str]` to be used in the template.\n\n    Output columns:\n        - generation (`str`): The generated text.\n        - model_name (`str`): The name of the model used to generate the text.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [Jinja2 Template Designer Documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/)\n\n    Examples:\n        Generate text from an instruction:\n\n        ```python\n        from distilabel.steps.tasks import TextGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        text_gen = TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            )\n        )\n\n        text_gen.load()\n\n        result = next(\n            text_gen.process(\n                [{\"instruction\": \"your instruction\"}]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'your instruction',\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n        #         'generation': 'generation',\n        #     }\n        # ]\n        ```\n\n        Use a custom template to generate text:\n\n        ```python\n        from distilabel.steps.tasks import TextGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        CUSTOM_TEMPLATE = '''Document:\n        {{ document }}\n\n        Question: {{ question }}\n\n        Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:\n        '''.rstrip()\n\n        text_gen = TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            system_prompt=\"You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.\",\n            template=CUSTOM_TEMPLATE,\n            columns=[\"document\", \"question\"],\n        )\n\n        text_gen.load()\n\n        result = next(\n            text_gen.process(\n                [\n                    {\n                        \"document\": \"The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.\",\n                        \"question\": \"What is the main threat to the Great Barrier Reef mentioned in the document?\"\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',\n        #         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n        #         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',\n        #     }\n        # ]\n        ```\n\n        Few shot learning with different system prompts:\n\n        ```python\n        from distilabel.steps.tasks import TextGeneration\n        from distilabel.models import InferenceEndpointsLLM\n\n        CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:\n\n        {% for example in examples %}\n        Example {{ loop.index }}:\n        Instruction: {{ example }}\n\n        {% endfor %}\n        Now, generate a new instruction in a similar style:\n        '''.rstrip()\n\n        text_gen = TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            template=CUSTOM_TEMPLATE,\n            columns=\"examples\",\n        )\n\n        text_gen.load()\n\n        result = next(\n            text_gen.process(\n                [\n                    {\n                        \"examples\": [\"This is an example\", \"Another relevant example\"],\n                        \"system_prompt\": \"You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.\"\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'examples': ['This is an example', 'Another relevant example'],\n        #         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n        #         'generation': 'Disable the firewall on the router',\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    system_prompt: Union[str, None] = None\n    use_system_prompt: bool = Field(default=True, deprecated=True)\n    template: str = Field(\n        default=\"{{ instruction }}\",\n        description=(\n            \"This is a template or prompt to use for the generation. \"\n            \"If not provided, it is assumed a `instruction` is placed in the inputs, \"\n            \"to be used as is.\"\n        ),\n    )\n    columns: Union[str, List[str]] = Field(\n        default=\"instruction\",\n        description=(\n            \"Custom column or list of columns to include in the input. \"\n            \"If a `template` is provided which needs custom column names, \"\n            \"then they should be provided here. By default it will use `instruction`.\"\n        ),\n    )\n\n    _can_be_used_with_offline_batch_generation = True\n    _template: Optional[\"Template\"] = PrivateAttr(default=...)\n\n    def model_post_init(self, __context: Any) -&gt; None:\n        self.columns = [self.columns] if isinstance(self.columns, str) else self.columns\n        super().model_post_init(__context)\n\n    def load(self) -&gt; None:\n        super().load()\n\n        for column in self.columns:\n            check_column_in_template(column, self.template)\n\n        self._template = Template(self.template)\n\n    def unload(self) -&gt; None:\n        super().unload()\n        self._template = None\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        \"\"\"The input for the task is the `instruction` by default, or the `columns` given as input.\"\"\"\n        columns = {column: True for column in self.columns}\n        columns[\"system_prompt\"] = False\n        return columns\n\n    def _prepare_message_content(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n        fields = {column: input[column] for column in self.columns}\n        return [{\"role\": \"user\", \"content\": self._template.render(**fields)}]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        # Handle the previous expected errors, in case of custom columns there's more freedom\n        # and we cannot check it so easily.\n        if self.columns == [\"instruction\"]:\n            if is_openai_format(input[\"instruction\"]):\n                raise DistilabelUserError(\n                    \"Providing `instruction` formatted as an OpenAI chat / conversation is\"\n                    \" deprecated, you should use `ChatGeneration` with `messages` as input instead.\",\n                    page=\"components-gallery/tasks/textgeneration/\",\n                )\n\n            if not isinstance(input[\"instruction\"], str):\n                raise DistilabelUserError(\n                    f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n                    page=\"components-gallery/tasks/textgeneration/\",\n                )\n\n        messages = self._prepare_message_content(input)\n\n        row_system_prompt = input.get(\"system_prompt\")\n        if row_system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": row_system_prompt})\n\n        if self.system_prompt and not row_system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n        return messages  # type: ignore\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\"\"\"\n        return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code> by default, or the <code>columns</code> given as input.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration._prepare_message_content","title":"<code>_prepare_message_content(input)</code>","text":"<p>Prepares the content for the template and returns the formatted messages.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def _prepare_message_content(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n    fields = {column: input[column] for column in self.columns}\n    return [{\"role\": \"user\", \"content\": self._template.render(**fields)}]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    # Handle the previous expected errors, in case of custom columns there's more freedom\n    # and we cannot check it so easily.\n    if self.columns == [\"instruction\"]:\n        if is_openai_format(input[\"instruction\"]):\n            raise DistilabelUserError(\n                \"Providing `instruction` formatted as an OpenAI chat / conversation is\"\n                \" deprecated, you should use `ChatGeneration` with `messages` as input instead.\",\n                page=\"components-gallery/tasks/textgeneration/\",\n            )\n\n        if not isinstance(input[\"instruction\"], str):\n            raise DistilabelUserError(\n                f\"Input `instruction` must be a string. Got: {input['instruction']}.\",\n                page=\"components-gallery/tasks/textgeneration/\",\n            )\n\n    messages = self._prepare_message_content(input)\n\n    row_system_prompt = input.get(\"system_prompt\")\n    if row_system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": row_system_prompt})\n\n    if self.system_prompt and not row_system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n    return messages  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGeneration.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>generation</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> Source code in <code>src/distilabel/steps/tasks/text_generation.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `generation`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\"\"\"\n    return {\"generation\": output}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage","title":"<code>TextGenerationWithImage</code>","text":"<p>               Bases: <code>TextGeneration</code></p> <p>Text generation with images with an <code>LLM</code> given a prompt.</p> <pre><code>`TextGenerationWithImage` is a pre-defined task that allows passing a custom prompt using the\nJinja2 syntax. By default, a `instruction` is expected in the inputs, but the using\n`template` and `columns` attributes one can define a custom prompt and columns expected\nfrom the text. Additionally, an `image` column is expected containing one of the\nurl, base64 encoded image or PIL image. This task inherits from `TextGeneration`,\nso all the functionality available in that task related to the prompt will be available\nhere too.\n\nAttributes:\n    system_prompt: The system prompt to use in the generation.\n        If not, then no system prompt will be used. Defaults to `None`.\n    template: The template to use for the generation. It must follow the Jinja2 template\n        syntax. If not provided, it will assume the text passed is an instruction and\n        construct the appropriate template.\n    columns: A string with the column, or a list with columns expected in the template.\n        Take a look at the examples for more information. Defaults to `instruction`.\n    image_type: The type of the image provided, this will be used to preprocess if necessary.\n        Must be one of \"url\", \"base64\" or \"PIL\".\n\nInput columns:\n    - dynamic (determined by `columns` attribute): By default will be set to `instruction`.\n        The columns can point both to a `str` or a `list[str]` to be used in the template.\n    - image: The column containing the image URL, base64 encoded image or PIL image.\n\nOutput columns:\n    - generation (`str`): The generated text.\n    - model_name (`str`): The name of the model used to generate the text.\n\nCategories:\n    - text-generation\n\nReferences:\n    - [Jinja2 Template Designer Documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/)\n    - [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)\n    - [OpenAI Vision](https://platform.openai.com/docs/guides/vision)\n\nExamples:\n    Answer questions from an image:\n\n    ```python\n    from distilabel.steps.tasks import TextGenerationWithImage\n    from distilabel.models.llms import InferenceEndpointsLLM\n\n    vision = TextGenerationWithImage(\n        name=\"vision_gen\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n        ),\n        image_type=\"url\"\n    )\n\n    vision.load()\n\n    result = next(\n        vision.process(\n            [\n                {\n                    \"instruction\": \"What\u2019s in this image?\",\n                    \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n                }\n            ]\n        )\n    )\n    # result\n    # [\n    #     {\n    #         \"instruction\": \"What\u2019s in this image?\",\n    #         \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n    #         \"generation\": \"Based on the visual cues in the image...\",\n    #         \"model_name\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\"\n    #         ... # distilabel_metadata would be here\n    #     }\n    # ]\n    # result[0][\"generation\"]\n    # \"Based on the visual cues in the image, here are some possible story points:\n</code></pre> <ul> <li>The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.</li> </ul> <p>Analysis and Ideas: * The abundance of green grass and trees suggests a healthy ecosystem or habitat. * The presence of wildlife, such as birds or deer, is possible based on the surroundings. * A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.</p> <p>Additional Questions to Ask: * Why is a footbridge present in this area? * What kind of wildlife inhabits this region\"         <pre><code>Answer questions from an image stored as base64:\n\n```python\n# For this example we will assume that we have the string representation of the image\n# stored, but will just take the image and transform it to base64 to ilustrate the example.\nimport requests\nimport base64\n\nimage_url =\"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg\"\nimg = requests.get(image_url).content\nbase64_image = base64.b64encode(img).decode(\"utf-8\")\n\nfrom distilabel.steps.tasks import TextGenerationWithImage\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nvision = TextGenerationWithImage(\n    name=\"vision_gen\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    ),\n    image_type=\"base64\"\n)\n\nvision.load()\n\nresult = next(\n    vision.process(\n        [\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": base64_image\n            }\n        ]\n    )\n)\n</code></pre></p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>class TextGenerationWithImage(TextGeneration):\n    \"\"\"Text generation with images with an `LLM` given a prompt.\n\n    `TextGenerationWithImage` is a pre-defined task that allows passing a custom prompt using the\n    Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using\n    `template` and `columns` attributes one can define a custom prompt and columns expected\n    from the text. Additionally, an `image` column is expected containing one of the\n    url, base64 encoded image or PIL image. This task inherits from `TextGeneration`,\n    so all the functionality available in that task related to the prompt will be available\n    here too.\n\n    Attributes:\n        system_prompt: The system prompt to use in the generation.\n            If not, then no system prompt will be used. Defaults to `None`.\n        template: The template to use for the generation. It must follow the Jinja2 template\n            syntax. If not provided, it will assume the text passed is an instruction and\n            construct the appropriate template.\n        columns: A string with the column, or a list with columns expected in the template.\n            Take a look at the examples for more information. Defaults to `instruction`.\n        image_type: The type of the image provided, this will be used to preprocess if necessary.\n            Must be one of \"url\", \"base64\" or \"PIL\".\n\n    Input columns:\n        - dynamic (determined by `columns` attribute): By default will be set to `instruction`.\n            The columns can point both to a `str` or a `list[str]` to be used in the template.\n        - image: The column containing the image URL, base64 encoded image or PIL image.\n\n    Output columns:\n        - generation (`str`): The generated text.\n        - model_name (`str`): The name of the model used to generate the text.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [Jinja2 Template Designer Documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/)\n        - [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)\n        - [OpenAI Vision](https://platform.openai.com/docs/guides/vision)\n\n    Examples:\n        Answer questions from an image:\n\n        ```python\n        from distilabel.steps.tasks import TextGenerationWithImage\n        from distilabel.models.llms import InferenceEndpointsLLM\n\n        vision = TextGenerationWithImage(\n            name=\"vision_gen\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n            ),\n            image_type=\"url\"\n        )\n\n        vision.load()\n\n        result = next(\n            vision.process(\n                [\n                    {\n                        \"instruction\": \"What\u2019s in this image?\",\n                        \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         \"instruction\": \"What\\u2019s in this image?\",\n        #         \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n        #         \"generation\": \"Based on the visual cues in the image...\",\n        #         \"model_name\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\"\n        #         ... # distilabel_metadata would be here\n        #     }\n        # ]\n        # result[0][\"generation\"]\n        # \"Based on the visual cues in the image, here are some possible story points:\\n\\n* The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.\\n\\nAnalysis and Ideas:\\n* The abundance of green grass and trees suggests a healthy ecosystem or habitat.\\n* The presence of wildlife, such as birds or deer, is possible based on the surroundings.\\n* A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.\\n\\nAdditional Questions to Ask:\\n* Why is a footbridge present in this area?\\n* What kind of wildlife inhabits this region\"\n        ```\n\n        Answer questions from an image stored as base64:\n\n        ```python\n        # For this example we will assume that we have the string representation of the image\n        # stored, but will just take the image and transform it to base64 to ilustrate the example.\n        import requests\n        import base64\n\n        image_url =\"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg\"\n        img = requests.get(image_url).content\n        base64_image = base64.b64encode(img).decode(\"utf-8\")\n\n        from distilabel.steps.tasks import TextGenerationWithImage\n        from distilabel.models.llms import InferenceEndpointsLLM\n\n        vision = TextGenerationWithImage(\n            name=\"vision_gen\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n            ),\n            image_type=\"base64\"\n        )\n\n        vision.load()\n\n        result = next(\n            vision.process(\n                [\n                    {\n                        \"instruction\": \"What\u2019s in this image?\",\n                        \"image\": base64_image\n                    }\n                ]\n            )\n        )\n        ```\n    \"\"\"\n\n    image_type: Literal[\"url\", \"base64\", \"PIL\"] = Field(\n        default=\"url\",\n        description=\"The type of the image provided, this will be used to preprocess if necessary.\",\n    )\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        columns = super().inputs\n        columns[\"image\"] = True\n        return columns\n\n    def load(self) -&gt; None:\n        Task.load(self)\n\n        for column in self.columns:\n            check_column_in_template(\n                column, self.template, page=\"components-gallery/tasks/visiongeneration/\"\n            )\n\n        self._template = Template(self.template)\n\n    def _transform_image(self, image: Union[str, \"Image\"]) -&gt; str:\n        \"\"\"Transforms the image based on the `image_type` attribute.\"\"\"\n        if self.image_type == \"url\":\n            return image\n\n        if self.image_type == \"base64\":\n            return f\"data:image/jpeg;base64,{image}\"\n\n        # Othwerwise, it's a PIL image\n        return f\"data:image/jpeg;base64,{image_to_str(image)}\"\n\n    def _prepare_message_content(self, input: dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n        fields = {column: input[column] for column in self.columns}\n        img_url = self._transform_image(input[\"image\"])\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": [\n                    {\n                        \"type\": \"text\",\n                        \"text\": self._template.render(**fields),\n                    },\n                    {\n                        \"type\": \"image_url\",\n                        \"image_url\": {\n                            \"url\": img_url,\n                        },\n                    },\n                ],\n            }\n        ]\n\n    def format_input(self, input: dict[str, Any]) -&gt; \"ChatType\":\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        messages = self._prepare_message_content(input)\n\n        if self.system_prompt:\n            messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n        return messages  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage._transform_image","title":"<code>_transform_image(image)</code>","text":"<p>Transforms the image based on the <code>image_type</code> attribute.</p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>def _transform_image(self, image: Union[str, \"Image\"]) -&gt; str:\n    \"\"\"Transforms the image based on the `image_type` attribute.\"\"\"\n    if self.image_type == \"url\":\n        return image\n\n    if self.image_type == \"base64\":\n        return f\"data:image/jpeg;base64,{image}\"\n\n    # Othwerwise, it's a PIL image\n    return f\"data:image/jpeg;base64,{image_to_str(image)}\"\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage._prepare_message_content","title":"<code>_prepare_message_content(input)</code>","text":"<p>Prepares the content for the template and returns the formatted messages.</p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>def _prepare_message_content(self, input: dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"Prepares the content for the template and returns the formatted messages.\"\"\"\n    fields = {column: input[column] for column in self.columns}\n    img_url = self._transform_image(input[\"image\"])\n    return [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"text\",\n                    \"text\": self._template.render(**fields),\n                },\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\n                        \"url\": img_url,\n                    },\n                },\n            ],\n        }\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.TextGenerationWithImage.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/text_generation_with_image.py</code> <pre><code>def format_input(self, input: dict[str, Any]) -&gt; \"ChatType\":\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    messages = self._prepare_message_content(input)\n\n    if self.system_prompt:\n        messages.insert(0, {\"role\": \"system\", \"content\": self.system_prompt})\n\n    return messages  # type: ignore\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback","title":"<code>UltraFeedback</code>","text":"<p>               Bases: <code>Task</code></p> <p>Rank generations focusing on different aspects using an <code>LLM</code>.</p> <p>UltraFeedback: Boosting Language Models with High-quality Feedback.</p> <p>Attributes:</p> Name Type Description <code>aspect</code> <code>Literal['helpfulness', 'honesty', 'instruction-following', 'truthfulness', 'overall-rating']</code> <p>The aspect to perform with the <code>UltraFeedback</code> model. The available aspects are: - <code>helpfulness</code>: Evaluate text outputs based on helpfulness. - <code>honesty</code>: Evaluate text outputs based on honesty. - <code>instruction-following</code>: Evaluate text outputs based on given instructions. - <code>truthfulness</code>: Evaluate text outputs based on truthfulness. Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall assessment of the text outputs within a single prompt. The custom aspect is: - <code>overall-rating</code>: Evaluate text outputs based on an overall assessment. Defaults to <code>\"overall-rating\"</code>.</p> Input columns <ul> <li>instruction (<code>str</code>): The reference instruction to evaluate the text outputs.</li> <li>generations (<code>List[str]</code>): The text outputs to evaluate for the given instruction.</li> </ul> Output columns <ul> <li>ratings (<code>List[float]</code>): The ratings for each of the provided text outputs.</li> <li>rationales (<code>List[str]</code>): The rationales for each of the provided text outputs.</li> <li>model_name (<code>str</code>): The name of the model used to generate the ratings and rationales.</li> </ul> Categories <ul> <li>preference</li> </ul> References <ul> <li><code>UltraFeedback: Boosting Language Models with High-quality Feedback</code></li> <li><code>UltraFeedback - GitHub Repository</code></li> </ul> <p>Examples:</p> <p>Rate generations from different LLMs based on the selected aspect:</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=False\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'How much is 2+2?',\n#         'generations': ['4', 'and a car'],\n#         'ratings': [1, 2],\n#         'rationales': ['explanation for 4', 'explanation for and a car'],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#     }\n# ]\n</code></pre> <p>Rate generations from different LLMs based on the honesty, using the default structured output:</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    aspect=\"honesty\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n# 'generations': ['4', 'and a car'],\n# 'ratings': [5, 1],\n# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',\n# \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"],\n# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{\"ratings\": [\\n    5,\\n    1\\n] \\n\\n,\"rationales\": [\\n    \"The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.\",\\n    \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"\\n] }'},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> <p>Rate generations from different LLMs based on the helpfulness, using the default structured output:</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512},\n    ),\n    aspect=\"helpfulness\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n#   'generations': ['4', 'and a car'],\n#   'ratings': [1, 5],\n#   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',\n#    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],\n#   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',\n#    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],\n#   'types': [1, 3, 1],\n#   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \\n  \"ratings\": [\\n    1,\\n    5\\n  ]\\n ,\\n  \"rationales\": [\\n    \"Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.\",\\n    \"Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"rationales_for_rating\": [\\n    \"Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.\",\\n    \"Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"types\": [\\n    1, 3,\\n    1\\n  ]\\n  }'},\n#   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre> Citations <pre><code>@misc{cui2024ultrafeedbackboostinglanguagemodels,\n    title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},\n    author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},\n    year={2024},\n    eprint={2310.01377},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n    url={https://arxiv.org/abs/2310.01377},\n}\n</code></pre> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>class UltraFeedback(Task):\n    \"\"\"Rank generations focusing on different aspects using an `LLM`.\n\n    UltraFeedback: Boosting Language Models with High-quality Feedback.\n\n    Attributes:\n        aspect: The aspect to perform with the `UltraFeedback` model. The available aspects are:\n            - `helpfulness`: Evaluate text outputs based on helpfulness.\n            - `honesty`: Evaluate text outputs based on honesty.\n            - `instruction-following`: Evaluate text outputs based on given instructions.\n            - `truthfulness`: Evaluate text outputs based on truthfulness.\n            Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall\n            assessment of the text outputs within a single prompt. The custom aspect is:\n            - `overall-rating`: Evaluate text outputs based on an overall assessment.\n            Defaults to `\"overall-rating\"`.\n\n    Input columns:\n        - instruction (`str`): The reference instruction to evaluate the text outputs.\n        - generations (`List[str]`): The text outputs to evaluate for the given instruction.\n\n    Output columns:\n        - ratings (`List[float]`): The ratings for each of the provided text outputs.\n        - rationales (`List[str]`): The rationales for each of the provided text outputs.\n        - model_name (`str`): The name of the model used to generate the ratings and rationales.\n\n    Categories:\n        - preference\n\n    References:\n        - [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377)\n        - [`UltraFeedback - GitHub Repository`](https://github.com/OpenBMB/UltraFeedback)\n\n    Examples:\n        Rate generations from different LLMs based on the selected aspect:\n\n        ```python\n        from distilabel.steps.tasks import UltraFeedback\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        ultrafeedback = UltraFeedback(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n            ),\n            use_default_structured_output=False\n        )\n\n        ultrafeedback.load()\n\n        result = next(\n            ultrafeedback.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generations\": [\"4\", \"and a car\"],\n                    }\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'instruction': 'How much is 2+2?',\n        #         'generations': ['4', 'and a car'],\n        #         'ratings': [1, 2],\n        #         'rationales': ['explanation for 4', 'explanation for and a car'],\n        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n        #     }\n        # ]\n        ```\n\n        Rate generations from different LLMs based on the honesty, using the default structured output:\n\n        ```python\n        from distilabel.steps.tasks import UltraFeedback\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        ultrafeedback = UltraFeedback(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            ),\n            aspect=\"honesty\"\n        )\n\n        ultrafeedback.load()\n\n        result = next(\n            ultrafeedback.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generations\": [\"4\", \"and a car\"],\n                    }\n                ]\n            )\n        )\n        # result\n        # [{'instruction': 'How much is 2+2?',\n        # 'generations': ['4', 'and a car'],\n        # 'ratings': [5, 1],\n        # 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',\n        # \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"],\n        # 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{\"ratings\": [\\\\n    5,\\\\n    1\\\\n] \\\\n\\\\n,\"rationales\": [\\\\n    \"The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.\",\\\\n    \"The response is confidently incorrect, as it provides unrelated information (\\'a car\\') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"\\\\n] }'},\n        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n        Rate generations from different LLMs based on the helpfulness, using the default structured output:\n\n        ```python\n        from distilabel.steps.tasks import UltraFeedback\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        ultrafeedback = UltraFeedback(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n                generation_kwargs={\"max_new_tokens\": 512},\n            ),\n            aspect=\"helpfulness\"\n        )\n\n        ultrafeedback.load()\n\n        result = next(\n            ultrafeedback.process(\n                [\n                    {\n                        \"instruction\": \"How much is 2+2?\",\n                        \"generations\": [\"4\", \"and a car\"],\n                    }\n                ]\n            )\n        )\n        # result\n        # [{'instruction': 'How much is 2+2?',\n        #   'generations': ['4', 'and a car'],\n        #   'ratings': [1, 5],\n        #   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',\n        #    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],\n        #   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',\n        #    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],\n        #   'types': [1, 3, 1],\n        #   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \\\\n  \"ratings\": [\\\\n    1,\\\\n    5\\\\n  ]\\\\n ,\\\\n  \"rationales\": [\\\\n    \"Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.\",\\\\n    \"Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.\"\\\\n  ]\\\\n ,\\\\n  \"rationales_for_rating\": [\\\\n    \"Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.\",\\\\n    \"Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.\"\\\\n  ]\\\\n ,\\\\n  \"types\": [\\\\n    1, 3,\\\\n    1\\\\n  ]\\\\n  }'},\n        #   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n        ```\n\n    Citations:\n        ```\n        @misc{cui2024ultrafeedbackboostinglanguagemodels,\n            title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},\n            author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},\n            year={2024},\n            eprint={2310.01377},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2310.01377},\n        }\n        ```\n    \"\"\"\n\n    aspect: Literal[\n        \"helpfulness\",\n        \"honesty\",\n        \"instruction-following\",\n        \"truthfulness\",\n        # Custom aspects\n        \"overall-rating\",\n    ] = \"overall-rating\"\n\n    _system_prompt: str = PrivateAttr(\n        default=(\n            \"Your role is to evaluate text quality based on given criteria.\\n\"\n            'You\\'ll receive an instructional description (\"Instruction\") and {no_texts} text outputs (\"Text\").\\n'\n            \"Understand and interpret instructions to evaluate effectively.\\n\"\n            \"Provide annotations for each text with a rating and rationale.\\n\"\n            \"The {no_texts} texts given are independent, and should be evaluated separately.\\n\"\n        )\n    )\n    _template: Optional[\"Template\"] = PrivateAttr(default=...)\n    _can_be_used_with_offline_batch_generation = True\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"ultrafeedback\"\n            / f\"{self.aspect}.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`, and the `generations` for it.\"\"\"\n        return [\"instruction\", \"generations\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation.\"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self._system_prompt.format(\n                    no_texts=len(input[\"generations\"])\n                ),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(  # type: ignore\n                    instruction=input[\"instruction\"], generations=input[\"generations\"]\n                ),\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The output for the task is the `generation` and the `model_name`.\"\"\"\n        columns = []\n        if self.aspect in [\"honesty\", \"instruction-following\", \"overall-rating\"]:\n            columns = [\"ratings\", \"rationales\"]\n        elif self.aspect in [\"helpfulness\", \"truthfulness\"]:\n            columns = [\"types\", \"rationales\", \"ratings\", \"rationales-for-ratings\"]\n        return columns + [\"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        \"\"\"The output is formatted as a dictionary with the `ratings` and `rationales` for\n        each of the provided `generations` for the given `instruction`. The `model_name`\n        will be automatically included within the `process` method of `Task`.\n\n        Args:\n            output: a string representing the output of the LLM via the `process` method.\n            input: the input to the task, as required by some tasks to format the output.\n\n        Returns:\n            A dictionary containing either the `ratings` and `rationales` for each of the provided\n            `generations` for the given `instruction` if the provided aspect is either `honesty`,\n            `instruction-following`, or `overall-rating`; or the `types`, `rationales`,\n            `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the\n            given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.\n        \"\"\"\n        assert input is not None, \"Input is required to format the output.\"\n\n        if self.aspect in [\n            \"honesty\",\n            \"instruction-following\",\n            \"overall-rating\",\n        ]:\n            return self._format_ratings_rationales_output(output, input)\n\n        return self._format_types_ratings_rationales_output(output, input)\n\n    def _format_ratings_rationales_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, List[Any]]:\n        \"\"\"Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`.\"\"\"\n        if output is None:\n            return {\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n            }\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        pattern = r\"Rating: (.+?)\\nRationale: (.+)\"\n        sections = output.split(\"\\n\\n\")\n\n        formatted_outputs = []\n        for section in sections:\n            matches = None\n            if section is not None and section != \"\":\n                matches = re.search(pattern, section, re.DOTALL)\n            if not matches:\n                formatted_outputs.append({\"ratings\": None, \"rationales\": None})\n                continue\n\n            formatted_outputs.append(\n                {\n                    \"ratings\": (\n                        int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                        if matches.group(1) not in [\"None\", \"N/A\"]\n                        else None\n                    ),\n                    \"rationales\": matches.group(2),\n                }\n            )\n        return group_dicts(*formatted_outputs)\n\n    def _format_types_ratings_rationales_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, List[Any]]:\n        \"\"\"Formats the output when the aspect is either `helpfulness` or `truthfulness`.\"\"\"\n        if output is None:\n            return {\n                \"types\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n            }\n\n        if self.use_default_structured_output:\n            return self._format_structured_output(output, input)\n\n        pattern = r\"Type: (.+?)\\nRationale: (.+?)\\nRating: (.+?)\\nRationale: (.+)\"\n\n        sections = output.split(\"\\n\\n\")\n\n        formatted_outputs = []\n        for section in sections:\n            matches = None\n            if section is not None and section != \"\":\n                matches = re.search(pattern, section, re.DOTALL)\n            if not matches:\n                formatted_outputs.append(\n                    {\n                        \"types\": None,\n                        \"rationales\": None,\n                        \"ratings\": None,\n                        \"rationales-for-ratings\": None,\n                    }\n                )\n                continue\n\n            formatted_outputs.append(\n                {\n                    \"types\": (\n                        int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                        if matches.group(1) not in [\"None\", \"N/A\"]\n                        else None\n                    ),\n                    \"rationales\": matches.group(2),\n                    \"ratings\": (\n                        int(re.findall(r\"\\b\\d+\\b\", matches.group(3))[0])\n                        if matches.group(3) not in [\"None\", \"N/A\"]\n                        else None\n                    ),\n                    \"rationales-for-ratings\": matches.group(4),\n                }\n            )\n        return group_dicts(*formatted_outputs)\n\n    @override\n    def get_structured_output(self) -&gt; Dict[str, Any]:\n        \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n        a dictionary with the output which can be directly parsed as a python dictionary.\n\n        The schema corresponds to the following:\n\n        ```python\n        from pydantic import BaseModel\n        from typing import List\n\n        class SchemaUltraFeedback(BaseModel):\n            ratings: List[int]\n            rationales: List[str]\n\n        class SchemaUltraFeedbackWithType(BaseModel):\n            types: List[Optional[int]]\n            ratings: List[int]\n            rationales: List[str]\n            rationales_for_rating: List[str]\n        ```\n\n        Returns:\n            JSON Schema of the response to enforce.\n        \"\"\"\n        if self.aspect in [\n            \"honesty\",\n            \"instruction-following\",\n            \"overall-rating\",\n        ]:\n            return {\n                \"properties\": {\n                    \"ratings\": {\n                        \"items\": {\"type\": \"integer\"},\n                        \"title\": \"Ratings\",\n                        \"type\": \"array\",\n                    },\n                    \"rationales\": {\n                        \"items\": {\"type\": \"string\"},\n                        \"title\": \"Rationales\",\n                        \"type\": \"array\",\n                    },\n                },\n                \"required\": [\"ratings\", \"rationales\"],\n                \"title\": \"SchemaUltraFeedback\",\n                \"type\": \"object\",\n            }\n        return {\n            \"properties\": {\n                \"types\": {\n                    \"items\": {\"anyOf\": [{\"type\": \"integer\"}, {\"type\": \"null\"}]},\n                    \"title\": \"Types\",\n                    \"type\": \"array\",\n                },\n                \"ratings\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Ratings\",\n                    \"type\": \"array\",\n                },\n                \"rationales\": {\n                    \"items\": {\"type\": \"string\"},\n                    \"title\": \"Rationales\",\n                    \"type\": \"array\",\n                },\n                \"rationales_for_rating\": {\n                    \"items\": {\"type\": \"string\"},\n                    \"title\": \"Rationales For Rating\",\n                    \"type\": \"array\",\n                },\n            },\n            \"required\": [\"types\", \"ratings\", \"rationales\", \"rationales_for_rating\"],\n            \"title\": \"SchemaUltraFeedbackWithType\",\n            \"type\": \"object\",\n        }\n\n    def _format_structured_output(\n        self, output: str, input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        \"\"\"Parses the structured response, which should correspond to a dictionary\n        with either `positive`, or `positive` and `negative` keys.\n\n        Args:\n            output: The output from the `LLM`.\n\n        Returns:\n            Formatted output.\n        \"\"\"\n        try:\n            return orjson.loads(output)\n        except orjson.JSONDecodeError:\n            if self.aspect in [\n                \"honesty\",\n                \"instruction-following\",\n                \"overall-rating\",\n            ]:\n                return {\n                    \"ratings\": [None] * len(input[\"generations\"]),\n                    \"rationales\": [None] * len(input[\"generations\"]),\n                }\n            return {\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n                \"types\": [None] * len(input[\"generations\"]),\n                \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n            }\n\n    @override\n    def _sample_input(self) -&gt; ChatType:\n        return self.format_input(\n            {\n                \"instruction\": f\"&lt;PLACEHOLDER_{'instruction'.upper()}&gt;\",\n                \"generations\": [\n                    f\"&lt;PLACEHOLDER_{f'GENERATION_{i}'.upper()}&gt;\" for i in range(2)\n                ],\n            }\n        )\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.inputs","title":"<code>inputs</code>  <code>property</code>","text":"<p>The input for the task is the <code>instruction</code>, and the <code>generations</code> for it.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.outputs","title":"<code>outputs</code>  <code>property</code>","text":"<p>The output for the task is the <code>generation</code> and the <code>model_name</code>.</p>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template for the given <code>aspect</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"ultrafeedback\"\n        / f\"{self.aspect}.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.format_input","title":"<code>format_input(input)</code>","text":"<p>The input is formatted as a <code>ChatType</code> assuming that the instruction is the first interaction from the user within a conversation.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n    \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n    is the first interaction from the user within a conversation.\"\"\"\n    return [\n        {\n            \"role\": \"system\",\n            \"content\": self._system_prompt.format(\n                no_texts=len(input[\"generations\"])\n            ),\n        },\n        {\n            \"role\": \"user\",\n            \"content\": self._template.render(  # type: ignore\n                instruction=input[\"instruction\"], generations=input[\"generations\"]\n            ),\n        },\n    ]\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.format_output","title":"<code>format_output(output, input=None)</code>","text":"<p>The output is formatted as a dictionary with the <code>ratings</code> and <code>rationales</code> for each of the provided <code>generations</code> for the given <code>instruction</code>. The <code>model_name</code> will be automatically included within the <code>process</code> method of <code>Task</code>.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>Union[str, None]</code> <p>a string representing the output of the LLM via the <code>process</code> method.</p> required <code>input</code> <code>Union[Dict[str, Any], None]</code> <p>the input to the task, as required by some tasks to format the output.</p> <code>None</code> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>A dictionary containing either the <code>ratings</code> and <code>rationales</code> for each of the provided</p> <code>Dict[str, Any]</code> <p><code>generations</code> for the given <code>instruction</code> if the provided aspect is either <code>honesty</code>,</p> <code>Dict[str, Any]</code> <p><code>instruction-following</code>, or <code>overall-rating</code>; or the <code>types</code>, <code>rationales</code>,</p> <code>Dict[str, Any]</code> <p><code>ratings</code>, and <code>rationales-for-ratings</code> for each of the provided <code>generations</code> for the</p> <code>Dict[str, Any]</code> <p>given <code>instruction</code> if the provided aspect is either <code>helpfulness</code> or <code>truthfulness</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def format_output(\n    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n) -&gt; Dict[str, Any]:\n    \"\"\"The output is formatted as a dictionary with the `ratings` and `rationales` for\n    each of the provided `generations` for the given `instruction`. The `model_name`\n    will be automatically included within the `process` method of `Task`.\n\n    Args:\n        output: a string representing the output of the LLM via the `process` method.\n        input: the input to the task, as required by some tasks to format the output.\n\n    Returns:\n        A dictionary containing either the `ratings` and `rationales` for each of the provided\n        `generations` for the given `instruction` if the provided aspect is either `honesty`,\n        `instruction-following`, or `overall-rating`; or the `types`, `rationales`,\n        `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the\n        given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.\n    \"\"\"\n    assert input is not None, \"Input is required to format the output.\"\n\n    if self.aspect in [\n        \"honesty\",\n        \"instruction-following\",\n        \"overall-rating\",\n    ]:\n        return self._format_ratings_rationales_output(output, input)\n\n    return self._format_types_ratings_rationales_output(output, input)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback._format_ratings_rationales_output","title":"<code>_format_ratings_rationales_output(output, input)</code>","text":"<p>Formats the output when the aspect is either <code>honesty</code>, <code>instruction-following</code>, or <code>overall-rating</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def _format_ratings_rationales_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, List[Any]]:\n    \"\"\"Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`.\"\"\"\n    if output is None:\n        return {\n            \"ratings\": [None] * len(input[\"generations\"]),\n            \"rationales\": [None] * len(input[\"generations\"]),\n        }\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    pattern = r\"Rating: (.+?)\\nRationale: (.+)\"\n    sections = output.split(\"\\n\\n\")\n\n    formatted_outputs = []\n    for section in sections:\n        matches = None\n        if section is not None and section != \"\":\n            matches = re.search(pattern, section, re.DOTALL)\n        if not matches:\n            formatted_outputs.append({\"ratings\": None, \"rationales\": None})\n            continue\n\n        formatted_outputs.append(\n            {\n                \"ratings\": (\n                    int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                    if matches.group(1) not in [\"None\", \"N/A\"]\n                    else None\n                ),\n                \"rationales\": matches.group(2),\n            }\n        )\n    return group_dicts(*formatted_outputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback._format_types_ratings_rationales_output","title":"<code>_format_types_ratings_rationales_output(output, input)</code>","text":"<p>Formats the output when the aspect is either <code>helpfulness</code> or <code>truthfulness</code>.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def _format_types_ratings_rationales_output(\n    self, output: Union[str, None], input: Dict[str, Any]\n) -&gt; Dict[str, List[Any]]:\n    \"\"\"Formats the output when the aspect is either `helpfulness` or `truthfulness`.\"\"\"\n    if output is None:\n        return {\n            \"types\": [None] * len(input[\"generations\"]),\n            \"rationales\": [None] * len(input[\"generations\"]),\n            \"ratings\": [None] * len(input[\"generations\"]),\n            \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n        }\n\n    if self.use_default_structured_output:\n        return self._format_structured_output(output, input)\n\n    pattern = r\"Type: (.+?)\\nRationale: (.+?)\\nRating: (.+?)\\nRationale: (.+)\"\n\n    sections = output.split(\"\\n\\n\")\n\n    formatted_outputs = []\n    for section in sections:\n        matches = None\n        if section is not None and section != \"\":\n            matches = re.search(pattern, section, re.DOTALL)\n        if not matches:\n            formatted_outputs.append(\n                {\n                    \"types\": None,\n                    \"rationales\": None,\n                    \"ratings\": None,\n                    \"rationales-for-ratings\": None,\n                }\n            )\n            continue\n\n        formatted_outputs.append(\n            {\n                \"types\": (\n                    int(re.findall(r\"\\b\\d+\\b\", matches.group(1))[0])\n                    if matches.group(1) not in [\"None\", \"N/A\"]\n                    else None\n                ),\n                \"rationales\": matches.group(2),\n                \"ratings\": (\n                    int(re.findall(r\"\\b\\d+\\b\", matches.group(3))[0])\n                    if matches.group(3) not in [\"None\", \"N/A\"]\n                    else None\n                ),\n                \"rationales-for-ratings\": matches.group(4),\n            }\n        )\n    return group_dicts(*formatted_outputs)\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback.get_structured_output","title":"<code>get_structured_output()</code>","text":"<p>Creates the json schema to be passed to the LLM, to enforce generating a dictionary with the output which can be directly parsed as a python dictionary.</p> <p>The schema corresponds to the following:</p> <pre><code>from pydantic import BaseModel\nfrom typing import List\n\nclass SchemaUltraFeedback(BaseModel):\n    ratings: List[int]\n    rationales: List[str]\n\nclass SchemaUltraFeedbackWithType(BaseModel):\n    types: List[Optional[int]]\n    ratings: List[int]\n    rationales: List[str]\n    rationales_for_rating: List[str]\n</code></pre> <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>JSON Schema of the response to enforce.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>@override\ndef get_structured_output(self) -&gt; Dict[str, Any]:\n    \"\"\"Creates the json schema to be passed to the LLM, to enforce generating\n    a dictionary with the output which can be directly parsed as a python dictionary.\n\n    The schema corresponds to the following:\n\n    ```python\n    from pydantic import BaseModel\n    from typing import List\n\n    class SchemaUltraFeedback(BaseModel):\n        ratings: List[int]\n        rationales: List[str]\n\n    class SchemaUltraFeedbackWithType(BaseModel):\n        types: List[Optional[int]]\n        ratings: List[int]\n        rationales: List[str]\n        rationales_for_rating: List[str]\n    ```\n\n    Returns:\n        JSON Schema of the response to enforce.\n    \"\"\"\n    if self.aspect in [\n        \"honesty\",\n        \"instruction-following\",\n        \"overall-rating\",\n    ]:\n        return {\n            \"properties\": {\n                \"ratings\": {\n                    \"items\": {\"type\": \"integer\"},\n                    \"title\": \"Ratings\",\n                    \"type\": \"array\",\n                },\n                \"rationales\": {\n                    \"items\": {\"type\": \"string\"},\n                    \"title\": \"Rationales\",\n                    \"type\": \"array\",\n                },\n            },\n            \"required\": [\"ratings\", \"rationales\"],\n            \"title\": \"SchemaUltraFeedback\",\n            \"type\": \"object\",\n        }\n    return {\n        \"properties\": {\n            \"types\": {\n                \"items\": {\"anyOf\": [{\"type\": \"integer\"}, {\"type\": \"null\"}]},\n                \"title\": \"Types\",\n                \"type\": \"array\",\n            },\n            \"ratings\": {\n                \"items\": {\"type\": \"integer\"},\n                \"title\": \"Ratings\",\n                \"type\": \"array\",\n            },\n            \"rationales\": {\n                \"items\": {\"type\": \"string\"},\n                \"title\": \"Rationales\",\n                \"type\": \"array\",\n            },\n            \"rationales_for_rating\": {\n                \"items\": {\"type\": \"string\"},\n                \"title\": \"Rationales For Rating\",\n                \"type\": \"array\",\n            },\n        },\n        \"required\": [\"types\", \"ratings\", \"rationales\", \"rationales_for_rating\"],\n        \"title\": \"SchemaUltraFeedbackWithType\",\n        \"type\": \"object\",\n    }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.UltraFeedback._format_structured_output","title":"<code>_format_structured_output(output, input)</code>","text":"<p>Parses the structured response, which should correspond to a dictionary with either <code>positive</code>, or <code>positive</code> and <code>negative</code> keys.</p> <p>Parameters:</p> Name Type Description Default <code>output</code> <code>str</code> <p>The output from the <code>LLM</code>.</p> required <p>Returns:</p> Type Description <code>Dict[str, Any]</code> <p>Formatted output.</p> Source code in <code>src/distilabel/steps/tasks/ultrafeedback.py</code> <pre><code>def _format_structured_output(\n    self, output: str, input: Dict[str, Any]\n) -&gt; Dict[str, Any]:\n    \"\"\"Parses the structured response, which should correspond to a dictionary\n    with either `positive`, or `positive` and `negative` keys.\n\n    Args:\n        output: The output from the `LLM`.\n\n    Returns:\n        Formatted output.\n    \"\"\"\n    try:\n        return orjson.loads(output)\n    except orjson.JSONDecodeError:\n        if self.aspect in [\n            \"honesty\",\n            \"instruction-following\",\n            \"overall-rating\",\n        ]:\n            return {\n                \"ratings\": [None] * len(input[\"generations\"]),\n                \"rationales\": [None] * len(input[\"generations\"]),\n            }\n        return {\n            \"ratings\": [None] * len(input[\"generations\"]),\n            \"rationales\": [None] * len(input[\"generations\"]),\n            \"types\": [None] * len(input[\"generations\"]),\n            \"rationales-for-ratings\": [None] * len(input[\"generations\"]),\n        }\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.URIAL","title":"<code>URIAL</code>","text":"<p>               Bases: <code>Task</code></p> <p>Generates a response using a non-instruct fine-tuned model.</p> <p><code>URIAL</code> is a pre-defined task that generates a response using a non-instruct fine-tuned model. This task is used to generate a response based on the conversation provided as input.</p> Input columns <ul> <li>instruction (<code>str</code>, optional): The instruction to generate a response from.</li> <li>conversation (<code>List[Dict[str, str]]</code>, optional): The conversation to generate     a response from (the last message must be from the user).</li> </ul> Output columns <ul> <li>generation (<code>str</code>): The generated response.</li> <li>model_name (<code>str</code>): The name of the model used to generate the response.</li> </ul> Categories <ul> <li>text-generation</li> </ul> References <ul> <li>The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning</li> </ul> <p>Examples:</p> <p>Generate text from an instruction:</p> <pre><code>from distilabel.models import vLLM\nfrom distilabel.steps.tasks import URIAL\n\nstep = URIAL(\n    llm=vLLM(\n        model=\"meta-llama/Meta-Llama-3.1-8B\",\n        generation_kwargs={\"temperature\": 0.7},\n    ),\n)\n\nstep.load()\n\nresults = next(\n    step.process(inputs=[{\"instruction\": \"What's the most most common type of cloud?\"}])\n)\n# [\n#     {\n#         'instruction': \"What's the most most common type of cloud?\",\n#         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',\n#         'distilabel_metadata': {...},\n#         'model_name': 'meta-llama/Meta-Llama-3.1-8B'\n#     }\n# ]\n</code></pre> Source code in <code>src/distilabel/steps/tasks/urial.py</code> <pre><code>class URIAL(Task):\n    \"\"\"Generates a response using a non-instruct fine-tuned model.\n\n    `URIAL` is a pre-defined task that generates a response using a non-instruct fine-tuned\n    model. This task is used to generate a response based on the conversation provided as\n    input.\n\n    Input columns:\n        - instruction (`str`, optional): The instruction to generate a response from.\n        - conversation (`List[Dict[str, str]]`, optional): The conversation to generate\n            a response from (the last message must be from the user).\n\n    Output columns:\n        - generation (`str`): The generated response.\n        - model_name (`str`): The name of the model used to generate the response.\n\n    Categories:\n        - text-generation\n\n    References:\n        - [The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning](https://arxiv.org/abs/2312.01552)\n\n    Examples:\n        Generate text from an instruction:\n\n        ```python\n        from distilabel.models import vLLM\n        from distilabel.steps.tasks import URIAL\n\n        step = URIAL(\n            llm=vLLM(\n                model=\"meta-llama/Meta-Llama-3.1-8B\",\n                generation_kwargs={\"temperature\": 0.7},\n            ),\n        )\n\n        step.load()\n\n        results = next(\n            step.process(inputs=[{\"instruction\": \"What's the most most common type of cloud?\"}])\n        )\n        # [\n        #     {\n        #         'instruction': \"What's the most most common type of cloud?\",\n        #         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',\n        #         'distilabel_metadata': {...},\n        #         'model_name': 'meta-llama/Meta-Llama-3.1-8B'\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n        super().load()\n\n        _path = str(\n            importlib_resources.files(\"distilabel\")\n            / \"steps\"\n            / \"tasks\"\n            / \"templates\"\n            / \"urial.jinja2\"\n        )\n\n        self._template = Template(open(_path).read())\n\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return {\"instruction\": False, \"conversation\": False}\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        messages = (\n            [{\"role\": \"user\", \"content\": input[\"instruction\"]}]\n            if \"instruction\" in input\n            else input[\"conversation\"]\n        )\n\n        if messages[-1][\"role\"] != \"user\":\n            raise ValueError(\"The last message must be from the user.\")\n\n        return [{\"role\": \"user\", \"content\": self._template.render(messages=messages)}]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"generation\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n    ) -&gt; Dict[str, Any]:\n        if output is None:\n            return {\"generation\": None}\n\n        response = output.split(\"\\n\\n# User\")[0]\n        if response.startswith(\"\\n\\n\"):\n            response = response[2:]\n        response = response.strip()\n\n        return {\"generation\": response}\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.URIAL.load","title":"<code>load()</code>","text":"<p>Loads the Jinja2 template for the given <code>aspect</code>.</p> Source code in <code>src/distilabel/steps/tasks/urial.py</code> <pre><code>def load(self) -&gt; None:\n    \"\"\"Loads the Jinja2 template for the given `aspect`.\"\"\"\n    super().load()\n\n    _path = str(\n        importlib_resources.files(\"distilabel\")\n        / \"steps\"\n        / \"tasks\"\n        / \"templates\"\n        / \"urial.jinja2\"\n    )\n\n    self._template = Template(open(_path).read())\n</code></pre>"},{"location":"api/task/task_gallery/#distilabel.steps.tasks.task","title":"<code>task(inputs=None, outputs=None)</code>","text":"<p>Creates a <code>Task</code> from a formatting output function.</p> <p>Parameters:</p> Name Type Description Default <code>inputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the inputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column is required or not, that are required by the step. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> <code>outputs</code> <code>Union[StepColumns, None]</code> <p>a list containing the name of the outputs columns/keys or a dictionary where the keys are the columns and the values are booleans indicating whether the column will be generated or not. If not provided the default will be an empty list <code>[]</code> and it will be assumed that the step doesn't need any specific columns. Defaults to <code>None</code>.</p> <code>None</code> Source code in <code>src/distilabel/steps/tasks/decorator.py</code> <pre><code>def task(\n    inputs: Union[\"StepColumns\", None] = None,\n    outputs: Union[\"StepColumns\", None] = None,\n) -&gt; Callable[..., Type[\"Task\"]]:\n    \"\"\"Creates a `Task` from a formatting output function.\n\n    Args:\n        inputs: a list containing the name of the inputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column is required or not, that are required by the step. If not provided\n            the default will be an empty list `[]` and it will be assumed that the step\n            doesn't need any specific columns. Defaults to `None`.\n        outputs: a list containing the name of the outputs columns/keys or a dictionary\n            where the keys are the columns and the values are booleans indicating whether\n            the column will be generated or not. If not provided the default will be an\n            empty list `[]` and it will be assumed that the step doesn't need any specific\n            columns. Defaults to `None`.\n    \"\"\"\n\n    inputs = inputs or []\n    outputs = outputs or []\n\n    def decorator(func: TaskFormattingOutputFunc) -&gt; Type[\"Task\"]:\n        doc = inspect.getdoc(func)\n        if doc is None:\n            raise DistilabelUserError(\n                \"When using the `task` decorator, including a docstring in the formatting\"\n                \" function is mandatory. The docstring must follow the format described\"\n                \" in the documentation.\",\n                page=\"\",\n            )\n\n        system_prompt, user_message_template = _parse_docstring(doc)\n        _validate_templates(inputs, system_prompt, user_message_template)\n\n        def inputs_property(self) -&gt; \"StepColumns\":\n            return inputs\n\n        def outputs_property(self) -&gt; \"StepColumns\":\n            return outputs\n\n        def format_input(self, input: Dict[str, Any]) -&gt; \"FormattedInput\":\n            return [\n                {\"role\": \"system\", \"content\": system_prompt.format(**input)},\n                {\"role\": \"user\", \"content\": user_message_template.format(**input)},\n            ]\n\n        def format_output(\n            self, output: Union[str, None], input: Union[Dict[str, Any], None] = None\n        ) -&gt; Dict[str, Any]:\n            return func(output, input)\n\n        return type(\n            func.__name__,\n            (Task,),\n            {\n                \"inputs\": property(inputs_property),\n                \"outputs\": property(outputs_property),\n                \"__module__\": func.__module__,\n                \"format_input\": format_input,\n                \"format_output\": format_output,\n            },\n        )\n\n    return decorator\n</code></pre>"},{"location":"api/task/typing/","title":"Task Typing","text":""},{"location":"api/task/typing/#distilabel.steps.tasks.typing","title":"<code>typing</code>","text":""},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ChatType","title":"<code>ChatType = List[ChatItem]</code>  <code>module-attribute</code>","text":"<p>ChatType is a type alias for a <code>list</code> of <code>dict</code>s following the OpenAI conversational format.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.StructuredOutputType","title":"<code>StructuredOutputType = Union[OutlinesStructuredOutputType, InstructorStructuredOutputType]</code>  <code>module-attribute</code>","text":"<p>StructuredOutputType is an alias for the union of <code>OutlinesStructuredOutputType</code> and <code>InstructorStructuredOutputType</code>.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.StandardInput","title":"<code>StandardInput = ChatType</code>  <code>module-attribute</code>","text":"<p>StandardInput is an alias for ChatType that defines the default / standard input produced by <code>format_input</code>.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.StructuredInput","title":"<code>StructuredInput = Tuple[StandardInput, Union[StructuredOutputType, None]]</code>  <code>module-attribute</code>","text":"<p>StructuredInput defines a type produced by <code>format_input</code> when using either <code>StructuredGeneration</code> or a subclass of it.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.FormattedInput","title":"<code>FormattedInput = Union[StandardInput, StructuredInput, ChatType]</code>  <code>module-attribute</code>","text":"<p>FormattedInput is an alias for the union of <code>StandardInput</code> and <code>StructuredInput</code> as generated by <code>format_input</code> and expected by the <code>LLM</code>s, as well as <code>ConversationType</code> for the vision language models.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ImageUrl","title":"<code>ImageUrl</code>","text":"<p>               Bases: <code>TypedDict</code></p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class ImageUrl(TypedDict):\n    url: Required[str]\n    \"\"\"Either a URL of the image or the base64 encoded image data.\"\"\"\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ImageUrl.url","title":"<code>url</code>  <code>instance-attribute</code>","text":"<p>Either a URL of the image or the base64 encoded image data.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.ImageContent","title":"<code>ImageContent</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>Type alias for the user's message in a conversation that can include text or an image. It's the standard type for vision language models: https://platform.openai.com/docs/guides/vision</p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class ImageContent(TypedDict, total=False):\n    \"\"\"Type alias for the user's message in a conversation that can include text or an image.\n    It's the standard type for vision language models:\n    https://platform.openai.com/docs/guides/vision\n    \"\"\"\n\n    type: Required[Literal[\"image_url\"]]\n    image_url: Required[ImageUrl]\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType","title":"<code>OutlinesStructuredOutputType</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>TypedDict to represent the structured output configuration from <code>outlines</code>.</p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class OutlinesStructuredOutputType(TypedDict, total=False):\n    \"\"\"TypedDict to represent the structured output configuration from `outlines`.\"\"\"\n\n    format: Literal[\"json\", \"regex\"]\n    \"\"\"One of \"json\" or \"regex\".\"\"\"\n    schema: Union[str, Type[BaseModel], Dict[str, Any]]\n    \"\"\"The schema to use for the structured output. If \"json\", it\n    can be a pydantic.BaseModel class, or the schema as a string,\n    as obtained from `model_to_schema(BaseModel)`, if \"regex\", it\n    should be a regex pattern as a string.\n    \"\"\"\n    whitespace_pattern: Optional[Union[str, List[str]]]\n    \"\"\"If \"json\" corresponds to a string or a list of\n    strings with a pattern (doesn't impact string literals).\n    For example, to allow only a single space or newline with\n    `whitespace_pattern=r\"[\\n ]?\"`\n    \"\"\"\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType.format","title":"<code>format</code>  <code>instance-attribute</code>","text":"<p>One of \"json\" or \"regex\".</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType.schema","title":"<code>schema</code>  <code>instance-attribute</code>","text":"<p>The schema to use for the structured output. If \"json\", it can be a pydantic.BaseModel class, or the schema as a string, as obtained from <code>model_to_schema(BaseModel)</code>, if \"regex\", it should be a regex pattern as a string.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.OutlinesStructuredOutputType.whitespace_pattern","title":"<code>whitespace_pattern</code>  <code>instance-attribute</code>","text":"<p>If \"json\" corresponds to a string or a list of    strings with a pattern (doesn't impact string literals).    For example, to allow only a single space or newline with    <code>whitespace_pattern=r\"[ ]?\"</code></p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType","title":"<code>InstructorStructuredOutputType</code>","text":"<p>               Bases: <code>TypedDict</code></p> <p>TypedDict to represent the structured output configuration from <code>instructor</code>.</p> Source code in <code>src/distilabel/steps/tasks/typing.py</code> <pre><code>class InstructorStructuredOutputType(TypedDict, total=False):\n    \"\"\"TypedDict to represent the structured output configuration from `instructor`.\"\"\"\n\n    format: Optional[Literal[\"json\"]]\n    \"\"\"One of \"json\".\"\"\"\n    schema: Union[Type[BaseModel], Dict[str, Any]]\n    \"\"\"The schema to use for the structured output, a `pydantic.BaseModel` class. \"\"\"\n    mode: Optional[str]\n    \"\"\"Generation mode. Take a look at `instructor.Mode` for more information, if not informed it will\n    be determined automatically. \"\"\"\n    max_retries: int\n    \"\"\"Number of times to reask the model in case of error, if not set will default to the model's default. \"\"\"\n</code></pre>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.format","title":"<code>format</code>  <code>instance-attribute</code>","text":"<p>One of \"json\".</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.schema","title":"<code>schema</code>  <code>instance-attribute</code>","text":"<p>The schema to use for the structured output, a <code>pydantic.BaseModel</code> class.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.mode","title":"<code>mode</code>  <code>instance-attribute</code>","text":"<p>Generation mode. Take a look at <code>instructor.Mode</code> for more information, if not informed it will be determined automatically.</p>"},{"location":"api/task/typing/#distilabel.steps.tasks.typing.InstructorStructuredOutputType.max_retries","title":"<code>max_retries</code>  <code>instance-attribute</code>","text":"<p>Number of times to reask the model in case of error, if not set will default to the model's default.</p>"},{"location":"sections/community/","title":"Community","text":"<p>We are an open-source community-driven project not only focused on building a great product but also on building a great community, where you can get support, share your experiences, and contribute to the project! We would love to hear from you and help you get started with distilabel.</p> <ul> <li> <p>Discord</p> <p>In our Discord channels (#argilla-general and #argilla-help), you can get direct support from the community.</p> <p> Discord \u2197</p> </li> <li> <p>Community Meetup</p> <p>We host bi-weekly community meetups where you can listen in or present your work.</p> <p> Community Meetup \u2197</p> </li> <li> <p>Changelog</p> <p>The changelog is where you can find the latest updates and changes to the distilabel project.</p> <p> Changelog \u2197</p> </li> <li> <p>Roadmap</p> <p>We love to discuss our plans with the community. Feel encouraged to participate in our roadmap discussions.</p> <p> Roadmap \u2197</p> </li> </ul>"},{"location":"sections/community/#badges","title":"Badges","text":"<p>If you build something cool with <code>distilabel</code> consider adding one of these badges to your dataset or model card.</p> <pre><code>[&lt;img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/&gt;](https://github.com/argilla-io/distilabel)\n</code></pre> <p></p> <pre><code>[&lt;img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/&gt;](https://github.com/argilla-io/distilabel)\n</code></pre> <p></p>"},{"location":"sections/community/#contribute","title":"Contribute","text":"<p>To directly contribute with <code>distilabel</code>, check our good first issues or open a new one.</p>"},{"location":"sections/community/contributor/","title":"How to contribute?","text":"<p>Thank you for investing your time in contributing to the project! Any contribution you make will be reflected in the most recent version of distilabel \ud83e\udd29.</p> New to contributing in general? <p>If you're a new contributor, read the README to get an overview of the project. In addition, here are some resources to help you get started with open-source contributions:</p> <ul> <li>Discord: You are welcome to join the distilabel Discord community, where you can keep in touch with other users, contributors and the distilabel team. In the following section, you can find more information on how to get started in Discord.</li> <li>Git: This is a very useful tool to keep track of the changes in your files. Using the command-line interface (CLI), you can make your contributions easily. For that, you need to have it installed and updated on your computer.</li> <li>GitHub: It is a platform and cloud-based service that uses git and allows developers to collaborate on projects. To contribute to distilabel, you'll need to create an account. Check the Contributor Workflow with Git and Github for more info.</li> <li>Developer Documentation: To collaborate, you'll need to set up an efficient environment. Check the Installation guide to know how to do it.</li> </ul>"},{"location":"sections/community/contributor/#first-contact-in-discord","title":"First Contact in Discord","text":"<p>Discord is a handy tool for more casual conversations and to answer day-to-day questions. As part of Hugging Face, we have set up some distilabel channels on the server. Click here to join the Hugging Face Discord community effortlessly.</p> <p>When part of the Hugging Face Discord, you can select \"Channels &amp; roles\" and select \"Argilla\" along with any of the other groups that are interesting to you. \"Argilla\" will cover anything about argilla and distilabel. You can join the following channels:</p> <ul> <li>#argilla-distilabel-announcements: \ud83d\udce3 Stay up-to-date.</li> <li>#argilla-distilabel-general: \ud83d\udcac For general discussions.</li> <li>#argilla-distilabel-help: \ud83d\ude4b\u200d\u2640\ufe0f Need assistance? We're always here to help. Select the appropriate label (argilla or distilabel) for your issue and post it.</li> </ul> <p>So now there is only one thing left to do: introduce yourself and talk to the community. You'll always be welcome! \ud83e\udd17\ud83d\udc4b</p>"},{"location":"sections/community/contributor/#contributor-workflow-with-git-and-github","title":"Contributor Workflow with Git and GitHub","text":"<p>If you're working with distilabel and suddenly a new idea comes to your mind or you find an issue that can be improved, it's time to actively participate and contribute to the project!</p>"},{"location":"sections/community/contributor/#report-an-issue","title":"Report an issue","text":"<p>If you spot a problem, search if an issue already exists, you can use the <code>Label</code> filter. If that is the case, participate in the conversation. If it does not exist, create an issue by clicking on <code>New Issue</code>. This will show various templates; choose the one that best suits your issue. Once you choose one, you will need to fill it in following the guidelines. Try to be as clear as possible. In addition, you can assign yourself to the issue and add or choose the right labels. Finally, click on <code>Submit new issue</code>.</p>"},{"location":"sections/community/contributor/#work-with-a-fork","title":"Work with a fork","text":""},{"location":"sections/community/contributor/#fork-the-distilabel-repository","title":"Fork the distilabel repository","text":"<p>After having reported the issue, you can start working on it. For that, you will need to create a fork of the project. To do that, click on the <code>Fork</code> button. Now, fill in the information. Remember to uncheck the <code>Copy develop branch only</code> if you are going to work in or from another branch (for instance, to fix documentation, the <code>main</code> branch is used). Then, click on <code>Create fork</code>.</p> <p>You will be redirected to your fork. You can see that you are in your fork because the name of the repository will be your <code>username/distilabel</code>, and it will indicate <code>forked from argilla-io/distilabel</code>.</p>"},{"location":"sections/community/contributor/#clone-your-forked-repository","title":"Clone your forked repository","text":"<p>In order to make the required adjustments, clone the forked repository to your local machine. Choose the destination folder and run the following command:</p> <pre><code>git clone https://github.com/[your-github-username]/distilabel.git\ncd distilabel\n</code></pre> <p>To keep your fork\u2019s main/develop branch up to date with our repo, add it as an upstream remote branch.</p> <pre><code>git remote add upstream https://github.com/argilla-io/distilabel.git\n</code></pre>"},{"location":"sections/community/contributor/#create-a-new-branch","title":"Create a new branch","text":"<p>For each issue you're addressing, it's advisable to create a new branch. GitHub offers a straightforward method to streamline this process.</p> <p>\u26a0\ufe0f Never work directly on the <code>main</code> or <code>develop</code> branch. Always create a new branch for your changes.</p> <p>Navigate to your issue, and on the right column, select <code>Create a branch</code>.</p> <p></p> <p>After the new window pops up, the branch will be named after the issue and include a prefix such as feature/, bug/, or docs/ to facilitate quick recognition of the issue type. In the <code>Repository destination</code>, pick your fork ( [your-github-username]/distilabel), and then select <code>Change branch source</code> to specify the source branch for creating the new one. Complete the process by clicking <code>Create branch</code>.</p> <p>\ud83e\udd14 Remember that the <code>main</code> branch is only used to work with the documentation. For any other changes, use the <code>develop</code> branch.</p> <p>Now, locally, change to the new branch you just created.</p> <pre><code>git fetch origin\ngit checkout [branch-name]\n</code></pre>"},{"location":"sections/community/contributor/#make-changes-and-push-them","title":"Make changes and push them","text":"<p>Make the changes you want in your local repository, and test that everything works and you are following the guidelines.</p> <p>Once you have finished, you can check the status of your repository and synchronize with the upstreaming repo with the following command:</p> <pre><code># Check the status of your repository\ngit status\n\n# Synchronize with the upstreaming repo\ngit checkout [branch-name]\ngit rebase [default-branch]\n</code></pre> <p>If everything is right, we need to commit and push the changes to your fork. For that, run the following commands:</p> <pre><code># Add the changes to the staging area\ngit add filename\n\n# Commit the changes by writing a proper message\ngit commit -m \"commit-message\"\n\n# Push the changes to your fork\ngit push origin [branch-name]\n</code></pre> <p>When pushing, you will be asked to enter your GitHub login credentials. Once the push is complete, all local commits will be on your GitHub repository.</p>"},{"location":"sections/community/contributor/#create-a-pull-request","title":"Create a pull request","text":"<p>Come back to GitHub, navigate to the original repository where you created your fork, and click on <code>Compare &amp; pull request</code>.</p> <p></p> <p>First, click on <code>compare across forks</code> and select the right repositories and branches.</p> <p>In the base repository, keep in mind that you should select either <code>main</code> or <code>develop</code> based on the modifications made. In the head repository, indicate your forked repository and the branch corresponding to the issue.</p> <p>Then, fill in the pull request template. You should add a prefix to the PR name, as we did with the branch above. If you are working on a new feature, you can name your PR as <code>feat: TITLE</code>. If your PR consists of a solution for a bug, you can name your PR as <code>bug: TITLE</code>. And, if your work is for improving the documentation, you can name your PR as <code>docs: TITLE</code>.</p> <p>In addition, on the right side, you can select a reviewer (for instance, if you discussed the issue with a member of the team) and assign the pull request to yourself. It is highly advisable to add labels to PR as well. You can do this again by the labels section right on the screen. For instance, if you are addressing a bug, add the <code>bug</code> label, or if the PR is related to the documentation, add the <code>documentation</code> label. This way, PRs can be easily filtered.</p> <p>Finally, fill in the template carefully and follow the guidelines. Remember to link the original issue and enable the checkbox to allow maintainer edits so the branch can be updated for a merge. Then, click on <code>Create pull request</code>.</p> <p>For the PR body, ensure you give a description of what the PR contains, and add examples if possible (and if they apply to the contribution) to help with the review process. You can take a look at #PR 974 or #PR 983 for examples of typical PRs.</p>"},{"location":"sections/community/contributor/#review-your-pull-request","title":"Review your pull request","text":"<p>Once you submit your PR, a team member will review your proposal. We may ask questions, request additional information, or ask for changes to be made before a PR can be merged, either using suggested changes or pull request comments.</p> <p>You can apply the changes directly through the UI (check the files changed and click on the right-corner three dots; see image below) or from your fork, and then commit them to your branch. The PR will be updated automatically, and the suggestions will appear as <code>outdated</code>.</p> <p></p> <p>If you run into any merge issues, check out this git tutorial to help you resolve merge conflicts and other issues.</p>"},{"location":"sections/community/contributor/#your-pr-is-merged","title":"Your PR is merged!","text":"<p>Congratulations \ud83c\udf89\ud83c\udf8a We thank you \ud83e\udd29</p> <p>Once your PR is merged, your contributions will be publicly visible on the distilabel GitHub.</p> <p>Additionally, we will include your changes in the next release based on our development branch.</p>"},{"location":"sections/community/contributor/#additional-resources","title":"Additional resources","text":"<p>Here are some helpful resources for your reference.</p> <ul> <li>Configuring Discord, a guide to learning how to get started with Discord.</li> <li>Pro Git, a book to learn Git.</li> <li>Git in VSCode, a guide to learning how to easily use Git in VSCode.</li> <li>GitHub Skills, an interactive course for learning GitHub.</li> </ul>"},{"location":"sections/community/developer_documentation/","title":"Developer Documentation","text":"<p>Thank you for investing your time in contributing to the project!</p> <p>If you don't have the repository locally, and need any help, go to the contributor guide and read the contributor workflow with Git and GitHub first.</p>"},{"location":"sections/community/developer_documentation/#set-up-the-python-environment","title":"Set up the Python environment","text":"<p>To work on the <code>distilabel</code>, you must install the package on your system.</p> <p>Tip</p> <p>This guide will use <code>uv</code>, but <code>pip</code> and <code>venv</code> can be used as well, this guide can work quite similar with both options.</p> <p>From the root of the cloned Distilabel repository, you should move to the distilabel folder in your terminal.</p> <pre><code>cd distilabel\n</code></pre>"},{"location":"sections/community/developer_documentation/#create-a-virtual-environment","title":"Create a virtual environment","text":"<p>The first step will be creating a virtual environment to keep our dependencies isolated. Here we are choosing <code>python 3.11</code> (uv venv documentation), and then activate it:</p> <pre><code>uv venv .venv --python 3.11\nsource .venv/bin/activate\n</code></pre>"},{"location":"sections/community/developer_documentation/#install-the-project","title":"Install the project","text":"<p>Installing from local (we are using <code>uv pip</code>):</p> <pre><code>uv pip install -e .\n</code></pre> <p>We have extra dependencies with their name, depending on the part you are working on, you may want to install some dependency (take a look at <code>pyproject.toml</code> in the repo to see all the extra dependencies):</p> <pre><code>uv pip install -e \".[vllm,outlines]\"\n</code></pre>"},{"location":"sections/community/developer_documentation/#linting-and-formatting","title":"Linting and formatting","text":"<p>To maintain a consistent code format, install the pre-commit hooks to run before each commit automatically (we rely heavily on <code>ruff</code>):</p> <pre><code>uv pip install -e \".[dev]\"\npre-commit install\n</code></pre>"},{"location":"sections/community/developer_documentation/#running-tests","title":"Running tests","text":"<p>All the changes you add to the codebase should come with tests, either <code>unit</code> or <code>integration</code> tests, depending on the type of change, which are placed under <code>tests/unit</code> and <code>tests/integration</code> respectively.</p> <p>Start by installing the tests dependencies:</p> <pre><code>uv pip install \".[tests]\"\n</code></pre> <p>Running the whole tests suite may take some time, and you will need all the dependencies installed, so just run your tests, and the whole tests suite will be run for you in the CI:</p> <pre><code># Run specific tests\npytest tests/unit/steps/generators/test_data.py\n</code></pre>"},{"location":"sections/community/developer_documentation/#set-up-the-documentation","title":"Set up the documentation","text":"<p>To contribute to the documentation and generate it locally, ensure you have installed the development dependencies:</p> <pre><code>uv pip install -e \".[docs]\"\n</code></pre> <p>And run the following command to create the development server with <code>mkdocs</code>:</p> <pre><code>mkdocs serve\n</code></pre>"},{"location":"sections/community/developer_documentation/#documentation-guidelines","title":"Documentation guidelines","text":"<p>As mentioned, we use mkdocs to build the documentation. You can write the documentation in <code>markdown</code> format, and it will automatically be converted to HTML. In addition, you can include elements such as tables, tabs, images, and others, as shown in this guide. We recommend following these guidelines:</p> <ul> <li> <p>Use clear and concise language: Ensure the documentation is easy to understand for all users by using straightforward language and including meaningful examples. Images are not easy to maintain, so use them only when necessary and place them in the appropriate folder within the docs/assets/images directory.</p> </li> <li> <p>Verify code snippets: Double-check that all code snippets are correct and runnable.</p> </li> <li> <p>Review spelling and grammar: Check the spelling and grammar of the documentation.</p> </li> <li> <p>Update the table of contents: If you add a new page, include it in the relevant index.md or the mkdocs.yml file.</p> </li> </ul>"},{"location":"sections/community/developer_documentation/#components-gallery","title":"Components gallery","text":"<p>The components gallery section of the documentation is automatically generated thanks to a custom plugin, it will be run when <code>mkdocs serve</code> is called. This guide to the steps helps us visualize each step, as well as examples of use.</p> <p>Note</p> <p>Changes done to the docstrings of <code>Steps/Tasks/LLMs</code> won't appear in the components gallery automatically, you will have to stop the <code>mkdocs</code> server and run it again to see the changes, everything else is reloaded automatically.</p>"},{"location":"sections/community/popular_issues/","title":"Issue dashboard","text":"Most engaging open issuesLatest issues open by the communityPlanned issues for upcoming releases Rank Issue Reactions Comments 1 995 - [FEATURE] <code>mlx-lm</code> integration \ud83d\udc4d 3 \ud83d\udcac 1 2 1041 - [FEATURE] Add Offline batch generation for open models with EXXA API \ud83d\udc4d 2 \ud83d\udcac 1 3 737 - [FEATURE] Allow <code>FormatTextGenerationSFT</code> to include tools/function calls in the formatted messages. \ud83d\udc4d 2 \ud83d\udcac 0 4 1001 - [FEATURE] <code>sglang</code> integration \ud83d\udc4d 1 \ud83d\udcac 1 5 797 - [FEATURE] synthetic data generation for predictive NLP tasks \ud83d\udc4d 1 \ud83d\udcac 1 6 914 - [FEATURE] Use <code>Step.resources</code> to set <code>tensor_parallel_size</code> and <code>pipeline_parallel_size</code> in <code>vLLM</code> \ud83d\udc4d 1 \ud83d\udcac 0 7 588 - [FEATURE] Single request caching \ud83d\udc4d 1 \ud83d\udcac 0 8 953 - [EXAMPLE] Add <code>CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation</code> example \ud83d\udc4d 0 \ud83d\udcac 6 9 972 - [BUG] Input data size != output data size when task batch size &lt; batch size of predecessor  \ud83d\udc4d 0 \ud83d\udcac 4 10 859 - [FEATURE] Update <code>PushToHub</code> to stream data to the Hub \ud83d\udc4d 0 \ud83d\udcac 4 Rank Issue Author 1 \ud83d\udfe2 1091 - [BUG] distiset.push_to_hub() seems to have cache pathing issue by MoritzLaurer 2 \ud83d\udfe2 1070 - [BUG] Pipeline serialization/caching issue when including <code>RoutingBatchFunction</code> by liamcripwell 3 \ud83d\udfe2 1068 - [BUG] GenerateSentencePair(...) always returns None positive and negative pairs by caesar-one 4 \ud83d\udfe2 1064 - [DOCS] Update basic guides of steps and tasks by plaguss 5 \ud83d\udfe2 1058 - [FEATURE] Implement a rate limiter for API calls by plaguss 6 \ud83d\udfe3 1056 - [DOCS] The example on how to use a Step no longer works by wwymak 7 \ud83d\udfe3 1049 - [BUG] vLLM Task not utilizing multiple GPUs in parallel when replicas &gt; 1 by adamlin120 8 \ud83d\udfe3 1048 - [BUG] OepnAI JSON format by tinyrolls 9 \ud83d\udfe2 1047 - Failed to load all the steps. Could not run pipeline. by yuqie 10 \ud83d\udfe2 1046 - [FEATURE] Compute the input/output tokens of a dataset by plaguss Rank Issue Milestone 1 \ud83d\udfe2 579 - [FEATURE] Sequential execution for local pipeline 1.4.0 2 \ud83d\udfe2 771 - [FEATURE] Allow passing path to YAML file containing pipeline runtime parameters in <code>distilabel run</code> 1.4.0 3 \ud83d\udfe2 773 - [DOCS] Include section/guide describing pipeline patterns 1.4.0 4 \ud83d\udfe2 802 - [FEATURE] Add defaults to Steps and Tasks so they can be more easily connected 1.4.0 5 \ud83d\udfe2 880 - [FEATURE] Add <code>exclude_from_signature</code> attribute 1.4.0 6 \ud83d\udfe2 889 - [FEATURE] Replace <code>extra_sampling_params</code> for normal arguments in <code>vLLM</code> 1.4.0 7 \ud83d\udfe2 942 - [BUG] <code>make_generator_step</code> can fail when setting the <code>_dataset_info</code> internally 1.4.0 8 \ud83d\udfe2 662 - [FEATURE] Allow passing <code>self</code> to steps created with <code>step</code> decorator 1.4.0 9 \ud83d\udfe2 738 - [FEATURE] Update <code>LLM.generate</code> interface to allow returning arbitrary/extra stuff related to the generation 1.5.0 10 \ud83d\udfe2 749 - [IMPLEMENTATION] Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models 1.5.0 <p>Last update: 2025-01-09</p>"},{"location":"sections/getting_started/faq/","title":"Frequent Asked Questions (FAQ)","text":"How can I rename the columns in a batch? <p>Every <code>Step</code> has both <code>input_mappings</code> and <code>output_mappings</code> attributes that can be used to rename the columns in each batch.</p> <p>But <code>input_mappings</code> will only map, meaning that if you have a batch with the column <code>A</code> and you want to rename it to <code>B</code>, you should use <code>input_mappings={\"A\": \"B\"}</code>, but that will only be applied to that specific <code>Step</code> meaning that the next step in the pipeline will still have the column <code>A</code> instead of <code>B</code>.</p> <p>While <code>output_mappings</code> will indeed apply the rename, meaning that if the <code>Step</code> produces the column <code>A</code> and you want to rename to <code>B</code>, you should use <code>output_mappings={\"A\": \"B\"}</code>, and that will be applied to the next <code>Step</code> in the pipeline.</p> Will the API Keys be exposed when sharing the pipeline? <p>No, those will be masked out using <code>pydantic.SecretStr</code>, meaning that those won't be exposed when sharing the pipeline.</p> <p>This also means that if you want to re-run your own pipeline and the API keys have not been provided via environment variable but either via an attribute or runtime parameter, you will need to provide them again.</p> Does it work for Windows? <p>Yes, but you may need to set the <code>multiprocessing</code> context in advance to ensure that the <code>spawn</code> method is used since the default method <code>fork</code> is not available on Windows.</p> <pre><code>import multiprocessing as mp\n\nmp.set_start_method(\"spawn\")\n</code></pre> Will the custom Steps / Tasks / LLMs be serialized too? <p>No, at the moment, only the references to the classes within the <code>distilabel</code> library will be serialized, meaning that if you define a custom class used within the pipeline, the serialization won't break, but the deserialize will fail since the class won't be available unless used from the same file.</p> What happens if <code>Pipeline.run</code> fails? Do I lose all the data? <p>No, indeed, we're using a cache mechanism to store all the intermediate results in the disk so, if a <code>Step</code> fails; the pipeline can be re-run from that point without losing the data, only if nothing is changed in the <code>Pipeline</code>.</p> <p>All the data will be stored in <code>.cache/distilabel</code>, but the only data that will persist at the end of the <code>Pipeline.run</code> execution is the one from the leaf step/s, so bear that in mind.</p> <p>For more information on the caching mechanism in <code>distilabel</code>, you can check the Learn - Advanced - Caching section.</p> <p>Also, note that when running a <code>Step</code> or a <code>Task</code> standalone, the cache mechanism won't be used, so if you want to use that, you should use the <code>Pipeline</code> context manager.</p> How can I use the same <code>LLM</code> across several tasks without having to load it several times? <p>You can serve the LLM using a solution like TGI or vLLM, and then connect to it using an <code>AsyncLLM</code> client like <code>InferenceEndpointsLLM</code> or <code>OpenAILLM</code>. Please refer to Serving LLMs guide for more information.</p> Can <code>distilabel</code> be used with OpenAI Batch API? <p>Yes, <code>distilabel</code> is integrated with OpenAI Batch API via OpenAILLM. Check LLMs - Offline Batch Generation for a small example on how to use it and Advanced - Offline Batch Generation for a more detailed guide.</p> Prevent overloads on Free Serverless Endpoints <p>When running a task using the InferenceEndpointsLLM with Free Serverless Endpoints, you may be facing some errors such as <code>Model is overloaded</code> if you let the batch size to the default (set at 50). To fix the issue, lower the value or even better set <code>input_batch_size=1</code> in your task. It may take a longer time to finish, but please remember this is a free service.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps import TextGeneration\n\nTextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=1\n)\n</code></pre>"},{"location":"sections/getting_started/installation/","title":"Installation","text":"<p>You will need to have at least Python 3.9 or higher, up to Python 3.12, since support for the latter is still a work in progress.</p> <p>To install the latest release of the package from PyPI you can use the following command:</p> <pre><code>pip install distilabel --upgrade\n</code></pre> <p>Alternatively, you may also want to install it from source i.e. the latest unreleased version, you can use the following command:</p> <pre><code>pip install \"distilabel @ git+https://github.com/argilla-io/distilabel.git@develop\" --upgrade\n</code></pre> <p>Note</p> <p>We are installing from <code>develop</code> since that's the branch we use to collect all the features, bug fixes, and improvements that will be part of the next release. If you want to install from a specific branch, you can replace <code>develop</code> with the branch name.</p>"},{"location":"sections/getting_started/installation/#extras","title":"Extras","text":"<p>Additionally, as part of <code>distilabel</code> some extra dependencies are available, mainly to add support for some of the LLM integrations we support. Here's a list of the available extras:</p>"},{"location":"sections/getting_started/installation/#llms","title":"LLMs","text":"<ul> <li> <p><code>anthropic</code>: for using models available in Anthropic API via the <code>AnthropicLLM</code> integration.</p> </li> <li> <p><code>argilla</code>: for exporting the generated datasets to Argilla.</p> </li> <li> <p><code>cohere</code>: for using models available in Cohere via the <code>CohereLLM</code> integration.</p> </li> <li> <p><code>groq</code>: for using models available in Groq using <code>groq</code> Python client via the <code>GroqLLM</code> integration.</p> </li> <li> <p><code>hf-inference-endpoints</code>: for using the Hugging Face Inference Endpoints via the <code>InferenceEndpointsLLM</code> integration.</p> </li> <li> <p><code>hf-transformers</code>: for using models available in transformers package via the <code>TransformersLLM</code> integration.</p> </li> <li> <p><code>litellm</code>: for using <code>LiteLLM</code> to call any LLM using OpenAI format via the <code>LiteLLM</code> integration.</p> </li> <li> <p><code>llama-cpp</code>: for using llama-cpp-python Python bindings for <code>llama.cpp</code> via the <code>LlamaCppLLM</code> integration.</p> </li> <li> <p><code>mistralai</code>: for using models available in Mistral AI API via the <code>MistralAILLM</code> integration.</p> </li> <li> <p><code>ollama</code>: for using Ollama and their available models via <code>OllamaLLM</code> integration.</p> </li> <li> <p><code>openai</code>: for using OpenAI API models via the <code>OpenAILLM</code> integration, or the rest of the integrations based on OpenAI and relying on its client as <code>AnyscaleLLM</code>, <code>AzureOpenAILLM</code>, and <code>TogetherLLM</code>.</p> </li> <li> <p><code>vertexai</code>: for using Google Vertex AI proprietary models via the <code>VertexAILLM</code> integration.</p> </li> <li> <p><code>vllm</code>: for using vllm serving engine via the <code>vLLM</code> integration.</p> </li> <li> <p><code>sentence-transformers</code>: for generating sentence embeddings using sentence-transformers.</p> </li> </ul>"},{"location":"sections/getting_started/installation/#data-processing","title":"Data processing","text":"<ul> <li> <p><code>ray</code>: for scaling and distributing a pipeline with Ray.</p> </li> <li> <p><code>faiss-cpu</code> and <code>faiss-gpu</code>: for generating sentence embeddings using faiss.</p> </li> <li> <p><code>minhash</code>: for using minhash for duplicate detection with datasketch and nltk.</p> </li> <li> <p><code>text-clustering</code>: for using text clustering with UMAP and Scikit-learn.</p> </li> </ul>"},{"location":"sections/getting_started/installation/#structured-generation","title":"Structured generation","text":"<ul> <li> <p><code>outlines</code>: for using structured generation of LLMs with outlines.</p> </li> <li> <p><code>instructor</code>: for using structured generation of LLMs with Instructor.</p> </li> </ul>"},{"location":"sections/getting_started/installation/#recommendations-notes","title":"Recommendations / Notes","text":"<p>The <code>mistralai</code> dependency requires Python 3.9 or higher, so if you're willing to use the <code>distilabel.models.llms.MistralLLM</code> implementation, you will need to have Python 3.9 or higher.</p> <p>In some cases like <code>transformers</code> and <code>vllm</code>, the installation of <code>flash-attn</code> is recommended if you are using a GPU accelerator since it will speed up the inference process, but the installation needs to be done separately, as it's not included in the <code>distilabel</code> dependencies.</p> <pre><code>pip install flash-attn --no-build-isolation\n</code></pre> <p>Also, if you are willing to use the <code>llama-cpp-python</code> integration for running local LLMs, note that the installation process may get a bit trickier depending on which OS are you using, so we recommend you to read through their Installation section in their docs.</p>"},{"location":"sections/getting_started/quickstart/","title":"Quickstart","text":""},{"location":"sections/getting_started/quickstart/#quickstart","title":"Quickstart","text":"<p>Distilabel provides all the tools you need to your scalable and reliable pipelines for synthetic data generation and AI-feedback. Pipelines are used to generate data, evaluate models, manipulate data, or any other general task. They are made up of different components: Steps, Tasks and LLMs, which are chained together in a directed acyclic graph (DAG).</p> <ul> <li>Steps: These are the building blocks of your pipeline. Normal steps are used for basic executions like loading data, applying some transformations, or any other general task.</li> <li>Tasks: These are steps that rely on LLMs and prompts to perform generative tasks. For example, they can be used to generate data, evaluate models or manipulate data.</li> <li>LLMs: These are the models that will perform the task. They can be local or remote models, and open-source or commercial models.</li> </ul> <p>Pipelines are designed to be scalable and reliable. They can be executed in a distributed manner, and they can be cached and recovered. This is useful when dealing with large datasets or when you want to ensure that your pipeline is reproducible.</p> <p>Besides that, pipelines are designed to be modular and flexible. You can easily add new steps, tasks, or LLMs to your pipeline, and you can also easily modify or remove them. An example architecture of a pipeline to generate a dataset of preferences is the following:</p>"},{"location":"sections/getting_started/quickstart/#installation","title":"Installation","text":"<p>To install the latest release with <code>hf-inference-endpoints</code> extra of the package from PyPI you can use the following command:</p> <pre><code>pip install distilabel[hf-inference-endpoints] --upgrade\n</code></pre>"},{"location":"sections/getting_started/quickstart/#use-a-generic-pipeline","title":"Use a generic pipeline","text":"<p>To use a generic pipeline for an ML task, you can use the <code>InstructionResponsePipeline</code> class. This class is a generic pipeline that can be used to generate data for supervised fine-tuning tasks. It uses the <code>InferenceEndpointsLLM</code> class to generate data based on the input data and the model.</p> <pre><code>from distilabel.pipeline import InstructionResponsePipeline\n\npipeline = InstructionResponsePipeline()\ndataset = pipeline.run()\n</code></pre> <p>The <code>InstructionResponsePipeline</code> class will use the <code>InferenceEndpointsLLM</code> class with the model <code>meta-llama/Meta-Llama-3.1-8B-Instruct</code> to generate data based on the system prompt. The output data will be a dataset with the columns <code>instruction</code> and <code>response</code>. The class uses a generic system prompt, but you can customize it by passing the <code>system_prompt</code> parameter to the class.</p> <p>Note</p> <p>We're actively working on building more pipelines for different tasks. If you have any suggestions or requests, please let us know! We're currently working on pipelines for classification, Direct Preference Optimization, and Information Retrieval tasks.</p>"},{"location":"sections/getting_started/quickstart/#define-a-custom-pipeline","title":"Define a Custom pipeline","text":"<p>In this guide we will walk you through the process of creating a simple pipeline that uses the <code>InferenceEndpointsLLM</code> class to generate text. The <code>Pipeline</code> will load a dataset that contains a column named <code>prompt</code> from the Hugging Face Hub via the step <code>LoadDataFromHub</code> and then use the <code>InferenceEndpointsLLM</code> class to generate text based on the dataset using the <code>TextGeneration</code> task.</p> <p>You can check the available models in the Hugging Face Model Hub and filter by <code>Inference status</code>.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(  # (1)\n    name=\"simple-text-generation-pipeline\",\n    description=\"A simple text generation pipeline\",\n) as pipeline:  # (2)\n    load_dataset = LoadDataFromHub(  # (3)\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    text_generation = TextGeneration(  # (4)\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        ),  # (5)\n        system_prompt=\"You are a creative AI Assistant writer.\",\n        template=\"Follow the following instruction: {{ instruction }}\"  # (6)\n    )\n\n    load_dataset &gt;&gt; text_generation  # (7)\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(  # (8)\n        parameters={\n            load_dataset.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n            text_generation.name: {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n        },\n    )\n    distiset.push_to_hub(repo_id=\"distilabel-example\")  # (9)\n</code></pre> <ol> <li> <p>We define a <code>Pipeline</code> with the name <code>simple-text-generation-pipeline</code> and a description <code>A simple text generation pipeline</code>. Note that the <code>name</code> is mandatory and will be used to calculate the <code>cache</code> signature path, so changing the name will change the cache path and will be identified as a different pipeline.</p> </li> <li> <p>We are using the <code>Pipeline</code> context manager, meaning that every <code>Step</code> subclass that is defined within the context manager will be added to the pipeline automatically.</p> </li> <li> <p>We define a <code>LoadDataFromHub</code> step named <code>load_dataset</code> that will load a dataset from the Hugging Face Hub, as provided via runtime parameters in the <code>pipeline.run</code> method below, but it can also be defined within the class instance via the arg <code>repo_id=...</code>. This step will produce output batches with the rows from the dataset, and the column <code>prompt</code> will be mapped to the <code>instruction</code> field.</p> </li> <li> <p>We define a <code>TextGeneration</code> task named <code>text_generation</code> that will generate text based on the <code>instruction</code> field from the dataset. This task will use the <code>InferenceEndpointsLLM</code> class with the model <code>Meta-Llama-3.1-8B-Instruct</code>.</p> </li> <li> <p>We define the <code>InferenceEndpointsLLM</code> class with the model <code>Meta-Llama-3.1-8B-Instruct</code> that will be used by the <code>TextGeneration</code> task. In this case, since the <code>InferenceEndpointsLLM</code> is used, we assume that the <code>HF_TOKEN</code> environment variable is set.</p> </li> <li> <p>Both <code>system_prompt</code> and <code>template</code> are optional fields. The <code>template</code> must be informed as a string following the Jinja2 template format, and the fields that appear there (\"instruction\" in this case, which corresponds to the default) must be informed in the <code>columns</code> attribute. The component gallery for <code>TextGeneration</code> has examples to get you started. </p> </li> <li> <p>We connect the <code>load_dataset</code> step to the <code>text_generation</code> task using the <code>rshift</code> operator, meaning that the output from the <code>load_dataset</code> step will be used as input for the <code>text_generation</code> task.</p> </li> <li> <p>We run the pipeline with the parameters for the <code>load_dataset</code> and <code>text_generation</code> steps. The <code>load_dataset</code> step will use the repository <code>distilabel-internal-testing/instruction-dataset-mini</code> and the <code>test</code> split, and the <code>text_generation</code> task will use the <code>generation_kwargs</code> with the <code>temperature</code> set to <code>0.7</code> and the <code>max_new_tokens</code> set to <code>512</code>.</p> </li> <li> <p>Optionally, we can push the generated <code>Distiset</code> to the Hugging Face Hub repository <code>distilabel-example</code>. This will allow you to share the generated dataset with others and use it in other pipelines.</p> </li> </ol>"},{"location":"sections/how_to_guides/","title":"How-to guides","text":"<p>Welcome to the how-to guides section! Here you will find a collection of guides that will help you get started with Distilabel. We have divided the guides into two categories: basic and advanced. The basic guides will help you get started with the core concepts of Distilabel, while the advanced guides will help you explore more advanced features.</p>"},{"location":"sections/how_to_guides/#basic","title":"Basic","text":"<ul> <li> <p>Define Steps for your Pipeline</p> <p>Steps are the building blocks of your pipeline. They can be used to generate data, evaluate models, manipulate data, or any other general task.</p> <p> Define Steps</p> </li> <li> <p>Define Tasks that rely on LLMs</p> <p>Tasks are a specific type of step that rely on Language Models (LLMs) to generate data.</p> <p> Define Tasks</p> </li> <li> <p>Define LLMs as local or remote models</p> <p>LLMs are the core of your tasks. They are used to integrate with local models or remote APIs.</p> <p> Define LLMs</p> </li> <li> <p>Execute Steps and Tasks in a Pipeline</p> <p>Pipeline is where you put all your steps and tasks together to create a workflow.</p> <p> Execute Pipeline</p> </li> </ul>"},{"location":"sections/how_to_guides/#advanced","title":"Advanced","text":"<ul> <li> <p>Using the Distiset dataset object</p> <p>Distiset is a dataset object based on the datasets library that can be used to store and manipulate data.</p> <p> Distiset</p> </li> <li> <p>Export data to Argilla</p> <p>Argilla is a platform that can be used to store, search, and apply feedback to datasets.  Argilla</p> </li> <li> <p>Using a file system to pass data of batches between steps</p> <p>File system can be used to pass data between steps in a pipeline.</p> <p> File System</p> </li> <li> <p>Using CLI to explore and re-run existing Pipelines</p> <p>CLI can be used to explore and re-run existing pipelines through the command line.</p> <p> CLI</p> </li> <li> <p>Cache and recover pipeline executions</p> <p>Caching can be used to recover pipeline executions to avoid loosing data and precious LLM calls.</p> <p> Caching</p> </li> <li> <p>Structured data generation</p> <p>Structured data generation can be used to generate data with a specific structure like JSON, function calls, etc.</p> <p> Structured Generation</p> </li> <li> <p>Serving an LLM for sharing it between several tasks</p> <p>Serve an LLM via TGI or vLLM to make requests and connect using a client like <code>InferenceEndpointsLLM</code> or <code>OpenAILLM</code> to avoid wasting resources.</p> <p> Sharing an LLM across tasks</p> </li> <li> <p>Impose requirements to your pipelines and steps</p> <p>Add requirements to steps in a pipeline to ensure they are installed and avoid errors.</p> <p> Pipeline requirements</p> </li> </ul>"},{"location":"sections/how_to_guides/advanced/argilla/","title":"Export data to Argilla","text":"<p>Being able to export the generated synthetic datasets to Argilla, is a core feature within <code>distilabel</code>. We believe in the potential of synthetic data, but without removing the impact a human annotator or group of annotators can bring. So on, the Argilla integration makes it straightforward to push a dataset to Argilla while the <code>Pipeline</code> is running, to be able to follow along the generation process in Argilla's UI, as well as annotating the records on the fly. One can include a <code>Step</code> within the <code>Pipeline</code> to easily export the datasets to Argilla with a pre-defined configuration, suiting the annotation purposes.</p> <p>Before using any of the steps about to be described below, you should first have an Argilla instance up and running, so that you can successfully upload the data to Argilla. In order to deploy Argilla, the easiest and most straightforward way is to deploy it via the Argilla Template in Hugging Face Spaces as simply as following the steps there, or just via the following button:</p> <p> </p>"},{"location":"sections/how_to_guides/advanced/argilla/#text-generation","title":"Text Generation","text":"<p>For text generation scenarios, i.e. when the <code>Pipeline</code> contains a single <code>TextGeneration</code> step, we have designed the task <code>TextGenerationToArgilla</code>, which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.</p> <p>The dataset will be pushed with the following configuration:</p> <ul> <li> <p>Fields: <code>instruction</code> and <code>generation</code>, both being fields of type <code>argilla.TextField</code>, plus the automatically generated <code>id</code> for the given <code>instruction</code> to be able to search for other records with the same <code>instruction</code> in the dataset. The field <code>instruction</code> must always be a string, while the field <code>generation</code> can either be a single string or a list of strings (useful when there are multiple parent nodes of type <code>TextGeneration</code>); even though each record will always contain at most one <code>instruction</code>-<code>generation</code> pair.</p> </li> <li> <p>Questions: <code>quality</code> will be the only question for the annotators to answer, i.e., to annotate, and it will be an <code>argilla.LabelQuestion</code> referring to the quality of the provided generation for the given instruction. It can be annotated as either \ud83d\udc4e (bad) or \ud83d\udc4d (good).</p> </li> </ul> <p>Note</p> <p>The <code>TextGenerationToArgilla</code> step will only work as is if the <code>Pipeline</code> contains one or multiple <code>TextGeneration</code> steps, or if the columns <code>instruction</code> and <code>generation</code> are available within the batch data. Otherwise, the variable <code>input_mappings</code> will need to be set so that either both or one of <code>instruction</code> and <code>generation</code> are mapped to one of the existing columns in the batch data.</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.steps import LoadDataFromDicts, TextGenerationToArgilla\nfrom distilabel.steps.tasks import TextGeneration\n\n\nwith Pipeline(name=\"my-pipeline\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_dataset\",\n        data=[\n            {\n                \"instruction\": \"Write a short story about a dragon that saves a princess from a tower.\",\n            },\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=OpenAILLM(model=\"gpt-4\"),\n    )\n\n    to_argilla = TextGenerationToArgilla(\n        dataset_name=\"my-dataset\",\n        dataset_workspace=\"admin\",\n        api_url=\"&lt;ARGILLA_API_URL&gt;\",\n        api_key=\"&lt;ARGILLA_API_KEY&gt;\",\n    )\n\n    load_dataset &gt;&gt; text_generation &gt;&gt; to_argilla\n\npipeline.run()\n</code></pre> <p></p>"},{"location":"sections/how_to_guides/advanced/argilla/#preference","title":"Preference","text":"<p>For preference scenarios, i.e. when the <code>Pipeline</code> contains multiple <code>TextGeneration</code> steps, we have designed the task <code>PreferenceToArgilla</code>, which will seamlessly push the generated data to Argilla, and allow the annotator to review the records.</p> <p>The dataset will be pushed with the following configuration:</p> <ul> <li> <p>Fields: <code>instruction</code> and <code>generations</code>, both being fields of type <code>argilla.TextField</code>, plus the automatically generated <code>id</code> for the given <code>instruction</code> to be able to search for other records with the same <code>instruction</code> in the dataset. The field <code>instruction</code> must always be a string, while the field <code>generations</code> must be a list of strings, containing the generated texts for the given <code>instruction</code> so that at least there are two generations to compare. Other than that, the number of <code>generation</code> fields within each record in Argilla will be defined by the value of the variable <code>num_generations</code> to be provided in the <code>PreferenceToArgilla</code> step.</p> </li> <li> <p>Questions: <code>rating</code> and <code>rationale</code> will be the pairs of questions to be defined per each generation i.e. per each value within the range from 0 to <code>num_generations</code>, and those will be of types <code>argilla.RatingQuestion</code> and <code>argilla.TextQuestion</code>, respectively. Note that only the first pair of questions will be mandatory, since only one generation is ensured to be within the batch data. Additionally, note that the provided ratings will range from 1 to 5, and to mention that Argilla only supports values above 0.</p> </li> </ul> <p>Note</p> <p>The <code>PreferenceToArgilla</code> step will only work if the <code>Pipeline</code> contains multiple <code>TextGeneration</code> steps, or if the columns <code>instruction</code> and <code>generations</code> are available within the batch data. Otherwise, the variable <code>input_mappings</code> will need to be set so that either both or one of <code>instruction</code> and <code>generations</code> are mapped to one of the existing columns in the batch data.</p> <p>Note</p> <p>Additionally, if the <code>Pipeline</code> contains an <code>UltraFeedback</code> step, the <code>ratings</code> and <code>rationales</code> will also be available and be automatically injected as suggestions to the existing dataset.</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.steps import LoadDataFromDicts, PreferenceToArgilla\nfrom distilabel.steps.tasks import TextGeneration\n\n\nwith Pipeline(name=\"my-pipeline\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_dataset\",\n        data=[\n            {\n                \"instruction\": \"Write a short story about a dragon that saves a princess from a tower.\",\n            },\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=OpenAILLM(model=\"gpt-4\"),\n        num_generations=4,\n        group_generations=True,\n    )\n\n    to_argilla = PreferenceToArgilla(\n        dataset_name=\"my-dataset\",\n        dataset_workspace=\"admin\",\n        api_url=\"&lt;ARGILLA_API_URL&gt;\",\n        api_key=\"&lt;ARGILLA_API_KEY&gt;\",\n        num_generations=4,\n    )\n\n    load_dataset &gt;&gt; text_generation &gt;&gt; to_argilla\n\nif __name__ == \"__main__\":\n    pipeline.run()\n</code></pre> <p></p>"},{"location":"sections/how_to_guides/advanced/assigning_resources_to_step/","title":"Assigning resources to a <code>Step</code>","text":"<p>When dealing with complex pipelines that get executed in a distributed environment with abundant resources (CPUs and GPUs), sometimes it's necessary to allocate these resources judiciously among the <code>Step</code>s. This is why <code>distilabel</code> allows to specify the number of <code>replicas</code>, <code>cpus</code> and <code>gpus</code> for each <code>Step</code>. Let's see that with an example:</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.models import vLLM\nfrom distilabel.steps import StepResources\nfrom distilabel.steps.tasks import PrometheusEval\n\n\nwith Pipeline(name=\"resources\") as pipeline:\n    ...\n\n    prometheus = PrometheusEval(\n        llm=vLLM(\n            model=\"prometheus-eval/prometheus-7b-v2.0\",\n            chat_template=\"[INST] {{ messages[0]['content'] }}\\\\n{{ messages[1]['content'] }}[/INST]\",\n        ),\n        resources=StepResources(replicas=2, cpus=1, gpus=1)\n        mode=\"absolute\",\n        rubric=\"factual-validity\",\n        reference=False,\n        num_generations=1,\n        group_generations=False,\n    )\n</code></pre> <p>In the example above, we're creating a <code>PrometheusEval</code> task (remember that <code>Task</code>s are <code>Step</code>s) that will use <code>vLLM</code> to serve <code>prometheus-eval/prometheus-7b-v2.0</code> model. This task is resource intensive as it requires an LLM, which in turn requires a GPU to run fast. With that in mind, we have specified the <code>resources</code> required for the task using the <code>StepResources</code> class, and we have defined that we need <code>1</code> GPU and <code>1</code> CPU per replica of the task. In addition, we have defined that we need <code>2</code> replicas i.e. we will run two instances of the task so the computation for the whole dataset runs faster. In addition, <code>StepResources</code> uses the RuntimeParametersMixin, so we can also specify the resources for each step when running the pipeline:</p> <pre><code>...\n\nif __name__ == \"__main__\":\n    pipeline.run(\n        parameters={\n            prometheus.name: {\"resources\": {\"replicas\": 2, \"cpus\": 1, \"gpus\": 1}}\n        }\n    )\n</code></pre> <p>And that's it! When running the pipeline, <code>distilabel</code> will create the tasks in nodes that have available the specified resources.</p>"},{"location":"sections/how_to_guides/advanced/caching/","title":"Pipeline cache","text":"<p><code>distilabel</code> will automatically save all the intermediate outputs generated by each <code>Step</code> of a <code>Pipeline</code>, so these outputs can be reused to recover the state of a pipeline execution that was stopped before finishing or to not have to re-execute steps from a pipeline after adding a new downstream step.</p>"},{"location":"sections/how_to_guides/advanced/caching/#how-to-enabledisable-the-cache","title":"How to enable/disable the cache","text":"<p>The use of the cache can be toggled using the <code>use_cache</code> parameter of the <code>Pipeline.use_cache</code> method. If <code>True</code>, then <code>distilabel</code> will use the reuse the outputs of previous executions for the new execution. If <code>False</code>, then <code>distilabel</code> will re-execute all the steps of the pipeline to generate new outputs for all the steps.</p> <pre><code>with Pipeline(name=\"my-pipeline\") as pipeline:\n    ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)  # (1)\n</code></pre> <ol> <li>Pipeline cache is disabled</li> </ol> <p>In addition, the cache can be enabled/disabled at <code>Step</code> level using its <code>use_cache</code> attribute. If <code>True</code>, then the outputs of the step will be reused in the new pipeline execution. If <code>False</code>, then the step will be re-executed to generate new outputs. If the cache of one step is disabled and the outputs have to be regenerated, then the outputs of the steps that depend on this step will also be regenerated.</p> <pre><code>with Pipeline(name=\"writting-assistant\") as pipeline:\n    load_data = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"How much is 2+2?\"\n            }\n        ]\n    )\n\n    generation = TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"Qwen/Qwen2.5-72B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.8,\n                \"max_new_tokens\": 512,\n            },\n        ),\n        use_cache=False  # (1)\n    )\n\n    load_data &gt;&gt; generation\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run()\n</code></pre> <ol> <li>Step cache is disabled and every time the pipeline is executed, this step will be re-executed</li> </ol>"},{"location":"sections/how_to_guides/advanced/caching/#how-a-cache-hit-is-triggered","title":"How a cache hit is triggered","text":"<p><code>distilabel</code> groups information and data generated by a <code>Pipeline</code> using the name of the pipeline, so the first factor that triggers a cache hit is the name of the pipeline. The second factor, is the <code>Pipeline.signature</code> property. This property returns a hash that is generated using the names of the steps used in the pipeline and their connections. The third factor, is the <code>Pipeline.aggregated_steps_signature</code> property which is used to determine if the new pipeline execution is exactly the same as one of the previous i.e. the pipeline contains exactly the same steps, with exactly the same connections and the steps are using exactly the same parameters. If these three factors are met, then the cache hit is triggered and the pipeline won't get re-executed and instead the function <code>create_distiset</code> will be used to create the resulting <code>Distiset</code> using the outputs of the previous execution, as it can be seen in the following image:</p> <p></p> <p>If the new pipeline execution have a different <code>Pipeline.aggregated_steps_signature</code> i.e. at least one step has changed its parameters, <code>distilabel</code> will reuse the outputs of the steps that have not changed and re-execute the steps that have changed, as it can be seen in the following image:</p> <p></p> <p>The same pipeline from above gets executed a third time, but this time the last step <code>text_generation_1</code> changed, so it's needed to re-execute it. The other steps, as they have not been changed, doesn't need to be re-executed and their outputs are reused.</p>"},{"location":"sections/how_to_guides/advanced/distiset/","title":"Using the Distiset dataset object","text":"<p>A <code>Pipeline</code> in <code>distilabel</code> returns a special type of Hugging Face <code>datasets.DatasetDict</code> which is called <code>Distiset</code>.</p> <p>The <code>Distiset</code> is a dictionary-like object that contains the different configurations generated by the <code>Pipeline</code>, where each configuration corresponds to each leaf step in the DAG built by the <code>Pipeline</code>. Each configuration corresponds to a different subset of the dataset. This is a concept taken from \ud83e\udd17 <code>datasets</code> that lets you upload different configurations of the same dataset within the same repository and can contain different columns i.e. different configurations, which can be seamlessly pushed to the Hugging Face Hub.</p> <p>Below you can find an example of how to create a <code>Distiset</code> object that resembles a <code>datasets.DatasetDict</code>:</p> <pre><code>from datasets import Dataset\nfrom distilabel.distiset import Distiset\n\ndistiset = Distiset(\n    {\n        \"leaf_step_1\": Dataset.from_dict({\"instruction\": [1, 2, 3]}),\n        \"leaf_step_2\": Dataset.from_dict(\n            {\"instruction\": [1, 2, 3, 4], \"generation\": [5, 6, 7, 8]}\n        ),\n    }\n)\n</code></pre> <p>Note</p> <p>If there's only one leaf node, i.e., only one step at the end of the <code>Pipeline</code>, then the configuration name won't be the name of the last step, but it will be set to \"default\" instead, as that's more aligned with standard datasets within the Hugging Face Hub.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#distiset-methods","title":"Distiset methods","text":"<p>We can interact with the different pieces generated by the <code>Pipeline</code> and treat them as different <code>configurations</code>. The <code>Distiset</code> contains just two methods:</p>"},{"location":"sections/how_to_guides/advanced/distiset/#traintest-split","title":"Train/Test split","text":"<p>Create a train/test split partition of the dataset for the different configurations or subsets.</p> <pre><code>&gt;&gt;&gt; distiset.train_test_split(train_size=0.9)\nDistiset({\n    leaf_step_1: DatasetDict({\n        train: Dataset({\n            features: ['instruction'],\n            num_rows: 2\n        })\n        test: Dataset({\n            features: ['instruction'],\n            num_rows: 1\n        })\n    })\n    leaf_step_2: DatasetDict({\n        train: Dataset({\n            features: ['instruction', 'generation'],\n            num_rows: 3\n        })\n        test: Dataset({\n            features: ['instruction', 'generation'],\n            num_rows: 1\n        })\n    })\n})\n</code></pre>"},{"location":"sections/how_to_guides/advanced/distiset/#push-to-hugging-face-hub","title":"Push to Hugging Face Hub","text":"<p>Push the <code>Distiset</code> to a Hugging Face repository, where each one of the subsets will correspond to a different configuration:</p> <pre><code>distiset.push_to_hub(\n    \"my-org/my-dataset\",\n    commit_message=\"Initial commit\",\n    private=False,\n    token=os.getenv(\"HF_TOKEN\"),\n    generate_card=True,\n    include_script=False\n)\n</code></pre> <p>New since version 1.3.0</p> <p>Since version <code>1.3.0</code> you can automatically push the script that created your pipeline to the same repository. For example, assuming you have a file like the following:</p> sample_pipe.py<pre><code>with Pipeline() as pipe:\n    ...\ndistiset = pipe.run()\ndistiset.push_to_hub(\n    \"my-org/my-dataset,\n    include_script=True\n)\n</code></pre> <p>After running the command, you could visit the repository and the file <code>sample_pipe.py</code> will be stored to simplify sharing your pipeline with the community.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#custom-docstrings","title":"Custom Docstrings","text":"<p><code>distilabel</code> contains a custom plugin to automatically generates a gallery for the different components. The information is extracted by parsing the <code>Step</code>'s docstrings. You can take a look at the docstrings in the source code of the UltraFeedback, and take a look at the corresponding entry in the components gallery to see an example of how the docstrings are rendered.</p> <p>If you create your own components and want the <code>Citations</code> automatically rendered in the README card (in case you are sharing your final distiset in the Hugging Face Hub), you may want to add the citation section. This is an example for the MagpieGenerator Task:</p> <pre><code>class MagpieGenerator(GeneratorTask, MagpieBase):\n    r\"\"\"Generator task the generates instructions or conversations using Magpie.\n    ...\n\n    Citations:\n\n        ```\n        @misc{xu2024magpiealignmentdatasynthesis,\n            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},\n            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},\n            year={2024},\n            eprint={2406.08464},\n            archivePrefix={arXiv},\n            primaryClass={cs.CL},\n            url={https://arxiv.org/abs/2406.08464},\n        }\n        ```\n    \"\"\"\n</code></pre> <p>The <code>Citations</code> section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: <code>```@misc{...}```</code>. This information will be automatically used in the README of your <code>Distiset</code> if you decide to call <code>distiset.push_to_hub</code>. Alternatively, if the <code>Citations</code> is not found, but in the <code>References</code> there are found any urls pointing to <code>https://arxiv.org/</code>, we will try to obtain the <code>Bibtex</code> equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#save-and-load-from-disk","title":"Save and load from disk","text":"<p>Take into account that these methods work as <code>datasets.load_from_disk</code> and <code>datasets.Dataset.save_to_disk</code> so the arguments are directly passed to those methods. This means you can also make use of <code>storage_options</code> argument to save your <code>Distiset</code> in your cloud provider, including the distilabel artifacts (<code>pipeline.yaml</code>, <code>pipeline.log</code> and the <code>README.md</code> with the dataset card). You can read more in <code>datasets</code> documentation here.</p> Save to diskLoad from disk (local)Load from disk (cloud) <p>Save the <code>Distiset</code> to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:</p> <pre><code>distiset.save_to_disk(\n    \"my-dataset\",\n    save_card=True,\n    save_pipeline_config=True,\n    save_pipeline_log=True\n)\n</code></pre> <p>Load a <code>Distiset</code> that was saved using <code>Distiset.save_to_disk</code> just the same way:</p> <pre><code>distiset = Distiset.load_from_disk(\"my-dataset\")\n</code></pre> <p>Load a <code>Distiset</code> from a remote location, like S3, GCS. You can pass the <code>storage_options</code> argument to authenticate with the cloud provider:</p> <pre><code>distiset = Distiset.load_from_disk(\n    \"s3://path/to/my_dataset\",  # gcs:// or any filesystem tolerated by fsspec\n    storage_options={\n        \"key\": os.environ[\"S3_ACCESS_KEY\"],\n        \"secret\": os.environ[\"S3_SECRET_KEY\"],\n        ...\n    }\n)\n</code></pre> <p>Take a look at the remaining arguments at <code>Distiset.save_to_disk</code> and <code>Distiset.load_from_disk</code>.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#dataset-card","title":"Dataset card","text":"<p>Having this special type of dataset comes with an added advantage when calling <code>Distiset.push_to_hub</code>, which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting <code>generate_card=False</code>:</p> <pre><code>distiset.push_to_hub(\"my-org/my-dataset\", generate_card=True)\n</code></pre> <p>We will have an automatic dataset card (an example can be seen here) with some handy information like reproducing the <code>Pipeline</code> with the <code>CLI</code>, or examples of the records from the different subsets.</p>"},{"location":"sections/how_to_guides/advanced/distiset/#create_distiset-helper","title":"create_distiset helper","text":"<p>Lastly, we presented in the caching section the <code>create_distiset</code> function, you can take a look at the section to see how to create a <code>Distiset</code> from the cache folder, using the helper function to automatically include all the relevant data.</p>"},{"location":"sections/how_to_guides/advanced/fs_to_pass_data/","title":"Using a file system to pass data of batches between steps","text":"<p>In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, <code>distilabel</code> uses <code>fsspec</code> to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the <code>run</code> method of the <code>distilabel</code> pipelines:</p> <p>Warning</p> <p>In order to use a specific file system/cloud storage, you will need to install the specific package providing the <code>fsspec</code> implementation for that file system. For instance, to use Google Cloud Storage you will need to install <code>gcsfs</code>:</p> <pre><code>pip install gcsfs\n</code></pre> <p>Check the available implementations: fsspec - Other known implementations</p> <pre><code>from distilabel.pipeline import Pipeline\n\nwith Pipeline(name=\"my-pipeline\") as pipeline:\n  ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        ..., \n        storage_parameters={\"path\": \"gcs://my-bucket\"},\n        use_fs_to_pass_data=True\n    )\n</code></pre> <p>The code above setups a file system (in this case Google Cloud Storage) and sets the flag <code>use_fs_to_pass_data</code> to specify that the data of the batches should be passed to the steps using the file system. The <code>storage_parameters</code> argument is optional, and in the case it's not provided but <code>use_fs_to_pass_data==True</code>, <code>distilabel</code> will use the local file system.</p> <p>Note</p> <p>As <code>GlobalStep</code>s receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if <code>use_fs_to_pass_data==False</code>, <code>distilabel</code> will use the file system to pass the data to the <code>GlobalStep</code>. </p>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/","title":"Load groups and execution stages","text":"<p>By default, the <code>distilabel</code> architecture loads all steps of a pipeline at the same time, as they are all supposed to process batches of data in parallel. However, loading all steps at once can waste resources in two scenarios: when using <code>GlobalStep</code>s that must wait for upstream steps to complete before processing data, or when running on machines with limited resources that cannot execute all steps simultaneously. In these cases, steps need to be loaded and executed in distinct load stages.</p>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/#load-stages","title":"Load stages","text":"<p>A load stage represents a point in the pipeline execution where a group of steps are loaded at the same time to process batches in parallel. These stages are required because:</p> <ol> <li>There are some kind of steps like the <code>GlobalStep</code>s that needs to receive all the data at once from their upstream steps i.e. needs their upstream steps to have finished its execution. It would be wasteful to load a <code>GlobalStep</code> at the same time as other steps of the pipeline as that would take resources (from the machine or cluster running the pipeline) that wouldn't be used until upstream steps have finished.</li> <li>When running on machines or clusters with limited resources, it may be not possible to load and execute all steps simultaneously as they would need to access the same limited resources (memory, CPU, GPU, etc.). </li> </ol> <p>Having that said, the first element that will create a load stage when executing a pipeline are the <code>GlobalStep</code>, as they mark and divide a pipeline in three stages: one stage with the upstream steps of the global step, one stage with the global step, and one final stage with the downstream steps of the global step. For example, the following pipeline will contain three stages:</p> <pre><code>from typing import TYPE_CHECKING\n\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts, StepInput, step\n\nif TYPE_CHECKING:\n    from distilabel.typing import StepOutput\n\n\n@step(inputs=[\"instruction\"], outputs=[\"instruction2\"])\ndef DummyStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        input[\"instruction2\"] = \"miau\"\n    yield inputs\n\n\n@step(inputs=[\"instruction\"], outputs=[\"instruction2\"], step_type=\"global\")\ndef GlobalDummyStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        input[\"instruction2\"] = \"miau\"\n    yield inputs\n\n\nwith Pipeline() as pipeline:\n    generator = LoadDataFromDicts(data=[{\"instruction\": \"Hi\"}] * 50)\n    dummy_step_0 = DummyStep()\n    global_dummy_step = GlobalDummyStep()\n    dummy_step_1 = DummyStep()\n\n    generator &gt;&gt; dummy_step_0 &gt;&gt; global_dummy_step &gt;&gt; dummy_step_1\n\nif __name__ == \"__main__\":\n    load_stages = pipeline.get_load_stages()\n\n    for i, steps_stage in enumerate(load_stages[0]):\n        print(f\"Stage {i}: {steps_stage}\")\n\n    # Output:\n    # Stage 0: ['load_data_from_dicts_0', 'dummy_step_0']\n    # Stage 1: ['global_dummy_step_0']\n    # Stage 2: ['dummy_step_1']\n</code></pre> <p>As we can see, the <code>GlobalStep</code> divided the pipeline execution in three stages.</p>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/#load-groups","title":"Load groups","text":"<p>While <code>GlobalStep</code>s automatically divide pipeline execution into stages, we many need fine-grained control over how steps are loaded and executed within each stage. This is where load groups come in.</p> <p>Load groups allows to specify which steps of the pipeline have to be loaded together within a stage. This is particularly useful when running on resource-constrained environments where all the steps cannot be executed in parallel.</p> <p>Let's see how it works with an example:</p> <pre><code>from datasets import load_dataset\n\nfrom distilabel.llms import vLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import StepResources\nfrom distilabel.steps.tasks import TextGeneration\n\ndataset = load_dataset(\n    \"distilabel-internal-testing/instruction-dataset-mini\", split=\"test\"\n).rename_column(\"prompt\", \"instruction\")\n\nwith Pipeline() as pipeline:\n    text_generation_0 = TextGeneration(\n        llm=vLLM(\n            model=\"HuggingFaceTB/SmolLM2-1.7B-Instruct\",\n            extra_kwargs={\"max_model_len\": 1024},\n        ),\n        resources=StepResources(gpus=1),\n    )\n\n    text_generation_1 = TextGeneration(\n        llm=vLLM(\n            model=\"HuggingFaceTB/SmolLM2-1.7B-Instruct\",\n            extra_kwargs={\"max_model_len\": 1024},\n        ),\n        resources=StepResources(gpus=1),\n    )\n\nif __name__ == \"__main__\":\n    load_stages = pipeline.get_load_stages(load_groups=[[text_generation_1.name]])\n\n    for i, steps_stage in enumerate(load_stages[0]):\n        print(f\"Stage {i}: {steps_stage}\")\n\n    # Output:\n    # Stage 0: ['text_generation_0']\n    # Stage 1: ['text_generation_1']\n\n    distiset = pipeline.run(dataset=dataset, load_groups=[[text_generation_0.name]])\n</code></pre> <p>In this example, we're working with a machine that has a single GPU, but the pipeline includes two instances of TextGeneration tasks both using vLLM and requesting 1 GPU. We cannot execute both steps in parallel. To fix that, we specify in the <code>run</code> method using the <code>load_groups</code> argument that the <code>text_generation_0</code> step has to be executed in isolation in a stage. This way, we can run the pipeline on a single GPU machine by executing the steps in different stages (sequentially) instead of in parallel.</p> <p>Some key points about load groups:</p> <ol> <li>Load groups are specified as a list of lists, where each inner list represents a group of steps that should be loaded together.</li> <li>Same as <code>GlobalSteps</code>s, the load groups creates a new load stage dividing the pipeline in 3 stages: one for the upstream steps, one for the steps in the load group, and one for the downstream steps.</li> </ol>"},{"location":"sections/how_to_guides/advanced/load_groups_and_execution_stages/#load-groups-modes","title":"Load groups modes","text":"<p>In addition, <code>distilabel</code> allows passing some modes to the <code>load_groups</code> argument that will handle the creation of the load groups:</p> <ul> <li><code>\"sequential_step_execution\"</code>: when passed, it will create a load group for each step i.e. the execution of the steps of the pipeline will be sequential.</li> </ul>"},{"location":"sections/how_to_guides/advanced/offline_batch_generation/","title":"Offline Batch Generation","text":"<p>The offline batch generation is a feature that some <code>LLM</code>s implemented in <code>distilabel</code> offers, allowing to send the inputs to a LLM-as-a-service platform and waiting for the outputs in a asynchronous manner. LLM-as-a-service platforms offer this feature as it allows them to gather many inputs and creating batches as big as the hardware allows, maximizing the hardware utilization and reducing the cost of the service. In exchange, the user has to wait certain time for the outputs to be ready but the cost per token is usually much lower.</p> <p><code>distilabel</code> pipelines are able to handle <code>LLM</code>s that offer this feature in the following way:</p> <ul> <li>The first time the pipeline gets executed, the <code>LLM</code> will send the inputs to the platform. The platform will return jobs ids that can be used later to check the status of the jobs and retrieve the results. The <code>LLM</code> will save these jobs ids in its <code>jobs_ids</code> attribute and raise an special exception DistilabelOfflineBatchGenerationNotFinishedException that will be handled by the <code>Pipeline</code>. The jobs ids will be saved in the pipeline cache, so they can be used in subsequent calls.</li> <li>The second time and subsequent calls will recover the pipeline execution and the <code>LLM</code> won't send the inputs again to the platform. This time as it has the <code>jobs_ids</code> it will check if the jobs have finished, and if they have then it will retrieve the results and return the outputs. If they haven't finished, then it will raise again <code>DistilabelOfflineBatchGenerationNotFinishedException</code> again.</li> <li>In addition, LLMs with offline batch generation can be specified to do polling until the jobs have finished, blocking the pipeline until they are done. If for some reason the polling needs to be stopped, one can press Ctrl+C or Cmd+C depending on your OS (or send a <code>SIGINT</code> to the main process) which will stop the polling and raise <code>DistilabelOfflineBatchGenerationNotFinishedException</code> that will be handled by the pipeline as described above.</li> </ul> <p>Warning</p> <p>In order to recover the pipeline execution and retrieve the results, the pipeline cache must be enabled. If the pipeline cache is disabled, then it will send the inputs again and create different jobs incurring in extra costs.</p>"},{"location":"sections/how_to_guides/advanced/offline_batch_generation/#example-pipeline-using-openaillm-with-offline-batch-generation","title":"Example pipeline using <code>OpenAILLM</code> with offline batch generation","text":"<pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline() as pipeline:\n    load_data = LoadDataFromHub(output_mappings={\"prompt\": \"instruction\"})\n\n    text_generation = TextGeneration(\n        llm=OpenAILLM(\n            model=\"gpt-3.5-turbo\",\n            use_offline_batch_generation=True,  # (1)\n        )\n    )\n\n    load_data &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset\",\n                \"split\": \"test\",\n                \"batch_size\": 500,\n            },\n        }\n    )\n</code></pre> <ol> <li>Indicate that the <code>OpenAILLM</code> should use offline batch generation.</li> </ol>"},{"location":"sections/how_to_guides/advanced/pipeline_requirements/","title":"Add requirements to run a Pipeline","text":"<p>When sharing a <code>Pipeline</code> that contains custom <code>Step</code>s or <code>Task</code>s, you may want to add the specific requirements that are needed to run them. <code>distilabel</code> will take this list of requirements and warn the user if any are missing.</p> <p>Let's see how we can add additional requirements with an example. The first thing we're going to do is to add requirements for our <code>CustomStep</code>. To do so we use the <code>requirements</code> decorator to specify that the step has <code>nltk&gt;=3.8</code> as dependency (we can use version specifiers). In addition, we're going to specify at <code>Pipeline</code> level that we need <code>distilabel&gt;=1.3.0</code> to run it.</p> <pre><code>from typing import List\n\nfrom distilabel.steps import Step\nfrom distilabel.steps.base import StepInput\nfrom distilabel.steps.typing import StepOutput\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.utils.requirements import requirements\nfrom distilabel.pipeline import Pipeline\n\n\n@requirements([\"nltk\"])\nclass CustomStep(Step):\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"instruction\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"response\"]\n\n    def process(self, inputs: StepInput) -&gt; StepOutput:  # type: ignore\n        for input in inputs:\n            input[\"response\"] = nltk.word_tokenize(input)\n        yield inputs\n\n\nwith Pipeline(\n    name=\"pipeline-with-requirements\", requirements=[\"distilabel&gt;=1.3.0\"]\n) as pipeline:\n    loader = LoadDataFromDicts(data=[{\"instruction\": \"sample sentence\"}])\n    step1 = CustomStep()\n    loader &gt;&gt; step1\n\nif __name__ == \"__main__\":\n    pipeline.run()\n</code></pre> <p>Once we call <code>pipeline.run()</code>, if any of the requirements informed at the <code>Step</code> or <code>Pipeline</code> level is missing, a <code>ValueError</code> will be raised telling us that we should install the list of dependencies:</p> <pre><code>&gt;&gt;&gt; pipeline.run()\n[06/27/24 11:07:33] ERROR    ['distilabel.pipeline'] Please install the following requirements to run the pipeline:                                                                                                                                     base.py:350\n                             distilabel&gt;=1.3.0\n...\nValueError: Please install the following requirements to run the pipeline:\ndistilabel&gt;=1.3.0\n</code></pre>"},{"location":"sections/how_to_guides/advanced/saving_step_generated_artifacts/","title":"Saving step generated artifacts","text":"<p>Some <code>Step</code>s might need to produce an auxiliary artifact that is not a result of the computation, but is needed for the computation. For example, the <code>FaissNearestNeighbour</code> needs to create a Faiss index to compute the output of the step which are the top <code>k</code> nearest neighbours for each input. Generating the Faiss index takes time and it could potentially be reused outside of the <code>distilabel</code> pipeline, so it would be a shame not saving it.</p> <p>For this reason, <code>Step</code>s have a method called <code>save_artifact</code> that allows saving artifacts that will be included along the outputs of the pipeline in the generated <code>Distiset</code>. The generated artifacts will be uploaded and saved when using <code>Distiset.push_to_hub</code> or <code>Distiset.save_to_disk</code> respectively. Let's see how to use it with a simple example.</p> <pre><code>from typing import List, TYPE_CHECKING\nfrom distilabel.steps import GlobalStep, StepInput, StepOutput\nimport matplotlib.pyplot as plt\n\nif TYPE_CHECKING:\n    from distilabel.steps import StepOutput\n\n\nclass CountTextCharacters(GlobalStep):\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"text_character_count\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        character_counts = []\n\n        for input in inputs:\n            text_character_count = len(input[\"text\"])\n            input[\"text_character_count\"] = text_character_count\n            character_counts.append(text_character_count)\n\n        # Generate plot with the distribution of text character counts\n        plt.figure(figsize=(10, 6))\n        plt.hist(character_counts, bins=30, edgecolor=\"black\")\n        plt.title(\"Distribution of Text Character Counts\")\n        plt.xlabel(\"Character Count\")\n        plt.ylabel(\"Frequency\")\n\n        # Save the plot as an artifact of the step\n        self.save_artifact(\n            name=\"text_character_count_distribution\",\n            write_function=lambda path: plt.savefig(path / \"figure.png\"),\n            metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n        )\n\n        plt.close()\n\n        yield inputs\n</code></pre> <p>As it can be seen in the example above, we have created a simple step that counts the number of characters in each input text and generates a histogram with the distribution of the character counts. We save the histogram as an artifact of the step using the <code>save_artifact</code> method. The method takes three arguments:</p> <ul> <li><code>name</code>: The name we want to give to the artifact.</li> <li><code>write_function</code>: A function that writes the artifact to the desired path. The function will receive a <code>path</code> argument which is a <code>pathlib.Path</code> object pointing to the directory where the artifact should be saved.</li> <li><code>metadata</code>: A dictionary with metadata about the artifact. This metadata will be saved along with the artifact.</li> </ul> <p>Let's execute the step with a simple pipeline and push the resulting <code>Distiset</code> to the Hugging Face Hub:</p> Example full code <pre><code>from typing import TYPE_CHECKING, List\n\nimport matplotlib.pyplot as plt\nfrom datasets import load_dataset\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GlobalStep, StepInput, StepOutput\n\nif TYPE_CHECKING:\n    from distilabel.steps import StepOutput\n\n\nclass CountTextCharacters(GlobalStep):\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"text\"]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"text_character_count\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":  # type: ignore\n        character_counts = []\n\n        for input in inputs:\n            text_character_count = len(input[\"text\"])\n            input[\"text_character_count\"] = text_character_count\n            character_counts.append(text_character_count)\n\n        # Generate plot with the distribution of text character counts\n        plt.figure(figsize=(10, 6))\n        plt.hist(character_counts, bins=30, edgecolor=\"black\")\n        plt.title(\"Distribution of Text Character Counts\")\n        plt.xlabel(\"Character Count\")\n        plt.ylabel(\"Frequency\")\n\n        # Save the plot as an artifact of the step\n        self.save_artifact(\n            name=\"text_character_count_distribution\",\n            write_function=lambda path: plt.savefig(path / \"figure.png\"),\n            metadata={\"type\": \"image\", \"library\": \"matplotlib\"},\n        )\n\n        plt.close()\n\n        yield inputs\n\n\nwith Pipeline() as pipeline:\n    count_text_characters = CountTextCharacters()\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        dataset=load_dataset(\n            \"HuggingFaceH4/instruction-dataset\", split=\"test\"\n        ).rename_column(\"prompt\", \"text\"),\n    )\n\n    distiset.push_to_hub(\"distilabel-internal-testing/distilabel-artifacts-example\")\n</code></pre> <p>The generated distilabel-internal-testing/distilabel-artifacts-example dataset repository has a section in its card describing the artifacts generated by the pipeline and the generated plot can be seen here.</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/","title":"Scaling and distributing a pipeline with Ray","text":"<p>Although the local Pipeline based on <code>multiprocessing</code> + serving LLMs with an external service is enough for executing most of the pipelines used to create SFT and preference datasets, there are scenarios where you might need to scale your pipeline across multiple machines. In such cases, distilabel leverages Ray to distribute the workload efficiently. This allows you to generate larger datasets, reduce execution time, and maximize resource utilization across a cluster of machines, without needing to change a single line of code.</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#relation-between-distilabel-steps-and-ray-actors","title":"Relation between distilabel steps and Ray Actors","text":"<p>A <code>distilabel</code> pipeline consist of several <code>Step</code>s. An <code>Step</code> is a class that defines a basic life-cycle:</p> <ol> <li>It will load or create the resources (LLMs, clients, etc) required to run its logic.</li> <li>It will run a loop waiting for incoming batches received using a queue. Once it receives one batch, it will process it and put the processed batch into an output queue.</li> <li>When it finish a batch that is the final one or receives a special signal, the loop will finish and the unload logic will be executed.</li> </ol> <p>So an <code>Step</code> needs to maintain a minimum state and the best way to do that with Ray is using actors.</p> <pre><code>graph TD\n    A[Step] --&gt;|has| B[Multiple Replicas]\n    B --&gt;|wrapped in| C[Ray Actor]\n    C --&gt;|maintains| D[Step Replica State]\n    C --&gt;|executes| E[Step Lifecycle]\n    E --&gt;|1. Load/Create Resources| F[LLMs, Clients, etc.]\n    E --&gt;|2. Process batches from| G[Input Queue]\n    E --&gt;|3. Processed batches are put in| H[Output Queue]\n    E --&gt;|4. Unload| I[Cleanup]\n</code></pre>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#executing-a-pipeline-with-ray","title":"Executing a pipeline with Ray","text":"<p>The recommended way to execute a <code>distilabel</code> pipeline using Ray is using the Ray Jobs API.</p> <p>Before jumping on the explanation, let's first install the prerequisites: <pre><code>pip install distilabel[ray]\n</code></pre></p> <p>Tip</p> <p>It's recommended to create a virtual environment.</p> <p>For the purpose of explaining how to execute a pipeline with Ray, we'll use the following pipeline throughout the examples:</p> <pre><code>from distilabel.models import vLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(name=\"text-generation-ray-pipeline\") as pipeline:\n    load_data_from_hub = LoadDataFromHub(output_mappings={\"prompt\": \"instruction\"})\n\n    text_generation = TextGeneration(\n        llm=vLLM(\n            model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n            tokenizer=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        )\n    )\n\n    load_data_from_hub &gt;&gt; text_generation\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data_from_hub.name: {\n                \"repo_id\": \"HuggingFaceH4/instruction-dataset\",\n                \"split\": \"test\",\n            },\n            text_generation.name: {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 4096,\n                    }\n                },\n                \"resources\": {\"replicas\": 2, \"gpus\": 1}, # (1)\n            },\n        }\n    )\n\n    distiset.push_to_hub(\n        \"&lt;YOUR_HF_USERNAME_OR_ORGANIZATION&gt;/text-generation-distilabel-ray\" # (2)\n    )\n</code></pre> <ol> <li>We're setting resources for the <code>text_generation</code> step and defining that we want two replicas and one GPU per replica. <code>distilabel</code> will create two replicas of the step i.e. two actors in the Ray cluster, and each actor will request to be allocated in a node of the cluster that have at least one GPU. You can read more about how Ray manages the resources here.</li> <li>You should modify this and add your user or organization on the Hugging Face Hub.</li> </ol> <p>It's a basic pipeline with just two steps: one to load a dataset from the Hub with an <code>instruction</code> column and one to generate a <code>response</code> for that instruction using Llama 3 8B Instruct with vLLM. Simple but enough to demonstrate how to distribute and scale the workload using a Ray cluster!</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#using-ray-jobs-api","title":"Using Ray Jobs API","text":"<p>If you don't know the Ray Jobs API then it's recommended to read Ray Jobs Overview. Quick summary: Ray Jobs is the recommended way to execute a job in a Ray cluster as it will handle packaging, deploying and managing the Ray application.</p> <p>To execute the pipeline above, we first need to create a directory (kind of a package) with the pipeline script (or scripts) that we will submit to the Ray cluster:</p> <pre><code>mkdir ray-pipeline\n</code></pre> <p>The content of the directory <code>ray-pipeline</code> should be:</p> <pre><code>ray-pipeline/\n\u251c\u2500\u2500 pipeline.py\n\u2514\u2500\u2500 runtime_env.yaml\n</code></pre> <p>The first file contains the code of the pipeline, while the second one (<code>runtime_env.yaml</code>) is a specific Ray file containing the environment dependencies required to run the job:</p> <pre><code>pip:\n  - distilabel[ray,vllm] &gt;= 1.3.0\nenv_vars:\n  HF_TOKEN: &lt;YOUR_HF_TOKEN&gt;\n</code></pre> <p>With this file we're basically informing to the Ray cluster that it will have to install <code>distilabel</code> with the <code>vllm</code> and <code>ray</code> extra dependencies to be able to run the job. In addition, we're defining the <code>HF_TOKEN</code> environment variable that will be used (by the <code>push_to_hub</code> method) to upload the resulting dataset to the Hugging Face Hub. </p> <p>After that, we can proceed to execute the <code>ray</code> command that will submit the job to the Ray cluster: <pre><code>ray job submit \\\n    --address http://localhost:8265 \\\n    --working-dir ray-pipeline \\\n    --runtime-env ray-pipeline/runtime_env.yaml -- python pipeline.py\n</code></pre></p> <p>What this will do, it's to basically upload the <code>--working-dir</code> to the Ray cluster, install the dependencies and then execute the <code>python pipeline.py</code> command from the head node.</p>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#file-system-requirements","title":"File system requirements","text":"<p>As described in Using a file system to pass data to steps, <code>distilabel</code> relies on the file system to pass the data to the <code>GlobalStep</code>s, so if the pipeline to be executed in the Ray cluster have any <code>GlobalStep</code> or do you want to set the <code>use_fs_to_pass_data=True</code> of the run method, then you will need to setup a file system to which all the nodes of the Ray cluster have access:</p> <pre><code>if __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={...},\n        storage_parameters={\"path\": \"file:///mnt/data\"}, # (1)\n        use_fs_to_pass_data=True,\n    )\n</code></pre> <ol> <li>All the nodes of the Ray cluster should have access to <code>/mnt/data</code>.</li> </ol>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#executing-a-raypipeline-in-a-cluster-with-slurm","title":"Executing a <code>RayPipeline</code> in a cluster with Slurm","text":"<p>If you have access to an HPC, then you're probably also a user of Slurm, a workload manager typically used on HPCs. We can create Slurm job that takes some nodes and deploy a Ray cluster to run a distributed <code>distilabel</code> pipeline:</p> <pre><code>#!/bin/bash\n#SBATCH --job-name=distilabel-ray-text-generation\n#SBATCH --partition=your-partition\n#SBATCH --qos=normal\n#SBATCH --nodes=2 # (1)\n#SBATCH --exclusive\n#SBATCH --ntasks-per-node=1 # (2)\n#SBATCH --gpus-per-node=1 # (3)\n#SBATCH --time=0:30:00\n\nset -ex\n\necho \"SLURM_JOB_ID: $SLURM_JOB_ID\"\necho \"SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST\"\n\n# Activate virtual environment\nsource /path/to/virtualenv/.venv/bin/activate\n\n# Getting the node names\nnodes=$(scontrol show hostnames \"$SLURM_JOB_NODELIST\")\nnodes_array=($nodes)\n\n# Get the IP address of the head node\nhead_node=${nodes_array[0]}\nhead_node_ip=$(srun --nodes=1 --ntasks=1 -w \"$head_node\" hostname --ip-address)\n\n# Start Ray head node\nport=6379\nip_head=$head_node_ip:$port\nexport ip_head\necho \"IP Head: $ip_head\"\n\n# Generate a unique Ray tmp dir for the head node (just in case the default one is not writable)\nhead_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_head\"\n\necho \"Starting HEAD at $head_node\"\nOUTLINES_CACHE_DIR=\"/tmp/.outlines\" srun --nodes=1 --ntasks=1 -w \"$head_node\" \\ # (4)\n    ray start --head --node-ip-address=\"$head_node_ip\" --port=$port \\\n    --dashboard-host=0.0.0.0 \\\n    --dashboard-port=8265 \\\n    --temp-dir=\"$head_tmp_dir\" \\\n    --block &amp;\n\n# Give some time to head node to start...\necho \"Waiting a bit before starting worker nodes...\"\nsleep 10\n\n# Start Ray worker nodes\nworker_num=$((SLURM_JOB_NUM_NODES - 1))\n\n# Start from 1 (0 is head node)\nfor ((i = 1; i &lt;= worker_num; i++)); do\n    node_i=${nodes_array[$i]}\n    worker_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_worker_$i\"\n    echo \"Starting WORKER $i at $node_i\"\n    OUTLINES_CACHE_DIR=\"/tmp/.outlines\" srun --nodes=1 --ntasks=1 -w \"$node_i\" \\\n        ray start --address \"$ip_head\" \\\n        --temp-dir=\"$worker_tmp_dir\" \\\n        --block &amp;\n    sleep 5\ndone\n\n# Give some time to the Ray cluster to gather info\necho \"Waiting a bit before submitting the job...\"\nsleep 60\n\n# Finally submit the job to the cluster\nray job submit --address http://localhost:8265 --working-dir ray-pipeline -- python -u pipeline.py\n</code></pre> <ol> <li>In this case, we just want two nodes: one to run the Ray head node and one to run a worker.</li> <li>We just want to run a task per node i.e. the Ray command that starts the head/worker node.</li> <li>We have selected 1 GPU per node, but we could have selected more depending on the pipeline.</li> <li>We need to set the environment variable <code>OUTLINES_CACHE_DIR</code> to <code>/tmp/.outlines</code> to avoid issues with the nodes trying to read/write the same <code>outlines</code> cache files, which is not possible.</li> </ol>"},{"location":"sections/how_to_guides/advanced/scaling_with_ray/#vllm-and-tensor_parallel_size","title":"<code>vLLM</code> and <code>tensor_parallel_size</code>","text":"<p>In order to use <code>vLLM</code> multi-GPU and multi-node capabilities with <code>ray</code>, we need to do a few changes in the example pipeline from above. The first change needed is to specify a value for <code>tensor_parallel_size</code> aka \"In how many GPUs do I want you to load the model\", and the second one is to define <code>ray</code> as the <code>distributed_executor_backend</code> as the default one in <code>vLLM</code> is to use <code>multiprocessing</code>:</p> <pre><code>with Pipeline(name=\"text-generation-ray-pipeline\") as pipeline:\n    load_data_from_hub = LoadDataFromHub(output_mappings={\"prompt\": \"instruction\"})\n\n    text_generation = TextGeneration(\n        llm=vLLM(\n            model=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            extra_kwargs={\n                \"tensor_parallel_size\": 8,\n                \"distributed_executor_backend\": \"ray\",\n            }\n        )\n    )\n\n    load_data_from_hub &gt;&gt; text_generation\n</code></pre> <p>More information about distributed inference with <code>vLLM</code> can be found here: vLLM - Distributed Serving</p>"},{"location":"sections/how_to_guides/advanced/serving_an_llm_for_reuse/","title":"Serving an <code>LLM</code> for sharing it between several <code>Task</code>s","text":"<p>It's very common to want to use the same <code>LLM</code> for several <code>Task</code>s in a pipeline. To avoid loading the <code>LLM</code> as many times as the number of <code>Task</code>s and avoid wasting resources, it's recommended to serve the model using solutions like <code>text-generation-inference</code> or <code>vLLM</code>, and then use an <code>AsyncLLM</code> compatible client like <code>InferenceEndpointsLLM</code> or <code>OpenAILLM</code> to communicate with the server respectively.</p>"},{"location":"sections/how_to_guides/advanced/serving_an_llm_for_reuse/#serving-llms-using-text-generation-inference","title":"Serving LLMs using <code>text-generation-inference</code>","text":"<pre><code>model=meta-llama/Meta-Llama-3-8B-Instruct\nvolume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run\n\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \\\n    -e HUGGING_FACE_HUB_TOKEN=&lt;secret&gt; \\\n    ghcr.io/huggingface/text-generation-inference:2.0.4 \\\n    --model-id $model\n</code></pre> <p>Note</p> <p>The bash command above has been copy-pasted from the official docs text-generation-inference. Please refer to the official docs for more information.</p> <p>And then we can use <code>InferenceEndpointsLLM</code> with <code>base_url=http://localhost:8080</code> (pointing to our <code>TGI</code> local deployment):</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n\nwith Pipeline(name=\"serving-llm\") as pipeline:\n    load_data = LoadDataFromDicts(\n        data=[{\"instruction\": \"Write a poem about the sun and moon.\"}]\n    )\n\n    # `base_url` points to the address of the `TGI` serving the LLM\n    llm = InferenceEndpointsLLM(base_url=\"http://192.168.1.138:8080\")\n\n    text_generation = TextGeneration(\n        llm=llm,\n        num_generations=3,\n        group_generations=True,\n        output_mappings={\"generation\": \"generations\"},\n    )\n\n    ultrafeedback = UltraFeedback(aspect=\"overall-rating\", llm=llm)\n\n    load_data &gt;&gt; text_generation &gt;&gt; ultrafeedback\n</code></pre>"},{"location":"sections/how_to_guides/advanced/serving_an_llm_for_reuse/#serving-llms-using-vllm","title":"Serving LLMs using <code>vLLM</code>","text":"<pre><code>docker run --gpus all \\\n    -v ~/.cache/huggingface:/root/.cache/huggingface \\\n    --env \"HUGGING_FACE_HUB_TOKEN=&lt;secret&gt;\" \\\n    -p 8000:8000 \\\n    --ipc=host \\\n    vllm/vllm-openai:latest \\\n    --model meta-llama/Meta-Llama-3-8B-Instruct\n</code></pre> <p>Note</p> <p>The bash command above has been copy-pasted from the official docs vLLM. Please refer to the official docs for more information.</p> <p>And then we can use <code>OpenAILLM</code> with <code>base_url=http://localhost:8000</code> (pointing to our <code>vLLM</code> local deployment):</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n\nwith Pipeline(name=\"serving-llm\") as pipeline:\n    load_data = LoadDataFromDicts(\n        data=[{\"instruction\": \"Write a poem about the sun and moon.\"}]\n    )\n\n    # `base_url` points to the address of the `vLLM` serving the LLM\n    llm = OpenAILLM(base_url=\"http://192.168.1.138:8000\", model=\"\")\n\n    text_generation = TextGeneration(\n        llm=llm,\n        num_generations=3,\n        group_generations=True,\n        output_mappings={\"generation\": \"generations\"},\n    )\n\n    ultrafeedback = UltraFeedback(aspect=\"overall-rating\", llm=llm)\n\n    load_data &gt;&gt; text_generation &gt;&gt; ultrafeedback\n</code></pre>"},{"location":"sections/how_to_guides/advanced/structured_generation/","title":"Structured data generation","text":"<p><code>Distilabel</code> has integrations with relevant libraries to generate structured text i.e. to guide the <code>LLM</code> towards the generation of structured outputs following a JSON schema, a regex, etc.</p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#outlines","title":"Outlines","text":"<p><code>Distilabel</code> integrates <code>outlines</code> within some <code>LLM</code> subclasses. At the moment, the following LLMs integrated with <code>outlines</code> are supported in <code>distilabel</code>: <code>TransformersLLM</code>, <code>vLLM</code> or <code>LlamaCppLLM</code>, so that anyone can generate structured outputs in the form of JSON or a parseable regex.</p> <p>The <code>LLM</code> has an argument named <code>structured_output</code><sup>1</sup> that determines how we can generate structured outputs with it, let's see an example using <code>LlamaCppLLM</code>.</p> <p>Note</p> <p>For <code>outlines</code> integration to work you may need to install the corresponding dependencies:</p> <pre><code>pip install distilabel[outlines]\n</code></pre>"},{"location":"sections/how_to_guides/advanced/structured_generation/#json","title":"JSON","text":"<p>We will start with a JSON example, where we initially define a <code>pydantic.BaseModel</code> schema to guide the generation of the structured output.</p> <p>Note</p> <p>Take a look at <code>StructuredOutputType</code> to see the expected format of the <code>structured_output</code> dict variable.</p> <pre><code>from pydantic import BaseModel\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n</code></pre> <p>And then we provide that schema to the <code>structured_output</code> argument of the LLM.</p> <pre><code>from distilabel.models import LlamaCppLLM\n\nllm = LlamaCppLLM(\n    model_path=\"./openhermes-2.5-mistral-7b.Q4_K_M.gguf\"  # (1)\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"json\", \"schema\": User},\n)\nllm.load()\n</code></pre> <ol> <li>We have previously downloaded a GGUF model i.e. <code>llama.cpp</code> compatible, from the Hugging Face Hub using curl<sup>2</sup>, but any model can be used as replacement, as long as the <code>model_path</code> argument is updated.</li> </ol> <p>And we are ready to pass our instruction as usual:</p> <pre><code>import json\n\nresult = llm.generate(\n    [[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]],\n    max_new_tokens=50\n)\n\ndata = json.loads(result[0][0])\ndata\n# {'name': 'Kathy', 'last_name': 'Smith', 'id': 4539210}\nUser(**data)\n# User(name='Kathy', last_name='Smith', id=4539210)\n</code></pre> <p>We get back a Python dictionary (formatted as a string) that we can parse using <code>json.loads</code>, or validate it directly using the <code>User</code>, which si a <code>pydantic.BaseModel</code> instance.</p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#regex","title":"Regex","text":"<p>The following example shows an example of text generation whose output adhere to a regular expression:</p> <pre><code>pattern = r\"&lt;name&gt;(.*?)&lt;/name&gt;.*?&lt;grade&gt;(.*?)&lt;/grade&gt;\"  #\u00a0the same pattern for re.compile\n\nllm=LlamaCppLLM(\n    model_path=model_path,\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"regex\", \"schema\": pattern},\n)\nllm.load()\n\nresult = llm.generate(\n    [\n        [\n            {\"role\": \"system\", \"content\": \"You are Simpsons' fans who loves assigning grades from A to E, where A is the best and E is the worst.\"},\n            {\"role\": \"user\", \"content\": \"What's up with Homer Simpson?\"}\n        ]\n    ],\n    max_new_tokens=200\n)\n</code></pre> <p>We can check the output by parsing the content using the same pattern we required from the LLM.</p> <pre><code>import re\nmatch = re.search(pattern, result[0][0])\n\nif match:\n    name = match.group(1)\n    grade = match.group(2)\n    print(f\"Name: {name}, Grade: {grade}\")\n# Name: Homer Simpson, Grade: C+\n</code></pre> <p>These were some simple examples, but one can see the options this opens.</p> <p>Tip</p> <p>A full pipeline example can be seen in the following script: <code>examples/structured_generation_with_outlines.py</code></p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#instructor","title":"Instructor","text":"<p>For other LLM providers behind APIs, there's no direct way of accessing the internal logit processor like <code>outlines</code> does, but thanks to <code>instructor</code> we can generate structured output from LLM providers based on <code>pydantic.BaseModel</code> objects. We have integrated <code>instructor</code> to deal with the <code>AsyncLLM</code>.</p> <p>Note</p> <p>For <code>instructor</code> integration to work you may need to install the corresponding dependencies:</p> <pre><code>pip install distilabel[instructor]\n</code></pre> <p>Note</p> <p>Take a look at <code>InstructorStructuredOutputType</code> to see the expected format of the <code>structured_output</code> dict variable.</p> <p>The following is the same example you can see with <code>outlines</code>'s <code>JSON</code> section for comparison purposes.</p> <pre><code>from pydantic import BaseModel\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n</code></pre> <p>And then we provide that schema to the <code>structured_output</code> argument of the LLM:</p> <p>Note</p> <p>In this example we are using Meta Llama 3.1 8B Instruct, keep in mind not all the models support structured outputs.</p> <pre><code>from distilabel.models import MistralLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    structured_output={\"schema\": User}\n)\nllm.load()\n</code></pre> <p>And we are ready to pass our instructions as usual:</p> <pre><code>import json\n\nresult = llm.generate(\n    [[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]],\n    max_new_tokens=256\n)\n\ndata = json.loads(result[0][0])\ndata\n# {'name': 'John', 'last_name': 'Doe', 'id': 12345}\nUser(**data)\n# User(name='John', last_name='Doe', id=12345)\n</code></pre> <p>We get back a Python dictionary (formatted as a string) that we can parse using <code>json.loads</code>, or validate it directly using the <code>User</code>, which is a <code>pydantic.BaseModel</code> instance.</p> <p>Tip</p> <p>A full pipeline example can be seen in the following script: <code>examples/structured_generation_with_instructor.py</code></p>"},{"location":"sections/how_to_guides/advanced/structured_generation/#openai-json","title":"OpenAI JSON","text":"<p>OpenAI offers a JSON Mode to deal with structured output via their API, let's see how to make use of them. The JSON mode instructs the model to always return a JSON object following the instruction required.</p> <p>Warning</p> <p>Bear in mind, for this to work, you must instruct the model in some way to generate JSON, either in the <code>system message</code> or in the instruction, as can be seen in the API reference.</p> <p>Contrary to what we have via <code>outlines</code>, JSON mode will not guarantee the output matches any specific schema, only that it is valid and parses without errors. More information can be found in the OpenAI documentation.</p> <p>Other than the reference to generating JSON, to ensure the model generates parseable JSON we can pass the argument <code>response_format=\"json\"</code><sup>3</sup>:</p> <pre><code>from distilabel.models import OpenAILLM\nllm = OpenAILLM(model=\"gpt4-turbo\", api_key=\"api.key\")\nllm.generate(..., response_format=\"json\")\n</code></pre> <ol> <li> <p>You can check the variable type by importing it from:</p> <p><pre><code>from distilabel.steps.tasks.structured_outputs.outlines import StructuredOutputType\n</code></pre> \u21a9</p> </li> <li> <p>Download the model with curl:</p> <p><pre><code>curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n</code></pre> \u21a9</p> </li> <li> <p>Keep in mind that to interact with this <code>response_format</code> argument in a pipeline, you will have to pass it via the <code>generation_kwargs</code>:</p> <p><pre><code># Assuming a pipeline is already defined, and we have a task using OpenAILLM called `task_with_openai`:\npipeline.run(\n    parameters={\n        \"task_with_openai\": {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"response_format\": \"json\"\n                }\n            }\n        }\n    }\n)\n</code></pre> \u21a9</p> </li> </ol>"},{"location":"sections/how_to_guides/advanced/cli/","title":"Command Line Interface (CLI)","text":"<p><code>Distilabel</code> offers a <code>CLI</code> to explore and re-run existing <code>Pipeline</code> dumps, meaning that an existing dump can be explored to see the steps, how those are connected, the runtime parameters used, and also re-run it with the same or different runtime parameters, respectively.</p>"},{"location":"sections/how_to_guides/advanced/cli/#available-commands","title":"Available commands","text":"<p>The only available command as of the current version of <code>distilabel</code> is <code>distilabel pipeline</code>.</p> <pre><code>$ distilabel pipeline --help\n\n Usage: distilabel pipeline [OPTIONS] COMMAND [ARGS]...\n\n Commands to run and inspect Distilabel pipelines.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --help          Show this message and exit.                                             \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\u256d\u2500 Commands \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 info      Get information about a Distilabel pipeline.                                  \u2502\n\u2502 run       Run a Distilabel pipeline.                                                    \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> <p>So on, <code>distilabel pipeline</code> has two subcommands: <code>info</code> and <code>run</code>, as described below. Note that for testing purposes we will be using the following dataset.</p>"},{"location":"sections/how_to_guides/advanced/cli/#distilabel-pipeline-info","title":"<code>distilabel pipeline info</code>","text":"<pre><code>$ distilabel pipeline info --help\n\n Usage: distilabel pipeline info [OPTIONS]\n\n Get information about a Distilabel pipeline.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 *  --config        TEXT  Path or URL to the Distilabel pipeline configuration file. \u2502\n\u2502                          [default: None]                                            \u2502\n\u2502                          [required]                                                 \u2502\n\u2502    --help                Show this message and exit.                                \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> <p>As we can see from the help message, we need to pass either a <code>Path</code> or a <code>URL</code>. This second option comes handy for datasets stored in Hugging Face Hub, for example:</p> <pre><code>distilabel pipeline info --config \"https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml\"\n</code></pre> <p>If we take a look:</p> <p></p> <p>The pipeline information includes the steps used in the <code>Pipeline</code> along with the <code>Runtime Parameter</code> that was used, as well as a description of each of them, and also the connections between these steps. These can be helpful to explore the Pipeline locally.</p>"},{"location":"sections/how_to_guides/advanced/cli/#distilabel-pipeline-run","title":"<code>distilabel pipeline run</code>","text":"<p>We can also run a <code>Pipeline</code> from the CLI just pointing to the same <code>pipeline.yaml</code> file or an URL pointing to it and calling <code>distilabel pipeline run</code>. Alternatively, an URL pointing to a Python script containing a distilabel pipeline can be used:</p> <pre><code>$ distilabel pipeline run --help\n\n Usage: distilabel pipeline run [OPTIONS]\n\n Run a Distilabel pipeline.\n\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --param                                          PARSE_RUNTIME_PARAM  [default: (dynamic)]                                         \u2502\n\u2502 --config                                         TEXT                 Path or URL to the Distilabel pipeline configuration file.   \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --script                                         TEXT                 URL pointing to a python script containing a distilabel      \u2502\n\u2502                                                                       pipeline.                                                    \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --pipeline-variable-name                         TEXT                 Name of the pipeline in a script. I.e. the 'pipeline'        \u2502\n\u2502                                                                       variable in `with Pipeline(...) as pipeline:...`.            \u2502\n\u2502                                                                       [default: pipeline]                                          \u2502\n\u2502 --ignore-cache              --no-ignore-cache                         Whether to ignore the cache and re-run the pipeline from     \u2502\n\u2502                                                                       scratch.                                                     \u2502\n\u2502                                                                       [default: no-ignore-cache]                                   \u2502\n\u2502 --repo-id                                        TEXT                 The Hugging Face Hub repository ID to push the resulting     \u2502\n\u2502                                                                       dataset to.                                                  \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --commit-message                                 TEXT                 The commit message to use when pushing the dataset.          \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --private                   --no-private                              Whether to make the resulting dataset private on the Hub.    \u2502\n\u2502                                                                       [default: no-private]                                        \u2502\n\u2502 --token                                          TEXT                 The Hugging Face Hub API token to use when pushing the       \u2502\n\u2502                                                                       dataset.                                                     \u2502\n\u2502                                                                       [default: None]                                              \u2502\n\u2502 --help                                                                Show this message and exit.                                  \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</code></pre> <p>Using <code>--config</code> option, we must pass a path with a <code>pipeline.yaml</code> file. To specify the runtime parameters of the steps we will need to use the <code>--param</code> option and the value of the parameter in the following format:</p> <pre><code>distilabel pipeline run --config \"https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml\" \\\n    --param load_dataset.repo_id=distilabel-internal-testing/instruction-dataset-mini \\\n    --param load_dataset.split=test \\\n    --param generate_with_gpt35.llm.generation_kwargs.max_new_tokens=512 \\\n    --param generate_with_gpt35.llm.generation_kwargs.temperature=0.7 \\\n    --param to_argilla.dataset_name=text_generation_with_gpt35 \\\n    --param to_argilla.dataset_workspace=admin\n</code></pre> <p>Or using <code>--script</code> we can pass directly a remote python script (keep in mind <code>--config</code> and <code>--script</code> are exclusive):</p> <pre><code>distilabel pipeline run --script \"https://huggingface.co/datasets/distilabel-internal-testing/pipe_nothing_test/raw/main/pipe_nothing.py\"\n</code></pre> <p>You can also pass runtime parameters to the python script as we saw with <code>--config</code> option.</p> <p>Again, this helps with the reproducibility of the results, and simplifies sharing not only the final dataset but also the process to generate it.</p>"},{"location":"sections/how_to_guides/basic/llm/","title":"Executing Tasks with LLMs","text":""},{"location":"sections/how_to_guides/basic/llm/#working-with-llms","title":"Working with LLMs","text":"<p>LLM subclasses are designed to be used within a Task, but they can also be used standalone.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n)\nllm.load()\n\nllm.generate_outputs(\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# [\n#   {\n#     \"generations\": [\n#       \"The capital of Spain is Madrid.\"\n#     ],\n#     \"statistics\": {\n#       \"input_tokens\": [\n#         43\n#       ],\n#       \"output_tokens\": [\n#         8\n#       ]\n#     }\n#   }\n# ]\n</code></pre> <p>Note</p> <p>Always call the <code>LLM.load</code> or <code>Task.load</code> method when using LLMs standalone or as part of a <code>Task</code>. If using a <code>Pipeline</code>, this is done automatically in <code>Pipeline.run()</code>.</p> <p>New in version 1.5.0</p> <p>Since version <code>1.5.0</code> the LLM output is a list of dictionaries (one per item in the <code>inputs</code>), each containing <code>generations</code>, that reports the text returned by the <code>LLM</code>, and a <code>statistics</code> field that will store statistics related to the <code>LLM</code> generation. Initially, this will include <code>input_tokens</code> and <code>output_tokens</code> when available, which will be obtained via the API when available, or if a tokenizer is available for the model used, using the tokenizer for the model. This data will be moved by the corresponding <code>Task</code> during the pipeline processing and moved to <code>distilabel_metadata</code> so we can operate on this data if we want, like for example computing the number of tokens per dataset.</p> <p>To access to the previous result one just has to access to the generations in the resulting dictionary: <code>result[0][\"generations\"]</code>.</p>"},{"location":"sections/how_to_guides/basic/llm/#offline-batch-generation","title":"Offline Batch Generation","text":"<p>By default, all <code>LLM</code>s will generate text in a synchronous manner i.e. send inputs using <code>generate_outputs</code> method that will get blocked until outputs are generated. There are some <code>LLM</code>s (such as OpenAILLM) that implements what we denote as offline batch generation, which allows to send the inputs to the LLM-as-a-service which will generate the outputs asynchronously and give us a job id that we can use later to check the status and retrieve the generated outputs when they are ready. LLM-as-a-service platforms offers this feature as a way to save costs in exchange of waiting for the outputs to be generated.</p> <p>To use this feature in <code>distilabel</code> the only thing we need to do is to set the <code>use_offline_batch_generation</code> attribute to <code>True</code> when creating the <code>LLM</code> instance:</p> <pre><code>from distilabel.models import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"gpt-4o\",\n    use_offline_batch_generation=True,\n)\n\nllm.load()\n\nllm.jobs_ids  # (1)\n# None\n\nllm.generate_outputs(  # (2)\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# DistilabelOfflineBatchGenerationNotFinishedException: Batch generation with jobs_ids=('batch_OGB4VjKpu2ay9nz3iiFJxt5H',) is not finished\n\nllm.jobs_ids  # (3)\n# ('batch_OGB4VjKpu2ay9nz3iiFJxt5H',)\n\n\nllm.generate_outputs(  # (4)\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# [{'generations': ['The capital of Spain is Madrid.'],\n#   'statistics': {'input_tokens': [13], 'output_tokens': [7]}}]\n</code></pre> <ol> <li>At first the <code>jobs_ids</code> attribute is <code>None</code>.</li> <li>The first call to <code>generate_outputs</code> will send the inputs to the LLM-as-a-service and return a <code>DistilabelOfflineBatchGenerationNotFinishedException</code> since the outputs are not ready yet.</li> <li>After the first call to <code>generate_outputs</code> the <code>jobs_ids</code> attribute will contain the job ids created for generating the outputs.</li> <li>The second call or subsequent calls to <code>generate_outputs</code> will return the outputs if they are ready or raise a <code>DistilabelOfflineBatchGenerationNotFinishedException</code> if they are not ready yet.</li> </ol> <p>The <code>offline_batch_generation_block_until_done</code> attribute can be used to block the <code>generate_outputs</code> method until the outputs are ready polling the platform the specified amount of seconds.</p> <pre><code>from distilabel.models import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"gpt-4o\",\n    use_offline_batch_generation=True,\n    offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n)\nllm.load()\n\nllm.generate_outputs(\n    inputs=[\n        [{\"role\": \"user\", \"content\": \"What's the capital of Spain?\"}],\n    ],\n)\n# [{'generations': ['The capital of Spain is Madrid.'],\n#   'statistics': {'input_tokens': [13], 'output_tokens': [7]}}]\n</code></pre>"},{"location":"sections/how_to_guides/basic/llm/#within-a-task","title":"Within a Task","text":"<p>Pass the LLM as an argument to the <code>Task</code>, and the task will handle the rest.</p> <pre><code>from distilabel.models import OpenAILLM\nfrom distilabel.steps.tasks import TextGeneration\n\nllm = OpenAILLM(model=\"gpt-4o-mini\")\ntask = TextGeneration(name=\"text_generation\", llm=llm)\n\ntask.load()\n\nnext(task.process(inputs=[{\"instruction\": \"What's the capital of Spain?\"}]))\n# [{'instruction': \"What's the capital of Spain?\",\n#   'generation': 'The capital of Spain is Madrid.',\n#   'distilabel_metadata': {'raw_output_text_generation': 'The capital of Spain is Madrid.',\n#    'raw_input_text_generation': [{'role': 'user',\n#      'content': \"What's the capital of Spain?\"}],\n#    'statistics_text_generation': {'input_tokens': 13, 'output_tokens': 7}},\n#   'model_name': 'gpt-4o-mini'}]\n</code></pre> <p>Note</p> <p>As mentioned in Working with LLMs section, the generation of an LLM is automatically moved to <code>distilabel_metadata</code> to avoid interference with the common workflow, so the addition of the <code>statistics</code> it's an extra component available for the user, but nothing has to be changed in the defined pipelines.</p>"},{"location":"sections/how_to_guides/basic/llm/#runtime-parameters","title":"Runtime Parameters","text":"<p>LLMs can have runtime parameters, such as <code>generation_kwargs</code>, provided via the <code>Pipeline.run()</code> method using the <code>params</code> argument.</p> <p>Note</p> <p>Runtime parameters can differ between LLM subclasses, caused by the different functionalities offered by the LLM providers.</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.models import OpenAILLM\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(name=\"text-generation-pipeline\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_dataset\",\n        data=[{\"instruction\": \"Write a short story about a dragon that saves a princess from a tower.\"}],\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=OpenAILLM(model=\"gpt-4o-mini\"),\n    )\n\n    load_dataset &gt;&gt; text_generation\n\nif __name__ == \"__main__\":\n    pipeline.run(\n        parameters={\n            text_generation.name: {\"llm\": {\"generation_kwargs\": {\"temperature\": 0.3}}},\n        },\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/llm/#creating-custom-llms","title":"Creating custom LLMs","text":"<p>To create custom LLMs, subclass either <code>LLM</code> for synchronous or <code>AsyncLLM</code> for asynchronous LLMs. Implement the following methods:</p> <ul> <li> <p><code>model_name</code>: A property containing the model's name.</p> </li> <li> <p><code>generate</code>: A method that takes a list of prompts and returns generated texts.</p> </li> <li> <p><code>agenerate</code>: A method that takes a single prompt and returns generated texts. This method is used within the <code>generate</code> method of the <code>AsyncLLM</code> class.</p> </li> <li> <p>(optional) <code>get_last_hidden_state</code>: is a method that will take a list of prompts and return a list of hidden states. This method is optional and will be used by some tasks such as the <code>GenerateEmbeddings</code> task.</p> </li> </ul> Custom LLMCustom AsyncLLM <pre><code>from typing import Any\n\nfrom pydantic import validate_call\n\nfrom distilabel.models import LLM\nfrom distilabel.typing import GenerateOutput, HiddenState\nfrom distilabel.typing import ChatType\n\nclass CustomLLM(LLM):\n    @property\n    def model_name(self) -&gt; str:\n        return \"my-model\"\n\n    @validate_call\n    def generate(self, inputs: List[ChatType], num_generations: int = 1, **kwargs: Any) -&gt; List[GenerateOutput]:\n        for _ in range(num_generations):\n            ...\n\n    def get_last_hidden_state(self, inputs: List[ChatType]) -&gt; List[HiddenState]:\n        ...\n</code></pre> <pre><code>from typing import Any\n\nfrom pydantic import validate_call\n\nfrom distilabel.models import AsyncLLM\nfrom distilabel.typing import GenerateOutput, HiddenState\nfrom distilabel.typing import ChatType\n\nclass CustomAsyncLLM(AsyncLLM):\n    @property\n    def model_name(self) -&gt; str:\n        return \"my-model\"\n\n    @validate_call\n    async def agenerate(self, input: ChatType, num_generations: int = 1, **kwargs: Any) -&gt; GenerateOutput:\n        for _ in range(num_generations):\n            ...\n\n    def get_last_hidden_state(self, inputs: List[ChatType]) -&gt; List[HiddenState]:\n        ...\n</code></pre> <p><code>generate</code> and <code>agenerate</code> keyword arguments (but <code>input</code> and <code>num_generations</code>) are considered as <code>RuntimeParameter</code>s, so a value can be passed to them via the <code>parameters</code> argument of the <code>Pipeline.run</code> method.</p> <p>Note</p> <p>To have the arguments of the <code>generate</code> and <code>agenerate</code> coerced to the expected types, the <code>validate_call</code> decorator is used, which will automatically coerce the arguments to the expected types, and raise an error if the types are not correct. This is specially useful when providing a value for an argument of <code>generate</code> or <code>agenerate</code> from the CLI, since the CLI will always provide the arguments as strings.</p> <p>Warning</p> <p>Additional LLMs created in <code>distilabel</code> will have to take into account how the <code>statistics</code> are generated to properly include them in the LLM output.</p>"},{"location":"sections/how_to_guides/basic/llm/#available-llms","title":"Available LLMs","text":"<p>Our LLM gallery shows a list of the available LLMs that can be used within the <code>distilabel</code> library.</p>"},{"location":"sections/how_to_guides/basic/pipeline/","title":"Execute Steps and Tasks in a Pipeline","text":""},{"location":"sections/how_to_guides/basic/pipeline/#how-to-create-a-pipeline","title":"How to create a pipeline","text":"<p><code>Pipeline</code> organise the Steps and Tasks in a sequence, where the output of one step is the input of the next one. A <code>Pipeline</code> should be created by making use of the context manager along with passing a name, and optionally a description.</p> <pre><code>from distilabel.pipeline import Pipeline\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    ...\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#connecting-steps-with-the-stepconnect-method","title":"Connecting steps with the <code>Step.connect</code> method","text":"<p>Now, we can define the steps of our <code>Pipeline</code>.</p> <p>Note</p> <p>Steps without predecessors (i.e. root steps), need to be <code>GeneratorStep</code>s such as <code>LoadDataFromDicts</code> or <code>LoadDataFromHub</code>. After this, other steps can be defined.</p> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n    ...\n</code></pre> <p>Easily load your datasets</p> <p>If you are already used to work with Hugging Face's <code>Dataset</code> via <code>load_dataset</code> or <code>pd.DataFrame</code>, you can create the <code>GeneratorStep</code> directly from the dataset (or dataframe), and create the step with the help of <code>make_generator_step</code>:</p> From a list of dictsFrom <code>datasets.Dataset</code>From <code>pd.DataFrame</code> <pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import make_generator_step\n\ndataset = [{\"instruction\": \"Tell me a joke.\"}]\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    loader = make_generator_step(dataset, output_mappings={\"prompt\": \"instruction\"})\n    ...\n</code></pre> <pre><code>from datasets import load_dataset\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import make_generator_step\n\ndataset = load_dataset(\n    \"DIBT/10k_prompts_ranked\",\n    split=\"train\"\n).filter(\n    lambda r: r[\"avg_rating\"]&gt;=4 and r[\"num_responses\"]&gt;=2\n).select(range(500))\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    loader = make_generator_step(dataset, output_mappings={\"prompt\": \"instruction\"})\n    ...\n</code></pre> <pre><code>import pandas as pd\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import make_generator_step\n\ndataset = pd.read_csv(\"path/to/dataset.csv\")\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    loader = make_generator_step(dataset, output_mappings={\"prompt\": \"instruction\"})\n    ...\n</code></pre> <p>Next, we will use <code>prompt</code> column from the dataset obtained through <code>LoadDataFromHub</code> and use several <code>LLM</code>s to execute a <code>TextGeneration</code> task. We will also use the <code>Task.connect()</code> method to connect the steps, so the output of one step is the input of the next one.</p> <p>Note</p> <p>The order of the execution of the steps will be determined by the connections of the steps. In this case, the <code>TextGeneration</code> tasks will be executed after the <code>LoadDataFromHub</code> step.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        task.connect(load_dataset)\n\n    ...\n</code></pre> <p>For each row of the dataset, the <code>TextGeneration</code> task will generate a text based on the <code>instruction</code> column and the <code>LLM</code> model, and store the result (a single string) in a new column called <code>generation</code>. Because we need to have the <code>response</code>s in the same column, we will add <code>GroupColumns</code> to combine them all in the same column as a list of strings.</p> <p>Note</p> <p>In this case, the <code>GroupColumns</code> tasks will be executed after all <code>TextGeneration</code> steps.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        load_dataset.connect(task)\n        task.connect(combine_generations)\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#connecting-steps-with-the-operator","title":"Connecting steps with the <code>&gt;&gt;</code> operator","text":"<p>Besides the <code>Step.connect</code> method: <code>step1.connect(step2)</code>, there's an alternative way by making use of the <code>&gt;&gt;</code> operator. We can connect steps in a more readable way, and it's also possible to connect multiple steps at once.</p> Step per stepMultiple steps at once <p>Each call to <code>step1.connect(step2)</code> has been exchanged by <code>step1 &gt;&gt; step2</code> within the loop.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        load_dataset &gt;&gt; task &gt;&gt; combine_generations\n</code></pre> <p>Each task is first appended to a list, and then all the calls to connections are done in a single call.</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(name=\"load_dataset\")\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    tasks = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        tasks.append(\n            TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        )\n\n    load_dataset &gt;&gt; tasks &gt;&gt; combine_generations\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#routing-batches-to-specific-downstream-steps","title":"Routing batches to specific downstream steps","text":"<p>In some pipelines, you may want to send batches from a single upstream step to specific downstream steps based on certain conditions. To achieve this, you can use a <code>routing_batch_function</code>. This function takes a list of downstream steps and returns a list of step names to which each batch should be routed.</p> <p>Let's update the example above to route the batches loaded by the <code>LoadDataFromHub</code> step to just 2 of the <code>TextGeneration</code> tasks. First, we will create our custom <code>routing_batch_function</code>, and then we will update the pipeline to use it:</p> <pre><code>import random\nfrom distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, routing_batch_function\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\n@routing_batch_function\ndef sample_two_steps(steps: list[str]) -&gt; list[str]:\n    return random.sample(steps, 2)\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    tasks = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.0-pro\"),\n    ):\n        tasks.append(\n            TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        )\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    load_dataset &gt;&gt; sample_two_steps &gt;&gt; tasks &gt;&gt; combine_generations\n</code></pre> <p>The <code>routing_batch_function</code> that we just built is a common one, so <code>distilabel</code> comes with a builtin function that can be used to achieve the same behavior:</p> <pre><code>from distilable.pipeline import sample_n_steps\n\nsample_two_steps = sample_n_steps(2)\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#running-the-pipeline","title":"Running the pipeline","text":""},{"location":"sections/how_to_guides/basic/pipeline/#pipelinedry_run","title":"Pipeline.dry_run","text":"<p>Before running the <code>Pipeline</code> we can check if the pipeline is valid using the <code>Pipeline.dry_run()</code> method. It takes the same parameters as the <code>run</code> method which we will discuss in the following section, plus the <code>batch_size</code> we want the dry run to use (by default set to 1).</p> <pre><code>with Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.dry_run(parameters=..., batch_size=1)\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#pipelinerun","title":"Pipeline.run","text":"<p>After testing, we can now execute the full <code>Pipeline</code> using the <code>Pipeline.run()</code> method.</p> <pre><code>with Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    ...\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            \"load_dataset\": {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n            \"text_generation_with_gpt-4-0125-preview\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_mistral-large-2402\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_gemini-1.0-pro\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n        },\n    )\n</code></pre> <p>But if we run the pipeline above, we will see that the <code>run</code> method will fail:</p> <pre><code>ValueError: Step 'text_generation_with_gpt-4-0125-preview' requires inputs ['instruction'], but only the inputs=['prompt', 'completion', 'meta'] are available, which means that the inputs=['instruction'] are missing or not available\nwhen the step gets to be executed in the pipeline. Please make sure previous steps to 'text_generation_with_gpt-4-0125-preview' are generating the required inputs.\n</code></pre> <p>This is because, before actually running the pipeline, we must ensure each step has the necessary input columns to be executed. In this case, the <code>TextGeneration</code> task requires the <code>instruction</code> column, but the <code>LoadDataFromHub</code> step generates the <code>prompt</code> column. To solve this, we can use the <code>output_mappings</code> or <code>input_mapping</code> arguments of individual <code>Step</code>s, to map columns from one step to another.</p> <pre><code>with Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"}\n    )\n\n    ...\n</code></pre> <p>If we execute the pipeline again, it will run successfully and we will have a <code>Distiset</code> with the outputs of all the leaf steps of the pipeline which we can push to the Hugging Face Hub.</p> <pre><code>if __name__ == \"__main__\":\n    distiset = pipeline.run(...)\n    distiset.push_to_hub(\"distilabel-internal-testing/instruction-dataset-mini-with-generations\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#pipelinerun-with-a-dataset","title":"Pipeline.run with a dataset","text":"<p>Note that in most cases if you don't need the extra flexibility the <code>GeneratorSteps</code> bring you, you can create a dataset as you would normally do and pass it to the Pipeline.run method directly. Look at the highlighted lines to see the updated lines:</p> <pre><code>import random\nfrom distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline, routing_batch_function\nfrom distilabel.steps import GroupColumns\nfrom distilabel.steps.tasks import TextGeneration\n\n@routing_batch_function\ndef sample_two_steps(steps: list[str]) -&gt; list[str]:\n    return random.sample(steps, 2)\n\ndataset = load_dataset(\n    \"distilabel-internal-testing/instruction-dataset-mini\",\n    split=\"test\"\n)\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    tasks = []\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.0-pro\"),\n    ):\n        tasks.append(\n            TextGeneration(name=f\"text_generation_with_{llm.model_name}\", llm=llm)\n        )\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    sample_two_steps &gt;&gt; tasks &gt;&gt; combine_generations\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        dataset=dataset,\n        parameters=...\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#stopping-the-pipeline","title":"Stopping the pipeline","text":"<p>In case you want to stop the pipeline while it's running, you can press Ctrl+C or Cmd+C depending on your OS (or send a <code>SIGINT</code> to the main process), and the outputs will be stored in the cache. Pressing an additional time will force the pipeline to stop its execution, but this can lead to losing the generated outputs for certain batches.</p>"},{"location":"sections/how_to_guides/basic/pipeline/#cache","title":"Cache","text":"<p>If for some reason, the pipeline execution stops (for example by pressing <code>Ctrl+C</code>), the state of the pipeline and the outputs will be stored in the cache, so we can resume the pipeline execution from the point where it was stopped.</p> <p>If we want to force the pipeline to run again without can, then we can use the <code>use_cache</code> argument of the <code>Pipeline.run()</code> method:</p> <pre><code>if __name__ == \"__main__\":\n    distiset = pipeline.run(parameters={...}, use_cache=False)\n</code></pre> <p>Note</p> <p>For more information on caching, we refer the reader to the caching section.</p>"},{"location":"sections/how_to_guides/basic/pipeline/#adjusting-the-batch-size-for-each-step","title":"Adjusting the batch size for each step","text":"<p>Memory issues can arise when processing large datasets or when using large models. To avoid this, we can use the <code>input_batch_size</code> argument of individual tasks. <code>TextGeneration</code> task will receive 5 dictionaries, while the <code>LoadDataFromHub</code> step will send 10 dictionaries per batch:</p> <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n        batch_size=10\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.5-pro\"),\n    ):\n        task = TextGeneration(\n            name=f\"text_generation_with_{llm.model_name.replace('.', '-')}\",\n            llm=llm,\n            input_batch_size=5,\n        )\n\n    ...\n</code></pre>"},{"location":"sections/how_to_guides/basic/pipeline/#serializing-the-pipeline","title":"Serializing the pipeline","text":"<p>Sharing a pipeline with others is very easy, as we can serialize the pipeline object using the <code>save</code> method. We can save the pipeline in different formats, such as <code>yaml</code> or <code>json</code>:</p> yamljson <pre><code>if __name__ == \"__main__\":\n    pipeline.save(\"pipeline.yaml\", format=\"yaml\")\n</code></pre> <pre><code>if __name__ == \"__main__\":\n    pipeline.save(\"pipeline.json\", format=\"json\")\n</code></pre> <p>To load the pipeline, we can use the <code>from_yaml</code> or <code>from_json</code> methods:</p> yamljson <pre><code>pipeline = Pipeline.from_yaml(\"pipeline.yaml\")\n</code></pre> <pre><code>pipeline = Pipeline.from_json(\"pipeline.json\")\n</code></pre> <p>Serializing the pipeline is very useful when we want to share the pipeline with others, or when we want to store the pipeline for future use. It can even be hosted online, so the pipeline can be executed directly using the CLI.</p>"},{"location":"sections/how_to_guides/basic/pipeline/#visualizing-the-pipeline","title":"Visualizing the pipeline","text":"<p>We can visualize the pipeline using the <code>Pipeline.draw()</code> method. This will create a <code>mermaid</code> graph, and return the path to the image.</p> <pre><code>path_to_image = pipeline.draw(\n    top_to_bottom=True,\n    show_edge_labels=True,\n)\n</code></pre> <p>Within notebooks, we can simply call <code>pipeline</code> and the graph will be displayed. Alternatively, we can use the <code>Pipeline.draw()</code> method to have more control over the graph visualization and use <code>IPython</code> to display it.</p> <pre><code>from IPython.display import Image, display\n\ndisplay(Image(path_to_image))\n</code></pre> <p>Let's now see how the pipeline of the fully working example looks like.</p> <p></p>"},{"location":"sections/how_to_guides/basic/pipeline/#fully-working-example","title":"Fully working example","text":"<p>To sum up, here is the full code of the pipeline we have created in this section. Note that you will need to change the name of the Hugging Face repository where the resulting will be pushed, set <code>OPENAI_API_KEY</code> environment variable, set <code>MISTRAL_API_KEY</code> and have <code>gcloud</code> installed and configured:</p> Code <pre><code>from distilabel.models import MistralLLM, OpenAILLM, VertexAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import GroupColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\"pipe-name\", description=\"My first pipe\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    combine_generations = GroupColumns(\n        name=\"combine_generations\",\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    for llm in (\n        OpenAILLM(model=\"gpt-4-0125-preview\"),\n        MistralLLM(model=\"mistral-large-2402\"),\n        VertexAILLM(model=\"gemini-1.0-pro\"),\n    ):\n        task = TextGeneration(\n            name=f\"text_generation_with_{llm.model_name.replace('.', '-')}\", llm=llm\n        )\n        load_dataset.connect(task)\n        task.connect(combine_generations)\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            \"load_dataset\": {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n            \"text_generation_with_gpt-4-0125-preview\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_mistral-large-2402\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n            \"text_generation_with_gemini-1.0-pro\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n        },\n    )\n    distiset.push_to_hub(\n        \"distilabel-internal-testing/instruction-dataset-mini-with-generations\"\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/","title":"Steps for processing data","text":""},{"location":"sections/how_to_guides/basic/step/#working-with-steps","title":"Working with Steps","text":"<p>The <code>Step</code> is intended to be used within the scope of a <code>Pipeline</code>, which will orchestrate the different steps defined but can also be used standalone.</p> <p>Assuming that we have a <code>Step</code> already defined as it follows:</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import Step, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, StepOutput\n\nclass MyStep(Step):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"input_field\"]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"output_field\"]\n\n    def process(self, inputs: StepInput) -&gt; \"StepOutput\":\n        for input in inputs:\n            input[\"output_field\"] = input[\"input_field\"]\n        yield inputs\n</code></pre> <p>Then we can use it as follows:</p> <pre><code>step = MyStep(name=\"my-step\")\nstep.load()\n\nnext(step.process([{\"input_field\": \"value\"}]))\n# [{'input_field': 'value', 'output_field': 'value'}]\n</code></pre> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p>"},{"location":"sections/how_to_guides/basic/step/#arguments","title":"Arguments","text":"<ul> <li> <p><code>input_mappings</code>, is a dictionary that maps keys from the input dictionaries to the keys expected by the step. For example, if <code>input_mappings={\"instruction\": \"prompt\"}</code>, means that the input key <code>prompt</code> will be used as the key <code>instruction</code> for current step.</p> </li> <li> <p><code>output_mappings</code>, is a dictionary that can be used to map the outputs of the step to other names. For example, if <code>output_mappings={\"conversation\": \"prompt\"}</code>, means that output key <code>conversation</code> will be renamed to <code>prompt</code> for the next step.</p> </li> <li> <p><code>input_batch_size</code> (by default set to 50), is independent for every step and will determine how many input dictionaries will process at once.</p> </li> </ul>"},{"location":"sections/how_to_guides/basic/step/#runtime-parameters","title":"Runtime parameters","text":"<p><code>Step</code>s can also have <code>RuntimeParameter</code>, which are parameters that can only be used after the pipeline initialisation when calling the <code>Pipeline.run</code>.</p> <pre><code>from distilabel.mixins.runtime_parameters import RuntimeParameter\n\nclass Step(...):\n    input_batch_size: RuntimeParameter[PositiveInt] = Field(\n        default=DEFAULT_INPUT_BATCH_SIZE,\n        description=\"The number of rows that will contain the batches processed by the\"\n        \" step.\",\n    )\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/#types-of-steps","title":"Types of Steps","text":"<p>There are two special types of <code>Step</code> in <code>distilabel</code>:</p> <ul> <li> <p><code>GeneratorStep</code>: is a step that only generates data, and it doesn't need any input data from previous steps and normally is the first node in a <code>Pipeline</code>. More information: Components -&gt; Step - GeneratorStep.</p> </li> <li> <p><code>GlobalStep</code>: is a step with the standard interface i.e. receives inputs and generates outputs, but it processes all the data at once, and often is the final step in the <code>Pipeline</code>. The fact that a <code>GlobalStep</code> requires the previous steps  to finish before being able to start. More information: Components - Step - GlobalStep.</p> </li> <li> <p><code>Task</code>, is essentially the same as a default <code>Step</code>, but it relies on an <code>LLM</code> as an attribute, and the <code>process</code> method will be in charge of calling that LLM. More information: Components - Task.</p> </li> </ul>"},{"location":"sections/how_to_guides/basic/step/#defining-custom-steps","title":"Defining custom Steps","text":"<p>We can define a custom step by creating a new subclass of the <code>Step</code> and defining the following:</p> <ul> <li> <p><code>inputs</code>: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>process</code>: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should <code>yield</code> the output data.</p> </li> </ul> <p>Note</p> <p>The default signature for the <code>process</code> method is <code>process(self, *inputs: StepInput) -&gt; StepOutput</code>. The argument <code>inputs</code> should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one <code>Step</code> at a time could be connected to the current one.</p> <p>Warning</p> <p>For the custom <code>Step</code> subclasses to work properly with <code>distilabel</code> and with the validation and serialization performed by default over each <code>Step</code> in the <code>Pipeline</code>, the type-hint for both <code>StepInput</code> and <code>StepOutput</code> should be used and not surrounded with double-quotes or imported under <code>typing.TYPE_CHECKING</code>, otherwise, the validation and/or serialization will fail.</p> Inherit from <code>Step</code>Using the <code>@step</code> decorator <p>We can inherit from the <code>Step</code> class and define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods as follows:</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import Step, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, StepOutput\n\nclass CustomStep(Step):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        ...\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        ...\n\n    def process(self, *inputs: StepInput) -&gt; \"StepOutput\":\n        for upstream_step_inputs in inputs:\n            ...\n            yield item\n\n    # When overridden (ideally under the `typing_extensions.override` decorator)\n    # @typing_extensions.override\n    # def process(self, inputs: StepInput) -&gt; StepOutput:\n    #     for input in inputs:\n    #         ...\n    #     yield inputs\n</code></pre> <p>The <code>@step</code> decorator will take care of the boilerplate code, and will allow to define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods in a more straightforward way. One downside is that it won't let you access the <code>self</code> attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom <code>Step</code> subclass.</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import StepInput, step\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepOutput\n\n@step(inputs=[...], outputs=[...])\ndef CustomStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        ...\n    yield inputs\n\nstep = CustomStep(name=\"my-step\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/generator_step/","title":"GeneratorStep","text":"<p>The <code>GeneratorStep</code> is a subclass of <code>Step</code> that is intended to be used as the first step within a <code>Pipeline</code>, because it doesn't require input and generates data that can be used by other steps. Alternatively, it can also be used as a standalone.</p> <pre><code>from typing import List, TYPE_CHECKING\nfrom typing_extensions import override\n\nfrom distilabel.steps import GeneratorStep\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, GeneratorStepOutput\n\nclass MyGeneratorStep(GeneratorStep):\n    instructions: List[str]\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        if offset:\n            self.instructions = self.instructions[offset:]\n\n        while self.instructions:\n            batch = [\n                {\n                    \"instruction\": instruction\n                } for instruction in self.instructions[: self.batch_size]\n            ]\n            self.instructions = self.instructions[self.batch_size :]\n            yield (\n                batch,\n                True if len(self.instructions) == 0 else False,\n            )\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"instruction\"]\n</code></pre> <p>Then we can use it as follows:</p> <pre><code>step = MyGeneratorStep(\n    name=\"my-generator-step\",\n    instructions=[\"Tell me a joke.\", \"Tell me a story.\"],\n    batch_size=1,\n)\nstep.load()\n\nnext(step.process(offset=0))\n# ([{'instruction': 'Tell me a joke.'}], False)\nnext(step.process(offset=1))\n# ([{'instruction': 'Tell me a story.'}], True)\n</code></pre> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p>"},{"location":"sections/how_to_guides/basic/step/generator_step/#defining-custom-generatorsteps","title":"Defining custom GeneratorSteps","text":"<p>We can define a custom generator step by creating a new subclass of the <code>GeneratorStep</code> and defining the following:</p> <ul> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>process</code>: is a method that yields output data and a boolean flag indicating whether that's the last batch to be generated.</p> </li> </ul> <p>Note</p> <p>The default signature for the <code>process</code> method is <code>process(self, offset: int = 0) -&gt; GeneratorStepOutput</code>. The argument <code>offset</code> should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one <code>Step</code> at a time could be connected to the current one.</p> <p>Warning</p> <p>For the custom <code>Step</code> subclasses to work properly with <code>distilabel</code> and with the validation and serialization performed by default over each <code>Step</code> in the <code>Pipeline</code>, the type-hint for both <code>StepInput</code> and <code>StepOutput</code> should be used and not surrounded with double-quotes or imported under <code>typing.TYPE_CHECKING</code>, otherwise, the validation and/or serialization will fail.</p> Inherit from <code>GeneratorStep</code>Using the <code>@step</code> decorator <p>We can inherit from the <code>GeneratorStep</code> class and define the <code>outputs</code>, and <code>process</code> methods as follows:</p> <pre><code>from typing import List, TYPE_CHECKING\nfrom typing_extensions import override\n\nfrom distilabel.steps import GeneratorStep\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, GeneratorStepOutput\n\nclass MyGeneratorStep(GeneratorStep):\n    instructions: List[str]\n\n    @override\n    def process(self, offset: int = 0) -&gt; \"GeneratorStepOutput\":\n        ...\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        ...\n</code></pre> <p>The <code>@step</code> decorator will take care of the boilerplate code, and will allow to define the <code>outputs</code>, and <code>process</code> methods in a more straightforward way. One downside is that it won't let you access the <code>self</code> attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom <code>GeneratorStep</code> subclass.</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import step\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import GeneratorStepOutput\n\n@step(outputs=[...], step_type=\"generator\")\ndef CustomGeneratorStep(offset: int = 0) -&gt; \"GeneratorStepOutput\":\n    yield (\n        ...,\n        True if offset == 10 else False,\n    )\n\nstep = CustomGeneratorStep(name=\"my-step\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/step/global_step/","title":"GlobalStep","text":"<p>The <code>GlobalStep</code> is a subclass of <code>Step</code> that is used to define a step that requires the previous steps to be completed to run, since it will wait until all the input batches are received before running. This step is useful when you need to run a step that requires all the input data to be processed before running. Alternatively, it can also be used as a standalone.</p>"},{"location":"sections/how_to_guides/basic/step/global_step/#defining-custom-globalsteps","title":"Defining custom GlobalSteps","text":"<p>We can define a custom step by creating a new subclass of the <code>GlobalStep</code> and defining the following:</p> <ul> <li> <p><code>inputs</code>: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>process</code>: is a method that receives the input data and returns the output data, and it should be a generator, meaning that it should <code>yield</code> the output data.</p> </li> </ul> <p>Note</p> <p>The default signature for the <code>process</code> method is <code>process(self, *inputs: StepInput) -&gt; StepOutput</code>. The argument <code>inputs</code> should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one <code>Step</code> at a time could be connected to the current one.</p> <p>Warning</p> <p>For the custom <code>GlobalStep</code> subclasses to work properly with <code>distilabel</code> and with the validation and serialization performed by default over each <code>Step</code> in the <code>Pipeline</code>, the type-hint for both <code>StepInput</code> and <code>StepOutput</code> should be used and not surrounded with double-quotes or imported under <code>typing.TYPE_CHECKING</code>, otherwise, the validation and/or serialization will fail.</p> Inherit from <code>GlobalStep</code>Using the <code>@step</code> decorator <p>We can inherit from the <code>GlobalStep</code> class and define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods as follows:</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import GlobalStep, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns, StepOutput\n\nclass CustomStep(Step):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        ...\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        ...\n\n    def process(self, *inputs: StepInput) -&gt; StepOutput:\n        for upstream_step_inputs in inputs:\n            for item in input:\n                ...\n            yield item\n\n    # When overridden (ideally under the `typing_extensions.override` decorator)\n    # @typing_extensions.override\n    # def process(self, inputs: StepInput) -&gt; StepOutput:\n    #     for input in inputs:\n    #         ...\n    #     yield inputs\n</code></pre> <p>The <code>@step</code> decorator will take care of the boilerplate code, and will allow to define the <code>inputs</code>, <code>outputs</code>, and <code>process</code> methods in a more straightforward way. One downside is that it won't let you access the <code>self</code> attributes if any, neither set those, so if you need to access or set any attribute, you should go with the first approach of defining the custom <code>GlobalStep</code> subclass.</p> <pre><code>from typing import TYPE_CHECKING\nfrom distilabel.steps import StepInput, step\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepOutput\n\n@step(inputs=[...], outputs=[...], step_type=\"global\")\ndef CustomStep(inputs: StepInput) -&gt; \"StepOutput\":\n    for input in inputs:\n        ...\n    yield inputs\n\nstep = CustomStep(name=\"my-step\")\n</code></pre>"},{"location":"sections/how_to_guides/basic/task/","title":"Tasks for generating and judging with LLMs","text":""},{"location":"sections/how_to_guides/basic/task/#working-with-tasks","title":"Working with Tasks","text":"<p>The <code>Task</code> is a special kind of <code>Step</code> that includes the <code>LLM</code> as a mandatory argument. As with a <code>Step</code>, it is normally used within a <code>Pipeline</code> but can also be used standalone.</p> <p>For example, the most basic task is the <code>TextGeneration</code> task, which generates text based on a given instruction.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import TextGeneration\n\ntask = TextGeneration(\n    name=\"text-generation\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\ntask.load()\n\nnext(task.process([{\"instruction\": \"What's the capital of Spain?\"}]))\n# [\n#   {\n#     \"instruction\": \"What's the capital of Spain?\",\n#     \"generation\": \"The capital of Spain is Madrid.\",\n#     \"distilabel_metadata\": {\n#       \"raw_output_text-generation\": \"The capital of Spain is Madrid.\",\n#       \"raw_input_text-generation\": [\n#         {\n#           \"role\": \"user\",\n#           \"content\": \"What's the capital of Spain?\"\n#         }\n#       ],\n#       \"statistics_text-generation\": {  # (1)\n#         \"input_tokens\": 18,\n#         \"output_tokens\": 8\n#       }\n#     },\n#     \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n#   }\n# ]\n</code></pre> <ol> <li>The <code>LLMs</code> will not only return the text but also a <code>statistics_{STEP_NAME}</code> field that will contain statistics related to the generation. If available, at least the input and output tokens will be returned.</li> </ol> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p> <p>As shown above, the <code>TextGeneration</code> task adds a <code>generation</code> based on the <code>instruction</code>.</p> <p>New in version 1.2.0</p> <p>Since version <code>1.2.0</code>, we provide some metadata about the LLM call through <code>distilabel_metadata</code>. This can be disabled by setting the <code>add_raw_output</code> attribute to <code>False</code> when creating the task.</p> <p>Additionally, since version <code>1.4.0</code>, the formatted input can also be included, which can be helpful when testing custom templates (testing the pipeline using the <code>dry_run</code> method).</p> disable raw input and output<pre><code>task = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    add_raw_output=False,\n    add_raw_input=False\n)\n</code></pre> <p>New in version 1.5.0</p> <p>Since version <code>1.5.0</code> <code>distilabel_metadata</code> includes a new <code>statistics</code> field out of the box. The generation from the LLM will not only contain the text, but also statistics associated with the text if available, like the input and output tokens. This field will be generated with <code>statistic_{STEP_NAME}</code> to avoid collisions between different steps in the pipeline, similar to how <code>raw_output_{STEP_NAME}</code> works.</p>"},{"location":"sections/how_to_guides/basic/task/#taskprint","title":"Task.print","text":"<p>New in version 1.4.0</p> <p>New since version <code>1.4.0</code>, <code>Task.print</code> <code>Task.print</code> method.</p> <p>The <code>Tasks</code> include a handy method to show what the prompt formatted for an <code>LLM</code> would look like, let's see an example with <code>UltraFeedback</code>, but it applies to any other <code>Task</code>.</p> <pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\nuf = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n)\nuf.load()\nuf.print()\n</code></pre> <p>The result will be a rendered prompt, with the System prompt (if contained for the task) and the User prompt, rendered with rich (it will show exactly the same in a jupyter notebook).</p> <p></p> <p>In case you want to test with a custom input, you can pass an example to the tasks<code></code>format_input` method (or generate it on your own depending on the task), and pass it to the print method so that it shows your example:</p> <pre><code>uf.print(\n    uf.format_input({\"instruction\": \"test\", \"generations\": [\"1\", \"2\"]})\n)\n</code></pre> Using a DummyLLM to avoid loading one <p>In case you don't want to load an LLM to render the template, you can create a dummy one like the ones we could use for testing.</p> <pre><code>from distilabel.models import LLM\nfrom distilabel.models.mixins import MagpieChatTemplateMixin\n\nclass DummyLLM(AsyncLLM, MagpieChatTemplateMixin):\n    structured_output: Any = None\n    magpie_pre_query_template: str = \"llama3\"\n\n    def load(self) -&gt; None:\n        pass\n\n    @property\n    def model_name(self) -&gt; str:\n        return \"test\"\n\n    def generate(\n        self, input: \"FormattedInput\", num_generations: int = 1\n    ) -&gt; \"GenerateOutput\":\n        return [\"output\" for _ in range(num_generations)]\n</code></pre> <p>You can use this <code>LLM</code> just as any of the other ones to <code>load</code> your task and call <code>print</code>:</p> <pre><code>uf = UltraFeedback(llm=DummyLLM())\nuf.load()\nuf.print()\n</code></pre> <p>Note</p> <p>When creating a custom task, the <code>print</code> method will be available by default, but it is limited to the most common scenarios for the inputs. If you test your new task and find it's not working as expected (for example, if your task contains one input consisting of a list of texts instead of a single one), you should override the <code>_sample_input</code> method. You can inspect the <code>UltraFeedback</code> source code for this.</p>"},{"location":"sections/how_to_guides/basic/task/#specifying-the-number-of-generations-and-grouping-generations","title":"Specifying the number of generations and grouping generations","text":"<p>All the <code>Task</code>s have a <code>num_generations</code> attribute that allows defining the number of generations that we want to have per input. We can update the example above to generate 3 completions per input:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import TextGeneration\n\ntask = TextGeneration(\n    name=\"text-generation\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    num_generations=3,\n)\ntask.load()\n\nnext(task.process([{\"instruction\": \"What's the capital of Spain?\"}]))\n# [\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': 'The capital of Spain is Madrid.',\n#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     },\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': 'The capital of Spain is Madrid.',\n#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     },\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': 'The capital of Spain is Madrid.',\n#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     }\n# ]\n</code></pre> <p>In addition, we might want to group the generations in a single output row as maybe one downstream step expects a single row with multiple generations. We can achieve this by setting the <code>group_generations</code> attribute to <code>True</code>:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import TextGeneration\n\ntask = TextGeneration(\n    name=\"text-generation\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    num_generations=3,\n    group_generations=True\n)\ntask.load()\n\nnext(task.process([{\"instruction\": \"What's the capital of Spain?\"}]))\n# [\n#     {\n#         'instruction': \"What's the capital of Spain?\",\n#         'generation': ['The capital of Spain is Madrid.', 'The capital of Spain is Madrid.', 'The capital of Spain is Madrid.'],\n#         'distilabel_metadata': [\n#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},\n#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'}\n#         ],\n#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'\n#     }\n# ]\n</code></pre>"},{"location":"sections/how_to_guides/basic/task/#defining-custom-tasks","title":"Defining custom Tasks","text":"<p>We can define a custom step by creating a new subclass of the <code>Task</code> and defining the following:</p> <ul> <li> <p><code>inputs</code>: is a property that returns a list of strings with the names of the required input fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not.</p> </li> <li> <p><code>format_input</code>: is a method that receives a dictionary with the input data and returns a <code>ChatType</code> following the chat-completion OpenAI message formatting.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields or a dictionary in which the keys are the names of the columns and the values are boolean indicating whether the column is required or not. This property should always include <code>model_name</code> as one of the outputs since that's automatically injected from the LLM.</p> </li> <li> <p><code>format_output</code>: is a method that receives the output from the <code>LLM</code> and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in <code>outputs</code>. Note that there's no need to include the <code>model_name</code> in the output.</p> </li> </ul> Inherit from <code>Task</code>Using the <code>@task</code> decorator <p>When using the <code>Task</code> class inheritance method for creating a custom task, we can also optionally override the <code>Task.process</code> method to define a more complex processing logic involving an <code>LLM</code>, as the default one just calls the <code>LLM.generate</code> method once previously formatting the input and subsequently formatting the output. For example, EvolInstruct task overrides this method to call the <code>LLM.generate</code> multiple times (one for each evolution).</p> <pre><code>from typing import Any, Dict, List, Union, TYPE_CHECKING\n\nfrom distilabel.steps.tasks import Task\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepColumns\n    from distilabel.steps.tasks.typing import ChatType\n\n\nclass MyCustomTask(Task):\n    @property\n    def inputs(self) -&gt; \"StepColumns\":\n        return [\"input_field\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; \"ChatType\":\n        return [\n            {\n                \"role\": \"user\",\n                \"content\": input[\"input_field\"],\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; \"StepColumns\":\n        return [\"output_field\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        return {\"output_field\": output}\n</code></pre> <p>If your task just needs a system prompt, a user message template and a way to format the output given by the <code>LLM</code>, then you can use the <code>@task</code> decorator to avoid writing too much boilerplate code.</p> <pre><code>from typing import Any, Dict, Union\nfrom distilabel.steps.tasks import task\n\n\n@task(inputs=[\"input_field\"], outputs=[\"output_field\"])\ndef MyCustomTask(output: Union[str, None], input: Union[Dict[str, Any], None] = None) -&gt; Dict[str, Any]:\n    \"\"\"\n    ---\n    system_prompt: |\n        My custom system prompt\n\n    user_message_template: |\n        My custom user message template: {input_field}\n    ---\n    \"\"\"\n    # Format the `LLM` output here\n    return {\"output_field\": output}\n</code></pre> <p>Warning</p> <p>Most <code>Tasks</code> reuse the <code>Task.process</code> method to process the generations, but if a new <code>Task</code> defines a custom <code>process</code> method, like happens for example with <code>Magpie</code>, one hast to deal with the <code>statistics</code> returned by the <code>LLM</code>.</p>"},{"location":"sections/how_to_guides/basic/task/generator_task/","title":"GeneratorTask that produces output","text":""},{"location":"sections/how_to_guides/basic/task/generator_task/#working-with-generatortasks","title":"Working with GeneratorTasks","text":"<p>The <code>GeneratorTask</code> is a custom implementation of a <code>Task</code> based on the <code>GeneratorStep</code>. As with a <code>Task</code>, it is normally used within a <code>Pipeline</code> but can also be used standalone.</p> <p>Warning</p> <p>This task is still experimental and may be subject to changes in the future.</p> <pre><code>from typing import Any, Dict, List, Union\nfrom typing_extensions import override\n\nfrom distilabel.steps.tasks.base import GeneratorTask\nfrom distilabel.steps.tasks.typing import ChatType\nfrom distilabel.steps.typing import GeneratorOutput\n\n\nclass MyCustomTask(GeneratorTask):\n    instruction: str\n\n    @override\n    def process(self, offset: int = 0) -&gt; GeneratorOutput:\n        output = self.llm.generate(\n            inputs=[\n                [\n                    {\"role\": \"user\", \"content\": self.instruction},\n                ],\n            ],\n        )\n        output = {\"model_name\": self.llm.model_name}\n        output.update(\n            self.format_output(output=output, input=None)\n        )\n        yield output\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"output_field\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        return {\"output_field\": output}\n</code></pre> <p>We can then use it as follows:</p> <pre><code>task = MyCustomTask(\n    name=\"custom-generation\",\n    instruction=\"Tell me a joke.\",\n    llm=OpenAILLM(model=\"gpt-4\"),\n)\ntask.load()\n\nnext(task.process())\n# [{'output_field\": \"Why did the scarecrow win an award? Because he was outstanding!\", \"model_name\": \"gpt-4\"}]\n</code></pre> <p>Note</p> <p>Most of the times you would need to override the default <code>process</code> method, as it's suited for the standard <code>Task</code> and not for the <code>GeneratorTask</code>. But within the context of the <code>process</code> function you can freely use the <code>llm</code> to generate data in any way.</p> <p>Note</p> <p>The <code>Step.load()</code> always needs to be executed when being used as a standalone. Within a pipeline, this will be done automatically during pipeline execution.</p>"},{"location":"sections/how_to_guides/basic/task/generator_task/#defining-custom-generatortasks","title":"Defining custom GeneratorTasks","text":"<p>We can define a custom generator task by creating a new subclass of the <code>GeneratorTask</code> and defining the following:</p> <ul> <li> <p><code>process</code>: is a method that generates the data based on the <code>LLM</code> and the <code>instruction</code> provided within the class instance, and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in <code>outputs</code>. Note that the <code>inputs</code> argument is not allowed in this function since this is a <code>GeneratorTask</code>. The signature only expects the <code>offset</code> argument, which is used to keep track of the current iteration in the generator.</p> </li> <li> <p><code>outputs</code>: is a property that returns a list of strings with the names of the output fields, this property should always include <code>model_name</code> as one of the outputs since that's automatically injected from the LLM.</p> </li> <li> <p><code>format_output</code>: is a method that receives the output from the <code>LLM</code> and optionally also the input data (which may be useful to build the output in some scenarios), and returns a dictionary with the output data formatted as needed i.e. with the values for the columns in <code>outputs</code>. Note that there's no need to include the <code>model_name</code> in the output.</p> </li> </ul> <pre><code>from typing import Any, Dict, List, Union\n\nfrom distilabel.steps.tasks.base import GeneratorTask\nfrom distilabel.steps.tasks.typing import ChatType\n\n\nclass MyCustomTask(GeneratorTask):\n    @override\n    def process(self, offset: int = 0) -&gt; GeneratorOutput:\n        output = self.llm.generate(\n            inputs=[\n                [{\"role\": \"user\", \"content\": \"Tell me a joke.\"}],\n            ],\n        )\n        output = {\"model_name\": self.llm.model_name}\n        output.update(\n            self.format_output(output=output, input=None)\n        )\n        yield output\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        return [\"output_field\", \"model_name\"]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any]\n    ) -&gt; Dict[str, Any]:\n        return {\"output_field\": output}\n</code></pre>"},{"location":"sections/pipeline_samples/","title":"Tutorials","text":"<ul> <li>End-to-end tutorials provide detailed step-by-step explanations and the code used for end-to-end workflows.</li> <li>Paper implementations provide reproductions of fundamental papers in the synthetic data domain.</li> <li>Examples don't provide explenations but simply show code for different tasks.</li> </ul>"},{"location":"sections/pipeline_samples/#end-to-end-tutorials","title":"End-to-end tutorials","text":"<ul> <li> <p>Generate a preference dataset</p> <p>Learn about synthetic data generation for ORPO and DPO.</p> <p> Tutorial</p> </li> <li> <p>Clean an existing preference dataset</p> <p>Learn about how to provide AI feedback to clean an existing dataset.</p> <p> Tutorial</p> </li> <li> <p>Retrieval and reranking models</p> <p>Learn about synthetic data generation for fine-tuning custom retrieval and reranking models.</p> <p> Tutorial</p> </li> <li> <p>Generate text classification data</p> <p>Learn about how synthetic data generation for text classification can help address data imbalance or scarcity.</p> <p> Tutorial</p> </li> </ul>"},{"location":"sections/pipeline_samples/#paper-implementations","title":"Paper Implementations","text":"<ul> <li> <p>Deepseek Prover</p> <p>Learn about an approach to generate mathematical proofs for theorems generated from informal math problems.</p> <p> Example</p> </li> <li> <p>DEITA</p> <p>Learn about prompt, response tuning for complexity and quality and LLMs as judges for automatic data selection.</p> <p> Paper</p> </li> <li> <p>Instruction Backtranslation</p> <p>Learn about automatically labeling human-written text with corresponding instructions.</p> <p> Paper</p> </li> <li> <p>Prometheus 2</p> <p>Learn about using open-source models as judges for direct assessment and pair-wise ranking.</p> <p> Paper</p> </li> <li> <p>UltraFeedback</p> <p>Learn about a large-scale, fine-grained, diverse preference dataset, used for training powerful reward and critic models.</p> <p> Paper</p> </li> <li> <p>APIGen</p> <p>Learn how to create verifiable high-quality datases for function-calling applications.</p> <p> Paper</p> </li> <li> <p>CLAIR</p> <p>Learn Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs.</p> <p> Paper</p> </li> <li> <p>Math Shepherd</p> <p>Learn about Math-Shepherd, a framework to generate datasets to train process reward models (PRMs) which assign reward scores to each step of math problem solutions.</p> <p> Paper</p> </li> </ul>"},{"location":"sections/pipeline_samples/#examples","title":"Examples","text":"<ul> <li> <p>Benchmarking with distilabel</p> <p>Learn about reproducing the Arena Hard benchmark with disitlabel.</p> <p> Example</p> </li> <li> <p>Structured generation with outlines</p> <p>Learn about generating RPG characters following a pydantic.BaseModel with outlines in distilabel.</p> <p> Example</p> </li> <li> <p>Structured generation with instructor</p> <p>Learn about answering instructions with knowledge graphs defined as pydantic.BaseModel objects using instructor in distilabel.</p> <p> Example</p> </li> <li> <p>Create a social network with FinePersonas</p> <p>Learn how to leverage FinePersonas to create a synthetic social network and fine-tune adapters for Multi-LoRA.</p> <p> Example</p> </li> <li> <p>Create questions and answers for a exam</p> <p>Learn how to generate questions and answers for a exam, using a raw wikipedia page and structured generation.</p> <p> Example</p> </li> <li> <p>Text generation with images in distilabel</p> <p>Ask questions about images using distilabel.</p> <p> Example</p> </li> </ul>"},{"location":"sections/pipeline_samples/examples/benchmarking_with_distilabel/","title":"Benchmarking with <code>distilabel</code>","text":"<p>Benchmark LLMs with <code>distilabel</code>: reproducing the Arena Hard benchmark.</p> <p>The script below first defines both the <code>ArenaHard</code> and the <code>ArenaHardResults</code> tasks, so as to generate responses for a given collection of prompts/questions with up to two LLMs, and then calculate the results as per the original implementation, respectively. Additionally, the second part of the example builds a <code>Pipeline</code> to run the generation on top of the prompts with <code>InferenceEndpointsLLM</code> while streaming the rest of the generations from a pre-computed set of GPT-4 generations, and then evaluate one against the other with <code>OpenAILLM</code> generating an alternate response, a comparison between the responses, and a result as A&gt;&gt;B, A&gt;B, B&gt;A, B&gt;&gt;A, or tie.</p> <p></p> <p>To run this example you will first need to install the Arena Hard optional dependencies, being <code>pandas</code>, <code>scikit-learn</code>, and <code>numpy</code>.</p> Run <pre><code>python examples/arena_hard.py\n</code></pre> arena_hard.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport re\nfrom typing import Any, Dict, List, Optional, Union\n\nfrom typing_extensions import override\n\nfrom distilabel.steps import GlobalStep, StepInput\nfrom distilabel.steps.tasks.base import Task\nfrom distilabel.steps.tasks.typing import ChatType\nfrom distilabel.steps.typing import StepOutput\n\n\nclass ArenaHard(Task):\n    \"\"\"Evaluates two assistant responses using an LLM as judge.\n\n    This `Task` is based on the \"From Live Data to High-Quality Benchmarks: The\n    Arena-Hard Pipeline\" paper that presents Arena Hard, which is a benchmark for\n    instruction-tuned LLMs that contains 500 challenging user queries. GPT-4 is used\n    as the judge to compare the model responses against a baseline model, which defaults\n    to `gpt-4-0314`.\n\n    Note:\n        Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena\n        among popular open-ended LLM benchmarks.\n\n    Input columns:\n        - instruction (`str`): The instruction to evaluate the responses.\n        - generations (`List[str]`): The responses generated by two, and only two, LLMs.\n\n    Output columns:\n        - evaluation (`str`): The evaluation of the responses generated by the LLMs.\n        - score (`str`): The score extracted from the evaluation.\n        - model_name (`str`): The model name used to generate the evaluation.\n\n    Categories:\n        - benchmark\n\n    References:\n        - [From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/)\n        - [`arena-hard-auto`](https://github.com/lm-sys/arena-hard-auto/tree/main)\n\n    Examples:\n\n        Evaluate two assistant responses for a given instruction using Arean Hard prompts:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import GroupColumns, LoadDataFromDicts\n        from distilabel.steps.tasks import ArenaHard, TextGeneration\n\n        with Pipeline() as pipeline:\n            load_data = LoadDataFromDicts(\n                data=[{\"instruction\": \"What is the capital of France?\"}],\n            )\n\n            text_generation_a = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            text_generation_b = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            combine = GroupColumns(\n                columns=[\"generation\", \"generation_model\"],\n                output_columns=[\"generations\", \"generation_models\"],\n            )\n\n            arena_hard = ArenaHard(\n                llm=...,  # LLM instance\n            )\n\n            load_data &gt;&gt; [text_generation_a, text_generation_b] &gt;&gt; combine &gt;&gt; arena_hard\n        ```\n    \"\"\"\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs required by this task are the `instruction` and the `generations`,\n        which are the responses generated by two, and only two, LLMs.\"\"\"\n        return [\"instruction\", \"generations\"]\n\n    def format_input(self, input: Dict[str, Any]) -&gt; ChatType:\n        \"\"\"This method formats the input data as a `ChatType` using the prompt defined\n        by the Arena Hard benchmark, which consists on a `system_prompt` plus a template\n        for the user first message that contains the `instruction` and both `generations`.\n        \"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": \"Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\\n\\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\\n\\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\\n\\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\\n\\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\\n\\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\\n\\n1. Assistant A is significantly better: [[A&gt;&gt;B]]\\n2. Assistant A is slightly better: [[A&gt;B]]\\n3. Tie, relatively the same: [[A=B]]\\n4. Assistant B is slightly better: [[B&gt;A]]\\n5. Assistant B is significantly better: [[B&gt;&gt;A]]\\n\\nExample output: \\\"My final verdict is tie: [[A=B]]\\\".\",\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\"&lt;|User Prompt|&gt;\\n{input['instruction']}\\n\\n&lt;|The Start of Assistant A's Answer|&gt;\\n{input['generations'][0]}\\n&lt;|The End of Assistant A's Answer|&gt;\\n\\n&lt;|The Start of Assistant B's Answer|&gt;\\n{input['generations'][1]}\\n&lt;|The End of Assistant B's Answer|&gt;\",\n            },\n        ]\n\n    @property\n    def outputs(self) -&gt; List[str]:\n        \"\"\"The outputs generated by this task are the `evaluation`, the `score` and\n        the `model_name` (which is automatically injected within the `process` method\n        of the parent task).\"\"\"\n        return [\"evaluation\", \"score\", \"model_name\"]\n\n    def format_output(\n        self,\n        output: Union[str, None],\n        input: Union[Dict[str, Any], None] = None,\n    ) -&gt; Dict[str, Any]:\n        \"\"\"This method formats the output generated by the LLM as a Python dictionary\n        containing the `evaluation` which is the raw output generated by the LLM (consisting\n        of the judge LLM alternate generation for the given instruction, plus an explanation\n        on the evaluation of the given responses; plus the `score` extracted from the output.\n\n        Args:\n            output: the raw output of the LLM.\n            input: the input to the task. Is provided in case it needs to be used to enrich\n                the output if needed.\n\n        Returns:\n            A dict with the keys `evaluation` with the raw output which contains the LLM\n            evaluation and the extracted `score` if possible.\n        \"\"\"\n        if output is None:\n            return {\"evaluation\": None, \"score\": None}\n        pattern = re.compile(r\"\\[\\[([AB&lt;&gt;=]+)\\]\\]\")\n        match = pattern.search(output)\n        if match is None:\n            return {\"evaluation\": output, \"score\": None}\n        return {\"evaluation\": output, \"score\": match.group(1)}\n\n\nclass ArenaHardResults(GlobalStep):\n    \"\"\"Process Arena Hard results to calculate the ELO scores.\n\n    This `Step` is based on the \"From Live Data to High-Quality Benchmarks: The\n    Arena-Hard Pipeline\" paper that presents Arena Hard, which is a benchmark for\n    instruction-tuned LLMs that contains 500 challenging user queries. This step is\n    a `GlobalStep` that should run right after the `ArenaHard` task to calculate the\n    ELO scores for the evaluated models.\n\n    Note:\n        Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena\n        among popular open-ended LLM benchmarks.\n\n    Input columns:\n        - evaluation (`str`): The evaluation of the responses generated by the LLMs.\n        - score (`str`): The score extracted from the evaluation.\n\n    References:\n        - [From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/)\n        - [`arena-hard-auto`](https://github.com/lm-sys/arena-hard-auto/tree/main)\n\n    Examples:\n\n        Rate the ELO scores for two assistant responses for a given an evaluation / comparison between both using Arean Hard prompts:\n\n        ```python\n        from distilabel.pipeline import Pipeline\n        from distilabel.steps import GroupColumns, LoadDataFromDicts\n        from distilabel.steps.tasks import ArenaHard, TextGeneration\n\n        with Pipeline() as pipeline:\n            load_data = LoadDataFromDicts(\n                data=[{\"instruction\": \"What is the capital of France?\"}],\n            )\n\n            text_generation_a = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            text_generation_b = TextGeneration(\n                llm=...,  # LLM instance\n                output_mappings={\"model_name\": \"generation_model\"},\n            )\n\n            combine = GroupColumns(\n                columns=[\"generation\", \"generation_model\"],\n                output_columns=[\"generations\", \"generation_models\"],\n            )\n\n            arena_hard = ArenaHard(\n                llm=...,  # LLM instance\n            )\n\n            arena_hard_results = ArenaHardResults(\n                custom_model_column=\"generation_models\",\n                custom_weights={\"A&gt;B\": 1, \"A&gt;&gt;B\": 3, \"B&gt;A\": 1, \"B&gt;&gt;A\": 3},\n            )\n\n            load_data &gt;&gt; [text_generation_a, text_generation_b] &gt;&gt; combine &gt;&gt; arena_hard &gt;&gt; arena_hard_results\n        ```\n\n    \"\"\"\n\n    custom_model_column: Optional[str] = None\n    custom_weights: Dict[str, int] = {\"A&gt;B\": 1, \"A&gt;&gt;B\": 3, \"B&gt;A\": 1, \"B&gt;&gt;A\": 3}\n\n    def load(self) -&gt; None:\n        \"\"\"Ensures that the required dependencies are installed.\"\"\"\n        super().load()\n\n        try:\n            import numpy as np  # noqa: F401\n            import pandas as pd  # noqa: F401\n            from sklearn.linear_model import LogisticRegression  # noqa: F401\n        except ImportError as e:\n            raise ImportError(\n                \"In order to run `ArenaHardResults`, the `arena-hard` extra dependencies\"\n                \" must be installed i.e. `numpy`, `pandas`, and `scikit-learn`.\\n\"\n                \"Please install the dependencies by running `pip install distilabel[arena-hard]`.\"\n            ) from e\n\n    # TODO: the `evaluation` is not really required as an input, so it could be removed, since\n    # only `score` is used / required\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The inputs required by this step are the `evaluation` and the `score` generated\n        by the `ArenaHard` task. Since this step does use the identifiers `model_a` and `model_b`,\n        optionally one can set `custom_model_column` to use the model names if existing within\n        the input data, ideally this value should be `model_name` if connected from the `ArenaHard`\n        step.\"\"\"\n        columns = [\"evaluation\", \"score\"]\n        if self.custom_model_column:\n            columns.append(self.custom_model_column)\n        return columns\n\n    @override\n    def process(self, inputs: StepInput) -&gt; StepOutput:  # type: ignore\n        \"\"\"This method processes the inputs generated by the `ArenaHard` task to calculate the\n        win rates for each of the models to evaluate. Since this step inherits from the `GlobalStep`,\n        it will wait for all the input batches to be processed, and then the output will be yielded in\n        case there's a follow up step, since this step won't modify the received inputs.\n\n        Args:\n            inputs: A list of Python dictionaries with the inputs of the task.\n\n        Yields:\n            A list of Python dictionaries with the outputs of the task.\n\n        References:\n            - https://github.com/lm-sys/arena-hard-auto/blob/main/show_result.py\n        \"\"\"\n        import numpy as np\n        import pandas as pd\n        from sklearn.linear_model import LogisticRegression\n\n        models = [\"A\", \"B\"]\n        if self.custom_model_column:\n            models = inputs[0][self.custom_model_column]\n\n        # TODO: the battles are only calculated for the first game, even though the official\n        # implementation also covers the possibility of a second game (not within the released\n        # dataset yet)\n        battles = pd.DataFrame()\n        for input in inputs:\n            output = {\n                # TODO: \"question_id\": input[\"question_id\"],\n                \"model_a\": models[0],\n                \"model_b\": models[1],\n            }\n            if input[\"score\"] in [\"A&gt;B\", \"A&gt;&gt;B\"]:\n                output[\"winner\"] = models[0]\n                rows = [output] * self.custom_weights[input[\"score\"]]\n            elif input[\"score\"] in [\"B&gt;A\", \"B&gt;&gt;A\"]:\n                output[\"winner\"] = models[1]\n                rows = [output] * self.custom_weights[input[\"score\"]]\n            elif input[\"score\"] == \"A=B\":\n                output[\"winner\"] = \"tie\"\n                rows = [output]\n            else:\n                continue\n\n            battles = pd.concat([battles, pd.DataFrame(rows)])\n\n        models = pd.concat([battles[\"model_a\"], battles[\"model_b\"]]).unique()\n        models = pd.Series(np.arange(len(models)), index=models)\n\n        battles = pd.concat([battles, battles], ignore_index=True)\n        p = len(models.index)\n        n = battles.shape[0]\n\n        X = np.zeros([n, p])\n        X[np.arange(n), models[battles[\"model_a\"]]] = +np.log(10)\n        X[np.arange(n), models[battles[\"model_b\"]]] = -np.log(10)\n\n        Y = np.zeros(n)\n        Y[battles[\"winner\"] == \"model_a\"] = 1.0\n\n        tie_idx = battles[\"winner\"] == \"tie\"\n        tie_idx[len(tie_idx) // 2 :] = False\n        Y[tie_idx] = 1.0\n\n        lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-8)  # type: ignore\n        lr.fit(X, Y)\n\n        # The ELO scores are calculated assuming that the reference is `gpt-4-0314`\n        # with an starting ELO of 1000, so that the evaluated models are compared with\n        # `gtp-4-0314` only if it's available within the models\n        elo_scores = 400 * lr.coef_[0] + 1000\n        # TODO: we could parametrize the reference / anchor model, but left as is to be faithful to the\n        # original implementation\n        if \"gpt-4-0314\" in models.index:\n            elo_scores += 1000 - elo_scores[models[\"gpt-4-0314\"]]\n\n        output = pd.Series(elo_scores, index=models.index).sort_values(ascending=False)\n        self._logger.info(f\"Arena Hard ELO: {output}\")\n\n        # Here only so that if follow up steps are connected the inputs are preserved,\n        # since this step doesn't modify nor generate new inputs\n        yield inputs\n\n\nif __name__ == \"__main__\":\n    import json\n\n    from distilabel.models import InferenceEndpointsLLM, OpenAILLM\n    from distilabel.pipeline import Pipeline\n    from distilabel.steps import (\n        GroupColumns,\n        KeepColumns,\n        LoadDataFromHub,\n        StepInput,\n        step,\n    )\n    from distilabel.steps.tasks import TextGeneration\n    from distilabel.steps.typing import StepOutput\n\n    @step(inputs=[\"turns\"], outputs=[\"system_prompt\", \"instruction\"])\n    def PrepareForTextGeneration(*inputs: StepInput) -&gt; StepOutput:\n        for input in inputs:\n            for item in input:\n                item[\"system_prompt\"] = \"You are a helpful assistant.\"\n                item[\"instruction\"] = item[\"turns\"][0][\"content\"]\n            yield input\n\n    @step(\n        inputs=[\"question_id\"],\n        outputs=[\"generation\", \"generation_model\"],\n        step_type=\"global\",\n    )\n    def LoadReference(*inputs: StepInput) -&gt; StepOutput:\n        # File downloaded from https://raw.githubusercontent.com/lm-sys/arena-hard-auto/e0a8ea1df42c1df76451a6cd04b14e31ff992b87/data/arena-hard-v0.1/model_answer/gpt-4-0314.jsonl\n        lines = open(\"gpt-4-0314.jsonl\", mode=\"r\").readlines()\n        for input in inputs:\n            for item in input:\n                for line in lines:\n                    data = json.loads(line)\n                    if data[\"question_id\"] == item[\"question_id\"]:\n                        item[\"generation\"] = data[\"choices\"][0][\"turns\"][0][\"content\"]\n                        item[\"generation_model\"] = data[\"model_id\"]\n                        break\n            yield input\n\n    with Pipeline(name=\"arena-hard-v0.1\") as pipeline:\n        load_dataset = LoadDataFromHub(\n            name=\"load_dataset\",\n            repo_id=\"alvarobartt/lmsys-arena-hard-v0.1\",\n            split=\"test\",\n            num_examples=5,\n        )\n\n        load_reference = LoadReference(name=\"load_reference\")\n\n        prepare = PrepareForTextGeneration(name=\"prepare\")\n\n        text_generation_cohere = TextGeneration(\n            name=\"text_generation_cohere\",\n            llm=InferenceEndpointsLLM(\n                model_id=\"CohereForAI/c4ai-command-r-plus\",\n                tokenizer_id=\"CohereForAI/c4ai-command-r-plus\",\n            ),\n            use_system_prompt=True,\n            input_batch_size=10,\n            output_mappings={\"model_name\": \"generation_model\"},\n        )\n\n        combine_columns = GroupColumns(\n            name=\"combine_columns\",\n            columns=[\"generation\", \"generation_model\"],\n            output_columns=[\"generations\", \"generation_models\"],\n        )\n\n        arena_hard = ArenaHard(\n            name=\"arena_hard\",\n            llm=OpenAILLM(model=\"gpt-4-1106-preview\"),\n            output_mappings={\"model_name\": \"evaluation_model\"},\n        )\n\n        keep_columns = KeepColumns(\n            name=\"keep_columns\",\n            columns=[\n                \"question_id\",\n                \"category\",\n                \"cluster\",\n                \"system_prompt\",\n                \"instruction\",\n                \"generations\",\n                \"generation_models\",\n                \"evaluation\",\n                \"score\",\n                \"evaluation_model\",\n            ],\n        )\n\n        win_rates = ArenaHardResults(\n            name=\"win_rates\", custom_model_column=\"generation_models\"\n        )\n\n        load_dataset &gt;&gt; load_reference  # type: ignore\n        load_dataset &gt;&gt; prepare &gt;&gt; text_generation_cohere  # type: ignore\n        (  # type: ignore\n            [load_reference, text_generation_cohere]\n            &gt;&gt; combine_columns\n            &gt;&gt; arena_hard\n            &gt;&gt; keep_columns\n            &gt;&gt; win_rates\n        )\n\n        distiset = pipeline.run(\n            parameters={  # type: ignore\n                text_generation_cohere.name: {\n                    \"llm\": {\n                        \"generation_kwargs\": {\n                            \"temperature\": 0.7,\n                            \"max_new_tokens\": 4096,\n                            \"stop_sequences\": [\"&lt;EOS_TOKEN&gt;\", \"&lt;|END_OF_TURN_TOKEN|&gt;\"],\n                        }\n                    }\n                },\n                arena_hard.name: {\n                    \"llm\": {\n                        \"generation_kwargs\": {\n                            \"temperature\": 0.0,\n                            \"max_new_tokens\": 4096,\n                        }\n                    }\n                },\n            },\n        )\n        if distiset is not None:\n            distiset.push_to_hub(\"arena-hard-results\")\n</code></pre>"},{"location":"sections/pipeline_samples/examples/exam_questions/","title":"Create exam questions using structured generation","text":"<p>This example will showcase how to generate exams questions and answers from a text page. In this case, we will use a wikipedia page as an example, and show how to leverage the prompt to help the model generate the data in the appropriate format.</p> <p>We are going to use a <code>meta-llama/Meta-Llama-3.1-8B-Instruct</code> to generate questions and answers for a mock exam from a wikipedia page. In this case, we are going to use the Transfer Learning entry for it. With the help of structured generation we will guide the model to create structured data for us that is easy to parse. The structure will be question, answer, and distractors (wrong answers).</p> Click to see the sample results <p>Example page Transfer_learning:</p> <p></p> <p>QA of the page:</p> <pre><code>{\n    \"exam\": [\n        {\n            \"answer\": \"A technique in machine learning where knowledge learned from a task is re-used to boost performance on a related task.\",\n            \"distractors\": [\"A type of neural network architecture\", \"A machine learning algorithm for image classification\", \"A method for data preprocessing\"],\n            \"question\": \"What is transfer learning?\"\n        },\n        {\n            \"answer\": \"1976\",\n            \"distractors\": [\"1981\", \"1992\", \"1998\"],\n            \"question\": \"In which year did Bozinovski and Fulgosi publish a paper addressing transfer learning in neural network training?\"\n        },\n        {\n            \"answer\": \"Discriminability-based transfer (DBT) algorithm\",\n            \"distractors\": [\"Multi-task learning\", \"Learning to Learn\", \"Cost-sensitive machine learning\"],\n            \"question\": \"What algorithm was formulated by Lorien Pratt in 1992?\"\n        },\n        {\n            \"answer\": \"A domain consists of a feature space and a marginal probability distribution.\",\n            \"distractors\": [\"A domain consists of a label space and an objective predictive function.\", \"A domain consists of a task and a learning algorithm.\", \"A domain consists of a dataset and a model.\"],\n            \"question\": \"What is the definition of a domain in the context of transfer learning?\"\n        },\n        {\n            \"answer\": \"Transfer learning aims to help improve the learning of the target predictive function in the target domain using the knowledge in the source domain and learning task.\",\n            \"distractors\": [\"Transfer learning aims to learn a new task from scratch.\", \"Transfer learning aims to improve the learning of the source predictive function in the source domain.\", \"Transfer learning aims to improve the learning of the target predictive function in the source domain.\"],\n            \"question\": \"What is the goal of transfer learning?\"\n        },\n        {\n            \"answer\": \"Markov logic networks, Bayesian networks, cancer subtype discovery, building utilization, general game playing, text classification, digit recognition, medical imaging, and spam filtering.\",\n            \"distractors\": [\"Supervised learning, unsupervised learning, reinforcement learning, natural language processing, computer vision, and robotics.\", \"Image classification, object detection, segmentation, and tracking.\", \"Speech recognition, sentiment analysis, and topic modeling.\"],\n            \"question\": \"What are some applications of transfer learning?\"\n        },\n        {\n            \"answer\": \"ADAPT (Python), TLib (Python), Domain-Adaptation-Toolbox (Matlab)\",\n            \"distractors\": [\"TensorFlow, PyTorch, Keras\", \"Scikit-learn, OpenCV, NumPy\", \"Matlab, R, Julia\"],\n            \"question\": \"What are some software implementations of transfer learning and domain adaptation algorithms?\"\n        }\n    ]\n}\n</code></pre>"},{"location":"sections/pipeline_samples/examples/exam_questions/#build-the-pipeline","title":"Build the pipeline","text":"<p>Let's see how to build a pipeline to obtain this type of data:</p> <pre><code>from typing import List\nfrom pathlib import Path\n\nfrom pydantic import BaseModel, Field\n\nfrom distilabel.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\nimport wikipedia\n\npage = wikipedia.page(title=\"Transfer_learning\")  # (1)\n\n\nclass ExamQuestion(BaseModel):\n    question: str = Field(..., description=\"The question to be answered\")\n    answer: str = Field(..., description=\"The correct answer to the question\")\n    distractors: List[str] = Field(\n        ..., description=\"A list of incorrect but viable answers to the question\"\n    )\n\nclass ExamQuestions(BaseModel):  # (2)\n    exam: List[ExamQuestion]\n\n\nSYSTEM_PROMPT = \"\"\"\\\nYou are an exam writer specialized in writing exams for students.\nYour goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.\nYour answer must adhere to the following format:\n```\n[\n    {\n        \"question\": \"Your question\",\n        \"answer\": \"The correct answer to the question\",\n        \"distractors\": [\"wrong answer 1\", \"wrong answer 2\", \"wrong answer 3\"]\n    },\n    ... (more questions and answers as required)\n]\n```\n\"\"\".strip() #\u00a0(3)\n\n\nwith Pipeline(name=\"ExamGenerator\") as pipeline:\n\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"page\": page.content,  #\u00a0(4)\n            }\n        ],\n    )\n\n    text_generation = TextGeneration(  # (5)\n        name=\"exam_generation\",\n        system_prompt=SYSTEM_PROMPT,\n        template=\"Generate a list of answers and questions about the document. Document:\\n\\n{{ page }}\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            structured_output={\n                \"schema\": ExamQuestions.model_json_schema(),\n                \"format\": \"json\"\n            },\n        ),\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n\n    load_dataset &gt;&gt; text_generation  # (6)\n</code></pre> <ol> <li> <p>Download a single page for the demo. We could donwnload first the pages, or apply the same procedure to any type of data we want. In a real world use case, we would want to make a dataset from these documents first.</p> </li> <li> <p>Define the structure required for the answer using Pydantic. In this case we want for each page, a list with questions and answers (additionally we've added distractors, but can be ignored for this case). So our output will be a <code>ExamQuestions</code> model, which is a list of <code>ExamQuestion</code>, where each one consists in the <code>question</code> and <code>answer</code> fields as string fields. The language model will use the field descriptions to generate the values.</p> </li> <li> <p>Use the system prompt to guide the model towards the behaviour we want from it. Independently from the structured output we are forcing the model to have, it helps if we pass the format expected in our prompt.</p> </li> <li> <p>Move the page content from wikipedia to a row in the dataset.</p> </li> <li> <p>The <code>TextGeneration</code> task gets the system prompt, and the user prompt by means of the <code>template</code> argument, where we aid the model to generate the questions and answers based on the page content, that will be obtained from the corresponding column of the loaded data.</p> </li> <li> <p>Connect both steps, and we are done.</p> </li> </ol>"},{"location":"sections/pipeline_samples/examples/exam_questions/#run-the-example","title":"Run the example","text":"<p>To run this example you will first need to install the wikipedia dependency to download the sample data, being <code>pip install wikipedia</code>. Change the username first in case you want to push the dataset to the hub using your account.</p> Run <pre><code>python examples/exam_questions.py\n</code></pre> exam_questions.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom typing import List\n\nimport wikipedia\nfrom pydantic import BaseModel, Field\n\nfrom distilabel.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\npage = wikipedia.page(title=\"Transfer_learning\")\n\n\nclass ExamQuestion(BaseModel):\n    question: str = Field(..., description=\"The question to be answered\")\n    answer: str = Field(..., description=\"The correct answer to the question\")\n    distractors: List[str] = Field(\n        ..., description=\"A list of incorrect but viable answers to the question\"\n    )\n\n\nclass ExamQuestions(BaseModel):\n    exam: List[ExamQuestion]\n\n\nSYSTEM_PROMPT = \"\"\"\\\nYou are an exam writer specialized in writing exams for students.\nYour goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.\nYour answer must adhere to the following format:\n```\n[\n    {\n        \"question\": \"Your question\",\n        \"answer\": \"The correct answer to the question\",\n        \"distractors\": [\"wrong answer 1\", \"wrong answer 2\", \"wrong answer 3\"]\n    },\n    ... (more questions and answers as required)\n]\n```\n\"\"\".strip()\n\n\nwith Pipeline(name=\"ExamGenerator\") as pipeline:\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"page\": page.content,\n            }\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"exam_generation\",\n        system_prompt=SYSTEM_PROMPT,\n        template=\"Generate a list of answers and questions about the document. Document:\\n\\n{{ page }}\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n            structured_output={\n                \"schema\": ExamQuestions.model_json_schema(),\n                \"format\": \"json\",\n            },\n        ),\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    load_dataset &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            text_generation.name: {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"max_new_tokens\": 2048,\n                    }\n                }\n            }\n        },\n        use_cache=False,\n    )\n    distiset.push_to_hub(\"USERNAME/exam_questions\")\n</code></pre>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/","title":"Create a social network with FinePersonas","text":"<p>In this example, we'll explore the creation of specialized user personas for social network interactions using the FinePersonas-v0.1 dataset from Hugging Face. The final dataset will be ready to fine-tune a chat model with specific traits and characteristics.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#introduction","title":"Introduction","text":"<p>We'll delve into the process of fine-tuning different LoRA (Low-Rank Adaptation) models to imbue these personas with specific traits and characteristics.</p> <p>This approach draws inspiration from Michael Sayman's work on SocialAI (visit the profile to see some examples), to leverage FinePersonas-v0.1 for building models that can emulate bots with specific behaviour.</p> <p>By fine-tuning these adapters, we can potentially create AI personas with distinct characteristics, communication styles, and areas of expertise. The result? AI interactions that feel more natural and tailored to specific contexts or user needs. For those interested in the technical aspects of this approach, we recommend the insightful blog post on Multi-LoRA serving. It provides a clear and comprehensive explanation of the technology behind this innovative method.</p> <p>Let's jump to the demo.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#creating-our-socialai-task","title":"Creating our SocialAI Task","text":"<p>Building on the new <code>TextGeneration</code>, creating custom tasks is easier than ever before. This powerful tool opens up a world of possibilities for creating tailored text-based content with ease and precision. We will create a <code>SocialAI</code> task that will be in charge of generating responses to user interactions, taking into account a given <code>follower_type</code>, and use the perspective from a given <code>persona</code>:</p> <pre><code>from distilabel.steps.tasks import TextGeneration\n\nclass SocialAI(TextGeneration):\n    follower_type: Literal[\"supporter\", \"troll\", \"alarmist\"] = \"supporter\"\n    system_prompt: str = (\n        \"You are an AI assistant expert at simulating user interactions. \"\n        \"You must answer as if you were a '{follower_type}', be concise answer with no more than 200 characters, nothing else.\"\n        \"Here are some traits to use for your personality:\\n\\n\"\n        \"{traits}\"\n    )  #\u00a0(1)\n    template: str = \"You are the folowing persona:\\n\\n{{ persona }}\\n\\nWhat would you say to the following?\\n\\n {{ post }}\"  # (2)\n    columns: str | list[str] = [\"persona\", \"post\"]  # (3)\n\n    _follower_traits: dict[str, str] = {\n        \"supporter\": (\n            \"- Encouraging and positive\\n\"\n            \"- Tends to prioritize enjoyment and relaxation\\n\"\n            \"- Focuses on the present moment and short-term pleasure\\n\"\n            \"- Often uses humor and playful language\\n\"\n            \"- Wants to help others feel good and have fun\\n\"\n        ),\n        \"troll\": (\n            \"- Provocative and confrontational\\n\"\n            \"- Enjoys stirring up controversy and conflict\\n\"\n            \"- Often uses sarcasm, irony, and mocking language\\n\"\n            \"- Tends to belittle or dismiss others' opinions and feelings\\n\"\n            \"- Seeks to get a rise out of others and create drama\\n\"\n        ),\n        \"alarmist\": (\n            \"- Anxious and warning-oriented\\n\"\n            \"- Focuses on potential risks and negative consequences\\n\"\n            \"- Often uses dramatic or sensational language\\n\"\n            \"- Tends to be serious and stern in tone\\n\"\n            \"- Seeks to alert others to potential dangers and protect them from harm (even if it's excessive or unwarranted)\\n\"\n        ),\n    }\n\n    def load(self) -&gt; None:\n        super().load()\n        self.system_prompt = self.system_prompt.format(\n            follower_type=self.follower_type,\n            traits=self._follower_traits[self.follower_type]\n        )  # (4)\n</code></pre> <ol> <li> <p>We have a custom system prompt that will depend on the <code>follower_type</code> we decide for our model.</p> </li> <li> <p>The base template or prompt will answert to the <code>post</code> we have, from the point of view of a <code>persona</code>.</p> </li> <li> <p>We will need our dataset to have both <code>persona</code> and <code>post</code> columns to populate the prompt.</p> </li> <li> <p>In the load method we place the specific traits for our follower type in the system prompt.</p> </li> </ol>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#data-preparation","title":"Data preparation","text":"<p>This is an example, so let's keep it short. We will use 3 posts, and 3 different types of personas. While there's potential to enhance this process (perhaps by implementing random persona selection or leveraging semantic similarity) we'll opt for a straightforward method in this demonstration.</p> <p>Our goal is to create a set of nine examples, each pairing a post with a persona. To achieve this, we'll employ an LLM to respond to each post from the perspective of a specific <code>persona</code>, effectively simulating how different characters might engage with the content.</p> <pre><code>posts = [\n    {\n        \"post\": \"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\"\n    },\n    {\n        \"post\": \"I need to develop a training course for my company on communication skills. Need to decide how deliver it remotely.\"\n    },\n    {\n        \"post\": \"I'm always 10 minutes late to meetups but no one's complained. Could this be annoying to them?\"\n    },\n]\n\npersonas = (\n    load_dataset(\"argilla/FinePersonas-v0.1-clustering-100k\", split=\"train\")\n    .shuffle()\n    .select(range(3))\n    .select_columns(\"persona\")\n    .to_list()\n)\n\ndata = []\nfor post in posts:\n    for persona in personas:\n        data.append({\"post\": post[\"post\"], \"persona\": persona[\"persona\"]})\n</code></pre> <p>Each row in will have the following format:</p> <pre><code>import json\nprint(json.dumps(data[0], indent=4))\n{\n    \"post\": \"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\",\n    \"persona\": \"A high school or college environmental science teacher or an ecology student specializing in biogeography and ecosystem dynamics.\"\n}\n</code></pre> <p>This will be our dataset, that we can ingest using the <code>LoadDataFromDicts</code>:</p> <pre><code>loader = LoadDataFromDicts(data=data)\n</code></pre>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#simulating-from-different-types-of-followers","title":"Simulating from different types of followers","text":"<p>With our data in hand, we're ready to explore the capabilities of our SocialAI task. For this demonstration, we'll make use of of <code>meta-llama/Meta-Llama-3.1-70B-Instruct</code> While this model has become something of a go-to choice recently, it's worth noting that experimenting with a variety of models could yield even more interesting results:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 256,\n    },\n)\nfollower_type = \"supporter\"\n\nfollower = SocialAI(\n    llm=llm,\n    follower_type=follower_type,\n    name=f\"{follower_type}_user\",\n)\n</code></pre> <p>This setup simplifies the process, we only need to input the follower type, and the system handles the rest. We could update this too to have a random type of follower by default, and simulate from a bunch of different personalities.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#building-our-pipeline","title":"Building our Pipeline","text":"<p>The foundation of our pipeline is now in place. At its core is a single, powerful LLM. This versatile model will be repurposed to drive three distinct <code>SocialAI</code> Tasks, each tailored to a specific <code>TextGeneration</code> task, and each one of them will be prepared for Supervised Fine Tuning using <code>FormatTextGenerationSFT</code>:</p> <pre><code>with Pipeline(name=\"Social AI Personas\") as pipeline:\n    loader = LoadDataFromDicts(data=data, batch_size=1)\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 256,\n        },\n    )\n\n    for follower_type in [\"supporter\", \"troll\", \"alarmist\"]:\n        follower = SocialAI(\n            llm=llm,\n            follower_type=follower_type,\n            name=f\"{follower_type}_user\",  # (1)\n            output_mappings={\n                \"generation\": f\"interaction_{follower_type}\"  # (2)\n            }\n        )\n        format_sft = FormatTextGenerationSFT(\n            name=f\"format_sft_{follower_type}\",\n            input_mappings={\n                \"instruction\": \"post\",\n                \"generation\": f\"interaction_{follower_type}\"  # (3)\n            },\n        )\n        loader &gt;&gt; follower &gt;&gt; format_sft  # (4)\n</code></pre> <ol> <li> <p>We update the name of the step to keep track in the pipeline.</p> </li> <li> <p>The <code>generation</code> column from each LLM will be mapped to avoid them being overriden, as we are reusing the same task.</p> </li> <li> <p>As we have modified the output column from <code>SocialAI</code>, we redirect each one of the \"follower_type\" responses.</p> </li> <li> <p>Connect the loader to each one of the follower tasks and <code>format_sft</code> to obtain 3 different subsets.</p> </li> </ol> <p>The outcome of this pipeline will be three specialized models, each fine-tuned to a unique <code>follower type</code> crafted by the <code>SocialAI</code> task. These models will generate SFT-formatted datasets, where each post is paired with its corresponding interaction data for a specific follower type. This setup enables seamless fine-tuning using your preferred framework, such as TRL, or any other training framework of your choice.</p>"},{"location":"sections/pipeline_samples/examples/fine_personas_social_network/#script-and-final-dataset","title":"Script and final dataset","text":"<p>All the pieces are in place for our script, the full pipeline can be seen here:</p> Run <pre><code>python examples/finepersonas_social_ai.py\n</code></pre> finepersonas_social_ai.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom typing import Literal\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import FormatTextGenerationSFT, LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\n\nclass SocialAI(TextGeneration):\n    follower_type: Literal[\"supporter\", \"troll\", \"alarmist\"] = \"supporter\"\n    system_prompt: str = (\n        \"You are an AI assistant expert at simulating user interactions. \"\n        \"You must answer as if you were a '{follower_type}', be concise answer with no more than 200 characters, nothing else.\"\n        \"Here are some traits to use for your personality:\\n\\n\"\n        \"{traits}\"\n    )\n    template: str = \"You are the folowing persona:\\n\\n{{ persona }}\\n\\nWhat would you say to the following?\\n\\n {{ post }}\"\n    columns: str | list[str] = [\"persona\", \"post\"]\n\n    _follower_traits: dict[str, str] = {\n        \"supporter\": (\n            \"- Encouraging and positive\\n\"\n            \"- Tends to prioritize enjoyment and relaxation\\n\"\n            \"- Focuses on the present moment and short-term pleasure\\n\"\n            \"- Often uses humor and playful language\\n\"\n            \"- Wants to help others feel good and have fun\\n\"\n        ),\n        \"troll\": (\n            \"- Provocative and confrontational\\n\"\n            \"- Enjoys stirring up controversy and conflict\\n\"\n            \"- Often uses sarcasm, irony, and mocking language\\n\"\n            \"- Tends to belittle or dismiss others' opinions and feelings\\n\"\n            \"- Seeks to get a rise out of others and create drama\\n\"\n        ),\n        \"alarmist\": (\n            \"- Anxious and warning-oriented\\n\"\n            \"- Focuses on potential risks and negative consequences\\n\"\n            \"- Often uses dramatic or sensational language\\n\"\n            \"- Tends to be serious and stern in tone\\n\"\n            \"- Seeks to alert others to potential dangers and protect them from harm (even if it's excessive or unwarranted)\\n\"\n        ),\n    }\n\n    def load(self) -&gt; None:\n        super().load()\n        self.system_prompt = self.system_prompt.format(\n            follower_type=self.follower_type,\n            traits=self._follower_traits[self.follower_type],\n        )\n\n\nposts = [\n    {\n        \"post\": \"Hmm, ok now I'm torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\"\n    },\n    {\n        \"post\": \"I need to develop a training course for my company on communication skills. Need to decide how deliver it remotely.\"\n    },\n    {\n        \"post\": \"I'm always 10 minutes late to meetups but no one's complained. Could this be annoying to them?\"\n    },\n]\n\npersonas = (\n    load_dataset(\"argilla/FinePersonas-v0.1-clustering-100k\", split=\"train\")\n    .shuffle()\n    .select(range(3))\n    .select_columns(\"persona\")\n    .to_list()\n)\n\ndata = []\nfor post in posts:\n    for persona in personas:\n        data.append({\"post\": post[\"post\"], \"persona\": persona[\"persona\"]})\n\n\nwith Pipeline(name=\"Social AI Personas\") as pipeline:\n    loader = LoadDataFromDicts(data=data, batch_size=1)\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 256,\n        },\n    )\n\n    for follower_type in [\"supporter\", \"troll\", \"alarmist\"]:\n        follower = SocialAI(\n            llm=llm,\n            follower_type=follower_type,\n            name=f\"{follower_type}_user\",\n            output_mappings={\"generation\": f\"interaction_{follower_type}\"},\n        )\n        format_sft = FormatTextGenerationSFT(\n            name=f\"format_sft_{follower_type}\",\n            input_mappings={\n                \"instruction\": \"post\",\n                \"generation\": f\"interaction_{follower_type}\",\n            },\n        )\n        loader &gt;&gt; follower &gt;&gt; format_sft\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    distiset.push_to_hub(\"plaguss/FinePersonas-SocialAI-test\", include_script=True)\n</code></pre> <p>This is the final toy dataset we obtain: FinePersonas-SocialAI-test</p> <p>You can see examples of how to load each subset of them to fine-tune a model:</p> <pre><code>from datasets import load_dataset\n\nds = load_dataset(\"plaguss/FinePersonas-SocialAI-test\", \"format_sft_troll\")\n</code></pre> <p>And a sample of the generated field with the corresponding <code>post</code> and <code>persona</code>:</p> <pre><code>{\n    \"post\": \"Hmm, ok now I\\u0027m torn: should I go for healthy chicken tacos or unhealthy beef tacos for late night cravings?\",\n    \"persona\": \"A high school or undergraduate physics or chemistry teacher, likely with a focus on experimental instruction.\",\n    \"interaction_troll\": \"\\\"Late night cravings? More like late night brain drain. Either way, it\\u0027s just a collision of molecules in your stomach. Choose the one with more calories, at least that\\u0027s some decent kinetic energy.\\\"\",\n}\n</code></pre> <p>There's a lot of room for improvement, but quite a promising start.</p>"},{"location":"sections/pipeline_samples/examples/llama_cpp_with_outlines/","title":"Structured generation with <code>outlines</code>","text":"<p>Generate RPG characters following a <code>pydantic.BaseModel</code> with <code>outlines</code> in <code>distilabel</code>.</p> <p>This script makes use of <code>LlamaCppLLM</code> and the structured output capabilities thanks to <code>outlines</code> to generate RPG characters that adhere to a JSON schema.</p> <p></p> <p>It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other <code>LLMs</code> like <code>vLLM</code>.</p> Run <pre><code>python examples/structured_generation_with_outlines.py\n</code></pre> structured_generation_with_outlines.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom enum import Enum\nfrom pathlib import Path\n\nfrom pydantic import BaseModel, StringConstraints, conint\nfrom typing_extensions import Annotated\n\nfrom distilabel.models import LlamaCppLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\n\nclass Weapon(str, Enum):\n    sword = \"sword\"\n    axe = \"axe\"\n    mace = \"mace\"\n    spear = \"spear\"\n    bow = \"bow\"\n    crossbow = \"crossbow\"\n\n\nclass Armor(str, Enum):\n    leather = \"leather\"\n    chainmail = \"chainmail\"\n    plate = \"plate\"\n    mithril = \"mithril\"\n\n\nclass Character(BaseModel):\n    name: Annotated[str, StringConstraints(max_length=30)]\n    age: conint(gt=1, lt=3000)\n    armor: Armor\n    weapon: Weapon\n\n\n# Download the model with\n# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nwith Pipeline(\"RPG-characters\") as pipeline:\n    system_prompt = (\n        \"You are a leading role play gamer. You have seen thousands of different characters and their attributes.\"\n        \" Please return a JSON object with common attributes of an RPG character.\"\n    )\n\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"system_prompt\": system_prompt,\n                \"instruction\": f\"Give me a character description for a {char}\",\n            }\n            for char in [\"dwarf\", \"elf\", \"human\", \"ork\"]\n        ],\n    )\n    llm = LlamaCppLLM(\n        model_path=str(Path.home() / model_path),  # type: ignore\n        n_gpu_layers=-1,\n        n_ctx=1024,\n        structured_output={\"format\": \"json\", \"schema\": Character},\n    )\n    # Change to vLLM as such:\n    # llm = vLLM(\n    #     model=\"teknium/OpenHermes-2.5-Mistral-7B\",\n    #     extra_kwargs={\"tensor_parallel_size\": 1},\n    #     structured_output={\"format\": \"json\", \"schema\": Character},\n    # )\n\n    text_generation = TextGeneration(\n        name=\"text_generation_rpg\",\n        llm=llm,\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    load_dataset &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            text_generation.name: {\n                \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 256}}\n            }\n        },\n        use_cache=False,\n    )\n    for num, character in enumerate(distiset[\"default\"][\"train\"][\"generation\"]):\n        print(f\"Character: {num}\")\n        print(character)\n\n# Character: 0\n# {\n# \"name\": \"Gimli\",\n# \"age\": 42,\n# \"armor\": \"plate\",\n# \"weapon\": \"axe\" }\n# Character: 1\n# {\"name\":\"Gaelen\",\"age\":600,\"armor\":\"leather\",\"weapon\":\"bow\"}\n# Character: 2\n# {\"name\": \"John Smith\",\"age\": 35,\"armor\": \"leather\",\"weapon\": \"sword\"}\n# Character: 3\n# { \"name\": \"Grug\", \"age\": 35, \"armor\": \"leather\", \"weapon\": \"axe\"}\n</code></pre>"},{"location":"sections/pipeline_samples/examples/mistralai_with_instructor/","title":"Structured generation with <code>instructor</code>","text":"<p>Answer instructions with knowledge graphs defined as <code>pydantic.BaseModel</code> objects using <code>instructor</code> in <code>distilabel</code>.</p> <p>This script makes use of <code>MistralLLM</code> and the structured output capabilities thanks to <code>instructor</code> to generate knowledge graphs from complex topics.</p> <p></p> <p>This example is translated from this awesome example from <code>instructor</code> cookbook.</p> Run <pre><code>python examples/structured_generation_with_instructor.py\n</code></pre> structured_generation_with_instructor.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom typing import List\n\nfrom pydantic import BaseModel, Field\n\nfrom distilabel.models import MistralLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import TextGeneration\n\n\nclass Node(BaseModel):\n    id: int\n    label: str\n    color: str\n\n\nclass Edge(BaseModel):\n    source: int\n    target: int\n    label: str\n    color: str = \"black\"\n\n\nclass KnowledgeGraph(BaseModel):\n    nodes: List[Node] = Field(..., default_factory=list)\n    edges: List[Edge] = Field(..., default_factory=list)\n\n\nwith Pipeline(\n    name=\"Knowledge-Graphs\",\n    description=(\n        \"Generate knowledge graphs to answer questions, this type of dataset can be used to \"\n        \"steer a model to answer questions with a knowledge graph.\"\n    ),\n) as pipeline:\n    sample_questions = [\n        \"Teach me about quantum mechanics\",\n        \"Who is who in The Simpsons family?\",\n        \"Tell me about the evolution of programming languages\",\n    ]\n\n    load_dataset = LoadDataFromDicts(\n        name=\"load_instructions\",\n        data=[\n            {\n                \"system_prompt\": \"You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.\",\n                \"instruction\": f\"{question}\",\n            }\n            for question in sample_questions\n        ],\n    )\n\n    text_generation = TextGeneration(\n        name=\"knowledge_graph_generation\",\n        llm=MistralLLM(\n            model=\"open-mixtral-8x22b\", structured_output={\"schema\": KnowledgeGraph}\n        ),\n        input_batch_size=8,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    load_dataset &gt;&gt; text_generation\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            text_generation.name: {\n                \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 2048}}\n            }\n        },\n        use_cache=False,\n    )\n\n    distiset.push_to_hub(\"distilabel-internal-testing/knowledge_graphs\")\n</code></pre> Visualizing the graphs <p>Want to see how to visualize the graphs? You can test it using the following script. Generate some samples on your own and take a look:</p> <p>Note</p> <p>This example uses graphviz to render the graph, you can install with <code>pip</code> in the following way:</p> <pre><code>pip install graphviz\n</code></pre> <pre><code>python examples/draw_kg.py 2  # You can pass 0,1,2 to visualize each of the samples.\n</code></pre> <p></p>"},{"location":"sections/pipeline_samples/examples/text_generation_with_image/","title":"Text generation with images in <code>distilabel</code>","text":"<p>Answer questions about images using <code>distilabel</code>.</p> <p>Image-text-to-text models take in an image and text prompt and output text. In this example we will use an LLM <code>InferenceEndpointsLLM</code> with meta-llama/Llama-3.2-11B-Vision-Instruct to ask a question about an image, and <code>OpenAILLM</code> with <code>gpt-4o-mini</code>. We will ask a simple question to showcase how the <code>TextGenerationWithImage</code> task can be used in a pipeline.</p> Inference Endpoints - meta-llama/Llama-3.2-11B-Vision-InstructOpenAI - gpt-4o-mini <pre><code>from distilabel.models.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage\nfrom distilabel.steps import LoadDataFromDicts\n\n\nwith Pipeline(name=\"vision_generation_pipeline\") as pipeline:\n    loader = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n            }\n        ],\n    )\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    )\n\n    vision = TextGenerationWithImage(\n        name=\"vision_gen\",\n        llm=llm,\n        image_type=\"url\"  # (1)\n    )\n\n    loader &gt;&gt; vision\n</code></pre> <ol> <li>The image_type can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the <code>TextGenerationWithImage</code> for more information.</li> </ol> <p>Image:</p> <p></p> <p>Question:</p> <p>What\u2019s in this image?</p> <p>Response:</p> <p>This image depicts a wooden boardwalk weaving its way through a lush meadow, flanked by vibrant green grass that stretches towards the horizon under a calm and inviting sky. The boardwalk runs straight ahead, away from the viewer, forming a clear pathway through the tall, lush green grass, crops or other plant types or an assortment of small trees and shrubs. This meadow is dotted with trees and shrubs, appearing to be healthy and green. The sky above is a beautiful blue with white clouds scattered throughout, adding a sense of tranquility to the scene. While this image appears to be of a natural landscape, because grass is...</p> <pre><code>from distilabel.models.llms import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage\nfrom distilabel.steps import LoadDataFromDicts\n\n\nwith Pipeline(name=\"vision_generation_pipeline\") as pipeline:\n    loader = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n            }\n        ],\n    )\n\n    llm = OpenAILLM(\n        model=\"gpt-4o-mini\",\n    )\n\n    vision = TextGenerationWithImage(\n        name=\"vision_gen\",\n        llm=llm,\n        image_type=\"url\"  # (1)\n    )\n\n    loader &gt;&gt; vision\n</code></pre> <ol> <li>The image_type can be a url pointing to the image, the base64 string representation, or a PIL image, take a look at the <code>VisionGeneration</code> for more information.</li> </ol> <p>Image:</p> <p></p> <p>Question:</p> <p>What\u2019s in this image?</p> <p>Response:</p> <p>The image depicts a scenic landscape featuring a wooden walkway or path that runs through a lush green marsh or field. The area is surrounded by tall grass and various shrubs, with trees likely visible in the background. The sky is blue with some wispy clouds, suggesting a beautiful day. Overall, it presents a peaceful natural setting, ideal for a stroll or nature observation.</p> <p>The full pipeline can be run at the following example:</p> Run the full pipeline <pre><code>python examples/text_generation_with_image.py\n</code></pre> text_generation_with_image.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom distilabel.models.llms import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks.text_generation_with_image import TextGenerationWithImage\n\nwith Pipeline(name=\"vision_generation_pipeline\") as pipeline:\n    loader = LoadDataFromDicts(\n        data=[\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n            }\n        ],\n    )\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    )\n\n    vision = TextGenerationWithImage(name=\"vision_gen\", llm=llm, image_type=\"url\")\n\n    loader &gt;&gt; vision\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    distiset.push_to_hub(\"plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct\")\n</code></pre> <p>A sample dataset can be seen at plaguss/test-vision-generation-Llama-3.2-11B-Vision-Instruct.</p>"},{"location":"sections/pipeline_samples/papers/apigen/","title":"Create Function-Calling datasets with APIGen","text":"<p>This example will introduce APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, a data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications.</p>"},{"location":"sections/pipeline_samples/papers/apigen/#replication","title":"Replication","text":"<p>The following figure showcases the APIGen framework:</p> <p></p> <p>Now, let's walk through the key steps illustrated in the figure:</p> <ul> <li> <p><code>DataSampler</code>: With the help of this step and the original Salesforce/xlam-function-calling-60k we are getting the Seed QA Data Sampler for the prompt template.</p> </li> <li> <p><code>APIGenGenerator</code>: This step does the job of the Query-Answer Generator, including the format checker from Stage 1: Format Checker thanks to the structured output generation.</p> </li> <li> <p><code>APIGenExecutionChecker</code>: This step is in charge of the Stage 2: Execution Checker.</p> </li> <li> <p><code>APIGenSemanticChecker</code>: Step in charge of running Stage 3: Semantic Checker, can use the same or a different LLM, we are using the same as in <code>APIGenGenerator</code> step.</p> </li> </ul> <p>The current implementation hasn't utilized the Diverse Prompt Library. To incorporate it, one could either adjust the prompt template within the <code>APIGenGenerator</code> or develop a new sampler specifically for this purpose. As for the API Sampler, while no specific data is shared here, we've created illustrative examples to demonstrate the pipeline's functionality. These examples represent a mix of data that could be used to replicate the sampler's output.</p>"},{"location":"sections/pipeline_samples/papers/apigen/#data-preparation","title":"Data preparation","text":"<p>The original paper tells about the data they used and give some hints, but nothing was shared. In this example, we will write a bunch of examples by hand to showcase how this pipeline can be built.</p> <p>Assume we have the following function names, and corresponding descriptions of their behaviour:</p> <pre><code>data = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n    {\n        \"func_name\": \"binary_addition\",\n        \"func_desc\": \"Adds two binary numbers and returns the result as a binary string.\",\n    },\n    {\n        \"func_name\": \"swapi_planet_resource\",\n        \"func_desc\": \"get a specific planets resource\",\n    },\n    {\n        \"func_name\": \"disney_character\",\n        \"func_desc\": \"Find a specific character using this endpoint\",\n    }\n]\n</code></pre> <p>The original paper refers to both python functions and APIs, but we will make use of python functions exclusively for simplicity. In order to execute and check this functions/APIs, we need access to the code, which we have moved to a Python file: lib_apigen.py. All this functions are executable, but we also need access to their tool representation. For this, we will make use of transformers' get_json_schema function<sup>1</sup>.</p> <p>We have all the machinery prepared in our libpath, except from the tool definition. With the help of our helper function <code>load_module_from_path</code> we will load this python module, collect all the tools, and add them to each row in our <code>data</code> variable.</p> <pre><code>from distilabel.steps.tasks.apigen.utils import load_module_from_path\n\nlibpath_module = load_module_from_path(libpath)\ntools = getattr(libpath_module, \"get_tools\")()  # call get_tools()\n\nfor row in data:\n    #\u00a0The tools should have a mix where both the correct and irrelevant tools are present.\n    row.update({\"tools\": [tools[row[\"func_name\"]]]})\n</code></pre> <p>Now we have all the necessary data for our prompt. Additionally, we will make use of the original dataset as few-shot examples to enhance the model:</p> <pre><code>ds_og = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\n</code></pre> <p>We have just loaded a subset and transformed it to a list of dictionaries, as we will use it in the <code>DataSampler</code> <code>GeneratorStep</code>, grabbing random examples from the original dataset.</p>"},{"location":"sections/pipeline_samples/papers/apigen/#building-the-pipeline","title":"Building the Pipeline","text":"<p>Now that we've walked through each component, it's time to see how it all comes together, here's the Pipeline code:</p> <pre><code>with Pipeline(name=\"apigen-example\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)  # (1)\n\n    sampler = DataSampler(  # (2)\n        data=ds_og,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n\n    prep_examples = PrepareExamples()  # This step will add the 'examples' column\n\n    combine_steps = CombineOutputs()  # (3)\n\n    model_id = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    llm=InferenceEndpointsLLM(  # (4)\n        model_id=model_id,\n        tokenizer_id=model_id,\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 2048,\n        },\n    )\n    apigen = APIGenGenerator(  # (5)\n        llm=llm,\n        use_default_structured_output=True,\n    )\n\n    execution_checker = APIGenExecutionChecker(libpath=str(libpath))  # (6)\n    semantic_checker = APIGenSemanticChecker(llm=llm)  # (7)\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples] \n        &gt;&gt; combine_steps \n        &gt;&gt; apigen\n        &gt;&gt; execution_checker\n        &gt;&gt; semantic_checker\n    )\n</code></pre> <ol> <li> <p>Load the data seeds we are going to use to generate our function calling dataset.</p> </li> <li> <p>The <code>DataSampler</code> together with <code>PrepareExamples</code> will be used to help us create the few-shot examples from the original dataset to be fed in our prompt.</p> </li> <li> <p>Combine both columns to obtain a single stream of data</p> </li> <li> <p>Will reuse the same LLM for the generation and the semantic checks.</p> </li> <li> <p>Creates the <code>query</code> and <code>answers</code> that will be used together with the <code>tools</code> to fine-tune a new model. Will generate the structured outputs to ensure we have valid JSON formatted answers.</p> </li> <li> <p>Adds columns <code>keep_row_after_execution_check</code> and <code>execution_result</code>.</p> </li> <li> <p>Adds columns <code>keep_row_after_semantic_check</code> and <code>thought</code>.</p> </li> </ol>"},{"location":"sections/pipeline_samples/papers/apigen/#script-and-final-dataset","title":"Script and final dataset","text":"<p>To see all the pieces in place, take a look at the full pipeline, as well as an example row that would be generated from this pipeline.</p> Run <pre><code>python examples/pipeline_apigen.py\n</code></pre> pipeline_apigen.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom pathlib import Path\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, DataSampler, LoadDataFromDicts\nfrom distilabel.steps.tasks import (\n    APIGenExecutionChecker,\n    APIGenGenerator,\n    APIGenSemanticChecker,\n)\nfrom distilabel.steps.tasks.apigen.utils import PrepareExamples, load_module_from_path\n\nlibpath = Path(__file__).parent / \"lib_apigen.py\"\n\ndata = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n    {\n        \"func_name\": \"binary_addition\",\n        \"func_desc\": \"Adds two binary numbers and returns the result as a binary string.\",\n    },\n    {\n        \"func_name\": \"swapi_planet_resource\",\n        \"func_desc\": \"get a specific planets resource\",\n    },\n    {\n        \"func_name\": \"disney_character\",\n        \"func_desc\": \"Find a specific character using this endpoint\",\n    },\n]\n\nlibpath_module = load_module_from_path(libpath)\ntools = libpath_module.get_tools()  # call get_tools()\n\n# TODO: Add in the tools between 0 and 2 extra tools to make the task more challenging.\nfor row in data:\n    # The tools should have a mix where both the correct and irrelevant tools are present.\n    row.update({\"tools\": [tools[row[\"func_name\"]]]})\n\n\nds_og = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\n\n\nwith Pipeline(name=\"APIGenPipeline\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)\n    sampler = DataSampler(\n        data=ds_og,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n\n    prep_examples = PrepareExamples()\n\n    model_id = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    llm = InferenceEndpointsLLM(\n        model_id=model_id,\n        tokenizer_id=model_id,\n        generation_kwargs={\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 2048,\n        },\n    )\n    apigen = APIGenGenerator(\n        llm=llm,\n        use_default_structured_output=True,\n    )\n    combine_steps = CombineOutputs()\n\n    execution_checker = APIGenExecutionChecker(libpath=str(libpath))\n    semantic_checker = APIGenSemanticChecker(llm=llm)\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples]\n        &gt;&gt; combine_steps\n        &gt;&gt; apigen\n        &gt;&gt; execution_checker\n        &gt;&gt; semantic_checker\n    )\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run()\n    print(distiset[\"default\"][\"train\"][0])\n</code></pre> <p>Example row:</p> <pre><code>{\n  \"func_name\": \"final_velocity\",\n  \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n  \"tools\": [\n    {\n      \"function\": {\n        \"description\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n        \"name\": \"final_velocity\",\n        \"parameters\": {\n          \"properties\": {\n            \"acceleration\": {\n              \"description\": \"The acceleration of the object.\",\n              \"type\": \"number\"\n            },\n            \"initial_velocity\": {\n              \"description\": \"The initial velocity of the object.\",\n              \"type\": \"number\"\n            },\n            \"time\": {\n              \"description\": \"The time elapsed.\",\n              \"type\": \"number\"\n            }\n          },\n          \"required\": [\n            \"initial_velocity\",\n            \"acceleration\",\n            \"time\"\n          ],\n          \"type\": \"object\"\n        }\n      },\n      \"type\": \"function\"\n    }\n  ],\n  \"examples\": \"## Query:\\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\\n## Answers:\\n[{\\\"name\\\": \\\"v1_post_post_id_comments\\\", \\\"arguments\\\": {\\\"post_id\\\": \\\"12345\\\", \\\"count\\\": 15}}]\\n\\n## Query:\\nRetrieve the detailed recipe for the cake with ID 'cake101'.\\n## Answers:\\n[{\\\"name\\\": \\\"detailed_cake_recipe_by_id\\\", \\\"arguments\\\": {\\\"is_id\\\": \\\"cake101\\\"}}]\\n\\n## Query:\\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\\n## Answers:\\n[{\\\"name\\\": \\\"symbols_faq\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}, {\\\"name\\\": \\\"symbols_suggested\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}]\",\n  \"query\": \"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\",\n  \"answers\": \"[{\\\"arguments\\\": {\\\"acceleration\\\": \\\"9.8\\\", \\\"initial_velocity\\\": \\\"0\\\", \\\"time\\\": \\\"10\\\"}, \\\"name\\\": \\\"final_velocity\\\"}]\",\n  \"distilabel_metadata\": {\n    \"raw_input_a_p_i_gen_generator_0\": [\n      {\n        \"content\": \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\\n\\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\\n\\nEnsure the query:\\n- Is clear and concise\\n- Demonstrates typical use cases\\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\\n- The corresponding result's parameter types and ranges match with the function's descriptions\\n\\nEnsure the answer:\\n- Is a list of function calls in JSON format\\n- The length of the answer list should be equal to the number of requests in the query\\n- Can solve all the requests in the query effectively\",\n        \"role\": \"system\"\n      },\n      {\n        \"content\": \"Here are examples of queries and the corresponding answers for similar functions:\\n## Query:\\nRetrieve the first 15 comments for post ID '12345' from the Tokapi mobile API.\\n## Answers:\\n[{\\\"name\\\": \\\"v1_post_post_id_comments\\\", \\\"arguments\\\": {\\\"post_id\\\": \\\"12345\\\", \\\"count\\\": 15}}]\\n\\n## Query:\\nRetrieve the detailed recipe for the cake with ID 'cake101'.\\n## Answers:\\n[{\\\"name\\\": \\\"detailed_cake_recipe_by_id\\\", \\\"arguments\\\": {\\\"is_id\\\": \\\"cake101\\\"}}]\\n\\n## Query:\\nWhat are the frequently asked questions and their answers for Coca-Cola Company? Also, what are the suggested tickers based on Coca-Cola Company?\\n## Answers:\\n[{\\\"name\\\": \\\"symbols_faq\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}, {\\\"name\\\": \\\"symbols_suggested\\\", \\\"arguments\\\": {\\\"ticker_slug\\\": \\\"KO\\\"}}]\\n\\nNote that the query could be interpreted as a combination of several independent requests.\\n\\nBased on these examples, generate 1 diverse query and answer pairs for the function `final_velocity`.\\nThe detailed function description is the following:\\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\\n\\nThese are the available tools to help you:\\n[{'type': 'function', 'function': {'name': 'final_velocity', 'description': 'Calculates the final velocity of an object given its initial velocity, acceleration, and time.', 'parameters': {'type': 'object', 'properties': {'initial_velocity': {'type': 'number', 'description': 'The initial velocity of the object.'}, 'acceleration': {'type': 'number', 'description': 'The acceleration of the object.'}, 'time': {'type': 'number', 'description': 'The time elapsed.'}}, 'required': ['initial_velocity', 'acceleration', 'time']}}}]\\n\\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\\n```json\\n[\\n   {\\n       \\\"query\\\": \\\"The generated query.\\\",\\n       \\\"answers\\\": [\\n           {\\n               \\\"name\\\": \\\"api_name\\\",\\n               \\\"arguments\\\": {\\n                   \\\"arg_name\\\": \\\"value\\\"\\n                   ... (more arguments as required)\\n               }\\n           },\\n           ... (more API calls as required)\\n       ]\\n   }\\n]\\n```\\n\\nNow please generate 1 diverse query and answer pairs following the above format.\",\n        \"role\": \"user\"\n      }\n    ],\n    \"raw_input_a_p_i_gen_semantic_checker_0\": [\n      {\n        \"content\": \"As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\",\n        \"role\": \"system\"\n      },\n      {\n        \"content\": \"Given Information:\\n- All Available Functions:\\nCalculates the final velocity of an object given its initial velocity, acceleration, and time.\\n- User Query: What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\\n- Generated Function Calls: [{\\\"arguments\\\": {\\\"acceleration\\\": \\\"9.8\\\", \\\"initial_velocity\\\": \\\"0\\\", \\\"time\\\": \\\"10\\\"}, \\\"name\\\": \\\"final_velocity\\\"}]\\n- Execution Results: ['9.8']\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query's intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \\\"thought\\\": \\\"Concisely describe your reasoning here\\\",\\n   \\\"passes\\\": \\\"yes\\\" or \\\"no\\\"\\n}\\n```\\n\",\n        \"role\": \"user\"\n      }\n    ],\n    \"raw_output_a_p_i_gen_generator_0\": \"{\\\"pairs\\\": [\\n   {\\n       \\\"answers\\\": [\\n           {\\n               \\\"arguments\\\": {\\n                   \\\"acceleration\\\": \\\"9.8\\\",\\n                   \\\"initial_velocity\\\": \\\"0\\\",\\n                   \\\"time\\\": \\\"10\\\"\\n               },\\n               \\\"name\\\": \\\"final_velocity\\\"\\n           }\\n       ],\\n       \\\"query\\\": \\\"What would be the final velocity of an object that starts at rest and accelerates at 9.8 m/s^2 for 10 seconds.\\\"\\n   }\\n]}\",\n    \"raw_output_a_p_i_gen_semantic_checker_0\": \"{\\n   \\\"thought\\\": \\\"\\\",\\n   \\\"passes\\\": \\\"yes\\\"\\n}\"\n  },\n  \"model_name\": \"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n  \"keep_row_after_execution_check\": true,\n  \"execution_result\": [\n    \"9.8\"\n  ],\n  \"thought\": \"\",\n  \"keep_row_after_semantic_check\": true\n}\n</code></pre> <ol> <li> <p>Read this nice blog post for more information on tools and the reasoning behind <code>get_json_schema</code>: Tool Use, Unified.\u00a0\u21a9</p> </li> </ol>"},{"location":"sections/pipeline_samples/papers/clair/","title":"Contrastive Learning From AI Revisions (CLAIR)","text":"<p>\"Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment\" introduces both Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. While APO can be found in TRL, we have implemented a task for CLAIR in <code>distilabel</code>.</p> <p>CLAIR is a method for creating preference pairs which minimally revises one output to express a preference, resulting in a more precise learning signal as opposed to conventional methods which use a judge to select a preferred response. </p> <p></p> <p>The athors from the original paper shared a collection of datasets from CLAIR and APO, where ContextualAI/ultrafeedback_clair_32k corresponds to the CLAIR implementation.</p>"},{"location":"sections/pipeline_samples/papers/clair/#replication","title":"Replication","text":"<p>Note</p> <p>The section is named <code>Replication</code> but in this case we are showing how to use the <code>CLAIR</code> task create revisions for your generations using <code>distilabel</code>.</p> <p>To showcase CLAIR we will be using the <code>CLAIR</code> task implemented in <code>distilabel</code> and we are reusing a small sample of the already generated dataset by ContextualAI <code>ContextualAI/ultrafeedback_clair_32k</code> for testing.</p>"},{"location":"sections/pipeline_samples/papers/clair/#installation","title":"Installation","text":"<p>To reproduce the code below, one will need to install <code>distilabel</code> as follows:</p> <pre><code>pip install \"distilabel&gt;=1.4.0\"\n</code></pre> <p>Depending on the LLM provider you want to use, the requirements may vary, take a look at the dependencies in that case, we are using for the example the free inference endpoints from Hugging Face, but that won't apply for a bigger dataset.</p>"},{"location":"sections/pipeline_samples/papers/clair/#building-blocks","title":"Building blocks","text":"<p>In this case where we already have instructions and their generations, we will just need to load the data and the corresponding CLAIR task for the revisions:</p> <ul> <li><code>CLAIR</code> to generate the revisions.</li> </ul>"},{"location":"sections/pipeline_samples/papers/clair/#code","title":"Code","text":"<p>Let's see the full pipeline applied to <code>ContextualAI/ultrafeedback_clair_32k</code> in <code>distilabel</code>:</p> <pre><code>from typing import Any, Dict\n\nfrom datasets import load_dataset\n\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import CLAIR\nfrom distilabel.models import InferenceEndpointsLLM\n\n\ndef transform_ultrafeedback(example: Dict[str, Any]) -&gt; Dict[str, Any]:\n    return {\n        \"task\": example[\"prompt\"],\n        \"student_solution\": example[\"rejected\"][1][\"content\"],\n    }\n\ndataset = (\n    load_dataset(\"ContextualAI/ultrafeedback_clair_32k\", split=\"train\")\n    .select(range(10))             #\u00a0We collect just 10 examples\n    .map(transform_ultrafeedback)  # Apply the transformation to get just the text\n)\n\nwith Pipeline(name=\"CLAIR UltraFeedback sample\") as pipeline:\n    clair = CLAIR(  # (1)\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\n                \"temperature\": 0.7,\n                \"max_new_tokens\": 4096\n            }\n        )\n    )\n\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(dataset=dataset)  # (2)\n    distiset.push_to_hub(repo_id=\"username/clair-test\", include_script=True)  # (3)\n</code></pre> <ol> <li> <p>This Pipeline uses just CLAIR because we already have the generations, but one can just include a first task to create generations from instructions, and then the revisions with CLAIR.</p> </li> <li> <p>Include the dataset directly in the run method for simplicity.</p> </li> <li> <p>Push the distiset to the hub with the script for reproducibility.</p> </li> </ol> <p>An example dataset can be found at: distilabel-internal-testing/clair-test.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/","title":"DeepSeek Prover","text":"<p>\"DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data\" presents an approach to generate mathematical proofs for theorems generated from informal math problems. This approach shows promising results to advance the capabilities of models towards theorem proving using synthetic data. Until this moment the dataset and the model trained on top of it haven't been opened, let's see how the approach works to reproduce the pipeline using <code>distilabel</code>. The following figure depicts the approach taken to generate the dataset:</p> <p></p> <p>The authors propose a method for generating Lean 4 proof data from informal mathematical problems. Their approach translates high-school and undergraduate-level mathematical competition problems into formal statements.</p> <p>Here we show how to deal with steps 1 and 2, but the authors ensure the theorems are checked using the lean4 program on the generated proofs, and iterate for a series of steps, fine-tuning a model on the synthetic data (DeepSeek prover 7B), regenerating the dataset, and continue the process until no further improvement is found.</p> <p></p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#replication","title":"Replication","text":"<p>Note</p> <p>The section is named <code>Replication</code> but we will show how we can use <code>distilabel</code> to create the different steps outlined in the <code>DeepSeek-Prover</code> approach. We intentionally let some steps out of the pipeline, but this can easily be extended.</p> <p>We will define the components needed to generate a dataset like the one depicted in the previous figure (we won't call lean4 or do the fine-tuning, this last step can be done outside of <code>distilabel</code>). The different blocks will have all the docstrings as we would have in the internal steps to showcase how they are done, but they can be omitted for brevity.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#installation","title":"Installation","text":"<p>To reproduce the code below, we need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <p>We have decided to use <code>InferenceEndpointsLLM</code>, but any other provider with a strong model could work.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#building-blocks","title":"Building blocks","text":"<p>There are three components we needed to define for this pipeline, for the different components in the paper: A task to formalize the original statements, another one to assess the relevance of the theorems, and a final one to generate proofs for the theorems.</p> <p>Note</p> <p>We will use the same <code>LLM</code> for all the tasks, so we will define once and reuse it for the different tasks:</p> <pre><code>llm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#deepseekproverautoformalization","title":"DeepSeekProverAutoFormalization","text":"<p>This <code>Task</code> corresponds to the first step in the figure. Given an informal statement, it will formalize it for us in <code>Lean 4</code> language, meaning it will translate from an informal statement that could be gathered from the internet, to the lean4 structured language.</p> DeepSeekProverAutoFormalization <pre><code>_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX = r\"```lean4(.*?)```\"\n\ntemplate_deepseek_prover_auto_formalization = \"\"\"\\\nMathematical Problem in Natural Language:\n{{ informal_statement }}\n{%- if few_shot %}\n\nPlease use the following examples to guide you with the answer:\n{%- for example in examples %}\n- {{ example }}\n{%- endfor %}\n{% endif -%}\"\"\"\n\n\nclass DeepSeekProverAutoFormalization(Task):\n    examples: Optional[List[str]] = None\n    system_prompt: str = \"Translate the problem to Lean 4 (only the core declaration):\\n```lean4\\nformal statement goes here\\n```\"\n    _template: Union[Template, None] = PrivateAttr(...)\n    _few_shot: bool = PrivateAttr(default=False)\n\n    def load(self) -&gt; None:\n        super().load()\n        self._template = Template(template_deepseek_prover_auto_formalization)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"informal_statement\"]\n\n    @property\n    def outputs(self):\n        return [\"formal_statement\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    informal_statement=input[self.inputs[0]],\n                    few_shot=bool(self.examples),\n                    examples=self.examples,\n                ),\n            },\n        ]\n\n    @override\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"formal_statement\": match}\n</code></pre> <p>Following the paper, they found that the model yields better results if it uses examples in a few shot setting, so this class allows to take some examples to help in generating the formulation. Let's see an example of how we can instantiate it:</p> <pre><code>from textwrap import dedent\n\nexamples = [\n    dedent(\"\"\"\n    ## Statement in natural language:\n    For real numbers k and x:\n    If x is equal to (13 - \u221a131) / 4, and\n    If the equation 2x\u00b2 - 13x + k = 0 is satisfied,\n    Then k must be equal to 19/4.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    The greatest common divisor (GCD) of 20 factorial (20!) and 200,000 is equal to 40,000.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    Given two integers x and y:\n    If y is positive (greater than 0),\n    And y is less than x,\n    And the equation x + y + xy = 80 is true,\n    Then x must be equal to 26.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n]\n\nauto_formalization = DeepSeekProverAutoFormalization(\n    name=\"auto_formalization\",\n    input_batch_size=8,\n    llm=llm,\n    examples=examples\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#deepseekproverscorer","title":"DeepSeekProverScorer","text":"<p>The next <code>Task</code> corresponds to the second step, the model scoring and assessment. It uses an LLM as judge to evaluate the relevance of the theorem, and assigns a score so it can be filtered afterwards.</p> DeepSeekProverScorer <pre><code>template_deepseek_prover_scorer = \"\"\"\\\nTo evaluate whether a formal Lean4 statement will be of interest to the community, consider the following criteria:\n\n1. Relevance to Current Research: Does the statement address a problem or concept that is actively being researched in mathematics or related fields? Higher relevance scores indicate greater potential interest.\n2. Complexity and Depth: Is the statement complex enough to challenge existing theories and methodologies, yet deep enough to provide significant insights or advancements? Complexity and depth showcase Lean4's capabilities and attract interest.\n3. Interdisciplinary Potential: Does the statement offer opportunities for interdisciplinary research, connecting mathematics with other fields such as computer science, physics, or biology? Interdisciplinary projects often garner wide interest.\n4. Community Needs and Gaps: Does the statement fill an identified need or gap within the Lean4 community or the broader mathematical community? Addressing these needs directly correlates with interest.\n5. Innovativeness: How innovative is the statement? Does it propose new methods, concepts, or applications? Innovation drives interest and engagement.\n\nCustomize your evaluation for each problem accordingly, assessing it as 'excellent', 'good', 'above average', 'fair' or 'poor'.\n\nYou should respond in the following format for each statement:\n\n'''\nNatural language: (Detailed explanation of the informal statement, including any relevant background information, assumptions, and definitions.)\nAnalysis: (Provide a brief justification for each score, highlighting why the statement scored as it did across the criteria.)\nAssessment: (Based on the criteria, rate the statement as 'excellent', 'good', 'above average', 'fair' or 'poor'. JUST the Assessment.)\n'''\"\"\"\n\nclass DeepSeekProverScorer(Task):\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        super().load()\n        self._template = Template(template_deepseek_prover_scorer)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"informal_statement\", \"formal_statement\"]\n\n    @property\n    def outputs(self):\n        return [\"natural_language\", \"analysis\", \"assessment\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self._template.render(),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\"## Informal statement:\\n{input[self.inputs[0]]}\\n\\n ## Formal statement:\\n{input[self.inputs[1]]}\",\n            },\n        ]\n\n    @override\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:\n        try:\n            result = output.split(\"Natural language:\")[1].strip()\n            natural_language, analysis = result.split(\"Analysis:\")\n            analysis, assessment = analysis.split(\"Assessment:\")\n            natural_language = natural_language.strip()\n            analysis = analysis.strip()\n            assessment = assessment.strip()\n        except Exception:\n            natural_language = analysis = assessment = None\n\n        return {\n            \"natural_language\": natural_language,\n            \"analysis\": analysis,\n            \"assessment\": assessment\n        }\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#deepseekproversolver","title":"DeepSeekProverSolver","text":"<p>The last task is in charge of generating a proof for the theorems generated in the previous steps.</p> DeepSeekProverSolver <pre><code>class DeepSeekProverSolver(Task):\n    system_prompt: str = (\n        \"You are an expert in proving mathematical theorems formalized in lean4 language. \"\n        \"Your answers consist just in the proof to the theorem given, and nothing else.\"\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        return [\"formal_statement\"]\n\n    @property\n    def outputs(self):\n        return [\"proof\"]\n\n    def format_input(self, input: str) -&gt; ChatType:\n        prompt = dedent(\"\"\"\n            Give me a proof for the following theorem:\n            ```lean4\n            {theorem}\n            ```\"\"\"\n        )\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": prompt.format(theorem=input[\"formal_statement\"]),\n            },\n        ]\n\n    def format_output(\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:\n        import re\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"proof\": match}\n</code></pre> <p>Additionally, the original pipeline defined in the paper includes a step to check the final proofs using the lean 4 language that we have omitted for simplicity. The fine tuning can be done completely offline, and come back to the pipeline after each iteration/training run.</p> <p>All the docstrings have been removed from the code blocks, but can be seen in the full pipeline.</p>"},{"location":"sections/pipeline_samples/papers/deepseek_prover/#code","title":"Code","text":"<p>Lets's put the building blocks together to create the final pipeline with <code>distilabel</code>. For this example we have generated a sample dataset plaguss/informal-mathematical-statements-tiny of informal mathematical statements starting from casey-martin/multilingual-mathematical-autoformalization, but as the paper mentions, we can create formal statements and it's corresponding proofs starting from informal ones:</p> Click to see the full pipeline deepseek_prover.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nimport re\nfrom pathlib import Path\nfrom textwrap import dedent\nfrom typing import Any, Dict, List, Optional, Union\n\nfrom jinja2 import Template\nfrom pydantic import PrivateAttr\nfrom typing_extensions import override\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub\nfrom distilabel.steps.tasks.base import Task\nfrom distilabel.steps.tasks.typing import ChatType\n\n_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX = r\"```lean4(.*?)```\"\n\n\ntemplate_deepseek_prover_auto_formalization = \"\"\"\\\nMathematical Problem in Natural Language:\n{{ informal_statement }}\n{%- if few_shot %}\n\nPlease use the following examples to guide you with the answer:\n{%- for example in examples %}\n- {{ example }}\n{%- endfor %}\n{% endif -%}\"\"\"\n\n\nclass DeepSeekProverAutoFormalization(Task):\n    \"\"\"Task to translate a mathematical problem from natural language to Lean 4.\n\n    Note:\n        A related dataset (MMA from the paper) can be found in Hugging Face:\n        [casey-martin/multilingual-mathematical-autoformalization](https://huggingface.co/datasets/casey-martin/multilingual-mathematical-autoformalization).\n\n    Input columns:\n        - informal_statement (`str`): The statement to be formalized using Lean 4.\n\n    Output columns:\n        - formal_statement (`str`): The formalized statement using Lean 4, to be analysed.\n\n    Categories:\n        - generation\n\n    References:\n        - [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).\n        - [`Lean 4`](https://github.com/leanprover/lean4).\n\n    Examples:\n\n        Formalize a mathematical problem from natural language to Lean 4:\n\n        ```python\n        from distilabel.steps.tasks import DeepSeekProverAutoFormalization\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prover_autoformal = DeepSeekProverAutoFormalization(\n            llm=InferenceEndpointsLLM(\n                model_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n                tokenizer_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n            ),\n        )\n\n        prover_autoformal.load()\n\n        result = next(\n            prover_autoformal.process(\n                [\n                    {\"informal_statement\": \"If a polynomial g is monic, then the root of g is integral over the ring R.\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'informal_statement': 'If a polynomial g is monic, then the root of g is integral over the ring R.',\n        #         'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',\n        #         'distilabel_metadata': {\n        #             'raw_output_deep_seek_prover_auto_formalization_0': '```lean4\\ntheorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\\n```'\n        #         },\n        #         'model_name': 'deepseek-prover'\n        #     }\n        # ]\n        ```\n\n        Use a few-shot setting to formalize a mathematical problem from natural language to Lean 4:\n\n        ```python\n        from distilabel.steps.tasks import DeepSeekProverAutoFormalization\n        from distilabel.models import InferenceEndpointsLLM\n\n        # You can gain inspiration from the following examples to create your own few-shot examples:\n        # https://github.com/yangky11/miniF2F-lean4/blob/main/MiniF2F/Valid.lean\n        # Consider this as a placeholder for your actual LLM.\n        prover_autoformal = DeepSeekProverAutoFormalization(\n            llm=InferenceEndpointsLLM(\n                model_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n                tokenizer_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n            ),\n            examples=[\n                \"theorem amc12a_2019_p21 (z : \u2102) (h\u2080 : z = (1 + Complex.I) / Real.sqrt 2) :\\n\\n((\u2211 k : \u2124 in Finset.Icc 1 12, z ^ k ^ 2) * (\u2211 k : \u2124 in Finset.Icc 1 12, 1 / z ^ k ^ 2)) = 36 := by\\n\\nsorry\",\n                \"theorem amc12a_2015_p10 (x y : \u2124) (h\u2080 : 0 &lt; y) (h\u2081 : y &lt; x) (h\u2082 : x + y + x * y = 80) : x = 26 := by\\n\\nsorry\"\n            ]\n        )\n\n        prover_autoformal.load()\n\n        result = next(\n            prover_autoformal.process(\n                [\n                    {\"informal_statement\": \"If a polynomial g is monic, then the root of g is integral over the ring R.\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'informal_statement': 'If a polynomial g is monic, then the root of g is integral over the ring R.',\n        #         'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',\n        #         'distilabel_metadata': {\n        #             'raw_output_deep_seek_prover_auto_formalization_0': '```lean4\\ntheorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\\n```'\n        #         },\n        #         'model_name': 'deepseek-prover'\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    examples: Optional[List[str]] = None\n    system_prompt: str = \"Translate the problem to Lean 4 (only the core declaration):\\n```lean4\\nformal statement goes here\\n```\"\n    _template: Union[Template, None] = PrivateAttr(...)\n    _few_shot: bool = PrivateAttr(default=False)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        self._template = Template(template_deepseek_prover_auto_formalization)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"informal_statement\"]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `instructions` containing the generated instructions.\"\"\"\n        return [\"formal_statement\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": self._template.render(\n                    informal_statement=input[self.inputs[0]],\n                    few_shot=bool(self.examples),\n                    examples=self.examples,\n                ),\n            },\n        ]\n\n    @override\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"Extracts the formal statement from the Lean 4 output.\"\"\"\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"formal_statement\": match}\n\n\ntemplate_deepseek_prover_scorer = \"\"\"\\\nTo evaluate whether a formal Lean4 statement will be of interest to the community, consider the following criteria:\n\n1. Relevance to Current Research: Does the statement address a problem or concept that is actively being researched in mathematics or related fields? Higher relevance scores indicate greater potential interest.\n2. Complexity and Depth: Is the statement complex enough to challenge existing theories and methodologies, yet deep enough to provide significant insights or advancements? Complexity and depth showcase Lean4's capabilities and attract interest.\n3. Interdisciplinary Potential: Does the statement offer opportunities for interdisciplinary research, connecting mathematics with other fields such as computer science, physics, or biology? Interdisciplinary projects often garner wide interest.\n4. Community Needs and Gaps: Does the statement fill an identified need or gap within the Lean4 community or the broader mathematical community? Addressing these needs directly correlates with interest.\n5. Innovativeness: How innovative is the statement? Does it propose new methods, concepts, or applications? Innovation drives interest and engagement.\n\nCustomize your evaluation for each problem accordingly, assessing it as 'excellent', 'good', 'above average', 'fair' or 'poor'.\n\nYou should respond in the following format for each statement:\n\n'''\nNatural language: (Detailed explanation of the informal statement, including any relevant background information, assumptions, and definitions.)\nAnalysis: (Provide a brief justification for each score, highlighting why the statement scored as it did across the criteria.)\nAssessment: (Based on the criteria, rate the statement as 'excellent', 'good', 'above average', 'fair' or 'poor'. JUST the Assessment.)\n'''\"\"\"\n\n\nclass DeepSeekProverScorer(Task):\n    \"\"\"Task to evaluate the quality of a formalized mathematical problem in Lean 4,\n    inspired by the DeepSeek-Prover task for scoring.\n\n    Note:\n        A related dataset (MMA from the paper) can be found in Hugging Face:\n        [casey-martin/multilingual-mathematical-autoformalization](https://huggingface.co/datasets/casey-martin/multilingual-mathematical-autoformalization).\n\n    Input columns:\n        - informal_statement (`str`): The statement to be formalized using Lean 4.\n        - formal_statement (`str`): The formalized statement using Lean 4, to be analysed.\n\n    Output columns:\n        - natural_language (`str`): Explanation for the problem.\n        - analysis (`str`): Analysis of the different points defined in the prompt.\n        - assessment (`str`): Result of the assessment.\n\n    Categories:\n        - scorer\n        - quality\n        - response\n\n    References:\n        - [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).\n        - [`Lean 4`](https://github.com/leanprover/lean4).\n\n    Examples:\n\n        Analyse a formal statement in Lean 4:\n\n        ```python\n        from distilabel.steps.tasks import DeepSeekProverScorer\n        from distilabel.models import InferenceEndpointsLLM\n\n        # Consider this as a placeholder for your actual LLM.\n        prover_scorer = DeepSeekProverAutoFormalization(\n            llm=InferenceEndpointsLLM(\n                model_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n                tokenizer_id=\"deepseek-ai/deepseek-math-7b-instruct\",\n            ),\n        )\n\n        prover_scorer.load()\n\n        result = next(\n            prover_scorer.process(\n                [\n                    {\"formal_statement\": \"theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=\"},\n                ]\n            )\n        )\n        # result\n        # [\n        #     {\n        #         'formal_statement': 'theorem isIntegral_root (hg : g.Monic) : IsIntegral R (root g):=',\n        #         'informal_statement': 'INFORMAL',\n        #         'analysis': 'ANALYSIS',\n        #         'assessment': 'ASSESSMENT',\n        #         'distilabel_metadata': {\n        #             'raw_output_deep_seek_prover_scorer_0': 'Natural language:\\nINFORMAL\\nAnalysis:\\nANALYSIS\\nAssessment:\\nASSESSMENT'\n        #         },\n        #         'model_name': 'deepseek-prover-scorer'\n        #     }\n        # ]\n        ```\n    \"\"\"\n\n    _template: Union[Template, None] = PrivateAttr(...)\n\n    def load(self) -&gt; None:\n        \"\"\"Loads the Jinja2 template.\"\"\"\n        super().load()\n\n        self._template = Template(template_deepseek_prover_scorer)\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `instruction`.\"\"\"\n        return [\"informal_statement\", \"formal_statement\"]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is a list of `instructions` containing the generated instructions.\"\"\"\n        return [\"natural_language\", \"analysis\", \"assessment\", \"model_name\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType` assuming that the instruction\n        is the first interaction from the user within a conversation. And the\n        `system_prompt` is added as the first message if it exists.\"\"\"\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self._template.render(),\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\"## Informal statement:\\n{input[self.inputs[0]]}\\n\\n ## Formal statement:\\n{input[self.inputs[1]]}\",\n            },\n        ]\n\n    @override\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        \"\"\"Analyses the formal statement with Lean 4 output and generates an assessment\n        and the corresponding informal assessment.\"\"\"\n\n        try:\n            result = output.split(\"Natural language:\")[1].strip()\n            natural_language, analysis = result.split(\"Analysis:\")\n            analysis, assessment = analysis.split(\"Assessment:\")\n            natural_language = natural_language.strip()\n            analysis = analysis.strip()\n            assessment = assessment.strip()\n        except Exception:\n            natural_language = analysis = assessment = None\n\n        return {\n            \"natural_language\": natural_language,\n            \"analysis\": analysis,\n            \"assessment\": assessment,\n        }\n\n\nclass DeepSeekProverSolver(Task):\n    \"\"\"Task to generate a proof for a formal statement (theorem) in lean4.\n\n    Input columns:\n        - formal_statement (`str`): The formalized statement using Lean 4.\n\n    Output columns:\n        - proof (`str`): The proof for the formal statement theorem.\n\n    Categories:\n        - scorer\n        - quality\n        - response\n\n    References:\n        - [`DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data`](https://arxiv.org/abs/2405.14333).\n    \"\"\"\n\n    system_prompt: str = (\n        \"You are an expert in proving mathematical theorems formalized in lean4 language. \"\n        \"Your answers consist just in the proof to the theorem given, and nothing else.\"\n    )\n\n    @property\n    def inputs(self) -&gt; List[str]:\n        \"\"\"The input for the task is the `formal_statement`.\"\"\"\n        return [\"formal_statement\"]\n\n    @property\n    def outputs(self):\n        \"\"\"The output for the task is the proof for the formal statement theorem.\"\"\"\n        return [\"proof\"]\n\n    def format_input(self, input: str) -&gt; ChatType:  # type: ignore\n        \"\"\"The input is formatted as a `ChatType`, with a system prompt to guide our model.\"\"\"\n        prompt = dedent(\"\"\"\n            Give me a proof for the following theorem:\n            ```lean4\n            {theorem}\n            ```\"\"\")\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": self.system_prompt,\n            },\n            {\n                \"role\": \"user\",\n                \"content\": prompt.format(theorem=input[\"formal_statement\"]),\n            },\n        ]\n\n    def format_output(  # type: ignore\n        self, output: Union[str, None], input: Dict[str, Any] = None\n    ) -&gt; Dict[str, Any]:  # type: ignore\n        import re\n\n        match = re.search(_PARSE_DEEPSEEK_PROVER_AUTOFORMAL_REGEX, output, re.DOTALL)\n        if match:\n            match = match.group(1).strip()\n        return {\"proof\": match}\n\n\nexamples = [\n    dedent(\"\"\"\n    ## Statement in natural language:\n    For real numbers k and x:\n    If x is equal to (13 - \u221a131) / 4, and\n    If the equation 2x\u00b2 - 13x + k = 0 is satisfied,\n    Then k must be equal to 19/4.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    The greatest common divisor (GCD) of 20 factorial (20!) and 200,000 is equal to 40,000.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n    dedent(\"\"\"\n    ## Statement in natural language:\n    Given two integers x and y:\n    If y is positive (greater than 0),\n    And y is less than x,\n    And the equation x + y + xy = 80 is true,\n    Then x must be equal to 26.\n    ## Formalized:\n    theorem mathd_algebra_116 (k x : \u211d) (h\u2080 : x = (13 - Real.sqrt 131) / 4)\n        (h\u2081 : 2 * x ^ 2 - 13 * x + k = 0) : k = 19 / 4 :=\"\"\"),\n]\n\n\nwith Pipeline(name=\"test_deepseek_prover\") as pipeline:\n    data_loader = LoadDataFromHub(\n        repo_id=\"plaguss/informal-mathematical-statements-tiny\",\n        split=\"val\",\n        batch_size=8,\n    )\n\n    llm = InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    )\n    auto_formalization = DeepSeekProverAutoFormalization(\n        name=\"auto_formalization\", input_batch_size=8, llm=llm, examples=examples\n    )\n    prover_scorer = DeepSeekProverScorer(\n        name=\"prover_scorer\",\n        input_batch_size=8,\n        llm=llm,\n    )\n    proof_generator = DeepSeekProverSolver(\n        name=\"proof_generator\", input_batch_size=8, llm=llm\n    )\n\n    (data_loader &gt;&gt; auto_formalization &gt;&gt; prover_scorer &gt;&gt; proof_generator)\n\n\nif __name__ == \"__main__\":\n    import argparse\n\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"-d\",\n        \"--dry-run\",\n        action=argparse.BooleanOptionalAction,\n        help=\"Do a dry run for testing purposes.\",\n    )\n    args = parser.parse_args()\n\n    pipeline_parameters = {\n        data_loader.name: {\"split\": \"val\"},\n        auto_formalization.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"temperature\": 0.6,\n                    \"top_p\": 0.9,\n                    \"max_new_tokens\": 512,\n                }\n            }\n        },\n        prover_scorer.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"temperature\": 0.6,\n                    \"top_p\": 0.9,\n                    \"max_new_tokens\": 512,\n                }\n            }\n        },\n    }\n\n    ds_name = \"test_deepseek_prover\"\n\n    if args.dry_run:\n        distiset = pipeline.dry_run(batch_size=1, parameters=pipeline_parameters)\n        distiset.save_to_disk(Path.home() / f\"Downloads/{ds_name}\")\n\n        import pprint\n\n        pprint.pprint(distiset[\"default\"][\"train\"][0])\n\n    else:\n        distiset = pipeline.run(parameters=pipeline_parameters)\n        distiset.push_to_hub(ds_name, include_script=True)\n</code></pre> <p>The script can be run run for a dry run or not, depending on the argument (the pipeline will run without dry run by default), and will be pushed to the hub with the name <code>your_username/test_deepseek_prover</code>:</p> <pre><code>python deepseek_prover.py [-d | --dry-run | --no-dry-run]\n</code></pre> <p>Final dataset: plaguss/test_deepseek_prover.</p>"},{"location":"sections/pipeline_samples/papers/deita/","title":"DEITA","text":"<p>DEITA (Data-Efficient Instruction Tuning for Alignment) studies an automatic data selection process by first quantifying the data quality based on complexity, quality and diversity. Second, select the best potential combination from an open-source dataset that would fit into the budget you allocate to tune your own LLM.</p> <p>In most setting we cannot allocate unlimited resources for instruction-tuning LLMs. Therefore, the DEITA authors investigated how to select qualitative data for instruction tuning based on the principle of fewer high-quality samples. Liu et al. tackle the issue of first defining good data and second identifying it to respect an initial budget to instruct-tune your LLM.</p> <p>The strategy utilizes LLMs to replace human effort in time-intensive data quality tasks on instruction-tuning datasets**. DEITA introduces a way to measure data quality across three critical dimensions: complexity, quality and diversity.</p> <p></p> <p>You can see that we see again the dataset of instructions/responses and we kind of reproducing the second step when we learn how to optimize the responses according to an instruction by comparing several possibilities.</p> <p></p>"},{"location":"sections/pipeline_samples/papers/deita/#datasets-and-budget","title":"Datasets and budget","text":"<p>We will dive deeper into the whole process. We will investigate each stage to efficiently select the final dataset used for supervised fine-tuning with a budget constraint. We will tackle technical challenges by explaining exactly how you would assess good data as presented in the paper.</p> <p>As a reminder, we're looking for a strategy to automatically select good data for the instruction-tuning step when you want to fine-tune an LLM to your own use case taking into account a resource constraint. This means that you cannot blindly train a model on any data you encounter on the internet.</p> <p>The DEITA authors assume that you have access to open-source datasets that fit your use case. This may not be the case entirely. But with open-source communities tackling many use cases, with projects such as BLOOM or AYA, it's likely that your use case will be tackled at some point. Furthermore, you could generate your own instruction/response pairs with methods such as self-generated instructions using distilabel. This tutorial assumes that we have a data pool with excessive samples for the project's cost constraint. In short, we aim to achieve adequate performance from fewer samples.</p> <p>The authors claim that the subsample size \"correlates proportionally with the computation consumed in instruction tuning\". Hence on a first approximation, reducing the sample size means reducing computation consumption and so the total development cost. Reproducing the paper notations, we will associate the budget m to a number of instruction/response pairs that you can set depending on your real budget.</p> <p></p> <p>To match the experimental set-up, dataset X_sota is a meta-dataset combining major open-source datasets available to instruct-tune LLMs. This dataset is composed of ShareGPT (58k instruction/response pairs), UltraChat (105k instruction/response pairs) and WizardLM (143k instruction/response pairs). It sums to more than 300k instruction/response pairs. We aim to reduce the final subsample to 6k instruction/response pairs.</p>"},{"location":"sections/pipeline_samples/papers/deita/#setup-the-notebook-and-packages","title":"Setup the notebook and packages","text":"<p>Let's prepare our dependencies:</p> <pre><code>pip install \"distilabel[openai,hf-transformers]&gt;=1.0.0\"\npip install pynvml huggingface_hub argilla\n</code></pre> <p>Import distilabel:</p> <pre><code>from distilabel.models import TransformersLLM, OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import ConversationTemplate, DeitaFiltering, ExpandColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import ComplexityScorer, EvolInstruct, EvolQuality, GenerateEmbeddings, QualityScorer\n</code></pre> <p>Define the distilabel Pipeline and load the dataset from the Hugging Face Hub.</p> <pre><code>pipeline = Pipeline(name=\"DEITA\")\n\nload_data = LoadDataFromHub(\n    name=\"load_data\", batch_size=100, output_mappings={\"prompt\": \"instruction\"}, pipeline=pipeline\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#evol-instruct-generate-instructions-with-an-llm","title":"EVOL-INSTRUCT: Generate Instructions with an LLM","text":"<p>Evol-Instruct automates the creation of complex instruction data for training large language models (LLMs) by progressively rewriting an initial set of instructions into more complex forms. This generated data is then used to fine-tune a model named WizardLM.</p> <p>Evaluations show that instructions from Evol-Instruct are superior to human-created ones, and WizardLM achieves performance close to or exceeding GPT3.5-turbo in many skills. In distilabel, we initialise each step of the data generation pipeline. Later, we'll connect them together.</p> <pre><code>evol_instruction_complexity = EvolInstruct(\n    name=\"evol_instruction_complexity\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    num_evolutions=5,\n    store_evolutions=True,\n    generate_answers=True,\n    include_original_instruction=True,\n    pipeline=pipeline,\n)\n\nevol_instruction_complexity.load()\n\n_evolved_instructions = next(evol_instruction_complexity.process(\n    ([{\"instruction\": \"How many fish are there in a dozen fish?\"}]))\n)\n\nprint(*_evolved_instructions, sep=\"\\n\")\n</code></pre> <p>Output:</p> <pre><code>( 1, 'How many fish are there in a dozen fish?')\n( 2, 'How many rainbow trout are there in a dozen rainbow trout?')\n( 3, 'What is the average weight in pounds of a dozen rainbow trout caught in a specific river in Alaska during the month of May?')\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#evol-complexity-evaluate-complexity-of-generated-instructions","title":"EVOL COMPLEXITY: Evaluate complexity of generated instructions","text":"<p>The second step is the evaluation of complexity for an instruction in a given instruction-response pair. Like EVOL-INSTRUCT, this method uses LLMs instead of humans to automatically improve instructions, specifically through their complexity. From any instruction-response pair, \\((I, R)\\), we first generate new instructions following the In-Depth Evolving Response. We generate more complex instructions through prompting, as explained by authors, by adding some constraints or reasoning steps. Let\\'s take an example from GPT-4-LLM which aims to generate observations by GPT-4 to instruct-tune LLMs with supervised fine-tuning. And, we have the instruction \\(instruction_0\\):</p> <pre><code>instruction_0 = \"Give three tips for staying healthy.\"\n</code></pre> <p>To make it more complex, you can use, as the authors did, some prompt templates to add constraints or deepen the instruction. They provided some prompts in the paper appendix. For instance, this one was used to add constraints:</p> <pre><code>PROMPT = \"\"\"I want you act as a Prompt Rewriter.\nYour objective is to rewrite a given prompt into a more complex version to\nmake those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle.\nBut the rewritten prompt must be reasonable and must be understood and\nresponded by humans.\nYour rewriting cannot omit the non-text parts such as the table and code in\n#Given Prompt#:. Also, please do not omit the input in #Given Prompt#.\nYou SHOULD complicate the given prompt using the following method:\nPlease add one more constraints/requirements into #Given Prompt#\nYou should try your best not to make the #Rewritten Prompt# become verbose,\n#Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#.\n\u2018#Given Prompt#\u2019, \u2018#Rewritten Prompt#\u2019, \u2018given prompt\u2019 and \u2018rewritten prompt\u2019\nare not allowed to appear in #Rewritten Prompt#\n#Given Prompt#:\n&lt;Here is instruction&gt;\n#Rewritten Prompt#:\n\"\"\"\n</code></pre> <p>Prompting this to an LLM, you automatically get a more complex instruction, called \\(instruction_1\\), from an initial instruction \\(instruction_0\\):</p> <pre><code>instruction_1 = \"Provide three recommendations for maintaining well-being, ensuring one focuses on mental health.\"\n</code></pre> <p>With sequences of evolved instructions, we use a further LLM to automatically rank and score them. We provide the 6 instructions at the same time. By providing all instructions together, we force the scoring model to look at minor complexity differences between evolved instructions. Encouraging the model to discriminate between instructions. Taking the example below, \\(instruction_0\\) and \\(instruction_1\\) could deserve the same score independently, but when compared together we would notice the slight difference that makes \\(instruction_1\\) more complex.</p> <p>In <code>distilabel</code>, we implement this like so:</p> <pre><code>instruction_complexity_scorer = ComplexityScorer(\n    name=\"instruction_complexity_scorer\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    input_mappings={\"instructions\": \"evolved_instructions\"},\n    pipeline=pipeline,\n)\n\nexpand_evolved_instructions = ExpandColumns(\n    name=\"expand_evolved_instructions\",\n    columns=[\"evolved_instructions\", \"answers\", \"scores\"],\n    output_mappings={\n        \"evolved_instructions\": \"evolved_instruction\",\n        \"answers\": \"answer\",\n        \"scores\": \"evol_instruction_score\",\n    },\n    pipeline=pipeline,\n)\n\ninstruction_complexity_scorer.load()\n\n_evolved_instructions = next(instruction_complexity_scorer.process(([{\"evolved_instructions\": [PROMPT + instruction_1]}])))\n\nprint(\"Original Instruction:\")\nprint(instruction_1)\nprint(\"\\nEvolved Instruction:\")\nprint(_evolved_instructions[0][\"evolved_instructions\"][0].split(\"#Rewritten Prompt#:\\n\")[1])\n</code></pre> <p>Output:</p> <pre><code>Original Instruction:\nProvide three recommendations for maintaining well-being, ensuring one focuses on mental health.\n\nEvolved Instruction:\nSuggest three strategies for nurturing overall well-being, with the stipulation that at least one explicitly addresses the enhancement of mental health, incorporating evidence-based practices.\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#evol-quality-quality-evaluation","title":"EVOL-QUALITY: Quality Evaluation","text":"<p>Now that we have scored the complexity of the instructions, we will focus on the quality of the responses. Similar to EVOL COMPLEXITY, the authors introduced EVOL QUALITY, a method based on LLMs, instead of humans, to automatically score the quality of the response.</p> <p>From an instruction-response pair, \\((I, R)\\), the goal is to make the response evolve into a more helpful and relevant response. The key difference is that we need to also provide the first instruction to guide evolution. Let's take back our example from GPT-4-LLM.</p> <p>Here we have the response \\(response_0\\) and its initial instruction \\(instruction_0\\):</p> <pre><code>instruction_0 = \"Give three tips for staying healthy.\"\nreponse_0 = \"1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.\"\n</code></pre> <p>Again the authors provided several prompts you could use to make your response evolve according to some guidelines. For example, this one was used to enrich the answer:</p> <pre><code>PROMPT = \"\"\"I want you to act as a Response Rewriter\nYour goal is to enhance the quality of the response given by an AI assistant\nto the #Given Prompt# through rewriting.\nBut the rewritten response must be reasonable and must be understood by humans.\nYour rewriting cannot omit the non-text parts such as the table and code in\n#Given Prompt# and #Given Response#. Also, please do not omit the input\nin #Given Prompt#.\nYou Should enhance the quality of the response using the following method:\nPlease make the Response more in-depth\nYou should try your best not to make the #Rewritten Response# become verbose,\n#Rewritten Response# can only add 10 to 20 words into #Given Response#.\n\u2018#Given Response#\u2019, \u2018#Rewritten Response#\u2019, \u2018given response\u2019 and \u2018rewritten response\u2019\nare not allowed to appear in #Rewritten Response#\n#Given Prompt#:\n&lt;instruction_0&gt;\n#Given Response#:\n&lt;response_0&gt;\n#Rewritten Response#:\n\"\"\"\n</code></pre> <p>Prompting this to an LLM, you will automatically get a more enriched response, called \\(response_1\\), from an initial response \\(response_0\\) and initial instruction \\(instruction_0\\):</p> <pre><code>evol_response_quality = EvolQuality(\n    name=\"evol_response_quality\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    num_evolutions=5,\n    store_evolutions=True,\n    include_original_response=True,\n    input_mappings={\n        \"instruction\": \"evolved_instruction\",\n        \"response\": \"answer\",\n    },\n    pipeline=pipeline,\n)\n\nevol_response_quality.load()\n\n_evolved_responses = next(evol_response_quality.process([{\"instruction\": PROMPT + instruction_0, \"response\": reponse_0}]))\n\nprint(\"Original Response:\")\nprint(reponse_0)\nprint(\"\\nEvolved Response:\")\nprint(*_evolved_responses[0]['evolved_responses'], sep=\"\\n\")\n</code></pre> <p>And now, as in EVOL COMPLEXITY you iterate through this path and use different prompts to make your responses more relevant, helpful or creative. In the paper, they make 4 more iterations to get 5 evolved responses \\((R0, R1, R2, R3, R4)\\) which makes 5 different responses for one initial instruction at the end of this step.</p> <pre><code>response_quality_scorer = QualityScorer(\n    name=\"response_quality_scorer\",\n    llm=OpenAILLM(model=\"gpt-3.5-turbo\"),\n    input_mappings={\n        \"instruction\": \"evolved_instruction\",\n        \"responses\": \"evolved_responses\",\n    },\n    pipeline=pipeline,\n)\n\nexpand_evolved_responses = ExpandColumns(\n    name=\"expand_evolved_responses\",\n    columns=[\"evolved_responses\", \"scores\"],\n    output_mappings={\n        \"evolved_responses\": \"evolved_response\",\n        \"scores\": \"evol_response_score\",\n    },\n    pipeline=pipeline,\n)\n\nresponse_quality_scorer.load()\n\n_scored_responses = next(response_quality_scorer.process([{\"instruction\": PROMPT + instruction_0, \"responses\": _evolved_responses[0]['evolved_responses']}]))\n\nprint(\"Original Response:\")\nprint(reponse_0)\n\nprint(\"\\nScore, Evolved Response:\")\nprint(*zip(_scored_responses[0][\"scores\"], _evolved_responses[0]['evolved_responses']), sep=\"\\n\")\n</code></pre> <p>Output:</p> <pre><code>Original Response:\n1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.\n\nScore, Evolved Response:\n(4.0, 'Here are three essential tips for maintaining good health: \\n1. Prioritize regular exercise \\n2. Eat a balanced diet with plenty of fruits and vegetables \\n3. Get an adequate amount of sleep each night.')\n(2.0, 'Here are three effective strategies to maintain a healthy lifestyle.')\n(5.0, 'Here are three practical tips to maintain good health: Ensure a balanced diet, engage in regular exercise, and prioritize sufficient sleep. These practices support overall well-being.')\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#improving-data-diversity","title":"Improving Data Diversity","text":"<p>One main component of good data to instruct-tune LLMs is diversity. Real world data can often contain redundancy due repetitive and homogeneous data.</p> <p>The authors of the DEITA paper tackle the challenge of ensuring data diversity in the instruction tuning LLMs to avoid the pitfalls of data redundancy that can lead to over-fitting or poor generalization. They propose an embedding-based method to filter data for diversity. This method, called Repr Filter, uses embeddings generated by the Llama 1 13B model to represent instruction-response pairs in a vector space. The diversity of a new data sample is assessed based on the cosine distance between its embedding and that of its nearest neighbor in the already selected dataset. If this distance is greater than a specified threshold, the sample is considered diverse and is added to the selection. This process prioritizes diversity by assessing each sample's contribution to the variety of the dataset until the data selection budget is met. This approach effectively maintains the diversity of the data used for instruction tuning, as demonstrated by the DEITA models outperforming or matching state-of-the-art models with significantly less training data. In this implementation of DEITA we use the hidden state of the last layer of the Llama 2 model to generate embeddings, instead of a sentence transformer model, because we found that it improved the diversity of the data selection.</p> <p></p> <pre><code>generate_conversation = ConversationTemplate(\n    name=\"generate_conversation\",\n    input_mappings={\n        \"instruction\": \"evolved_instruction\",\n        \"response\": \"evolved_response\",\n    },\n    pipeline=pipeline,\n)\n\ngenerate_embeddings = GenerateEmbeddings(\n    name=\"generate_embeddings\",\n    llm=TransformersLLM(\n        model=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n        device=\"cuda\",\n        torch_dtype=\"float16\",\n    ),\n    input_mappings={\"text\": \"conversation\"},\n    input_batch_size=5,\n    pipeline=pipeline,\n)\n\ndeita_filtering = DeitaFiltering(name=\"deita_filtering\", pipeline=pipeline)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#build-the-distilabel-pipeline","title":"Build the \u2697 distilabel <code>Pipeline</code>","text":"<p>Now we're ready to build a <code>distilabel</code> pipeline using the DEITA method:</p> <pre><code>load_data.connect(evol_instruction_complexity)\nevol_instruction_complexity.connect(instruction_complexity_scorer)\ninstruction_complexity_scorer.connect(expand_evolved_instructions)\nexpand_evolved_instructions.connect(evol_response_quality)\nevol_response_quality.connect(response_quality_scorer)\nresponse_quality_scorer.connect(expand_evolved_responses)\nexpand_evolved_responses.connect(generate_conversation)\ngenerate_conversation.connect(generate_embeddings)\ngenerate_embeddings.connect(deita_filtering)\n</code></pre> <p>Now we can run the pipeline. We use the step names to reference them in the pipeline configuration:</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        \"load_data\": {\n            \"repo_id\": \"distilabel-internal-testing/instruction-dataset-50\",\n            \"split\": \"train\",\n        },\n        \"evol_instruction_complexity\": {\n            \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 512, \"temperature\": 0.7}}\n        },\n        \"instruction_complexity_scorer\": {\n            \"llm\": {\"generation_kwargs\": {\"temperature\": 0.0}}\n        },\n        \"evol_response_quality\": {\n            \"llm\": {\"generation_kwargs\": {\"max_new_tokens\": 512, \"temperature\": 0.7}}\n        },\n        \"response_quality_scorer\": {\"llm\": {\"generation_kwargs\": {\"temperature\": 0.0}}},\n        \"deita_filtering\": {\"data_budget\": 500, \"diversity_threshold\": 0.04},\n    },\n    use_cache=False,\n)\n</code></pre> <p>We can push the results to the Hugging Face Hub:</p> <pre><code>distiset.push_to_hub(\"distilabel-internal-testing/deita-colab\")\n</code></pre>"},{"location":"sections/pipeline_samples/papers/deita/#results","title":"Results","text":"<p>Again, to show the relevance of EVOL QUALITY method, the authors evaluated on the MT-bench models fine-tuned with different data selections according to how we defined quality responses according to an instruction. Each time they selected 6k data according to the quality score:</p> <p></p> <p>Credit: Liu et al. (2023)</p> <p>The score is much better when selecting data with the EVOL QUALITY method than when we select randomly or according to the length, making a more qualitative response if longer. Nevertheless, we see that the margin we may have seen in the complexity score is thinner. And we'll discuss the strategy in a later part. Nevertheless, this strategy looks to improve the fine-tuning compared to the baselines and now we're interested in mixing quality and complexity assessment with a diversity evaluation to find the right trade-off in our selection process.</p>"},{"location":"sections/pipeline_samples/papers/deita/#conclusion","title":"Conclusion","text":"<p>In conclusion, if you are looking for some efficient method to align an open-source LLM to your business case with a constrained budget, the solutions provided by DEITA are really worth the shot. This data-centric approach enables one to focus on the content of the dataset to have the best results instead of \"just\" scaling the instruction-tuning with more, and surely less qualitative, data. In a nutshell, the strategy developed, through automatically scoring instructions-responses, aims to substitute the human preference step proprietary models such as GPT-4 have been trained with. There are a few improvements we could think about when it comes to how to select the good data, but it opens a really great way in instruct-tuning LLM with lower computational needs making the whole process intellectually relevant and more sustainable than most of the other methods. We'd be happy to help you out with aligning an LLM with your business case drawing inspiration from such a methodology.</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/","title":"Instruction Backtranslation","text":"<p>\"Self Alignment with Instruction Backtranslation\" presents a scalable method to build high-quality instruction following a language model by automatically labeling human-written text with corresponding instructions. Their approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high-quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model.</p> <p></p> <p>Their self-training approach assumes access to a base language model, a small amount of seed data, and a collection of unlabelled examples, e.g. a web corpus. The unlabelled data is a large, diverse set of human-written documents that includes writing about all manner of topics humans are interested in \u2013 but crucially is not paired with instructions.</p> <p>A first key assumption is that there exists some subset of this very large human-written text that would be suitable as gold generations for some user instructions. A second key assumption is that they can predict instructions for these candidate gold answers that can be used as high-quality example pairs to train an instruction-following model.</p> <p>Their overall process, called instruction back translation performs two core steps:</p> <ol> <li> <p>Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning.</p> </li> <li> <p>Self-curate: Self-select high-quality demonstration examples as training data to finetune the base model to follow instructions. This approach is done iteratively where a better intermediate instruction-following model can improve on selecting data for finetuning in the next iteration.</p> </li> </ol> <p>This replication covers the self-curation step i.e. the second/latter step as mentioned above, so as to be able to use the proposed prompting approach to rate the quality of the generated text, which can either be synthetically generated or real human-written text.</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#replication","title":"Replication","text":"<p>To replicate the paper we will be using <code>distilabel</code> and a smaller dataset created by the Hugging Face H4 team named <code>HuggingFaceH4/instruction-dataset</code> for testing purposes.</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#installation","title":"Installation","text":"<p>To replicate Self Alignment with Instruction Backtranslation one will need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[hf-inference-endpoints,openai]&gt;=1.0.0\"\n</code></pre> <p>And since we will be using <code>InferenceEndpointsLLM</code> (installed via the extra <code>hf-inference-endpoints</code>) we will need deploy those in advance either locally or in the Hugging Face Hub (alternatively also the serverless endpoints can be used, but most of the times the inference times are slower, and there's a limited quota to use those as those are free) and set both the <code>HF_TOKEN</code> (to use the <code>InferenceEndpointsLLM</code>) and the <code>OPENAI_API_KEY</code> environment variable value (to use the <code>OpenAILLM</code>).</p>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#building-blocks","title":"Building blocks","text":"<ul> <li><code>LoadDataFromHub</code>: Generator Step to load a dataset from the Hugging Face Hub.</li> <li><code>TextGeneration</code>: Task to generate responses for a given instruction using an LLM.<ul> <li><code>InferenceEndpointsLLM</code>: LLM that runs a model from an Inference Endpoint in the Hugging Face Hub.</li> </ul> </li> <li><code>InstructionBacktranslation</code>: Task that generates a score and a reason for a response for a given instruction using the Self Alignment with Instruction Backtranslation prompt.<ul> <li><code>OpenAILLM</code>: LLM that loads a model from OpenAI.</li> </ul> </li> </ul>"},{"location":"sections/pipeline_samples/papers/instruction_backtranslation/#code","title":"Code","text":"<p>As mentioned before, we will put the previously mentioned building blocks together to replicate Self Alignment with Instruction Backtranslation.</p> <pre><code>from distilabel.models import InferenceEndpointsLLM, OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromHub, KeepColumns\nfrom distilabel.steps.tasks import InstructionBacktranslation, TextGeneration\n\n\nwith Pipeline(name=\"self-alignment-with-instruction-backtranslation\") as pipeline:\n    load_hub_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    text_generation = TextGeneration(\n        name=\"text_generation\",\n        llm=InferenceEndpointsLLM(\n            base_url=\"&lt;INFERENCE_ENDPOINT_URL&gt;\",\n            tokenizer_id=\"argilla/notus-7b-v1\",\n            model_display_name=\"argilla/notus-7b-v1\",\n        ),\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n\n    instruction_backtranslation = InstructionBacktranslation(\n        name=\"instruction_backtranslation\",\n        llm=OpenAILLM(model=\"gpt-4\"),\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"scoring_model\"},\n    )\n\n    keep_columns = KeepColumns(\n        name=\"keep_columns\",\n        columns=[\n            \"instruction\",\n            \"generation\",\n            \"generation_model\",\n            \"score\",\n            \"reason\",\n            \"scoring_model\",\n        ],\n    )\n\n    load_hub_dataset &gt;&gt; text_generation &gt;&gt; instruction_backtranslation &gt;&gt; keep_columns\n</code></pre> <p>Then we need to call <code>pipeline.run</code> with the runtime parameters so that the pipeline can be launched.</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        load_hub_dataset.name: {\n            \"repo_id\": \"HuggingFaceH4/instruction-dataset\",\n            \"split\": \"test\",\n        },\n        text_generation.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 1024,\n                    \"temperature\": 0.7,\n                },\n            },\n        },\n        instruction_backtranslation.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 1024,\n                    \"temperature\": 0.7,\n                },\n            },\n        },\n    },\n)\n</code></pre> <p>Finally, we can optionally push the generated dataset, named <code>Distiset</code>, to the Hugging Face Hub via the <code>push_to_hub</code> method, so that each subset generated in the leaf steps is pushed to the Hub.</p> <pre><code>distiset.push_to_hub(\n    \"instruction-backtranslation-instruction-dataset\",\n    private=True,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/math_shepherd/","title":"Create datasets to train a Process Reward Model using Math-Shepherd","text":"<p>This example will introduce Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations, an innovative math process reward model (PRM) which assigns reward scores to each step of math problem solutions. Specifically, we will present a recipe to create datasets to train such models. The final sections contain 2 pipeline examples to run the pipeline depending with more or less resources.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#replica","title":"Replica","text":"<p>Unlike traditional models that only look at final answers (Output Reward Models or ORM), this system evaluates each step of a mathematical solution and assigns reward scores to individual solution steps. Let's see the Figure 2 from the paper, which makes a summary of the labelling approach presented in their work.</p> <p></p> <p>In the traditional ORM approach, the annotation was done depending on the final outcome, while the Process Reward Model (PRM) allows labelling the different steps that lead to a solution, making for a richer set of information.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#steps-involved","title":"Steps involved","text":"<ul> <li> <p><code>MathShepherdGenerator</code>: This step is in charge of generating solutions for the instruction. Depending on the value set for the <code>M</code>, this step can be used to generate both the <code>golden_solution</code>, to be used as a reference for the labeller, or the set of <code>solutions</code> to be labelled. For the <code>solutions</code> column we want some diversity, to allow the model to reach both good and bad solutions, so we have a representative sample for the labeller, so it may be better to use a \"weaker\" model.</p> </li> <li> <p><code>MathShepherdCompleter</code>. This task does the job of the <code>completer</code> in the paper, generating completions as presented in Figure 2, section 3.3.2. It doesn't generate a column on it's own, but updates the steps generated in the <code>solutions</code> column from the <code>MathShepherdGenerator</code>, using as reference to label the data, the <code>golden_solution</code>. So in order for this step to work, we need both of this columns in our dataset. Depending on the type of dataset, we may already have access to the <code>golden_solution</code>, even if it's with a different name, but it's not the same for the <code>solutions</code>.</p> </li> <li> <p><code>FormatPRM</code>. This step does the auxiliary job of preparing the data to follow the format defined in the paper of having two columns <code>input</code> and <code>label</code>. After running the <code>MathShepherdCompleter</code>, we have raw data that can be formatted as the user want. Using <code>ExpandColumns</code> and this step, one can directly obtain the same format presented in the dataset shared in the paper: peiyi9979/Math-Shepherd.</p> </li> </ul>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#data-preparation","title":"Data preparation","text":"<p>For this example, just as the original paper, we are using the openai/gsm8k dataset. We only need a dataset with instructions to be solved (in this case it corresponds to the <code>question</code> column), and we can generate everything else using our predefined steps.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#building-the-pipeline","title":"Building the pipeline","text":"<p>The pipeline uses <code>openai/gsm8k</code> as reference, but the pipeline can be applied to different datasets, keep in mind the prompts can be modified with the current definition, by tweaking the <code>extra_rules</code> and <code>few_shots</code> in each task:</p> <pre><code>from datasets import load_dataset\n\nfrom distilabel.steps.tasks import MathShepherdCompleter, MathShepherdGenerator, FormatPRM\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, ExpandColumns\n\nds_name = \"openai/gsm8k\"\n\nds = load_dataset(ds_name, \"main\", split=\"test\").rename_column(\"question\", \"instruction\").select(range(3))  # (1)\n\nwith Pipeline(name=\"Math-Shepherd\") as pipe:\n    model_id_70B = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    model_id_8B = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n\n    llm_70B = InferenceEndpointsLLM(\n        model_id=model_id_70B,\n        tokenizer_id=model_id_70B,\n        generation_kwargs={\"max_new_tokens\": 1024, \"temperature\": 0.6},\n    )\n    llm_8B = InferenceEndpointsLLM(\n        model_id=model_id_8B,\n        tokenizer_id=model_id_8B,\n        generation_kwargs={\"max_new_tokens\": 2048, \"temperature\": 0.6},\n    )  # (2)\n\n    generator_golden = MathShepherdGenerator(\n        name=\"golden_generator\",\n        llm=llm_70B,\n    )  # (3)\n    generator = MathShepherdGenerator(\n        name=\"generator\",\n        llm=llm_8B,\n        use_default_structured_output=True,  # (9)\n        M=5\n    )  #\u00a0(4)\n    completer = MathShepherdCompleter(\n        name=\"completer\",\n        llm=llm_8B,\n        use_default_structured_output=True,\n        N=4\n    )  # (5)\n\n    combine = CombineOutputs()\n\n    expand = ExpandColumns(\n        name=\"expand_columns\",\n        columns=[\"solutions\"],\n        split_statistics=True,\n    )  #\u00a0(6)\n    formatter = FormatPRM(name=\"format_prm\")  # (7)\n\n    [generator_golden, generator] &gt;&gt; combine &gt;&gt; completer &gt;&gt; expand &gt;&gt; formatter # (8)\n</code></pre> <ol> <li> <p>Will use just 3 rows from the sample dataset, and rename the \"question\" to \"instruction\", to set the expected value for the <code>MathShepherdGenerator</code>.</p> </li> <li> <p>We will use 2 different LLMs, <code>meta-llama/Meta-Llama-3.1-70B-Instruct</code> (a stronger model for the <code>golden_solution</code>) and <code>meta-llama/Meta-Llama-3.1-8B-Instruct</code> (a weaker one to generate candidate solutions, and the completions).</p> </li> <li> <p>This <code>MathShepherdGenerator</code> task, that uses the stronger model, will generate the <code>golden_solution</code> for us, the step by step solution for the task.</p> </li> <li> <p>Another <code>MathShepherdGenerator</code> task, but in this case using the weaker model will generate candidate <code>solutions</code> (<code>M=5</code> in total).</p> </li> <li> <p>Now the <code>MathShepherdCompleter</code> task will generate <code>n=4</code> completions for each step of each candidate solution in the <code>solutions</code> column, and label them using the <code>golden_solution</code> as shown in Figure 2 in the paper. This step will add the label (it uses [+ and -] tags following the implementation in the paper, but these values can be modified) to the <code>solutions</code> column in place, instead of generating an additional column, but the intermediate completions won't be shown at the end.</p> </li> <li> <p>The <code>ExpandColumns</code> step expands the solution to match the instruction, so if we had set M=5, we would now have 5x instruction-pair solutions. We set the <code>split_statistics</code> to True to ensure the <code>distilabel_metadata</code> is split accordingly, othwerwise the number of tokens for each solution would count as the tokens needed for the whole list of solutions generated. One can omit both this and the following step and process the data for training as preferred.</p> </li> <li> <p>And finally, the <code>FormatPRM</code> generates two columns: <code>input</code> and <code>label</code> which prepare the data for training as presented in the original Math-Shepherd dataset.</p> </li> <li> <p>Both the <code>generator_golden</code> and <code>generator</code> can be run in parallel as there's no dependency between them, and after that we combine the results and pass them to the <code>completer</code>. Finally, we use the <code>expand</code> and <code>formatter</code> prepare the data in the expected format to train the Process Reward Model as defined in the original paper.</p> </li> <li> <p>Generate structured outputs to ensure it's easier to parse them, otherwise the models can fail a lot of times with an easy to parse list.</p> </li> </ol>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#script-and-final-dataset","title":"Script and final dataset","text":"<p>To see all the pieces in place, take a look at the full pipeline:</p> Run <pre><code>python examples/pipe_math_shepherd.py\n</code></pre> Full pipeline pipe_math_shepherd.py<pre><code># Copyright 2023-present, Argilla, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, ExpandColumns\nfrom distilabel.steps.tasks import (\n    FormatPRM,\n    MathShepherdCompleter,\n    MathShepherdGenerator,\n)\n\nds_name = \"openai/gsm8k\"\n\nds = (\n    load_dataset(ds_name, \"main\", split=\"test\")\n    .rename_column(\"question\", \"instruction\")\n    .select(range(3))\n)\n\n\nwith Pipeline(name=\"Math-Shepherd\") as pipe:\n    model_id_70B = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n    model_id_8B = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n\n    llm_70B = InferenceEndpointsLLM(\n        model_id=model_id_8B,\n        tokenizer_id=model_id_8B,\n        generation_kwargs={\"max_new_tokens\": 1024, \"temperature\": 0.5},\n    )\n    llm_8B = InferenceEndpointsLLM(\n        model_id=model_id_8B,\n        tokenizer_id=model_id_8B,\n        generation_kwargs={\"max_new_tokens\": 2048, \"temperature\": 0.7},\n    )\n\n    generator_golden = MathShepherdGenerator(\n        name=\"golden_generator\",\n        llm=llm_70B,\n    )\n    generator = MathShepherdGenerator(\n        name=\"generator\",\n        llm=llm_8B,\n        M=5,\n    )\n    completer = MathShepherdCompleter(name=\"completer\", llm=llm_8B, N=4)\n\n    combine = CombineOutputs()\n\n    expand = ExpandColumns(\n        name=\"expand_columns\",\n        columns=[\"solutions\"],\n        split_statistics=True,\n    )\n    formatter = FormatPRM(name=\"format_prm\")\n    [generator_golden, generator] &gt;&gt; combine &gt;&gt; completer &gt;&gt; expand &gt;&gt; formatter\n\n\nif __name__ == \"__main__\":\n    distiset = pipe.run(use_cache=False, dataset=ds)\n    distiset.push_to_hub(\"plaguss/test_math_shepherd_prm\")\n</code></pre> <p>The resulting dataset can be seen at: plaguss/test_math_shepherd_prm.</p>"},{"location":"sections/pipeline_samples/papers/math_shepherd/#pipeline-with-vllm-and-ray","title":"Pipeline with vLLM and ray","text":"<p>This section contains an alternative way of running the pipeline with a bigger outcome. To showcase how to scale the pipeline, we are using for the 3 generating tasks Qwen/Qwen2.5-72B-Instruct, highly improving the final quality as it follows much closer the prompt given. Also, we are using <code>vLLM</code> and 3 nodes (one per task in this case), to scale up the generation process.</p> Math-Shepherd's bigger pipeline <pre><code>from datasets import load_dataset\n\nfrom distilabel.models import vLLM\nfrom distilabel.steps import StepResources\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs, ExpandColumns\nfrom distilabel.steps.tasks import (\n    FormatPRM,\n    MathShepherdCompleter,\n    MathShepherdGenerator,\n)\n\nds_name = \"openai/gsm8k\"\n\nds = (\n    load_dataset(ds_name, \"main\", split=\"test\")\n    .rename_column(\"question\", \"instruction\")\n)\n\n\nwith Pipeline(name=\"Math-Shepherd\").ray() as pipe:  # (1)\n\n    model_id_72B = \"Qwen/Qwen2.5-72B-Instruct\"\n\n    llm_72B = vLLM(\n        model=model_id_72B,\n        tokenizer=model_id_72B,\n        extra_kwargs={\n            \"tensor_parallel_size\": 8,               # Number of GPUs per node\n            \"max_model_len\": 2048,\n        },\n        generation_kwargs={\n            \"temperature\": 0.5,\n            \"max_new_tokens\": 4096,\n        },\n    )\n\n    generator_golden = MathShepherdGenerator(\n        name=\"golden_generator\",\n        llm=llm_72B,\n        input_batch_size=50,\n        output_mappings={\"model_name\": \"model_name_golden_generator\"},\n        resources=StepResources(replicas=1, gpus=8)  # (2)\n    )\n    generator = MathShepherdGenerator(\n        name=\"generator\",\n        llm=llm_72B,\n        input_batch_size=50,\n        M=5,\n        use_default_structured_output=True,\n        output_mappings={\"model_name\": \"model_name_generator\"},\n        resources=StepResources(replicas=1, gpus=8)\n    )\n    completer = MathShepherdCompleter(\n        name=\"completer\", \n        llm=llm_72B,\n        N=8,\n        use_default_structured_output=True,\n        output_mappings={\"model_name\": \"model_name_completer\"},\n        resources=StepResources(replicas=1, gpus=8)\n    )\n\n    combine = CombineOutputs()\n\n    expand = ExpandColumns(\n        name=\"expand_columns\",\n        columns=[\"solutions\"],\n        split_statistics=True,\n\n    )\n    formatter = FormatPRM(name=\"format_prm\", format=\"trl\")  # (3)\n\n    [generator_golden, generator] &gt;&gt; combine &gt;&gt; completer &gt;&gt; expand &gt;&gt; formatter\n\n\nif __name__ == \"__main__\":\n    distiset = pipe.run(use_cache=False, dataset=ds, dataset_batch_size=50)\n    if distiset:\n        distiset.push_to_hub(\"plaguss/test_math_shepherd_prm_ray\")\n</code></pre> <ol> <li> <p>Transform the pipeline to run using <code>ray</code> backend.</p> </li> <li> <p>Assign the resources: number of replicas 1 as we want a single instance of the task in a node, and number of GPUs equals to 8, using a whole node. Given that we defined the script in the slurm file to use 3 nodes, this will use all the 3 available nodes, with 8 GPUs for each of these tasks.</p> </li> <li> <p>Prepare the columns in the format expected by <code>TRL</code> for training.</p> </li> </ol> <p>Click to see the slurm file used to run the previous pipeline. It's our go to <code>slurm</code> file, using 3 8xH100 nodes.</p> Slurm file <pre><code>#!/bin/bash\n#SBATCH --job-name=math-shepherd-test-ray\n#SBATCH --partition=hopper-prod\n#SBATCH --qos=normal\n#SBATCH --nodes=3\n#SBATCH --exclusive\n#SBATCH --ntasks-per-node=1\n#SBATCH --gpus-per-node=8\n#SBATCH --output=./logs/%x-%j.out\n#SBATCH --err=./logs/%x-%j.err\n#SBATCH --time=48:00:00\n\nset -ex\n\nmodule load cuda/12.1\n\necho \"SLURM_JOB_ID: $SLURM_JOB_ID\"\necho \"SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST\"\n\nsource .venv/bin/activate\n\n# Getting the node names\nnodes=$(scontrol show hostnames \"$SLURM_JOB_NODELIST\")\nnodes_array=($nodes)\n\n# Get the IP address of the head node\nhead_node=${nodes_array[0]}\nhead_node_ip=$(srun --nodes=1 --ntasks=1 -w \"$head_node\" hostname --ip-address)\n\n# Start Ray head node\nport=6379\nip_head=$head_node_ip:$port\nexport ip_head\necho \"IP Head: $ip_head\"\n\n# Generate a unique Ray tmp dir for the head node\nhead_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_head\"\n\necho \"Starting HEAD at $head_node\"\nsrun --nodes=1 --ntasks=1 -w \"$head_node\" \\\n    ray start --head --node-ip-address=\"$head_node_ip\" --port=$port \\\n    --dashboard-host=0.0.0.0 \\\n    --dashboard-port=8265 \\\n    --temp-dir=\"$head_tmp_dir\" \\\n    --block &amp;\n\n# Give some time to head node to start...\nsleep 10\n\n# Start Ray worker nodes\nworker_num=$((SLURM_JOB_NUM_NODES - 1))\n\n# Start from 1 (0 is head node)\nfor ((i = 1; i &lt;= worker_num; i++)); do\n    node_i=${nodes_array[$i]}\n    worker_tmp_dir=\"/tmp/ray_tmp_${SLURM_JOB_ID}_worker_$i\"\n    echo \"Starting WORKER $i at $node_i\"\n    srun --nodes=1 --ntasks=1 -w \"$node_i\" \\\n        ray start --address \"$ip_head\" \\\n        --temp-dir=\"$worker_tmp_dir\" \\\n        --block &amp;\n    sleep 5\ndone\n\n# Give some time to the Ray cluster to gather info\nsleep 60\n\n# Finally submit the job to the cluster\nRAY_ADDRESS=\"http://$head_node_ip:8265\" ray job submit --working-dir pipeline -- python -u pipeline_math_shepherd_ray.py\n</code></pre> Final dataset <p>The resulting dataset can be seen at: plaguss/test_math_shepherd_prm_ray.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/","title":"Prometheus 2","text":"<p>\"Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models\" presents Prometheus 2, a new and more powerful evaluator LLM compared to Prometheus (its predecessor) presented in \"Prometheus: Inducing Fine-grained Evaluation Capability in Language Models\"; since GPT-4, as well as other proprietary LLMs, are commonly used to assess the quality of the responses for various LLMs, but there are concerns about transparency, controllability, and affordability, that motivate the need of open-source LLMs specialized in evaluations.</p> <p></p> <p>Existing open evaluator LMs exhibit critical shortcomings:</p> <ol> <li>They issue scores that significantly diverge from those assigned by humans.</li> <li>They lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment.</li> </ol> <p>Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. Prometheus 2 is capable of processing both direct assessment and pair-wise ranking formats grouped with user-defined evaluation criteria.</p> <p>Prometheus 2 released two variants:</p> <ul> <li><code>prometheus-eval/prometheus-7b-v2.0</code>: fine-tuned on top of <code>mistralai/Mistral-7B-Instruct-v0.2</code></li> <li><code>prometheus-eval/prometheus-8x7b-v2.0</code>: fine-tuned on top of <code>mistralai/Mixtral-8x7B-Instruct-v0.1</code></li> </ul> <p>Both models have been fine-tuned for both direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively.</p> <p>On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Their models, code, and data are all publicly available at <code>prometheus-eval/prometheus-eval</code>.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/#replication","title":"Replication","text":"<p>Note</p> <p>The section is named <code>Replication</code> but in this case we're not replicating the Prometheus 2 paper per se, but rather showing how to use the <code>PrometheusEval</code> task implemented within <code>distilabel</code> to evaluate the quality of the responses from a given instruction using the Prometheus 2 model.</p> <p>To showcase Prometheus 2 we will be using the <code>PrometheusEval</code> task implemented in <code>distilabel</code> and a smaller dataset created by the Hugging Face H4 team named <code>HuggingFaceH4/instruction-dataset</code> for testing purposes.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/#installation","title":"Installation","text":"<p>To reproduce the code below, one will need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[vllm]&gt;=1.1.0\"\n</code></pre> <p>Alternatively, it's recommended to install <code>Dao-AILab/flash-attention</code> to benefit from Flash Attention 2 speed ups during inference via <code>vllm</code>.</p> <pre><code>pip install flash-attn --no-build-isolation\n</code></pre> <p>Note</p> <p>The installation notes above assume that you are using a VM with one GPU accelerator with at least the required VRAM to fit <code>prometheus-eval/prometheus-7b-v2.0</code> in bfloat16 (28GB); but if you have enough VRAM to fit their 8x7B model in bfloat16 (~90GB) you can use <code>prometheus-eval/prometheus-8x7b-v2.0</code> instead.</p>"},{"location":"sections/pipeline_samples/papers/prometheus/#building-blocks","title":"Building blocks","text":"<ul> <li> <p><code>LoadDataFromHub</code>: <code>GeneratorStep</code> to load a dataset from the Hugging Face Hub.</p> </li> <li> <p><code>PrometheusEval</code>: <code>Task</code> that assesses the quality of a response for a given instruction using any of the Prometheus 2 models.</p> <ul> <li><code>vLLM</code>: <code>LLM</code> that loads a model from the Hugging Face Hub via vllm-project/vllm.</li> </ul> <p>Note</p> <p>Since the Prometheus 2 models use a slightly different chat template than <code>mistralai/Mistral-7B-Instruct-v0.2</code>, we need to set the <code>chat_template</code> parameter to <code>[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]</code> so as to properly format the input for Prometheus 2.</p> </li> <li> <p>(Optional) <code>KeepColumns</code>: <code>Task</code> that keeps only the specified columns in the dataset, used to remove the undesired columns.</p> </li> </ul>"},{"location":"sections/pipeline_samples/papers/prometheus/#code","title":"Code","text":"<p>As mentioned before, we will put the previously mentioned building blocks together to see how Prometheus 2 can be used via <code>distilabel</code>.</p> <pre><code>from distilabel.models import vLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import KeepColumns, LoadDataFromHub\nfrom distilabel.steps.tasks import PrometheusEval\n\nif __name__ == \"__main__\":\n    with Pipeline(name=\"prometheus\") as pipeline:\n        load_dataset = LoadDataFromHub(\n            name=\"load_dataset\",\n            repo_id=\"HuggingFaceH4/instruction-dataset\",\n            split=\"test\",\n            output_mappings={\"prompt\": \"instruction\", \"completion\": \"generation\"},\n        )\n\n        task = PrometheusEval(\n            name=\"task\",\n            llm=vLLM(\n                model=\"prometheus-eval/prometheus-7b-v2.0\",\n                chat_template=\"[INST] {{ messages[0]['content'] }}\\n{{ messages[1]['content'] }}[/INST]\",\n            ),\n            mode=\"absolute\",\n            rubric=\"factual-validity\",\n            reference=False,\n            num_generations=1,\n            group_generations=False,\n        )\n\n        keep_columns = KeepColumns(\n            name=\"keep_columns\",\n            columns=[\"instruction\", \"generation\", \"feedback\", \"result\", \"model_name\"],\n        )\n\n        load_dataset &gt;&gt; task &gt;&gt; keep_columns\n</code></pre> <p>Then we need to call <code>pipeline.run</code> with the runtime parameters so that the pipeline can be launched.</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        task.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 1024,\n                    \"temperature\": 0.7,\n                },\n            },\n        },\n    },\n)\n</code></pre> <p>Finally, we can optionally push the generated dataset, named <code>Distiset</code>, to the Hugging Face Hub via the <code>push_to_hub</code> method, so that each subset generated in the leaf steps is pushed to the Hub.</p> <pre><code>distiset.push_to_hub(\n    \"instruction-dataset-prometheus\",\n    private=True,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/","title":"UltraFeedback","text":"<p>UltraFeedback: Boosting Language Models with High-quality Feedback is a paper published by OpenBMB which proposes <code>UltraFeedback</code>, a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models.</p> <p>UltraFeedback collects about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN), then they use these prompts to query multiple LLMs (commercial models, Llama models ranging 7B to 70B, and non-Llama models) and generate four different responses for each prompt, resulting in a total of 256k samples i.e. the UltraFeedback will rate four responses on every OpenAI request.</p> <p></p> <p>To collect high-quality preference and textual feedback, they design a fine-grained annotation instruction, which contains four different aspects, namely instruction-following, truthfulness, honesty and helpfulness (even though within the paper they also mention a fifth one named verbalized calibration). Finally, GPT-4 is used to generate the ratings for the generated responses to the given prompt using the previously mentioned aspects.</p>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#replication","title":"Replication","text":"<p>To replicate the paper we will be using <code>distilabel</code> and a smaller dataset created by the Hugging Face H4 team named <code>HuggingFaceH4/instruction-dataset</code> for testing purposes.</p> <p>Also for testing purposes we will just show how to evaluate the generated responses for a given prompt using a new global aspect named <code>overall-rating</code> defined by Argilla, that computes the average of the four aspects, so as to reduce number of requests to be sent to OpenAI, but note that all the aspects are implemented within <code>distilabel</code> and can be used instead for a more faithful reproduction. Besides that we will generate three responses for each instruction using three LLMs selected from a pool of six: <code>HuggingFaceH4/zephyr-7b-beta</code>, <code>argilla/notus-7b-v1</code>, <code>google/gemma-1.1-7b-it</code>, <code>meta-llama/Meta-Llama-3-8B-Instruct</code>, <code>HuggingFaceH4/zephyr-7b-gemma-v0.1</code> and <code>mlabonne/UltraMerge-7B</code>.</p>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#installation","title":"Installation","text":"<p>To replicate UltraFeedback one will need to install <code>distilabel</code> as it follows:</p> <pre><code>pip install \"distilabel[argilla,openai,vllm]&gt;=1.0.0\"\n</code></pre> <p>And since we will be using <code>vllm</code> we will need to use a VM with at least 6 NVIDIA GPUs with at least 16GB of memory each to run the text generation, and set the <code>OPENAI_API_KEY</code> environment variable value.</p>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#building-blocks","title":"Building blocks","text":"<ul> <li><code>LoadDataFromHub</code>: Generator Step to load a dataset from the Hugging Face Hub.</li> <li><code>sample_n_steps</code>: Function to create a <code>routing_batch_function</code> that samples <code>n</code> downstream steps for each batch generated by the upstream step. This is the key to replicate the LLM pooling mechanism described in the paper.</li> <li><code>TextGeneration</code>: Task to generate responses for a given instruction using an LLM.<ul> <li><code>vLLM</code>: LLM that loads a model from the Hugging Face Hub using <code>vllm</code>.</li> </ul> </li> <li><code>GroupColumns</code>: Task that combines multiple columns into a single one i.e. from string to list of strings. Useful when there are multiple parallel steps that are connected to the same node.</li> <li><code>UltraFeedback</code>: Task that generates ratings for the responses of a given instruction using the UltraFeedback prompt.<ul> <li><code>OpenAILLM</code>: LLM that loads a model from OpenAI.</li> </ul> </li> <li><code>KeepColumns</code>: Task to keep the desired columns while removing the not needed ones, as well as defining the order for those.</li> <li>(optional) <code>PreferenceToArgilla</code>: Task to optionally push the generated dataset to Argilla to do some further analysis and human annotation.</li> </ul>"},{"location":"sections/pipeline_samples/papers/ultrafeedback/#code","title":"Code","text":"<p>As mentioned before, we will put the previously mentioned building blocks together to replicate UltraFeedback.</p> <pre><code>from distilabel.models import OpenAILLM, vLLM\nfrom distilabel.pipeline import Pipeline, sample_n_steps\nfrom distilabel.steps import (\n    GroupColumns,\n    KeepColumns,\n    LoadDataFromHub,\n    PreferenceToArgilla,\n)\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n\nsample_three_llms = sample_n_steps(n=3)\n\n\nwith Pipeline(name=\"ultrafeedback-pipeline\") as pipeline:\n    load_hub_dataset = LoadDataFromHub(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n        batch_size=2,\n    )\n\n    text_generation_with_notus = TextGeneration(\n        name=\"text_generation_with_notus\",\n        llm=vLLM(model=\"argilla/notus-7b-v1\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_zephyr = TextGeneration(\n        name=\"text_generation_with_zephyr\",\n        llm=vLLM(model=\"HuggingFaceH4/zephyr-7b-gemma-v0.1\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_gemma = TextGeneration(\n        name=\"text_generation_with_gemma\",\n        llm=vLLM(model=\"google/gemma-1.1-7b-it\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_zephyr_gemma = TextGeneration(\n        name=\"text_generation_with_zephyr_gemma\",\n        llm=vLLM(model=\"HuggingFaceH4/zephyr-7b-gemma-v0.1\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_llama = TextGeneration(\n        name=\"text_generation_with_llama\",\n        llm=vLLM(model=\"meta-llama/Meta-Llama-3-8B-Instruct\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n    text_generation_with_ultramerge = TextGeneration(\n        name=\"text_generation_with_ultramerge\",\n        llm=vLLM(model=\"mlabonne/UltraMerge-7B\"),\n        input_batch_size=2,\n        output_mappings={\"model_name\": \"generation_model\"},\n    )\n\n    combine_columns = GroupColumns(\n        name=\"combine_columns\",\n        columns=[\"generation\", \"generation_model\"],\n        output_columns=[\"generations\", \"generation_models\"],\n        input_batch_size=2\n    )\n\n    ultrafeedback = UltraFeedback(\n        name=\"ultrafeedback_openai\",\n        llm=OpenAILLM(model=\"gpt-4-turbo-2024-04-09\"),\n        aspect=\"overall-rating\",\n        output_mappings={\"model_name\": \"ultrafeedback_model\"},\n    )\n\n    keep_columns = KeepColumns(\n        name=\"keep_columns\",\n        columns=[\n            \"instruction\",\n            \"generations\",\n            \"generation_models\",\n            \"ratings\",\n            \"rationales\",\n            \"ultrafeedback_model\",\n        ],\n    )\n\n    (\n        load_hub_dataset\n        &gt;&gt; sample_three_llms\n        &gt;&gt; [\n            text_generation_with_notus,\n            text_generation_with_zephyr,\n            text_generation_with_gemma,\n            text_generation_with_llama,\n            text_generation_with_zephyr_gemma,\n            text_generation_with_ultramerge\n        ]\n        &gt;&gt; combine_columns\n        &gt;&gt; ultrafeedback\n        &gt;&gt; keep_columns\n    )\n\n    # Optional: Push the generated dataset to Argilla, but will need to `pip install argilla` first\n    # push_to_argilla = PreferenceToArgilla(\n    #     name=\"push_to_argilla\",\n    #     api_url=\"&lt;ARGILLA_API_URL&gt;\",\n    #     api_key=\"&lt;ARGILLA_API_KEY&gt;\",  # type: ignore\n    #     dataset_name=\"ultrafeedback\",\n    #     dataset_workspace=\"admin\",\n    #     num_generations=2,\n    # )\n    # keep_columns &gt;&gt; push_to_argilla\n</code></pre> <p>Note</p> <p>As we're using a relative small dataset, we're setting a low <code>batch_size</code> and <code>input_batch_size</code> so we have more batches for the <code>routing_batch_function</code> i.e. we will have more variety on the LLMs used to generate the responses. When using a large dataset, it's recommended to use a larger <code>batch_size</code> and <code>input_batch_size</code> to benefit from the <code>vLLM</code> optimizations for larger batch sizes, which makes the pipeline execution faster.</p> <p>Then we need to call <code>pipeline.run</code> with the runtime parameters so that the pipeline can be launched.</p> <pre><code>distiset = pipeline.run(\n    parameters={\n        load_hub_dataset.name: {\n            \"repo_id\": \"HuggingFaceH4/instruction-dataset\",\n            \"split\": \"test\",\n        },\n        text_generation_with_notus.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_zephyr.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_gemma.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_llama.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_zephyr_gemma.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        text_generation_with_ultramerge.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 512,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n        ultrafeedback.name: {\n            \"llm\": {\n                \"generation_kwargs\": {\n                    \"max_new_tokens\": 2048,\n                    \"temperature\": 0.7,\n                }\n            },\n        },\n    }\n)\n</code></pre> <p>Finally, we can optionally push the generated dataset, named <code>Distiset</code>, to the Hugging Face Hub via the <code>push_to_hub</code> method, so that each subset generated in the leaf steps is pushed to the Hub.</p> <pre><code>distiset.push_to_hub(\n    \"ultrafeedback-instruction-dataset\",\n    private=True,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/","title":"Synthetic data generation for fine-tuning custom retrieval and reranking models","text":"<pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"sentence-transformers~=3.0\"\n</code></pre> <p>Let's make the needed imports:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.steps import LoadDataFromHub\n\nfrom sentence_transformers import SentenceTransformer, CrossEncoder\nimport torch\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Login to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>Let's make the extra needed imports:</p> <pre><code>import argilla as rg\n</code></pre> <pre><code>context = (\n\"\"\"\nThe text is a chunk from technical Python SDK documentation of Argilla.\nArgilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets.\nAlong with prose explanations, the text chunk may include code snippets and Python references.\n\"\"\"\n)\n</code></pre> <pre><code>llm = InferenceEndpointsLLM(\n    model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    tokenizer_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n)\n\nwith Pipeline(name=\"generate\") as pipeline:\n    load_dataset = LoadDataFromHub(\n        num_examples=15,\n        output_mappings={\"chunks\": \"anchor\"},\n    )\n    generate_retrieval_pairs = GenerateSentencePair(\n        name=\"generate_retrieval_pairs\",\n        triplet=True,\n        hard_negative=True,\n        action=\"query\",\n        llm=llm,\n        input_batch_size=10,\n        context=context,\n    )\n    generate_reranking_pairs = GenerateSentencePair(\n        name=\"generate_reranking_pairs\",\n        triplet=True,\n        hard_negative=False,  # to potentially generate non-relevant pairs\n        action=\"semantically-similar\",\n        llm=llm,\n        input_batch_size=10,\n        context=context,\n    )\n\n    load_dataset.connect(generate_retrieval_pairs, generate_reranking_pairs)\n</code></pre> <p>Next, we can execute this using <code>pipeline.run</code>. We will provide some <code>parameters</code> to specific components within our pipeline.</p> <pre><code>generation_kwargs = {\n    \"llm\": {\n        \"generation_kwargs\": {\n            \"temperature\": 0.7,\n            \"max_new_tokens\": 512,\n        }\n    }\n}\n\ndistiset = pipeline.run(  \n    parameters={\n        load_dataset.name: {\n            \"repo_id\": \"plaguss/argilla_sdk_docs_raw_unstructured\",\n            \"split\": \"train\",\n        },\n        generate_retrieval_pairs.name: generation_kwargs,\n        generate_reranking_pairs.name: generation_kwargs,\n    },\n    use_cache=False,  # False for demo\n)\n</code></pre> <p>Data generation can be a expensive, so it is recommended to store the data somewhere. For now, we will store it on the Hugging Face Hub, using  our <code>push_to_hub</code> method.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-retrieval-reranking-dataset\")\n</code></pre> <p>We have got 2 different leaf/end nodes, therefore we've got a distil configurations we can access, one for the retrieval data, and one for the reranking data.</p> <p>Looking at these initial examples, we can see they nicely capture the essence of the <code>chunks</code> column but we will need to evaluate the quality of the data a bit more before we can use it for fine-tuning.</p> <pre><code>model_id = \"Snowflake/snowflake-arctic-embed-m\"  # Hugging Face model ID\n\nmodel_retrieval = SentenceTransformer(\n    model_id, device=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n)\n</code></pre> <p>Next, we will encode the generated text pairs and compute the similarities. </p> <pre><code>from sklearn.metrics.pairwise import cosine_similarity\n\ndef get_embeddings(texts):\n    vectors = model_retrieval.encode(texts)\n    return [vector.tolist() for vector in vectors]\n\n\ndef get_similarities(vector_batch_a, vector_batch_b):\n    similarities = []\n    for vector_a, vector_b in zip(vector_batch_a, vector_batch_b):\n        similarity = cosine_similarity([vector_a], [vector_b])[0][0]\n        similarities.append(similarity)\n    return similarities\n\ndef format_data_retriever(batch):# -&amp;gt; Any:\n    batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n    batch[\"positive-vector\"] = get_embeddings(batch[\"positive\"])\n    batch[\"negative-vector\"] = get_embeddings(batch[\"negative\"])    \n    batch[\"similarity-positive-negative\"] = get_similarities(batch[\"positive-vector\"], batch[\"negative-vector\"])\n    batch[\"similarity-anchor-positive\"] = get_similarities(batch[\"anchor-vector\"], batch[\"positive-vector\"])\n    batch[\"similarity-anchor-negative\"] = get_similarities(batch[\"anchor-vector\"], batch[\"negative-vector\"])\n    return batch\n\ndataset_generate_retrieval_pairs = distiset[\"generate_retrieval_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)\n</code></pre> <pre><code>model_id = \"sentence-transformers/all-MiniLM-L12-v2\"\n\nmodel = CrossEncoder(model_id)\n</code></pre> <p>Next, we will compute the similarity for the generated text pairs using the reranker. On top of that, we will compute an <code>anchor-vector</code> to allow for doing semantic search.</p> <pre><code>def format_data_retriever(batch):# -&amp;gt; Any:\n    batch[\"anchor-vector\"] = get_embeddings(batch[\"anchor\"])\n    batch[\"similarity-positive-negative\"] = model.predict(zip(batch[\"positive-vector\"], batch[\"negative-vector\"]))\n    batch[\"similarity-anchor-positive\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"positive-vector\"]))\n    batch[\"similarity-anchor-negative\"] = model.predict(zip(batch[\"anchor-vector\"], batch[\"negative-vector\"]))\n    return batch\n\ndataset_generate_reranking_pairs = distiset[\"generate_reranking_pairs\"][\"train\"].map(format_data_retriever, batched=True, batch_size=250)\n</code></pre> <p>And voila, we have our proxies for quality evaluation which we can use to filter out the best and worst examples.</p> <p>First, we need to define the setting for our Argilla dataset. We will create two different datasets, one for the retrieval data and one for the reranking data to ensure our annotators can focus on the task at hand.</p> <pre><code>import argilla as rg\nfrom argilla._exceptions import ConflictError\n\napi_key = \"ohh so secret\"\napi_url = \"https://[your-owner-name]-[your-space-name].hf.space\"\n\nclient = rg.Argilla(api_url=api_url, api_key=api_key)\n\nsettings = rg.Settings(\n    fields=[\n        rg.TextField(\"anchor\")\n    ],\n    questions=[\n        rg.TextQuestion(\"positive\"),\n        rg.TextQuestion(\"negative\"),\n        rg.LabelQuestion(\n            name=\"is_positive_relevant\",\n            title=\"Is the positive query relevant?\",\n            labels=[\"yes\", \"no\"],\n        ),\n        rg.LabelQuestion(\n            name=\"is_negative_irrelevant\",\n            title=\"Is the negative query irrelevant?\",\n            labels=[\"yes\", \"no\"],\n        )\n    ],\n    metadata=[\n        rg.TermsMetadataProperty(\"filename\"),\n        rg.FloatMetadataProperty(\"similarity-positive-negative\"),\n        rg.FloatMetadataProperty(\"similarity-anchor-positive\"),\n        rg.FloatMetadataProperty(\"similarity-anchor-negative\"),\n    ],\n    vectors=[\n        rg.VectorField(\"anchor-vector\", dimensions=model.get_sentence_embedding_dimension())\n    ]\n)\nrg_datasets = []\nfor dataset_name in [\"generate_retrieval_pairs\", \"generate_reranking_pairs\"]:\n    ds = rg.Dataset(\n        name=dataset_name,\n        settings=settings\n    )\n    try:\n        ds.create()\n    except ConflictError:\n        ds = client.datasets(dataset_name)\n    rg_datasets.append(ds)\n</code></pre> <p>Now, we've got our dataset definitions setup in Argilla, we can upload our data to Argilla.</p> <pre><code>ds_datasets = [dataset_generate_retrieval_pairs, dataset_generate_reranking_pairs]\n\nrecords = []\n\nfor rg_dataset, ds_dataset in zip(rg_datasets, ds_datasets):\n    for idx, entry in enumerate(ds_dataset):\n        records.append(\n            rg.Record(\n                id=idx,\n                fields={\"anchor\": entry[\"anchor\"]},\n                suggestions=[\n                    rg.Suggestion(\"positive\", value=entry[\"positive\"], agent=\"gpt-4o\", type=\"model\"),\n                    rg.Suggestion(\"negative\", value=entry[\"negative\"], agent=\"gpt-4o\", type=\"model\"),\n                ],\n                metadata={\n                    \"filename\": entry[\"filename\"],\n                    \"similarity-positive-negative\": entry[\"similarity-positive-negative\"],\n                    \"similarity-anchor-positive\": entry[\"similarity-anchor-positive\"],\n                    \"similarity-anchor-negative\": entry[\"similarity-anchor-negative\"]\n                },\n                vectors={\"anchor-vector\": entry[\"anchor-vector\"]}\n            )\n        )\n    rg_dataset.records.log(records)\n</code></pre> <p>Now, we can explore the UI and add a final human touch to get he most out of our dataset. </p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#synthetic-data-generation-for-fine-tuning-custom-retrieval-and-reranking-models","title":"Synthetic data generation for fine-tuning custom retrieval and reranking models","text":"<ul> <li>Goal: Bootstrap, optimize and maintain your embedding models and rerankers through synthetic data generation and human feedback.</li> <li>Libraries: argilla, hf-inference-endpoints, sentence-transformers</li> <li>Components: LoadDataFromHub, GenerateSentencePair, InferenceEndpointsLLM</li> </ul> <p>Note</p> <p>For a comprehensive overview on optimizing the retrieval performance in a RAG pipeline, check this guide in collaboration with ZenML, an open-source MLOps framework designed for building portable and production-ready machine learning pipelines.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#getting-started","title":"Getting started","text":""},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide. </p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#the-dataset","title":"The dataset","text":"<p>Before starting any project, it is always important to look at your data. Our data is publicly available on the Hugging Face Hub so we can have a quick look through their dataset viewer within an embedded iFrame. </p> <p>As we can see, our dataset contains a column called <code>chunks</code>, which was obtained from the Argilla docs. Normally, you would need to download and chunk the data but we will not cover that in this tutorial. To read a full explanation for how this dataset was generated, please refer to How we leveraged distilabel to create an Argilla 2.0 Chatbot.</p> <p>Alternatively, we can load the entire dataset to disk with <code>datasets.load_dataset</code>.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#synthetic-data-generation","title":"Synthetic data generation","text":"<p>The <code>GenerateSentencePair</code> component from <code>distilabel</code> can be used to generate training datasets for embeddings models. </p> <p>It is a pre-defined <code>Task</code> that given an <code>anchor</code> sentence generate data for a specific <code>action</code>. Supported actions are: <code>\"paraphrase\", \"semantically-similar\", \"query\", \"answer\"</code>. In our case the <code>chunks</code> column corresponds to the <code>anchor</code>. This means we will use <code>query</code> to generate potential queries for a fine-tuning a retrieval model and that we will use <code>semantically-similar</code> to generate texts that are similar to the intial anchor for fine-tuning a reranking model.</p> <p>We will <code>triplet=True</code> in order to generate both positive and negative examples, which should help the model generalize better during fine-tuning and we will set <code>hard_negative=True</code> to generate more challenging examples that are closer to the anchor and discussed topics.</p> <p>Lastly, we can seed the LLM with <code>context</code> to generate more relevant examples.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#retrieval","title":"Retrieval","text":"<p>For retrieval, we will thus generate queries that are similar to the <code>chunks</code> column. We will use the <code>query</code> action to generate potential queries for a fine-tuning a retrieval model.</p> <pre><code>generate_sentence_pair = GenerateSentencePair(\n    triplet=True,  \n    hard_negative=True,\n    action=\"query\",\n    llm=llm,\n    input_batch_size=10,\n    context=context,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#reranking","title":"Reranking","text":"<p>For reranking, we will generate texts that are similar to the intial anchor. We will use the <code>semantically-similar</code> action to generate texts that are similar to the intial anchor for fine-tuning a reranking model. In this case, we set <code>hard_negative=False</code> to generate more diverse and potentially wrong examples, which can be used as negative examples for similarity fine-tuning because rerankers cannot be fine-tuned using triplets.</p> <pre><code>generate_sentence_pair = GenerateSentencePair(\n    triplet=True,\n    hard_negative=False,\n    action=\"semantically-similar\",\n    llm=llm,\n    input_batch_size=10,\n    context=context,\n)\n</code></pre>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#combined-pipeline","title":"Combined pipeline","text":"<p>We will now use the <code>GenerateSentencePair</code> task to generate synthetic data for both retrieval and reranking models in a single pipeline. Note that, we map the <code>chunks</code> column to the <code>anchor</code> argument.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#data-quality-evaluation","title":"Data quality evaluation","text":"<p>Data is never as clean as it can be and this also holds for synthetically generated data too, therefore, it is always good to spent some time and look at your data.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#feature-engineering","title":"Feature engineering","text":"<p>In order to evaluate the quality of our data we will use features of the  models that we intent to fine-tune as proxy for data quality. We can then use these features to filter out the best examples.</p> <p>In order to choose a good default model, we will use the Massive Text Embedding Benchmark (MTEB) Leaderboard. We want to optimize for size and speed, so we will set model size <code>&lt;100M</code> and then filter for <code>Retrieval</code> and <code>Reranking</code> based on the highest average score, resulting in Snowflake/snowflake-arctic-embed-s and sentence-transformers/all-MiniLM-L12-v2 respectively.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#retrieval_1","title":"Retrieval","text":"<p>For retrieval, we will compute similarities for the current embeddings of <code>anchor-positive</code>, <code>positive-negative</code> and <code>anchor-negative</code> pairs. We assume that an overlap of these similarities will cause the model to have difficulties generalizing and therefore we can use these features to evaluate the quality of our data.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#reranking_1","title":"Reranking","text":"<p>For reranking, we will compute the compute the relevance scores from an existing reranker model for <code>anchor-positive</code>, <code>positive-negative</code> and <code>anchor-negative</code> pais and make a similar assumption as for the retrieval model.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#optional-argilla","title":"(Optional) Argilla","text":"<p>To get the most out of you data and actually look at our data, we will use Argilla. If you are not familiar with Argilla, we recommend taking a look at the Argilla quickstart docs. Alternatively, you can use your Hugging Face account to login to the Argilla demo Space.</p> <p>To start exploring data, we first need to define an <code>argilla.Dataset</code>. We will create a basic datset with some input <code>TextFields</code> for the <code>anchor</code> and output <code>TextQuestions</code> for the <code>positive</code> and <code>negative</code> pairs. Additionally, we will use the <code>file_name</code> as <code>MetaDataProperty</code>. Lastly, we will be re-using the vectors obtained from our previous step to allow for semantic search and we will add te similarity scores for some basic filtering and sorting.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#fine-tuning","title":"Fine-tuning","text":"<p>At last, we can fine-tune our models. We will use the <code>sentence-transformers</code> library to fine-tune our models.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#retrieval_2","title":"Retrieval","text":"<p>For retrieval, we have created a script that fine-tunes a model on our generated data the generated data based https://github.com/argilla-io/argilla-sdk-chatbot/blob/main/train_embedding.ipynb.You can also open it in Google Colab directly.</p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#reranking_2","title":"Reranking","text":"<p>For reranking, <code>sentence-transformers</code> provides a script that shows how to fine-tune a CrossEncoder models. Ad of now, there is some uncertainty over fine-tuning CrossEncoder models with triplets but you can still use the <code>positive</code> and <code>anchor</code></p>"},{"location":"sections/pipeline_samples/tutorials/GenerateSentencePair/#conclusions","title":"Conclusions","text":"<p>In this tutorial, we present an end-to-end example of fine-tuning retrievers and rerankers for RAG. This serves as a good starting point for optimizing and maintaining your data and model but need to be adapted to your specific use case.</p> <p>We started with some seed data from the Argilla docs, generated synthetic data for retrieval and reranking models, evaluated the quality of the data, and showed how to fine-tune the models. We also used Argilla to get a human touch on the data.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/","title":"Clean an existing preference dataset","text":"<ul> <li>Goal: Clean an existing preference dataset by providing AI feedback on the quality of the data.</li> <li>Libraries: argilla, hf-inference-endpoints</li> <li>Components: LoadDataFromDicts, UltraFeedback, KeepColumns, PreferenceToArgilla, InferenceEndpointsLLM, GlobalStep</li> </ul> <pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"transformers~=4.0\" \"torch~=2.0\"\n</code></pre> <p>Let's make the required imports:</p> <pre><code>import random\n\nfrom datasets import load_dataset\n\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import (\n    KeepColumns,\n    LoadDataFromDicts,\n    PreferenceToArgilla,\n)\nfrom distilabel.steps.tasks import UltraFeedback\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Login to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>In this case, we will clean a preference dataset, so we will use the <code>Intel/orca_dpo_pairs</code> dataset from the Hugging Face Hub.</p> <pre><code>dataset = load_dataset(\"Intel/orca_dpo_pairs\", split=\"train[:20]\")\n</code></pre> <p>Next, we will shuffle the <code>chosen</code> and <code>rejected</code> columns to avoid any bias in the dataset.</p> <pre><code>def shuffle_and_track(chosen, rejected):\n    pair = [chosen, rejected]\n    random.shuffle(pair)\n    order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n    return {\"generations\": pair, \"order\": order}\n\ndataset = dataset.map(lambda x: shuffle_and_track(x[\"chosen\"], x[\"rejected\"]))\n</code></pre> <pre><code>dataset = dataset.to_list()\n</code></pre> As a custom step <p>You can also create a custom step in a separate module, import it and add it to the pipeline after loading the <code>orca_dpo_pairs</code> dataset using the <code>LoadDataFromHub</code> step.</p> shuffle_step.py<pre><code>from typing import TYPE_CHECKING, List\nfrom distilabel.steps import GlobalStep, StepInput\n\nif TYPE_CHECKING:\n    from distilabel.steps.typing import StepOutput\n\nimport random\n\nclass ShuffleStep(GlobalStep):\n    @property\n    def inputs(self):\n        \"\"\"Returns List[str]: The inputs of the step.\"\"\"\n        return [\"instruction\", \"chosen\", \"rejected\"]\n\n    @property\n    def outputs(self):\n        \"\"\"Returns List[str]: The outputs of the step.\"\"\"\n        return [\"instruction\", \"generations\", \"order\"]\n\n    def process(self, inputs: StepInput):\n        \"\"\"Returns StepOutput: The outputs of the step.\"\"\"\n        outputs = []\n\n        for input in inputs:\n            chosen = input[\"chosen\"]\n            rejected = input[\"rejected\"]\n            pair = [chosen, rejected]\n            random.shuffle(pair)\n            order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n\n            outputs.append({\"instruction\": input[\"instruction\"], \"generations\": pair, \"order\": order})\n\n        yield outputs\n</code></pre> <pre><code>from shuffle_step import ShuffleStep\n</code></pre> <p>To clean an existing preference dataset, we will need to define a <code>Pipeline</code> with all the necessary steps. However, a similar workflow can be used to clean a SFT dataset. Below, we will go over each step in detail.</p> <pre><code>load_dataset = LoadDataFromDicts(\n    data=dataset[:1],\n    output_mappings={\"question\": \"instruction\"},\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nload_dataset.load()\nnext(load_dataset.process())\n</code></pre> <pre>\n<code>([{'system': '',\n   'question': \"You will be given a definition of a task first, then some input of the task.\\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\\n\\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\\nOutput:\",\n   'chosen': '[\\n  [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n  [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]',\n   'rejected': \" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n   'generations': [\" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n    '[\\n  [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n  [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]'],\n   'order': ['rejected', 'chosen']}],\n True)</code>\n</pre> <pre><code>evaluate_responses = UltraFeedback(\n    aspect=\"overall-rating\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n    ),\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nevaluate_responses.load()\nnext(\n    evaluate_responses.process(\n        [\n            {\n                \"instruction\": \"What's the capital of Spain?\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'ratings': [5, 1],\n  'rationales': [\"The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\",\n   \"The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"],\n  'distilabel_metadata': {'raw_output_ultra_feedback_0': \"#### Output for Text 1\\nRating: 5 (Excellent)\\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\\n\\n#### Output for Text 2\\nRating: 1 (Low Quality)\\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"},\n  'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]</code>\n</pre> <pre><code>keep_columns = KeepColumns(\n    columns=[\n        \"instruction\",\n        \"generations\",\n        \"order\",\n        \"ratings\",\n        \"rationales\",\n        \"model_name\",\n    ],\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nkeep_columns.load()\nnext(\n    keep_columns.process(\n        [\n            {\n                \"system\": \"\",\n                \"instruction\": \"What's the capital of Spain?\",\n                \"chosen\": \"Madrid\",\n                \"rejected\": \"Barcelona\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n                \"order\": [\"chosen\", \"rejected\"],\n                \"ratings\": [5, 1],\n                \"rationales\": [\"\", \"\"],\n                \"model_name\": \"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'order': ['chosen', 'rejected'],\n  'ratings': [5, 1],\n  'rationales': ['', ''],\n  'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]</code>\n</pre> <pre><code>to_argilla = PreferenceToArgilla(\n    dataset_name=\"cleaned-dataset\",\n    dataset_workspace=\"argilla\",\n    api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n    api_key=\"[your-api-key]\",\n    num_generations=2\n)\n</code></pre> <p>Below, you can see the full pipeline definition:</p> <pre><code>with Pipeline(name=\"clean-dataset\") as pipeline:\n\n    load_dataset = LoadDataFromDicts(\n        data=dataset, output_mappings={\"question\": \"instruction\"}\n    )\n\n    evaluate_responses = UltraFeedback(\n        aspect=\"overall-rating\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        ),\n    )\n\n    keep_columns = KeepColumns(\n        columns=[\n            \"instruction\",\n            \"generations\",\n            \"order\",\n            \"ratings\",\n            \"rationales\",\n            \"model_name\",\n        ]\n    )\n\n    to_argilla = PreferenceToArgilla(\n        dataset_name=\"cleaned-dataset\",\n        dataset_workspace=\"argilla\",\n        api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n        api_key=\"[your-api-key]\",\n        num_generations=2,\n    )\n\n    load_dataset.connect(evaluate_responses)\n    evaluate_responses.connect(keep_columns)\n    keep_columns.connect(to_argilla)\n</code></pre> <p>Let's now run the pipeline and clean our preference dataset.</p> <pre><code>distiset = pipeline.run()\n</code></pre> <p>Let's check it! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.</p> <p>You can push the dataset to the Hub for sharing with the community and embed it to explore the data.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-cleaned-preference-dataset\")\n</code></pre> <p>In this tutorial, we showcased the detailed steps to build a pipeline for cleaning a preference dataset using distilabel. However, you can customize this pipeline for your own use cases, such as cleaning an SFT dataset or adding custom steps.</p> <p>We used a preference dataset as our starting point and shuffled the data to avoid any bias. Next, we evaluated the responses using a model through the serverless Hugging Face Inference API, following the UltraFeedback standards. Finally, we kept the needed columns and used Argilla for further curation.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#clean-an-existing-preference-dataset","title":"Clean an existing preference dataset","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#getting-started","title":"Getting Started","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide. </p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#the-dataset","title":"The dataset","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#define-the-pipeline","title":"Define the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#load-the-dataset","title":"Load the dataset","text":"<p>We will use the dataset we just shuffled as source data.</p> <ul> <li>Component: <code>LoadDataFromDicts</code></li> <li>Input columns: <code>system</code>, <code>question</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code> and <code>order</code>, the same keys as in the loaded list of dictionaries.</li> <li>Output columns: <code>system</code>, <code>instruction</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code> and <code>order</code>. We will use <code>output_mappings</code> to rename the columns.</li> </ul>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#evaluate-the-responses","title":"Evaluate the responses","text":"<p>To evaluate the quality of the responses, we will use <code>meta-llama/Meta-Llama-3.1-70B-Instruct</code>, applying the <code>UltraFeedback</code> task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For an SFT dataset, you can use <code>PrometheusEval</code> instead.</p> <ul> <li>Component: <code>UltraFeedback</code> task with LLMs using <code>InferenceEndpointsLLM</code></li> <li>Input columns: <code>instruction</code>, <code>generations</code></li> <li>Output columns: <code>ratings</code>, <code>rationales</code>, <code>distilabel_metadata</code>, <code>model_name</code></li> </ul> <p>For your use case and to improve the results, you can use any other LLM of your choice.</p>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#keep-only-the-required-columns","title":"Keep only the required columns","text":"<p>We will get rid of the unneeded columns.</p> <ul> <li>Component: <code>KeepColumns</code></li> <li>Input columns: <code>system</code>, <code>instruction</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code>, <code>ratings</code>, <code>rationales</code>, <code>distilabel_metadata</code> and <code>model_name</code></li> <li>Output columns: <code>instruction</code>, <code>chosen</code>, <code>rejected</code>, <code>generations</code> and <code>order</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#optional-further-data-curation","title":"(Optional) Further data curation","text":"<p>You can use Argilla to further curate your data.</p> <ul> <li>Component: <code>PreferenceToArgilla</code> step</li> <li>Input columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> <li>Output columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#run-the-pipeline","title":"Run the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/clean_existing_dataset/#conclusions","title":"Conclusions","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/","title":"Generate a preference dataset","text":"<ul> <li>Goal: Generate a synthetic preference dataset for DPO/ORPO.</li> <li>Libraries: argilla, hf-inference-endpoints</li> <li>Components: LoadDataFromHub, TextGeneration, UltraFeedback, GroupColumns, FormatTextGenerationDPO, PreferenceToArgilla, InferenceEndpointsLLM</li> </ul> <pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"transformers~=4.0\" \"torch~=2.0\"\n</code></pre> <p>Let's make the required imports:</p> <pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import (\n    LoadDataFromHub,\n    GroupColumns,\n    FormatTextGenerationDPO,\n    PreferenceToArgilla,\n)\nfrom distilabel.steps.tasks import TextGeneration, UltraFeedback\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Log in to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>To generate our preference dataset, we will need to define a <code>Pipeline</code> with all the necessary steps. Below, we will go over each step in detail.</p> <pre><code>load_dataset = LoadDataFromHub(\n        repo_id= \"argilla/10Kprompts-mini\",\n        num_examples=1,\n        pipeline=Pipeline(name=\"showcase-pipeline\"),\n    )\nload_dataset.load()\nnext(load_dataset.process())\n</code></pre> <pre>\n<code>([{'instruction': 'How can I create an efficient and robust workflow that utilizes advanced automation techniques to extract targeted data, including customer information, from diverse PDF documents and effortlessly integrate it into a designated Google Sheet? Furthermore, I am interested in establishing a comprehensive and seamless system that promptly activates an SMS notification on my mobile device whenever a new PDF document is uploaded to the Google Sheet, ensuring real-time updates and enhanced accessibility.',\n   'topic': 'Software Development'}],\n True)</code>\n</pre> <pre><code>generate_responses = [\n    TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        ),\n        pipeline=Pipeline(name=\"showcase-pipeline\"),\n    ),\n    TextGeneration(\n        llm=InferenceEndpointsLLM(\n            model_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n            tokenizer_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        ),\n        pipeline=Pipeline(name=\"showcase-pipeline\"),\n    ),\n]\nfor task in generate_responses:\n    task.load()\n    print(next(task.process([{\"instruction\": \"Which are the top cities in Spain?\"}])))\n</code></pre> <pre>\n<code>[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\\n\\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Fam\u00edlia and Park G\u00fcell, designed by Antoni Gaud\u00ed.\\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alc\u00e1zar Palace, and lively flamenco music scene.\\n5. **M\u00e1laga**: A coastal city in southern Spain, M\u00e1laga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\\n10. **San Sebasti\u00e1n**: A city in the Basque Country, San Sebasti\u00e1n is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebasti\u00e1n International Film Festival.\\n\\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\\n\\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Fam\u00edlia and Park G\u00fcell, designed by Antoni Gaud\u00ed.\\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alc\u00e1zar Palace, and lively flamenco music scene.\\n5. **M\u00e1laga**: A coastal city in southern Spain, M\u00e1laga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\\n10. **San Sebasti\u00e1n**: A city in the Basque Country, San Sebasti\u00e1n is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebasti\u00e1n International Film Festival.\\n\\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}]\n[{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\\n\\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\\n\\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaud\u00ed, including the Sagrada Familia and Park G\u00fcell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\\n\\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\\n\\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alc\u00e1zar, a beautiful palace made up of a series of rooms and courtyards.\\n\\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\\n\\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\\n\\n7. M\u00e1laga: A coastal city in Andalusia, M\u00e1laga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\\n\\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\\n\\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\\n\\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaud\u00ed, including the Sagrada Familia and Park G\u00fcell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\\n\\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\\n\\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alc\u00e1zar, a beautiful palace made up of a series of rooms and courtyards.\\n\\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\\n\\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\\n\\n7. M\u00e1laga: A coastal city in Andalusia, M\u00e1laga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\\n\\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]\n</code>\n</pre> <pre><code>group_responses = GroupColumns(\n    columns=[\"generation\", \"model_name\"],\n    output_columns=[\"generations\", \"model_names\"],\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nnext(\n    group_responses.process(\n        [\n            {\n                \"generation\": \"Madrid\",\n                \"model_name\": \"meta-llama/Meta-Llama-3-8B-Instruct\",\n            },\n        ],\n        [\n            {\n                \"generation\": \"Barcelona\",\n                \"model_name\": \"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n            }\n        ],\n    )\n)\n</code></pre> <pre>\n<code>[{'generations': ['Madrid', 'Barcelona'],\n  'model_names': ['meta-llama/Meta-Llama-3-8B-Instruct',\n   'mistralai/Mixtral-8x7B-Instruct-v0.1']}]</code>\n</pre> <pre><code>evaluate_responses = UltraFeedback(\n    aspect=\"overall-rating\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n    ),\n    pipeline=Pipeline(name=\"showcase-pipeline\"),\n)\nevaluate_responses.load()\nnext(\n    evaluate_responses.process(\n        [\n            {\n                \"instruction\": \"What's the capital of Spain?\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'ratings': [5, 1],\n  'rationales': [\"The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\",\n   \"The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"],\n  'distilabel_metadata': {'raw_output_ultra_feedback_0': \"#### Output for Text 1\\nRating: 5 (Excellent)\\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\\n\\n#### Output for Text 2\\nRating: 1 (Low Quality)\\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"},\n  'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'}]</code>\n</pre> <pre><code>format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name=\"showcase-pipeline\"))\nformat_dpo.load()\nnext(\n    format_dpo.process(\n        [\n            {\n                \"instruction\": \"What's the capital of Spain?\",\n                \"generations\": [\"Madrid\", \"Barcelona\"],\n                \"generation_models\": [\n                    \"Meta-Llama-3-8B-Instruct\",\n                    \"Mixtral-8x7B-Instruct-v0.1\",\n                ],\n                \"ratings\": [5, 1],\n            }\n        ]\n    )\n)\n</code></pre> <pre>\n<code>[{'instruction': \"What's the capital of Spain?\",\n  'generations': ['Madrid', 'Barcelona'],\n  'generation_models': ['Meta-Llama-3-8B-Instruct',\n   'Mixtral-8x7B-Instruct-v0.1'],\n  'ratings': [5, 1],\n  'prompt': \"What's the capital of Spain?\",\n  'prompt_id': '26174c953df26b3049484e4721102dca6b25d2de9e3aa22aa84f25ed1c798512',\n  'chosen': [{'role': 'user', 'content': \"What's the capital of Spain?\"},\n   {'role': 'assistant', 'content': 'Madrid'}],\n  'chosen_model': 'Meta-Llama-3-8B-Instruct',\n  'chosen_rating': 5,\n  'rejected': [{'role': 'user', 'content': \"What's the capital of Spain?\"},\n   {'role': 'assistant', 'content': 'Barcelona'}],\n  'rejected_model': 'Mixtral-8x7B-Instruct-v0.1',\n  'rejected_rating': 1}]</code>\n</pre> <ul> <li>Or you can use Argilla to manually label the data and convert it to a preference dataset.<ul> <li>Component: <code>PreferenceToArgilla</code> step</li> <li>Input columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> <li>Output columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> </ul> </li> </ul> <pre><code>to_argilla = PreferenceToArgilla(\n    dataset_name=\"preference-dataset\",\n    dataset_workspace=\"argilla\",\n    api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n    api_key=\"[your-api-key]\",\n    num_generations=2\n)\n</code></pre> <p>Below, you can see the full pipeline definition:</p> <pre><code>with Pipeline(name=\"generate-dataset\") as pipeline:\n\n    load_dataset = LoadDataFromHub(repo_id=\"argilla/10Kprompts-mini\")\n\n    generate_responses = [\n        TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n                generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n            )\n        ),\n        TextGeneration(\n            llm=InferenceEndpointsLLM(\n                model_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n                tokenizer_id=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n                generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n            )\n        ),\n    ]\n\n    group_responses = GroupColumns(\n        columns=[\"generation\", \"model_name\"],\n        output_columns=[\"generations\", \"model_names\"],\n    )\n\n    evaluate_responses = UltraFeedback(\n        aspect=\"overall-rating\",\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n        )\n    )\n\n    format_dpo = FormatTextGenerationDPO()\n\n    to_argilla = PreferenceToArgilla(\n        dataset_name=\"preference-dataset\",\n        dataset_workspace=\"argilla\",\n        api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n        api_key=\"[your-api-key]\",\n        num_generations=2\n    )\n\n    for task in generate_responses:\n        load_dataset.connect(task)\n        task.connect(group_responses)\n    group_responses.connect(evaluate_responses)\n    evaluate_responses.connect(format_dpo, to_argilla)\n</code></pre> <p>Let's now run the pipeline and generate the preference dataset.</p> <pre><code>distiset = pipeline.run()\n</code></pre> <p>Let's check the preference dataset! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.</p> <p>You can push the dataset to the Hub for sharing with the community and embed it to explore the data.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-preference-dataset\")\n</code></pre> <p>In this tutorial, we showcased the detailed steps to build a pipeline for generating a preference dataset using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub, or use them to train a model for DPO or ORPO.</p> <p>We used a dataset containing prompts to generate responses using two different models through the serverless Hugging Face Inference API. Next, we evaluated the responses using a third model, following the UltraFeedback standards. Finally, we converted the data to a preference dataset and used Argilla for further curation.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#generate-a-preference-dataset","title":"Generate a preference dataset","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#getting-started","title":"Getting started","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide. </p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#define-the-pipeline","title":"Define the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#load-the-dataset","title":"Load the dataset","text":"<p>We will use as source data the <code>argilla/10Kprompts-mini</code> dataset from the Hugging Face Hub.</p> <ul> <li>Component: <code>LoadDataFromHub</code></li> <li>Input columns: <code>instruction</code> and <code>topic</code>, the same as in the loaded dataset</li> <li>Output columns: <code>instruction</code> and <code>topic</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#generate-responses","title":"Generate responses","text":"<p>We need to generate the responses for the given instructions. We will use two different models available on the Hugging Face Hub through the Serverless Inference API: <code>meta-llama/Meta-Llama-3-8B-Instruct</code> and <code>mistralai/Mixtral-8x7B-Instruct-v0.1</code>. We will also indicate the generation parameters for each model.</p> <ul> <li>Component: <code>TextGeneration</code> task with LLMs using <code>InferenceEndpointsLLM</code></li> <li>Input columns: <code>instruction</code></li> <li>Output columns: <code>generation</code>, <code>distilabel_metadata</code>, <code>model_name</code> for each model</li> </ul> <p>For your use case and to improve the results, you can use any other LLM of your choice.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#group-the-responses","title":"Group the responses","text":"<p>The task to evaluate the responses needs as input a list of generations. However, each model response was saved in the generation column of the subsets <code>text_generation_0</code> and <code>text_generation_1</code>. We will combine these two columns into a single column and the <code>default</code> subset.</p> <ul> <li>Component: <code>GroupColumns</code></li> <li>Input columns: <code>generation</code> and <code>model_name</code>from <code>text_generation_0</code> and <code>text_generation_1</code></li> <li>Output columns: <code>generations</code> and <code>model_names</code></li> </ul>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#evaluate-the-responses","title":"Evaluate the responses","text":"<p>To build our preference dataset, we need to evaluate the responses generated by the models. We will use <code>meta-llama/Meta-Llama-3-70B-Instruct</code> for this, applying the <code>UltraFeedback</code> task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness).</p> <ul> <li>Component: <code>UltraFeedback</code> task with LLMs using <code>InferenceEndpointsLLM</code></li> <li>Input columns: <code>instruction</code>, <code>generations</code></li> <li>Output columns: <code>ratings</code>, <code>rationales</code>, <code>distilabel_metadata</code>, <code>model_name</code></li> </ul> <p>For your use case and to improve the results, you can use any other LLM of your choice.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#convert-to-a-preference-dataset","title":"Convert to a preference dataset","text":"<ul> <li>You can automatically convert it to a preference dataset with the <code>chosen</code> and <code>rejected</code> columns.<ul> <li>Component: <code>FormatTextGenerationDPO</code> step</li> <li>Input columns: <code>instruction</code>, <code>generations</code>, <code>generation_models</code>, <code>ratings</code></li> <li>Output columns: <code>prompt</code>, <code>prompt_id</code>, <code>chosen</code>, <code>chosen_model</code>, <code>chosen_rating</code>, <code>rejected</code>, <code>rejected_model</code>, <code>rejected_rating</code></li> </ul> </li> </ul>"},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#run-the-pipeline","title":"Run the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/generate_preference_dataset/#conclusions","title":"Conclusions","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/","title":"Generate synthetic text classification data","text":"<ul> <li>Goal: Generate synthetic text classification data to augment an imbalanced and limited dataset for training a topic classifier. In addition, generate new data for training a fact-based versus opinion-based classifier to add a new label.</li> <li>Libraries: argilla, hf-inference-endpoints, SetFit</li> <li>Components: LoadDataFromDicts, EmbeddingTaskGenerator, GenerateTextClassificationData</li> </ul> <pre><code>!pip install \"distilabel[hf-inference-endpoints]\"\n</code></pre> <pre><code>!pip install \"transformers~=4.40\" \"torch~=2.0\" \"setfit~=1.0\"\n</code></pre> <p>Let's make the required imports:</p> <pre><code>import random\nfrom collections import Counter\n\nfrom datasets import load_dataset, Dataset\nfrom distilabel.models import InferenceEndpointsLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadDataFromDicts\nfrom distilabel.steps.tasks import (\n    GenerateTextClassificationData,\n)\nfrom setfit import SetFitModel, Trainer, sample_dataset\n</code></pre> <p>You'll need an <code>HF_TOKEN</code> to use the HF Inference Endpoints. Log in to use it directly within this notebook.</p> <pre><code>import os\nfrom huggingface_hub import login\n\nlogin(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)\n</code></pre> <pre><code>!pip install \"distilabel[argilla, hf-inference-endpoints]\"\n</code></pre> <p>We will use the <code>fancyzhx/ag_news</code> dataset from the Hugging Face Hub as our original data source. To simulate a real-world scenario with imbalanced and limited data, we will load only 20 samples from this dataset.</p> <pre><code>hf_dataset = load_dataset(\"fancyzhx/ag_news\", split=\"train[-20:]\")\n</code></pre> <p>Now, we can retrieve the available labels in the dataset and examine the current data distribution.</p> <pre><code>labels_topic = hf_dataset.features[\"label\"].names\nid2str = {i: labels_topic[i] for i in range(len(labels_topic))}\nprint(id2str)\nprint(Counter(hf_dataset[\"label\"]))\n</code></pre> <pre>\n<code>{0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}\nCounter({0: 12, 1: 6, 2: 2})\n</code>\n</pre> <p>As observed, the dataset is imbalanced, with most samples falling under the <code>World</code> category, while the <code>Sci/Tech</code> category is entirely missing. Moreover, there are insufficient samples to effectively train a topic classification model.</p> <p>We will also define the labels for the new classification task.</p> <pre><code>labels_fact_opinion = [\"Fact-based\", \"Opinion-based\"]\n</code></pre> <p>To generate the data we will use the <code>GenerateTextClassificationData</code> task. This task will use as input classification tasks and we can define the language, difficulty and clarity required for the generated data.</p> <pre><code>task = GenerateTextClassificationData(\n    language=\"English\",\n    difficulty=\"college\",\n    clarity=\"clear\",\n    num_generations=1,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.4},\n    ),\n    input_batch_size=5,\n)\ntask.load()\nresult = next(\n    task.process([{\"task\": \"Classify the news article as fact-based or opinion-based\"}])\n)\nprint(result[0][\"distilabel_metadata\"][\"raw_input_generate_text_classification_data_0\"])\n</code></pre> <pre>\n<code>[{'role': 'user', 'content': 'You have been assigned a text classification task: Classify the news article as fact-based or opinion-based\\n\\nYour mission is to write one text classification example for this task in JSON format. The JSON object must contain the following keys:\\n - \"input_text\": a string, the input text specified by the classification task.\\n - \"label\": a string, the correct label of the input text.\\n - \"misleading_label\": a string, an incorrect label that is related to the task.\\n\\nPlease adhere to the following guidelines:\\n - The \"input_text\" should be diverse in expression.\\n - The \"misleading_label\" must be a valid label for the given task, but not as appropriate as the \"label\" for the \"input_text\".\\n - The values for all fields should be in English.\\n - Avoid including the values of the \"label\" and \"misleading_label\" fields in the \"input_text\", that would make the task too easy.\\n - The \"input_text\" is clear and requires college level education to comprehend.\\n\\nYour output must always be a JSON object only, do not explain yourself or output anything else. Be creative!'}]\n</code>\n</pre> <p>For our use case, we only need to generate data for two tasks: a topic classification task and a fact versus opinion classification task. Therefore, we will define the tasks accordingly. As we will be using an smaller model for generation, we will select 2 random labels for each topic classification task and change the order for the fact versus opinion classification task ensuring more diversity in the generated data.</p> <pre><code>task_templates = [\n    \"Determine the news article as {}\",\n    \"Classify news article as {}\",\n    \"Identify the news article as {}\",\n    \"Categorize the news article as {}\",\n    \"Label the news article using {}\",\n    \"Annotate the news article based on {}\",\n    \"Determine the theme of a news article from {}\",\n    \"Recognize the topic of the news article as {}\",\n]\n\nclassification_tasks = [\n    {\"task\": action.format(\" or \".join(random.sample(labels_topic, 2)))}\n    for action in task_templates for _ in range(4)\n] + [\n    {\"task\": action.format(\" or \".join(random.sample(labels_fact_opinion, 2)))}\n    for action in task_templates\n]\n</code></pre> <p>Now, it's time to define and run the pipeline. As mentioned, we will load the written tasks and feed them into the <code>GenerateTextClassificationData</code> task. For our use case, we will be using <code>Meta-Llama-3.1-8B-Instruct</code> via the <code>InferenceEndpointsLLM</code>, with different degrees of difficulty and clarity.</p> <pre><code>difficulties = [\"college\", \"high school\", \"PhD\"]\nclarity = [\"clear\", \"understandable with some effort\", \"ambiguous\"]\n\nwith Pipeline(\"texcat-generation-pipeline\") as pipeline:\n\n    tasks_generator = LoadDataFromDicts(data=classification_tasks)\n\n    generate_data = []\n    for difficulty in difficulties:\n        for clarity_level in clarity:\n            task = GenerateTextClassificationData(\n                language=\"English\",\n                difficulty=difficulty,\n                clarity=clarity_level,\n                num_generations=2,\n                llm=InferenceEndpointsLLM(\n                    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n                    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n                    generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n                ),\n                input_batch_size=5,\n            )\n            generate_data.append(task)\n\n    for task in generate_data:\n        tasks_generator.connect(task)\n</code></pre> <p>Let's now run the pipeline and generate the synthetic data.</p> <pre><code>distiset = pipeline.run()\n</code></pre> <pre><code>distiset[\"generate_text_classification_data_0\"][\"train\"][0]\n</code></pre> <pre>\n<code>{'task': 'Determine the news article as Business or World',\n 'input_text': \"The recent decision by the European Central Bank to raise interest rates will likely have a significant impact on the eurozone's economic growth, with some analysts predicting a 0.5% contraction in GDP due to the increased borrowing costs. The move is seen as a measure to combat inflation, which has been rising steadily over the past year.\",\n 'label': 'Business',\n 'misleading_label': 'World',\n 'distilabel_metadata': {'raw_output_generate_text_classification_data_0': '{\\n  \"input_text\": \"The recent decision by the European Central Bank to raise interest rates will likely have a significant impact on the eurozone\\'s economic growth, with some analysts predicting a 0.5% contraction in GDP due to the increased borrowing costs. The move is seen as a measure to combat inflation, which has been rising steadily over the past year.\",\\n  \"label\": \"Business\",\\n  \"misleading_label\": \"World\"\\n}'},\n 'model_name': 'meta-llama/Meta-Llama-3.1-8B-Instruct'}</code>\n</pre> <p>You can push the dataset to the Hub for sharing with the community and embed it to explore the data.</p> <pre><code>distiset.push_to_hub(\"[your-owner-name]/example-texcat-generation-dataset\")\n</code></pre> <p>By examining the distiset distribution, we can confirm that it includes at least the 8 required samples for each label to train our classification models with SetFit.</p> <pre><code>all_labels = [\n    entry[\"label\"]\n    for dataset_name in distiset\n    for entry in distiset[dataset_name][\"train\"]\n]\n\nCounter(all_labels)\n</code></pre> <pre>\n<code>Counter({'Sci/Tech': 275,\n         'Business': 130,\n         'World': 86,\n         'Fact-based': 86,\n         'Sports': 64,\n         'Opinion-based': 54,\n         None: 20,\n         'Opinion Based': 1,\n         'News/Opinion': 1,\n         'Science': 1,\n         'Environment': 1,\n         'Opinion': 1})</code>\n</pre> <p>We will create two datasets with the required labels and data for our use cases.</p> <pre><code>def extract_rows(distiset, labels):\n    return [\n        {\n            \"text\": entry[\"input_text\"],\n            \"label\": entry[\"label\"],\n            \"id\": i\n        }\n        for dataset_name in distiset\n        for i, entry in enumerate(distiset[dataset_name][\"train\"])\n        if entry[\"label\"] in labels\n    ]\n\ndata_topic = extract_rows(distiset, labels_topic)\ndata_fact_opinion = extract_rows(distiset, labels_fact_opinion)\n</code></pre> <p>Get started in Argilla</p> <p>If you are not familiar with Argilla, we recommend taking a look at the Argilla quickstart docs. Alternatively, you can use your Hugging Face account to login to the Argilla demo Space.</p> <p>To get the most out of our data, we will use Argilla. First, we need to connect to the Argilla instance.</p> <pre><code>import argilla as rg\n\n# Replace api_url with your url if using Docker\n# Replace api_key with your API key under \"My Settings\" in the UI\n# Uncomment the last line and set your HF_TOKEN if your space is private\nclient = rg.Argilla(\n    api_url=\"https://[your-owner-name]-[your_space_name].hf.space\",\n    api_key=\"[your-api-key]\",\n    # headers={\"Authorization\": f\"Bearer {HF_TOKEN}\"}\n)\n</code></pre> <p>We will create a <code>Dataset</code> for each task, with an input <code>TextField</code> for the text classification text and a <code>LabelQuestion</code> to ensure the generated labels are correct.</p> <pre><code>def create_texcat_dataset(dataset_name, labels):\n    settings = rg.Settings(\n        fields=[rg.TextField(\"text\")],\n        questions=[\n            rg.LabelQuestion(\n                name=\"label\",\n                title=\"Classify the texts according to the following labels\",\n                labels=labels,\n            ),\n        ],\n    )\n    return rg.Dataset(name=dataset_name, settings=settings).create()\n\n\nrg_dataset_topic = create_texcat_dataset(\"topic-classification\", labels_topic)\nrg_dataset_fact_opinion = create_texcat_dataset(\n    \"fact-opinion-classification\", labels_fact_opinion\n)\n</code></pre> <p>Now, we can upload the generated data to Argilla and evaluate it. We will use the generated labels as suggestions.</p> <pre><code>rg_dataset_topic.records.log(data_topic)\nrg_dataset_fact_opinion.records.log(data_fact_opinion)\n</code></pre> <p>Now, we can start the annotation process. Just open the dataset in the Argilla UI and start annotating the records. If the suggestions are correct, you can just click on <code>Submit</code>. Otherwise, you can select the correct label.</p> <p>Note</p> <p>Check this how-to guide to know more about annotating in the UI.</p> <p>Once, you get the annotations, let's continue by retrieving the data from Argilla and format it as a dataset with the required data.</p> <pre><code>rg_dataset_topic = client.datasets(\"topic-classification\")\nrg_dataset_fact_opinion = client.datasets(\"fact-opinion-classification\")\n</code></pre> <pre><code>status_filter = rg.Query(filter=rg.Filter((\"response.status\", \"==\", \"submitted\")))\n\nsubmitted_topic = rg_dataset_topic.records(status_filter).to_list(flatten=True)\nsubmitted_fact_opinion = rg_dataset_fact_opinion.records(status_filter).to_list(\n    flatten=True\n)\n</code></pre> <pre><code>def format_submitted(submitted):\n    return [\n        {\n            \"text\": r[\"text\"],\n            \"label\": r[\"label.responses\"][0],\n            \"id\": i,\n        }\n        for i, r in enumerate(submitted)\n    ]\n\ndata_topic = format_submitted(submitted_topic)\ndata_fact_opinion = format_submitted(submitted_fact_opinion)\n</code></pre> <p>In our case, we will fine-tune using SetFit. However, you can select the one that best fits your requirements.</p> <p>The next step will be to format the data to be compatible with SetFit. In the case of the topic classification, we will need to combine the synthetic data with the original data.</p> <pre><code>hf_topic = hf_dataset.to_list()\nnum = len(data_topic)\n\ndata_topic.extend(\n    [\n        {\n            \"text\": r[\"text\"],\n            \"label\": id2str[r[\"label\"]],\n            \"id\": num + i,\n        }\n        for i, r in enumerate(hf_topic)\n    ]\n)\n</code></pre> <p>If we check the data distribution now, we can see that we have enough samples for each label to train our models.</p> <pre><code>labels = [record[\"label\"] for record in data_topic]\nCounter(labels)\n</code></pre> <pre>\n<code>Counter({'Sci/Tech': 275, 'Business': 132, 'World': 98, 'Sports': 70})</code>\n</pre> <pre><code>labels = [record[\"label\"] for record in data_fact_opinion]\nCounter(labels)\n</code></pre> <pre>\n<code>Counter({'Fact-based': 86, 'Opinion-based': 54})</code>\n</pre> <p>Now, let's create our training and validation datasets. The training dataset will gather 8 samples by label. In this case, the validation datasets will contain the remaining samples not included in the training datasets.</p> <pre><code>def sample_and_split(dataset, label_column, num_samples):\n    train_dataset = sample_dataset(\n        dataset, label_column=label_column, num_samples=num_samples\n    )\n    eval_dataset = dataset.filter(lambda x: x[\"id\"] not in set(train_dataset[\"id\"]))\n    return train_dataset, eval_dataset\n\n\ndataset_topic_full = Dataset.from_list(data_topic)\ndataset_fact_opinion_full = Dataset.from_list(data_fact_opinion)\n\ntrain_dataset_topic, eval_dataset_topic = sample_and_split(\n    dataset_topic_full, \"label\", 8\n)\ntrain_dataset_fact_opinion, eval_dataset_fact_opinion = sample_and_split(\n    dataset_fact_opinion_full, \"label\", 8\n)\n</code></pre> <p>Let's train our models for each task! We will use TaylorAI/bge-micro-v2, available in the Hugging Face Hub. You can check the MTEB leaderboard to select the best model for your use case.</p> <pre><code>def train_model(model_name, dataset, eval_dataset):\n    model = SetFitModel.from_pretrained(model_name)\n\n    trainer = Trainer(\n        model=model,\n        train_dataset=dataset,\n    )\n    trainer.train()\n    metrics = trainer.evaluate(eval_dataset)\n    print(metrics)\n\n    return model\n</code></pre> <pre><code>model_topic = train_model(\n    model_name=\"TaylorAI/bge-micro-v2\",\n    dataset=train_dataset_topic,\n    eval_dataset=eval_dataset_topic,\n)\nmodel_topic.save_pretrained(\"topic_classification_model\")\nmodel_topic = SetFitModel.from_pretrained(\"topic_classification_model\")\n</code></pre> <pre>\n<code>***** Running training *****\n  Num unique pairs = 768\n  Batch size = 16\n  Num epochs = 1\n  Total optimization steps = 48\n</code>\n</pre> <pre>\n<code>{'embedding_loss': 0.1873, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}\n</code>\n</pre> <pre>\n<code>***** Running evaluation *****\n</code>\n</pre> <pre>\n<code>{'train_runtime': 4.9767, 'train_samples_per_second': 154.318, 'train_steps_per_second': 9.645, 'epoch': 1.0}\n{'accuracy': 0.8333333333333334}\n</code>\n</pre> <pre><code>model_fact_opinion = train_model(\n    model_name=\"TaylorAI/bge-micro-v2\",\n    dataset=train_dataset_fact_opinion,\n    eval_dataset=eval_dataset_fact_opinion,\n)\nmodel_fact_opinion.save_pretrained(\"fact_opinion_classification_model\")\nmodel_fact_opinion = SetFitModel.from_pretrained(\"fact_opinion_classification_model\")\n</code></pre> <pre>\n<code>***** Running training *****\n  Num unique pairs = 144\n  Batch size = 16\n  Num epochs = 1\n  Total optimization steps = 9\n</code>\n</pre> <pre>\n<code>{'embedding_loss': 0.2985, 'learning_rate': 2e-05, 'epoch': 0.11}\n</code>\n</pre> <pre>\n<code>***** Running evaluation *****\n</code>\n</pre> <pre>\n<code>{'train_runtime': 0.8327, 'train_samples_per_second': 172.931, 'train_steps_per_second': 10.808, 'epoch': 1.0}\n{'accuracy': 0.9090909090909091}\n</code>\n</pre> <p>Voil\u00e0! The models are now trained and ready to be used. You can start making predictions to check the model's performance and add the new label. Optionally, you can continue using distilabel to generate additional data or Argilla to verify the quality of the predictions.</p> <pre><code>def predict(model, input, labels):\n    model.labels = labels\n    prediction = model.predict([input])\n    return prediction[0]\n</code></pre> <pre><code>predict(\n    model_topic, \"The new iPhone is expected to be released next month.\", labels_topic\n)\n</code></pre> <pre>\n<code>'Sci/Tech'</code>\n</pre> <pre><code>predict(\n    model_fact_opinion,\n    \"The new iPhone is expected to be released next month.\",\n    labels_fact_opinion,\n)\n</code></pre> <pre>\n<code>'Opinion-based'</code>\n</pre> <p>In this tutorial, we showcased the detailed steps to build a pipeline for generating text classification data using distilabel. You can customize this pipeline for your own use cases and share your datasets with the community through the Hugging Face Hub.</p> <p>We defined two text classification tasks\u2014a topic classification task and a fact versus opinion classification task\u2014and generated new data using various models via the serverless Hugging Face Inference API. Then, we curated the generated data with Argilla. Finally, we trained the models with SetFit using both the original and synthetic data.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#generate-synthetic-text-classification-data","title":"Generate synthetic text classification data","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#getting-started","title":"Getting started","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#install-the-dependencies","title":"Install the dependencies","text":"<p>To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using the free but rate-limited Hugging Face serverless Inference API for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:</p>"},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#optional-deploy-argilla","title":"(optional) Deploy Argilla","text":"<p>You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.</p> <p>Along with that, you will need to install Argilla as a distilabel extra.</p>"},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#the-dataset","title":"The dataset","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#define-the-text-classification-task","title":"Define the text classification task","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#run-the-pipeline","title":"Run the pipeline","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#optional-evaluate-with-argilla","title":"(Optional) Evaluate with Argilla","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#train-your-models","title":"Train your models","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#formatting-the-data","title":"Formatting the data","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#the-actual-training","title":"The actual training","text":""},{"location":"sections/pipeline_samples/tutorials/generate_textcat_dataset/#conclusions","title":"Conclusions","text":""},{"location":"components-gallery/","title":"Components Gallery","text":"<ul> <li> <p> Steps</p> <p>Explore all the available <code>Step</code>s that can be used for data manipulation.</p> <p> Steps</p> </li> <li> <p> Tasks</p> <p>Explore all the available <code>Task</code>s that can be used with an <code>LLM</code> to perform data generation, annotation, and more.</p> <p> Tasks</p> </li> <li> <p> LLMs</p> <p>Explore all the available <code>LLM</code>s integrated with <code>distilabel</code>.</p> <p> LLMs</p> </li> <li> <p> Embeddings</p> <p>Explore all the available <code>Embeddings</code> models integrated with <code>distilabel</code>.</p> <p> Embeddings</p> </li> </ul>"},{"location":"components-gallery/steps/","title":"Steps Gallery","text":"Category Overview <p>The gallery page showcases the different types of components within <code>distilabel</code>.</p> Icon Category Description text-generation Text generation steps are used to generate text based on a given prompt. chat-generation Chat generation steps are used to generate text based on a conversation. text-classification Text classification steps are used to classify text into a category. text-manipulation Text manipulation steps are used to manipulate or rewrite an input text. evol Evol steps are used to rewrite input text and evolve it to a higher quality. critique Critique steps are used to provide feedback on the quality of the data with a written explanation. scorer Scorer steps are used to evaluate and score the data with a numerical value. preference Preference steps are used to collect preferences on the data with numerical values or ranks. embedding Embedding steps are used to generate embeddings for the data. clustering Clustering steps are used to group similar data points together. columns Columns steps are used to manipulate columns in the data. filtering Filtering steps are used to filter the data based on some criteria. format Format steps are used to format the data. load Load steps are used to load the data. execution Executes python functions. save Save steps are used to save the data. labelling Labelling steps are used to label the data. <ul> <li> <p> PreferenceToArgilla</p> <p>Creates a preference dataset in Argilla.</p> <p> PreferenceToArgilla</p> </li> <li> <p> TextGenerationToArgilla</p> <p>Creates a text generation dataset in Argilla.</p> <p> TextGenerationToArgilla</p> </li> <li> <p> CombineColumns</p> <p><code>CombineColumns</code> is deprecated and will be removed in version 1.5.0, use <code>GroupColumns</code> instead.</p> <p> CombineColumns</p> </li> <li> <p> PushToHub</p> <p>Push data to a Hugging Face Hub dataset.</p> <p> PushToHub</p> </li> <li> <p> LoadDataFromDicts</p> <p>Loads a dataset from a list of dictionaries.</p> <p> LoadDataFromDicts</p> </li> <li> <p> DataSampler</p> <p>Step to sample from a dataset.</p> <p> DataSampler</p> </li> <li> <p> LoadDataFromHub</p> <p>Loads a dataset from the Hugging Face Hub.</p> <p> LoadDataFromHub</p> </li> <li> <p> LoadDataFromFileSystem</p> <p>Loads a dataset from a file in your filesystem.</p> <p> LoadDataFromFileSystem</p> </li> <li> <p> LoadDataFromDisk</p> <p>Load a dataset that was previously saved to disk.</p> <p> LoadDataFromDisk</p> </li> <li> <p> PrepareExamples</p> <p>Helper step to create examples from <code>query</code> and <code>answers</code> pairs used as Few Shots in APIGen.</p> <p> PrepareExamples</p> </li> <li> <p> ConversationTemplate</p> <p>Generate a conversation template from an instruction and a response.</p> <p> ConversationTemplate</p> </li> <li> <p> FormatTextGenerationDPO</p> <p>Format the output of your LLMs for Direct Preference Optimization (DPO).</p> <p> FormatTextGenerationDPO</p> </li> <li> <p> FormatChatGenerationDPO</p> <p>Format the output of a combination of a <code>ChatGeneration</code> + a preference task for Direct Preference Optimization (DPO).</p> <p> FormatChatGenerationDPO</p> </li> <li> <p> FormatTextGenerationSFT</p> <p>Format the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p> FormatTextGenerationSFT</p> </li> <li> <p> FormatChatGenerationSFT</p> <p>Format the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p> FormatChatGenerationSFT</p> </li> <li> <p> DeitaFiltering</p> <p>Filter dataset rows using DEITA filtering strategy.</p> <p> DeitaFiltering</p> </li> <li> <p> EmbeddingDedup</p> <p>Deduplicates text using embeddings.</p> <p> EmbeddingDedup</p> </li> <li> <p> APIGenExecutionChecker</p> <p>Executes the generated function calls.</p> <p> APIGenExecutionChecker</p> </li> <li> <p> MinHashDedup</p> <p>Deduplicates text using <code>MinHash</code> and <code>MinHashLSH</code>.</p> <p> MinHashDedup</p> </li> <li> <p> CombineOutputs</p> <p>Combine the outputs of several upstream steps.</p> <p> CombineOutputs</p> </li> <li> <p> ExpandColumns</p> <p>Expand columns that contain lists into multiple rows.</p> <p> ExpandColumns</p> </li> <li> <p> GroupColumns</p> <p>Combines columns from a list of <code>StepInput</code>.</p> <p> GroupColumns</p> </li> <li> <p> KeepColumns</p> <p>Keeps selected columns in the dataset.</p> <p> KeepColumns</p> </li> <li> <p> MergeColumns</p> <p>Merge columns from a row.</p> <p> MergeColumns</p> </li> <li> <p> DBSCAN</p> <p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core</p> <p> DBSCAN</p> </li> <li> <p> UMAP</p> <p>UMAP is a general purpose manifold learning and dimension reduction algorithm.</p> <p> UMAP</p> </li> <li> <p> FaissNearestNeighbour</p> <p>Create a <code>faiss</code> index to get the nearest neighbours.</p> <p> FaissNearestNeighbour</p> </li> <li> <p> EmbeddingGeneration</p> <p>Generate embeddings using an <code>Embeddings</code> model.</p> <p> EmbeddingGeneration</p> </li> <li> <p> RewardModelScore</p> <p>Assign a score to a response using a Reward Model.</p> <p> RewardModelScore</p> </li> <li> <p> FormatPRM</p> <p>Helper step to transform the data into the format expected by the PRM model.</p> <p> FormatPRM</p> </li> <li> <p> TruncateTextColumn</p> <p>Truncate a row using a tokenizer or the number of characters.</p> <p> TruncateTextColumn</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/","title":"PreferenceToArgilla","text":"<p>Creates a preference dataset in Argilla.</p> <p>Step that creates a dataset in Argilla during the load phase, and then pushes the input     batches into it as records. This dataset is a preference dataset, where there's one field     for the instruction and one extra field per each generation within the same record, and then     a rating question per each of the generation fields. The rating question asks the annotator to     set a rating from 1 to 5 for each of the provided generations.</p>"},{"location":"components-gallery/steps/preferencetoargilla/#note","title":"Note","text":"<p>This step is meant to be used in conjunction with the <code>UltraFeedback</code> step, or any other step generating both ratings and responses for a given set of instruction and generations for the given instruction. But alternatively, it can also be used with any other task or step generating only the <code>instruction</code> and <code>generations</code>, as the <code>ratings</code> and <code>rationales</code> are optional.</p>"},{"location":"components-gallery/steps/preferencetoargilla/#attributes","title":"Attributes","text":"<ul> <li> <p>num_generations: The number of generations to include in the dataset.</p> </li> <li> <p>dataset_name: The name of the dataset in Argilla.</p> </li> <li> <p>dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to  <code>None</code>, which means it will be created in the default workspace.</p> </li> <li> <p>api_url: The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from  the <code>ARGILLA_API_URL</code> environment variable.</p> </li> <li> <p>api_key: The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will  be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_url: The base URL to use for the Argilla API requests.</p> </li> <li> <p>api_key: The API key to authenticate the requests to the Argilla API.</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generations]\n            ICOL2[ratings]\n            ICOL3[rationales]\n        end\n    end\n\n    subgraph PreferenceToArgilla\n        StepInput[Input Columns: instruction, generations, ratings, rationales]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n</code></pre>"},{"location":"components-gallery/steps/preferencetoargilla/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the completion.</p> </li> <li> <p>generations (<code>List[str]</code>): The completion that was generated based on the input instruction.</p> </li> <li> <p>ratings (<code>List[str]</code>, optional): The ratings for the generations. If not provided, the  generated ratings won't be pushed to Argilla.</p> </li> <li> <p>rationales (<code>List[str]</code>, optional): The rationales for the ratings. If not provided, the  generated rationales won't be pushed to Argilla.</p> </li> </ul>"},{"location":"components-gallery/steps/preferencetoargilla/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/preferencetoargilla/#push-a-preference-dataset-to-an-argilla-instance","title":"Push a preference dataset to an Argilla instance","text":"<pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = PreferenceToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generations': ['first_generation', 'second_generation']}]\n</code></pre>"},{"location":"components-gallery/steps/preferencetoargilla/#it-can-also-include-ratings-and-rationales","title":"It can also include ratings and rationales","text":"<pre><code>result = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generations\": [\"first_generation\", \"second_generation\"],\n                \"ratings\": [\"4\", \"5\"],\n                \"rationales\": [\"rationale for 4\", \"rationale for 5\"],\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'instruction',\n#         'generations': ['first_generation', 'second_generation'],\n#         'ratings': ['4', '5'],\n#         'rationales': ['rationale for 4', 'rationale for 5']\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/textgenerationtoargilla/","title":"TextGenerationToArgilla","text":"<p>Creates a text generation dataset in Argilla.</p> <p><code>Step</code> that creates a dataset in Argilla during the load phase, and then pushes the input     batches into it as records. This dataset is a text-generation dataset, where there's one field     per each input, and then a label question to rate the quality of the completion in either bad     (represented with \ud83d\udc4e) or good (represented with \ud83d\udc4d).</p>"},{"location":"components-gallery/steps/textgenerationtoargilla/#note","title":"Note","text":"<p>This step is meant to be used in conjunction with a <code>TextGeneration</code> step and no column mapping is needed, as it will use the default values for the <code>instruction</code> and <code>generation</code> columns.</p>"},{"location":"components-gallery/steps/textgenerationtoargilla/#attributes","title":"Attributes","text":"<ul> <li> <p>dataset_name: The name of the dataset in Argilla.</p> </li> <li> <p>dataset_workspace: The workspace where the dataset will be created in Argilla. Defaults to  <code>None</code>, which means it will be created in the default workspace.</p> </li> <li> <p>api_url: The URL of the Argilla API. Defaults to <code>None</code>, which means it will be read from  the <code>ARGILLA_API_URL</code> environment variable.</p> </li> <li> <p>api_key: The API key to authenticate with Argilla. Defaults to <code>None</code>, which means it will  be read from the <code>ARGILLA_API_KEY</code> environment variable.</p> </li> </ul>"},{"location":"components-gallery/steps/textgenerationtoargilla/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_url: The base URL to use for the Argilla API requests.</p> </li> <li> <p>api_key: The API key to authenticate the requests to the Argilla API.</p> </li> </ul>"},{"location":"components-gallery/steps/textgenerationtoargilla/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generation]\n        end\n    end\n\n    subgraph TextGenerationToArgilla\n        StepInput[Input Columns: instruction, generation]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n</code></pre>"},{"location":"components-gallery/steps/textgenerationtoargilla/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the completion.</p> </li> <li> <p>generation (<code>str</code> or <code>List[str]</code>): The completions that were generated based on the input instruction.</p> </li> </ul>"},{"location":"components-gallery/steps/textgenerationtoargilla/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/textgenerationtoargilla/#push-a-text-generation-dataset-to-an-argilla-instance","title":"Push a text generation dataset to an Argilla instance","text":"<pre><code>from distilabel.steps import PreferenceToArgilla\n\nto_argilla = TextGenerationToArgilla(\n    num_generations=2,\n    api_url=\"https://dibt-demo-argilla-space.hf.space/\",\n    api_key=\"api.key\",\n    dataset_name=\"argilla_dataset\",\n    dataset_workspace=\"my_workspace\",\n)\nto_argilla.load()\n\nresult = next(\n    to_argilla.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"generation\": \"generation\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction', 'generation': 'generation'}]\n</code></pre>"},{"location":"components-gallery/steps/combinecolumns/","title":"CombineColumns","text":"<p><code>CombineColumns</code> is deprecated and will be removed in version 1.5.0, use <code>GroupColumns</code> instead.</p>"},{"location":"components-gallery/steps/combinecolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n    end\n\n    subgraph CombineColumns\n    end\n\n</code></pre>"},{"location":"components-gallery/steps/pushtohub/","title":"PushToHub","text":"<p>Push data to a Hugging Face Hub dataset.</p> <p>A <code>GlobalStep</code> which creates a <code>datasets.Dataset</code> with the input data and pushes     it to the Hugging Face Hub.</p>"},{"location":"components-gallery/steps/pushtohub/#attributes","title":"Attributes","text":"<ul> <li> <p>repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.</p> </li> <li> <p>split: The split of the dataset that will be pushed. Defaults to <code>\"train\"</code>.</p> </li> <li> <p>private: Whether the dataset to be pushed should be private or not. Defaults to  <code>False</code>.</p> </li> <li> <p>token: The token that will be used to authenticate in the Hub. If not provided, the  token will be tried to be obtained from the environment variable <code>HF_TOKEN</code>.  If not provided using one of the previous methods, then <code>huggingface_hub</code> library  will try to use the token from the local Hugging Face CLI configuration. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/pushtohub/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>repo_id: The Hugging Face Hub repository ID where the dataset will be uploaded.</p> </li> <li> <p>split: The split of the dataset that will be pushed.</p> </li> <li> <p>private: Whether the dataset to be pushed should be private or not.</p> </li> <li> <p>token: The token that will be used to authenticate in the Hub.</p> </li> </ul>"},{"location":"components-gallery/steps/pushtohub/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n    end\n\n    subgraph PushToHub\n        StepInput[Input Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n</code></pre>"},{"location":"components-gallery/steps/pushtohub/#inputs","title":"Inputs","text":"<ul> <li>dynamic (<code>all</code>): all columns from the input will be used to create the dataset.</li> </ul>"},{"location":"components-gallery/steps/pushtohub/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/pushtohub/#push-batches-of-your-dataset-to-the-hugging-face-hub-repository","title":"Push batches of your dataset to the Hugging Face Hub repository","text":"<pre><code>from distilabel.steps import PushToHub\n\npush = PushToHub(repo_id=\"path_to/repo\")\npush.load()\n\nresult = next(\n    push.process(\n        [\n            {\n                \"instruction\": \"instruction \",\n                \"generation\": \"generation\"\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction ', 'generation': 'generation'}]\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdicts/","title":"LoadDataFromDicts","text":"<p>Loads a dataset from a list of dictionaries.</p> <p><code>GeneratorStep</code> that loads a dataset from a list of dictionaries and yields it in     batches.</p>"},{"location":"components-gallery/steps/loaddatafromdicts/#attributes","title":"Attributes","text":"<ul> <li>data: The list of dictionaries to load the data from.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdicts/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>batch_size: The batch size to use when processing the data.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdicts/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromDicts\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdicts/#outputs","title":"Outputs","text":"<ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns  of the dataset.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdicts/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromdicts/#load-data-from-a-list-of-dictionaries","title":"Load data from a list of dictionaries","text":"<pre><code>from distilabel.steps import LoadDataFromDicts\n\nloader = LoadDataFromDicts(\n    data=[{\"instruction\": \"What are 2+2?\"}] * 5,\n    batch_size=2\n)\nloader.load()\n\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'instruction': 'What are 2+2?'}, {'instruction': 'What are 2+2?'}], False)\n</code></pre>"},{"location":"components-gallery/steps/datasampler/","title":"DataSampler","text":"<p>Step to sample from a dataset.</p> <p><code>GeneratorStep</code> that samples from a dataset and yields it in batches.     This step is useful when you have a pipeline that can benefit from using examples     in the prompts for example as few-shot learning, that can be changing on each row.     For example, you can pass a list of dictionaries with N examples and generate M samples     from it (assuming you have another step loading data, this M should have the same size     as the data being loaded in that step). The size S argument is the number of samples per     row generated, so each example would contain S examples to be used as examples.</p>"},{"location":"components-gallery/steps/datasampler/#attributes","title":"Attributes","text":"<ul> <li> <p>data: The list of dictionaries to sample from.</p> </li> <li> <p>size: Number of samples per example. For example in a few-shot learning scenario,  the number of few-shot examples that will be generated per example. Defaults to 2.</p> </li> <li> <p>samples: Number of examples that will be generated by the step in total.  If used with another loader step, this should be the same as the number  of samples in the loader step. Defaults to 100.</p> </li> </ul>"},{"location":"components-gallery/steps/datasampler/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph DataSampler\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/datasampler/#outputs","title":"Outputs","text":"<ul> <li>dynamic (based on the keys found on the first dictionary of the list): The columns  of the dataset.</li> </ul>"},{"location":"components-gallery/steps/datasampler/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/datasampler/#sample-data-from-a-list-of-dictionaries","title":"Sample data from a list of dictionaries","text":"<pre><code>from distilabel.steps import DataSampler\n\nsampler = DataSampler(\n    data=[{\"sample\": f\"sample {i}\"} for i in range(30)],\n    samples=10,\n    size=2,\n    batch_size=4\n)\nsampler.load()\n\nresult = next(sampler.process())\n# &gt;&gt;&gt; result\n# ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)\n</code></pre>"},{"location":"components-gallery/steps/datasampler/#pipeline-with-a-loader-and-a-sampler-combined-in-a-single-stream","title":"Pipeline with a loader and a sampler combined in a single stream","text":"<pre><code>from datasets import load_dataset\n\nfrom distilabel.steps import LoadDataFromDicts, DataSampler\nfrom distilabel.steps.tasks.apigen.utils import PrepareExamples\nfrom distilabel.pipeline import Pipeline\n\nds = (\n    load_dataset(\"Salesforce/xlam-function-calling-60k\", split=\"train\")\n    .shuffle(seed=42)\n    .select(range(500))\n    .to_list()\n)\ndata = [\n    {\n        \"func_name\": \"final_velocity\",\n        \"func_desc\": \"Calculates the final velocity of an object given its initial velocity, acceleration, and time.\",\n    },\n    {\n        \"func_name\": \"permutation_count\",\n        \"func_desc\": \"Calculates the number of permutations of k elements from a set of n elements.\",\n    },\n    {\n        \"func_name\": \"getdivision\",\n        \"func_desc\": \"Divides two numbers by making an API call to a division service.\",\n    },\n]\nwith Pipeline(name=\"APIGenPipeline\") as pipeline:\n    loader_seeds = LoadDataFromDicts(data=data)\n    sampler = DataSampler(\n        data=ds,\n        size=2,\n        samples=len(data),\n        batch_size=8,\n    )\n    prep_examples = PrepareExamples()\n\n    sampler &gt;&gt; prep_examples\n    (\n        [loader_seeds, prep_examples]\n        &gt;&gt; combine_steps\n    )\n# Now we have a single stream of data with the loader and the sampler data\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromhub/","title":"LoadDataFromHub","text":"<p>Loads a dataset from the Hugging Face Hub.</p> <p><code>GeneratorStep</code> that loads a dataset from the Hugging Face Hub using the <code>datasets</code>     library.</p>"},{"location":"components-gallery/steps/loaddatafromhub/#attributes","title":"Attributes","text":"<ul> <li> <p>repo_id: The Hugging Face Hub repository ID of the dataset to load.</p> </li> <li> <p>split: The split of the dataset to load.</p> </li> <li> <p>config: The configuration of the dataset to load. This is optional and only needed  if the dataset has multiple configurations.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromhub/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>batch_size: The batch size to use when processing the data.</p> </li> <li> <p>repo_id: The Hugging Face Hub repository ID of the dataset to load.</p> </li> <li> <p>split: The split of the dataset to load. Defaults to 'train'.</p> </li> <li> <p>config: The configuration of the dataset to load. This is optional and only  needed if the dataset has multiple configurations.</p> </li> <li> <p>revision: The revision of the dataset to load. Defaults to the latest revision.</p> </li> <li> <p>streaming: Whether to load the dataset in streaming mode or not. Defaults to  <code>False</code>.</p> </li> <li> <p>num_examples: The number of examples to load from the dataset.  By default will load all examples.</p> </li> <li> <p>storage_options: Key/value pairs to be passed on to the file-system backend, if any.  Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromhub/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromHub\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromhub/#outputs","title":"Outputs","text":"<ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the  datasets loaded from the Hugging Face Hub.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromhub/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromhub/#load-data-from-a-dataset-in-hugging-face-hub","title":"Load data from a dataset in Hugging Face Hub","text":"<pre><code>from distilabel.steps import LoadDataFromHub\n\nloader = LoadDataFromHub(\n    repo_id=\"distilabel-internal-testing/instruction-dataset-mini\",\n    split=\"test\",\n    batch_size=2\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'prompt': 'Arianna has 12...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/","title":"LoadDataFromFileSystem","text":"<p>Loads a dataset from a file in your filesystem.</p> <p><code>GeneratorStep</code> that creates a dataset from a file in the filesystem, uses Hugging Face <code>datasets</code>     library. Take a look at Hugging Face Datasets     for more information of the supported file types.</p>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#attributes","title":"Attributes","text":"<ul> <li> <p>data_files: The path to the file, or directory containing the files that conform  the dataset.</p> </li> <li> <p>split: The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>batch_size: The batch size to use when processing the data.</p> </li> <li> <p>data_files: The path to the file, or directory containing the files that conform  the dataset.</p> </li> <li> <p>split: The split of the dataset to load. Defaults to 'train'.</p> </li> <li> <p>streaming: Whether to load the dataset in streaming mode or not. Defaults to  <code>False</code>.</p> </li> <li> <p>num_examples: The number of examples to load from the dataset.  By default will load all examples.</p> </li> <li> <p>storage_options: Key/value pairs to be passed on to the file-system backend, if any.  Defaults to <code>None</code>.</p> </li> <li> <p>filetype: The expected filetype. If not provided, it will be inferred from the file extension.  For more than one file, it will be inferred from the first file.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromFileSystem\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#outputs","title":"Outputs","text":"<ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the  datasets loaded from the Hugging Face Hub.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromfilesystem/#load-data-from-a-hugging-face-dataset-in-your-file-system","title":"Load data from a Hugging Face dataset in your file system","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(data_files=\"path/to/dataset.jsonl\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#specify-a-filetype-if-the-file-extension-is-not-expected","title":"Specify a filetype if the file extension is not expected","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(filetype=\"csv\", data_files=\"path/to/dataset.txtr\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#load-data-from-a-file-in-your-cloud-provider","title":"Load data from a file in your cloud provider","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromfilesystem/#load-data-passing-a-glob-pattern","title":"Load data passing a glob pattern","text":"<pre><code>from distilabel.steps import LoadDataFromFileSystem\n\nloader = LoadDataFromFileSystem(\n    data_files=\"path/to/dataset/*.jsonl\",\n    streaming=True\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/","title":"LoadDataFromDisk","text":"<p>Load a dataset that was previously saved to disk.</p> <p>If you previously saved your dataset using the <code>save_to_disk</code> method, or     <code>Distiset.save_to_disk</code> you can load it again to build a new pipeline using this class.</p>"},{"location":"components-gallery/steps/loaddatafromdisk/#attributes","title":"Attributes","text":"<ul> <li> <p>dataset_path: The path to the dataset or distiset.</p> </li> <li> <p>split: The split of the dataset to load (typically will be <code>train</code>, <code>test</code> or <code>validation</code>).</p> </li> <li> <p>config: The configuration of the dataset to load. Defaults to <code>default</code>, if there are  multiple configurations in the dataset this must be suplied or an error is raised.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromdisk/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>batch_size: The batch size to use when processing the data.</p> </li> <li> <p>dataset_path: The path to the dataset or distiset.</p> </li> <li> <p>is_distiset: Whether the dataset to load is a <code>Distiset</code> or not. Defaults to False.</p> </li> <li> <p>split: The split of the dataset to load. Defaults to 'train'.</p> </li> <li> <p>config: The configuration of the dataset to load. Defaults to <code>default</code>, if there are  multiple configurations in the dataset this must be suplied or an error is raised.</p> </li> <li> <p>num_examples: The number of examples to load from the dataset.  By default will load all examples.</p> </li> <li> <p>storage_options: Key/value pairs to be passed on to the file-system backend, if any.  Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/loaddatafromdisk/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph LoadDataFromDisk\n        StepOutput[Output Columns: dynamic]\n    end\n\n    StepOutput --&gt; OCOL0\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/#outputs","title":"Outputs","text":"<ul> <li>dynamic (<code>all</code>): The columns that will be generated by this step, based on the  datasets loaded from the Hugging Face Hub.</li> </ul>"},{"location":"components-gallery/steps/loaddatafromdisk/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/loaddatafromdisk/#load-data-from-a-hugging-face-dataset","title":"Load data from a Hugging Face Dataset","text":"<pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(dataset_path=\"path/to/dataset\")\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/#load-data-from-a-distilabel-distiset","title":"Load data from a distilabel Distiset","text":"<pre><code>from distilabel.steps import LoadDataFromDisk\n\n# Specify the configuration to load.\nloader = LoadDataFromDisk(\n    dataset_path=\"path/to/dataset\",\n    is_distiset=True,\n    config=\"leaf_step_1\"\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'a': 1}, {'a': 2}, {'a': 3}], True)\n</code></pre>"},{"location":"components-gallery/steps/loaddatafromdisk/#load-data-from-a-hugging-face-dataset-or-distiset-in-your-cloud-provider","title":"Load data from a Hugging Face Dataset or Distiset in your cloud provider","text":"<pre><code>from distilabel.steps import LoadDataFromDisk\n\nloader = LoadDataFromDisk(\n    dataset_path=\"gcs://path/to/dataset\",\n    storage_options={\"project\": \"experiments-0001\"}\n)\nloader.load()\n\n# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.\nresult = next(loader.process())\n# &gt;&gt;&gt; result\n# ([{'type': 'function', 'function':...', False)\n</code></pre>"},{"location":"components-gallery/steps/prepareexamples/","title":"PrepareExamples","text":"<p>Helper step to create examples from <code>query</code> and <code>answers</code> pairs used as Few Shots in APIGen.</p>"},{"location":"components-gallery/steps/prepareexamples/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[query]\n            ICOL1[answers]\n        end\n        subgraph New columns\n            OCOL0[examples]\n        end\n    end\n\n    subgraph PrepareExamples\n        StepInput[Input Columns: query, answers]\n        StepOutput[Output Columns: examples]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/prepareexamples/#inputs","title":"Inputs","text":"<ul> <li> <p>query (<code>str</code>): The query to generate examples from.</p> </li> <li> <p>answers (<code>str</code>): The answers to the query.</p> </li> </ul>"},{"location":"components-gallery/steps/prepareexamples/#outputs","title":"Outputs","text":"<ul> <li>examples (<code>str</code>): The formatted examples.</li> </ul>"},{"location":"components-gallery/steps/prepareexamples/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/prepareexamples/#generate-examples-for-apigen","title":"Generate examples for APIGen","text":"<pre><code>from distilabel.steps.tasks.apigen.utils import PrepareExamples\n\nprepare_examples = PrepareExamples()\nresult = next(prepare_examples.process(\n    [\n        {\n            \"query\": ['I need the area of circles with radius 2.5, 5, and 7.5 inches, please.', 'Can you provide the current locations of buses and trolleys on route 12?'],\n            \"answers\": ['[{\"name\": \"circle_area\", \"arguments\": {\"radius\": 2.5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 7.5}}]', '[{\"name\": \"bus_trolley_locations\", \"arguments\": {\"route\": \"12\"}}]']\n        }\n    ]\n)\n# result\n# [{'examples': '## Query:\\nI need the area of circles with radius 2.5, 5, and 7.5 inches, please.\\n## Answers:\\n[{\"name\": \"circle_area\", \"arguments\": {\"radius\": 2.5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 7.5}}]\\n\\n## Query:\\nCan you provide the current locations of buses and trolleys on route 12?\\n## Answers:\\n[{\"name\": \"bus_trolley_locations\", \"arguments\": {\"route\": \"12\"}}]'}, {'examples': '## Query:\\nI need the area of circles with radius 2.5, 5, and 7.5 inches, please.\\n## Answers:\\n[{\"name\": \"circle_area\", \"arguments\": {\"radius\": 2.5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 5}}, {\"name\": \"circle_area\", \"arguments\": {\"radius\": 7.5}}]\\n\\n## Query:\\nCan you provide the current locations of buses and trolleys on route 12?\\n## Answers:\\n[{\"name\": \"bus_trolley_locations\", \"arguments\": {\"route\": \"12\"}}]'}]\n</code></pre>"},{"location":"components-gallery/steps/conversationtemplate/","title":"ConversationTemplate","text":"<p>Generate a conversation template from an instruction and a response.</p>"},{"location":"components-gallery/steps/conversationtemplate/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[response]\n        end\n        subgraph New columns\n            OCOL0[conversation]\n        end\n    end\n\n    subgraph ConversationTemplate\n        StepInput[Input Columns: instruction, response]\n        StepOutput[Output Columns: conversation]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/conversationtemplate/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction to be used in the conversation.</p> </li> <li> <p>response (<code>str</code>): The response to be used in the conversation.</p> </li> </ul>"},{"location":"components-gallery/steps/conversationtemplate/#outputs","title":"Outputs","text":"<ul> <li>conversation (<code>ChatType</code>): The conversation template.</li> </ul>"},{"location":"components-gallery/steps/conversationtemplate/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/conversationtemplate/#create-a-conversation-from-an-instruction-and-a-response","title":"Create a conversation from an instruction and a response","text":"<pre><code>from distilabel.steps import ConversationTemplate\n\nconv_template = ConversationTemplate()\nconv_template.load()\n\nresult = next(\n    conv_template.process(\n        [\n            {\n                \"instruction\": \"Hello\",\n                \"response\": \"Hi\",\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'Hello', 'response': 'Hi', 'conversation': [{'role': 'user', 'content': 'Hello'}, {'role': 'assistant', 'content': 'Hi'}]}]\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationdpo/","title":"FormatTextGenerationDPO","text":"<p>Format the output of your LLMs for Direct Preference Optimization (DPO).</p> <p><code>FormatTextGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>TextGeneration</code>     task with a preference <code>Task</code> i.e. a task generating <code>ratings</code>, so that those are used to rank the     existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>.     Use this step to transform the output of a combination of a <code>TextGeneration</code> + a preference task such as     <code>UltraFeedback</code> following the standard formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>.</p>"},{"location":"components-gallery/steps/formattextgenerationdpo/#note","title":"Note","text":"<p>The <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p>"},{"location":"components-gallery/steps/formattextgenerationdpo/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n            ICOL1[instruction]\n            ICOL2[generations]\n            ICOL3[generation_models]\n            ICOL4[ratings]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[chosen]\n            OCOL3[chosen_model]\n            OCOL4[chosen_rating]\n            OCOL5[rejected]\n            OCOL6[rejected_model]\n            OCOL7[rejected_rating]\n        end\n    end\n\n    subgraph FormatTextGenerationDPO\n        StepInput[Input Columns: system_prompt, instruction, generations, generation_models, ratings]\n        StepOutput[Output Columns: prompt, prompt_id, chosen, chosen_model, chosen_rating, rejected, rejected_model, rejected_rating]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    ICOL4 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepOutput --&gt; OCOL5\n    StepOutput --&gt; OCOL6\n    StepOutput --&gt; OCOL7\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationdpo/#inputs","title":"Inputs","text":"<ul> <li> <p>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the  <code>generations</code>, if available.</p> </li> <li> <p>instruction (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</p> </li> <li> <p>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</p> </li> <li> <p>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,  only available if the <code>model_name</code> from the <code>TextGeneration</code> task/s is combined into a single  column named this way, otherwise, it will be ignored.</p> </li> <li> <p>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference  task such as <code>UltraFeedback</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationdpo/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The instruction used to generate the <code>generations</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</p> </li> <li> <p>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationdpo/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formattextgenerationdpo/#format-your-dataset-for-dpo-fine-tuning","title":"Format your dataset for DPO fine tuning","text":"<pre><code>from distilabel.steps import FormatTextGenerationDPO\n\nformat_dpo = FormatTextGenerationDPO()\nformat_dpo.load()\n\n# NOTE: Both \"system_prompt\" and \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#    {   'instruction': \"What's 2+2?\",\n#        'generations': ['4', '5', '6'],\n#        'ratings': [1, 0, -1],\n#        'prompt': \"What's 2+2?\",\n#        'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#        'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#        'chosen_rating': 1,\n#        'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#        'rejected_rating': -1\n#    }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationdpo/","title":"FormatChatGenerationDPO","text":"<p>Format the output of a combination of a <code>ChatGeneration</code> + a preference task for Direct Preference Optimization (DPO).</p> <p><code>FormatChatGenerationDPO</code> is a <code>Step</code> that formats the output of the combination of a <code>ChatGeneration</code>     task with a preference <code>Task</code> i.e. a task generating <code>ratings</code> such as <code>UltraFeedback</code> following the standard     formatting from frameworks such as <code>axolotl</code> or <code>alignment-handbook</code>., so that those are used to rank the     existing generations and provide the <code>chosen</code> and <code>rejected</code> generations based on the <code>ratings</code>.</p>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#note","title":"Note","text":"<p>The <code>messages</code> column should contain at least one message from the user, the <code>generations</code> column should contain at least two generations, the <code>ratings</code> column should contain the same number of ratings as generations.</p>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[messages]\n            ICOL1[generations]\n            ICOL2[generation_models]\n            ICOL3[ratings]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[chosen]\n            OCOL3[chosen_model]\n            OCOL4[chosen_rating]\n            OCOL5[rejected]\n            OCOL6[rejected_model]\n            OCOL7[rejected_rating]\n        end\n    end\n\n    subgraph FormatChatGenerationDPO\n        StepInput[Input Columns: messages, generations, generation_models, ratings]\n        StepOutput[Output Columns: prompt, prompt_id, chosen, chosen_model, chosen_rating, rejected, rejected_model, rejected_rating]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepOutput --&gt; OCOL5\n    StepOutput --&gt; OCOL6\n    StepOutput --&gt; OCOL7\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#inputs","title":"Inputs","text":"<ul> <li> <p>messages (<code>List[Dict[str, str]]</code>): The conversation messages.</p> </li> <li> <p>generations (<code>List[str]</code>): The generations produced by the <code>LLM</code>.</p> </li> <li> <p>generation_models (<code>List[str]</code>, optional): The model names used to generate the <code>generations</code>,  only available if the <code>model_name</code> from the <code>ChatGeneration</code> task/s is combined into a single  column named this way, otherwise, it will be ignored.</p> </li> <li> <p>ratings (<code>List[float]</code>): The ratings for each of the <code>generations</code>, produced by a preference  task such as <code>UltraFeedback</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The user message used to generate the <code>generations</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>chosen (<code>List[Dict[str, str]]</code>): The <code>chosen</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>chosen_model (<code>str</code>, optional): The model name used to generate the <code>chosen</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>chosen_rating (<code>float</code>): The rating of the <code>chosen</code> generation.</p> </li> <li> <p>rejected (<code>List[Dict[str, str]]</code>): The <code>rejected</code> generation based on the <code>ratings</code>.</p> </li> <li> <p>rejected_model (<code>str</code>, optional): The model name used to generate the <code>rejected</code> generation,  if the <code>generation_models</code> are available.</p> </li> <li> <p>rejected_rating (<code>float</code>): The rating of the <code>rejected</code> generation.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationdpo/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formatchatgenerationdpo/#format-your-dataset-for-dpo-fine-tuning","title":"Format your dataset for DPO fine tuning","text":"<pre><code>from distilabel.steps import FormatChatGenerationDPO\n\nformat_dpo = FormatChatGenerationDPO()\nformat_dpo.load()\n\n# NOTE: \"generation_models\" can be added optionally.\nresult = next(\n    format_dpo.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generations\": [\"4\", \"5\", \"6\"],\n                \"ratings\": [1, 0, -1],\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}],\n#         'generations': ['4', '5', '6'],\n#         'ratings': [1, 0, -1],\n#         'prompt': \"What's 2+2?\",\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'chosen': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'chosen_rating': 1,\n#         'rejected': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '6'}],\n#         'rejected_rating': -1\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationsft/","title":"FormatTextGenerationSFT","text":"<p>Format the output of a <code>TextGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatTextGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>TextGeneration</code> task for     Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code>     or <code>alignment-handbook</code>. The output of the <code>TextGeneration</code> task is formatted into a chat-like     conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant     message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message     in the conversation.</p>"},{"location":"components-gallery/steps/formattextgenerationsft/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n            ICOL1[instruction]\n            ICOL2[generation]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[messages]\n        end\n    end\n\n    subgraph FormatTextGenerationSFT\n        StepInput[Input Columns: system_prompt, instruction, generation]\n        StepOutput[Output Columns: prompt, prompt_id, messages]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formattextgenerationsft/#inputs","title":"Inputs","text":"<ul> <li> <p>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the  <code>generation</code>, if available.</p> </li> <li> <p>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationsft/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as  the user message and the <code>generation</code> as the assistant message.</p> </li> </ul>"},{"location":"components-gallery/steps/formattextgenerationsft/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formattextgenerationsft/#format-your-dataset-for-sft-fine-tuning","title":"Format your dataset for SFT fine tuning","text":"<pre><code>from distilabel.steps import FormatTextGenerationSFT\n\nformat_sft = FormatTextGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"instruction\": \"What's 2+2?\",\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'instruction': 'What's 2+2?',\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}]\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationsft/","title":"FormatChatGenerationSFT","text":"<p>Format the output of a <code>ChatGeneration</code> task for Supervised Fine-Tuning (SFT).</p> <p><code>FormatChatGenerationSFT</code> is a <code>Step</code> that formats the output of a <code>ChatGeneration</code> task for     Supervised Fine-Tuning (SFT) following the standard formatting from frameworks such as <code>axolotl</code>     or <code>alignment-handbook</code>. The output of the <code>ChatGeneration</code> task is formatted into a chat-like     conversation with the <code>instruction</code> as the user message and the <code>generation</code> as the assistant     message. Optionally, if the <code>system_prompt</code> is available, it is included as the first message     in the conversation.</p>"},{"location":"components-gallery/steps/formatchatgenerationsft/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n            ICOL1[instruction]\n            ICOL2[generation]\n        end\n        subgraph New columns\n            OCOL0[prompt]\n            OCOL1[prompt_id]\n            OCOL2[messages]\n        end\n    end\n\n    subgraph FormatChatGenerationSFT\n        StepInput[Input Columns: system_prompt, instruction, generation]\n        StepOutput[Output Columns: prompt, prompt_id, messages]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formatchatgenerationsft/#inputs","title":"Inputs","text":"<ul> <li> <p>system_prompt (<code>str</code>, optional): The system prompt used within the <code>LLM</code> to generate the  <code>generation</code>, if available.</p> </li> <li> <p>instruction (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>generation (<code>str</code>): The generation produced by the <code>LLM</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationsft/#outputs","title":"Outputs","text":"<ul> <li> <p>prompt (<code>str</code>): The instruction used to generate the <code>generation</code> with the <code>LLM</code>.</p> </li> <li> <p>prompt_id (<code>str</code>): The <code>SHA256</code> hash of the <code>prompt</code>.</p> </li> <li> <p>messages (<code>List[Dict[str, str]]</code>): The chat-like conversation with the <code>instruction</code> as  the user message and the <code>generation</code> as the assistant message.</p> </li> </ul>"},{"location":"components-gallery/steps/formatchatgenerationsft/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formatchatgenerationsft/#format-your-dataset-for-sft","title":"Format your dataset for SFT","text":"<pre><code>from distilabel.steps import FormatChatGenerationSFT\n\nformat_sft = FormatChatGenerationSFT()\nformat_sft.load()\n\n# NOTE: \"system_prompt\" can be added optionally.\nresult = next(\n    format_sft.process(\n        [\n            {\n                \"messages\": [{\"role\": \"user\", \"content\": \"What's 2+2?\"}],\n                \"generation\": \"4\"\n            }\n        ]\n    )\n)\n# &gt;&gt;&gt; result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': \"What's 2+2?\"}, {'role': 'assistant', 'content': '4'}],\n#         'generation': '4',\n#         'prompt': 'What's 2+2?',\n#         'prompt_id': '7762ecf17ad41479767061a8f4a7bfa3b63d371672af5180872f9b82b4cd4e29',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/steps/deitafiltering/","title":"DeitaFiltering","text":"<p>Filter dataset rows using DEITA filtering strategy.</p> <p>Filter the dataset based on the DEITA score and the cosine distance between the embeddings.     It's an implementation of the filtering step from the paper 'What Makes Good Data     for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.</p>"},{"location":"components-gallery/steps/deitafiltering/#attributes","title":"Attributes","text":"<ul> <li> <p>data_budget: The desired size of the dataset after filtering.</p> </li> <li> <p>diversity_threshold: If a row has a cosine distance with respect to it's nearest  neighbor greater than this value, it will be included in the filtered dataset.  Defaults to <code>0.9</code>.</p> </li> <li> <p>normalize_embeddings: Whether to normalize the embeddings before computing the cosine  distance. Defaults to <code>True</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>data_budget: The desired size of the dataset after filtering.</p> </li> <li> <p>diversity_threshold: If a row has a cosine distance with respect to it's nearest  neighbor greater than this value, it will be included in the filtered dataset.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[evol_instruction_score]\n            ICOL1[evol_response_score]\n            ICOL2[embedding]\n        end\n        subgraph New columns\n            OCOL0[deita_score]\n            OCOL1[deita_score_computed_with]\n            OCOL2[nearest_neighbor_distance]\n        end\n    end\n\n    subgraph DeitaFiltering\n        StepInput[Input Columns: evol_instruction_score, evol_response_score, embedding]\n        StepOutput[Output Columns: deita_score, deita_score_computed_with, nearest_neighbor_distance]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/deitafiltering/#inputs","title":"Inputs","text":"<ul> <li> <p>evol_instruction_score (<code>float</code>): The score of the instruction generated by  <code>ComplexityScorer</code> step.</p> </li> <li> <p>evol_response_score (<code>float</code>): The score of the response generated by  <code>QualityScorer</code> step.</p> </li> <li> <p>embedding (<code>List[float]</code>): The embedding generated for the conversation of the  instruction-response pair using <code>GenerateEmbeddings</code> step.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#outputs","title":"Outputs","text":"<ul> <li> <p>deita_score (<code>float</code>): The DEITA score for the instruction-response pair.</p> </li> <li> <p>deita_score_computed_with (<code>List[str]</code>): The scores used to compute the DEITA  score.</p> </li> <li> <p>nearest_neighbor_distance (<code>float</code>): The cosine distance between the embeddings  of the instruction-response pair.</p> </li> </ul>"},{"location":"components-gallery/steps/deitafiltering/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/deitafiltering/#filter-the-dataset-based-on-the-deita-score-and-the-cosine-distance-between-the-embeddings","title":"Filter the dataset based on the DEITA score and the cosine distance between the embeddings","text":"<pre><code>from distilabel.steps import DeitaFiltering\n\ndeita_filtering = DeitaFiltering(data_budget=1)\n\ndeita_filtering.load()\n\nresult = next(\n    deita_filtering.process(\n        [\n            {\n                \"evol_instruction_score\": 0.5,\n                \"evol_response_score\": 0.5,\n                \"embedding\": [-8.12729941, -5.24642847, -6.34003029],\n            },\n            {\n                \"evol_instruction_score\": 0.6,\n                \"evol_response_score\": 0.6,\n                \"embedding\": [2.99329242, 0.7800932, 0.7799726],\n            },\n            {\n                \"evol_instruction_score\": 0.7,\n                \"evol_response_score\": 0.7,\n                \"embedding\": [10.29041806, 14.33088073, 13.00557506],\n            },\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'evol_instruction_score': 0.5, 'evol_response_score': 0.5, 'embedding': [-8.12729941, -5.24642847, -6.34003029], 'deita_score': 0.25, 'deita_score_computed_with': ['evol_instruction_score', 'evol_response_score'], 'nearest_neighbor_distance': 1.9042812683723933}]\n</code></pre>"},{"location":"components-gallery/steps/deitafiltering/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/","title":"EmbeddingDedup","text":"<p>Deduplicates text using embeddings.</p> <p><code>EmbeddingDedup</code> is a Step that detects near-duplicates in datasets, using     embeddings to compare the similarity between the texts. The typical workflow with this step     would include having a dataset with embeddings precomputed, and then (possibly using the     <code>FaissNearestNeighbour</code>) using the <code>nn_indices</code> and <code>nn_scores</code>, determine the texts that     are duplicate.</p>"},{"location":"components-gallery/steps/embeddingdedup/#attributes","title":"Attributes","text":"<ul> <li>threshold: the threshold to consider 2 examples as duplicates.  It's dependent on the type of index that was used to generate the embeddings.  For example, if the embeddings were generated using cosine similarity, a threshold  of <code>0.9</code> would make all the texts with a cosine similarity above the value  duplicates. Higher values detect less duplicates in such an index, but that should  be taken into account when building it. Defaults to <code>0.9</code>.  Runtime Parameters:  - <code>threshold</code>: the threshold to consider 2 examples as duplicates.</li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[nn_indices]\n            ICOL1[nn_scores]\n        end\n        subgraph New columns\n            OCOL0[keep_row_after_embedding_filtering]\n        end\n    end\n\n    subgraph EmbeddingDedup\n        StepInput[Input Columns: nn_indices, nn_scores]\n        StepOutput[Output Columns: keep_row_after_embedding_filtering]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/embeddingdedup/#inputs","title":"Inputs","text":"<ul> <li> <p>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours  in the inputs for the row.</p> </li> <li> <p>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>  nearest neighbour in the inputs.</p> </li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/#outputs","title":"Outputs","text":"<ul> <li>keep_row_after_embedding_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is  not a duplicate i.e. this text should be kept.</li> </ul>"},{"location":"components-gallery/steps/embeddingdedup/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/embeddingdedup/#deduplicate-a-list-of-texts-using-embedding-information","title":"Deduplicate a list of texts using embedding information","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    data = LoadDataFromDicts(\n        data=[\n            {\n                \"persona\": \"A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.\",\n                \"embedding\": [\n                    0.018477669046149742,\n                    -0.03748236608841726,\n                    0.001919870620352492,\n                    0.024918478063770535,\n                    0.02348063521315178,\n                    0.0038251285566308375,\n                    -0.01723884983037716,\n                    0.02881971942372201,\n                ],\n                \"nn_indices\": [0, 1],\n                \"nn_scores\": [\n                    0.9164746999740601,\n                    0.782106876373291,\n                ],\n            },\n            {\n                \"persona\": \"A music teacher or instructor focused on theoretical and practical piano lessons.\",\n                \"embedding\": [\n                    -0.0023464179614082125,\n                    -0.07325472251663565,\n                    -0.06058678419516501,\n                    -0.02100326928586996,\n                    -0.013462744792362657,\n                    0.027368447064244242,\n                    -0.003916070100455717,\n                    0.01243614518480423,\n                ],\n                \"nn_indices\": [0, 2],\n                \"nn_scores\": [\n                    0.7552462220191956,\n                    0.7261884808540344,\n                ],\n            },\n            {\n                \"persona\": \"A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.\",\n                \"embedding\": [\n                    -0.01630817942328242,\n                    -0.023760151552345232,\n                    -0.014249650090627883,\n                    -0.005713686451446624,\n                    -0.016033059279131567,\n                    0.0071440908501058786,\n                    -0.05691099643425161,\n                    0.01597412704817784,\n                ],\n                \"nn_indices\": [1, 2],\n                \"nn_scores\": [\n                    0.8107735514640808,\n                    0.7172299027442932,\n                ],\n            },\n        ],\n        batch_size=batch_size,\n    )\n    # In general you should do something like this before the deduplication step, to obtain the\n    # `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's\n    # no need for it.\n    # nn = FaissNearestNeighbour(\n    #     k=30,\n    #     metric_type=faiss.METRIC_INNER_PRODUCT,\n    #     search_batch_size=50,\n    #     train_size=len(dataset),              # The number of embeddings to use for training\n    #     string_factory=\"IVF300_HNSW32,Flat\"   # To use an index (optional, maybe required for big datasets)\n    # )\n    # Read more about the `string_factory` here:\n    # https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index\n\n    embedding_dedup = EmbeddingDedup(\n        threshold=0.8,\n        input_batch_size=batch_size,\n    )\n\n    data &gt;&gt; embedding_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_embedding_filtering\"])\n</code></pre>"},{"location":"components-gallery/steps/apigenexecutionchecker/","title":"APIGenExecutionChecker","text":"<p>Executes the generated function calls.</p> <p>This step checks if a given answer from a model as generated by <code>APIGenGenerator</code>     can be executed against the given library (given by <code>libpath</code>, which is a string     pointing to a python .py file with functions).</p>"},{"location":"components-gallery/steps/apigenexecutionchecker/#attributes","title":"Attributes","text":"<ul> <li> <p>libpath: The path to the library where we will retrieve the functions.  It can also point to a folder with the functions. In this case, the folder  layout should be a folder with .py files, each containing a single function,  the name of the function being the same as the filename.</p> </li> <li> <p>check_is_dangerous: Bool to exclude some potentially dangerous functions, it contains  some heuristics found while testing. This functions can run subprocesses, deal with  the OS, or have other potentially dangerous operations. Defaults to True.</p> </li> </ul>"},{"location":"components-gallery/steps/apigenexecutionchecker/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[answers]\n        end\n        subgraph New columns\n            OCOL0[keep_row_after_execution_check]\n            OCOL1[execution_result]\n        end\n    end\n\n    subgraph APIGenExecutionChecker\n        StepInput[Input Columns: answers]\n        StepOutput[Output Columns: keep_row_after_execution_check, execution_result]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/apigenexecutionchecker/#inputs","title":"Inputs","text":"<ul> <li>answers (<code>str</code>): List with arguments to be passed to the function,  dumped as a string from a list of dictionaries. Should be loaded using  <code>json.loads</code>.</li> </ul>"},{"location":"components-gallery/steps/apigenexecutionchecker/#outputs","title":"Outputs","text":"<ul> <li> <p>keep_row_after_execution_check (<code>bool</code>): Whether the function should be kept or not.</p> </li> <li> <p>execution_result (<code>str</code>): The result from executing the function.</p> </li> </ul>"},{"location":"components-gallery/steps/apigenexecutionchecker/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/apigenexecutionchecker/#execute-a-function-from-a-given-library-with-the-answer-from-an-llm","title":"Execute a function from a given library with the answer from an LLM","text":"<pre><code>from distilabel.steps.tasks import APIGenExecutionChecker\n\n# For the libpath you can use as an example the file at the tests folder:\n# ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\ntask = APIGenExecutionChecker(\n    libpath=\"../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py\",\n)\ntask.load()\n\nres = next(\n    task.process(\n        [\n            {\n                \"answers\": [\n                    {\n                        \"arguments\": {\n                            \"initial_velocity\": 0.2,\n                            \"acceleration\": 0.1,\n                            \"time\": 0.5,\n                        },\n                        \"name\": \"final_velocity\",\n                    }\n                ],\n            }\n        ]\n    )\n)\nres\n#[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]\n</code></pre>"},{"location":"components-gallery/steps/apigenexecutionchecker/#references","title":"References","text":"<ul> <li> <p>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</p> </li> <li> <p>Salesforce/xlam-function-calling-60k</p> </li> </ul>"},{"location":"components-gallery/steps/minhashdedup/","title":"MinHashDedup","text":"<p>Deduplicates text using <code>MinHash</code> and <code>MinHashLSH</code>.</p> <p><code>MinHashDedup</code> is a Step that detects near-duplicates in datasets. The idea roughly translates     to the following steps:     1. Tokenize the text into words or ngrams.     2. Create a <code>MinHash</code> for each text.     3. Store the <code>MinHashes</code> in a <code>MinHashLSH</code>.     4. Check if the <code>MinHash</code> is already in the <code>LSH</code>, if so, it is a duplicate.</p>"},{"location":"components-gallery/steps/minhashdedup/#attributes","title":"Attributes","text":"<ul> <li> <p>num_perm: the number of permutations to use. Defaults to <code>128</code>.</p> </li> <li> <p>seed: the seed to use for the MinHash. Defaults to <code>1</code>.</p> </li> <li> <p>tokenizer: the tokenizer to use. Available ones are <code>words</code> or <code>ngrams</code>.  If <code>words</code> is selected, it tokenizes the text into words using nltk's  word tokenizer. <code>ngram</code> estimates the ngrams (together with the size  <code>n</code>). Defaults to <code>words</code>.</p> </li> <li> <p>n: the size of the ngrams to use. Only relevant if <code>tokenizer=\"ngrams\"</code>. Defaults to <code>5</code>.</p> </li> <li> <p>threshold: the threshold to consider two MinHashes as duplicates.  Values closer to 0 detect more duplicates. Defaults to <code>0.9</code>.</p> </li> <li> <p>storage: the storage to use for the LSH. Can be <code>dict</code> to store the index  in memory, or <code>disk</code>. Keep in mind, <code>disk</code> is an experimental feature  not defined in <code>datasketch</code>, that is based on DiskCache's <code>Index</code> class.  It should work as a <code>dict</code>, but backed by disk, but depending on the system  it can be slower. Defaults to <code>dict</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/minhashdedup/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[keep_row_after_minhash_filtering]\n        end\n    end\n\n    subgraph MinHashDedup\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: keep_row_after_minhash_filtering]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/minhashdedup/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>): the texts to be filtered.</li> </ul>"},{"location":"components-gallery/steps/minhashdedup/#outputs","title":"Outputs","text":"<ul> <li>keep_row_after_minhash_filtering (<code>bool</code>): boolean indicating if the piece <code>text</code> is  not a duplicate i.e. this text should be kept.</li> </ul>"},{"location":"components-gallery/steps/minhashdedup/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/minhashdedup/#deduplicate-a-list-of-texts-using-minhash-and-minhashlsh","title":"Deduplicate a list of texts using MinHash and MinHashLSH","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import MinHashDedup\nfrom distilabel.steps import LoadDataFromDicts\n\nwith Pipeline() as pipeline:\n    ds_size = 1000\n    batch_size = 500  # Bigger batch sizes work better for this step\n    data = LoadDataFromDicts(\n        data=[\n            {\"text\": \"This is a test document.\"},\n            {\"text\": \"This document is a test.\"},\n            {\"text\": \"Test document for duplication.\"},\n            {\"text\": \"Document for duplication test.\"},\n            {\"text\": \"This is another unique document.\"},\n        ]\n        * (ds_size // 5),\n        batch_size=batch_size,\n    )\n    minhash_dedup = MinHashDedup(\n        tokenizer=\"words\",\n        threshold=0.9,      # lower values will increase the number of duplicates\n        storage=\"dict\",     # or \"disk\" for bigger datasets\n    )\n\n    data &gt;&gt; minhash_dedup\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(use_cache=False)\n    ds = distiset[\"default\"][\"train\"]\n    # Filter out the duplicates\n    ds_dedup = ds.filter(lambda x: x[\"keep_row_after_minhash_filtering\"])\n</code></pre>"},{"location":"components-gallery/steps/minhashdedup/#references","title":"References","text":"<ul> <li> <p>datasketch documentation</p> </li> <li> <p>Identifying and Filtering Near-Duplicate Documents</p> </li> <li> <p>Diskcache's Index</p> </li> </ul>"},{"location":"components-gallery/steps/combineoutputs/","title":"CombineOutputs","text":"<p>Combine the outputs of several upstream steps.</p> <p><code>CombineOutputs</code> is a <code>Step</code> that takes the outputs of several upstream steps and combines     them to generate a new dictionary with all keys/columns of the upstream steps outputs.</p>"},{"location":"components-gallery/steps/combineoutputs/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph CombineOutputs\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/combineoutputs/#inputs","title":"Inputs","text":"<ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul>"},{"location":"components-gallery/steps/combineoutputs/#outputs","title":"Outputs","text":"<ul> <li>dynamic (based on the upstream <code>Step</code>s): All the columns of the upstream steps outputs.</li> </ul>"},{"location":"components-gallery/steps/combineoutputs/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/combineoutputs/#combine-dictionaries-of-a-dataset","title":"Combine dictionaries of a dataset","text":"<pre><code>from distilabel.steps import CombineOutputs\n\ncombine_outputs = CombineOutputs()\ncombine_outputs.load()\n\nresult = next(\n    combine_outputs.process(\n        [{\"a\": 1, \"b\": 2}, {\"a\": 3, \"b\": 4}],\n        [{\"c\": 5, \"d\": 6}, {\"c\": 7, \"d\": 8}],\n    )\n)\n# [\n#   {\"a\": 1, \"b\": 2, \"c\": 5, \"d\": 6},\n#   {\"a\": 3, \"b\": 4, \"c\": 7, \"d\": 8},\n# ]\n</code></pre>"},{"location":"components-gallery/steps/combineoutputs/#combine-upstream-steps-outputs-in-a-pipeline","title":"Combine upstream steps outputs in a pipeline","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps import CombineOutputs\n\nwith Pipeline() as pipeline:\n    step_1 = ...\n    step_2 = ...\n    step_3 = ...\n    combine = CombineOutputs()\n\n    [step_1, step_2, step_3] &gt;&gt; combine\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/","title":"ExpandColumns","text":"<p>Expand columns that contain lists into multiple rows.</p> <p><code>ExpandColumns</code> is a <code>Step</code> that takes a list of columns and expands them into multiple     rows. The new rows will have the same data as the original row, except for the expanded     column, which will contain a single item from the original list.</p>"},{"location":"components-gallery/steps/expandcolumns/#attributes","title":"Attributes","text":"<ul> <li> <p>columns: A dictionary that maps the column to be expanded to the new column name  or a list of columns to be expanded. If a list is provided, the new column name  will be the same as the column name.</p> </li> <li> <p>encoded: A bool to inform Whether the columns are JSON encoded lists. If this value is  set to True, the columns will be decoded before expanding. Alternatively, to specify  columns that can be encoded, a list can be provided. In this case, the column names  informed must be a subset of the columns selected for expansion.</p> </li> <li> <p>split_statistics: A bool to inform whether the statistics in the <code>distilabel_metadata</code>  column should be split into multiple rows.  If we want to expand some columns containing a list of strings that come from  having parsed the output of an LLM, the tokens in the <code>statistics_{step_name}</code>  of the <code>distilabel_metadata</code> column should be splitted to avoid multiplying  them if we aggregate the data afterwards. For example, with a task that is supposed  to generate a list of N instructions, and we want each of those N instructions in  different rows, we should split the statistics by N.  In such a case, set this value to True.</p> </li> </ul>"},{"location":"components-gallery/steps/expandcolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph ExpandColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to be expanded into  multiple rows.</li> </ul>"},{"location":"components-gallery/steps/expandcolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The expanded columns.</li> </ul>"},{"location":"components-gallery/steps/expandcolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/expandcolumns/#expand-the-selected-columns-into-multiple-rows","title":"Expand the selected columns into multiple rows","text":"<pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"]}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/#expand-the-selected-columns-which-are-json-encoded-into-multiple-rows","title":"Expand the selected columns which are JSON encoded into multiple rows","text":"<pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    encoded=True,  # It can also be a list of columns that are encoded, i.e. [\"generation\"]\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": '[\"generation 1\", \"generation 2\"]'}\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1'}, {'instruction': 'instruction 1', 'generation': 'generation 2'}]\n</code></pre>"},{"location":"components-gallery/steps/expandcolumns/#expand-the-selected-columns-and-split-the-statistics-in-the-distilabel_metadata-column","title":"Expand the selected columns and split the statistics in the <code>distilabel_metadata</code> column","text":"<pre><code>from distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(\n    columns=[\"generation\"],\n    split_statistics=True,\n)\nexpand_columns.load()\n\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"instruction 1\",\n                \"generation\": [\"generation 1\", \"generation 2\"],\n                \"distilabel_metadata\": {\n                    \"statistics_generation\": {\n                        \"input_tokens\": [12],\n                        \"output_tokens\": [12],\n                    },\n                },\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': 'instruction 1', 'generation': 'generation 1', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}, {'instruction': 'instruction 1', 'generation': 'generation 2', 'distilabel_metadata': {'statistics_generation': {'input_tokens': [6], 'output_tokens': [6]}}}]\n</code></pre>"},{"location":"components-gallery/steps/groupcolumns/","title":"GroupColumns","text":"<p>Combines columns from a list of <code>StepInput</code>.</p> <p><code>GroupColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>group_dicts</code>     function to handle and combine a list of <code>StepInput</code>. Also <code>GroupColumns</code> provides two attributes     <code>columns</code> and <code>output_columns</code> to specify the columns to group and the output columns     which will override the default value for the properties <code>inputs</code> and <code>outputs</code>, respectively.</p>"},{"location":"components-gallery/steps/groupcolumns/#attributes","title":"Attributes","text":"<ul> <li> <p>columns: List of strings with the names of the columns to group.</p> </li> <li> <p>output_columns: Optional list of strings with the names of the output columns.</p> </li> </ul>"},{"location":"components-gallery/steps/groupcolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph GroupColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/groupcolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to group.</li> </ul>"},{"location":"components-gallery/steps/groupcolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> and <code>output_columns</code> attributes): The columns  that were grouped.</li> </ul>"},{"location":"components-gallery/steps/groupcolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/groupcolumns/#group-columns-of-a-dataset","title":"Group columns of a dataset","text":"<pre><code>from distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n# [{'merged_generation': ['AI generated text', 'Other generated text'], 'merged_model_name': ['my_model']}]\n</code></pre>"},{"location":"components-gallery/steps/groupcolumns/#specify-the-name-of-the-output-columns","title":"Specify the name of the output columns","text":"<pre><code>from distilabel.steps import GroupColumns\n\ngroup_columns = GroupColumns(\n    name=\"group_columns\",\n    columns=[\"generation\", \"model_name\"],\n    output_columns=[\"generations\", \"generation_models\"]\n)\ngroup_columns.load()\n\nresult = next(\n    group_columns.process(\n        [{\"generation\": \"AI generated text\"}, {\"model_name\": \"my_model\"}],\n        [{\"generation\": \"Other generated text\", \"model_name\": \"my_model\"}]\n    )\n)\n# &gt;&gt;&gt; result\n#[{'generations': ['AI generated text', 'Other generated text'], 'generation_models': ['my_model']}]\n</code></pre>"},{"location":"components-gallery/steps/keepcolumns/","title":"KeepColumns","text":"<p>Keeps selected columns in the dataset.</p> <p><code>KeepColumns</code> is a <code>Step</code> that implements the <code>process</code> method that keeps only the columns     specified in the <code>columns</code> attribute. Also <code>KeepColumns</code> provides an attribute <code>columns</code> to     specify the columns to keep which will override the default value for the properties <code>inputs</code>     and <code>outputs</code>.</p>"},{"location":"components-gallery/steps/keepcolumns/#note","title":"Note","text":"<p>The order in which the columns are provided is important, as the output will be sorted using the provided order, which is useful before pushing either a <code>dataset.Dataset</code> via the <code>PushToHub</code> step or a <code>distilabel.Distiset</code> via the <code>Pipeline.run</code> output variable.</p>"},{"location":"components-gallery/steps/keepcolumns/#attributes","title":"Attributes","text":"<ul> <li>columns: List of strings with the names of the columns to keep.</li> </ul>"},{"location":"components-gallery/steps/keepcolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph KeepColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/keepcolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to keep.</li> </ul>"},{"location":"components-gallery/steps/keepcolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns that were kept.</li> </ul>"},{"location":"components-gallery/steps/keepcolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/keepcolumns/#select-the-columns-to-keep","title":"Select the columns to keep","text":"<pre><code>from distilabel.steps import KeepColumns\n\nkeep_columns = KeepColumns(\n    columns=[\"instruction\", \"generation\"],\n)\nkeep_columns.load()\n\nresult = next(\n    keep_columns.process(\n        [{\"instruction\": \"What's the brightest color?\", \"generation\": \"white\", \"model_name\": \"my_model\"}],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'instruction': \"What's the brightest color?\", 'generation': 'white'}]\n</code></pre>"},{"location":"components-gallery/steps/mergecolumns/","title":"MergeColumns","text":"<p>Merge columns from a row.</p> <p><code>MergeColumns</code> is a <code>Step</code> that implements the <code>process</code> method that calls the <code>merge_columns</code>     function to handle and combine columns in a <code>StepInput</code>. <code>MergeColumns</code> provides two attributes     <code>columns</code> and <code>output_column</code> to specify the columns to merge and the resulting output column.</p> <pre><code>This step can be useful if you have a `Task` that generates instructions for example, and you\nwant to have more examples of those. In such a case, you could for example use another `Task`\nto multiply your instructions synthetically, what would yield two different columns splitted.\nUsing `MergeColumns` you can merge them and use them as a single column in your dataset for\nfurther processing.\n</code></pre>"},{"location":"components-gallery/steps/mergecolumns/#attributes","title":"Attributes","text":"<ul> <li> <p>columns: List of strings with the names of the columns to merge.</p> </li> <li> <p>output_column: str name of the output column</p> </li> </ul>"},{"location":"components-gallery/steps/mergecolumns/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph MergeColumns\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/mergecolumns/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): The columns to merge.</li> </ul>"},{"location":"components-gallery/steps/mergecolumns/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>columns</code> and <code>output_column</code> attributes): The columns  that were merged.</li> </ul>"},{"location":"components-gallery/steps/mergecolumns/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/mergecolumns/#combine-columns-in-rows-of-a-dataset","title":"Combine columns in rows of a dataset","text":"<pre><code>from distilabel.steps import MergeColumns\n\ncombiner = MergeColumns(\n    columns=[\"queries\", \"multiple_queries\"],\n    output_column=\"queries\",\n)\ncombiner.load()\n\nresult = next(\n    combiner.process(\n        [\n            {\n                \"queries\": \"How are you?\",\n                \"multiple_queries\": [\"What's up?\", \"Everything ok?\"]\n            }\n        ],\n    )\n)\n# &gt;&gt;&gt; result\n# [{'queries': ['How are you?', \"What's up?\", 'Everything ok?']}]\n</code></pre>"},{"location":"components-gallery/steps/dbscan/","title":"DBSCAN","text":"<p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core</p> <p>samples in regions of high density and expands clusters from them. This algorithm     is good for data which contains clusters of similar density.</p> <pre><code>This is a `GlobalStep` that clusters the embeddings using the DBSCAN algorithm\nfrom `sklearn`. Visit `TextClustering` step for an example of use.\nThe trained model is saved as an artifact when creating a distiset\nand pushing it to the Hugging Face Hub.\n</code></pre>"},{"location":"components-gallery/steps/dbscan/#attributes","title":"Attributes","text":"<ul> <li>eps: The maximum distance between two samples for one to be considered as in the  neighborhood of the other. This is not a maximum bound on the distances of  points within a cluster. This is the most important DBSCAN parameter to  choose appropriately for your data set and distance function.  - min_samples: The number of samples (or total weight) in a neighborhood for a point  to be considered as a core point. This includes the point itself. If <code>min_samples</code>  is set to a higher value, DBSCAN will find denser clusters, whereas if it is set  to a lower value, the found clusters will be more sparse.  - metric: The metric to use when calculating distance between instances in a feature  array. If metric is a string or callable, it must be one of the options allowed  by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.  - n_jobs: The number of parallel jobs to run.</li> </ul>"},{"location":"components-gallery/steps/dbscan/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>eps: The maximum distance between two samples for one to be considered as in the  neighborhood of the other. This is not a maximum bound on the distances of  points within a cluster. This is the most important DBSCAN parameter to  choose appropriately for your data set and distance function.</p> </li> <li> <p>min_samples: The number of samples (or total weight) in a neighborhood for a point  to be considered as a core point. This includes the point itself. If <code>min_samples</code>  is set to a higher value, DBSCAN will find denser clusters, whereas if it is set  to a lower value, the found clusters will be more sparse.</p> </li> <li> <p>metric: The metric to use when calculating distance between instances in a feature  array. If metric is a string or callable, it must be one of the options allowed  by <code>sklearn.metrics.pairwise_distances</code> for its metric parameter.</p> </li> <li> <p>n_jobs: The number of parallel jobs to run.</p> </li> </ul>"},{"location":"components-gallery/steps/dbscan/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[projection]\n        end\n        subgraph New columns\n            OCOL0[cluster_label]\n        end\n    end\n\n    subgraph DBSCAN\n        StepInput[Input Columns: projection]\n        StepOutput[Output Columns: cluster_label]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/dbscan/#inputs","title":"Inputs","text":"<ul> <li>projection (<code>List[float]</code>): Vector representation of the text to cluster,  normally the output from the <code>UMAP</code> step.</li> </ul>"},{"location":"components-gallery/steps/dbscan/#outputs","title":"Outputs","text":"<ul> <li>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1  means it wasn't clustered.</li> </ul>"},{"location":"components-gallery/steps/dbscan/#references","title":"References","text":"<ul> <li> <p>DBSCAN demo of sklearn</p> </li> <li> <p>sklearn dbscan</p> </li> </ul>"},{"location":"components-gallery/steps/umap/","title":"UMAP","text":"<p>UMAP is a general purpose manifold learning and dimension reduction algorithm.</p> <p>This is a <code>GlobalStep</code> that reduces the dimensionality of the embeddings using. Visit     the <code>TextClustering</code> step for an example of use. The trained model is saved as an artifact     when creating a distiset and pushing it to the Hugging Face Hub.</p>"},{"location":"components-gallery/steps/umap/#attributes","title":"Attributes","text":"<ul> <li>n_components: The dimension of the space to embed into. This defaults to 2 to  provide easy visualization (that's probably what you want), but can  reasonably be set to any integer value in the range 2 to 100.  - metric: The metric to use to compute distances in high dimensional space.  Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.  - n_jobs: The number of parallel jobs to run. Defaults to <code>8</code>.  - random_state: The random state to use for the UMAP algorithm.</li> </ul>"},{"location":"components-gallery/steps/umap/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>n_components: The dimension of the space to embed into. This defaults to 2 to  provide easy visualization (that's probably what you want), but can  reasonably be set to any integer value in the range 2 to 100.</p> </li> <li> <p>metric: The metric to use to compute distances in high dimensional space.  Visit UMAP's documentation for more information. Defaults to <code>euclidean</code>.</p> </li> <li> <p>n_jobs: The number of parallel jobs to run. Defaults to <code>8</code>.</p> </li> <li> <p>random_state: The random state to use for the UMAP algorithm.</p> </li> </ul>"},{"location":"components-gallery/steps/umap/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[embedding]\n        end\n        subgraph New columns\n            OCOL0[projection]\n        end\n    end\n\n    subgraph UMAP\n        StepInput[Input Columns: embedding]\n        StepOutput[Output Columns: projection]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/umap/#inputs","title":"Inputs","text":"<ul> <li>embedding (<code>List[float]</code>): The original embeddings we want to reduce the dimension.</li> </ul>"},{"location":"components-gallery/steps/umap/#outputs","title":"Outputs","text":"<ul> <li>projection (<code>List[float]</code>): Embedding reduced to the number of components specified,  the size of the new embeddings will be determined by the <code>n_components</code>.</li> </ul>"},{"location":"components-gallery/steps/umap/#references","title":"References","text":"<ul> <li> <p>UMAP repository</p> </li> <li> <p>UMAP documentation</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/","title":"FaissNearestNeighbour","text":"<p>Create a <code>faiss</code> index to get the nearest neighbours.</p> <p><code>FaissNearestNeighbour</code> is a <code>GlobalStep</code> that creates a <code>faiss</code> index using the Hugging     Face <code>datasets</code> library integration, and then gets the nearest neighbours and the scores     or distance of the nearest neighbours for each input row.</p>"},{"location":"components-gallery/steps/faissnearestneighbour/#attributes","title":"Attributes","text":"<ul> <li> <p>device: the CUDA device ID or a list of IDs to be used. If negative integer, it  will use all the available GPUs. Defaults to <code>None</code>.</p> </li> <li> <p>string_factory: the name of the factory to be used to build the <code>faiss</code> index.  Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.  Defaults to <code>None</code>.</p> </li> <li> <p>metric_type: the metric to be used to measure the distance between the points. It's  an integer and the recommend way to pass it is importing <code>faiss</code> and then passing  one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</p> </li> <li> <p>k: the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</p> </li> <li> <p>search_batch_size: the number of rows to include in a search batch. The value can  be adjusted to maximize the resources usage or to avoid OOM issues. Defaults  to <code>50</code>.</p> </li> <li> <p>train_size: If the index needs a training step, specifies how many vectors will be  used to train the index.</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>device: the CUDA device ID or a list of IDs to be used. If negative integer,  it will use all the available GPUs. Defaults to <code>None</code>.</p> </li> <li> <p>string_factory: the name of the factory to be used to build the <code>faiss</code> index.  Available string factories can be checked here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.  Defaults to <code>None</code>.</p> </li> <li> <p>metric_type: the metric to be used to measure the distance between the points.  It's an integer and the recommend way to pass it is importing <code>faiss</code> and then  passing one of <code>faiss.METRIC_x</code> variables. Defaults to <code>None</code>.</p> </li> <li> <p>k: the number of nearest neighbours to search for each input row. Defaults to <code>1</code>.</p> </li> <li> <p>search_batch_size: the number of rows to include in a search batch. The value  can be adjusted to maximize the resources usage or to avoid OOM issues. Defaults  to <code>50</code>.</p> </li> <li> <p>train_size: If the index needs a training step, specifies how many vectors will  be used to train the index.</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[embedding]\n        end\n        subgraph New columns\n            OCOL0[nn_indices]\n            OCOL1[nn_scores]\n        end\n    end\n\n    subgraph FaissNearestNeighbour\n        StepInput[Input Columns: embedding]\n        StepOutput[Output Columns: nn_indices, nn_scores]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/faissnearestneighbour/#inputs","title":"Inputs","text":"<ul> <li>embedding (<code>List[Union[float, int]]</code>): a sentence embedding.</li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#outputs","title":"Outputs","text":"<ul> <li> <p>nn_indices (<code>List[int]</code>): a list containing the indices of the <code>k</code> nearest neighbours  in the inputs for the row.</p> </li> <li> <p>nn_scores (<code>List[float]</code>): a list containing the score or distance to each <code>k</code>  nearest neighbour in the inputs.</p> </li> </ul>"},{"location":"components-gallery/steps/faissnearestneighbour/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/faissnearestneighbour/#generating-embeddings-and-getting-the-nearest-neighbours","title":"Generating embeddings and getting the nearest neighbours","text":"<pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import EmbeddingGeneration, FaissNearestNeighbour, LoadDataFromHub\n\nwith Pipeline(name=\"hello\") as pipeline:\n    load_data = LoadDataFromHub(output_mappings={\"prompt\": \"text\"})\n\n    embeddings = EmbeddingGeneration(\n        embeddings=SentenceTransformerEmbeddings(\n            model=\"mixedbread-ai/mxbai-embed-large-v1\"\n        )\n    )\n\n    nearest_neighbours = FaissNearestNeighbour()\n\n    load_data &gt;&gt; embeddings &gt;&gt; nearest_neighbours\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            load_data.name: {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n        },\n        use_cache=False,\n    )\n</code></pre>"},{"location":"components-gallery/steps/faissnearestneighbour/#references","title":"References","text":"<ul> <li>The Faiss library</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/","title":"EmbeddingGeneration","text":"<p>Generate embeddings using an <code>Embeddings</code> model.</p> <p><code>EmbeddingGeneration</code> is a <code>Step</code> that using an <code>Embeddings</code> model generates sentence     embeddings for the provided input texts.</p>"},{"location":"components-gallery/steps/embeddinggeneration/#attributes","title":"Attributes","text":"<ul> <li>embeddings: the <code>Embeddings</code> model used to generate the sentence embeddings.</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[embedding]\n        end\n    end\n\n    subgraph EmbeddingGeneration\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: embedding]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/embeddinggeneration/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>): The text for which the sentence embedding has to be generated.</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/#outputs","title":"Outputs","text":"<ul> <li>embedding (<code>List[Union[float, int]]</code>): the generated sentence embedding.</li> </ul>"},{"location":"components-gallery/steps/embeddinggeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/embeddinggeneration/#generate-sentence-embeddings-with-sentence-transformers","title":"Generate sentence embeddings with Sentence Transformers","text":"<pre><code>from distilabel.models import SentenceTransformerEmbeddings\nfrom distilabel.steps import EmbeddingGeneration\n\nembedding_generation = EmbeddingGeneration(\n    embeddings=SentenceTransformerEmbeddings(\n        model=\"mixedbread-ai/mxbai-embed-large-v1\",\n    )\n)\n\nembedding_generation.load()\n\nresult = next(embedding_generation.process([{\"text\": \"Hello, how are you?\"}]))\n# [{'text': 'Hello, how are you?', 'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]\n</code></pre>"},{"location":"components-gallery/steps/rewardmodelscore/","title":"RewardModelScore","text":"<p>Assign a score to a response using a Reward Model.</p> <p><code>RewardModelScore</code> is a <code>Step</code> that using a Reward Model (RM) loaded using <code>transformers</code>,     assigns an score to a response generated for an instruction, or a score to a multi-turn     conversation.</p>"},{"location":"components-gallery/steps/rewardmodelscore/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>revision: if <code>model</code> refers to a Hugging Face Hub repository, then the revision  (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> </li> <li> <p>torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.  Defaults to <code>\"auto\"</code>.</p> </li> <li> <p>trust_remote_code: whether to allow fetching and executing remote code fetched  from the repository in the Hub. Defaults to <code>False</code>.</p> </li> <li> <p>device_map: a dictionary mapping each layer of the model to a device, or a mode like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> </li> <li> <p>token: the Hugging Face Hub token that will be used to authenticate to the Hugging  Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package  local configuration will be used. Defaults to <code>None</code>.</p> </li> <li> <p>truncation: whether to truncate sequences at the maximum length. Defaults to <code>False</code>.</p> </li> <li> <p>max_length: maximun length to use for padding or truncation. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/rewardmodelscore/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[response]\n            ICOL2[conversation]\n        end\n        subgraph New columns\n            OCOL0[score]\n        end\n    end\n\n    subgraph RewardModelScore\n        StepInput[Input Columns: instruction, response, conversation]\n        StepOutput[Output Columns: score]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/rewardmodelscore/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>, optional): the instruction used to generate a <code>response</code>.  If provided, then <code>response</code> must be provided too.</p> </li> <li> <p>response (<code>str</code>, optional): the response generated for <code>instruction</code>. If provided,  then <code>instruction</code> must be provide too.</p> </li> <li> <p>conversation (<code>ChatType</code>, optional): a multi-turn conversation. If not provided,  then <code>instruction</code> and <code>response</code> columns must be provided.</p> </li> </ul>"},{"location":"components-gallery/steps/rewardmodelscore/#outputs","title":"Outputs","text":"<ul> <li>score (<code>float</code>): the score given by the reward model for the instruction-response  pair or the conversation.</li> </ul>"},{"location":"components-gallery/steps/rewardmodelscore/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/rewardmodelscore/#response-pair","title":"response pair","text":"<pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"response\": \"The output of 2+2 is 4\",\n            },\n            {\"instruction\": \"How much is 2+2?\", \"response\": \"4\"},\n        ]\n    )\n)\n# [\n#   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},\n#   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}\n# ]\n</code></pre>"},{"location":"components-gallery/steps/rewardmodelscore/#turn-conversation","title":"turn conversation","text":"<pre><code>from distilabel.steps import RewardModelScore\n\nstep = RewardModelScore(\n    model=\"RLHFlow/ArmoRM-Llama3-8B-v0.1\", device_map=\"auto\", trust_remote_code=True\n)\n\nstep.load()\n\nresult = next(\n    step.process(\n        inputs=[\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"The output of 2+2 is 4\"},\n                ],\n            },\n            {\n                \"conversation\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                    {\"role\": \"assistant\", \"content\": \"4\"},\n                ],\n            },\n        ]\n    )\n)\n# [\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},\n#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}\n# ]\n</code></pre>"},{"location":"components-gallery/steps/formatprm/","title":"FormatPRM","text":"<p>Helper step to transform the data into the format expected by the PRM model.</p> <p>This step can be used to format the data in one of 2 formats:     Following the format presented     in peiyi9979/Math-Shepherd,     in which case this step creates the columns input and label, where the input is the instruction     with the solution (and the tag replaced by a token), and the label is the instruction     with the solution, both separated by a newline.     Following TRL's format for training, which generates the columns prompt, completions, and labels.     The labels correspond to the original tags replaced by boolean values, where True represents     correct steps.</p>"},{"location":"components-gallery/steps/formatprm/#attributes","title":"Attributes","text":"<ul> <li> <p>format: The format to use for the PRM model.  \"math-shepherd\" corresponds to the original paper, while \"trl\" is a format  prepared to train the model using TRL.</p> </li> <li> <p>step_token: String that serves as a unique token denoting the position  for predicting the step score.</p> </li> <li> <p>tags: List of tags that represent the correct and incorrect steps.  This only needs to be informed if it's different than the default in  <code>MathShepherdCompleter</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/formatprm/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[solutions]\n        end\n        subgraph New columns\n            OCOL0[input]\n            OCOL1[label]\n            OCOL2[prompt]\n            OCOL3[completions]\n            OCOL4[labels]\n        end\n    end\n\n    subgraph FormatPRM\n        StepInput[Input Columns: instruction, solutions]\n        StepOutput[Output Columns: input, label, prompt, completions, labels]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/formatprm/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The task or instruction.</p> </li> <li> <p>solutions (<code>list[str]</code>): List of steps with a solution to the task.</p> </li> </ul>"},{"location":"components-gallery/steps/formatprm/#outputs","title":"Outputs","text":"<ul> <li> <p>input (<code>str</code>): The instruction with the solutions, where the label tags  are replaced by a token.</p> </li> <li> <p>label (<code>str</code>): The instruction with the solutions.</p> </li> <li> <p>prompt (<code>str</code>): The instruction with the solutions, where the label tags  are replaced by a token.</p> </li> <li> <p>completions (<code>List[str]</code>): The solution represented as a list of steps.</p> </li> <li> <p>labels (<code>List[bool]</code>): The labels, as a list of booleans, where True represents  a good response.</p> </li> </ul>"},{"location":"components-gallery/steps/formatprm/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/formatprm/#shepherd-format","title":"Shepherd format","text":"<pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM()\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\nresult = next(formatter.process(result))\n</code></pre>"},{"location":"components-gallery/steps/formatprm/#prepare-your-data-to-train-a-prm-model-with-the-trl-format","title":"Prepare your data to train a PRM model with the TRL format","text":"<pre><code>from distilabel.steps.tasks import FormatPRM\nfrom distilabel.steps import ExpandColumns\n\nexpand_columns = ExpandColumns(columns=[\"solutions\"])\nexpand_columns.load()\n\n# Define our PRM formatter\nformatter = FormatPRM(format=\"trl\")\nformatter.load()\n\n# Expand the solutions column as it comes from the MathShepherdCompleter\nresult = next(\n    expand_columns.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"solutions\": [[\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = &lt;&lt;2*0.5=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"], [\"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\", \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\", \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"]]\n            },\n        ]\n    )\n)\n\nresult = next(formatter.process(result))\n# {\n#     \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"solutions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber. +\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3 +\"\n#     ],\n#     \"prompt\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n#     \"completions\": [\n#         \"Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.\",\n#         \"Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = &lt;&lt;2/2=1&gt;&gt;1 bolt of white fiber.\",\n#         \"Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = &lt;&lt;2+1=3&gt;&gt;3 bolts of fiber in total. The answer is: 3\"\n#     ],\n#     \"labels\": [\n#         true,\n#         true,\n#         true\n#     ]\n# }\n</code></pre>"},{"location":"components-gallery/steps/formatprm/#references","title":"References","text":"<ul> <li> <p>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</p> </li> <li> <p>peiyi9979/Math-Shepherd</p> </li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/","title":"TruncateTextColumn","text":"<p>Truncate a row using a tokenizer or the number of characters.</p> <p><code>TruncateTextColumn</code> is a <code>Step</code> that truncates a row according to the max length. If     the <code>tokenizer</code> is provided, then the row will be truncated using the tokenizer,     and the <code>max_length</code> will be used as the maximum number of tokens, otherwise it will     be used as the maximum number of characters. The <code>TruncateTextColumn</code> step is useful when one     wants to truncate a row to a certain length, to avoid posterior errors in the model due     to the length.</p>"},{"location":"components-gallery/steps/truncatetextcolumn/#attributes","title":"Attributes","text":"<ul> <li> <p>column: the column to truncate. Defaults to <code>\"text\"</code>.</p> </li> <li> <p>max_length: the maximum length to use for truncation.  If a <code>tokenizer</code> is given, corresponds to the number of tokens,  otherwise corresponds to the number of characters. Defaults to <code>8192</code>.</p> </li> <li> <p>tokenizer: the name of the tokenizer to use. If provided, the row will be  truncated using the tokenizer. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[dynamic]\n        end\n    end\n\n    subgraph TruncateTextColumn\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: dynamic]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/steps/truncatetextcolumn/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>column</code> attribute): The columns to be truncated, defaults to \"text\".</li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/#outputs","title":"Outputs","text":"<ul> <li>dynamic (determined by <code>column</code> attribute): The truncated column.</li> </ul>"},{"location":"components-gallery/steps/truncatetextcolumn/#examples","title":"Examples","text":""},{"location":"components-gallery/steps/truncatetextcolumn/#truncating-a-row-to-a-given-number-of-tokens","title":"Truncating a row to a given number of tokens","text":"<pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(\n    tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    max_length=4,\n    column=\"text\"\n)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a sample'}]\n</code></pre>"},{"location":"components-gallery/steps/truncatetextcolumn/#truncating-a-row-to-a-given-number-of-characters","title":"Truncating a row to a given number of characters","text":"<pre><code>from distilabel.steps import TruncateTextColumn\n\ntrunc = TruncateTextColumn(max_length=10)\n\ntrunc.load()\n\nresult = next(\n    trunc.process(\n        [\n            {\"text\": \"This is a sample text that is longer than 10 characters\"}\n        ]\n    )\n)\n# result\n# [{'text': 'This is a '}]\n</code></pre>"},{"location":"components-gallery/tasks/","title":"Tasks Gallery","text":"Category Overview <p>The gallery page showcases the different types of components within <code>distilabel</code>.</p> Icon Category Description text-generation Text generation steps are used to generate text based on a given prompt. chat-generation Chat generation steps are used to generate text based on a conversation. text-classification Text classification steps are used to classify text into a category. text-manipulation Text manipulation steps are used to manipulate or rewrite an input text. evol Evol steps are used to rewrite input text and evolve it to a higher quality. critique Critique steps are used to provide feedback on the quality of the data with a written explanation. scorer Scorer steps are used to evaluate and score the data with a numerical value. preference Preference steps are used to collect preferences on the data with numerical values or ranks. embedding Embedding steps are used to generate embeddings for the data. clustering Clustering steps are used to group similar data points together. columns Columns steps are used to manipulate columns in the data. filtering Filtering steps are used to filter the data based on some criteria. format Format steps are used to format the data. load Load steps are used to load the data. execution Executes python functions. save Save steps are used to save the data. labelling Labelling steps are used to label the data. <ul> <li> <p> APIGenGenerator</p> <p>Generate queries and answers for the given functions in JSON format.</p> <p> APIGenGenerator</p> </li> <li> <p> Genstruct</p> <p>Generate a pair of instruction-response from a document using an <code>LLM</code>.</p> <p> Genstruct</p> </li> <li> <p> Magpie</p> <p>Generates conversations using an instruct fine-tuned LLM.</p> <p> Magpie</p> </li> <li> <p> MathShepherdCompleter</p> <p>Math Shepherd Completer and auto-labeller task.</p> <p> MathShepherdCompleter</p> </li> <li> <p> MathShepherdGenerator</p> <p>Math Shepherd solution generator.</p> <p> MathShepherdGenerator</p> </li> <li> <p> SelfInstruct</p> <p>Generate instructions based on a given input using an <code>LLM</code>.</p> <p> SelfInstruct</p> </li> <li> <p> TextGeneration</p> <p>Text generation with an <code>LLM</code> given a prompt.</p> <p> TextGeneration</p> </li> <li> <p> TextGenerationWithImage</p> <p>Text generation with images with an <code>LLM</code> given a prompt.</p> <p> TextGenerationWithImage</p> </li> <li> <p> URIAL</p> <p>Generates a response using a non-instruct fine-tuned model.</p> <p> URIAL</p> </li> <li> <p> MagpieGenerator</p> <p>Generator task the generates instructions or conversations using Magpie.</p> <p> MagpieGenerator</p> </li> <li> <p> ChatGeneration</p> <p>Generates text based on a conversation.</p> <p> ChatGeneration</p> </li> <li> <p> ArgillaLabeller</p> <p>Annotate Argilla records based on input fields, example records and question settings.</p> <p> ArgillaLabeller</p> </li> <li> <p> TextClassification</p> <p>Classifies text into one or more categories or labels.</p> <p> TextClassification</p> </li> <li> <p> EvolInstruct</p> <p>Evolve instructions using an <code>LLM</code>.</p> <p> EvolInstruct</p> </li> <li> <p> EvolComplexity</p> <p>Evolve instructions to make them more complex using an <code>LLM</code>.</p> <p> EvolComplexity</p> </li> <li> <p> EvolQuality</p> <p>Evolve the quality of the responses using an <code>LLM</code>.</p> <p> EvolQuality</p> </li> <li> <p> EvolInstructGenerator</p> <p>Generate evolved instructions using an <code>LLM</code>.</p> <p> EvolInstructGenerator</p> </li> <li> <p> EvolComplexityGenerator</p> <p>Generate evolved instructions with increased complexity using an <code>LLM</code>.</p> <p> EvolComplexityGenerator</p> </li> <li> <p> InstructionBacktranslation</p> <p>Self-Alignment with Instruction Backtranslation.</p> <p> InstructionBacktranslation</p> </li> <li> <p> PrometheusEval</p> <p>Critique and rank the quality of generations from an <code>LLM</code> using Prometheus 2.0.</p> <p> PrometheusEval</p> </li> <li> <p> ComplexityScorer</p> <p>Score instructions based on their complexity using an <code>LLM</code>.</p> <p> ComplexityScorer</p> </li> <li> <p> QualityScorer</p> <p>Score responses based on their quality using an <code>LLM</code>.</p> <p> QualityScorer</p> </li> <li> <p> CLAIR</p> <p>Contrastive Learning from AI Revisions (CLAIR).</p> <p> CLAIR</p> </li> <li> <p> UltraFeedback</p> <p>Rank generations focusing on different aspects using an <code>LLM</code>.</p> <p> UltraFeedback</p> </li> <li> <p> PairRM</p> <p>Rank the candidates based on the input using the <code>LLM</code> model.</p> <p> PairRM</p> </li> <li> <p> GenerateSentencePair</p> <p>Generate a positive and negative (optionally) sentences given an anchor sentence.</p> <p> GenerateSentencePair</p> </li> <li> <p> GenerateEmbeddings</p> <p>Generate embeddings using the last hidden state of an <code>LLM</code>.</p> <p> GenerateEmbeddings</p> </li> <li> <p> TextClustering</p> <p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p> TextClustering</p> </li> <li> <p> TextClustering</p> <p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p> TextClustering</p> </li> <li> <p> APIGenSemanticChecker</p> <p>Generate queries and answers for the given functions in JSON format.</p> <p> APIGenSemanticChecker</p> </li> <li> <p> GenerateTextRetrievalData</p> <p>Generate text retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateTextRetrievalData</p> </li> <li> <p> GenerateShortTextMatchingData</p> <p>Generate short text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateShortTextMatchingData</p> </li> <li> <p> GenerateLongTextMatchingData</p> <p>Generate long text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateLongTextMatchingData</p> </li> <li> <p> GenerateTextClassificationData</p> <p>Generate text classification data with an <code>LLM</code> to later on train an embedding model.</p> <p> GenerateTextClassificationData</p> </li> <li> <p> StructuredGeneration</p> <p>Generate structured content for a given <code>instruction</code> using an <code>LLM</code>.</p> <p> StructuredGeneration</p> </li> <li> <p> MonolingualTripletGenerator</p> <p>Generate monolingual triplets with an <code>LLM</code> to later on train an embedding model.</p> <p> MonolingualTripletGenerator</p> </li> <li> <p> BitextRetrievalGenerator</p> <p>Generate bitext retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p> BitextRetrievalGenerator</p> </li> <li> <p> EmbeddingTaskGenerator</p> <p>Generate task descriptions for embedding-related tasks using an <code>LLM</code>.</p> <p> EmbeddingTaskGenerator</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/","title":"APIGenGenerator","text":"<p>Generate queries and answers for the given functions in JSON format.</p> <p>The <code>APIGenGenerator</code> is inspired by the APIGen pipeline, which was designed to generate     verifiable and diverse function-calling datasets. The task generates a set of diverse queries     and corresponding answers for the given functions in JSON format.</p>"},{"location":"components-gallery/tasks/apigengenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to guide the user in the generation of queries and answers.</p> </li> <li> <p>use_tools: Whether to use the tools available in the prompt to generate the queries and answers.  In case the tools are given in the input, they will be added to the prompt.</p> </li> <li> <p>number: The number of queries to generate. It can be a list, where each number will be  chosen randomly, or a dictionary with the number of queries and the probability of each.  I.e: <code>number=1</code>, <code>number=[1, 2, 3]</code>, <code>number={1: 0.5, 2: 0.3, 3: 0.2}</code> are all valid inputs.  It corresponds to the number of parallel queries to generate.</p> </li> <li> <p>use_default_structured_output: Whether to use the default structured output or not.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[examples]\n            ICOL1[func_name]\n            ICOL2[func_desc]\n            ICOL3[tools]\n        end\n        subgraph New columns\n            OCOL0[query]\n            OCOL1[answers]\n        end\n    end\n\n    subgraph APIGenGenerator\n        StepInput[Input Columns: examples, func_name, func_desc, tools]\n        StepOutput[Output Columns: query, answers]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/apigengenerator/#inputs","title":"Inputs","text":"<ul> <li> <p>examples (<code>str</code>): Examples used as few shots to guide the model.</p> </li> <li> <p>func_name (<code>str</code>): Name for the function to generate.</p> </li> <li> <p>func_desc (<code>str</code>): Description of what the function should do.</p> </li> <li> <p>tools (<code>str</code>): JSON formatted string containing the tool representation of the function.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>query (<code>str</code>): The list of queries.</p> </li> <li> <p>answers (<code>str</code>): JSON formatted string with the list of answers, containing the info as  a dictionary to be passed to the functions.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigengenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/apigengenerator/#generate-without-structured-output-original-implementation","title":"Generate without structured output (original implementation)","text":"<pre><code>from distilabel.steps.tasks import ApiGenGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\napigen = ApiGenGenerator(\n    use_default_structured_output=False,\n    llm=llm\n)\napigen.load()\n\nres = next(\n    apigen.process(\n        [\n            {\n                \"examples\": 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                \"func_name\": \"getrandommovie\",\n                \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n            }\n        ]\n    )\n)\nres\n# [{'examples': 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n# 'number': 1,\n# 'func_name': 'getrandommovie',\n# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n# 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',\n# 'Give me 5 random movie suggestions from your database to plan my weekend.'],\n# 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],\n# [{'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}},\n#     {'name': 'getrandommovie', 'arguments': {}}]],\n# 'raw_input_api_gen_generator_0': [{'role': 'system',\n#     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively\"},\n#     {'role': 'user',\n#     'content': 'Here are examples of queries and the corresponding answers for similar functions:\nQUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\nThe detailed function description is the following:\nReturns a list of random movies from a database by calling an external API.\n\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n</code></pre>"},{"location":"components-gallery/tasks/apigengenerator/#generate-with-structured-output","title":"Generate with structured output","text":"<pre><code>from distilabel.steps.tasks import ApiGenGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\napigen = ApiGenGenerator(\n    use_default_structured_output=True,\n    llm=llm\n)\napigen.load()\n\nres_struct = next(\n    apigen.process(\n        [\n            {\n                \"examples\": 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n                \"func_name\": \"getrandommovie\",\n                \"func_desc\": \"Returns a list of random movies from a database by calling an external API.\"\n            }\n        ]\n    )\n)\nres_struct\n# [{'examples': 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]',\n# 'number': 1,\n# 'func_name': 'getrandommovie',\n# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',\n# 'queries': [\"I'm bored and want to watch a movie. Can you suggest some movies?\",\n# \"My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?\"],\n# 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],\n# [{'arguments': {}, 'name': 'getrandommovie'}]],\n# 'raw_input_api_gen_generator_0': [{'role': 'system',\n#     'content': \"You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively\"},\n#     {'role': 'user',\n#     'content': 'Here are examples of queries and the corresponding answers for similar functions:\nQUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{\"name\": \"binary_addition\", \"arguments\": {\"a\": \"10010\", \"b\": \"11101\"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\nThe detailed function description is the following:\nReturns a list of random movies from a database by calling an external API.\n\nNow please generate 2 diverse query and answer pairs following the above format.'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/apigengenerator/#references","title":"References","text":"<ul> <li> <p>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</p> </li> <li> <p>Salesforce/xlam-function-calling-60k</p> </li> </ul>"},{"location":"components-gallery/tasks/genstruct/","title":"Genstruct","text":"<p>Generate a pair of instruction-response from a document using an <code>LLM</code>.</p> <p><code>Genstruct</code> is a pre-defined task designed to generate valid instructions from a given raw document,     with the title and the content, enabling the creation of new, partially synthetic instruction finetuning     datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is     inspired in the Ada-Instruct paper.</p>"},{"location":"components-gallery/tasks/genstruct/#note","title":"Note","text":"<p>The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended option is to use <code>NousResearch/Genstruct-7B</code> as the LLM provided to the task, since it was trained for this specific task.</p>"},{"location":"components-gallery/tasks/genstruct/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/genstruct/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[title]\n            ICOL1[content]\n        end\n        subgraph New columns\n            OCOL0[user]\n            OCOL1[assistant]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph Genstruct\n        StepInput[Input Columns: title, content]\n        StepOutput[Output Columns: user, assistant, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/genstruct/#inputs","title":"Inputs","text":"<ul> <li> <p>title (<code>str</code>): The title of the document.</p> </li> <li> <p>content (<code>str</code>): The content of the document.</p> </li> </ul>"},{"location":"components-gallery/tasks/genstruct/#outputs","title":"Outputs","text":"<ul> <li> <p>user (<code>str</code>): The user's instruction based on the document.</p> </li> <li> <p>assistant (<code>str</code>): The assistant's response based on the user's instruction.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/genstruct/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/genstruct/#generate-instructions-from-raw-documents-using-the-title-and-content","title":"Generate instructions from raw documents using the title and content","text":"<pre><code>from distilabel.steps.tasks import Genstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ngenstruct = Genstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"NousResearch/Genstruct-7B\",\n    ),\n)\n\ngenstruct.load()\n\nresult = next(\n    genstruct.process(\n        [\n            {\"title\": \"common instruction\", \"content\": \"content of the document\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'title': 'An instruction',\n#         'content': 'content of the document',\n#         'model_name': 'test',\n#         'user': 'An instruction',\n#         'assistant': 'content of the document',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/genstruct/#references","title":"References","text":"<ul> <li> <p>Genstruct 7B by Nous Research</p> </li> <li> <p>Ada-Instruct: Adapting Instruction Generators for Complex Reasoning</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/","title":"Magpie","text":"<p>Generates conversations using an instruct fine-tuned LLM.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data     or specific system prompt thanks to the autoregressive capabilities of the instruct     fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message     and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query     or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the     LLM without any user message, then the LLM will continue generating tokens as if it was     the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.     After this instruct is generated, it can be sent again to the LLM to generate this time     an assistant response. This process can be repeated N times allowing to build a multi-turn     conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from     Scratch by Prompting Aligned LLMs with Nothing'.</p>"},{"location":"components-gallery/tasks/magpie/#attributes","title":"Attributes","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have.  Defaults to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have. Defaults  to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[system_prompt]\n        end\n        subgraph New columns\n            OCOL0[conversation]\n            OCOL1[instruction]\n            OCOL2[response]\n            OCOL3[system_prompt_key]\n            OCOL4[model_name]\n        end\n    end\n\n    subgraph Magpie\n        StepInput[Input Columns: system_prompt]\n        StepOutput[Output Columns: conversation, instruction, response, system_prompt_key, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/magpie/#inputs","title":"Inputs","text":"<ul> <li>system_prompt (<code>str</code>, optional): an optional system prompt that can be provided  to guide the generation of the instruct LLM and steer it to generate instructions  of certain topic.</li> </ul>"},{"location":"components-gallery/tasks/magpie/#outputs","title":"Outputs","text":"<ul> <li> <p>conversation (<code>ChatType</code>): the generated conversation which is a list of chat  items with a role and a message. Only if <code>only_instruction=False</code>.</p> </li> <li> <p>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code> or <code>n_turns==1</code>.</p> </li> <li> <p>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</p> </li> <li> <p>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate  the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpie/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/magpie/#generating-instructions-with-llama-3-8b-instruct-and-transformersllm","title":"Generating instructions with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {'instruction': \"That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?\"},\n#     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/magpie/#generating-conversations-with-llama-3-8b-instruct-and-transformersllm","title":"Generating conversations with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import Magpie\n\nmagpie = Magpie(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    n_turns=2,\n)\n\nmagpie.load()\n\nresult = next(\n    magpie.process(\n        inputs=[\n            {\n                \"system_prompt\": \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"\n            },\n            {\n                \"system_prompt\": \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"\n            },\n        ]\n    )\n)\n# [\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're a math expert AI assistant that helps students of secondary school to solve calculus problems.\"},\n#             {\n#                 'role': 'user',\n#                 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x\u2192a f(x) or lim x\u2192a [f(x)]. It is read as \"the limit as x approaches a of f\n# of x\".'\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x\u2192a f(x) indeed means \"the limit as x approaches a of f of\n# x\". What it's asking us to do is find the'\n#             }\n#         ]\n#     },\n#     {\n#         'conversation': [\n#             {'role': 'system', 'content': \"You're an expert florist AI assistant that helps user to erradicate pests in their crops.\"},\n#             {\n#                 'role': 'user',\n#                 'content': \"As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it\n# might be pests or diseases, but I'm not sure which.\"\n#             },\n#             {\n#                 'role': 'assistant',\n#                 'content': \"I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.\n# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called\"\n#             }\n#         ]\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/magpie/#references","title":"References","text":"<ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/","title":"MathShepherdCompleter","text":"<p>Math Shepherd Completer and auto-labeller task.</p> <p>This task is in charge of, given a list of solutions to an instruction, and a golden solution,     as reference, generate completions for the solutions, and label them according to the golden     solution using the hard estimation method from figure 2 in the reference paper, Eq. 3.     The attributes make the task flexible to be used with different types of dataset and LLMs, and     allow making use of different fields to modify the system and user prompts for it. Before modifying     them, review the current defaults to ensure the completions are generated correctly.</p>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to be used in the completions. The default one has been  checked and generates good completions using Llama 3.1 with 8B and 70B,  but it can be modified to adapt it to the model and dataset selected.</p> </li> <li> <p>extra_rules: This field can be used to insert extra rules relevant to the type of dataset.  For example, in the original paper they used GSM8K and MATH datasets, and this field  can be used to insert the rules for the GSM8K dataset.</p> </li> <li> <p>few_shots: Few shots to help the model generating the completions, write them in the  format of the type of solutions wanted for your dataset.</p> </li> <li> <p>N: Number of completions to generate for each step, correspond to N in the paper.  They used 8 in the paper, but it can be adjusted.</p> </li> <li> <p>tags: List of tags to be used in the completions, the default ones are [\"+\", \"-\"] as in the  paper, where the first is used as a positive label, and the second as a negative one.  This can be updated, but it MUST be a list with 2 elements, where the first is the  positive one, and the second the negative one.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[solutions]\n            ICOL2[golden_solution]\n        end\n        subgraph New columns\n            OCOL0[solutions]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph MathShepherdCompleter\n        StepInput[Input Columns: instruction, solutions, golden_solution]\n        StepOutput[Output Columns: solutions, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The task or instruction.</p> </li> <li> <p>solutions (<code>List[str]</code>): List of solutions to the task.</p> </li> <li> <p>golden_solution (<code>str</code>): The reference solution to the task, will be used  to annotate the candidate solutions.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#outputs","title":"Outputs","text":"<ul> <li> <p>solutions (<code>List[str]</code>): The same columns that were used as input, the \"solutions\" is modified.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the revision.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/mathshepherdcompleter/#annotate-your-steps-with-the-math-shepherd-completer-using-the-structured-outputs-the-preferred-way","title":"Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way)","text":"<pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3,\n    use_default_structured_output=True\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#annotate-your-steps-with-the-math-shepherd-completer","title":"Annotate your steps with the Math Shepherd Completer","text":"<pre><code>from distilabel.steps.tasks import MathShepherdCompleter\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdCompleter(\n    llm=llm,\n    N=3\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n                \"golden_solution\": [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                \"solutions\": [\n                    [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\u2019s market.\", \"The answer is: 18\"],\n                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking.', 'Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.', 'The answer is: 18'],\n                ]\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': [\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"],\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdcompleter/#references","title":"References","text":"<ul> <li>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/","title":"MathShepherdGenerator","text":"<p>Math Shepherd solution generator.</p> <p>This task is in charge of generating completions for a given instruction, in the format expected     by the Math Shepherd Completer task. The attributes make the task flexible to be used with different     types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented     in the original paper. Before modifying them, review the current defaults to ensure the completions     are generated correctly. This task can be used to generate the golden solutions for a given problem if     not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer.     Only one of <code>solutions</code> or <code>golden_solution</code> will be generated, depending on the value of M.</p>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to be used in the completions. The default one has been  checked and generates good completions using Llama 3.1 with 8B and 70B,  but it can be modified to adapt it to the model and dataset selected.  Take into account that the system prompt includes 2 variables in the Jinja2 template,  {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example  to steer the model towards a specific type of responses, and few shots to add examples.  They can be modified to adapt the system prompt to the dataset and model used without needing  to change the full system prompt.</p> </li> <li> <p>extra_rules: This field can be used to insert extra rules relevant to the type of dataset.  For example, in the original paper they used GSM8K and MATH datasets, and this field  can be used to insert the rules for the GSM8K dataset.</p> </li> <li> <p>few_shots: Few shots to help the model generating the completions, write them in the  format of the type of solutions wanted for your dataset.</p> </li> <li> <p>M: Number of completions to generate for each step. By default is set to 1, which will  generate the \"golden_solution\". In this case select a stronger model, as it will be used  as the source of true during labelling. If M is set to a number greater than 1, the task  will generate a list of completions to be labeled by the Math Shepherd Completer task.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n        end\n        subgraph New columns\n            OCOL0[golden_solution]\n            OCOL1[solutions]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph MathShepherdGenerator\n        StepInput[Input Columns: instruction]\n        StepOutput[Output Columns: golden_solution, solutions, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#inputs","title":"Inputs","text":"<ul> <li>instruction (<code>str</code>): The task or instruction.</li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>golden_solution (<code>str</code>): The step by step solution to the instruction.  It will be generated if M is equal to 1.</p> </li> <li> <p>solutions (<code>List[List[str]]</code>): A list of possible solutions to the instruction.  It will be generated if M is greater than 1.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the revision.</p> </li> </ul>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/mathshepherdgenerator/#generate-the-solution-for-a-given-instruction-prefer-a-stronger-model-here","title":"Generate the solution for a given instruction (prefer a stronger model here)","text":"<pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.6,\n        \"max_new_tokens\": 1024,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"golden_solution_generator\",\n    llm=llm,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'golden_solution': '[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day.\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"]'}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#generate-m-completions-for-a-given-instruction-using-structured-output-generation","title":"Generate M completions for a given instruction (using structured output generation)","text":"<pre><code>from distilabel.steps.tasks import MathShepherdGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 2048,\n    },\n)\ntask = MathShepherdGenerator(\n    name=\"solution_generator\",\n    llm=llm,\n    M=2,\n    use_default_structured_output=True,\n)\n\ntask.load()\n\nresult = next(\n    task.process(\n        [\n            {\n                \"instruction\": \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n            },\n        ]\n    )\n)\n# [[{'instruction': \"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\",\n# 'solutions': [[\"Step 1: Janet sells 16 - 3 - 4 = &lt;&lt;16-3-4=9&gt;&gt;9 duck eggs a day. -\", \"Step 2: She makes 9 * 2 = $&lt;&lt;9*2=18&gt;&gt;18 every day at the farmer\\u2019s market.\", \"The answer is: 18\"], [\"Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = &lt;&lt;3+4=7&gt;&gt;7 for eating and baking. +\", \"Step 2: So she sells 16 - 7 = &lt;&lt;16-7=9&gt;&gt;9 duck eggs every day. +\", \"Step 3: Those 9 eggs are worth 9 * $2 = $&lt;&lt;9*2=18&gt;&gt;18.\", \"The answer is: 18\"]]}]]\n</code></pre>"},{"location":"components-gallery/tasks/mathshepherdgenerator/#references","title":"References","text":"<ul> <li>Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations</li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/","title":"SelfInstruct","text":"<p>Generate instructions based on a given input using an <code>LLM</code>.</p> <p><code>SelfInstruct</code> is a pre-defined task that, given a number of instructions, a     certain criteria for query generations, an application description, and an input,     generates a number of instruction related to the given input and following what     is stated in the criteria for query generation and the application description.     It is based in the SelfInstruct framework from the paper \"Self-Instruct: Aligning     Language Models with Self-Generated Instructions\".</p>"},{"location":"components-gallery/tasks/selfinstruct/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated. Defaults to 5.</p> </li> <li> <p>criteria_for_query_generation: The criteria for the query generation. Defaults  to the criteria defined within the paper.</p> </li> <li> <p>application_description: The description of the AI application that one want  to build with these instructions. Defaults to <code>AI assistant</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[input]\n        end\n        subgraph New columns\n            OCOL0[instructions]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph SelfInstruct\n        StepInput[Input Columns: input]\n        StepOutput[Output Columns: instructions, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/selfinstruct/#inputs","title":"Inputs","text":"<ul> <li>input (<code>str</code>): The input to generate the instructions. It's also called seed in  the paper.</li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/#outputs","title":"Outputs","text":"<ul> <li> <p>instructions (<code>List[str]</code>): The generated instructions.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/selfinstruct/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/selfinstruct/#generate-instructions-based-on-a-given-input","title":"Generate instructions based on a given input","text":"<pre><code>from distilabel.steps.tasks import SelfInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\nself_instruct = SelfInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=5,  # This is the default value\n)\n\nself_instruct.load()\n\nresult = next(self_instruct.process([{\"input\": \"instruction\"}]))\n# result\n# [\n#     {\n#         'input': 'instruction',\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'instructions': [\"instruction 1\", \"instruction 2\", \"instruction 3\", \"instruction 4\", \"instruction 5\"],\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/","title":"TextGeneration","text":"<p>Text generation with an <code>LLM</code> given a prompt.</p> <p><code>TextGeneration</code> is a pre-defined task that allows passing a custom prompt using the     Jinja2 syntax. By default, a <code>instruction</code> is expected in the inputs, but the using     <code>template</code> and <code>columns</code> attributes one can define a custom prompt and columns expected     from the text. This task should be good enough for tasks that don't need post-processing     of the responses generated by the LLM.</p>"},{"location":"components-gallery/tasks/textgeneration/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to use in the generation. If not provided, then  it will check if the input row has a column named <code>system_prompt</code> and use it.  If not, then no system prompt will be used. Defaults to <code>None</code>.</p> </li> <li> <p>template: The template to use for the generation. It must follow the Jinja2 template  syntax. If not provided, it will assume the text passed is an instruction and  construct the appropriate template.</p> </li> <li> <p>columns: A string with the column, or a list with columns expected in the template.  Take a look at the examples for more information. Defaults to <code>instruction</code>.</p> </li> <li> <p>use_system_prompt: DEPRECATED. To be removed in 1.5.0. Whether to use the system  prompt in the generation. Defaults to <code>True</code>, which means that if the column  <code>system_prompt</code> is defined within the input batch, then the <code>system_prompt</code>  will be used, otherwise, it will be ignored.</p> </li> </ul>"},{"location":"components-gallery/tasks/textgeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextGeneration\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): By default will be set to <code>instruction</code>.  The columns can point both to a <code>str</code> or a <code>List[str]</code> to be used in the template.</li> </ul>"},{"location":"components-gallery/tasks/textgeneration/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/textgeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textgeneration/#generate-text-from-an-instruction","title":"Generate text from an instruction","text":"<pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    )\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [{\"instruction\": \"your instruction\"}]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'your instruction',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'generation',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#use-a-custom-template-to-generate-text","title":"Use a custom template to generate text","text":"<pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Document:\n{{ document }}\n\nQuestion: {{ question }}\n\nPlease provide a clear and concise answer to the question based on the information in the document and your general knowledge:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    system_prompt=\"You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.\",\n    template=CUSTOM_TEMPLATE,\n    columns=[\"document\", \"question\"],\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"document\": \"The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.\",\n                \"question\": \"What is the main threat to the Great Barrier Reef mentioned in the document?\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',\n#         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#few-shot-learning-with-different-system-prompts","title":"Few shot learning with different system prompts","text":"<pre><code>from distilabel.steps.tasks import TextGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nCUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:\n\n{% for example in examples %}\nExample {{ loop.index }}:\nInstruction: {{ example }}\n\n{% endfor %}\nNow, generate a new instruction in a similar style:\n'''.rstrip()\n\ntext_gen = TextGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    template=CUSTOM_TEMPLATE,\n    columns=\"examples\",\n)\n\ntext_gen.load()\n\nresult = next(\n    text_gen.process(\n        [\n            {\n                \"examples\": [\"This is an example\", \"Another relevant example\"],\n                \"system_prompt\": \"You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'examples': ['This is an example', 'Another relevant example'],\n#         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',\n#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',\n#         'generation': 'Disable the firewall on the router',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/textgeneration/#references","title":"References","text":"<ul> <li>Jinja2 Template Designer Documentation</li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/","title":"TextGenerationWithImage","text":"<p>Text generation with images with an <code>LLM</code> given a prompt.</p> <p><code>TextGenerationWithImage</code> is a pre-defined task that allows passing a custom prompt using the     Jinja2 syntax. By default, a <code>instruction</code> is expected in the inputs, but the using     <code>template</code> and <code>columns</code> attributes one can define a custom prompt and columns expected     from the text. Additionally, an <code>image</code> column is expected containing one of the     url, base64 encoded image or PIL image. This task inherits from <code>TextGeneration</code>,     so all the functionality available in that task related to the prompt will be available     here too.</p>"},{"location":"components-gallery/tasks/textgenerationwithimage/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: The system prompt to use in the generation.  If not, then no system prompt will be used. Defaults to <code>None</code>.</p> </li> <li> <p>template: The template to use for the generation. It must follow the Jinja2 template  syntax. If not provided, it will assume the text passed is an instruction and  construct the appropriate template.</p> </li> <li> <p>columns: A string with the column, or a list with columns expected in the template.  Take a look at the examples for more information. Defaults to <code>instruction</code>.</p> </li> <li> <p>image_type: The type of the image provided, this will be used to preprocess if necessary.  Must be one of \"url\", \"base64\" or \"PIL\".</p> </li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[dynamic]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextGenerationWithImage\n        StepInput[Input Columns: dynamic]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textgenerationwithimage/#inputs","title":"Inputs","text":"<ul> <li>dynamic (determined by <code>columns</code> attribute): By default will be set to <code>instruction</code>.  The columns can point both to a <code>str</code> or a <code>list[str]</code> to be used in the template.</li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/textgenerationwithimage/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textgenerationwithimage/#answer-questions-from-an-image","title":"Answer questions from an image","text":"<pre><code>from distilabel.steps.tasks import TextGenerationWithImage\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nvision = TextGenerationWithImage(\n    name=\"vision_gen\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    ),\n    image_type=\"url\"\n)\n\nvision.load()\n\nresult = next(\n    vision.process(\n        [\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         \"instruction\": \"What\u2019s in this image?\",\n#         \"image\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n#         \"generation\": \"Based on the visual cues in the image...\",\n#         \"model_name\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\"\n#         ... # distilabel_metadata would be here\n#     }\n# ]\n# result[0][\"generation\"]\n# \"Based on the visual cues in the image, here are some possible story points:\n\n* The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.\n\nAnalysis and Ideas:\n* The abundance of green grass and trees suggests a healthy ecosystem or habitat.\n* The presence of wildlife, such as birds or deer, is possible based on the surroundings.\n* A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.\n\nAdditional Questions to Ask:\n* Why is a footbridge present in this area?\n* What kind of wildlife inhabits this region\"\n</code></pre>"},{"location":"components-gallery/tasks/textgenerationwithimage/#answer-questions-from-an-image-stored-as-base64","title":"Answer questions from an image stored as base64","text":"<pre><code># For this example we will assume that we have the string representation of the image\n# stored, but will just take the image and transform it to base64 to ilustrate the example.\nimport requests\nimport base64\n\nimage_url =\"https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg\"\nimg = requests.get(image_url).content\nbase64_image = base64.b64encode(img).decode(\"utf-8\")\n\nfrom distilabel.steps.tasks import TextGenerationWithImage\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nvision = TextGenerationWithImage(\n    name=\"vision_gen\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n    ),\n    image_type=\"base64\"\n)\n\nvision.load()\n\nresult = next(\n    vision.process(\n        [\n            {\n                \"instruction\": \"What\u2019s in this image?\",\n                \"image\": base64_image\n            }\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/textgenerationwithimage/#references","title":"References","text":"<ul> <li> <p>Jinja2 Template Designer Documentation</p> </li> <li> <p>Image-Text-to-Text</p> </li> <li> <p>OpenAI Vision</p> </li> </ul>"},{"location":"components-gallery/tasks/urial/","title":"URIAL","text":"<p>Generates a response using a non-instruct fine-tuned model.</p> <p><code>URIAL</code> is a pre-defined task that generates a response using a non-instruct fine-tuned     model. This task is used to generate a response based on the conversation provided as     input.</p>"},{"location":"components-gallery/tasks/urial/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[conversation]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph URIAL\n        StepInput[Input Columns: instruction, conversation]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/urial/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>, optional): The instruction to generate a response from.</p> </li> <li> <p>conversation (<code>List[Dict[str, str]]</code>, optional): The conversation to generate  a response from (the last message must be from the user).</p> </li> </ul>"},{"location":"components-gallery/tasks/urial/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated response.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the response.</p> </li> </ul>"},{"location":"components-gallery/tasks/urial/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/urial/#generate-text-from-an-instruction","title":"Generate text from an instruction","text":"<pre><code>from distilabel.models import vLLM\nfrom distilabel.steps.tasks import URIAL\n\nstep = URIAL(\n    llm=vLLM(\n        model=\"meta-llama/Meta-Llama-3.1-8B\",\n        generation_kwargs={\"temperature\": 0.7},\n    ),\n)\n\nstep.load()\n\nresults = next(\n    step.process(inputs=[{\"instruction\": \"What's the most most common type of cloud?\"}])\n)\n# [\n#     {\n#         'instruction': \"What's the most most common type of cloud?\",\n#         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',\n#         'distilabel_metadata': {...},\n#         'model_name': 'meta-llama/Meta-Llama-3.1-8B'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/urial/#references","title":"References","text":"<ul> <li>The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning</li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/","title":"MagpieGenerator","text":"<p>Generator task the generates instructions or conversations using Magpie.</p> <p>Magpie is a neat method that allows generating user instructions with no seed data     or specific system prompt thanks to the autoregressive capabilities of the instruct     fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message     and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query     or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the     LLM without any user message, then the LLM will continue generating tokens as it was     the user. This trick allows \"extracting\" instructions from the instruct fine-tuned LLM.     After this instruct is generated, it can be sent again to the LLM to generate this time     an assistant response. This process can be repeated N times allowing to build a multi-turn     conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from     Scratch by Prompting Aligned LLMs with Nothing'.</p>"},{"location":"components-gallery/tasks/magpiegenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have.  Defaults to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic. Defaults to <code>None</code>.</p> </li> <li> <p>num_rows: the number of rows to be generated.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>n_turns: the number of turns that the generated conversation will have. Defaults  to <code>1</code>.</p> </li> <li> <p>end_with_user: whether the conversation should end with a user message.  Defaults to <code>False</code>.</p> </li> <li> <p>include_system_prompt: whether to include the system prompt used in the generated  conversation. Defaults to <code>False</code>.</p> </li> <li> <p>only_instruction: whether to generate only the instruction. If this argument is  <code>True</code>, then <code>n_turns</code> will be ignored. Defaults to <code>False</code>.</p> </li> <li> <p>system_prompt: an optional system prompt, or a list of system prompts from which  a random one will be chosen, or a dictionary of system prompts from which a  random one will be choosen, or a dictionary of system prompts with their probability  of being chosen. The random system prompt will be chosen per input/output batch.  This system prompt can be used to guide the generation of the instruct LLM and  steer it to generate instructions of a certain topic.</p> </li> <li> <p>num_rows: the number of rows to be generated.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[conversation]\n            OCOL1[instruction]\n            OCOL2[response]\n            OCOL3[system_prompt_key]\n            OCOL4[model_name]\n        end\n    end\n\n    subgraph MagpieGenerator\n        StepOutput[Output Columns: conversation, instruction, response, system_prompt_key, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>conversation (<code>ChatType</code>): the generated conversation which is a list of chat  items with a role and a message.</p> </li> <li> <p>instruction (<code>str</code>): the generated instructions if <code>only_instruction=True</code>.</p> </li> <li> <p>response (<code>str</code>): the generated response if <code>n_turns==1</code>.</p> </li> <li> <p>system_prompt_key (<code>str</code>, optional): the key of the system prompt used to generate  the conversation or instruction. Only if <code>system_prompt</code> is a dictionary.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>conversation</code> or <code>instruction</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/magpiegenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/magpiegenerator/#generating-instructions-with-llama-3-8b-instruct-and-transformersllm","title":"Generating instructions with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 256,\n        },\n        device=\"mps\",\n    ),\n    only_instruction=True,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#       [\n#           {\"instruction\": \"I've just bought a new phone and I're excited to start using it.\"},\n#           {\"instruction\": \"What are the most common types of companies that use digital signage?\"}\n#       ],\n#       True\n# )\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#generating-a-conversation-with-llama-3-8b-instruct-and-transformersllm","title":"Generating a conversation with Llama 3 8B Instruct and TransformersLLM","text":"<pre><code>from distilabel.models import TransformersLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\ngenerator = MagpieGenerator(\n    llm=TransformersLLM(\n        model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 1.0,\n            \"max_new_tokens\": 64,\n        },\n        device=\"mps\",\n    ),\n    n_turns=3,\n    num_rows=5,\n)\n\ngenerator.load()\n\nresult = next(generator.process())\n# (\n#     [\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,\n# let's break down the basics. First, we need to identify your goals and target audience. What do\"\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': \"Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main\n# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating\"\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or\n# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'\n#                 }\n#             ]\n#         },\n#         {\n#             'conversation': [\n#                 {\n#                     'role': 'system',\n#                     'content': 'You are a helpful Al assistant. The user will engage in a multi\u2212round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and\n# insightful responses to help the user with their queries.'\n#                 },\n#                 {'role': 'user', 'content': \"I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!\"},\n#                 {\n#                     'role': 'assistant',\n#                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.\n# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'\n#                 },\n#                 {\n#                     'role': 'user',\n#                     'content': 'Let me stop you there. Let's explore this \"purpose\" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my\n# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'\n#                 },\n#                 {\n#                     'role': 'assistant',\n#                     'content': \"Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly\n# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* \"\n#                 }\n#             ]\n#         }\n#     ],\n#     True\n# )\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#generating-with-system-prompts-with-probabilities","title":"Generating with system prompts with probabilities","text":"<pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import MagpieGenerator\n\nmagpie = MagpieGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n        magpie_pre_query_template=\"llama3\",\n        generation_kwargs={\n            \"temperature\": 0.8,\n            \"max_new_tokens\": 256,\n        },\n    ),\n    n_turns=2,\n    system_prompt={\n        \"math\": (\"You're an expert AI assistant.\", 0.8),\n        \"writing\": (\"You're an expert writing assistant.\", 0.2),\n    },\n)\n\nmagpie.load()\n\nresult = next(magpie.process())\n</code></pre>"},{"location":"components-gallery/tasks/magpiegenerator/#references","title":"References","text":"<ul> <li>Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing</li> </ul>"},{"location":"components-gallery/tasks/chatgeneration/","title":"ChatGeneration","text":"<p>Generates text based on a conversation.</p> <p><code>ChatGeneration</code> is a pre-defined task that defines the <code>messages</code> as the input     and <code>generation</code> as the output. This task is used to generate text based on a conversation.     The <code>model_name</code> is also returned as part of the output in order to enhance it.</p>"},{"location":"components-gallery/tasks/chatgeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[messages]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph ChatGeneration\n        StepInput[Input Columns: messages]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/chatgeneration/#inputs","title":"Inputs","text":"<ul> <li>messages (<code>List[Dict[Literal[\"role\", \"content\"], str]]</code>): The messages to generate the  follow up completion from.</li> </ul>"},{"location":"components-gallery/tasks/chatgeneration/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text from the assistant.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/chatgeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/chatgeneration/#generate-text-from-a-conversation-in-openai-chat-format","title":"Generate text from a conversation in OpenAI chat format","text":"<pre><code>from distilabel.steps.tasks import ChatGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nchat = ChatGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nchat.load()\n\nresult = next(\n    chat.process(\n        [\n            {\n                \"messages\": [\n                    {\"role\": \"user\", \"content\": \"How much is 2+2?\"},\n                ]\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#         'generation': '4',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/","title":"ArgillaLabeller","text":"<p>Annotate Argilla records based on input fields, example records and question settings.</p> <p>This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM.     It uses a system prompt that guides the LLM to understand the input fields, the question type,     and the question settings. The task then formats the input data and generates a response based on the question.     The response is validated against the question's value model, and the final suggestion is prepared for annotation.</p>"},{"location":"components-gallery/tasks/argillalabeller/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/argillalabeller/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[record]\n            ICOL1[fields]\n            ICOL2[question]\n            ICOL3[example_records]\n            ICOL4[guidelines]\n        end\n        subgraph New columns\n            OCOL0[suggestion]\n        end\n    end\n\n    subgraph ArgillaLabeller\n        StepInput[Input Columns: record, fields, question, example_records, guidelines]\n        StepOutput[Output Columns: suggestion]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    ICOL4 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#inputs","title":"Inputs","text":"<ul> <li> <p>record (<code>argilla.Record</code>): The record to be annotated.</p> </li> <li> <p>fields (<code>Optional[List[Dict[str, Any]]]</code>): The list of field settings for the input fields.</p> </li> <li> <p>question (<code>Optional[Dict[str, Any]]</code>): The question settings for the question to be answered.</p> </li> <li> <p>example_records (<code>Optional[List[Dict[str, Any]]]</code>): The few shot example records with responses to be used to answer the question.</p> </li> <li> <p>guidelines (<code>Optional[str]</code>): The guidelines for the annotation task.</p> </li> </ul>"},{"location":"components-gallery/tasks/argillalabeller/#outputs","title":"Outputs","text":"<ul> <li>suggestion (<code>Dict[str, Any]</code>): The final suggestion for annotation.</li> </ul>"},{"location":"components-gallery/tasks/argillalabeller/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/argillalabeller/#annotate-a-record-with-the-same-dataset-and-question","title":"Annotate a record with the same dataset and question","text":"<pre><code>import argilla as rg\nfrom argilla import Suggestion\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\npending_records_filter = rg.Filter((\"status\", \"==\", \"pending\"))\ncompleted_records_filter = rg.Filter((\"status\", \"==\", \"completed\"))\npending_records = list(\n    dataset.records(\n        query=rg.Query(filter=pending_records_filter),\n        limit=5,\n    )\n)\nexample_records = list(\n    dataset.records(\n        query=rg.Query(filter=completed_records_filter),\n        limit=5,\n    )\n)\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    fields=[field],\n    question=question,\n    example_records=example_records,\n    guidelines=dataset.guidelines\n)\nlabeller.load()\n\n# Process the pending records\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record\n            } for record in pending_records\n        ]\n    )\n)\n\n# Add the suggestions to the records\nfor record, suggestion in zip(pending_records, result):\n    record.suggestions.add(Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated records\ndataset.records.log(pending_records)\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#annotate-a-record-with-alternating-datasets-and-questions","title":"Annotate a record with alternating datasets and questions","text":"<pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Get information from Argilla dataset definition\ndataset = rg.Dataset(\"my_dataset\")\nfield = dataset.settings.fields[\"text\"]\nquestion = dataset.settings.questions[\"label\"]\nquestion2 = dataset.settings.questions[\"label2\"]\n\n# Initialize the labeller with the model and fields\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\nlabeller.load()\n\n# Process the record\nrecord = next(dataset.records())\nresult = next(\n    labeller.process(\n        [\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question,\n            },\n            {\n                \"record\": record,\n                \"fields\": [field],\n                \"question\": question2,\n            }\n        ]\n    )\n)\n\n# Add the suggestions to the record\nfor suggestion in result:\n    record.suggestions.add(rg.Suggestion(**suggestion[\"suggestion\"]))\n\n# Log the updated record\ndataset.records.log([record])\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#overwrite-default-prompts-and-instructions","title":"Overwrite default prompts and instructions","text":"<pre><code>import argilla as rg\nfrom distilabel.steps.tasks import ArgillaLabeller\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Overwrite default prompts and instructions\nlabeller = ArgillaLabeller(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    system_prompt=\"You are an expert annotator and labelling assistant that understands complex domains and natural language processing.\",\n    question_to_label_instruction={\n        \"label_selection\": \"Select the appropriate label from the list of provided labels.\",\n        \"multi_label_selection\": \"Select none, one or multiple labels from the list of provided labels.\",\n        \"text\": \"Provide a text response to the question.\",\n        \"rating\": \"Provide a rating for the question.\",\n    },\n)\nlabeller.load()\n</code></pre>"},{"location":"components-gallery/tasks/argillalabeller/#references","title":"References","text":"<ul> <li>Argilla: Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets</li> </ul>"},{"location":"components-gallery/tasks/textclassification/","title":"TextClassification","text":"<p>Classifies text into one or more categories or labels.</p> <p>This task can be used for text classification problems, where the goal is to assign     one or multiple labels to a given text.     It uses structured generation as per the reference paper by default,     it can help to generate more concise labels. See section 4.1 in the reference.</p>"},{"location":"components-gallery/tasks/textclassification/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: A prompt to display to the user before the task starts. Contains a default  message to make the model behave like a classifier specialist.</p> </li> <li> <p>n: Number of labels to generate If only 1 is required, corresponds to a label  classification problem, if &gt;1 it will intend return the \"n\" labels most representative  for the text. Defaults to 1.</p> </li> <li> <p>context: Context to use when generating the labels. By default contains a generic message,  but can be used to customize the context for the task.</p> </li> <li> <p>examples: List of examples to help the model understand the task, few shots.</p> </li> <li> <p>available_labels: List of available labels to choose from when classifying the text, or  a dictionary with the labels and their descriptions.</p> </li> <li> <p>default_label: Default label to use when the text is ambiguous or lacks sufficient information for  classification. Can be a list in case of multiple labels (n&gt;1).</p> </li> </ul>"},{"location":"components-gallery/tasks/textclassification/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[labels]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextClassification\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: labels, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>): The reference text we want to obtain labels for.</li> </ul>"},{"location":"components-gallery/tasks/textclassification/#outputs","title":"Outputs","text":"<ul> <li> <p>labels (<code>Union[str, List[str]]</code>): The label or list of labels for the text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the label/s.</p> </li> </ul>"},{"location":"components-gallery/tasks/textclassification/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textclassification/#assigning-a-sentiment-to-a-text","title":"Assigning a sentiment to a text","text":"<pre><code>from distilabel.steps.tasks import TextClassification\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\ntext_classification = TextClassification(\n    llm=llm,\n    context=\"You are an AI system specialized in assigning sentiment to movies.\",\n    available_labels=[\"positive\", \"negative\"],\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\"}]\n    )\n)\n# result\n# [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',\n# 'labels': 'positive',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"positive\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the user query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nYou are an AI system specialized in assigning sentiment to movie the user queries.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"positive\",  # The text shows positive sentiment\\n    \"negative\",  # The text shows negative sentiment\\n]\\n\\n\\n## User Query\\n```\\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#assigning-predefined-labels-with-specified-descriptions","title":"Assigning predefined labels with specified descriptions","text":"<pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=1,\n    context=\"Determine the intent of the text.\",\n    available_labels={\n        \"complaint\": \"A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.\",\n        \"inquiry\": \"A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.\",\n        \"feedback\": \"A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\",\n        \"praise\": \"A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\"\n    },\n    query_title=\"Customer Query\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"Can you tell me more about your return policy?\"}]\n    )\n)\n# result\n# [{'text': 'Can you tell me more about your return policy?',\n# 'labels': 'inquiry',\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": \"inquiry\"\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the customer query by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide the label that best describes the text.\\nDetermine the intent of the text.\\n## Labeling the user input\\nUse the available labels to classify the user query. Analyze the context of each label specifically:\\navailable_labels = [\\n    \"complaint\",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\\n    \"inquiry\",  # A question or request for information about a product, service, or situation. It\\'s a neutral or curious expression seeking clarification or details.\\n    \"feedback\",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\\n    \"praise\",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\\n]\\n\\n\\n## Customer Query\\n```\\nCan you tell me more about your return policy?\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": \"label\"\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#free-multi-label-classification-without-predefined-labels","title":"Free multi label classification without predefined labels","text":"<pre><code>from distilabel.steps.tasks import TextClassification\n\ntext_classification = TextClassification(\n    llm=llm,\n    n=3,\n    context=(\n        \"Describe the main themes, topics, or categories that could describe the \"\n        \"following type of persona.\"\n    ),\n    query_title=\"Example of Persona\",\n)\n\ntext_classification.load()\n\nresult = next(\n    text_classification.process(\n        [{\"text\": \"A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\"}]\n    )\n)\n# result\n# [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',\n# 'labels': ['Historical Researcher',\n# 'Cultural Specialist',\n# 'Ethnic Studies Expert'],\n# 'distilabel_metadata': {'raw_output_text_classification_0': '{\\n    \"labels\": [\"Historical Researcher\", \"Cultural Specialist\", \"Ethnic Studies Expert\"]\\n}',\n# 'raw_input_text_classification_0': [{'role': 'system',\n#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},\n#     {'role': 'user',\n#     'content': '# Instruction\\nPlease classify the example of persona by assigning the most appropriate labels.\\nDo not explain your reasoning or provide any additional commentary.\\nIf the text is ambiguous or lacks sufficient information for classification, respond with \"Unclassified\".\\nProvide a list of 3 labels that best describe the text.\\nDescribe the main themes, topics, or categories that could describe the following type of persona.\\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\\n\\n\\n## Example of Persona\\n```\\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\\n```\\n\\n## Output Format\\nNow, please give me the labels in JSON format, do not include any other text in your response:\\n```\\n{\\n    \"labels\": [\"label_0\", \"label_1\", \"label_2\"]\\n}\\n```'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/textclassification/#references","title":"References","text":"<ul> <li>Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/","title":"EvolInstruct","text":"<p>Evolve instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p>"},{"location":"components-gallery/tasks/evolinstruct/#attributes","title":"Attributes","text":"<ul> <li> <p>num_evolutions: The number of evolutions to be performed.</p> </li> <li> <p>store_evolutions: Whether to store all the evolutions or just the last one. Defaults  to <code>False</code>.</p> </li> <li> <p>generate_answers: Whether to generate answers for the evolved instructions. Defaults  to <code>False</code>.</p> </li> <li> <p>include_original_instruction: Whether to include the original instruction in the  <code>evolved_instructions</code> output column. Defaults to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for evolving the instructions.  Defaults to the ones provided in the <code>utils.py</code> file.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n        end\n        subgraph New columns\n            OCOL0[evolved_instruction]\n            OCOL1[evolved_instructions]\n            OCOL2[model_name]\n            OCOL3[answer]\n            OCOL4[answers]\n        end\n    end\n\n    subgraph EvolInstruct\n        StepInput[Input Columns: instruction]\n        StepOutput[Output Columns: evolved_instruction, evolved_instructions, model_name, answer, answers]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepOutput --&gt; OCOL4\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#inputs","title":"Inputs","text":"<ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#outputs","title":"Outputs","text":"<ul> <li> <p>evolved_instruction (<code>str</code>): The evolved instruction if <code>store_evolutions=False</code>.</p> </li> <li> <p>evolved_instructions (<code>List[str]</code>): The evolved instructions if <code>store_evolutions=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</p> </li> <li> <p>answer (<code>str</code>): The answer to the evolved instruction if <code>generate_answers=True</code>  and <code>store_evolutions=False</code>.</p> </li> <li> <p>answers (<code>List[str]</code>): The answers to the evolved instructions if <code>generate_answers=True</code>  and <code>store_evolutions=True</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstruct/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolinstruct/#evolve-an-instruction-using-an-llm","title":"Evolve an instruction using an LLM","text":"<pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#keep-the-iterations-of-the-evolutions","title":"Keep the iterations of the evolutions","text":"<pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    store_evolutions=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instructions': ['initial evolution', 'final evolution'],\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#generate-answers-for-the-instructions-in-a-single-step","title":"Generate answers for the instructions in a single step","text":"<pre><code>from distilabel.steps.tasks import EvolInstruct\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct = EvolInstruct(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n    generate_answers=True,\n)\n\nevol_instruct.load()\n\nresult = next(evol_instruct.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'evolved_instruction': 'evolved instruction',\n#         'answer': 'answer to the instruction',\n#         'model_name': 'model_name'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstruct/#references","title":"References","text":"<ul> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> <li> <p>GitHub: h2oai/h2o-wizardlm</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/","title":"EvolComplexity","text":"<p>Evolve instructions to make them more complex using an <code>LLM</code>.</p> <p><code>EvolComplexity</code> is a task that evolves instructions to make them more complex,     and it is based in the EvolInstruct task, using slight different prompts, but the     exact same evolutionary approach.</p>"},{"location":"components-gallery/tasks/evolcomplexity/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated.</p> </li> <li> <p>generate_answers: Whether to generate answers for the instructions or not. Defaults  to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for the generation of the  instructions.</p> </li> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to  be higher than, to be considered valid. Defaults to <code>512</code>.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to  be lower than, to be considered valid. Defaults to <code>1024</code>.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</p> </li> <li> <p>seed: The number of evolutions to be run.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n        end\n        subgraph New columns\n            OCOL0[evolved_instruction]\n            OCOL1[answer]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EvolComplexity\n        StepInput[Input Columns: instruction]\n        StepOutput[Output Columns: evolved_instruction, answer, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexity/#inputs","title":"Inputs","text":"<ul> <li>instruction (<code>str</code>): The instruction to evolve.</li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#outputs","title":"Outputs","text":"<ul> <li> <p>evolved_instruction (<code>str</code>): The evolved instruction.</p> </li> <li> <p>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexity/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolcomplexity/#evolve-an-instruction-using-an-llm","title":"Evolve an instruction using an LLM","text":"<pre><code>from distilabel.steps.tasks import EvolComplexity\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity = EvolComplexity(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_complexity.load()\n\nresult = next(evol_complexity.process([{\"instruction\": \"common instruction\"}]))\n# result\n# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexity/#references","title":"References","text":"<ul> <li> <p>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</p> </li> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/","title":"EvolQuality","text":"<p>Evolve the quality of the responses using an <code>LLM</code>.</p> <p><code>EvolQuality</code> task is used to evolve the quality of the responses given a prompt,     by generating a new response with a language model. This step implements the evolution     quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of     Automatic Data Selection in Instruction Tuning'.</p>"},{"location":"components-gallery/tasks/evolquality/#attributes","title":"Attributes","text":"<ul> <li> <p>num_evolutions: The number of evolutions to be performed on the responses.</p> </li> <li> <p>store_evolutions: Whether to store all the evolved responses or just the last one.  Defaults to <code>False</code>.</p> </li> <li> <p>include_original_response: Whether to include the original response within the evolved  responses. Defaults to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used to evolve the responses.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</li> </ul>"},{"location":"components-gallery/tasks/evolquality/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[response]\n        end\n        subgraph New columns\n            OCOL0[evolved_response]\n            OCOL1[evolved_responses]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EvolQuality\n        StepInput[Input Columns: instruction, response]\n        StepOutput[Output Columns: evolved_response, evolved_responses, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/evolquality/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</p> </li> <li> <p>response (<code>str</code>): The responses to be rewritten.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/#outputs","title":"Outputs","text":"<ul> <li> <p>evolved_response (<code>str</code>): The evolved response if <code>store_evolutions=False</code>.</p> </li> <li> <p>evolved_responses (<code>List[str]</code>): The evolved responses if <code>store_evolutions=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the responses.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolquality/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolquality/#evolve-the-quality-of-the-responses-given-a-prompt","title":"Evolve the quality of the responses given a prompt","text":"<pre><code>from distilabel.steps.tasks import EvolQuality\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_quality = EvolQuality(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_evolutions=2,\n)\n\nevol_quality.load()\n\nresult = next(\n    evol_quality.process(\n        [\n            {\"instruction\": \"common instruction\", \"response\": \"a response\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'common instruction',\n#         'response': 'a response',\n#         'evolved_response': 'evolved response',\n#         'model_name': '\"mistralai/Mistral-7B-Instruct-v0.2\"'\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/evolquality/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/","title":"EvolInstructGenerator","text":"<p>Generate evolved instructions using an <code>LLM</code>.</p> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p>"},{"location":"components-gallery/tasks/evolinstructgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated.</p> </li> <li> <p>generate_answers: Whether to generate answers for the instructions or not. Defaults  to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for the generation of the  instructions.</p> </li> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to  be higher than, to be considered valid. Defaults to <code>512</code>.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to  be lower than, to be considered valid. Defaults to <code>1024</code>.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs  to be higher than, to be considered valid.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs  to be lower than, to be considered valid.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[instruction]\n            OCOL1[answer]\n            OCOL2[instructions]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph EvolInstructGenerator\n        StepOutput[Output Columns: instruction, answer, instructions, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n</code></pre>"},{"location":"components-gallery/tasks/evolinstructgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>instruction (<code>str</code>): The generated instruction if <code>generate_answers=False</code>.</p> </li> <li> <p>answer (<code>str</code>): The generated answer if <code>generate_answers=True</code>.</p> </li> <li> <p>instructions (<code>List[str]</code>): The generated instructions if <code>generate_answers=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to generate and evolve the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolinstructgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolinstructgenerator/#generate-evolved-instructions-without-initial-instructions","title":"Generate evolved instructions without initial instructions","text":"<pre><code>from distilabel.steps.tasks import EvolInstructGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_instruct_generator = EvolInstructGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_instruct_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolinstructgenerator/#references","title":"References","text":"<ul> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> <li> <p>GitHub: h2oai/h2o-wizardlm</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/","title":"EvolComplexityGenerator","text":"<p>Generate evolved instructions with increased complexity using an <code>LLM</code>.</p> <p><code>EvolComplexityGenerator</code> is a generation task that evolves instructions to make     them more complex, and it is based in the EvolInstruct task, but using slight different     prompts, but the exact same evolutionary approach.</p>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>num_instructions: The number of instructions to be generated.</p> </li> <li> <p>generate_answers: Whether to generate answers for the instructions or not. Defaults  to <code>False</code>.</p> </li> <li> <p>mutation_templates: The mutation templates to be used for the generation of the  instructions.</p> </li> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to  be higher than, to be considered valid. Defaults to <code>512</code>.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to  be lower than, to be considered valid. Defaults to <code>1024</code>.</p> </li> <li> <p>seed: The seed to be set for <code>numpy</code> in order to randomly pick a mutation method.  Defaults to <code>42</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>min_length: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.</p> </li> <li> <p>max_length: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.</p> </li> <li> <p>seed: The number of evolutions to be run.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[instruction]\n            OCOL1[answer]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EvolComplexityGenerator\n        StepOutput[Output Columns: instruction, answer, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>instruction (<code>str</code>): The evolved instruction.</p> </li> <li> <p>answer (<code>str</code>, optional): The answer to the instruction if <code>generate_answers=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the LLM used to evolve the instructions.</p> </li> </ul>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/evolcomplexitygenerator/#generate-evolved-instructions-without-initial-instructions","title":"Generate evolved instructions without initial instructions","text":"<pre><code>from distilabel.steps.tasks import EvolComplexityGenerator\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nevol_complexity_generator = EvolComplexityGenerator(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    num_instructions=2,\n)\n\nevol_complexity_generator.load()\n\nresult = next(scorer.process())\n# result\n# [{'instruction': 'generated instruction', 'model_name': 'test'}]\n</code></pre>"},{"location":"components-gallery/tasks/evolcomplexitygenerator/#references","title":"References","text":"<ul> <li> <p>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</p> </li> <li> <p>WizardLM: Empowering Large Language Models to Follow Complex Instructions</p> </li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/","title":"InstructionBacktranslation","text":"<p>Self-Alignment with Instruction Backtranslation.</p>"},{"location":"components-gallery/tasks/instructionbacktranslation/#attributes","title":"Attributes","text":"<ul> <li>_template: the Jinja2 template to use for the Instruction Backtranslation task.</li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generation]\n        end\n        subgraph New columns\n            OCOL0[score]\n            OCOL1[reason]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph InstructionBacktranslation\n        StepInput[Input Columns: instruction, generation]\n        StepOutput[Output Columns: score, reason, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/instructionbacktranslation/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The reference instruction to evaluate the text output.</p> </li> <li> <p>generation (<code>str</code>): The text output to evaluate for the given instruction.</p> </li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/#outputs","title":"Outputs","text":"<ul> <li> <p>score (<code>str</code>): The score for the generation based on the given instruction.</p> </li> <li> <p>reason (<code>str</code>): The reason for the provided score.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to score the generation.</p> </li> </ul>"},{"location":"components-gallery/tasks/instructionbacktranslation/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/instructionbacktranslation/#generate-a-score-and-reason-for-a-given-instruction-and-generation","title":"Generate a score and reason for a given instruction and generation","text":"<pre><code>from distilabel.steps.tasks import InstructionBacktranslation\n\ninstruction_backtranslation = InstructionBacktranslation(\n        name=\"instruction_backtranslation\",\n        llm=llm,\n        input_batch_size=10,\n        output_mappings={\"model_name\": \"scoring_model\"},\n    )\ninstruction_backtranslation.load()\n\nresult = next(\n    instruction_backtranslation.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generation\": \"4\",\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         \"instruction\": \"How much is 2+2?\",\n#         \"generation\": \"4\",\n#         \"score\": 3,\n#         \"reason\": \"Reason for the generation.\",\n#         \"model_name\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/instructionbacktranslation/#references","title":"References","text":"<ul> <li>Self-Alignment with Instruction Backtranslation</li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/","title":"PrometheusEval","text":"<p>Critique and rank the quality of generations from an <code>LLM</code> using Prometheus 2.0.</p> <p><code>PrometheusEval</code> is a task created for Prometheus 2.0, covering both the absolute and relative     evaluations. The absolute evaluation i.e. <code>mode=\"absolute\"</code> is used to evaluate a single generation from     an LLM for a given instruction. The relative evaluation i.e. <code>mode=\"relative\"</code> is used to evaluate two generations from an LLM     for a given instruction.     Both evaluations provide the possibility of using a reference answer to compare with or withoug     the <code>reference</code> attribute, and both are based on a score rubric that critiques the generation/s     based on the following default aspects: <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>,     and <code>reasoning</code>, that can be overridden via <code>rubrics</code>, and the selected rubric is set via the attribute     <code>rubric</code>.</p>"},{"location":"components-gallery/tasks/prometheuseval/#note","title":"Note","text":"<p>The <code>PrometheusEval</code> task is better suited and intended to be used with any of the Prometheus 2.0 models released by Kaist AI, being: https://huggingface.co/prometheus-eval/prometheus-7b-v2.0, and https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting and quality is not guaranteed if using another model, even though some other models may be able to correctly follow the formatting and generate insightful critiques too.</p>"},{"location":"components-gallery/tasks/prometheuseval/#attributes","title":"Attributes","text":"<ul> <li> <p>mode: the evaluation mode to use, either <code>absolute</code> or <code>relative</code>. It defines whether the task  will evaluate one or two generations.</p> </li> <li> <p>rubric: the score rubric to use within the prompt to run the critique based on different aspects.  Can be any existing key in the <code>rubrics</code> attribute, which by default means that it can be:  <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, or <code>reasoning</code>. Those will only  work if using the default <code>rubrics</code>, otherwise, the provided <code>rubrics</code> should be used.</p> </li> <li> <p>rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are  the rubric names and the values are the rubric descriptions. The default rubrics are the following:  <code>helpfulness</code>, <code>harmlessness</code>, <code>honesty</code>, <code>factual-validity</code>, and <code>reasoning</code>.</p> </li> <li> <p>reference: a boolean flag to indicate whether a reference answer / completion will be provided, so  that the model critique is based on the comparison with it. It implies that the column <code>reference</code>  needs to be provided within the input data in addition to the rest of the inputs.</p> </li> <li> <p>_template: a Jinja2 template used to format the input for the LLM.</p> </li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generation]\n            ICOL2[generations]\n            ICOL3[reference]\n        end\n        subgraph New columns\n            OCOL0[feedback]\n            OCOL1[result]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph PrometheusEval\n        StepInput[Input Columns: instruction, generation, generations, reference]\n        StepOutput[Output Columns: feedback, result, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction to use as reference.</p> </li> <li> <p>generation (<code>str</code>, optional): The generated text from the given <code>instruction</code>. This column is required  if <code>mode=absolute</code>.</p> </li> <li> <p>generations (<code>List[str]</code>, optional): The generated texts from the given <code>instruction</code>. It should  contain 2 generations only. This column is required if <code>mode=relative</code>.</p> </li> <li> <p>reference (<code>str</code>, optional): The reference / golden answer for the <code>instruction</code>, to be used by the LLM  for comparison against.</p> </li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/#outputs","title":"Outputs","text":"<ul> <li> <p>feedback (<code>str</code>): The feedback explaining the result below, as critiqued by the LLM using the  pre-defined score rubric, compared against <code>reference</code> if provided.</p> </li> <li> <p>result (<code>Union[int, Literal[\"A\", \"B\"]]</code>): If <code>mode=absolute</code>, then the result contains the score for the  <code>generation</code> in a likert-scale from 1-5, otherwise, if <code>mode=relative</code>, then the result contains either  \"A\" or \"B\", the \"winning\" one being the generation in the index 0 of <code>generations</code> if <code>result='A'</code> or the  index 1 if <code>result='B'</code>.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the <code>feedback</code> and <code>result</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/prometheuseval/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/prometheuseval/#critique-and-evaluate-llm-generation-quality-using-prometheus-2_0","title":"Critique and evaluate LLM generation quality using Prometheus 2_0","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"factual-validity\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#critique-for-relative-evaluation","title":"Critique for relative evaluation","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"relative\",\n    rubric=\"honesty\"\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generations\": [\"something done\", \"other thing\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generations': ['something done', 'other thing'],\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 'something done',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#critique-with-a-custom-rubric","title":"Critique with a custom rubric","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"custom\",\n    rubrics={\n        \"custom\": \"[A]\\nScore 1: A\\nScore 2: B\\nScore 3: C\\nScore 4: D\\nScore 5: E\"\n    }\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\"instruction\": \"make something\", \"generation\": \"something done\"},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#critique-using-a-reference-answer","title":"Critique using a reference answer","text":"<pre><code>from distilabel.steps.tasks import PrometheusEval\nfrom distilabel.models import vLLM\n\n# Consider this as a placeholder for your actual LLM.\nprometheus = PrometheusEval(\n    llm=vLLM(\n        model=\"prometheus-eval/prometheus-7b-v2.0\",\n        chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n    ),\n    mode=\"absolute\",\n    rubric=\"helpfulness\",\n    reference=True,\n)\n\nprometheus.load()\n\nresult = next(\n    prometheus.process(\n        [\n            {\n                \"instruction\": \"make something\",\n                \"generation\": \"something done\",\n                \"reference\": \"this is a reference answer\",\n            },\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'make something',\n#         'generation': 'something done',\n#         'reference': 'this is a reference answer',\n#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',\n#         'feedback': 'the feedback',\n#         'result': 6,\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/prometheuseval/#references","title":"References","text":"<ul> <li> <p>Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models</p> </li> <li> <p>prometheus-eval: Evaluate your LLM's response with Prometheus \ud83d\udcaf</p> </li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/","title":"ComplexityScorer","text":"<p>Score instructions based on their complexity using an <code>LLM</code>.</p> <p><code>ComplexityScorer</code> is a pre-defined task used to rank a list of instructions based in     their complexity. It's an implementation of the complexity score task from the paper     'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection     in Instruction Tuning'.</p>"},{"location":"components-gallery/tasks/complexityscorer/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instructions]\n        end\n        subgraph New columns\n            OCOL0[scores]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph ComplexityScorer\n        StepInput[Input Columns: instructions]\n        StepOutput[Output Columns: scores, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/complexityscorer/#inputs","title":"Inputs","text":"<ul> <li>instructions (<code>List[str]</code>): The list of instructions to be scored.</li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/#outputs","title":"Outputs","text":"<ul> <li> <p>scores (<code>List[float]</code>): The score for each instruction.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the scores.</p> </li> </ul>"},{"location":"components-gallery/tasks/complexityscorer/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/complexityscorer/#evaluate-the-complexity-of-your-instructions","title":"Evaluate the complexity of your instructions","text":"<pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]\n</code></pre>"},{"location":"components-gallery/tasks/complexityscorer/#generate-structured-output-with-default-schema","title":"Generate structured output with default schema","text":"<pre><code>from distilabel.steps.tasks import ComplexityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = ComplexityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=use_default_structured_output\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [{\"instructions\": [\"plain instruction\", \"highly complex instruction\"]}]\n    )\n)\n# result\n# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \\n  \"scores\": [\\n    1, \\n    2\\n  ]\\n}'}}]\n</code></pre>"},{"location":"components-gallery/tasks/complexityscorer/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/","title":"QualityScorer","text":"<p>Score responses based on their quality using an <code>LLM</code>.</p> <p><code>QualityScorer</code> is a pre-defined task that defines the <code>instruction</code> as the input     and <code>score</code> as the output. This task is used to rate the quality of instructions and responses.     It's an implementation of the quality score task from the paper 'What Makes Good Data     for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.     The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs     are scored in terms of quality, obtaining a quality score for each instruction.</p>"},{"location":"components-gallery/tasks/qualityscorer/#attributes","title":"Attributes","text":"<ul> <li>_template: a Jinja2 template used to format the input for the LLM.</li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[responses]\n        end\n        subgraph New columns\n            OCOL0[scores]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph QualityScorer\n        StepInput[Input Columns: instruction, responses]\n        StepOutput[Output Columns: scores, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/qualityscorer/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction that was used to generate the <code>responses</code>.</p> </li> <li> <p>responses (<code>List[str]</code>): The responses to be scored. Each response forms a pair with the instruction.</p> </li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/#outputs","title":"Outputs","text":"<ul> <li> <p>scores (<code>List[float]</code>): The score for each instruction.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the scores.</p> </li> </ul>"},{"location":"components-gallery/tasks/qualityscorer/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/qualityscorer/#evaluate-the-quality-of-your-instructions","title":"Evaluate the quality of your instructions","text":"<pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    )\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n# result\n[\n    {\n        'instructions': 'instruction',\n        'model_name': 'test',\n        'scores': [5, 3, 1],\n    }\n]\n</code></pre>"},{"location":"components-gallery/tasks/qualityscorer/#generate-structured-output-with-default-schema","title":"Generate structured output with default schema","text":"<pre><code>from distilabel.steps.tasks import QualityScorer\nfrom distilabel.models import InferenceEndpointsLLM\n\nscorer = QualityScorer(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    use_default_structured_output=True\n)\n\nscorer.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\n                \"instruction\": \"instruction\",\n                \"responses\": [\"good response\", \"weird response\", \"bad response\"]\n            }\n        ]\n    )\n)\n\n# result\n[{'instruction': 'instruction',\n'responses': ['good response', 'weird response', 'bad response'],\n'scores': [1, 2, 3],\n'distilabel_metadata': {'raw_output_quality_scorer_0': '{  \"scores\": [1, 2, 3] }'},\n'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/qualityscorer/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/clair/","title":"CLAIR","text":"<p>Contrastive Learning from AI Revisions (CLAIR).</p> <p>CLAIR uses an AI system to minimally revise a solution A\u2192A\u00b4 such that the resulting     preference A <code>preferred</code> A\u2019 is much more contrastive and precise.</p>"},{"location":"components-gallery/tasks/clair/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n            ICOL1[student_solution]\n        end\n        subgraph New columns\n            OCOL0[revision]\n            OCOL1[rational]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph CLAIR\n        StepInput[Input Columns: task, student_solution]\n        StepOutput[Output Columns: revision, rational, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/clair/#inputs","title":"Inputs","text":"<ul> <li> <p>task (<code>str</code>): The task or instruction.</p> </li> <li> <p>student_solution (<code>str</code>): An answer to the task that is to be revised.</p> </li> </ul>"},{"location":"components-gallery/tasks/clair/#outputs","title":"Outputs","text":"<ul> <li> <p>revision (<code>str</code>): The revised text.</p> </li> <li> <p>rational (<code>str</code>): The rational for the provided revision.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the revision and rational.</p> </li> </ul>"},{"location":"components-gallery/tasks/clair/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/clair/#create-contrastive-preference-pairs","title":"Create contrastive preference pairs","text":"<pre><code>from distilabel.steps.tasks import CLAIR\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 4096,\n    },\n)\nclair_task = CLAIR(llm=llm)\n\nclair_task.load()\n\nresult = next(\n    clair_task.process(\n        [\n            {\n                \"task\": \"How many gaps are there between the earth and the moon?\",\n                \"student_solution\": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'\n            }\n        ]\n    )\n)\n# result\n# [{'task': 'How many gaps are there between the earth and the moon?',\n# 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',\n# 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'rational': 'The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.',\n# 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term \"gaps\" can be misleading in this context. The student should clarify what they mean by \"gaps.\" Secondly, the student provides some additional information about the Moon\\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\\'s conclusion could be more concise.\\n\\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\\'s orbital path, not the presence of any gaps.\\n\\nIn summary, the Moon\\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',\n# 'raw_input_c_l_a_i_r_0': [{'role': 'system',\n#     'content': \"You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution.\"},\n#     {'role': 'user',\n#     'content': '{task}: How many gaps are there between the earth and the moon?\\n\\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the \"lunar distance\" or \"lunar mean distance.\"\\n\\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\\n\\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\\n\\n-----------------\\n\\nLet\\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/clair/#references","title":"References","text":"<ul> <li> <p>Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment</p> </li> <li> <p>APO and CLAIR - GitHub Repository</p> </li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/","title":"UltraFeedback","text":"<p>Rank generations focusing on different aspects using an <code>LLM</code>.</p> <p>UltraFeedback: Boosting Language Models with High-quality Feedback.</p>"},{"location":"components-gallery/tasks/ultrafeedback/#attributes","title":"Attributes","text":"<ul> <li>aspect: The aspect to perform with the <code>UltraFeedback</code> model. The available aspects are:  - <code>helpfulness</code>: Evaluate text outputs based on helpfulness.  - <code>honesty</code>: Evaluate text outputs based on honesty.  - <code>instruction-following</code>: Evaluate text outputs based on given instructions.  - <code>truthfulness</code>: Evaluate text outputs based on truthfulness.  Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall  assessment of the text outputs within a single prompt. The custom aspect is:  - <code>overall-rating</code>: Evaluate text outputs based on an overall assessment.  Defaults to <code>\"overall-rating\"</code>.</li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[generations]\n        end\n        subgraph New columns\n            OCOL0[ratings]\n            OCOL1[rationales]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph UltraFeedback\n        StepInput[Input Columns: instruction, generations]\n        StepOutput[Output Columns: ratings, rationales, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The reference instruction to evaluate the text outputs.</p> </li> <li> <p>generations (<code>List[str]</code>): The text outputs to evaluate for the given instruction.</p> </li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/#outputs","title":"Outputs","text":"<ul> <li> <p>ratings (<code>List[float]</code>): The ratings for each of the provided text outputs.</p> </li> <li> <p>rationales (<code>List[str]</code>): The rationales for each of the provided text outputs.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the ratings and rationales.</p> </li> </ul>"},{"location":"components-gallery/tasks/ultrafeedback/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/ultrafeedback/#rate-generations-from-different-llms-based-on-the-selected-aspect","title":"Rate generations from different LLMs based on the selected aspect","text":"<pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"mistralai/Mistral-7B-Instruct-v0.2\",\n    ),\n    use_default_structured_output=False\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'instruction': 'How much is 2+2?',\n#         'generations': ['4', 'and a car'],\n#         'ratings': [1, 2],\n#         'rationales': ['explanation for 4', 'explanation for and a car'],\n#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#rate-generations-from-different-llms-based-on-the-honesty-using-the-default-structured-output","title":"Rate generations from different LLMs based on the honesty, using the default structured output","text":"<pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    aspect=\"honesty\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n# 'generations': ['4', 'and a car'],\n# 'ratings': [5, 1],\n# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',\n# \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"],\n# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{\"ratings\": [\\n    5,\\n    1\\n] \\n\\n,\"rationales\": [\\n    \"The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.\",\\n    \"The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer.\"\\n] }'},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#rate-generations-from-different-llms-based-on-the-helpfulness-using-the-default-structured-output","title":"Rate generations from different LLMs based on the helpfulness, using the default structured output","text":"<pre><code>from distilabel.steps.tasks import UltraFeedback\nfrom distilabel.models import InferenceEndpointsLLM\n\n# Consider this as a placeholder for your actual LLM.\nultrafeedback = UltraFeedback(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        generation_kwargs={\"max_new_tokens\": 512},\n    ),\n    aspect=\"helpfulness\"\n)\n\nultrafeedback.load()\n\nresult = next(\n    ultrafeedback.process(\n        [\n            {\n                \"instruction\": \"How much is 2+2?\",\n                \"generations\": [\"4\", \"and a car\"],\n            }\n        ]\n    )\n)\n# result\n# [{'instruction': 'How much is 2+2?',\n#   'generations': ['4', 'and a car'],\n#   'ratings': [1, 5],\n#   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',\n#    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],\n#   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',\n#    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],\n#   'types': [1, 3, 1],\n#   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \\n  \"ratings\": [\\n    1,\\n    5\\n  ]\\n ,\\n  \"rationales\": [\\n    \"Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.\",\\n    \"Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"rationales_for_rating\": [\\n    \"Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.\",\\n    \"Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.\"\\n  ]\\n ,\\n  \"types\": [\\n    1, 3,\\n    1\\n  ]\\n  }'},\n#   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/ultrafeedback/#references","title":"References","text":"<ul> <li> <p>UltraFeedback: Boosting Language Models with High-quality Feedback</p> </li> <li> <p>UltraFeedback - GitHub Repository</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/","title":"PairRM","text":"<p>Rank the candidates based on the input using the <code>LLM</code> model.</p>"},{"location":"components-gallery/tasks/pairrm/#note","title":"Note","text":"<p>This step differs to other tasks as there is a single implementation of this model currently, and we will use a specific <code>LLM</code>.</p>"},{"location":"components-gallery/tasks/pairrm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: The model to use for the ranking. Defaults to <code>\"llm-blender/PairRM\"</code>.</p> </li> <li> <p>instructions: The instructions to use for the model. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[inputs]\n            ICOL1[candidates]\n        end\n        subgraph New columns\n            OCOL0[ranks]\n            OCOL1[ranked_candidates]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph PairRM\n        StepInput[Input Columns: inputs, candidates]\n        StepOutput[Output Columns: ranks, ranked_candidates, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/pairrm/#inputs","title":"Inputs","text":"<ul> <li> <p>inputs (<code>List[Dict[str, Any]]</code>): The input text or conversation to rank the candidates for.</p> </li> <li> <p>candidates (<code>List[Dict[str, Any]]</code>): The candidates to rank.</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/#outputs","title":"Outputs","text":"<ul> <li> <p>ranks (<code>List[int]</code>): The ranks of the candidates based on the input.</p> </li> <li> <p>ranked_candidates (<code>List[Dict[str, Any]]</code>): The candidates ranked based on the input.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to rank the candidate responses. Defaults to <code>\"llm-blender/PairRM\"</code>.</p> </li> </ul>"},{"location":"components-gallery/tasks/pairrm/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/pairrm/#rank-llm-candidates","title":"Rank LLM candidates","text":"<pre><code>from distilabel.steps.tasks import PairRM\n\n# Consider this as a placeholder for your actual LLM.\npair_rm = PairRM()\n\npair_rm.load()\n\nresult = next(\n    scorer.process(\n        [\n            {\"input\": \"Hello, how are you?\", \"candidates\": [\"fine\", \"good\", \"bad\"]},\n        ]\n    )\n)\n# result\n# [\n#     {\n#         'input': 'Hello, how are you?',\n#         'candidates': ['fine', 'good', 'bad'],\n#         'ranks': [2, 1, 3],\n#         'ranked_candidates': ['good', 'fine', 'bad'],\n#         'model_name': 'llm-blender/PairRM',\n#     }\n# ]\n</code></pre>"},{"location":"components-gallery/tasks/pairrm/#references","title":"References","text":"<ul> <li> <p>LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion</p> </li> <li> <p>Pair Ranking Model</p> </li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/","title":"GenerateSentencePair","text":"<p>Generate a positive and negative (optionally) sentences given an anchor sentence.</p> <p><code>GenerateSentencePair</code> is a pre-defined task that given an anchor sentence generates     a positive sentence related to the anchor and optionally a negative sentence unrelated     to the anchor or similar to it. Optionally, you can give a context to guide the LLM     towards more specific behavior. This task is useful to generate training datasets for     training embeddings models.</p>"},{"location":"components-gallery/tasks/generatesentencepair/#attributes","title":"Attributes","text":"<ul> <li> <p>triplet: a flag to indicate if the task should generate a triplet of sentences  (anchor, positive, negative). Defaults to <code>False</code>.</p> </li> <li> <p>action: the action to perform to generate the positive sentence.</p> </li> <li> <p>context: the context to use for the generation. Can be helpful to guide the LLM  towards more specific context. Not used by default.</p> </li> <li> <p>hard_negative: A flag to indicate if the negative should be a hard-negative or not.  Hard negatives make it hard for the model to distinguish against the positive,  with a higher degree of semantic similarity.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[anchor]\n        end\n        subgraph New columns\n            OCOL0[positive]\n            OCOL1[negative]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph GenerateSentencePair\n        StepInput[Input Columns: anchor]\n        StepOutput[Output Columns: positive, negative, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#inputs","title":"Inputs","text":"<ul> <li>anchor (<code>str</code>): The anchor sentence to generate the positive and negative sentences.</li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/#outputs","title":"Outputs","text":"<ul> <li> <p>positive (<code>str</code>): The positive sentence related to the <code>anchor</code>.</p> </li> <li> <p>negative (<code>str</code>): The negative sentence unrelated to the <code>anchor</code> if <code>triplet=True</code>,  or more similar to the positive to make it more challenging for a model to distinguish  in case <code>hard_negative=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model that was used to generate the sentences.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatesentencepair/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatesentencepair/#paraphrasing","title":"Paraphrasing","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"paraphrase\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#generating-semantically-similar-sentences","title":"Generating semantically similar sentences","text":"<pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps.tasks import GenerateSentencePair\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"semantically-similar\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"How does 3D printing work?\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#generating-queries","title":"Generating queries","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"Argilla is an open-source data curation platform for LLMs. Using Argilla, ...\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#generating-answers","title":"Generating answers","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"answer\",\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"What Game of Thrones villain would be the most likely to give you mercy?\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generatesentencepair/#_1","title":")","text":"<pre><code>from distilabel.steps.tasks import GenerateSentencePair\nfrom distilabel.models import InferenceEndpointsLLM\n\ngenerate_sentence_pair = GenerateSentencePair(\n    triplet=True, # `False` to generate only positive\n    action=\"query\",\n    context=\"Argilla is an open-source data curation platform for LLMs.\",\n    hard_negative=True,\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    ),\n    input_batch_size=10,\n    use_default_structured_output=True\n)\n\ngenerate_sentence_pair.load()\n\nresult = generate_sentence_pair.process([{\"anchor\": \"I want to generate queries for my LLM.\"}])\n</code></pre>"},{"location":"components-gallery/tasks/generateembeddings/","title":"GenerateEmbeddings","text":"<p>Generate embeddings using the last hidden state of an <code>LLM</code>.</p> <p>Generate embeddings for a text input using the last hidden state of an <code>LLM</code>, as     described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of     Automatic Data Selection in Instruction Tuning'.</p>"},{"location":"components-gallery/tasks/generateembeddings/#attributes","title":"Attributes","text":"<ul> <li>llm: The <code>LLM</code> to use to generate the embeddings.</li> </ul>"},{"location":"components-gallery/tasks/generateembeddings/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n        end\n        subgraph New columns\n            OCOL0[embedding]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph GenerateEmbeddings\n        StepInput[Input Columns: text]\n        StepOutput[Output Columns: embedding, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generateembeddings/#inputs","title":"Inputs","text":"<ul> <li>text (<code>str</code>, <code>List[Dict[str, str]]</code>): The input text or conversation to generate  embeddings for.</li> </ul>"},{"location":"components-gallery/tasks/generateembeddings/#outputs","title":"Outputs","text":"<ul> <li> <p>embedding (<code>List[float]</code>): The embedding of the input text or conversation.</p> </li> <li> <p>model_name (<code>str</code>): The model name used to generate the embeddings.</p> </li> </ul>"},{"location":"components-gallery/tasks/generateembeddings/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generateembeddings/#rank-llm-candidates","title":"Rank LLM candidates","text":"<pre><code>from distilabel.steps.tasks import GenerateEmbeddings\nfrom distilabel.models.llms.huggingface import TransformersLLM\n\n# Consider this as a placeholder for your actual LLM.\nembedder = GenerateEmbeddings(\n    llm=TransformersLLM(\n        model=\"TaylorAI/bge-micro-v2\",\n        model_kwargs={\"is_decoder\": True},\n        cuda_devices=[],\n    )\n)\nembedder.load()\n\nresult = next(\n    embedder.process(\n        [\n            {\"text\": \"Hello, how are you?\"},\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/generateembeddings/#references","title":"References","text":"<ul> <li>What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning</li> </ul>"},{"location":"components-gallery/tasks/textclustering/","title":"TextClustering","text":"<p>Task that clusters a set of texts and generates summary labels for each cluster.</p> <p>This is a <code>GlobalTask</code> that inherits from <code>TextClassification</code>, this means that all     the attributes from that class are available here. Also, in this case we deal     with all the inputs at once, instead of using batches. The <code>input_batch_size</code> is     used here to send the examples to the LLM in batches (a subtle difference with the     more common <code>Task</code> definitions).     The task looks in each cluster for a given number of representative examples (the number     is set by the <code>samples_per_cluster</code> attribute), and sends them to the LLM to get a label/s     that represent the cluster. The labels are then assigned to each text in the cluster.     The clusters and projections used in the step, are assumed to be obtained from the <code>UMAP</code>     + <code>DBSCAN</code> steps, but could be generated for similar steps, as long as they represent the     same concepts.     This step runs a pipeline like the one in this repository:     https://github.com/huggingface/text-clustering</p>"},{"location":"components-gallery/tasks/textclustering/#attributes","title":"Attributes","text":"<ul> <li>savefig: Whether to generate and save a figure with the clustering of the texts.  - samples_per_cluster: The number of examples to use in the LLM as a sample of the cluster.</li> </ul>"},{"location":"components-gallery/tasks/textclustering/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[text]\n            ICOL1[projection]\n            ICOL2[cluster_label]\n        end\n        subgraph New columns\n            OCOL0[summary_label]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph TextClustering\n        StepInput[Input Columns: text, projection, cluster_label]\n        StepOutput[Output Columns: summary_label, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/textclustering/#inputs","title":"Inputs","text":"<ul> <li> <p>text (<code>str</code>): The reference text we want to obtain labels for.</p> </li> <li> <p>projection (<code>List[float]</code>): Vector representation of the text to cluster,  normally the output from the <code>UMAP</code> step.</p> </li> <li> <p>cluster_label (<code>int</code>): Integer representing the label of a given cluster. -1  means it wasn't clustered.</p> </li> </ul>"},{"location":"components-gallery/tasks/textclustering/#outputs","title":"Outputs","text":"<ul> <li> <p>summary_label (<code>str</code>): The label or list of labels for the text.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the label/s.</p> </li> </ul>"},{"location":"components-gallery/tasks/textclustering/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/textclustering/#generate-labels-for-a-set-of-texts-using-clustering","title":"Generate labels for a set of texts using clustering","text":"<pre><code>from distilabel.models import InferenceEndpointsLLM\nfrom distilabel.steps import UMAP, DBSCAN, TextClustering\nfrom distilabel.pipeline import Pipeline\n\nds_name = \"argilla-warehouse/personahub-fineweb-edu-4-clustering-100k\"\n\nwith Pipeline(name=\"Text clustering dataset\") as pipeline:\n    batch_size = 500\n\n    ds = load_dataset(ds_name, split=\"train\").select(range(10000))\n    loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)\n\n    umap = UMAP(n_components=2, metric=\"cosine\")\n    dbscan = DBSCAN(eps=0.3, min_samples=30)\n\n    text_clustering = TextClustering(\n        llm=InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n        ),\n        n=3,  # 3 labels per example\n        query_title=\"Examples of Personas\",\n        samples_per_cluster=10,\n        context=(\n            \"Describe the main themes, topics, or categories that could describe the \"\n            \"following types of personas. All the examples of personas must share \"\n            \"the same set of labels.\"\n        ),\n        default_label=\"None\",\n        savefig=True,\n        input_batch_size=8,\n        input_mappings={\"text\": \"persona\"},\n        use_default_structured_output=True,\n    )\n\n    loader &gt;&gt; umap &gt;&gt; dbscan &gt;&gt; text_clustering\n</code></pre>"},{"location":"components-gallery/tasks/textclustering/#references","title":"References","text":"<ul> <li>text-clustering repository</li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/","title":"APIGenSemanticChecker","text":"<p>Generate queries and answers for the given functions in JSON format.</p> <p>The <code>APIGenGenerator</code> is inspired by the APIGen pipeline, which was designed to generate     verifiable and diverse function-calling datasets. The task generates a set of diverse queries     and corresponding answers for the given functions in JSON format.</p>"},{"location":"components-gallery/tasks/apigensemanticchecker/#attributes","title":"Attributes","text":"<ul> <li> <p>system_prompt: System prompt for the task. Has a default one.</p> </li> <li> <p>exclude_failed_execution: Whether to exclude failed executions (won't run on those  rows that have a False in <code>keep_row_after_execution_check</code> column, which  comes from running <code>APIGenExecutionChecker</code>). Defaults to True.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[func_desc]\n            ICOL1[query]\n            ICOL2[answers]\n            ICOL3[execution_result]\n        end\n        subgraph New columns\n            OCOL0[thought]\n            OCOL1[keep_row_after_semantic_check]\n        end\n    end\n\n    subgraph APIGenSemanticChecker\n        StepInput[Input Columns: func_desc, query, answers, execution_result]\n        StepOutput[Output Columns: thought, keep_row_after_semantic_check]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    ICOL2 --&gt; StepInput\n    ICOL3 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/apigensemanticchecker/#inputs","title":"Inputs","text":"<ul> <li> <p>func_desc (<code>str</code>): Description of what the function should do.</p> </li> <li> <p>query (<code>str</code>): Instruction from the user.</p> </li> <li> <p>answers (<code>str</code>): JSON encoded list with arguments to be passed to the function/API.  Should be loaded using <code>json.loads</code>.</p> </li> <li> <p>execution_result (<code>str</code>): Result of the function/API executed.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/#outputs","title":"Outputs","text":"<ul> <li> <p>thought (<code>str</code>): Reasoning for the output on whether to keep this output or not.</p> </li> <li> <p>keep_row_after_semantic_check (<code>bool</code>): True or False, can be used to filter  afterwards.</p> </li> </ul>"},{"location":"components-gallery/tasks/apigensemanticchecker/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/apigensemanticchecker/#semantic-checker-for-generated-function-calls-original-implementation","title":"Semantic checker for generated function calls (original implementation)","text":"<pre><code>from distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=False,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'thought': '',\n# 'keep_row_after_semantic_check': True,\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n\\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\\n```\\n{\\n   \"thought\": \"Concisely describe your reasoning here\",\\n   \"pass\": \"yes\" or \"no\"\\n}\\n```\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/apigensemanticchecker/#semantic-checker-for-generated-function-calls-structured-output","title":"Semantic checker for generated function calls (structured output)","text":"<pre><code>from distilabel.steps.tasks import APIGenSemanticChecker\nfrom distilabel.models import InferenceEndpointsLLM\n\nllm=InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n    generation_kwargs={\n        \"temperature\": 0.7,\n        \"max_new_tokens\": 1024,\n    },\n)\nsemantic_checker = APIGenSemanticChecker(\n    use_default_structured_output=True,\n    llm=llm\n)\nsemantic_checker.load()\n\nres = next(\n    semantic_checker.process(\n        [\n            {\n                \"func_desc\": \"Fetch information about a specific cat breed from the Cat Breeds API.\",\n                \"query\": \"What information can be obtained about the Maine Coon cat breed?\",\n                \"answers\": json.dumps([{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]),\n                \"execution_result\": \"The Maine Coon is a big and hairy breed of cat\",\n            }\n        ]\n    )\n)\nres\n# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',\n# 'query': 'What information can be obtained about the Maine Coon cat breed?',\n# 'answers': [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}],\n# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',\n# 'keep_row_after_semantic_check': True,\n# 'thought': '',\n# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',\n#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user\u2019s intentions.\\n\\nDo not pass if:\\n1. The function call does not align with the query\u2019s objective, or the input arguments appear incorrect.\\n2. The function call and arguments are not properly chosen from the available functions.\\n3. The number of function calls does not correspond to the user\u2019s intentions.\\n4. The execution results are irrelevant and do not match the function\u2019s purpose.\\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\\n'},\n#     {'role': 'user',\n#     'content': 'Given Information:\\n- All Available Functions:\\nFetch information about a specific cat breed from the Cat Breeds API.\\n- User Query: What information can be obtained about the Maine Coon cat breed?\\n- Generated Function Calls: [{\"name\": \"get_breed_information\", \"arguments\": {\"breed\": \"Maine Coon\"}}]\\n- Execution Results: The Maine Coon is a big and hairy breed of cat\\n\\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\\n\\nThe main decision factor is wheather the function calls accurately reflect the query\\'s intentions and the function descriptions.\\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\\n'}]},\n# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]\n</code></pre>"},{"location":"components-gallery/tasks/apigensemanticchecker/#references","title":"References","text":"<ul> <li> <p>APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets</p> </li> <li> <p>Salesforce/xlam-function-calling-60k</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/","title":"GenerateTextRetrievalData","text":"<p>Generate text retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextRetrievalData</code> is a <code>Task</code> that generates text retrieval data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-retrieval\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-retrieval category.</p>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>query_type: The type of query to be generated, which can be <code>extremely long-tail</code>, <code>long-tail</code>,  or <code>common</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>query_length: The length of the query to be generated, which can be <code>less than 5 words</code>, <code>5 to 15 words</code>,  or <code>at least 10 words</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>clarity: The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>,  or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>num_words: The number of words in the query to be generated, which can be <code>50</code>, <code>100</code>, <code>200</code>, <code>300</code>, <code>400</code>, or <code>500</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[user_query]\n            OCOL1[positive_document]\n            OCOL2[hard_negative_document]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph GenerateTextRetrievalData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: user_query, positive_document, hard_negative_document, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#outputs","title":"Outputs","text":"<ul> <li> <p>user_query (<code>str</code>): the user query generated by the <code>LLM</code>.</p> </li> <li> <p>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</p> </li> <li> <p>hard_negative_document (<code>str</code>): the hard negative document generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the text retrieval data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatetextretrievaldata/#generate-synthetic-text-retrieval-data-for-training-embedding-models","title":"Generate synthetic text retrieval data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-retrieval\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextRetrievalData(\n        language=\"English\",\n        query_type=\"common\",\n        query_length=\"5 to 15 words\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        num_words=100,\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generatetextretrievaldata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/","title":"GenerateShortTextMatchingData","text":"<p>Generate short text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateShortTextMatchingData</code> is a <code>Task</code> that generates short text matching data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-short\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-short category.</p>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.  Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> </li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[input]\n            OCOL1[positive_document]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph GenerateShortTextMatchingData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: input, positive_document, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#outputs","title":"Outputs","text":"<ul> <li> <p>input (<code>str</code>): the input generated by the <code>LLM</code>.</p> </li> <li> <p>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the short text matching  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#generate-synthetic-short-text-matching-data-for-training-embedding-models","title":"Generate synthetic short text matching data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-short\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateShortTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generateshorttextmatchingdata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/","title":"GenerateLongTextMatchingData","text":"<p>Generate long text matching data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateLongTextMatchingData</code> is a <code>Task</code> that generates long text matching data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-matching-long\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-matching-long category.</p>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.  Note that in this task the <code>seed</code> has no effect since there are no sampling params.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[input]\n            OCOL1[positive_document]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph GenerateLongTextMatchingData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: input, positive_document, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#outputs","title":"Outputs","text":"<ul> <li> <p>input (<code>str</code>): the input generated by the <code>LLM</code>.</p> </li> <li> <p>positive_document (<code>str</code>): the positive document generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the long text matching  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#generate-synthetic-long-text-matching-data-for-training-embedding-models","title":"Generate synthetic long text matching data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-matching-long\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateLongTextMatchingData(\n        language=\"English\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generatelongtextmatchingdata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/","title":"GenerateTextClassificationData","text":"<p>Generate text classification data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>GenerateTextClassificationData</code> is a <code>Task</code> that generates text classification data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#note","title":"Note","text":"<p>Ideally this task should be used with <code>EmbeddingTaskGenerator</code> with <code>flatten_tasks=True</code> with the <code>category=\"text-classification\"</code>; so that the <code>LLM</code> generates a list of tasks that are flattened so that each row contains a single task for the text-classification category.</p>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>high school</code>, <code>college</code>, or <code>PhD</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>clarity: The clarity of the query to be generated, which can be <code>clear</code>, <code>understandable with some effort</code>,  or <code>ambiguous</code>. Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[task]\n        end\n        subgraph New columns\n            OCOL0[input_text]\n            OCOL1[label]\n            OCOL2[misleading_label]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph GenerateTextClassificationData\n        StepInput[Input Columns: task]\n        StepOutput[Output Columns: input_text, label, misleading_label, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#inputs","title":"Inputs","text":"<ul> <li>task (<code>str</code>): The task description to be used in the generation.</li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#outputs","title":"Outputs","text":"<ul> <li> <p>input_text (<code>str</code>): the input text generated by the <code>LLM</code>.</p> </li> <li> <p>label (<code>str</code>): the label generated by the <code>LLM</code>.</p> </li> <li> <p>misleading_label (<code>str</code>): the misleading label generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the text classification  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/generatetextclassificationdata/#generate-synthetic-text-classification-data-for-training-embedding-models","title":"Generate synthetic text classification data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-classification\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    generate = GenerateTextClassificationData(\n        language=\"English\",\n        difficulty=\"high school\",\n        clarity=\"clear\",\n        llm=...,  # LLM instance\n    )\n\n    task &gt;&gt; generate\n</code></pre>"},{"location":"components-gallery/tasks/generatetextclassificationdata/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/","title":"StructuredGeneration","text":"<p>Generate structured content for a given <code>instruction</code> using an <code>LLM</code>.</p> <p><code>StructuredGeneration</code> is a pre-defined task that defines the <code>instruction</code> and the <code>structured_output</code>     as the inputs, and <code>generation</code> as the output. This task is used to generate structured content based on     the input instruction and following the schema provided within the <code>structured_output</code> column per each     <code>instruction</code>. The <code>model_name</code> also returned as part of the output in order to enhance it.</p>"},{"location":"components-gallery/tasks/structuredgeneration/#attributes","title":"Attributes","text":"<ul> <li>use_system_prompt: Whether to use the system prompt in the generation. Defaults to <code>True</code>,  which means that if the column <code>system_prompt</code> is defined within the input batch, then  the <code>system_prompt</code> will be used, otherwise, it will be ignored.</li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph Columns\n            ICOL0[instruction]\n            ICOL1[structured_output]\n        end\n        subgraph New columns\n            OCOL0[generation]\n            OCOL1[model_name]\n        end\n    end\n\n    subgraph StructuredGeneration\n        StepInput[Input Columns: instruction, structured_output]\n        StepOutput[Output Columns: generation, model_name]\n    end\n\n    ICOL0 --&gt; StepInput\n    ICOL1 --&gt; StepInput\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepInput --&gt; StepOutput\n</code></pre>"},{"location":"components-gallery/tasks/structuredgeneration/#inputs","title":"Inputs","text":"<ul> <li> <p>instruction (<code>str</code>): The instruction to generate structured content from.</p> </li> <li> <p>structured_output (<code>Dict[str, Any]</code>): The structured_output to generate structured content from. It should be a  Python dictionary with the keys <code>format</code> and <code>schema</code>, where <code>format</code> should be one of <code>json</code> or  <code>regex</code>, and the <code>schema</code> should be either the JSON schema or the regex pattern, respectively.</p> </li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/#outputs","title":"Outputs","text":"<ul> <li> <p>generation (<code>str</code>): The generated text matching the provided schema, if possible.</p> </li> <li> <p>model_name (<code>str</code>): The name of the model used to generate the text.</p> </li> </ul>"},{"location":"components-gallery/tasks/structuredgeneration/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/structuredgeneration/#generate-structured-output-from-a-json-schema","title":"Generate structured output from a JSON schema","text":"<pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"Create an RPG character\",\n                \"structured_output\": {\n                    \"format\": \"json\",\n                    \"schema\": {\n                        \"properties\": {\n                            \"name\": {\n                                \"title\": \"Name\",\n                                \"type\": \"string\"\n                            },\n                            \"description\": {\n                                \"title\": \"Description\",\n                                \"type\": \"string\"\n                            },\n                            \"role\": {\n                                \"title\": \"Role\",\n                                \"type\": \"string\"\n                            },\n                            \"weapon\": {\n                                \"title\": \"Weapon\",\n                                \"type\": \"string\"\n                            }\n                        },\n                        \"required\": [\n                            \"name\",\n                            \"description\",\n                            \"role\",\n                            \"weapon\"\n                        ],\n                        \"title\": \"Character\",\n                        \"type\": \"object\"\n                    }\n                },\n            }\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/structuredgeneration/#generate-structured-output-from-a-regex-pattern-only-works-with-llms-that-support-regex-the-providers-using-outlines","title":"Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines)","text":"<pre><code>from distilabel.steps.tasks import StructuredGeneration\nfrom distilabel.models import InferenceEndpointsLLM\n\nstructured_gen = StructuredGeneration(\n    llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n)\n\nstructured_gen.load()\n\nresult = next(\n    structured_gen.process(\n        [\n            {\n                \"instruction\": \"What's the weather like today in Seattle in Celsius degrees?\",\n                \"structured_output\": {\n                    \"format\": \"regex\",\n                    \"schema\": r\"(\\d{1,2})\u00b0C\"\n                },\n\n            }\n        ]\n    )\n)\n</code></pre>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/","title":"MonolingualTripletGenerator","text":"<p>Generate monolingual triplets with an <code>LLM</code> to later on train an embedding model.</p> <p><code>MonolingualTripletGenerator</code> is a <code>GeneratorTask</code> that generates monolingual triplets with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>language: The language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>unit: The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>high_score: The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>low_score: The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[S1]\n            OCOL1[S2]\n            OCOL2[S3]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph MonolingualTripletGenerator\n        StepOutput[Output Columns: S1, S2, S3, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n</code></pre>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the monolingual triplets.</p> </li> </ul>"},{"location":"components-gallery/tasks/monolingualtripletgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/monolingualtripletgenerator/#generate-monolingual-triplets-for-training-embedding-models","title":"Generate monolingual triplets for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import MonolingualTripletGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = MonolingualTripletGenerator(\n        language=\"English\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/","title":"BitextRetrievalGenerator","text":"<p>Generate bitext retrieval data with an <code>LLM</code> to later on train an embedding model.</p> <p><code>BitextRetrievalGenerator</code> is a <code>GeneratorTask</code> that generates bitext retrieval data with an     <code>LLM</code> to later on train an embedding model. The task is based on the paper \"Improving     Text Embeddings with Large Language Models\" and the data is generated based on the     provided attributes, or randomly sampled if not provided.</p>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>source_language: The source language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>target_language: The target language of the data to be generated, which can be any of the languages  retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.</p> </li> <li> <p>unit: The unit of the data to be generated, which can be <code>sentence</code>, <code>phrase</code>, or <code>passage</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>difficulty: The difficulty of the query to be generated, which can be <code>elementary school</code>, <code>high school</code>, or <code>college</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>high_score: The high score of the query to be generated, which can be <code>4</code>, <code>4.5</code>, or <code>5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>low_score: The low score of the query to be generated, which can be <code>2.5</code>, <code>3</code>, or <code>3.5</code>.  Defaults to <code>None</code>, meaning that it will be randomly sampled.</p> </li> <li> <p>seed: The random seed to be set in case there's any sampling within the <code>format_input</code> method.</p> </li> </ul>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[S1]\n            OCOL1[S2]\n            OCOL2[S3]\n            OCOL3[model_name]\n        end\n    end\n\n    subgraph BitextRetrievalGenerator\n        StepOutput[Output Columns: S1, S2, S3, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n    StepOutput --&gt; OCOL3\n</code></pre>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>S1 (<code>str</code>): the first sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S2 (<code>str</code>): the second sentence generated by the <code>LLM</code>.</p> </li> <li> <p>S3 (<code>str</code>): the third sentence generated by the <code>LLM</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the bitext retrieval  data.</p> </li> </ul>"},{"location":"components-gallery/tasks/bitextretrievalgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/bitextretrievalgenerator/#generate-bitext-retrieval-data-for-training-embedding-models","title":"Generate bitext retrieval data for training embedding models","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import BitextRetrievalGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = BitextRetrievalGenerator(\n        source_language=\"English\",\n        target_language=\"Spanish\",\n        unit=\"sentence\",\n        difficulty=\"elementary school\",\n        high_score=\"4\",\n        low_score=\"2.5\",\n        llm=...,\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/","title":"EmbeddingTaskGenerator","text":"<p>Generate task descriptions for embedding-related tasks using an <code>LLM</code>.</p> <p><code>EmbeddingTaskGenerator</code> is a <code>GeneratorTask</code> that doesn't receieve any input besides the     provided attributes that generates task descriptions for embedding-related tasks using a     pre-defined prompt based on the <code>category</code> attribute. The <code>category</code> attribute should be     one of the following:</p> <pre><code>- `text-retrieval`: Generate task descriptions for text retrieval tasks.\n- `text-matching-short`: Generate task descriptions for short text matching tasks.\n- `text-matching-long`: Generate task descriptions for long text matching tasks.\n- `text-classification`: Generate task descriptions for text classification tasks.\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#attributes","title":"Attributes","text":"<ul> <li> <p>category: The category of the task to be generated, which can either be <code>text-retrieval</code>,  <code>text-matching-short</code>, <code>text-matching-long</code>, or <code>text-classification</code>.</p> </li> <li> <p>flatten_tasks: Whether to flatten the tasks i.e. since a list of tasks is generated by the  <code>LLM</code>, this attribute indicates whether to flatten the list or not. Defaults to <code>False</code>,  meaning that running this task with <code>num_generations=1</code> will return a <code>distilabel.Distiset</code>  with one row only containing a list with around 20 tasks; otherwise, if set to <code>True</code>, it  will return a <code>distilabel.Distiset</code> with around 20 rows, each containing one task.</p> </li> </ul>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#input-output-columns","title":"Input &amp; Output Columns","text":"<pre><code>graph TD\n    subgraph Dataset\n        subgraph New columns\n            OCOL0[tasks]\n            OCOL1[task]\n            OCOL2[model_name]\n        end\n    end\n\n    subgraph EmbeddingTaskGenerator\n        StepOutput[Output Columns: tasks, task, model_name]\n    end\n\n    StepOutput --&gt; OCOL0\n    StepOutput --&gt; OCOL1\n    StepOutput --&gt; OCOL2\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#outputs","title":"Outputs","text":"<ul> <li> <p>tasks (<code>List[str]</code>): the list of tasks generated by the <code>LLM</code>.</p> </li> <li> <p>task (<code>str</code>): the task generated by the <code>LLM</code> if <code>flatten_tasks=True</code>.</p> </li> <li> <p>model_name (<code>str</code>): the name of the model used to generate the tasks.</p> </li> </ul>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#examples","title":"Examples","text":""},{"location":"components-gallery/tasks/embeddingtaskgenerator/#generate-embedding-tasks-for-text-retrieval","title":"Generate embedding tasks for text retrieval","text":"<pre><code>from distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import EmbeddingTaskGenerator\n\nwith Pipeline(\"my-pipeline\") as pipeline:\n    task = EmbeddingTaskGenerator(\n        category=\"text-retrieval\",\n        flatten_tasks=True,\n        llm=...,  # LLM instance\n    )\n\n    ...\n\n    task &gt;&gt; ...\n</code></pre>"},{"location":"components-gallery/tasks/embeddingtaskgenerator/#references","title":"References","text":"<ul> <li>Improving Text Embeddings with Large Language Models</li> </ul>"},{"location":"components-gallery/llms/","title":"LLMs Gallery","text":"<ul> <li> <p> AnthropicLLM</p> <p>Anthropic LLM implementation running the Async API client.</p> <p> AnthropicLLM</p> </li> <li> <p> OpenAILLM</p> <p>OpenAI LLM implementation running the async API client.</p> <p> OpenAILLM</p> </li> <li> <p> AnyscaleLLM</p> <p>Anyscale LLM implementation running the async API client of OpenAI.</p> <p> AnyscaleLLM</p> </li> <li> <p> AzureOpenAILLM</p> <p>Azure OpenAI LLM implementation running the async API client.</p> <p> AzureOpenAILLM</p> </li> <li> <p> TogetherLLM</p> <p>TogetherLLM LLM implementation running the async API client of OpenAI.</p> <p> TogetherLLM</p> </li> <li> <p> ClientvLLM</p> <p>A client for the <code>vLLM</code> server implementing the OpenAI API specification.</p> <p> ClientvLLM</p> </li> <li> <p> CohereLLM</p> <p>Cohere API implementation using the async client for concurrent text generation.</p> <p> CohereLLM</p> </li> <li> <p> GroqLLM</p> <p>Groq API implementation using the async client for concurrent text generation.</p> <p> GroqLLM</p> </li> <li> <p> InferenceEndpointsLLM</p> <p>InferenceEndpoints LLM implementation running the async API client.</p> <p> InferenceEndpointsLLM</p> </li> <li> <p> LiteLLM</p> <p>LiteLLM implementation running the async API client.</p> <p> LiteLLM</p> </li> <li> <p> MistralLLM</p> <p>Mistral LLM implementation running the async API client.</p> <p> MistralLLM</p> </li> <li> <p> MixtureOfAgentsLLM</p> <p><code>Mixture-of-Agents</code> implementation.</p> <p> MixtureOfAgentsLLM</p> </li> <li> <p> OllamaLLM</p> <p>Ollama LLM implementation running the Async API client.</p> <p> OllamaLLM</p> </li> <li> <p> VertexAILLM</p> <p>VertexAI LLM implementation running the async API clients for Gemini.</p> <p> VertexAILLM</p> </li> <li> <p> TransformersLLM</p> <p>Hugging Face <code>transformers</code> library LLM implementation using the text generation</p> <p> TransformersLLM</p> </li> <li> <p> LlamaCppLLM</p> <p>llama.cpp LLM implementation running the Python bindings for the C++ code.</p> <p> LlamaCppLLM</p> </li> <li> <p> vLLM</p> <p><code>vLLM</code> library LLM implementation.</p> <p> vLLM</p> </li> </ul>"},{"location":"components-gallery/llms/anthropicllm/","title":"AnthropicLLM","text":"<p>Anthropic LLM implementation running the Async API client.</p>"},{"location":"components-gallery/llms/anthropicllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the name of the model to use for the LLM e.g. \"claude-3-opus-20240229\",  \"claude-3-sonnet-20240229\", etc. Available models can be checked here:  Anthropic: Models overview.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Anthropic API. If not provided,  it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</p> </li> <li> <p>base_url: the base URL to use for the Anthropic API. Defaults to <code>None</code> which means  that <code>https://api.anthropic.com</code> will be used internally.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</p> </li> <li> <p>max_retries: The maximum number of times to retry the request before failing. Defaults  to <code>6</code>.</p> </li> <li> <p>http_client: if provided, an alternative HTTP client to use for calling Anthropic  API. Defaults to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key. It  is meant to be used internally.</p> </li> <li> <p>_aclient: the <code>AsyncAnthropic</code> client to use for the Anthropic API. It is meant  to be used internally. Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/anthropicllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_key: the API key to authenticate the requests to the Anthropic API. If not  provided, it will be read from <code>ANTHROPIC_API_KEY</code> environment variable.</p> </li> <li> <p>base_url: the base URL to use for the Anthropic API. Defaults to <code>\"https://api.anthropic.com\"</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>600.0</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request before failing.  Defaults to <code>6</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/anthropicllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/anthropicllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AnthropicLLM\n\nllm = AnthropicLLM(model=\"claude-3-opus-20240229\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/anthropicllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AnthropicLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AnthropicLLM(\n    model=\"claude-3-opus-20240229\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/","title":"OpenAILLM","text":"<p>OpenAI LLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/openaillm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\", \"gpt-4\", etc.  Supported models can be found here.</p> </li> <li> <p>base_url: the base URL to use for the OpenAI API requests. Defaults to <code>None</code>, which  means that the value set for the environment variable <code>OPENAI_BASE_URL</code> will  be used, or \"https://api.openai.com/v1\" if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the OpenAI API. Defaults to  <code>None</code> which means that the value set for the environment variable <code>OPENAI_API_KEY</code>  will be used, or <code>None</code> if not set.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/openaillm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base URL to use for the OpenAI API requests. Defaults to <code>None</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the OpenAI API. Defaults  to <code>None</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/openaillm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/openaillm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/#generate-text-from-a-custom-endpoint-following-the-openai-api","title":"Generate text from a custom endpoint following the OpenAI API","text":"<pre><code>from distilabel.models.llms import OpenAILLM\n\nllm = OpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import OpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = OpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/openaillm/#generate-with-batch-api-offline-batch-generation","title":"Generate with Batch API (offline batch generation)","text":"<pre><code>from distilabel.models.llms import OpenAILLM\n\nload = llm = OpenAILLM(\n    model=\"gpt-3.5-turbo\",\n    use_offline_batch_generation=True,\n    offline_batch_generation_block_until_done=5,  # poll for results every 5 seconds\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n# [['Hello! How can I assist you today?']]\n</code></pre>"},{"location":"components-gallery/llms/anyscalellm/","title":"AnyscaleLLM","text":"<p>Anyscale LLM implementation running the async API client of OpenAI.</p>"},{"location":"components-gallery/llms/anyscalellm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM, e.g., <code>google/gemma-7b-it</code>. See the  supported models under the \"Text Generation -&gt; Supported Models\" section  here.</p> </li> <li> <p>base_url: the base URL to use for the Anyscale API requests. Defaults to <code>None</code>, which  means that the value set for the environment variable <code>ANYSCALE_BASE_URL</code> will be used, or  \"https://api.endpoints.anyscale.com/v1\" if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Anyscale API. Defaults to <code>None</code> which  means that the value set for the environment variable <code>ANYSCALE_API_KEY</code> will be used, or  <code>None</code> if not set.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key.  It is meant to be used internally.</p> </li> </ul>"},{"location":"components-gallery/llms/anyscalellm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/anyscalellm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = AnyscaleLLM(model=\"google/gemma-7b-it\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/azureopenaillm/","title":"AzureOpenAILLM","text":"<p>Azure OpenAI LLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/azureopenaillm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM i.e. the name of the Azure deployment.</p> </li> <li> <p>base_url: the base URL to use for the Azure OpenAI API can be set with <code>AZURE_OPENAI_ENDPOINT</code>.  Defaults to <code>None</code> which means that the value set for the environment variable  <code>AZURE_OPENAI_ENDPOINT</code> will be used, or <code>None</code> if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Azure OpenAI API. Defaults to <code>None</code>  which means that the value set for the environment variable <code>AZURE_OPENAI_API_KEY</code> will be  used, or <code>None</code> if not set.</p> </li> <li> <p>api_version: the API version to use for the Azure OpenAI API. Defaults to <code>None</code> which means  that the value set for the environment variable <code>OPENAI_API_VERSION</code> will be used, or  <code>None</code> if not set.</p> </li> </ul>"},{"location":"components-gallery/llms/azureopenaillm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/azureopenaillm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(model=\"gpt-4-turbo\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/azureopenaillm/#generate-text-from-a-custom-endpoint-following-the-openai-api","title":"Generate text from a custom endpoint following the OpenAI API","text":"<pre><code>from distilabel.models.llms import AzureOpenAILLM\n\nllm = AzureOpenAILLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    base_url=r\"http://localhost:8080/v1\"\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/azureopenaillm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import AzureOpenAILLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = AzureOpenAILLM(\n    model=\"gpt-4-turbo\",\n    api_key=\"api.key\",\n    structured_output={\"schema\": User}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/togetherllm/","title":"TogetherLLM","text":"<p>TogetherLLM LLM implementation running the async API client of OpenAI.</p>"},{"location":"components-gallery/llms/togetherllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"mistralai/Mixtral-8x7B-Instruct-v0.1\".  Supported models can be found here.</p> </li> <li> <p>base_url: the base URL to use for the Together API can be set with <code>TOGETHER_BASE_URL</code>.  Defaults to <code>None</code> which means that the value set for the environment variable  <code>TOGETHER_BASE_URL</code> will be used, or \"https://api.together.xyz/v1\" if not set.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Together API. Defaults to <code>None</code>  which means that the value set for the environment variable <code>TOGETHER_API_KEY</code> will be  used, or <code>None</code> if not set.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key. It  is meant to be used internally.</p> </li> </ul>"},{"location":"components-gallery/llms/togetherllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/togetherllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import AnyscaleLLM\n\nllm = TogetherLLM(model=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", api_key=\"api.key\")\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/clientvllm/","title":"ClientvLLM","text":"<p>A client for the <code>vLLM</code> server implementing the OpenAI API specification.</p>"},{"location":"components-gallery/llms/clientvllm/#attributes","title":"Attributes","text":"<ul> <li> <p>base_url: the base URL of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>httpx_client_kwargs: extra kwargs that will be passed to the <code>httpx.AsyncClient</code>  created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer: the Hugging Face Hub repo id or path of the tokenizer that will be used  to apply the chat template and tokenize the inputs before sending it to the  server. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer_revision: the revision of the tokenizer to load. Defaults to <code>None</code>.</p> </li> <li> <p>_aclient: the <code>httpx.AsyncClient</code> used to comunicate with the <code>vLLM</code> server. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/clientvllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base url of the <code>vLLM</code> server. Defaults to <code>\"http://localhost:8000\"</code>.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>6</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>httpx_client_kwargs: extra kwargs that will be passed to the <code>httpx.AsyncClient</code>  created to comunicate with the <code>vLLM</code> server. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/clientvllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/clientvllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import ClientvLLM\n\nllm = ClientvLLM(\n    base_url=\"http://localhost:8000/v1\",\n    tokenizer=\"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n)\n\nllm.load()\n\nresults = llm.generate_outputs(\n    inputs=[[{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]],\n    temperature=0.7,\n    top_p=1.0,\n    max_new_tokens=256,\n)\n# [\n#     [\n#         \"I'm functioning properly, thank you for asking. How can I assist you today?\",\n#         \"I'm doing well, thank you for asking. I'm a large language model, so I don't have feelings or emotions like humans do, but I'm here to help answer any questions or provide information you might need. How can I assist you today?\",\n#         \"I'm just a computer program, so I don't have feelings like humans do, but I'm functioning properly and ready to help you with any questions or tasks you have. What's on your mind?\"\n#     ]\n# ]\n</code></pre>"},{"location":"components-gallery/llms/coherellm/","title":"CohereLLM","text":"<p>Cohere API implementation using the async client for concurrent text generation.</p>"},{"location":"components-gallery/llms/coherellm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the name of the model from the Cohere API to use for the generation.</p> </li> <li> <p>base_url: the base URL to use for the Cohere API requests. Defaults to  <code>\"https://api.cohere.ai/v1\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Cohere API. Defaults to  the value of the <code>COHERE_API_KEY</code> environment variable.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>client_name: the name of the client to use for the API requests. Defaults to  <code>\"distilabel\"</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_ChatMessage: the <code>ChatMessage</code> class from the <code>cohere</code> package.</p> </li> <li> <p>_aclient: the <code>AsyncClient</code> client from the <code>cohere</code> package.</p> </li> </ul>"},{"location":"components-gallery/llms/coherellm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base URL to use for the Cohere API requests. Defaults to  <code>\"https://api.cohere.ai/v1\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Cohere API. Defaults  to the value of the <code>COHERE_API_KEY</code> environment variable.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>client_name: the name of the client to use for the API requests. Defaults to  <code>\"distilabel\"</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/coherellm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/coherellm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import CohereLLM\n\nllm = CohereLLM(model=\"CohereForAI/c4ai-command-r-plus\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/groqllm/","title":"GroqLLM","text":"<p>Groq API implementation using the async client for concurrent text generation.</p>"},{"location":"components-gallery/llms/groqllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the name of the model from the Groq API to use for the generation.</p> </li> <li> <p>base_url: the base URL to use for the Groq API requests. Defaults to  <code>\"https://api.groq.com\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Groq API. Defaults to  the value of the <code>GROQ_API_KEY</code> environment variable.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>2</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key.</p> </li> <li> <p>_aclient: the <code>AsyncGroq</code> client from the <code>groq</code> package.</p> </li> </ul>"},{"location":"components-gallery/llms/groqllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>base_url: the base URL to use for the Groq API requests. Defaults to  <code>\"https://api.groq.com\"</code>.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Groq API. Defaults to  the value of the <code>GROQ_API_KEY</code> environment variable.</p> </li> <li> <p>max_retries: the maximum number of times to retry the request to the API before  failing. Defaults to <code>2</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response from the API. Defaults  to <code>120</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/groqllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/groqllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import GroqLLM\n\nllm = GroqLLM(model=\"llama3-70b-8192\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/","title":"InferenceEndpointsLLM","text":"<p>InferenceEndpoints LLM implementation running the async API client.</p> <p>This LLM will internally use <code>huggingface_hub.AsyncInferenceClient</code>.</p>"},{"location":"components-gallery/llms/inferenceendpointsllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model_id: the model ID to use for the LLM as available in the Hugging Face Hub, which  will be used to resolve the base URL for the serverless Inference Endpoints API requests.  Defaults to <code>None</code>.</p> </li> <li> <p>endpoint_name: the name of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> </li> <li> <p>endpoint_namespace: the namespace of the Inference Endpoint to use for the LLM. Defaults to <code>None</code>.</p> </li> <li> <p>base_url: the base URL to use for the Inference Endpoints API requests.</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Inference Endpoints API.</p> </li> <li> <p>tokenizer_id: the tokenizer ID to use for the LLM as available in the Hugging Face Hub.  Defaults to <code>None</code>, but defining one is recommended to properly format the prompt.</p> </li> <li> <p>model_display_name: the model display name to use for the LLM. Defaults to <code>None</code>.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or  if more fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>.  Defaults to None.</p> </li> </ul>"},{"location":"components-gallery/llms/inferenceendpointsllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/inferenceendpointsllm/#free-serverless-inference-api-set-the-input_batch_size-of-the-task-that-uses-this-to-avoid-model-is-overloaded","title":"Free serverless Inference API, set the input_batch_size of the Task that uses this to avoid Model is overloaded","text":"<pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/#dedicated-inference-endpoints","title":"Dedicated Inference Endpoints","text":"<pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    endpoint_name=\"&lt;ENDPOINT_NAME&gt;\",\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    endpoint_namespace=\"&lt;USER|ORG&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/#dedicated-inference-endpoints-or-tgi","title":"Dedicated Inference Endpoints or TGI","text":"<pre><code>from distilabel.models.llms.huggingface import InferenceEndpointsLLM\n\nllm = InferenceEndpointsLLM(\n    api_key=\"&lt;HF_API_KEY&gt;\",\n    base_url=\"&lt;BASE_URL&gt;\",\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/inferenceendpointsllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pydantic import BaseModel\nfrom distilabel.models.llms import InferenceEndpointsLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = InferenceEndpointsLLM(\n    model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    api_key=\"api.key\",\n    structured_output={\"format\": \"json\", \"schema\": User.model_json_schema()}\n)\n\nllm.load()\n\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the Tour De France\"}]])\n</code></pre>"},{"location":"components-gallery/llms/litellm/","title":"LiteLLM","text":"<p>LiteLLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/litellm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"gpt-3.5-turbo\" or \"mistral/mistral-large\",  etc.</p> </li> <li> <p>verbose: whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/litellm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>verbose: whether to log the LiteLLM client's logs. Defaults to <code>False</code>.</li> </ul>"},{"location":"components-gallery/llms/litellm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/litellm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import LiteLLM\n\nllm = LiteLLM(model=\"gpt-3.5-turbo\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/mistralllm/","title":"MistralLLM","text":"<p>Mistral LLM implementation running the async API client.</p>"},{"location":"components-gallery/llms/mistralllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"mistral-tiny\", \"mistral-large\", etc.</p> </li> <li> <p>endpoint: the endpoint to use for the Mistral API. Defaults to \"https://api.mistral.ai\".</p> </li> <li> <p>api_key: the API key to authenticate the requests to the Mistral API. Defaults to <code>None</code> which  means that the value set for the environment variable <code>OPENAI_API_KEY</code> will be used, or  <code>None</code> if not set.</p> </li> <li> <p>max_retries: the maximum number of retries to attempt when a request fails. Defaults to <code>5</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</p> </li> <li> <p>max_concurrent_requests: the maximum number of concurrent requests to send. Defaults  to <code>64</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration configuration  using <code>instructor</code>. You can take a look at the dictionary structure in  <code>InstructorStructuredOutputType</code> from <code>distilabel.steps.tasks.structured_outputs.instructor</code>.</p> </li> <li> <p>_api_key_env_var: the name of the environment variable to use for the API key. It is meant to  be used internally.</p> </li> <li> <p>_aclient: the <code>Mistral</code> to use for the Mistral API. It is meant to be used internally.  Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/mistralllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>api_key: the API key to authenticate the requests to the Mistral API.</p> </li> <li> <p>max_retries: the maximum number of retries to attempt when a request fails.  Defaults to <code>5</code>.</p> </li> <li> <p>timeout: the maximum time in seconds to wait for a response. Defaults to <code>120</code>.</p> </li> <li> <p>max_concurrent_requests: the maximum number of concurrent requests to send.  Defaults to <code>64</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/mistralllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/mistralllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import MistralLLM\n\nllm = MistralLLM(model=\"open-mixtral-8x22b\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n\nGenerate structured data:\n</code></pre>"},{"location":"components-gallery/llms/mixtureofagentsllm/","title":"MixtureOfAgentsLLM","text":"<p><code>Mixture-of-Agents</code> implementation.</p> <p>An <code>LLM</code> class that leverages <code>LLM</code>s collective strenghts to generate a response,     as described in the \"Mixture-of-Agents Enhances Large Language model Capabilities\"     paper. There is a list of <code>LLM</code>s proposing/generating outputs that <code>LLM</code>s from the next     round/layer can use as auxiliary information. Finally, there is an <code>LLM</code> that aggregates     the outputs to generate the final response.</p>"},{"location":"components-gallery/llms/mixtureofagentsllm/#attributes","title":"Attributes","text":"<ul> <li> <p>aggregator_llm: The <code>LLM</code> that aggregates the outputs of the proposer <code>LLM</code>s.</p> </li> <li> <p>proposers_llms: The list of <code>LLM</code>s that propose outputs to be aggregated.</p> </li> <li> <p>rounds: The number of layers or rounds that the <code>proposers_llms</code> will generate  outputs. Defaults to <code>1</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/mixtureofagentsllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/mixtureofagentsllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM\n\nllm = MixtureOfAgentsLLM(\n    aggregator_llm=InferenceEndpointsLLM(\n        model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n    ),\n    proposers_llms=[\n        InferenceEndpointsLLM(\n            model_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n            tokenizer_id=\"meta-llama/Meta-Llama-3-70B-Instruct\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n            tokenizer_id=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n        ),\n        InferenceEndpointsLLM(\n            model_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n            tokenizer_id=\"HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1\",\n        ),\n    ],\n    rounds=2,\n)\n\nllm.load()\n\noutput = llm.generate_outputs(\n    inputs=[\n        [\n            {\n                \"role\": \"user\",\n                \"content\": \"My favorite witty review of The Rings of Power series is this: Input:\",\n            }\n        ]\n    ]\n)\n</code></pre>"},{"location":"components-gallery/llms/mixtureofagentsllm/#references","title":"References","text":"<ul> <li>Mixture-of-Agents Enhances Large Language Model Capabilities</li> </ul>"},{"location":"components-gallery/llms/ollamallm/","title":"OllamaLLM","text":"<p>Ollama LLM implementation running the Async API client.</p>"},{"location":"components-gallery/llms/ollamallm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"notus\".</p> </li> <li> <p>host: the Ollama server host.</p> </li> <li> <p>timeout: the timeout for the LLM. Defaults to <code>120</code>.</p> </li> <li> <p>follow_redirects: whether to follow redirects. Defaults to <code>True</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer config files. If not provided, the one associated to the <code>model</code>  will be used. Defaults to <code>None</code>.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> <li> <p>_aclient: the <code>AsyncClient</code> to use for the Ollama API. It is meant to be used internally.  Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/ollamallm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>host: the Ollama server host.</p> </li> <li> <p>timeout: the client timeout for the Ollama API. Defaults to <code>120</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/ollamallm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/ollamallm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import OllamaLLM\n\nllm = OllamaLLM(model=\"llama3\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/vertexaillm/","title":"VertexAILLM","text":"<p>VertexAI LLM implementation running the async API clients for Gemini.</p> <ul> <li> <p>Gemini API: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini</p> <p>To use the <code>VertexAILLM</code> is necessary to have configured the Google Cloud authentication using one of these methods:</p> <ul> <li>Setting <code>GOOGLE_CLOUD_CREDENTIALS</code> environment variable</li> <li>Using <code>gcloud auth application-default login</code> command</li> <li>Using <code>vertexai.init</code> function from the <code>google-cloud-aiplatform</code> library</li> </ul> </li> </ul>"},{"location":"components-gallery/llms/vertexaillm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model name to use for the LLM e.g. \"gemini-1.0-pro\". Supported models.</p> </li> <li> <p>_aclient: the <code>GenerativeModel</code> to use for the Vertex AI Gemini API. It is meant  to be used internally. Set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/vertexaillm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/vertexaillm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import VertexAILLM\n\nllm = VertexAILLM(model=\"gemini-1.5-pro\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/transformersllm/","title":"TransformersLLM","text":"<p>Hugging Face <code>transformers</code> library LLM implementation using the text generation</p> <p>pipeline.</p>"},{"location":"components-gallery/llms/transformersllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>revision: if <code>model</code> refers to a Hugging Face Hub repository, then the revision  (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> </li> <li> <p>torch_dtype: the torch dtype to use for the model e.g. \"float16\", \"float32\", etc.  Defaults to <code>\"auto\"</code>.</p> </li> <li> <p>trust_remote_code: whether to allow fetching and executing remote code fetched  from the repository in the Hub. Defaults to <code>False</code>.</p> </li> <li> <p>model_kwargs: additional dictionary of keyword arguments that will be passed to  the <code>from_pretrained</code> method of the model.</p> </li> <li> <p>tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer config files. If not provided, the one associated to the <code>model</code>  will be used. Defaults to <code>None</code>.</p> </li> <li> <p>use_fast: whether to use a fast tokenizer or not. Defaults to <code>True</code>.</p> </li> <li> <p>chat_template: a chat template that will be used to build the prompts before  sending them to the model. If not provided, the chat template defined in the  tokenizer config will be used. If not provided and the tokenizer doesn't have  a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> </li> <li> <p>device: the name or index of the device where the model will be loaded. Defaults  to <code>None</code>.</p> </li> <li> <p>device_map: a dictionary mapping each layer of the model to a device, or a mode  like <code>\"sequential\"</code> or <code>\"auto\"</code>. Defaults to <code>None</code>.</p> </li> <li> <p>token: the Hugging Face Hub token that will be used to authenticate to the Hugging  Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package  local configuration will be used. Defaults to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/transformersllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/transformersllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import TransformersLLM\n\nllm = TransformersLLM(model=\"microsoft/Phi-3-mini-4k-instruct\")\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/llamacppllm/","title":"LlamaCppLLM","text":"<p>llama.cpp LLM implementation running the Python bindings for the C++ code.</p>"},{"location":"components-gallery/llms/llamacppllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model_path: contains the path to the GGUF quantized model, compatible with the  installed version of the <code>llama.cpp</code> Python bindings.</p> </li> <li> <p>n_gpu_layers: the number of layers to use for the GPU. Defaults to <code>-1</code>, meaning that  the available GPU device will be used.</p> </li> <li> <p>chat_format: the chat format to use for the model. Defaults to <code>None</code>, which means the  Llama format will be used.</p> </li> <li> <p>n_ctx: the context size to use for the model. Defaults to <code>512</code>.</p> </li> <li> <p>n_batch: the prompt processing maximum batch size to use for the model. Defaults to <code>512</code>.</p> </li> <li> <p>seed: random seed to use for the generation. Defaults to <code>4294967295</code>.</p> </li> <li> <p>verbose: whether to print verbose output. Defaults to <code>False</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</p> </li> <li> <p>tokenizer_id: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer config files. If not provided, the one associated to the <code>model</code>  will be used. Defaults to <code>None</code>.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> <li> <p>_model: the Llama model instance. This attribute is meant to be used internally and  should not be accessed directly. It will be set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/llms/llamacppllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li> <p>model_path: the path to the GGUF quantized model.</p> </li> <li> <p>n_gpu_layers: the number of layers to use for the GPU. Defaults to <code>-1</code>.</p> </li> <li> <p>chat_format: the chat format to use for the model. Defaults to <code>None</code>.</p> </li> <li> <p>verbose: whether to print verbose output. Defaults to <code>False</code>.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>Llama</code> class of <code>llama_cpp</code> library. Defaults to <code>{}</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/llamacppllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/llamacppllm/#generate-text","title":"Generate text","text":"<pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\n# You can follow along this example downloading the following model running the following\n# command in the terminal, that will download the model to the `Downloads` folder:\n# curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),\n    n_gpu_layers=-1,  # To use the GPU if available\n    n_ctx=1024,       # Set the context size\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/llamacppllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pathlib import Path\nfrom distilabel.models.llms import LlamaCppLLM\n\nmodel_path = \"Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = LlamaCppLLM(\n    model_path=str(Path.home() / model_path),  # type: ignore\n    n_gpu_layers=-1,\n    n_ctx=1024,\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/llms/llamacppllm/#references","title":"References","text":"<ul> <li> <p>llama.cpp</p> </li> <li> <p>llama-cpp-python</p> </li> </ul>"},{"location":"components-gallery/llms/vllm/","title":"vLLM","text":"<p><code>vLLM</code> library LLM implementation.</p>"},{"location":"components-gallery/llms/vllm/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>dtype: the data type to use for the model. Defaults to <code>auto</code>.</p> </li> <li> <p>trust_remote_code: whether to trust the remote code when loading the model. Defaults  to <code>False</code>.</p> </li> <li> <p>quantization: the quantization mode to use for the model. Defaults to <code>None</code>.</p> </li> <li> <p>revision: the revision of the model to load. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer: the tokenizer Hugging Face Hub repo id or a path to a directory containing  the tokenizer files. If not provided, the tokenizer will be loaded from the  model directory. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer_mode: the mode to use for the tokenizer. Defaults to <code>auto</code>.</p> </li> <li> <p>tokenizer_revision: the revision of the tokenizer to load. Defaults to <code>None</code>.</p> </li> <li> <p>skip_tokenizer_init: whether to skip the initialization of the tokenizer. Defaults  to <code>False</code>.</p> </li> <li> <p>chat_template: a chat template that will be used to build the prompts before  sending them to the model. If not provided, the chat template defined in the  tokenizer config will be used. If not provided and the tokenizer doesn't have  a chat template, then ChatML template will be used. Defaults to <code>None</code>.</p> </li> <li> <p>structured_output: a dictionary containing the structured output configuration or if more  fine-grained control is needed, an instance of <code>OutlinesStructuredOutput</code>. Defaults to None.</p> </li> <li> <p>seed: the seed to use for the random number generator. Defaults to <code>0</code>.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> </li> <li> <p>_model: the <code>vLLM</code> model instance. This attribute is meant to be used internally  and should not be accessed directly. It will be set in the <code>load</code> method.</p> </li> <li> <p>_tokenizer: the tokenizer instance used to format the prompt before passing it to  the <code>LLM</code>. This attribute is meant to be used internally and should not be  accessed directly. It will be set in the <code>load</code> method.</p> </li> <li> <p>use_magpie_template: a flag used to enable/disable applying the Magpie pre-query  template. Defaults to <code>False</code>.</p> </li> <li> <p>magpie_pre_query_template: the pre-query template to be applied to the prompt or  sent to the LLM to generate an instruction or a follow up user message. Valid  values are \"llama3\", \"qwen2\" or another pre-query template provided. Defaults  to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/llms/vllm/#runtime-parameters","title":"Runtime Parameters","text":"<ul> <li>extra_kwargs: additional dictionary of keyword arguments that will be passed to  the <code>LLM</code> class of <code>vllm</code> library.</li> </ul>"},{"location":"components-gallery/llms/vllm/#examples","title":"Examples","text":""},{"location":"components-gallery/llms/vllm/#generate-text","title":"Generate text","text":"<pre><code>from distilabel.models.llms import vLLM\n\n# You can pass a custom chat_template to the model\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\",\n    chat_template=\"[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]\",\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Hello world!\"}]])\n</code></pre>"},{"location":"components-gallery/llms/vllm/#generate-structured-data","title":"Generate structured data","text":"<pre><code>from pathlib import Path\nfrom distilabel.models.llms import vLLM\n\nclass User(BaseModel):\n    name: str\n    last_name: str\n    id: int\n\nllm = vLLM(\n    model=\"prometheus-eval/prometheus-7b-v2.0\"\n    structured_output={\"format\": \"json\", \"schema\": Character},\n)\n\nllm.load()\n\n# Call the model\noutput = llm.generate_outputs(inputs=[[{\"role\": \"user\", \"content\": \"Create a user profile for the following marathon\"}]])\n</code></pre>"},{"location":"components-gallery/embeddings/","title":"Embeddings Gallery","text":"<ul> <li> <p> SentenceTransformerEmbeddings</p> <p><code>sentence-transformers</code> library implementation for embedding generation.</p> <p> SentenceTransformerEmbeddings</p> </li> <li> <p> vLLMEmbeddings</p> <p><code>vllm</code> library implementation for embedding generation.</p> <p> vLLMEmbeddings</p> </li> </ul>"},{"location":"components-gallery/embeddings/sentencetransformerembeddings/","title":"SentenceTransformerEmbeddings","text":"<p><code>sentence-transformers</code> library implementation for embedding generation.</p>"},{"location":"components-gallery/embeddings/sentencetransformerembeddings/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>device: the name of the device used to load the model e.g. \"cuda\", \"mps\", etc.  Defaults to <code>None</code>.</p> </li> <li> <p>prompts: a dictionary containing prompts to be used with the model. Defaults to  <code>None</code>.</p> </li> <li> <p>default_prompt_name: the default prompt (in <code>prompts</code>) that will be applied to the  inputs. If not provided, then no prompt will be used. Defaults to <code>None</code>.</p> </li> <li> <p>trust_remote_code: whether to allow fetching and executing remote code fetched  from the repository in the Hub. Defaults to <code>False</code>.</p> </li> <li> <p>revision: if <code>model</code> refers to a Hugging Face Hub repository, then the revision  (e.g. a branch name or a commit id) to use. Defaults to <code>\"main\"</code>.</p> </li> <li> <p>token: the Hugging Face Hub token that will be used to authenticate to the Hugging  Face Hub. If not provided, the <code>HF_TOKEN</code> environment or <code>huggingface_hub</code> package  local configuration will be used. Defaults to <code>None</code>.</p> </li> <li> <p>truncate_dim: the dimension to truncate the sentence embeddings. Defaults to <code>None</code>.</p> </li> <li> <p>model_kwargs: extra kwargs that will be passed to the Hugging Face <code>transformers</code>  model class. Defaults to <code>None</code>.</p> </li> <li> <p>tokenizer_kwargs: extra kwargs that will be passed to the Hugging Face <code>transformers</code>  tokenizer class. Defaults to <code>None</code>.</p> </li> <li> <p>config_kwargs: extra kwargs that will be passed to the Hugging Face <code>transformers</code>  configuration class. Defaults to <code>None</code>.</p> </li> <li> <p>precision: the dtype that will have the resulting embeddings. Defaults to <code>\"float32\"</code>.</p> </li> <li> <p>normalize_embeddings: whether to normalize the embeddings so they have a length  of 1. Defaults to <code>None</code>.</p> </li> </ul>"},{"location":"components-gallery/embeddings/sentencetransformerembeddings/#examples","title":"Examples","text":""},{"location":"components-gallery/embeddings/sentencetransformerembeddings/#generating-sentence-embeddings","title":"Generating sentence embeddings","text":"<pre><code>from distilabel.models import SentenceTransformerEmbeddings\n\nembeddings = SentenceTransformerEmbeddings(model=\"mixedbread-ai/mxbai-embed-large-v1\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre>"},{"location":"components-gallery/embeddings/vllmembeddings/","title":"vLLMEmbeddings","text":"<p><code>vllm</code> library implementation for embedding generation.</p>"},{"location":"components-gallery/embeddings/vllmembeddings/#attributes","title":"Attributes","text":"<ul> <li> <p>model: the model Hugging Face Hub repo id or a path to a directory containing the  model weights and configuration files.</p> </li> <li> <p>dtype: the data type to use for the model. Defaults to <code>auto</code>.</p> </li> <li> <p>trust_remote_code: whether to trust the remote code when loading the model. Defaults  to <code>False</code>.</p> </li> <li> <p>quantization: the quantization mode to use for the model. Defaults to <code>None</code>.</p> </li> <li> <p>revision: the revision of the model to load. Defaults to <code>None</code>.</p> </li> <li> <p>enforce_eager: whether to enforce eager execution. Defaults to <code>True</code>.</p> </li> <li> <p>seed: the seed to use for the random number generator. Defaults to <code>0</code>.</p> </li> <li> <p>extra_kwargs: additional dictionary of keyword arguments that will be passed to the  <code>LLM</code> class of <code>vllm</code> library. Defaults to <code>{}</code>.</p> </li> <li> <p>_model: the <code>vLLM</code> model instance. This attribute is meant to be used internally  and should not be accessed directly. It will be set in the <code>load</code> method.</p> </li> </ul>"},{"location":"components-gallery/embeddings/vllmembeddings/#examples","title":"Examples","text":""},{"location":"components-gallery/embeddings/vllmembeddings/#generating-sentence-embeddings","title":"Generating sentence embeddings","text":"<pre><code>from distilabel.models import vLLMEmbeddings\n\nembeddings = vLLMEmbeddings(model=\"intfloat/e5-mistral-7b-instruct\")\n\nembeddings.load()\n\nresults = embeddings.encode(inputs=[\"distilabel is awesome!\", \"and Argilla!\"])\n# [\n#   [-0.05447685346007347, -0.01623094454407692, ...],\n#   [4.4889533455716446e-05, 0.044016145169734955, ...],\n# ]\n</code></pre>"},{"location":"components-gallery/embeddings/vllmembeddings/#references","title":"References","text":"<ul> <li>Offline inference embeddings</li> </ul>"}]}
\ No newline at end of file
diff --git a/pr-1092/sections/community/popular_issues/index.html b/pr-1092/sections/community/popular_issues/index.html
index b7d70117e..c21537115 100644
--- a/pr-1092/sections/community/popular_issues/index.html
+++ b/pr-1092/sections/community/popular_issues/index.html
@@ -5016,13 +5016,13 @@ <h1>Issue dashboard</h1>
 <tbody>
 <tr>
 <td>1</td>
-<td><a href="https://github.com/argilla-io/distilabel/issues/1041">1041 - [FEATURE] Add Offline batch generation for open models with EXXA API</a></td>
-<td style="text-align: center;">👍 2</td>
+<td><a href="https://github.com/argilla-io/distilabel/issues/995">995 - [FEATURE] <code>mlx-lm</code> integration</a></td>
+<td style="text-align: center;">👍 3</td>
 <td style="text-align: center;">💬 1</td>
 </tr>
 <tr>
 <td>2</td>
-<td><a href="https://github.com/argilla-io/distilabel/issues/995">995 - [FEATURE] <code>mlx-lm</code> integration</a></td>
+<td><a href="https://github.com/argilla-io/distilabel/issues/1041">1041 - [FEATURE] Add Offline batch generation for open models with EXXA API</a></td>
 <td style="text-align: center;">👍 2</td>
 <td style="text-align: center;">💬 1</td>
 </tr>