docs: 287 docs shorten titles tutorials and 277 update core example (#…

…289) * docs: shorten doc titles docs: add core example * docs: update example on index page * docs: resolved bug in Google Colab * docs: removed plugs in tutorials * docs: updated some formatting
argilla-io · Jan 24, 2024 · 40c0b45 · 40c0b45
1 parent 2ec5900
commit 40c0b45
Show file tree

Hide file tree

Showing 5 changed files with 107 additions and 142 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -40,9 +40,9 @@ will create a `labeller` LLM using `OpenAILLM` with the `UltraFeedback` task for
 !!! note
     To run the script successfully, ensure you have assigned your OpenAI API key to the `OPENAI_API_KEY` environment variable.
 
-For a more complete example, check out our awesome notebook on Google Colab:
+For a more complete example, check out our awesome [tutorials](./tutorials/pipeline-notus-instructions-preferences-legal/) or the example below:
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rO1-OlLFPBC0KPuXQOeMpZOeajiwNoMy?usp=sharing)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/argilla-io/distilabel/blob/main/docs/tutorials/pipeline-notus-instructions-preferences-legal.ipynb) [![Open Source in Github](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/argilla-io/distilabel/blob/main/docs/tutorials/pipeline-notus-instructions-preferences-legal.ipynb)
 
 ## Navigation
 

diff --git a/docs/tutorials/clean-preference-dataset-judgelm-gpt.ipynb b/docs/tutorials/clean-preference-dataset-judgelm-gpt.ipynb
@@ -4,7 +4,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "# ✨ Clean a Preference Dataset with `JudgeLMTask` and `GPT4-turbo`\n",
+        "# 🧼 Clean an existing preference dataset\n",
         "\n",
         "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/argilla-io/distilabel/blob/main/docs/tutorials/clean-preference-dataset-judgelm-gpt.ipynb) [![Open Source in Github](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/argilla-io/distilabel/blob/main/docs/tutorials/clean-preference-dataset-judgelm-gpt.ipynb)"
       ]
@@ -50,35 +50,6 @@
         "## Getting Started"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Running Argilla\n",
-        "\n",
-        "For this tutorial, you can use Argilla to visualize and annotate the dataset cleaned by distilabel. There are two main options for deploying and running Argilla:\n",
-        "\n",
-        "**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:\n",
-        "\n",
-        "[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)\n",
-        "\n",
-        "For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).\n",
-        "\n",
-        "**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.\n",
-        "\n",
-        "For more information on deployment options, please check the Deployment section of the documentation.\n",
-        "\n",
-        "<div class=\"alert alert-info\">\n",
-        "\n",
-        "Tip\n",
-        "\n",
-        "This tutorial is a Jupyter Notebook. There are two options to run it:\n",
-        "\n",
-        "- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.\n",
-        "- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.\n",
-        "</div>\n"
-      ]
-    },
     {
       "cell_type": "markdown",
       "metadata": {},
@@ -99,7 +70,7 @@
       "metadata": {},
       "outputs": [],
       "source": [
-        "%pip install \"distilabel[openai,argilla]\" --upgrade"
+        "%pip install \"distilabel[openai]\" --upgrade"
       ]
     },
     {
@@ -125,7 +96,6 @@
         "from sklearn.feature_extraction.text import TfidfVectorizer\n",
         "from sklearn.metrics.pairwise import cosine_similarity\n",
         "\n",
-        "import argilla as rg\n",
         "from datasets import load_dataset\n",
         "from distilabel.llm import OpenAILLM\n",
         "from distilabel.pipeline import Pipeline\n",
@@ -137,56 +107,8 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Replace api_url with the url to your HF Spaces URL if using Spaces\n",
-        "# Replace api_key if you configured a custom API key\n",
-        "rg.init(\n",
-        "    api_url=\"http://localhost:6900\",\n",
-        "    api_key=\"owner.apikey\",\n",
-        "    workspace=\"admin\"\n",
-        ")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "If you’re running a private Hugging Face Space, you will also need to set the [HF_TOKEN](https://huggingface.co/settings/tokens) as follows:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# # Set the HF_TOKEN environment variable\n",
-        "# import os\n",
-        "# os.environ['HF_TOKEN'] = \"your-hf-token\"\n",
+        "### Environment variables\n",
         "\n",
-        "# # Replace api_url with the url to your HF Spaces URL\n",
-        "# # Replace api_key if you configured a custom API key\n",
-        "# # Replace workspace with the name of your workspace\n",
-        "# rg.init(\n",
-        "#     api_url=\"https://[your-owner-name]-[your_space_name].hf.space\",\n",
-        "#     api_key=\"owner.apikey\",\n",
-        "#     workspace=\"admin\",\n",
-        "#     extra_headers={\"Authorization\": f\"Bearer {os.environ['HF_TOKEN']}\"},\n",
-        "# )"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
         "Finally, we will also need to provide a HF_TOKEN and the OPENAI_API_KEY to run the distilabel pipeline."
       ]
     },
@@ -505,7 +427,56 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Optionally, if you want to further filter and curate the dataset, you can push the dataset to [Argilla](https://github.com/argilla-io/argilla) as follows:"
+        "## Human Feedback with Argilla\n",
+        "\n",
+        "You can use the AI Feedback created by distilabel directly but we hae ve seen that enhancing it with human feedback will improve the quality of your LLM. We provide a `to_argilla` method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check [the Argilla docs](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html) to get it up and running."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "First, install it."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!pip install \"distilabel[argilla]\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import argilla as rg\n",
+        "\n",
+        "# Replace api_url with the url to your HF Spaces URL if using Spaces\n",
+        "# Replace api_key if you configured a custom API key\n",
+        "rg.init(\n",
+        "    api_url=\"http://localhost:6900\",\n",
+        "    api_key=\"owner.apikey\",\n",
+        "    workspace=\"admin\"\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "You can now push the dataset to Argilla as follows:"
       ]
     },
     {

diff --git a/docs/tutorials/create-a-math-preference-dataset.ipynb b/docs/tutorials/create-a-math-preference-dataset.ipynb
diff --git a/docs/tutorials/pipeline-notus-instructions-preferences-legal.ipynb b/docs/tutorials/pipeline-notus-instructions-preferences-legal.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 🤗 Use Notus on inference endpoints to create a legal preference dataset\n",
+    "# ⚖️ Create a legal preference dataset\n",
     "\n",
     "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/argilla-io/distilabel/blob/main/docs/tutorials/pipeline-notus-instructions-preferences-legal.ipynb) [![Open Source in Github](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/argilla-io/distilabel/blob/main/docs/tutorials/pipeline-notus-instructions-preferences-legal.ipynb)\n",
     "\n",
@@ -40,36 +40,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install argilla distilabel \"farm-haystack[preprocessing]\" pip install \"distilabel[hf-inference-endpoints]\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Running Argilla\n",
-    "\n",
-    "For this tutorial, you can use Argilla to visualize and annotate the different datasets created by distilabel. There are two main options for deploying and running Argilla:\n",
-    "\n",
-    "**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:\n",
-    "\n",
-    "[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)\n",
-    "\n",
-    "For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).\n",
-    "\n",
-    "**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.\n",
-    "\n",
-    "For more information on deployment options, please check the Deployment section of the documentation.\n",
-    "\n",
-    "<div class=\"alert alert-info\">\n",
-    "\n",
-    "Tip\n",
-    "\n",
-    "This tutorial is a Jupyter Notebook. There are two options to run it:\n",
-    "\n",
-    "- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.\n",
-    "- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.\n",
-    "</div>\n"
+    "%pip install  distilabel \"farm-haystack[preprocessing]\" pip install \"distilabel[hf-inference-endpoints]\""
    ]
   },
   {
@@ -90,8 +61,6 @@
     "import os\n",
     "from typing import Dict\n",
     "\n",
-    "import argilla as rg\n",
-    "\n",
     "from distilabel.llm import InferenceEndpointsLLM\n",
     "from distilabel.pipeline import Pipeline, pipeline\n",
     "from distilabel.tasks import TextGenerationTask, SelfInstructTask, Prompt\n",
@@ -104,28 +73,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Replace api_url with the url to your HF Spaces URL if using Spaces\n",
-    "# Replace api_key if you configured a custom API key\n",
-    "rg.init(\n",
-    "    api_url=\"https://ignacioct-argilla.hf.space\",\n",
-    "    api_key=\"owner.apikey\",\n",
-    "    workspace=\"admin\",\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "### Environment variables\n",
+    "\n",
     "Additionally, we need to provide our HuggingFace and OpenAI accest token. To later instatiate an `InferenceEndpointsLLM` object, we need to pass as parameters the HF Inference Endpoint name and the HF namespace. One very convenient way to do so is also through environment variables.\n"
    ]
   },
@@ -801,7 +750,50 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Upload the preference dataset to Argilla to annotate.\n",
+    "## Human Feedback with Argilla\n",
+    "\n",
+    "You can use the AI Feedback created by distilabel directly but we hae ve seen that enhancing it with human feedback will improve the quality of your LLM. We provide a `to_argilla` method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check [the Argilla docs](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html) to get it up and running.\n",
+    "\n",
+    "First, install it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install \"distilabel[argilla]\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import argilla as rg\n",
+    "\n",
+    "# Replace api_url with the url to your HF Spaces URL if using Spaces\n",
+    "# Replace api_key if you configured a custom API key\n",
+    "rg.init(\n",
+    "    api_url=\"http://localhost:6900\",\n",
+    "    api_key=\"owner.apikey\",\n",
+    "    workspace=\"admin\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "\n",
     "Once our preference dataset has been correctly generated, the Argilla UI is the best tool at our disposal to visualize it and annotate it. As for the instruction dataset, we just have to convert it to an Argilla Feedback Dataset, and push it to Argilla.\n"
    ]

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -84,8 +84,9 @@ nav:
   - Getting started: index.md
   - Concepts: concepts.md
   - Tutorials:
-      - Use Notus on inference endpoints to create a legal preference dataset: tutorials/pipeline-notus-instructions-preferences-legal.ipynb
-      - Clean a Preference Dataset with the JudgeLMTask and GPT4-turbo: tutorials/clean-preference-dataset-judgelm-gpt.ipynb
+      - ⚖️ Create a legal preference dataset: tutorials/pipeline-notus-instructions-preferences-legal.ipynb
+      - 🧼 Clean an existing preference dataset: tutorials/clean-preference-dataset-judgelm-gpt.ipynb
+      - 🧮 Create a mathematical preference dataset : tutorials/create-a-math-preference-dataset.ipynb
   - Technical References:
       - Concept Guides:
           - technical-reference/index.md