argilla-io · davidberenstein1957 · Jan 10, 2025 · Dec 23, 2024 · Dec 23, 2024 · Dec 24, 2024
diff --git a/README.md b/README.md
@@ -28,7 +28,7 @@ hf_oauth_scopes:
 
 ## Introduction
 
-Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. [The announcement blog](https://huggingface.co/blog/synthetic-data-generator) goes over a practical example of how to use it but you can also wathh the [video](https://www.youtube.com/watch?v=nXjVtnGeEss) to see it in action.
+Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. [The announcement blog](https://huggingface.co/blog/synthetic-data-generator) goes over a practical example of how to use it but you can also watch the [video](https://www.youtube.com/watch?v=nXjVtnGeEss) to see it in action.
 
 Supported Tasks:
 
@@ -76,21 +76,25 @@ launch()
 
 - `HF_TOKEN`: Your [Hugging Face token](https://huggingface.co/settings/tokens/new?ownUserPermissions=repo.content.read&ownUserPermissions=repo.write&globalPermissions=inference.serverless.write&tokenType=fineGrained) to push your datasets to the Hugging Face Hub and generate free completions from Hugging Face Inference Endpoints. You can find some configuration examples in the [examples](examples/) folder.
 
-Optionally, you can set the following environment variables to customize the generation process.
+You can set the following environment variables to customize the generation process.
 
 - `MAX_NUM_TOKENS`: The maximum number of tokens to generate, defaults to `2048`.
 - `MAX_NUM_ROWS`: The maximum number of rows to generate, defaults to `1000`.
 - `DEFAULT_BATCH_SIZE`: The default batch size to use for generating the dataset, defaults to `5`.
 
-Optionally, you can use different models and APIs. For providers outside of Hugging Face, we provide an integration through [LiteLLM](https://docs.litellm.ai/docs/providers).
+Optionally, you can use different API providers and models.
 
-- `BASE_URL`: The base URL for any OpenAI compatible API, e.g. `https://api.openai.com/v1/`, `http://127.0.0.1:11434/v1/`.
-- `MODEL`: The model to use for generating the dataset, e.g. `meta-llama/Meta-Llama-3.1-8B-Instruct`, `openai/gpt-4o`, `ollama/llama3.1`.
+- `MODEL`: The model to use for generating the dataset, e.g. `meta-llama/Meta-Llama-3.1-8B-Instruct`, `gpt-4o`, `llama3.1`.
 - `API_KEY`: The API key to use for the generation API, e.g. `hf_...`, `sk-...`. If not provided, it will default to the provided `HF_TOKEN` environment variable.
+- `OPENAI_BASE_URL`: The base URL for any OpenAI compatible API, e.g. `https://api.openai.com/v1/`.
+- `OLLAMA_BASE_URL`: The base URL for any Ollama compatible API, e.g. `http://127.0.0.1:11434/`.
+- `HUGGINGFACE_BASE_URL`: The base URL for any Hugging Face compatible API, e.g. TGI server or Dedicated Inference Endpoints. If you want to use serverless inference, only set the `MODEL`.
+- `VLLM_BASE_URL`: The base URL for any VLLM compatible API, e.g. `http://localhost:8000/`.
 
-SFT and Chat Data generation is only supported with Hugging Face Inference Endpoints , and you can set the following environment variables use it with models other than Llama3 and Qwen2.
+SFT and Chat Data generation is not supported with OpenAI Endpoints. Additionally, you need to configure it per model family based on their prompt templates using the right `TOKENIZER_ID` and `MAGPIE_PRE_QUERY_TEMPLATE` environment variables.
 
-- `MAGPIE_PRE_QUERY_TEMPLATE`: Enforce setting the pre-query template for Magpie, which is only supported with Hugging Face Inference Endpoints. Llama3 and Qwen2 are supported out of the box and will use `"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"` and `"<|im_start|>user\n"` respectively. For other models, you can pass a custom pre-query template string.
+- `TOKENIZER_ID`: The tokenizer ID to use for the magpie pipeline, e.g. `meta-llama/Meta-Llama-3.1-8B-Instruct`.
+- `MAGPIE_PRE_QUERY_TEMPLATE`: Enforce setting the pre-query template for Magpie, which is only supported with Hugging Face Inference Endpoints. `llama3` and `qwen2` are supported out of the box and will use `"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"` and `"<|im_start|>user\n"`, respectively. For other models, you can pass a custom pre-query template string.
 
 Optionally, you can also push your datasets to Argilla for further curation by setting the following environment variables:
 

diff --git a/examples/argilla_deployment.py → examples/argilla-deployment.py b/examples/argilla_deployment.py → examples/argilla-deployment.py
@@ -9,7 +9,10 @@
 from synthetic_dataset_generator import launch
 
 # Follow https://docs.argilla.io/latest/getting_started/quickstart/ to get your Argilla API key and URL
-os.environ["ARGILLA_API_URL"] = "https://[your-owner-name]-[your_space_name].hf.space"
-os.environ["ARGILLA_API_KEY"] = "my_api_key"
+os.environ["HF_TOKEN"] = "hf_..."
+os.environ["ARGILLA_API_URL"] = (
+    "https://[your-owner-name]-[your_space_name].hf.space"  # argilla base url
+)
+os.environ["ARGILLA_API_KEY"] = "my_api_key"  # argilla api key
 
 launch()
diff --git a/examples/enforce_mapgie_template copy.py b/examples/enforce_mapgie_template copy.py
diff --git a/examples/fine-tune-modernbert-classifier.ipynb b/examples/fine-tune-modernbert-classifier.ipynb
@@ -530,7 +530,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.11.11"
   }
  },
  "nbformat": 4,

diff --git a/examples/hf-dedicated-or-tgi-deployment.py b/examples/hf-dedicated-or-tgi-deployment.py
@@ -0,0 +1,19 @@
+# /// script
+# requires-python = ">=3.11,<3.12"
+# dependencies = [
+#     "synthetic-dataset-generator",
+# ]
+# ///
+import os
+
+from synthetic_dataset_generator import launch
+
+os.environ["HF_TOKEN"] = "hf_..."  # push the data to huggingface
+os.environ["HUGGINGFACE_BASE_URL"] = "http://127.0.0.1:3000/"  # dedicated endpoint/TGI
+os.environ["MAGPIE_PRE_QUERY_TEMPLATE"] = "llama3"  # magpie template
+os.environ["TOKENIZER_ID"] = (
+    "meta-llama/Llama-3.1-8B-Instruct"  # tokenizer for model hosted on endpoint
+)
+os.environ["MODEL"] = None  # model is linked to endpoint
+
+launch()
diff --git a/examples/hf-serverless-deployment.py b/examples/hf-serverless-deployment.py
@@ -0,0 +1,15 @@
+# /// script
+# requires-python = ">=3.11,<3.12"
+# dependencies = [
+#     "synthetic-dataset-generator",
+# ]
+# ///
+import os
+
+from synthetic_dataset_generator import launch
+
+os.environ["HF_TOKEN"] = "hf_..."  # push the data to huggingface
+os.environ["MODEL"] = "meta-llama/Llama-3.1-8B-Instruct"  # use instruct model
+os.environ["MAGPIE_PRE_QUERY_TEMPLATE"] = "llama3"  # use the template for the model
+
+launch()
diff --git a/examples/ollama-deployment.py b/examples/ollama-deployment.py
@@ -0,0 +1,22 @@
+# /// script
+# requires-python = ">=3.11,<3.12"
+# dependencies = [
+#     "synthetic-dataset-generator",
+# ]
+# ///
+# ollama serve
+# ollama run qwen2.5:32b-instruct-q5_K_S
+import os
+
+from synthetic_dataset_generator import launch
+
+os.environ["HF_TOKEN"] = "hf_..."  # push the data to huggingface
+os.environ["OLLAMA_BASE_URL"] = "http://127.0.0.1:11434/"  # ollama base url
+os.environ["MODEL"] = "qwen2.5:32b-instruct-q5_K_S"  # model id
+os.environ["TOKENIZER_ID"] = "Qwen/Qwen2.5-32B-Instruct"  # tokenizer id
+os.environ["MAGPIE_PRE_QUERY_TEMPLATE"] = "qwen2"
+os.environ["MAX_NUM_ROWS"] = "10000"
+os.environ["DEFAULT_BATCH_SIZE"] = "2"
+os.environ["MAX_NUM_TOKENS"] = "1024"
+
+launch()
diff --git a/examples/ollama_local.py b/examples/ollama_local.py
diff --git a/examples/openai-deployment.py b/examples/openai-deployment.py
@@ -0,0 +1,18 @@
+# /// script
+# requires-python = ">=3.11,<3.12"
+# dependencies = [
+#     "synthetic-dataset-generator",
+# ]
+# ///
+
+import os
+
+from synthetic_dataset_generator import launch
+
+os.environ["HF_TOKEN"] = "hf_..."  # push the data to huggingface
+os.environ["OPENAI_BASE_URL"] = "https://api.openai.com/v1/"  # openai base url
+os.environ["API_KEY"] = os.getenv("OPENAI_API_KEY")  # openai api key
+os.environ["MODEL"] = "gpt-4o"  # model id
+os.environ["MAGPIE_PRE_QUERY_TEMPLATE"] = None  # chat data not supported with OpenAI
+
+launch()
diff --git a/examples/openai_local.py b/examples/openai_local.py
diff --git a/examples/vllm-deployment.py b/examples/vllm-deployment.py
@@ -0,0 +1,21 @@
+# /// script
+# requires-python = ">=3.11,<3.12"
+# dependencies = [
+#     "synthetic-dataset-generator",
+# ]
+# ///
+# vllm serve Qwen/Qwen2.5-1.5B-Instruct
+import os
+
+from synthetic_dataset_generator import launch
+
+os.environ["HF_TOKEN"] = "hf_..."  # push the data to huggingface
+os.environ["VLLM_BASE_URL"] = "http://127.0.0.1:8000/"  # vllm base url
+os.environ["MODEL"] = "Qwen/Qwen2.5-1.5B-Instruct"  # model id
+os.environ["TOKENIZER_ID"] = "Qwen/Qwen2.5-1.5B-Instruct"  # tokenizer id
+os.environ["MAGPIE_PRE_QUERY_TEMPLATE"] = "qwen2"
+os.environ["MAX_NUM_ROWS"] = "10000"
+os.environ["DEFAULT_BATCH_SIZE"] = "2"
+os.environ["MAX_NUM_TOKENS"] = "1024"
+
+launch()
diff --git a/pdm.lock b/pdm.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -18,7 +18,7 @@ readme = "README.md"
 license = {text = "Apache 2"}
 
 dependencies = [
-    "distilabel[hf-inference-endpoints,argilla,outlines,instructor]>=1.4.1,<2.0.0",
+    "distilabel[argilla,hf-inference-endpoints,hf-transformers,instructor,llama-cpp,ollama,openai,outlines,vllm] @ git+https://github.com/argilla-io/distilabel.git@develop",
     "gradio[oauth]>=5.4.0,<6.0.0",
     "transformers>=4.44.2,<5.0.0",
     "sentence-transformers>=3.2.0,<4.0.0",

diff --git a/src/synthetic_dataset_generator/_distiset.py b/src/synthetic_dataset_generator/_distiset.py
@@ -81,6 +81,15 @@ def _get_card(
                 dataset[0] if not isinstance(dataset, dict) else dataset["train"][0]
             )
 
+        keys = list(sample_records.keys())
+        if len(keys) != 2 or not (
+            ("label" in keys and "text" in keys)
+            or ("labels" in keys and "text" in keys)
+        ):
+            task_categories = ["text-classification"]
+        elif "prompt" in keys or "messages" in keys:
+            task_categories = ["text-generation", "text2text-generation"]
+
         readme_metadata = {}
         if repo_id and token:
             readme_metadata = self._extract_readme_metadata(repo_id, token)
@@ -90,6 +99,7 @@ def _get_card(
             "size_categories": size_categories_parser(
                 max(len(dataset) for dataset in self.values())
             ),
+            "task_categories": task_categories,
             "tags": [
                 "synthetic",
                 "distilabel",

diff --git a/src/synthetic_dataset_generator/apps/base.py b/src/synthetic_dataset_generator/apps/base.py
@@ -77,10 +77,15 @@ def validate_push_to_hub(org_name, repo_name):
     return repo_id
 
 
-def combine_datasets(repo_id: str, dataset: Dataset) -> Dataset:
+def combine_datasets(
+    repo_id: str, dataset: Dataset, oauth_token: Union[OAuthToken, None]
+) -> Dataset:
     try:
         new_dataset = load_dataset(
-            repo_id, split="train", download_mode="force_redownload"
+            repo_id,
+            split="train",
+            download_mode="force_redownload",
+            token=oauth_token.token,
         )
         return concatenate_datasets([dataset, new_dataset])
     except Exception: