Skip to content

Commit

Permalink
docs: 730 docs add an index to the guide overview (#731)
Browse files Browse the repository at this point in the history
* Add index page to how-to guides

* Apply suggestions from code review

Co-authored-by: burtenshaw <[email protected]>

---------

Co-authored-by: burtenshaw <[email protected]>
  • Loading branch information
davidberenstein1957 and burtenshaw authored Jun 13, 2024
1 parent 9d63f4a commit 806fd57
Show file tree
Hide file tree
Showing 6 changed files with 116 additions and 93 deletions.
89 changes: 1 addition & 88 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,6 @@ Distilabel is the **framework for synthetic data and AI feedback for AI engineer

If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!

<!-- ![overview](https://github.com/argilla-io/distilabel/assets/36760800/360110da-809d-4e24-a29b-1a1a8bc4f9b7) -->

## Why use Distilabel?

Whether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**.
Expand All @@ -64,89 +62,4 @@ Distilabel is a tool that can be used to **synthesize data and provide AI feedba

- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.
- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B),, show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.
- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.

## 👨🏽‍💻 Installation

```sh
pip install distilabel --upgrade
```

Requires Python 3.8+

In addition, the following extras are available:

- `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.
- `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.
- `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).
- `groq`: for using models available in [Groq](https://groq.com/) using [`groq`](https://github.com/groq/groq-python) Python client via the `GroqLLM` integration.
- `hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints) via the `InferenceEndpointsLLM` integration.
- `hf-transformers`: for using models available in [transformers](https://github.com/huggingface/transformers) package via the `TransformersLLM` integration.
- `litellm`: for using [`LiteLLM`](https://github.com/BerriAI/litellm) to call any LLM using OpenAI format via the `LiteLLM` integration.
- `llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.
- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration.
- `ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration.
- `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.
- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.

### Example

To run the following example you must install `distilabel` with both `openai` extra:

```sh
pip install "distilabel[openai]" --upgrade
```

Then run:

```python
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

with Pipeline(
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline:
load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})

generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))

load_dataset.connect(generate_with_openai)

if __name__ == "__main__":
distiset = pipeline.run(
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
generate_with_openai.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
```

## Badges

If you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

## Contribute

To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).
- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.
18 changes: 17 additions & 1 deletion docs/sections/community/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,20 @@ We are an open-source community-driven project not only focused on building a gr

[:octicons-arrow-right-24: Roadmap ↗](https://github.com/orgs/argilla-io/projects/15)

</div>
</div>

## Badges

If you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

## Contribute

To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).
2 changes: 1 addition & 1 deletion docs/sections/how_to_guides/basic/llm/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Define LLMs as local models or remote APIs
# Define LLMs as local or remote models

## Working with LLMs

Expand Down
2 changes: 1 addition & 1 deletion docs/sections/how_to_guides/basic/task/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Define Tasks as Steps that rely on LLMs
# Define Tasks that rely on LLMs

## Working with Tasks

Expand Down
93 changes: 93 additions & 0 deletions docs/sections/how_to_guides/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# How-to guides

Welcome to the how-to guides section! Here you will find a collection of guides that will help you get started with Distilabel. We have divided the guides into two categories: basic and advanced. The basic guides will help you get started with the core concepts of Distilabel, while the advanced guides will help you explore more advanced features.

## Basic

<div class="grid cards" markdown>

- __Define Steps for your Pipeline__

---

Steps are the building blocks of your pipeline. They can be used to generate data, evaluate models, manipulate data, or any other general task.

[:octicons-arrow-right-24: Define Steps](basic/step/index.md)

- __Define Tasks that rely on LLMs__

---

Tasks are a specific type of step that rely on Language Models (LLMs) to generate data.

[:octicons-arrow-right-24: Define Tasks](basic/task/index.md)

- __Define LLMs as local or remote models__

---

LLMs are the core of your tasks. They are used to integrate with local models or remote APIs.

[:octicons-arrow-right-24: Define LLMs](basic/llm/index.md)

- __Execute Steps and Tasks in a Pipeline__

---

Pipeline is where you put all your steps and tasks together to create a workflow.

[:octicons-arrow-right-24: Execute Pipeline](basic/pipeline/index.md)

</div>

## Advanced

<div class="grid cards" markdown>
- __Using the Distiset dataset object__

---

Distiset is a dataset object based on the datasets library that can be used to store and manipulate data.

[:octicons-arrow-right-24: Distiset](advanced/distiset.md)

- __Export data to Argilla__

---

Argilla is a platform that can be used to store, search, and apply feedback to datasets.
[:octicons-arrow-right-24: Argilla](advanced/argilla.md)

- __Using a file system to pass data of batches between steps__

---

File system can be used to pass data between steps in a pipeline.

[:octicons-arrow-right-24: File System](advanced/fs_to_pass_data.md)

- __Using CLI to explore and re-run existing Pipelines__

---

CLI can be used to explore and re-run existing pipelines through the command line.

[:octicons-arrow-right-24: CLI](advanced/cli/index.md)

- __Cache and recover pipeline executions__

---

Caching can be used to recover pipeline executions to avoid loosing data and precious LLM calls.

[:octicons-arrow-right-24: Caching](advanced/caching.md)

- __Structured data generation__

---

Structured data generation can be used to generate data with a specific structure like JSON, function calls, etc.

[:octicons-arrow-right-24: Structured Generation](advanced/structured_generation.md)

</div>
5 changes: 3 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -142,15 +142,16 @@ nav:
- Quickstart: "sections/getting_started/quickstart.md"
- FAQ: "sections/getting_started/faq.md"
- How-to guides:
- "sections/how_to_guides/index.md"
- Basic:
- Define Steps for your Pipeline:
- "sections/how_to_guides/basic/step/index.md"
- GeneratorStep: "sections/how_to_guides/basic/step/generator_step.md"
- GlobalStep: "sections/how_to_guides/basic/step/global_step.md"
- Define Tasks as Steps that rely on LLMs:
- Define Tasks that rely on LLMs:
- "sections/how_to_guides/basic/task/index.md"
- GeneratorTask: "sections/how_to_guides/basic/task/generator_task.md"
- Define LLMs as local models or remote APIs: "sections/how_to_guides/basic/llm/index.md"
- Define LLMs as local or remote models: "sections/how_to_guides/basic/llm/index.md"
- Execute Steps and Tasks in a Pipeline: "sections/how_to_guides/basic/pipeline/index.md"
- Advanced:
- Using the Distiset dataset object: "sections/how_to_guides/advanced/distiset.md"
Expand Down

0 comments on commit 806fd57

Please sign in to comment.