diff --git a/docs/index.md b/docs/index.md index 2d7f6c0895..de4bc67be8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,8 +40,6 @@ Distilabel is the **framework for synthetic data and AI feedback for AI engineer If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading! - - ## Why use Distilabel? Whether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**. @@ -64,89 +62,4 @@ Distilabel is a tool that can be used to **synthesize data and provide AI feedba - The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**. - Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B),, show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**. -- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset. - -## 👨🏽💻 Installation - -```sh -pip install distilabel --upgrade -``` - -Requires Python 3.8+ - -In addition, the following extras are available: - -- `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration. -- `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration. -- `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/). -- `groq`: for using models available in [Groq](https://groq.com/) using [`groq`](https://github.com/groq/groq-python) Python client via the `GroqLLM` integration. -- `hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints) via the `InferenceEndpointsLLM` integration. -- `hf-transformers`: for using models available in [transformers](https://github.com/huggingface/transformers) package via the `TransformersLLM` integration. -- `litellm`: for using [`LiteLLM`](https://github.com/BerriAI/litellm) to call any LLM using OpenAI format via the `LiteLLM` integration. -- `llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration. -- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration. -- `ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration. -- `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`. -- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration. -- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration. - -### Example - -To run the following example you must install `distilabel` with both `openai` extra: - -```sh -pip install "distilabel[openai]" --upgrade -``` - -Then run: - -```python -from distilabel.llms import OpenAILLM -from distilabel.pipeline import Pipeline -from distilabel.steps import LoadDataFromHub -from distilabel.steps.tasks import TextGeneration - -with Pipeline( - name="simple-text-generation-pipeline", - description="A simple text generation pipeline", -) as pipeline: - load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"}) - - generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo")) - - load_dataset.connect(generate_with_openai) - -if __name__ == "__main__": - distiset = pipeline.run( - parameters={ - load_dataset.name: { - "repo_id": "distilabel-internal-testing/instruction-dataset-mini", - "split": "test", - }, - generate_with_openai.name: { - "llm": { - "generation_kwargs": { - "temperature": 0.7, - "max_new_tokens": 512, - } - } - }, - }, - ) -``` - -## Badges - -If you build something cool with `distilabel` consider adding one of these badges to your dataset or model card. - - [](https://github.com/argilla-io/distilabel) - -[](https://github.com/argilla-io/distilabel) - - [](https://github.com/argilla-io/distilabel) - -[](https://github.com/argilla-io/distilabel) - -## Contribute - -To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose). +- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset. \ No newline at end of file diff --git a/docs/sections/community/index.md b/docs/sections/community/index.md index e7bef08111..ed7f6cdd42 100644 --- a/docs/sections/community/index.md +++ b/docs/sections/community/index.md @@ -41,4 +41,20 @@ We are an open-source community-driven project not only focused on building a gr [:octicons-arrow-right-24: Roadmap ↗](https://github.com/orgs/argilla-io/projects/15) - \ No newline at end of file + + +## Badges + +If you build something cool with `distilabel` consider adding one of these badges to your dataset or model card. + + [](https://github.com/argilla-io/distilabel) + +[](https://github.com/argilla-io/distilabel) + + [](https://github.com/argilla-io/distilabel) + +[](https://github.com/argilla-io/distilabel) + +## Contribute + +To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose). \ No newline at end of file diff --git a/docs/sections/how_to_guides/basic/llm/index.md b/docs/sections/how_to_guides/basic/llm/index.md index 944e3f4fce..4bd5f9de2b 100644 --- a/docs/sections/how_to_guides/basic/llm/index.md +++ b/docs/sections/how_to_guides/basic/llm/index.md @@ -1,4 +1,4 @@ -# Define LLMs as local models or remote APIs +# Define LLMs as local or remote models ## Working with LLMs diff --git a/docs/sections/how_to_guides/basic/task/index.md b/docs/sections/how_to_guides/basic/task/index.md index a184357af7..54c04483dc 100644 --- a/docs/sections/how_to_guides/basic/task/index.md +++ b/docs/sections/how_to_guides/basic/task/index.md @@ -1,4 +1,4 @@ -# Define Tasks as Steps that rely on LLMs +# Define Tasks that rely on LLMs ## Working with Tasks diff --git a/docs/sections/how_to_guides/index.md b/docs/sections/how_to_guides/index.md new file mode 100644 index 0000000000..3d6cb3e82d --- /dev/null +++ b/docs/sections/how_to_guides/index.md @@ -0,0 +1,93 @@ +# How-to guides + +Welcome to the how-to guides section! Here you will find a collection of guides that will help you get started with Distilabel. We have divided the guides into two categories: basic and advanced. The basic guides will help you get started with the core concepts of Distilabel, while the advanced guides will help you explore more advanced features. + +## Basic + +