Skip to content

Commit

Permalink
docs: API reference review (#932)
Browse files Browse the repository at this point in the history
* add headers labels API

* inherited members to false and ensure references from source

* ensure examples are rendered in components_gallery and API reference

* Remove space after Examples: in docstrings

* ensure citations are rendered

* note in CombineColumns and add missing steps

* fix \n usage in docstrings tasks, better \\n

* automatic LLM references for better maintenance

* fix distiset examples

* small fixes in galleries

* add embedding section and gallery

* add contributor docs

* fix typos and links

* Use HF Inference API instead of OpenAI in quickstart and README

* update extra steps

* add available models reference

* fix fais-gpu dependency

* upadate extras

* add colab button and align welcome page as argilla
  • Loading branch information
sdiazlor authored Aug 29, 2024
1 parent bb14e8b commit 88615c7
Show file tree
Hide file tree
Showing 95 changed files with 411 additions and 239 deletions.
18 changes: 12 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,16 @@ In addition, the following extras are available:

### Example

To run the following example you must install `distilabel` with both `openai` extra:
To run the following example you must install `distilabel` with the `hf-inference-endpoints` extra:

```sh
pip install "distilabel[openai]" --upgrade
pip install "distilabel[hf-inference-endpoints]" --upgrade
```

Then run:

```python
from distilabel.llms import OpenAILLM
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
Expand All @@ -114,9 +114,14 @@ with Pipeline(
) as pipeline:
load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})

generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))
text_generation = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
),
)

load_dataset >> generate_with_openai
load_dataset >> text_generation

if __name__ == "__main__":
distiset = pipeline.run(
Expand All @@ -125,7 +130,7 @@ if __name__ == "__main__":
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
generate_with_openai.name: {
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
Expand All @@ -135,6 +140,7 @@ if __name__ == "__main__":
},
},
)
distiset.push_to_hub(repo_id="distilabel-example")
```

## Badges
Expand Down
8 changes: 8 additions & 0 deletions docs/api/embedding/embedding_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Embedding Gallery

This section contains the existing [`Embeddings`][distilabel.embeddings] subclasses implemented in `distilabel`.

::: distilabel.embeddings
options:
filters:
- "!^Embeddings$"
7 changes: 7 additions & 0 deletions docs/api/embedding/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Embedding

This section contains the API reference for the `distilabel` embeddings.

For more information on how the [`Embeddings`][distilabel.steps.tasks.Task] works and see some examples.

::: distilabel.embeddings.base
3 changes: 0 additions & 3 deletions docs/api/llm/anthropic.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/anyscale.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/azure.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/cohere.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/groq.md

This file was deleted.

6 changes: 0 additions & 6 deletions docs/api/llm/huggingface.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/litellm.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/llamacpp.md

This file was deleted.

10 changes: 10 additions & 0 deletions docs/api/llm/llm_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# LLM Gallery

This section contains the existing [`LLM`][distilabel.llms] subclasses implemented in `distilabel`.

::: distilabel.llms
options:
filters:
- "!^LLM$"
- "!^AsyncLLM$"
- "!typing"
3 changes: 0 additions & 3 deletions docs/api/llm/mistral.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/ollama.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/openai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/together.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/vertexai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/vllm.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/api/step/typing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Step Typing

::: distilabel.steps.typing
13 changes: 9 additions & 4 deletions docs/api/step_gallery/extra.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Extra

::: distilabel.steps.generators.data
::: distilabel.steps.deita
::: distilabel.steps.formatting
::: distilabel.steps.typing
::: distilabel.steps
options:
filters:
- "!Argilla"
- "!Columns"
- "!From(Disk|FileSystem)"
- "!Hub"
- "![Ss]tep"
- "!typing"
1 change: 1 addition & 0 deletions docs/api/step_gallery/hugging_face.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ This section contains the existing steps integrated with `Hugging Face` so as to
::: distilabel.steps.LoadDataFromDisk
::: distilabel.steps.LoadDataFromFileSystem
::: distilabel.steps.LoadDataFromHub
::: distilabel.steps.PushToHub
File renamed without changes.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 22 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,21 +38,39 @@ hide:

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!
<div class="grid cards" markdown>

- __Get started in 5 minutes!__

---

Install distilabel with `pip` and run your first `Pipeline` to generate and evaluate synthetic data.

[:octicons-arrow-right-24: Quickstart](./sections/getting_started/quickstart.md)

- __How-to guides__

---

Get familiar with the basics of distilabel. Learn how to define `steps`, `tasks` and `llms` and run your `Pipeline`.

[:octicons-arrow-right-24: Learn more](./sections/how_to_guides/index.md)

</div>

## Why use distilabel?

Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.

### Improve your AI output quality through data quality
<p style="font-size:20px">Improve your AI output quality through data quality</p>

Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time **achieving and keeping high-quality standards for your synthetic data**.

### Take control of your data and models
<p style="font-size:20px">Take control of your data and models</p>

**Ownership of data for fine-tuning your own LLMs** is not easy but distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.

### Improve efficiency by quickly iterating on the right research and LLMs
<p style="font-size:20px">Improve efficiency by quickly iterating on the right data and models</p>

Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.

Expand Down
159 changes: 159 additions & 0 deletions docs/sections/community/contributor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
description: This is a step-by-step guide to help you contribute to the distilabel project. We are excited to have you on board! 🚀
hide:
- footer
---

Thank you for investing your time in contributing to the project! Any contribution you make will be reflected in the most recent version of distilabel 🤩.

??? Question "New to contributing in general?"
If you're a new contributor, read the [README](https://github.com/argilla-io/distilabel/blob/develop/README.md) to get an overview of the project. In addition, here are some resources to help you get started with open-source contributions:

* **Discord**: You are welcome to join the [distilabel Discord community](http://hf.co/join/discord), where you can keep in touch with other users, contributors and the distilabel team. In the following [section](#first-contact-in-discord), you can find more information on how to get started in Discord.
* **Git**: This is a very useful tool to keep track of the changes in your files. Using the command-line interface (CLI), you can make your contributions easily. For that, you need to have it [installed and updated](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) on your computer.
* **GitHub**: It is a platform and cloud-based service that uses git and allows developers to collaborate on projects. To contribute to distilabel, you'll need to create an account. Check the [Contributor Workflow with Git and Github](#contributor-workflow-with-git-and-github) for more info.
* **Developer Documentation**: To collaborate, you'll need to set up an efficient environment. Check the [Installation](../getting_started/installation.md) guide to know how to do it.

## First Contact in Discord

Discord is a handy tool for more casual conversations and to answer day-to-day questions. As part of Hugging Face, we have set up some distilabel channels on the server. Click [here](http://hf.co/join/discord) to join the Hugging Face Discord community effortlessly.

When part of the Hugging Face Discord, you can select "Channels & roles" and select "Argilla" along with any of the other groups that are interesting to you. "Argilla" will cover anything about argilla and distilabel. You can join the following channels:

* **#argilla-distilabel-announcements**: 📣 Stay up-to-date.
* **#argilla-distilabel-general**: 💬 For general discussions.
* **#argilla-distilabel-help**: 🙋‍♀️ Need assistance? We're always here to help. Select the appropriate label (argilla or distilabel) for your issue and post it.

So now there is only one thing left to do: introduce yourself and talk to the community. You'll always be welcome! 🤗👋


## Contributor Workflow with Git and GitHub

If you're working with distilabel and suddenly a new idea comes to your mind or you find an issue that can be improved, it's time to actively participate and contribute to the project!

### Report an issue

If you spot a problem, [search if an issue already exists](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue), you can use the `Label` filter. If that is the case, participate in the conversation. If it does not exist, create an issue by clicking on `New Issue`. This will show various templates; choose the one that best suits your issue. Once you choose one, you will need to fill it in following the guidelines. Try to be as clear as possible. In addition, you can assign yourself to the issue and add or choose the right labels. Finally, click on `Submit new issue`.


### Work with a fork

#### Fork the distilabel repository

After having reported the issue, you can start working on it. For that, you will need to create a fork of the project. To do that, click on the `Fork` button. Now, fill in the information. Remember to uncheck the `Copy develop branch only` if you are going to work in or from another branch (for instance, to fix documentation, the `main` branch is used). Then, click on `Create fork`.

You will be redirected to your fork. You can see that you are in your fork because the name of the repository will be your `username/distilabel`, and it will indicate `forked from argilla-io/distilabel`.


#### Clone your forked repository

In order to make the required adjustments, clone the forked repository to your local machine. Choose the destination folder and run the following command:

```sh
git clone https://github.com/[your-github-username]/distilabel.git
cd distilabel
```

To keep your fork’s main/develop branch up to date with our repo, add it as an upstream remote branch.

```sh
git remote add upstream https://github.com/argilla-io/distilabel.git
```


### Create a new branch

For each issue you're addressing, it's advisable to create a new branch. GitHub offers a straightforward method to streamline this process.

> ⚠️ Never work directly on the `main` or `develop` branch. Always create a new branch for your changes.
Navigate to your issue, and on the right column, select `Create a branch`.

![Create a branch](../../assets/images/sections/community/create-branch.PNG)

After the new window pops up, the branch will be named after the issue and include a prefix such as feature/, bug/, or docs/ to facilitate quick recognition of the issue type. In the `Repository destination`, pick your fork ( [your-github-username]/distilabel), and then select `Change branch source` to specify the source branch for creating the new one. Complete the process by clicking `Create branch`.

> 🤔 Remember that the `main` branch is only used to work with the documentation. For any other changes, use the `develop` branch.
Now, locally, change to the new branch you just created.

```sh
git fetch origin
git checkout [branch-name]
```

### Make changes and push them

Make the changes you want in your local repository, and test that everything works and you are following the guidelines.

Once you have finished, you can check the status of your repository and synchronize with the upstreaming repo with the following command:

```sh
# Check the status of your repository
git status

# Synchronize with the upstreaming repo
git checkout [branch-name]
git rebase [default-branch]
```

If everything is right, we need to commit and push the changes to your fork. For that, run the following commands:

```sh
# Add the changes to the staging area
git add filename

# Commit the changes by writing a proper message
git commit -m "commit-message"

# Push the changes to your fork
git push origin [branch-name]
```

When pushing, you will be asked to enter your GitHub login credentials. Once the push is complete, all local commits will be on your GitHub repository.


### Create a pull request

Come back to GitHub, navigate to the original repository where you created your fork, and click on `Compare & pull request`.

![compare-and-pr](../../assets/images/sections/community/compare-pull-request.PNG)

First, click on `compare across forks` and select the right repositories and branches.

> In the base repository, keep in mind that you should select either `main` or `develop` based on the modifications made. In the head repository, indicate your forked repository and the branch corresponding to the issue.
Then, fill in the pull request template. You should add a prefix to the PR name, as we did with the branch above. If you are working on a new feature, you can name your PR as `feat: TITLE`. If your PR consists of a solution for a bug, you can name your PR as `bug: TITLE`. And, if your work is for improving the documentation, you can name your PR as `docs: TITLE`.

In addition, on the right side, you can select a reviewer (for instance, if you discussed the issue with a member of the team) and assign the pull request to yourself. It is highly advisable to add labels to PR as well. You can do this again by the labels section right on the screen. For instance, if you are addressing a bug, add the `bug` label, or if the PR is related to the documentation, add the `documentation` label. This way, PRs can be easily filtered.

Finally, fill in the template carefully and follow the guidelines. Remember to link the original issue and enable the checkbox to allow maintainer edits so the branch can be updated for a merge. Then, click on `Create pull request`.


### Review your pull request

Once you submit your PR, a team member will review your proposal. We may ask questions, request additional information, or ask for changes to be made before a PR can be merged, either using [suggested changes](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/incorporating-feedback-in-your-pull-request) or pull request comments.

You can apply the changes directly through the UI (check the files changed and click on the right-corner three dots; see image below) or from your fork, and then commit them to your branch. The PR will be updated automatically, and the suggestions will appear as `outdated`.

![edit-file-from-UI](../../assets/images/sections/community/edit-file.PNG)

> If you run into any merge issues, check out this [git tutorial](https://github.com/skills/resolve-merge-conflicts) to help you resolve merge conflicts and other issues.

### Your PR is merged!

Congratulations 🎉🎊 We thank you 🤩

Once your PR is merged, your contributions will be publicly visible on the [distilabel GitHub](https://github.com/argilla-io/distilabel#contributors).

Additionally, we will include your changes in the next release based on our [development branch](https://github.com/argilla-io/argilla/tree/develop).

## Additional resources

Here are some helpful resources for your reference.

* [Configuring Discord](https://support.discord.com/hc/en-us/categories/115000217151), a guide to learning how to get started with Discord.
* [Pro Git](https://git-scm.com/book/en/v2), a book to learn Git.
* [Git in VSCode](https://code.visualstudio.com/docs/sourcecontrol/overview), a guide to learning how to easily use Git in VSCode.
* [GitHub Skills](https://skills.github.com/), an interactive course for learning GitHub.
Loading

0 comments on commit 88615c7

Please sign in to comment.