Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.4.0 #1024

Merged
merged 84 commits into from
Oct 8, 2024
Merged

1.4.0 #1024

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
ecbe16b
Bump version to `1.4.0`
gabrielmbmb Aug 6, 2024
1a39e01
Merge branch 'main' into develop
gabrielmbmb Aug 7, 2024
2ded30f
Make `ClientvLLM.model_name` a `cached_property` (#862)
gabrielmbmb Aug 8, 2024
314b759
Pass dataset to dry_run method (#863)
plaguss Aug 8, 2024
5e5e7c3
Add default structured output for `GenerateSentencePair` task (#868)
plaguss Aug 9, 2024
7702e24
Complexity scorer default structured output (#870)
plaguss Aug 9, 2024
aa616a1
Quality scorer default structured output (#873)
plaguss Aug 9, 2024
c006ddc
Ultrafeedback default structured output (#876)
plaguss Aug 9, 2024
bbe04fd
Remove use of `default_chat_template` (#888)
gabrielmbmb Aug 12, 2024
1198d24
Temporary (using `pip`) fix for installing `llama-cpp-python` in CI (…
gabrielmbmb Aug 13, 2024
8916ff2
Fix unit tests after release of `transformers==4.44.0` (#891)
gabrielmbmb Aug 13, 2024
75baf64
Fix default structured output (#892)
plaguss Aug 13, 2024
7ff4d20
Send as many batches as possible to input queues (#895)
gabrielmbmb Aug 13, 2024
04d0bf0
Exclude `repo_id` from `LoadDataFromFileSystem` (#898)
plaguss Aug 13, 2024
f382f1c
Fix loader to read from a glob pattern (#877)
plaguss Aug 14, 2024
c8df5a9
Add `save_artifact` method to `_Step` (#871)
gabrielmbmb Aug 14, 2024
3d772c5
Add new `add_raw_input` argument to `_Task` so we can automatically i…
plaguss Aug 14, 2024
4740063
New `TruncateTextColumn` to truncate the length of texts using the nu…
plaguss Aug 14, 2024
4093699
Update `inputs` and `outputs` interface to allow returning dict indic…
gabrielmbmb Aug 15, 2024
ed874ba
Update mistrallm (#904)
plaguss Aug 15, 2024
10fff29
Deepseek prover (#907)
plaguss Aug 15, 2024
974f0db
Update `RewardModelScore.inputs` to define optional input columns (#908)
gabrielmbmb Aug 15, 2024
3264563
Add tutorial - generate data for training embeddings and reranking mo…
davidberenstein1957 Aug 19, 2024
ebe7e25
Fix load data from disk (#910)
plaguss Aug 19, 2024
516909e
docs: minor fixes (#913)
davidberenstein1957 Aug 19, 2024
2a3906d
Add `URIAL` task (#921)
gabrielmbmb Aug 22, 2024
a796a75
Add `vLLMEmbeddings` (#920)
plaguss Aug 22, 2024
46d55ed
docs: add tutorials preference and clean (#917)
sdiazlor Aug 22, 2024
6576d1a
Fix `StructuredGeneration` examples and internal check (#912)
plaguss Aug 22, 2024
fc5d070
Generate deterministic pipeline name when it's not given (#878)
plaguss Aug 22, 2024
22db32c
Add custom errors (#911)
plaguss Aug 22, 2024
def7060
Merge branch 'main' into develop
gabrielmbmb Aug 23, 2024
af3515a
Docs/tutorials fix (#922)
sdiazlor Aug 26, 2024
d010f79
Add `revision` runtime parameter to `LoadDataFromHub` (#928)
gabrielmbmb Aug 26, 2024
2ce44f0
Add Plausible as replacement for GA (#929)
davidberenstein1957 Aug 26, 2024
bb14e8b
Add minhash related steps to deduplicate texts (#931)
plaguss Aug 28, 2024
88615c7
docs: API reference review (#932)
sdiazlor Aug 29, 2024
4b3c9c0
Refactor of MinHash to work with a single class and fix the shelve ba…
plaguss Sep 2, 2024
4556135
Update `make_generator_step` to set pipeline to step and add edge to …
gabrielmbmb Sep 2, 2024
d5f2ae3
Add `CombineOutputs` step (#939)
gabrielmbmb Sep 2, 2024
a2a8e86
update regex (#940)
sdiazlor Sep 2, 2024
28485d0
Offline batch generation (#923)
gabrielmbmb Sep 2, 2024
c8f4d61
Fix applying input mapping when mapping overrides another column (#938)
gabrielmbmb Sep 2, 2024
56b4036
Fix all replicas had the same `_llm_identifier` for `CudaDevicePlacem…
gabrielmbmb Sep 2, 2024
ebd2bb7
Fix empty load stage when two `GlobalStep`s are chained (#945)
gabrielmbmb Sep 3, 2024
973e0fa
Update `TextGeneration` to deprecate `use_system_prompt` and add (#950)
gabrielmbmb Sep 6, 2024
de2bed0
Add step to deduplicate records based on embeddings (#946)
plaguss Sep 6, 2024
eef8961
Updated `setup_logging` to use UTF-8 encoding in `FileHandler` (#952)
dameikle Sep 9, 2024
8e9cc8d
Add more generation parameters to `vLLM` (#955)
gabrielmbmb Sep 10, 2024
f207fab
Fix `Magpie` generating different columns depending on `LLM` output (…
gabrielmbmb Sep 10, 2024
6e2c9b1
Docs/962 docs create a smoother transition from index installation qu…
davidberenstein1957 Sep 11, 2024
ccea49a
Add `logging_handlers` argument (#969)
gabrielmbmb Sep 12, 2024
28ecbc4
[DOCS] Add tips in the docs to avoid overloading Free Serverless Endp…
plaguss Sep 13, 2024
f0067b8
Add `TextClassification`, `UMAP`, `DBSCAN` and `TextClustering` tasks…
plaguss Sep 16, 2024
af08b59
[FEATURE] Simplify customizing the `TextGeneration` task with custom …
plaguss Sep 16, 2024
e1253a6
Update `system_prompt` attribute for adding probabilities in `MagpieB…
gabrielmbmb Sep 16, 2024
75e34e1
Send as many `None`s as replicas in the step (#982)
gabrielmbmb Sep 16, 2024
b2d8eb5
docs: 960 docs add a glossary concept section (#970)
davidberenstein1957 Sep 16, 2024
e67864e
Fix missing `system_prompt_key` column in `Magpie` tasks (#983)
gabrielmbmb Sep 17, 2024
370e5b5
docs: update component gallery (#987)
davidberenstein1957 Sep 18, 2024
33b58bf
docs: update install overview in readme
davidberenstein1957 Sep 20, 2024
a2ab68d
docs: update installation overview
davidberenstein1957 Sep 20, 2024
f997cfd
Fix missing batch when last batch arrive early (#989)
zye1996 Sep 20, 2024
ad231ab
Fine personas socialai tutorial (#992)
plaguss Sep 20, 2024
c7deafa
feat: add basic draw implementation to pipline (#966)
davidberenstein1957 Sep 20, 2024
d7e61b5
Fix schema inference structured generation (#994)
davidberenstein1957 Sep 23, 2024
a178109
[DOCS] Add developer documentation section in the docs (#999)
plaguss Sep 25, 2024
a49242d
Fix `vllm` installation in CI (#1009)
gabrielmbmb Sep 30, 2024
3244c05
Fix writing `distilabel_metadata` column when `LLM` error (#1003)
zye1996 Sep 30, 2024
3fd680c
Add example of custom text generation step in quickstart (#984)
plaguss Sep 30, 2024
a46489e
feat: 985 feature argillalabeller task (#986)
davidberenstein1957 Oct 3, 2024
b4c13ba
fix: validate fields and questions during process
davidberenstein1957 Oct 3, 2024
1eb0524
fix: validation of fields and records passed
davidberenstein1957 Oct 3, 2024
7b5cbb0
fix: suggestion serialisation argilla labeller
davidberenstein1957 Oct 3, 2024
4848dd2
Fix`llvmlite` install with `uv` (#1018)
gabrielmbmb Oct 7, 2024
d5c0484
tests: validate passing questions and field within format_input too (…
davidberenstein1957 Oct 7, 2024
4b8903b
Fix impute when `output_mapping` is not empty (#1015)
zye1996 Oct 7, 2024
4b056ff
Add Tasks to replicate `APIGen` (#925)
plaguss Oct 7, 2024
87683f0
Pretty print (#934)
plaguss Oct 7, 2024
e027f99
Add `CLAIR` task (#926)
plaguss Oct 7, 2024
ebab004
Add cache at `Step` level (#766)
plaguss Oct 7, 2024
4cbcb90
Fix `IndexError` when overriding inputs and `group_generations=False`…
plaguss Oct 8, 2024
d99011c
Update `Pipeline cache` docs (#1023)
gabrielmbmb Oct 8, 2024
6ef15f4
Fix cross-reference
gabrielmbmb Oct 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ jobs:
if: steps.cache.outputs.cache-hit != 'true'
run: pip install -e .[docs]

- name: Check no warnings
run: mkdocs build --strict

- name: Set git credentials
run: |
git config --global user.name "${{ github.actor }}"
Expand Down
33 changes: 27 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ Requires Python 3.9+

In addition, the following extras are available:

### LLMs

- `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.
- `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.
- `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).
Expand All @@ -91,19 +93,32 @@ In addition, the following extras are available:
- `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.
- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.
- `sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https://github.com/UKPLab/sentence-transformers).

### Structured generation

- `outlines`: for using structured generation of LLMs with [outlines](https://github.com/outlines-dev/outlines).
- `instructor`: for using structured generation of LLMs with [Instructor](https://github.com/jxnl/instructor/).

### Data processing

- `ray`: for scaling and distributing a pipeline with [Ray](https://github.com/ray-project/ray).
- `faiss-cpu` and `faiss-gpu`: for generating sentence embeddings using [faiss](https://github.com/facebookresearch/faiss).
- `text-clustering`: for using text clustering with [UMAP](https://github.com/lmcinnes/umap) and [Scikit-learn](https://github.com/scikit-learn/scikit-learn).
- `minhash`: for using minhash for duplicate detection with [datasketch](https://github.com/datasketch/datasketch) and [nltk](https://github.com/nltk/nltk).

### Example

To run the following example you must install `distilabel` with both `openai` extra:
To run the following example you must install `distilabel` with the `hf-inference-endpoints` extra:

```sh
pip install "distilabel[openai]" --upgrade
pip install "distilabel[hf-inference-endpoints]" --upgrade
```

Then run:

```python
from distilabel.llms import OpenAILLM
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
Expand All @@ -114,9 +129,14 @@ with Pipeline(
) as pipeline:
load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})

generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))
text_generation = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
),
)

load_dataset >> generate_with_openai
load_dataset >> text_generation

if __name__ == "__main__":
distiset = pipeline.run(
Expand All @@ -125,7 +145,7 @@ if __name__ == "__main__":
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
generate_with_openai.name: {
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
Expand All @@ -135,6 +155,7 @@ if __name__ == "__main__":
},
},
)
distiset.push_to_hub(repo_id="distilabel-example")
```

## Badges
Expand Down
8 changes: 8 additions & 0 deletions docs/api/embedding/embedding_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Embedding Gallery

This section contains the existing [`Embeddings`][distilabel.embeddings] subclasses implemented in `distilabel`.

::: distilabel.embeddings
options:
filters:
- "!^Embeddings$"
7 changes: 7 additions & 0 deletions docs/api/embedding/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Embedding

This section contains the API reference for the `distilabel` embeddings.

For more information on how the [`Embeddings`][distilabel.steps.tasks.Task] works and see some examples.

::: distilabel.embeddings.base
8 changes: 8 additions & 0 deletions docs/api/errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Errors

This section contains the `distilabel` custom errors. Unlike [exceptions](exceptions.md), errors in `distilabel` are used to handle unexpected situations that can't be anticipated and that can't be handled in a controlled way.

:::distilabel.errors.DistilabelError
:::distilabel.errors.DistilabelUserError
:::distilabel.errors.DistilabelTypeError
:::distilabel.errors.DistilabelNotImplementedError
7 changes: 7 additions & 0 deletions docs/api/exceptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Exceptions

This section contains the `distilabel` custom exceptions. Unlike [errors](errors.md), exceptions in `distilabel` are used to handle specific situations that can be anticipated and that can be handled in a controlled way internally by the library.

:::distilabel.exceptions.DistilabelException
:::distilabel.exceptions.DistilabelGenerationException
:::distilabel.exceptions.DistilabelOfflineBatchGenerationNotFinishedException
3 changes: 0 additions & 3 deletions docs/api/llm/anthropic.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/anyscale.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/azure.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/cohere.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/groq.md

This file was deleted.

6 changes: 0 additions & 6 deletions docs/api/llm/huggingface.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/litellm.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/llamacpp.md

This file was deleted.

10 changes: 10 additions & 0 deletions docs/api/llm/llm_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# LLM Gallery

This section contains the existing [`LLM`][distilabel.llms] subclasses implemented in `distilabel`.

::: distilabel.llms
options:
filters:
- "!^LLM$"
- "!^AsyncLLM$"
- "!typing"
3 changes: 0 additions & 3 deletions docs/api/llm/mistral.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/ollama.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/openai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/together.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/vertexai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/vllm.md

This file was deleted.

4 changes: 4 additions & 0 deletions docs/api/pipeline/step_wrapper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Step Wrapper

::: distilabel.pipeline.step_wrapper._StepWrapper
::: distilabel.pipeline.step_wrapper._StepWrapperException
3 changes: 0 additions & 3 deletions docs/api/pipeline/utils.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/api/step/typing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Step Typing

::: distilabel.steps.typing
1 change: 1 addition & 0 deletions docs/api/step_gallery/columns.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ This section contains the existing steps intended to be used for common column o
::: distilabel.steps.columns.keep
::: distilabel.steps.columns.merge
::: distilabel.steps.columns.group
::: distilabel.steps.columns.utils
13 changes: 9 additions & 4 deletions docs/api/step_gallery/extra.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Extra

::: distilabel.steps.generators.data
::: distilabel.steps.deita
::: distilabel.steps.formatting
::: distilabel.steps.typing
::: distilabel.steps
options:
filters:
- "!Argilla"
- "!Columns"
- "!From(Disk|FileSystem)"
- "!Hub"
- "![Ss]tep"
- "!typing"
1 change: 1 addition & 0 deletions docs/api/step_gallery/hugging_face.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ This section contains the existing steps integrated with `Hugging Face` so as to
::: distilabel.steps.LoadDataFromDisk
::: distilabel.steps.LoadDataFromFileSystem
::: distilabel.steps.LoadDataFromHub
::: distilabel.steps.PushToHub
File renamed without changes.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/arena-hard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/clair.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/clean-dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/deepseek.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/deita.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/knowledge_graphs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/prometheus.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/sentence-transformer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/pipelines/ultrafeedback.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/tutorials-assets/overview-apigen.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 22 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,21 +38,39 @@ hide:

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!
<div class="grid cards" markdown>

- __Get started in 5 minutes!__

---

Install distilabel with `pip` and run your first `Pipeline` to generate and evaluate synthetic data.

[:octicons-arrow-right-24: Quickstart](./sections/getting_started/quickstart.md)

- __How-to guides__

---

Get familiar with the basics of distilabel. Learn how to define `steps`, `tasks` and `llms` and run your `Pipeline`.

[:octicons-arrow-right-24: Learn more](./sections/how_to_guides/index.md)

</div>

## Why use distilabel?

Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.

### Improve your AI output quality through data quality
<p style="font-size:20px">Improve your AI output quality through data quality</p>

Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time **achieving and keeping high-quality standards for your synthetic data**.

### Take control of your data and models
<p style="font-size:20px">Take control of your data and models</p>

**Ownership of data for fine-tuning your own LLMs** is not easy but distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.

### Improve efficiency by quickly iterating on the right research and LLMs
<p style="font-size:20px">Improve efficiency by quickly iterating on the right data and models</p>

Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.

Expand Down
Loading
Loading