`1.4.0` #1024

gabrielmbmb · 2024-10-08T14:06:33Z

No description provided.

* Update `ClientvLLM.model_name` to `cached_property` * Fix unit test

* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden

* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Add default structured output for ComplexityScorer task * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden * Refactor get_structured_output to return just the schema * Add reference for the JSON schema

* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Add default structured output for ComplexityScorer task * Add default structured output for QualityScorer task * Add example to the docstrings * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden * Refactor get_structured_output to return just the schema * Add reference for the JSON schema * Refactor get_structured_output to return just the schema

* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Add default structured output for ComplexityScorer task * Add default structured output for QualityScorer task * Add example to the docstrings * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden * Refactor get_structured_output to return just the schema * Add reference for the JSON schema * Refactor get_structured_output to return just the schema * Add default structured output for UltraFeedback task

…886)

* Update unit tests so they work with `transformers>=4.44.0` * fix more unit tests

* Add check for dependencies for structured outputs and change default value of structured outputs * Update tests with serialized default structured output --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update `_manage_batch_flow` to send as many batches as can be built * Fix load stages * Fix unit test * Fix `argilla` unit test after release `2.0.1` * Can fail

* Exclude repo_id from LoadDataFromFileSystem generator class and update tests * Update code to be compatible with python 3.9

* Fix loader to read from a glob pattern * Fix to read from general UPath instead of Path * Update tests to use glob patterns * Refactor to simplify check for glob pattern

* Add `save_artifact` method * Upload pipeline generated artifacts * Fix log file was being saved in different cache * Update `save_to_disk` to also save artifacts * Render artifacts in card * Update unit tests * Add missing unit tests * Update src/distilabel/distiset.py Co-authored-by: Agus <[email protected]> * Add section about saving artifacts * Add correct `edit_uri` --------- Co-authored-by: Agus <[email protected]>

…nclude the formatted input (#903) * Add attribute to include raw formatted input to distilabel_metadata field * Update tests to take into account add_raw_input attribute of tasks * Add reference to add_raw_input in the documentation * Update tests to control for the add_raw_input of the _Task

…mber of tokens or characters (#902) * Add new category for text manipulation and sort the dict aplhabetically * Redirect import * Add new TruncateRow step to truncate the text using the number of characters or tokens * Add tests for TruncateRow * Update tokenizer name to avoid errors accessing the repo in CI * Update src/distilabel/steps/__init__.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/truncate.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Refactor tokenizer_name to tokenizer for consistency * Update test for the tokenizer refactor * Refactor TruncateRow to TruncateTextColumn --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

…ating optionality (#883) * Use `CudaDevicePlacementMixin` in `RewardModelScore` step * Add `StepColumns` type * Update inputs and outputs validation * Update type hints * Update inputs checking * Add unit test for checking inputs/outputs with dict * Update type hints * Update `inputs` and `outputs` return * Add missing inputs and outputs in docstring * Update docs

* Update mistralai client to version 1.*.* * Update tests for new mistral client

* Add deepseek prover autoformalization task * Add task for the scorer as a jinja template to make it easy to maintain * Add deepseek prover scorer task * Add tests for the scorer task * Redirect import * Create a folder for the deepseek-prover templates * Make generator task more general including few shot examples * Remove the few shot argument as we can determine by just checking for examples * Remove deepseek-prover from the core as they are not that relevant for general pipelines * Add deepseek prover pipeline * Add entry for the paper implementation * Remove tests * Remove import * Remove redirected import

…dels (#893) * Add initial outline tutorial * Add section on data quality evaluation * Add conslusion * Update pipeline_samples structure for adding tutorials in a similar way as Argilla docs * Update new structure tutorials * Update title * Update to use Free serverless Inference API * Process comments from code review * Remove sections from header * Updated formatting examples * Add grid arror on new line * update phrasing * update phrasing

* Fix repo_id in load and make config argument optional if possible * Add tests for LoadFromDisk * Update src/distilabel/steps/generators/huggingface.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Make error more informative --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Fix minor error deepseep prover * Fix minor type generate sentence pairs

* Initial work for `URIAL` * Update template * Fix checking last message * Add `format_output` logic * Refine `format_output` and add docstring * Add `References` * Add `URIAL` unit tests

* Add vLLMEmbeddings to work with multiple GPUs * Add mocked tests

* add tutorials * clean dataset tutorial * generate preference dataset tutorial * modify sentence pairs tutorial * add to index * add missing component * fix: first feedback * fix: add headers * fix: process for steps * fix: typo and note * add torch * fix typo

* Fix error with instructor schema input * Fix examples of structured generation * Try inferring the type of format in case the user forgets informing about it

* Generate deterministic pipeline name when it's not given * Use the names of the steps to generate the default pipeline name * Update test with the steps names * Add suggestion from code review

* Remove pdm things * Draft of socialai example * Add example/post for socialai/fine personas * Simplify title per code review

* feat: add basic draw implementation to pipline * refactor: cleanup some code * feat: add functionality to draw TD or LR * refactor: remove step name from vis * refactor: default to LR generation * Add dag with mapping * feat: add edge labels * Remove images * feat: add support for leaf node to argilla and distilabel * refactor: order of functions * test: Add tests * fix: replace logger warning for `warning.warn` to avoid non-initialized logger * fix: avoid potentially getting raised errors during `get_outputs` call relying on dynamic calls * docs: Add visualizing pipelines section * feat: Add a try-except around pipeline visualization in Notebook to ensure it will never be a blocking action * feat: add a show method to the pipleines for visualizing in notebooks * docs: add more context on pipeline.show * Apply suggestions from code review Co-authored-by: Agus <[email protected]> * Update src/distilabel/steps/generators/huggingface.py * feat: remove show to simplify flow * refactor: mermaid URL at top as constant * feat: improve flow for passing by info to a potential next step * docs: update docstring --------- Co-authored-by: Agus <[email protected]>

* fix: converting ModelMetaClass to model_json_schema * fix: allow for adding optional literal format json to instructor to make methods more inter-changable * docs: emphasize usability with any framework * fix: first check if structured_output has been defined * Update docs/sections/how_to_guides/advanced/structured_generation.md Co-authored-by: Agus <[email protected]> --------- Co-authored-by: Agus <[email protected]>

* Add new section with developer docs * Fix name of link * Add help for PR body

* fix metadata writeout when llm error * linter reformat --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* feat: add initial version of argilla labeller task * fix: arguments in runtime parameters * feat: add field descriptions * feat: Update record formatting logic during structured generation * feat: update workflows * refactor: work based off server payloads * fix: resolve serializatione xample records * fix: only convert examples w when provided * fix: set to basically zero * fix: add temperature fix * fix: revert changes * fix: example records with formatted responses * fix: set max new tokens manually * fix: some fixes in formatting * refactor: some code quality improvements * feat: improv * refactor: remove unused code * fix: wrong prompt template * fix: remove print statement * fix: added pydantic rtuntimeparameter definition * fix: creating new characters per line examples * fix: add nuance on example in prompt template * feat: Add guidelines to prompt template * fix: remove pdb trace * fix: avoid using records without correct responses * feat: add ability to forward different questions * test: add tests for argilla labeller * fix: wrong docstring * fix: wrong docstring * refactor: rename suggestions -> suggestion * docs: update examples * tests: remove span question * docs: update the examples * Apply suggestions from code review Co-authored-by: Gabriel Martín Blázquez <[email protected]> * refactor: apply suggestions code review * fix: type hinting Record import * fix: tests * tests: fix failing tests --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Add `numba >= 0.54.0` * Use `numpy < 2.0.0` * Install vLLM first * remove llm blender install

…1017) Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Add apigen task module * Add tests for apigen * Fix default name for dataset info when requesting the number of examples * checkpoint * Add tests for apigen generator * Create jinja template, split methods and add docstrings * Update string format * Simplify function setting and move it to load method * Add tests for semantic checker * Add prompt template for semantic checker * Redirect import for semantic checker * Fix docstrins for output columns * Add semantic checker task from apigen * Add notes for execution checker * Remove extra jump of line * Add first version of data sampler, step helper for apigen * Add tests for data sampler * Add integration test to check the sampler can be mixed with another generator step * Draft tests for new execution checker * Move helper functions * Draft for execution checker functionality * Add first version of execution checker and tests * Add tests for utils module of apigen * Remove unnecessary step for transformation and rename files for clarity * Fix import * Change function results name to show the original results from the execution * Remove print when the url for a reference doesn't contain https://arxiv * first working version * Fix tests including previous columns * Go back to previous name for dummy llm * Change dummy llm names on tests * Read the answers from the model parsed instead of dumped string * Add option to include the tools if available for few shot * Allow extra checks for the parameter types and tests for those * Add docs for the execution checker * Add new icon for execution * Fix return type for outputs column * Fix docstrings * Redirect imports to top level * Update docstrings to render on components gallery * Improve docstrings for fields in the data sampler * Remove unnecesary data from docstrings and remove TODO * Add missing data variable in example * Update src/distilabel/steps/tasks/apigen/execution_checker.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Refactor to return formatted json string instead of dict to simplify work with arrow * Draft tutorial to replicate paper * Allow number to be a dict with values and probabilities * Update pipeline run call * Add functionality to load functions from a folder with .py files * Fix comment for arg * Add example implementation * Add dependency for vllm * Fix dependency name * Add setuptools-scm in the script with the dependencies to install it prior to vllm * Another attempt with system * Add tests to take into account casting methods * Avoid casting and update prompt to ensure argument order is respected * Inform error type on generator * Add extra checks and safeguards for failed answer generation * Ensure the error is of the expected type * Fix unstructured generation * Remove json fences and fix semantic checker * Control case of functions without arguments * Add additional checks to run the execution checker * Remove additional dependency * Try fixing CI error with dependencies * Install dependency for the system * Undo fix attempt * Try fixing llvmlite dependency issue * Remove additional dependency as it breaks other tests --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Add integration test to showcase the prompts * Add a base print method so the Tasks can pretty print their prompts easily * Update base method to allow automatic pretty printing * Add optional argument instead of the default, update method name and return type * Fix type hint * Add example in docstrings * Add section in docs for the print method

* Redirect import of CLAIR * Add jinja2 template for CLAIR * Add CLAIR task * Add tests for CLAIR task * Update example in docstrings * Add tutorial to reproduce CLAIR * Show new tutorial in the gallery and fix rendering issue in docstrings

* Add signature method for Serializable objects * Update signature to only keep track of the step names and not it's internal info * Refactor hash generation * Add dummy batch manager from dag * Update batch manager cache tests to start batch manager from a DAG * Draft of integration tests for new caching * Checkpoint draft * Add cache directory location * Add use_cache argument to Step for future use * Change output names to keep track of them while debugging * Make use of use_cache at the step level * Add docstrings for internal batch manager arguments * Remove path from add_batch method * Move step caching to get_batch method in batch manager step * Read batches from cached dir * Set every step cache to False if the pipeline has the cache as False * Comment for the batch manager * Move back to caching from add_step * Checkpoint current status * Add use_cache on step * If there's previous data saved, concatenate the content of the parquet files * Only read the distiset from cache if all the steps are the same, otherwise overwrite * Add changes to make loading a new and modified step feasible * Set use cache to True by default * Move logic of registering the batches to BasePipeline._register_batch to do it before calling _manage_batch_flows * Avoid reading parquet file from cache when any of the steps has use_cach=False * Add is_convergence method to DAG and cleanup batch_manager * Add integration tests for the new caching mechanism * Update unit tests related to register_batch * Fix signature serialization case of void list * Add use_cache to argilla tests * Fix tests related to use_cache * Fix tests * Remove undefined object input * Add `_invalidate_steps_cache_if_required` method * Initial work for loading batches from `batch_manager_data` directory * Draft cache updates * Update pipeline signature * Add signature mixin from other PR * Moved pipeline cache to executions folder with different data per pipeline * Testing new updates to read from cache * Checkpoint with loading working while adding new steps * Point of control * Fix not all the batches where being saved * Sort batches after loaded * Fix `load_from_cache` to load batches from `steps_data` directory correctly * Update test * Add `step_has_finished` method * Update invalidate cache function * Update integration caching tests * Refactor to extract logic to methods * Refactor to remove `cached_data_dir` * Update stages message * Refactor `invalidate_cache_for` method * Fix `_BatchManager` unit tests * Update to not serialize `exclude_from_signature` attribute * Fix pipeline unit tests * Remove write buffer data if `use_cache=False` * Fix offline batch generation attributes were being not ignored by signature * Fix print test * Fix routing batch function --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

…#1022) * Fix processing num_generations when applying input mappings in steps process * Add unit test * Update comment --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update link * Update cache section * Add step to fail if warnings * Fix dependency name

review-notebook-app · 2024-10-08T14:06:39Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2024-10-08T14:09:39Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1024/

codspeed-hq · 2024-10-08T14:13:52Z

CodSpeed Performance Report

Merging #1024 will not alter performance

_{Comparing develop (f1f7d77) with develop (925d259)}

Summary

✅ 1 untouched benchmarks

gabrielmbmb and others added 30 commits August 6, 2024 14:26

Bump version to 1.4.0

ecbe16b

Merge branch 'main' into develop

1a39e01

Make ClientvLLM.model_name a cached_property (#862)

2ded30f

* Update `ClientvLLM.model_name` to `cached_property` * Fix unit test

Pass dataset to dry_run method (#863)

314b759

Remove use of default_chat_template (#888)

bbe04fd

Temporary (using pip) fix for installing llama-cpp-python in CI (#…

1198d24

…886)

Fix unit tests after release of transformers==4.44.0 (#891)

8916ff2

* Update unit tests so they work with `transformers>=4.44.0` * fix more unit tests

Fix default structured output (#892)

75baf64

* Add check for dependencies for structured outputs and change default value of structured outputs * Update tests with serialized default structured output --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Send as many batches as possible to input queues (#895)

7ff4d20

* Update `_manage_batch_flow` to send as many batches as can be built * Fix load stages * Fix unit test * Fix `argilla` unit test after release `2.0.1` * Can fail

Exclude repo_id from LoadDataFromFileSystem (#898)

04d0bf0

* Exclude repo_id from LoadDataFromFileSystem generator class and update tests * Update code to be compatible with python 3.9

Fix loader to read from a glob pattern (#877)

f382f1c

* Fix loader to read from a glob pattern * Fix to read from general UPath instead of Path * Update tests to use glob patterns * Refactor to simplify check for glob pattern

Update mistrallm (#904)

ed874ba

* Update mistralai client to version 1.*.* * Update tests for new mistral client

Update RewardModelScore.inputs to define optional input columns (#908)

974f0db

docs: minor fixes (#913)

516909e

* Fix minor error deepseep prover * Fix minor type generate sentence pairs

Add URIAL task (#921)

2a3906d

* Initial work for `URIAL` * Update template * Fix checking last message * Add `format_output` logic * Refine `format_output` and add docstring * Add `References` * Add `URIAL` unit tests

Add vLLMEmbeddings (#920)

a796a75

* Add vLLMEmbeddings to work with multiple GPUs * Add mocked tests

Fix StructuredGeneration examples and internal check (#912)

6576d1a

* Fix error with instructor schema input * Fix examples of structured generation * Try inferring the type of format in case the user forgets informing about it

Generate deterministic pipeline name when it's not given (#878)

fc5d070

* Generate deterministic pipeline name when it's not given * Use the names of the steps to generate the default pipeline name * Update test with the steps names * Add suggestion from code review

davidberenstein1957 and others added 23 commits September 20, 2024 10:03

docs: update install overview in readme

33b58bf

docs: update installation overview

a2ab68d

Fix missing batch when last batch arrive early (#989)

f997cfd

Fine personas socialai tutorial (#992)

ad231ab

* Remove pdm things * Draft of socialai example * Add example/post for socialai/fine personas * Simplify title per code review

[DOCS] Add developer documentation section in the docs (#999)

a178109

* Add new section with developer docs * Fix name of link * Add help for PR body

Fix vllm installation in CI (#1009)

a49242d

Fix writing distilabel_metadata column when LLM error (#1003)

3244c05

* fix metadata writeout when llm error * linter reformat --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Add example of custom text generation step in quickstart (#984)

3fd680c

fix: validate fields and questions during process

b4c13ba

fix: validation of fields and records passed

1eb0524

fix: suggestion serialisation argilla labeller

7b5cbb0

Fixllvmlite install with uv (#1018)

4848dd2

* Add `numba >= 0.54.0` * Use `numpy < 2.0.0` * Install vLLM first * remove llm blender install

tests: validate passing questions and field within format_input too (#…

d5c0484

…1017) Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Fix impute when output_mapping is not empty (#1015)

4b8903b

Add CLAIR task (#926)

e027f99

* Redirect import of CLAIR * Add jinja2 template for CLAIR * Add CLAIR task * Add tests for CLAIR task * Update example in docstrings * Add tutorial to reproduce CLAIR * Show new tutorial in the gallery and fix rendering issue in docstrings

Fix IndexError when overriding inputs and group_generations=False (…

4cbcb90

…#1022) * Fix processing num_generations when applying input mappings in steps process * Add unit test * Update comment --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Update Pipeline cache docs (#1023)

d99011c

* Update link * Update cache section * Add step to fail if warnings * Fix dependency name

Fix cross-reference

6ef15f4

gabrielmbmb force-pushed the develop branch from 98bd95c to 6ef15f4 Compare October 8, 2024 14:18

plaguss approved these changes Oct 8, 2024

View reviewed changes

gabrielmbmb merged commit c0d798a into main Oct 8, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`1.4.0` #1024

`1.4.0` #1024

gabrielmbmb commented Oct 8, 2024

review-notebook-app bot commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

codspeed-hq bot commented Oct 8, 2024 •

edited

Loading

1.4.0 #1024

1.4.0 #1024

Conversation

gabrielmbmb commented Oct 8, 2024

review-notebook-app bot commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

codspeed-hq bot commented Oct 8, 2024 • edited Loading

CodSpeed Performance Report

Merging #1024 will not alter performance

Summary

`1.4.0` #1024

`1.4.0` #1024

codspeed-hq bot commented Oct 8, 2024 •

edited

Loading