-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.4.0
#1024
Merged
Merged
1.4.0
#1024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Update `ClientvLLM.model_name` to `cached_property` * Fix unit test
* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden
* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Add default structured output for ComplexityScorer task * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden * Refactor get_structured_output to return just the schema * Add reference for the JSON schema
* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Add default structured output for ComplexityScorer task * Add default structured output for QualityScorer task * Add example to the docstrings * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden * Refactor get_structured_output to return just the schema * Add reference for the JSON schema * Refactor get_structured_output to return just the schema
* Add default structured output for GenerateSentencePair task * Move default behavior to base class * Add docstrings to the methods and move json schemas to the class method * Add tests for default structured outputs in sentence transformers task * Add control for parsing errors on JSON data * Add default structured output for ComplexityScorer task * Add default structured output for QualityScorer task * Add example to the docstrings * Refactor code per code review, to simplify just creating the default schemas * Add extra check to avoid setting the structured output if the method wasn't overriden * Refactor get_structured_output to return just the schema * Add reference for the JSON schema * Refactor get_structured_output to return just the schema * Add default structured output for UltraFeedback task
* Update unit tests so they work with `transformers>=4.44.0` * fix more unit tests
* Add check for dependencies for structured outputs and change default value of structured outputs * Update tests with serialized default structured output --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Update `_manage_batch_flow` to send as many batches as can be built * Fix load stages * Fix unit test * Fix `argilla` unit test after release `2.0.1` * Can fail
* Fix loader to read from a glob pattern * Fix to read from general UPath instead of Path * Update tests to use glob patterns * Refactor to simplify check for glob pattern
* Add `save_artifact` method * Upload pipeline generated artifacts * Fix log file was being saved in different cache * Update `save_to_disk` to also save artifacts * Render artifacts in card * Update unit tests * Add missing unit tests * Update src/distilabel/distiset.py Co-authored-by: Agus <[email protected]> * Add section about saving artifacts * Add correct `edit_uri` --------- Co-authored-by: Agus <[email protected]>
…nclude the formatted input (#903) * Add attribute to include raw formatted input to distilabel_metadata field * Update tests to take into account add_raw_input attribute of tasks * Add reference to add_raw_input in the documentation * Update tests to control for the add_raw_input of the _Task
…mber of tokens or characters (#902) * Add new category for text manipulation and sort the dict aplhabetically * Redirect import * Add new TruncateRow step to truncate the text using the number of characters or tokens * Add tests for TruncateRow * Update tokenizer name to avoid errors accessing the repo in CI * Update src/distilabel/steps/__init__.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Update src/distilabel/steps/truncate.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Refactor tokenizer_name to tokenizer for consistency * Update test for the tokenizer refactor * Refactor TruncateRow to TruncateTextColumn --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
…ating optionality (#883) * Use `CudaDevicePlacementMixin` in `RewardModelScore` step * Add `StepColumns` type * Update inputs and outputs validation * Update type hints * Update inputs checking * Add unit test for checking inputs/outputs with dict * Update type hints * Update `inputs` and `outputs` return * Add missing inputs and outputs in docstring * Update docs
* Update mistralai client to version 1.*.* * Update tests for new mistral client
* Add deepseek prover autoformalization task * Add task for the scorer as a jinja template to make it easy to maintain * Add deepseek prover scorer task * Add tests for the scorer task * Redirect import * Create a folder for the deepseek-prover templates * Make generator task more general including few shot examples * Remove the few shot argument as we can determine by just checking for examples * Remove deepseek-prover from the core as they are not that relevant for general pipelines * Add deepseek prover pipeline * Add entry for the paper implementation * Remove tests * Remove import * Remove redirected import
…dels (#893) * Add initial outline tutorial * Add section on data quality evaluation * Add conslusion * Update pipeline_samples structure for adding tutorials in a similar way as Argilla docs * Update new structure tutorials * Update title * Update to use Free serverless Inference API * Process comments from code review * Remove sections from header * Updated formatting examples * Add grid arror on new line * update phrasing * update phrasing
* Fix repo_id in load and make config argument optional if possible * Add tests for LoadFromDisk * Update src/distilabel/steps/generators/huggingface.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Make error more informative --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Fix minor error deepseep prover * Fix minor type generate sentence pairs
* add tutorials * clean dataset tutorial * generate preference dataset tutorial * modify sentence pairs tutorial * add to index * add missing component * fix: first feedback * fix: add headers * fix: process for steps * fix: typo and note * add torch * fix typo
* Fix error with instructor schema input * Fix examples of structured generation * Try inferring the type of format in case the user forgets informing about it
* Generate deterministic pipeline name when it's not given * Use the names of the steps to generate the default pipeline name * Update test with the steps names * Add suggestion from code review
* Remove pdm things * Draft of socialai example * Add example/post for socialai/fine personas * Simplify title per code review
* feat: add basic draw implementation to pipline * refactor: cleanup some code * feat: add functionality to draw TD or LR * refactor: remove step name from vis * refactor: default to LR generation * Add dag with mapping * feat: add edge labels * Remove images * feat: add support for leaf node to argilla and distilabel * refactor: order of functions * test: Add tests * fix: replace logger warning for `warning.warn` to avoid non-initialized logger * fix: avoid potentially getting raised errors during `get_outputs` call relying on dynamic calls * docs: Add visualizing pipelines section * feat: Add a try-except around pipeline visualization in Notebook to ensure it will never be a blocking action * feat: add a show method to the pipleines for visualizing in notebooks * docs: add more context on pipeline.show * Apply suggestions from code review Co-authored-by: Agus <[email protected]> * Update src/distilabel/steps/generators/huggingface.py * feat: remove show to simplify flow * refactor: mermaid URL at top as constant * feat: improve flow for passing by info to a potential next step * docs: update docstring --------- Co-authored-by: Agus <[email protected]>
* fix: converting ModelMetaClass to model_json_schema * fix: allow for adding optional literal format json to instructor to make methods more inter-changable * docs: emphasize usability with any framework * fix: first check if structured_output has been defined * Update docs/sections/how_to_guides/advanced/structured_generation.md Co-authored-by: Agus <[email protected]> --------- Co-authored-by: Agus <[email protected]>
* Add new section with developer docs * Fix name of link * Add help for PR body
* fix metadata writeout when llm error * linter reformat --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* feat: add initial version of argilla labeller task * fix: arguments in runtime parameters * feat: add field descriptions * feat: Update record formatting logic during structured generation * feat: update workflows * refactor: work based off server payloads * fix: resolve serializatione xample records * fix: only convert examples w when provided * fix: set to basically zero * fix: add temperature fix * fix: revert changes * fix: example records with formatted responses * fix: set max new tokens manually * fix: some fixes in formatting * refactor: some code quality improvements * feat: improv * refactor: remove unused code * fix: wrong prompt template * fix: remove print statement * fix: added pydantic rtuntimeparameter definition * fix: creating new characters per line examples * fix: add nuance on example in prompt template * feat: Add guidelines to prompt template * fix: remove pdb trace * fix: avoid using records without correct responses * feat: add ability to forward different questions * test: add tests for argilla labeller * fix: wrong docstring * fix: wrong docstring * refactor: rename suggestions -> suggestion * docs: update examples * tests: remove span question * docs: update the examples * Apply suggestions from code review Co-authored-by: Gabriel Martín Blázquez <[email protected]> * refactor: apply suggestions code review * fix: type hinting Record import * fix: tests * tests: fix failing tests --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
…1017) Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Add apigen task module * Add tests for apigen * Fix default name for dataset info when requesting the number of examples * checkpoint * Add tests for apigen generator * Create jinja template, split methods and add docstrings * Update string format * Simplify function setting and move it to load method * Add tests for semantic checker * Add prompt template for semantic checker * Redirect import for semantic checker * Fix docstrins for output columns * Add semantic checker task from apigen * Add notes for execution checker * Remove extra jump of line * Add first version of data sampler, step helper for apigen * Add tests for data sampler * Add integration test to check the sampler can be mixed with another generator step * Draft tests for new execution checker * Move helper functions * Draft for execution checker functionality * Add first version of execution checker and tests * Add tests for utils module of apigen * Remove unnecessary step for transformation and rename files for clarity * Fix import * Change function results name to show the original results from the execution * Remove print when the url for a reference doesn't contain https://arxiv * first working version * Fix tests including previous columns * Go back to previous name for dummy llm * Change dummy llm names on tests * Read the answers from the model parsed instead of dumped string * Add option to include the tools if available for few shot * Allow extra checks for the parameter types and tests for those * Add docs for the execution checker * Add new icon for execution * Fix return type for outputs column * Fix docstrings * Redirect imports to top level * Update docstrings to render on components gallery * Improve docstrings for fields in the data sampler * Remove unnecesary data from docstrings and remove TODO * Add missing data variable in example * Update src/distilabel/steps/tasks/apigen/execution_checker.py Co-authored-by: Gabriel Martín Blázquez <[email protected]> * Refactor to return formatted json string instead of dict to simplify work with arrow * Draft tutorial to replicate paper * Allow number to be a dict with values and probabilities * Update pipeline run call * Add functionality to load functions from a folder with .py files * Fix comment for arg * Add example implementation * Add dependency for vllm * Fix dependency name * Add setuptools-scm in the script with the dependencies to install it prior to vllm * Another attempt with system * Add tests to take into account casting methods * Avoid casting and update prompt to ensure argument order is respected * Inform error type on generator * Add extra checks and safeguards for failed answer generation * Ensure the error is of the expected type * Fix unstructured generation * Remove json fences and fix semantic checker * Control case of functions without arguments * Add additional checks to run the execution checker * Remove additional dependency * Try fixing CI error with dependencies * Install dependency for the system * Undo fix attempt * Try fixing llvmlite dependency issue * Remove additional dependency as it breaks other tests --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Add integration test to showcase the prompts * Add a base print method so the Tasks can pretty print their prompts easily * Update base method to allow automatic pretty printing * Add optional argument instead of the default, update method name and return type * Fix type hint * Add example in docstrings * Add section in docs for the print method
* Add signature method for Serializable objects * Update signature to only keep track of the step names and not it's internal info * Refactor hash generation * Add dummy batch manager from dag * Update batch manager cache tests to start batch manager from a DAG * Draft of integration tests for new caching * Checkpoint draft * Add cache directory location * Add use_cache argument to Step for future use * Change output names to keep track of them while debugging * Make use of use_cache at the step level * Add docstrings for internal batch manager arguments * Remove path from add_batch method * Move step caching to get_batch method in batch manager step * Read batches from cached dir * Set every step cache to False if the pipeline has the cache as False * Comment for the batch manager * Move back to caching from add_step * Checkpoint current status * Add use_cache on step * If there's previous data saved, concatenate the content of the parquet files * Only read the distiset from cache if all the steps are the same, otherwise overwrite * Add changes to make loading a new and modified step feasible * Set use cache to True by default * Move logic of registering the batches to BasePipeline._register_batch to do it before calling _manage_batch_flows * Avoid reading parquet file from cache when any of the steps has use_cach=False * Add is_convergence method to DAG and cleanup batch_manager * Add integration tests for the new caching mechanism * Update unit tests related to register_batch * Fix signature serialization case of void list * Add use_cache to argilla tests * Fix tests related to use_cache * Fix tests * Remove undefined object input * Add `_invalidate_steps_cache_if_required` method * Initial work for loading batches from `batch_manager_data` directory * Draft cache updates * Update pipeline signature * Add signature mixin from other PR * Moved pipeline cache to executions folder with different data per pipeline * Testing new updates to read from cache * Checkpoint with loading working while adding new steps * Point of control * Fix not all the batches where being saved * Sort batches after loaded * Fix `load_from_cache` to load batches from `steps_data` directory correctly * Update test * Add `step_has_finished` method * Update invalidate cache function * Update integration caching tests * Refactor to extract logic to methods * Refactor to remove `cached_data_dir` * Update stages message * Refactor `invalidate_cache_for` method * Fix `_BatchManager` unit tests * Update to not serialize `exclude_from_signature` attribute * Fix pipeline unit tests * Remove write buffer data if `use_cache=False` * Fix offline batch generation attributes were being not ignored by signature * Fix print test * Fix routing batch function --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
…#1022) * Fix processing num_generations when applying input mappings in steps process * Add unit test * Update comment --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1024/ |
CodSpeed Performance ReportMerging #1024 will not alter performanceComparing Summary
|
plaguss
approved these changes
Oct 8, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.