Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Argilla integration TextGeneration, Preference, and more #472

Merged
merged 19 commits into from
Mar 27, 2024

Conversation

alvarobartt
Copy link
Member

@alvarobartt alvarobartt commented Mar 25, 2024

Description

This PR adds the PreferenceToArgilla step that will basically push a dataset to Argilla for preference i.e. rating and rationale over a list of generated responses for a given instruction. The PreferenceToArgilla step works with both existing suggestions i.e. ratings and rationales and injects those as suggestions within each record, but is also helpful for datasets that want to be hand annotated by humans with no suggestions i.e. no ratings and rationales. Additionally, note that this task provides a seamless integration with UltraFeedback so that's easier than ever to upload the UltraFeedback generated and annotated data to Argilla.

Also some minor tweaks and improvements have been applied to PromptCompletionToArgilla renamed to TextGenerationToArgilla.

Besides that, this PR also adds the KeepColumn step which basically is useful for column sorting and filtering before the pipeline.run ends, and I found it very convenient when running diverse pipelines with unsorted outputs and producing more columns that the ones I wanted, so KeepColumns helps with that and also respects the order provided via the columns arg.

Additionally, also the docstrings have been fixed and aligned for both CombineColumns and KeepColumns, as well as adding the unit tests for both steps. Finally, this PR also runs codespell to fix some typos in the codebase.

Example

from distilabel.integrations.argilla import PreferenceToArgilla

if __name__ == "__main__":
    ...
    push_to_argilla = PreferenceToArgilla(
        name="push_to_argilla",
        api_url="<ARGILLA_URL>",
        api_key="<ARGILLA_API_KEY>",
        dataset_name="push_to_argilla",
        dataset_workspace="admin",
        num_generations=2,
    )
    ...connect(push_to_argilla)

    dataset = pipeline.run(...)

@alvarobartt alvarobartt added this to the 1.0.0 milestone Mar 25, 2024
@alvarobartt alvarobartt self-assigned this Mar 25, 2024
@alvarobartt alvarobartt marked this pull request as ready for review March 26, 2024 12:36
Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

src/distilabel/integrations/argilla/preference.py Outdated Show resolved Hide resolved
src/distilabel/integrations/argilla/preference.py Outdated Show resolved Hide resolved
src/distilabel/integrations/argilla/preference.py Outdated Show resolved Hide resolved
@alvarobartt alvarobartt merged commit f56655f into core-refactor Mar 27, 2024
0 of 4 checks passed
@alvarobartt alvarobartt deleted the more-argilla branch March 27, 2024 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants