Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and improve docs #134

Merged
merged 40 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
7078e81
chore: add requirement to run mkdocs serve
Nov 29, 2023
27847ee
docs: add draft for learning section
Nov 30, 2023
3006955
refactor: guides as a part of the learn section
Nov 30, 2023
81ddb99
chore: add support to render notebooks
plaguss Nov 30, 2023
b269caf
docs: copy tutorial from gabri
plaguss Nov 30, 2023
6045f13
docs: update learn section
plaguss Dec 1, 2023
8826315
docs: let navigable section and new api reference
plaguss Dec 1, 2023
c4ea466
docs: add small overview of api reference
plaguss Dec 1, 2023
4cde7ac
Merge branch 'main' into docs/update-docs
plaguss Dec 1, 2023
1703837
Merge branch 'main' into docs/update-docs
plaguss Dec 7, 2023
eb54298
Update docs/api/pipeline.md
plaguss Dec 7, 2023
fe76cfc
Merge branch 'docs/update-docs' of https://github.com/argilla-io/dist…
plaguss Dec 7, 2023
fd1dcd6
wip
plaguss Dec 7, 2023
d727cfb
docs: rewrite for the llm concept guides
plaguss Dec 7, 2023
f05cb5d
chore: renamed files
plaguss Dec 7, 2023
7510475
refactor: updated user-guides content
plaguss Dec 7, 2023
0b5baf5
chore: allow adding footnotes and docs layout
plaguss Dec 7, 2023
553617a
docs: initial version of concept guides for llms and tasks
plaguss Dec 11, 2023
fcc9a94
docs: initial version for pipeline and llms
plaguss Dec 12, 2023
49da487
refactor: move wrong tutorial and place banner
plaguss Dec 12, 2023
4c2c4ce
Merge branch 'main' into docs/update-docs
plaguss Dec 12, 2023
7fd9d8b
docs: apply suggestions from code review
plaguss Dec 14, 2023
1a0d142
Merge branch 'docs/update-docs' of https://github.com/argilla-io/dist…
plaguss Dec 14, 2023
43a04be
docs: add suggestions from code review and move code snippets to its …
plaguss Dec 14, 2023
f047950
Merge branch 'main' into docs/update-docs
plaguss Dec 14, 2023
046416a
docs: removed noqa and avoid checking the docs with ruff
plaguss Dec 15, 2023
1e2ad18
chore: add new alembig image
plaguss Dec 15, 2023
8089bd2
Merge branch 'main' of https://github.com/argilla-io/distilabel into …
plaguss Dec 18, 2023
2c82ea8
Remove commented line
plaguss Dec 19, 2023
e25a1b4
Update logo.svg new name
plaguss Dec 19, 2023
cce4335
Merge branch 'docs/update-docs' of https://github.com/argilla-io/dist…
plaguss Dec 19, 2023
eaf5db7
Rename to HuggingFace
plaguss Dec 19, 2023
b441c06
Rename to HuggingFace
plaguss Dec 19, 2023
9096994
Rename to HuggingFace
plaguss Dec 19, 2023
43fe493
Point doc reference to main branch
plaguss Dec 19, 2023
0e0bac5
Fixed reference
plaguss Dec 19, 2023
4b0ca60
Remove comment
plaguss Dec 19, 2023
850c3f4
Merge branch 'docs/update-docs' of https://github.com/argilla-io/dist…
plaguss Dec 19, 2023
dfd9952
Add grid with different sections
plaguss Dec 19, 2023
0b33704
Update colours
plaguss Dec 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions docs/assets/alembic.svg
plaguss marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 0 additions & 3 deletions docs/guides.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/learn/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Learn

This section is the guide for using `distilabel`. It contains tutorials and guides that delve into the technical aspects of utilizing `distilabel`.
6 changes: 6 additions & 0 deletions docs/learn/tutorials/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Tutorials

!!! warning "🚧 Work in Progress"
This page is a work in progress.

This section will guide you step by step to create different datasets types with `distilabel`.
6 changes: 6 additions & 0 deletions docs/learn/user-guides/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# User guides

!!! warning "🚧 Work in Progress"
This page is a work in progress.

This section explains the main components of `distilabel`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import os

from distilabel.llm import InferenceEndpointsLLM
from distilabel.tasks import TextGenerationTask

endpoint_name = "aws-notus-7b-v1-4052" or os.getenv("HF_INFERENCE_ENDPOINT_NAME")
endpoint_namespace = "argilla" or os.getenv("HF_NAMESPACE")
token = os.getenv("HF_TOKEN") # hf_...

llm = InferenceEndpointsLLM(
endpoint_name=endpoint_name,
endpoint_namespace=endpoint_namespace,
token=token,
task=TextGenerationTask(),
max_new_tokens=512,
prompt_format="notus",
)
result = llm.generate([{"input": "What are critique LLMs?"}])
# print(result[0][0]["parsed_output"]["generations"])
# Critique LLMs (Long Land Moore Machines) are artificial intelligence models designed specifically for analyzing and evaluating the quality or worth of a particular subject or object. These models can be trained on a large dataset of reviews, ratings, or commentary related to a product, service, artwork, or any other topic of interest.
# The training data can include both positive and negative feedback, helping the LLM to understand the nuanced aspects of quality and value. The model uses natural language processing (NLP) techniques to extract meaningful insights, including sentiment analysis, entity recognition, and text classification.
# Once the model is trained, it can be used to analyze new input data and provide a critical assessment based on its learned understanding of quality and value. For example, a critique LLM for movies could evaluate a new film and generate a detailed review highlighting its strengths, weaknesses, and overall rating.
# Critique LLMs are becoming increasingly useful in various industries, such as e-commerce, education, and entertainment, where they can provide objective and reliable feedback to help guide decision-making processes. They can also aid in content optimization by highlighting areas of improvement or recommending strategies for enhancing user engagement.
# In summary, critique LLMs are powerful tools for analyzing and evaluating the quality or worth of different subjects or objects, helping individuals and organizations make informed decisions with confidence.
19 changes: 19 additions & 0 deletions docs/snippets/technical-reference/llm/llamacpp_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from distilabel.llm import LlamaCppLLM
from distilabel.tasks import TextGenerationTask
from llama_cpp import Llama

# Instantiate our LLM with them:
llm = LlamaCppLLM(
model=Llama(model_path="./notus-7b-v1.q4_k_m.gguf", n_gpu_layers=-1),
task=TextGenerationTask(),
max_new_tokens=128,
temperature=0.3,
prompt_format="notus",
)

result_llamacpp = llm.generate([{"input": "What is the capital of Spain?"}])
# >>> print(result_llamacpp[0][0]["parsed_output"]["generations"])
# The capital of Spain is Madrid. It is located in the center of the country and
# is known for its vibrant culture, beautiful architecture, and delicious food.
# Madrid is home to many famous landmarks such as the Prado Museum, Retiro Park,
# and the Royal Palace of Madrid. I hope this information helps!
18 changes: 18 additions & 0 deletions docs/snippets/technical-reference/llm/openai_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import os

from distilabel.llm import OpenAILLM
from distilabel.tasks import OpenAITextGenerationTask

openaillm = OpenAILLM(
model="gpt-3.5-turbo",
task=OpenAITextGenerationTask(),
max_new_tokens=256,
num_threads=2,
openai_api_key=os.environ.get("OPENAI_API_KEY"),
temperature=0.3,
)
result_openai = openaillm.generate([{"input": "What is OpenAI?"}])
# >>> result_openai
# [<Future at 0x2970ea560 state=running>]
# >>> result_openai[0].result()[0][0]["parsed_output"]["generations"]
# 'OpenAI is an artificial intelligence research organization that aims to ensure that artificial general intelligence (AGI) benefits all of humanity. AGI refers to highly autonomous systems that outperform humans at most economically valuable work. OpenAI conducts research, develops AI technologies, and promotes the responsible and safe use of AI. They also work on projects to make AI more accessible and beneficial to society. OpenAI is committed to transparency, cooperation, and avoiding uses of AI that could harm humanity or concentrate power in the wrong hands.'
17 changes: 17 additions & 0 deletions docs/snippets/technical-reference/llm/transformers_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from distilabel.llm import TransformersLLM
from distilabel.tasks import TextGenerationTask
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the models from huggingface hub:
tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")
model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1")

# Instantiate our LLM with them:
llm = TransformersLLM(
model=model,
tokenizer=tokenizer,
task=TextGenerationTask(),
max_new_tokens=128,
temperature=0.3,
prompt_format="notus",
)
20 changes: 20 additions & 0 deletions docs/snippets/technical-reference/pipeline/argilla.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argilla as rg

rg.init(api_key="<YOUR_ARGILLA_API_KEY>", api_url="<YOUR_ARGILLA_API_URL>")

rg_dataset = pipe_dataset.to_argilla()
rg_dataset.push_to_argilla(name="preference-dataset", workspace="admin")
37 changes: 37 additions & 0 deletions docs/snippets/technical-reference/pipeline/pipe_1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

from distilabel.llm import InferenceEndpointsLLM
from distilabel.pipeline import pipeline
from distilabel.tasks import TextGenerationTask

pipe = pipeline(
"preference",
"text-quality",
generator=InferenceEndpointsLLM(
endpoint_name=endpoint_name,
endpoint_namespace=endpoint_namespace,
token=token,
task=TextGenerationTask(),
max_new_tokens=512,
do_sample=True,
prompt_format="notus",
),
max_new_tokens=256,
num_threads=2,
openai_api_key=os.getenv("OPENAI_API_KEY"),
temperature=0.0,
)
29 changes: 29 additions & 0 deletions docs/snippets/technical-reference/pipeline/pipe_2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from datasets import load_dataset

instruction_dataset = (
load_dataset("HuggingFaceH4/instruction-dataset", split="test[:3]")
.remove_columns(["completion", "meta"])
.rename_column("prompt", "input")
)

pipe_dataset = pipe.generate(
instruction_dataset,
num_generations=2,
batch_size=1,
enable_checkpoints=True,
display_progress_bar=True,
)
41 changes: 41 additions & 0 deletions docs/snippets/technical-reference/pipeline/pipe_3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

print(pipe_dataset["input"][-1])
# Create a 3 turn conversation between a customer and a grocery store clerk - that is, 3 per person. Then tell me what they talked about.

print(pipe_dataset["generations"][-1][-1])
# Customer: Hi there, I'm looking for some fresh berries. Do you have any raspberries or blueberries in stock?

# Grocery Store Clerk: Yes, we have both raspberries and blueberries in stock today. Would you like me to grab some for you or can you find them yourself?

# Customer: I'd like your help getting some berries. Can you also suggest which variety is sweeter? Raspberries or blueberries?

# Grocery Store Clerk: Raspberries and blueberries both have distinct flavors. Raspberries are more tart and a little sweeter whereas blueberries tend to be a little sweeter and have a milder taste. It ultimately depends on your personal preference. Let me grab some of each for you to try at home and see which one you like better.

# Customer: That sounds like a great plan. How often do you receive deliveries? Do you have some new varieties of berries arriving soon?

# Grocery Store Clerk: We receive deliveries twice a week, on Wednesdays and Sundays. We also have a rotation of different varieties of berries throughout the season, so keep an eye out for new arrivals. Thanks for shopping with us, can I help you with anything else today?

# Customer: No, that's all for now. I'm always happy to support your local store.

# turn 1: berries, fresh produce availability, customer preference
# turn 2: product recommendations based on taste and personal preference, availability
# turn 3: store acknowledgment, shopping gratitude, loyalty and repeat business expectation.

print(pipe_dataset["rating"][-1][-1])
# 5.0

print(pipe_dataset["rationale"][-1][-1])
# The text accurately follows the given instructions and provides a conversation between a customer and a grocery store clerk. The information provided is correct, informative, and aligned with the user's intent. There are no hallucinations or misleading details.
20 changes: 20 additions & 0 deletions docs/snippets/technical-reference/pipeline/pipeline_generator_1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import os

from distilabel.llm import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import TextGenerationTask

endpoint_name = "aws-notus-7b-v1-4052" or os.getenv("HF_INFERENCE_ENDPOINT_NAME")
endpoint_namespace = "argilla" or os.getenv("HF_NAMESPACE")

pipe_generation = Pipeline(
generator=InferenceEndpointsLLM(
endpoint_name=endpoint_name, # The name given of the deployed model
endpoint_namespace=endpoint_namespace, # This usually corresponds to the organization, in this case "argilla"
token=os.getenv("HF_TOKEN"), # hf_...
task=TextGenerationTask(),
max_new_tokens=512,
do_sample=True,
prompt_format="notus",
),
)
20 changes: 20 additions & 0 deletions docs/snippets/technical-reference/pipeline/pipeline_generator_2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from datasets import Dataset

dataset = Dataset.from_dict(
{"input": ["Create an easy dinner recipe with few ingredients"]}
)
dataset_generated = pipe_generation.generate(dataset, num_generations=2)
53 changes: 53 additions & 0 deletions docs/snippets/technical-reference/pipeline/pipeline_generator_3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

print(dataset_generated)
# Dataset({
# features: ['input', 'generation_model', 'generation_prompt', 'raw_generation_responses', 'generations'],
# num_rows: 1
# })

print(dataset_generated[0]["generations"][0])
# Here's a simple and delicious dinner recipe with only a few ingredients:

# Garlic Butter Chicken with Roasted Vegetables

# Ingredients:
# - 4 boneless, skinless chicken breasts
# - 4 tablespoons butter
# - 4 cloves garlic, minced
# - 1 teaspoon dried oregano
# - 1/2 teaspoon salt
# - 1/4 teaspoon black pepper
# - 1 zucchini, sliced
# - 1 red bell pepper, sliced
# - 1 cup cherry tomatoes

# Instructions:

# 1. Preheat oven to 400°F (200°C).

# 2. Melt butter in a small saucepan over low heat. Add minced garlic and heat until fragrant, about 1-2 minutes.

# 3. Place chicken breasts in a baking dish and brush garlic butter over each one.

# 4. Sprinkle oregano, salt, and black pepper over the chicken.

# 5. In a separate baking dish, add sliced zucchini, red bell pepper, and cherry tomatoes. Brush with remaining garlic butter.

# 6. Roast the chicken and vegetables in the preheated oven for 25-30 minutes or until cooked through and the vegetables are tender and lightly browned.

# 7. Transfer the chicken to plates and serve with the roasted vegetables alongside. Enjoy!

# This recipe requires simple ingredients and is easy to prepare, making it perfect for a quick, satisfying dinner. The garlic butter adds maximum flavor, while the roasted vegetables complement the chicken beautifully, providing additional nutrition and texture. With minimal effort, you can have a delicious and balanced meal on the table in no time.
16 changes: 16 additions & 0 deletions docs/snippets/technical-reference/pipeline/pipeline_labeller_1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import os

from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import UltraFeedbackTask

pipe_labeller = Pipeline(
labeller=OpenAILLM(
model="gpt-4",
task=UltraFeedbackTask.for_instruction_following(),
max_new_tokens=256,
num_threads=8,
openai_api_key=os.getenv("OPENAI_API_KEY"),
temperature=0.3,
),
)
Loading