Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve TLM documentation #216

Merged
merged 13 commits into from
Apr 16, 2024
36 changes: 23 additions & 13 deletions cleanlab_studio/studio/studio.py
Original file line number Diff line number Diff line change
Expand Up @@ -391,28 +391,38 @@ def TLM(
timeout: Optional[float] = None,
verbose: Optional[bool] = None,
) -> trustworthy_language_model.TLM:
"""Gets a configured instance of Trustworthy Language Model (TLM).
"""Instantiates a configured Trustworthy Language Model (TLM) instance.

The returned TLM object can then be used as a drop-in replacement for an LLM, for estimating trustworthiness scores for LLM prompt/response pairs, and more. See the documentation for the [TLM](../trustworthy_language_model#class-TLM) class for more on what you can do with TLM.
The TLM object can be used as a drop-in replacement for an LLM, or for estimating trustworthiness scores for arbitrary text prompt/response pairs, and more (see the [TLM documentation](../trustworthy_language_model#class-TLM)).
jwmueller marked this conversation as resolved.
Show resolved Hide resolved

For advanced use cases, TLM supports a number of configuration options. The documentation below summarizes the options, and the [TLM tutorial](/tutorials/tlm) explains the tradeoffs in more detail.
For advanced use, TLM offers configuration options. The documentation below summarizes these options, and more details are explained in the [TLM tutorial](/tutorials/tlm).

Args:
quality_preset (TLMQualityPreset): quality preset to use for TLM queries, which will determine the quality of the output responses and trustworthiness scores.
Supported presets include "best", "high", "medium", "low", "base".
The "best" and "high" presets will improve the LLM responses themselves, with "best" also returning the most reliable trustworthiness scores.
The "medium" and "low" presets will return standard LLM responses along with associated confidence scores,
with "medium" producing more reliable trustworthiness scores than low.
The "base" preset will not return any confidence score, just a standard LLM output response, this option is similar to using your favorite LLM API.
Higher presets have increased runtime and cost.
quality_preset (TLMQualityPreset): An optional preset to control the quality of TLM responses and trustworthiness scores vs. runtimes/costs.
TLMQualityPreset is a string specifying one of the supported presets: "best", "high", "medium", "low", "base".
jwmueller marked this conversation as resolved.
Show resolved Hide resolved

The "best" and "high" presets improve the LLM responses themselves,
with "best" returning more reliable trustworthiness scores than "high".
The "medium" and "low" presets return standard LLM responses along with associated trustworthiness scores,
jwmueller marked this conversation as resolved.
Show resolved Hide resolved
with "medium" producing more reliable trustworthiness scores than low.
The "base" preset will not return any trustworthiness score, just a standard LLM response, and is similar to directly using your favorite LLM API.

Higher presets have increased runtime and cost (and may internally consume more tokens).
Reduce your preset if you see token-limit errors.
Details about each present are in the documentation for [TLMOptions](../trustworthy_language_model#class-tlmoptions).
Avoid using "best" or "high" presets if you primarily want to get trustworthiness scores, and are less concerned with improving LLM responses.
These presets have higher runtime/cost and are optimized to return more accurate LLM outputs, but not necessarily more reliable trustworthiness scores.

options (TLMOptions, optional): a typed dict of advanced configuration options.
Options that can be passed in include "model", "max_tokens", "num_candidate_responses", "num_consistency_samples", "use_self_reflection".
Avaialable options (keys in this dict) include: "model", "max_tokens", "num_candidate_responses", "num_consistency_samples", "use_self_reflection".
For more details about the options, see the documentation for [TLMOptions](../trustworthy_language_model#class-tlmoptions).
If specified, these override any settings from the choice of `quality_preset`.

timeout (float, optional): timeout (in seconds) to apply to each method call. If a result is not produced within the timeout, a TimeoutError will be raised. Defaults to None, which does not apply a timeout.
timeout (float, optional): timeout (in seconds) to apply to each method call.
jwmueller marked this conversation as resolved.
Show resolved Hide resolved
If a result is not produced within the timeout, a TimeoutError will be raised. Defaults to None, which does not apply a timeout.

verbose (bool, optional): whether to run in verbose mode, i.e., whether to show a tqdm progress bar when TLM is prompted with batches of data. If None, this will be determined automatically based on whether the code is running in an interactive environment such as a notebook.
verbose (bool, optional): whether to print outputs during execution, i.e., whether to show a progress bar when TLM is prompted with batches of data.
If None, this will be determined automatically based on whether the code is running in an interactive environment such as a Jupyter notebook.

Returns:
TLM: the [Trustworthy Language Model](../trustworthy_language_model#class-tlm) object
Expand Down
55 changes: 41 additions & 14 deletions cleanlab_studio/studio/trustworthy_language_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,10 @@
class TLM:
huiwengoh marked this conversation as resolved.
Show resolved Hide resolved
"""Represents a Trustworthy Language Model (TLM) instance, bound to a Cleanlab Studio account.

TLM should be configured and instantiated using the [`Studio.TLM()`](../studio/#method-tlm) method. Then, using the TLM object, you can [`prompt()`](#method-prompt) the language model, etc.
** The TLM object is not meant to be constructed directly.** Instead, use the [`Studio.TLM()`](../studio/#method-tlm)
method to configure and instantiate a TLM object.
After you've instantiated the TLM object using [`Studio.TLM()`](../studio/#method-tlm), you can use the instance methods below,
such as [`prompt()`](#method-prompt) and [`get_trustworthiness_score()`](#method-get_trustworthiness_score).
"""

def __init__(
Expand All @@ -48,9 +51,10 @@ def __init__(
timeout: Optional[float] = None,
verbose: Optional[bool] = None,
) -> None:
"""Initializes a Trustworthy Language Model.
"""Use `Studio.TLM()` instead of this method to initialize a TLM.

**Objects of this class are not meant to be constructed directly.** Instead, use [`Studio.TLM()`](../studio/#method-tlm), whose documentation also explains the different configuration options."""
lazydocs: ignore
"""
self._api_key = api_key

if quality_preset not in _VALID_TLM_QUALITY_PRESETS:
Expand Down Expand Up @@ -129,7 +133,7 @@ async def _batch_get_trustworthiness_score(
responses: Sequence[str],
capture_exceptions: bool = False,
) -> Union[List[float], List[Optional[float]]]:
"""Run batch of TLM get confidence score.
"""Run batch of TLM get trustworthiness score.

capture_exceptions behavior:
- If true, the list will contain None in place of the response for any errors or timeout processing some inputs.
Expand All @@ -140,19 +144,19 @@ async def _batch_get_trustworthiness_score(
- If false, a single timeout is applied to the entire batch (i.e. all queries will fail if the timeout is reached)

Args:
prompts (Sequence[str]): list of prompts to run get confidence score for
responses (Sequence[str]): list of responses to run get confidence score for
prompts (Sequence[str]): list of prompts to run get trustworthiness score for
responses (Sequence[str]): list of responses to run get trustworthiness score for
capture_exceptions (bool): if should return None in place of the response for any errors or timeout processing some inputs

Returns:
Union[List[float], List[Optional[float]]]: TLM confidence score for each prompt (in supplied order)
Union[List[float], List[Optional[float]]]: TLM trustworthiness score for each prompt (in supplied order)
"""
if capture_exceptions:
per_query_timeout, per_batch_timeout = self._timeout, None
else:
per_query_timeout, per_batch_timeout = None, self._timeout

# run batch of TLM get confidence score
# run batch of TLM get trustworthiness score
tlm_responses = await self._batch_async(
[
self._get_trustworthiness_score_async(
Expand Down Expand Up @@ -180,7 +184,7 @@ async def _batch_async(
"""Runs batch of TLM queries.

Args:
tlm_coroutines (List[Coroutine[None, None, Union[TLMResponse, float, None]]]): list of query coroutines to run, returning TLM responses or confidence scores (or None if capture_exceptions is True)
tlm_coroutines (List[Coroutine[None, None, Union[TLMResponse, float, None]]]): list of query coroutines to run, returning TLM responses or trustworthiness scores (or None if capture_exceptions is True)
batch_timeout (Optional[float], optional): timeout (in seconds) to run all queries, defaults to None (no timeout)

Returns:
Expand Down Expand Up @@ -266,8 +270,8 @@ def try_prompt(
The list returned will have the same length as the input list, if there are any
failures (errors or timeout) processing some inputs, the list will contain None in place of the response.

If there are any failures (errors or timeouts) processing some inputs, the list returned will have
the same length as the input list. In case of failure, the list will contain None in place of the response.
This is the recommended way to get TLM responses and trustworthiness scores for big datasets,
where some individual responses within the dataset may fail, as it will ensure partial results are not lost.

Args:
prompt (Sequence[str]): list of multiple prompts for the TLM
Expand Down Expand Up @@ -411,10 +415,14 @@ def try_get_trustworthiness_score(
response: Sequence[str],
) -> List[Optional[float]]:
"""Gets trustworthiness score for prompt-response pairs.

The list returned will have the same length as the input list, if there are any
failures (errors or timeout) processing some inputs, the list will contain None
in place of the response.

This is the recommended way to get TLM trustworthiness scores for big datasets,
where some individual responses within the dataset may fail, as it will ensure partial results are not lost.

Args:
prompt (Sequence[str]): list of prompts for the TLM to evaluate
response (Sequence[str]): list of responses corresponding to the input prompts
Expand Down Expand Up @@ -495,7 +503,7 @@ async def _get_trustworthiness_score_async(
"""
if self._quality_preset == "base":
raise ValidationError(
"Cannot get confidence score with `base` quality_preset -- choose a higher preset."
"Cannot get trustworthiness score with `base` quality_preset -- choose a higher preset."
)

try:
Expand Down Expand Up @@ -543,22 +551,41 @@ class TLMOptions(TypedDict):
(see the arguments in the TLM [initialization method](../studio#method-tlm) to learn more about the various quality presets),
but specifying custom values here will override any default values from the quality preset.

For all options described below, higher/more expensive settings will lead to longer runtimes and may consume more tokens internally.
The high token cost might make it such that you are not able to run long prompts (or prompts with long responses) in your account,
unless your token limits are increased. If you are running into issue with token limits, try using lower/less expensive settings
to be able to run longer prompts.

The default values for the various quality presets (specified when instantiating [`Studio.TLM`](../studio/#method-tlm)) are as below:
- **best:** `num_candidate_responses` = 6, `num_consistency_samples` = 8, `use_self_reflection` = True, this quality preset will return improved LLM responses
- **high:** `num_candidate_responses` = 6, `num_consistency_samples` = 8, `use_self_reflection` = True, this quality preset will return improved LLM responses
- **medium:** `num_candidate_responses` = 1, `num_consistency_samples` = 4, `use_self_reflection` = True
- **low:** `num_candidate_responses` = 1, `num_consistency_samples` = 4, `use_self_reflection` = True
- **base:** `num_candidate_responses` = 1, `num_consistency_samples` = 0, `use_self_reflection` = False, this quality preset is equivalent to a regular LLM call

By default, the TLM is set to the "medium" quality preset. The default `model` used is "gpt-3.5-turbo-16k", and `max_tokens` is 512 for all quality presets.
You can set custom values for these arguments regardless of the quality preset specified.

Args:
model (str, default = "gpt-3.5-turbo-16k"): underlying LLM to use (better models will yield better results).
Models currently supported include "gpt-3.5-turbo-16k", "gpt-4".

max_tokens (int, default = 512): the maximum number of tokens to generate in the TLM response.
This number will impact the maximum number of tokens you will see in the output response, and also the number of tokens
that can be generated for internal calls (to estimate the trustworthiness score).
Higher values here produce better (more reliable) TLM responses and trustworthiness scores, but at higher costs/runtimes.
If you are experiencing token limits while using the TLM (especially on higher quality presets), consider lowering this number.
The minimum value for this parameter is 64, and the maximum is 512.

num_candidate_responses (int, default = 1): this controls how many candidate responses are internally generated.
TLM scores the trustworthiness of each candidate response, and then returns the most trustworthy one.
Higher values here can produce better (more accurate) responses from the TLM, but at higher costs/runtimes.
The minimum value for this parameter is 1, and the maximum is 20.

num_consistency_samples (int, default = 5): this controls how many samples are internally generated to evaluate the LLM-response-consistency.
num_consistency_samples (int, default = 4): this controls how many samples are internally generated to evaluate the LLM-response-consistency.
This is a big part of the returned trustworthiness_score, in particular to evaluate strange input prompts or prompts that are too open-ended
to receive a clearly defined 'good' response.
Higher values here produce better (more reliable) TLM confidence scores, but at higher costs/runtimes.
Higher values here produce better (more reliable) TLM trustworthiness scores, but at higher costs/runtimes.
The minimum value for this parameter is 0, and the maximum is 20.
jwmueller marked this conversation as resolved.
Show resolved Hide resolved

use_self_reflection (bool, default = `True`): this controls whether self-reflection is used to have the LLM reflect upon the response it is
Expand Down
Loading