-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][Break BC] Custom instantiate API to simplify our config system #406
Changes from all commits
b622627
8a8395c
f878b31
dd24793
02f4ad0
64a55fb
0b151de
181f15d
3a6533b
f8e2a9b
96616c6
3a7e8af
ec4e68f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
.. _config: | ||
|
||
================== | ||
torchtune.config | ||
================== | ||
|
||
.. currentmodule:: torchtune.config | ||
|
||
.. autosummary:: | ||
:toctree: generated/ | ||
:nosignatures: | ||
|
||
instantiate | ||
parse |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,16 +4,15 @@ | |
Configs Deep-Dive | ||
================= | ||
|
||
This tutorial will guide you through writing configs for running recipes and | ||
designing params for custom recipes. | ||
This tutorial will guide you through writing configs for running recipes. | ||
|
||
.. grid:: 2 | ||
|
||
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn | ||
|
||
* How to write a YAML config and run a recipe with it | ||
* How to create a params dataclass for custom recipe | ||
* How to effectively use configs, CLI overrides, and dataclasses for running recipes | ||
* How to use :code:`instantiate` and :code:`parse` APIs | ||
* How to effectively use configs and CLI overrides for running recipes | ||
|
||
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites | ||
|
||
|
@@ -27,137 +26,193 @@ Where do parameters live? | |
|
||
There are two primary entry points for you to configure parameters: **configs** and | ||
**CLI overrides**. Configs are YAML files that define all the | ||
parameters needed to run a recipe within a single location. These can be overridden on the | ||
command-line for quick changes and experimentation without modifying the config. | ||
parameters needed to run a recipe within a single location. They are the single | ||
source of truth for reproducing a run. The config parameters can be overridden on the | ||
command-line using :code:`tune` for quick changes and experimentation without | ||
modifying the config. | ||
|
||
If you are planning to make a custom recipe, you will need to become familiar | ||
with the **recipe dataclass**, which collects all of your arguments from config and | ||
CLI, and passes it into the recipe itself. Here, we will discuss all three concepts: | ||
**configs**, **CLI**, and **dataclasses**. | ||
|
||
Writing configs | ||
--------------- | ||
Configs serve as the primary entry point for running recipes in TorchTune. They are | ||
expected to be YAML files and they simply list out values for parameters you want to define | ||
for a particular run. | ||
|
||
Recipe dataclasses | ||
------------------ | ||
.. code-block:: yaml | ||
Parameters should be organized in a single dataclass that is passed into the recipe. | ||
This serves as a single source of truth for the details of a fine-tuning run that can be easily validated in code and shared with collaborators for reproducibility. | ||
seed: null | ||
shuffle: True | ||
device: cuda | ||
dtype: fp32 | ||
enable_fsdp: True | ||
... | ||
.. code-block:: python | ||
Configuring components using :code:`instantiate` | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Many fields will require specifying TorchTune objects with associated keyword | ||
arguments as parameters. Models, datasets, optimizers, and loss functions are | ||
common examples of this. You can easily do this using the :code:`_component_` | ||
subfield. In :code:`_component_`, you need to specify the dotpath of the object | ||
you wish to instantiate in the recipe. The dotpath is the exact path you would use | ||
to import the object normally in a Python file. For example, to specify the | ||
:class:`~torchtune.datasets._alpaca.AlpacaDataset` in your config with custom | ||
arguments: | ||
|
||
class FullFinetuneParams: | ||
# Model | ||
model: str = "" | ||
model_checkpoint: str = "" | ||
.. code-block:: yaml | ||
In the dataclass, all fields should have defaults assigned to them. | ||
If a reasonable value cannot be assigned or it is a required argument, | ||
use the null value for that data type as the default and ensure that it is set | ||
by the user in the :code:`__post_init__` (see :ref:`Parameter Validation<parameter_validation_label>`). | ||
The dataclass should go in the :code:`recipes/params/` folder and the name of | ||
the file should match the name of the recipe file you are creating. | ||
dataset: | ||
_component_: torchtune.datasets.AlpacaDataset | ||
train_on_input: False | ||
In general, you should expose the minimal amount of parameters you need to run and experiment with your recipes. | ||
Exposing an excessive number of parameters will lead to bloated configs, which are more error prone, harder to read, and harder to manage. | ||
On the other hand, hardcoding all parameters will prevent quick experimentation without a code change. Only parametrize what is needed. | ||
Here, we are changing the default value for :code:`train_on_input` from :code:`True` | ||
to :code:`False`. | ||
|
||
To link the dataclass object with config and CLI parsing, | ||
you can use the :class:`~torchtune.utils.argparse.TuneArgumentParser` object and | ||
funnel the parsed arguments into your dataclass. | ||
Once you've specified the :code:`_component_` in your config, you can create an | ||
instance of the specified object in your recipe's setup like so: | ||
|
||
.. code-block:: python | ||
if __name__ == "__main__": | ||
parser = utils.TuneArgumentParser( | ||
description=FullFinetuneParams.__doc__, | ||
formatter_class=argparse.RawDescriptionHelpFormatter, | ||
) | ||
# Get user-specified args from config and CLI and create params for recipe | ||
args, _ = parser.parse_known_args() | ||
args = vars(args) | ||
params = FullFinetuneParams(**args) | ||
from torchtune import config | ||
logger = utils.get_logger("DEBUG") | ||
logger.info(msg=f"Running finetune_llm.py with parameters {params}") | ||
# Access the dataset field and create the object instance | ||
dataset = config.instantiate(cfg.dataset) | ||
recipe(params) | ||
This will automatically use any keyword arguments specified in the fields under | ||
:code:`dataset`. | ||
|
||
.. _parameter_validation_label: | ||
|
||
Parameter validation | ||
-------------------- | ||
To validate arguments for your dataclass and recipe, use the :code:`__post_init__` method to house any checks and raised exceptions. | ||
As written, the preceding example will actually throw an error. If you look at the constructor for :class:`~torchtune.datasets._alpaca.AlpacaDataset`, | ||
you'll notice that we're missing a required positional argument, the tokenizer. | ||
Since this is another configurable TorchTune object, let's understand how to handle | ||
this by taking a look at the :func:`~torchtune.config._instantiate.instantiate` API. | ||
|
||
.. code-block:: python | ||
def __post_init__(self): | ||
for param in fields(self): | ||
if getattr(self, param.name) == "": | ||
raise TypeError(f"{param.name} needs to be specified") | ||
def instantiate( | ||
config: DictConfig, | ||
*args: Tuple[Any, ...], | ||
**kwargs: Dict[str, Any], | ||
) | ||
Writing configs | ||
--------------- | ||
Once you've set up a recipe and its params, you need to create a config to run it. | ||
Configs serve as the primary entry point for running recipes in TorchTune. They are | ||
expected to be YAML files and simply list out values for parameters you want to define | ||
for a particular run. The config parameters should be a subset of the dataclass parameters; | ||
there should not be any new fields that are not already in the dataclass. Any parameters that | ||
are not specified in the config will take on the default value defined in the dataclass. | ||
:func:`~torchtune.config._instantiate.instantiate` also accepts positional arguments | ||
and keyword arguments and automatically uses that with the config when creating | ||
the object. This means we can not only pass in the tokenizer, but also add additional | ||
keyword arguments not specified in the config if we'd like: | ||
|
||
.. code-block:: yaml | ||
dataset: alpaca | ||
seed: null | ||
shuffle: True | ||
model: llama2_7b | ||
... | ||
# Tokenizer is needed for the dataset, configure it first | ||
tokenizer: | ||
_component_: torchtune.models.llama2.llama2_tokenizer | ||
path: /tmp/tokenizer.model | ||
Command-line overrides | ||
---------------------- | ||
To enable quick experimentation, you can specify override values to parameters in your config | ||
via the :code:`tune` command. These should be specified with the flag :code:`--override k1=v1 k2=v2 ...` | ||
dataset: | ||
_component_: torchtune.datasets.AlpacaDataset | ||
train_on_input: True | ||
For example, to run the :code:`full_finetune` recipe with custom model and tokenizer directories and using GPUs, you can provide overrides: | ||
.. code-block:: python | ||
.. code-block:: bash | ||
# Note the API of the tokenizer we specified - we need to pass in a path | ||
def llama2_tokenizer(path: str) -> Tokenizer; | ||
# Note the API of the dataset we specified - we need to pass in a tokenizer | ||
# and any optional keyword arguments | ||
class AlpacaDataset(Dataset): | ||
def __init__( | ||
self, | ||
tokenizer: Tokenizer, | ||
train_on_input: bool = True, | ||
use_clean: bool = False, | ||
**kwargs, | ||
) -> None; | ||
from torchtune import config | ||
# Since we've already specified the path in the config, we don't need to pass | ||
# it in | ||
tokenizer = config.instantiate(cfg.tokenizer) | ||
# We pass in the instantiated tokenizer as the first required argument, then | ||
# we change an optional keyword argument | ||
dataset = config.instantiate( | ||
cfg.dataset, | ||
tokenizer, | ||
use_clean=True, | ||
) | ||
Note that additional keyword arguments will overwrite any duplicated keys in the | ||
config. | ||
Referencing other config fields with interpolations | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Sometimes you need to use the same value more than once for multiple fields. You | ||
can use *interpolations* to reference another field, and :func:`~torchtune.config._instantiate.instantiate` | ||
will automatically resolve it for you. | ||
tune full_finetune --config alpaca_llama2_full_finetune --override model_directory=/home/my_model_checkpoint tokenizer_directory=/home/my_tokenizer_checkpoint device=cuda | ||
.. code-block:: yaml | ||
The order of overrides from these parameter sources is as follows, with highest precedence first: CLI, Config, Dataclass defaults | ||
output_dir: /tmp/alpaca-llama2-finetune | ||
metric_logger: | ||
_component_: torchtune.utils.metric_logging.DiskLogger | ||
log_dir: ${output_dir} | ||
Testing configs | ||
--------------- | ||
If you plan on contributing your config to the repo, we recommend adding it to the testing suite. TorchTune has testing for every config added to the library, namely ensuring that it instantiates the dataclass and runs the recipe correctly. | ||
Best practices for writing configs | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Let's discuss some guidelines for writing configs to get the most out of them. | ||
To add your config to this test suite, simply update the dictionary in :code:`recipes/tests/configs/test_configs`. | ||
Airtight configs | ||
"""""""""""""""" | ||
While it may be tempting to put as much as you can in the config to give you | ||
maximum flexibility in switching parameters for your experiments, we encourage | ||
you to only include fields in the config that will be used or instantiated in the | ||
recipe. This ensures full clarity on the options a recipe was run with and will | ||
make it significantly easier to debug. | ||
.. code-block:: python | ||
.. code-block:: yaml | ||
# dont do this | ||
alpaca_dataset: | ||
_component_: torchtune.datasets.AlpacaDataset | ||
train_on_input: True | ||
slimorca_dataset: | ||
... | ||
# do this | ||
dataset: | ||
# change this in config or override when needed | ||
_component_: torchtune.datasets.AlpacaDataset | ||
train_on_input: True | ||
Use public APIs only | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this section duplicative of the note above about AlpacaDataset in the "Configuring components using instantiate" section? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, I don't mind being redundant to emphasize the point, as most users will just skim through this. but if it's too much overlap in wording, I can remove the note |
||
"""""""""""""""""""" | ||
If a component you wish to specify in a config is located in a private file, use | ||
the public dotpath in your config. These components are typically exposed in their | ||
parent module's :code:`__init__.py` file. This way, you can guarantee the stability | ||
of the API you are using in your config. There should be no underscores in your | ||
component dotpath. | ||
config_to_params = { | ||
os.path.join(ROOT_DIR, "alpaca_llama2_full_finetune.yaml"): FullFinetuneParams, | ||
..., | ||
} | ||
.. code-block:: yaml | ||
Linking recipes and configs with :code:`tune` | ||
--------------------------------------------- | ||
# don't do this | ||
dataset: | ||
_component_: torchtune.datasets._alpaca.AlpacaDataset | ||
train_on_input: True | ||
In order to run your custom recipe and configs with :code:`tune`, you must update the :code:`_RECIPE_LIST` | ||
and :code:`_CONFIG_LISTS` in :code:`recipes/__init__.py` | ||
# do this | ||
dataset: | ||
_component_: torchtune.datasets.AlpacaDataset | ||
train_on_input: True | ||
.. code-block:: python | ||
_RECIPE_LIST = ["full_finetune", "lora_finetune", "alpaca_generate", ...] | ||
_CONFIG_LISTS = { | ||
"full_finetune": ["alpaca_llama2_full_finetune"], | ||
"lora_finetune": ["alpaca_llama2_lora_finetune"], | ||
"alpaca_generate": [], | ||
"<your_recipe>": ["<your_config"], | ||
} | ||
Command-line overrides | ||
---------------------- | ||
Configs are the primary location to collect all your parameters to run a recipe, | ||
but sometimes you may want to quickly try different values without having to update | ||
the config itself. To enable quick experimentation, you can specify override values | ||
to parameters in your config via the :code:`tune` command. These should be specified | ||
with the flag :code:`--override k1=v1 k2=v2 ...` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we still need to stick with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
Running your recipe | ||
------------------- | ||
If everything is set up correctly, you should be able to run your recipe just like the existing library recipes using the :code:`tune` command: | ||
For example, to run the :code:`full_finetune` recipe with custom model and tokenizer directories and using GPUs, you can provide overrides: | ||
.. code-block:: bash | ||
tune <recipe> --config <config> --override ... | ||
tune full_finetune --config alpaca_llama2_full_finetune --override model_directory=/home/my_model_checkpoint tokenizer_directory=/home/my_tokenizer_checkpoint device=cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add a warning here about accessing private files vs files exposed in the init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's discussed below in "Use public APIs only"