-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][Break BC] Custom instantiate API to simplify our config system #406
Conversation
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Thanks for putting this up. It's crazy how much cleaner and more extensible this is compared to other alternatives and our current config system. A few thoughts:
|
Sorry to be that guy (always :( ) but that's a much more dangerous problem that one might think, and a big non-starter IMO.
A package that doesn't get released and doesn't keep up with Python versions will break as Python version evolve over time - and yes Python does make BC-breaking changes even in the 3.X series. See e.g. pytorch/vision#8119 as an example where, if the code-base isn't updated, users wouldn't be able to use Python 3.12. I'm also having a bunch of release-related issues on surprise as dependencies evolve NicolasHug/Surprise#473 NicolasHug/Surprise#468 NicolasHug/Surprise#470 and yes that's because I'm being naughty and not maintaining that package often enough. |
One other question I have: are there any gotchas with |
@NicolasHug given that Hydra is a Meta-owned project, if indeed this is the right solution for our lib, we can figure out how to bring this back into active development. For this RFC (and the MVP), I'd love to see if Hydra is the right solution. The context section makes a strong case for this. Edit: "bring this back into active development" -> contribute to this project if this is the right solution for users and us. |
Then make it clear inthe RFC :) |
Fair point, @RdoubleA can you add this? Overall, Hydra has over 2M downloads and >8K stars so also want to make sure we don't come across as taking a dependency on a random project. Edit: Seems like failures are being actively fixed by the community [example]. based on this info, @RdoubleA, I think we should loosen our tenor on "this project is not maintained" to "this project is not being actively developed". |
Will be going through for a thorough-review, but this comment does not give me confidence. What I hear here is that we have to rely on our own powers of convincing to make sure that this other Meta project is maintained for our own interests. What I think is much more likely is that we become the TorchTune + Hydra team, which I would be surprised if any of us want. |
FYI:
We were in the process of building our own "Hydra" which seems like an order of magnitude worse. |
While I agree our solution needs more thought and no simple solution has yet been presented, it's based on argparse + plain YAML, which are both very popular, widely accepted, and easily understood. Also, while the project had a recent "fix" pushed, the latest release is EXACTLY 1 year ago: https://github.com/facebookresearch/hydra/releases/tag/v1.3.2 |
Thats not a great reason for continuing down that path if they don't satisfy our requirements :) |
This is from the best practices section. For the config noobs out there (me), would the adoption of Hydra change anything to these guidelines? |
@NicolasHug good question - nope we still have the separation we did before. The config is only visible to the recipe builder/setup methods and nothing more. Concretely, no building block from the lib should ever take in a config object and that continues to be the case |
Thanks everyone for the discussion so far. I've updated the context with some more details to clarify a few key points.
@NicolasHug This is a great point and something we definitely need to keep in mind if we add Hydra as a dependency. I'd like to open up the discussion of what "maintaining" Hydra would look like if we have to contribute to it directly. What are the barebones items we need to address to keep the lights on? If it's too much, can we garner support internally to see more active development?
@joecummings I agree being the TorchTune + Hydra team is not ideal, but I want to assess what this will look like, because we can communicate clear expectations that we are not taking on active development of Hydra, just the bare minimum to keep stable APIs running. I don't think this will be as heavy as we think it is.
Hydra is based off of argparse + plain YAML, this is not going away. In fact, I agree with @kartikayk that we are converging towards building our own Hydra, which DOES bring on the full burden of maintaining it, fixing bugs, supporting new features, while being less polished and less battle-tested. This is worse imo. |
26dadec
to
3ab9a99
Compare
but sometimes you may want to quickly try different values without having to update | ||
the config itself. To enable quick experimentation, you can specify override values | ||
to parameters in your config via the :code:`tune` command. These should be specified | ||
with the flag :code:`--override k1=v1 k2=v2 ...` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we still need to stick with --override
? That's a bit clunky, why can't we have kwargs specified normally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--override
is more explicit and will show up in the help message. but I don't feel strongly about this, we can revert it back to the conventional way in a follow-up
model: str = "" | ||
model_checkpoint: str = "" | ||
dataset: | ||
_component_: torchtune.datasets.AlpacaDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add a warning here about accessing private files vs files exposed in the init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's discussed below in "Use public APIs only"
recipe.cleanup() | ||
|
||
|
||
Linking recipes and configs with :code:`tune` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to remove this right??? We simply won't support linking like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only things available like this will be the built-in recipes that WE control.
|
||
from recipes.params.full_finetune import FullFinetuneParams | ||
from recipes.params.lora_finetune import LoRAFinetuneParams | ||
from torchtune.utils.argparse import TuneArgumentParser | ||
|
||
ROOT_DIR: str = os.path.join(os.path.abspath(__file__), "../../../configs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We need to standardize our use of Pathlib vs os.path. I prefer Pathlib as it's more modern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this looks disgusting lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made an issue #434
docs/source/examples/configs.rst
Outdated
_component_: torchtune.datasets.AlpacaDataset | ||
train_on_input: True | ||
|
||
Order in config is order of instantiation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I like the idea here, but tbh I don't think we should add it as guidance for users. Imo in the documentation we should call out that there can be some implicit ordering of config instantiation in the recipe, and we should also document it in our configs (as you've done here), but idk that we should actually tell users "hey make the ordering in your config line up with the order of instantiation in your recipe" as it is just an extra burden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But these are just suggestions right? we're not requiring or burdening them to do this. Otherwise, do we need a best practices section, or just leave it implicit through our example configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a best practice, I think it's good to call it out? We spoke a bit about this yesterday, but I think it would be nice if folks think about while constructing their configs. Unless @ebsmothers you think this will lead to unintuitive config structure (eg: tokenizer being grouped with dataset instead of model, which doesn;t sound too bad to me).
|
||
.. code-block:: python | ||
Use public APIs only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this section duplicative of the note above about AlpacaDataset in the "Configuring components using instantiate" section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I don't mind being redundant to emphasize the point, as most users will just skim through this. but if it's too much overlap in wording, I can remove the note
docs/source/examples/configs.rst
Outdated
tokenizer = config.instantiate(cfg.tokenizer) | ||
dataset = config.instantiate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo this example can be made clearer by explicitly saying "here is the tokenizer API, here is the dataset API" prior to just calling config.instantiate. Basically just give people a clear mental model of how they can resolve dependencies in initialization via instantiate
The following configs can be used to run this recipe: | ||
>>> tune ls | ||
RECIPE CONFIG | ||
full_finetune alpaca_llama2_full_finetune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add one more sentence like "see the config file for fields required to run this recipe"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a beast of a change, huge kudos to you for getting through all of this. Left a few more comments but gonna accept now so that you're unblocked. Please make sure that (a) all tests are passing, (b) all recipes run successfully, and (c) all remaining comments are addressed. Modulo that, this looks landable to me.
"full_finetune", | ||
"lora_finetune", | ||
"alpaca_generate", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait one more thing.. why not add alpaca_llama2_generate.yaml
to _CONFIG_LISTS
?
d916aae
to
ec4e68f
Compare
EDIT: This PR has been updated to use a custom built instantiate API instead of Hydra. Much of the discussion below pertains to the original proposal of using Hydra, and we have decided to pursue alternatives due to two core reasons:
The updated code changes and context now reflect the new proposal of using a custom instantiate. The original Hydra-related context is moved to the bottom.
Changelist
Apologies in advance for the giant PR, but folks have been wanting to see the full extent of the changes so I've kept it all together. I'll summarize the main highlights here:
torchtune/config
contains the files for the updated config system. Onlyinstantiate
andparse
are exposed as public APIsconfig.instantiate
. Side note, would like to enforce a convention to useconfig.instantiate
instead ofinstantiate
to be explicit that this is creating an object specified from a config, similar tomodels.get_model
ALL_MODELS
,ALL_DATASETS
, etc are removed and callsites refactoredTuneArgumentParser
can be replaced with aconfig.parse
decoratorrecipe/params/
directoryalpaca_generate
did not have a config previously, had to quickly add one (see also Refactor alpaca_generate recipe with new config system #302)Motivation
As we evolve our proposed config system (#230, #308) we quickly came across several key pain points that were difficult to address.
__all__
.It became more clear that we were building many layers of abstractions to solve these problems that may become difficult to maintain and debug. The closest analogue is Axolotl's config system, which allows for clean, user-friendly configs but under the hood requires a lot of heavy lifting (see normalize_config and validate_config) and indirection from the user, which is not only challenging to debug but goes against our core principle of extensibility.
Here, I build custom API that lives in
torchtune/config
that is inspired by Hydra's core functionality of instantiate and a parser decorator. Because this will be owned by TorchTune, we have full control over what users are able to do with the config and maintaining a third-party dependency is no longer needed.New API
instantiate
: Takes in aDictConfig
object (parsed from yaml file by OmegaConf) and expects a_path_
field that specifies the dotpath of the exact object. An instance of this object is created and returned using any additional positional args and keyword arguments provided either in theDictConfig
object or directly ininstantiate
.parse
: Decorator that can be used to wrap a recipe main function that handles creatingTuneArgumentParser
and parsing in args from a config file and command-line. This is passed into the recipe as a DictConfig object.Problems that instantiate solves
No getter methods
We can now
instantiate
objects directly from the config, and any public API in torchtune is fair game. Actually, anything that is importable in the user's environment is fair game, so we can support external objects, like BandB optimizers. You can see in the recipe that the only getters that are used are from utils, which are not relying on preexisting mappings. We can basically swap out these getter methods 1-to-1 withconfig.instantiate
No registry
No getter methods means no need for a registry of objects you can instantiate from the config. That means no need for hardcoded dicts, relying on
__all__
, or separate registry decorators, as extensively discussed in #307Configs are now the source of truth and are more shareable
Before we had this weird split between configs and the params dataclass.
instantiate
requires the config to be passed in as a DictConfig object directly, bypassing params. In fact, params no longer serves any purpose. The validation logic that it previously housed can be relocated to the recipe. This makes no difference since the onus is on the user to validate flag params whether it is in the dataclass or in the recipe. We can remove this layer of abstraction and configs now contain all details of a run (minus CLI overrides)Easily add parsing to a recipe
The
config.parse
decorator handles the argparsing for you, and expects almost the exact same syntax that we had designed:--config-name my_config.yaml key=value
for overrides. You can see in therecipe_main
that the logic is cleaner as a result.Coerces some organization in the configs
Currently, there is nothing enforcing the organization of parameters in the config besides comments and the goodwill of the users. With
instantiate
, parameters/kwargs related to the model, optimizer, dataset, etc will have to be colocated together, which imposes some organization.Ease of specifying kwargs
Any further customizations of TorchTune objects were usually specified at the top level of the config and passed into the getters with kwargs. This poses two problems: 1) what if an object has a kwarg unique to itself and not present in other similar objects, like project in wandb logger not present in other loggers. You need to add a field to params just to account for these unique flags; 2) packaging all instance arguments in **kwargs is challenging to debug and more opaque than having explicit keyword arguments.
instantiate
lets us specify instance arguments at the config level and instantiate the object with a single API.Problems that instantiate introduces
One new thing to learn
Something like
instantiate
is not commonly known to users as much as something like argparse, so there is a bit of a learning curve. But this can be addressed with thorough documentation and tutorials, and I want to point out that we are in effect removing two layers of abstraction that users will no longer have to learn (recipe dataclasses, getter methods)Instantiating private APIs
Users can technically instantiate an API from a private module from the config. We need to do one of two things:
get_object_from_path
Config UX is a bit awkward
The use of
_path_
and full module paths in the yaml files are a bit intimidating. But this is subjective and in my opinion is not too significant of a downgrade.Verdict
This approach creates intuitiveness and seamlessness of going from config to all the python objects needed in the recipe without requiring many layers of logic while enabling all the features we need from a robust config system. We no longer have to worry about maintaining a third party library or unexpected user behaviors that Hydra would've introduced.
Test plan
Successful unit test:
pytest tests
Successful recipe tests:
recipes/tests/run_test.sh
Ran recipes manually with success:
tune --nnodes 1 --nproc_per_node 1 full_finetune --config alpaca_llama2_full_finetune --override max_steps_per_epoch=2 epochs=1 enable_fsdp=False enable_activation_checkpointing=False batch_size=1 gradient_accumulation_steps=1
tune --nnodes 1 --nproc_per_node 8 lora_finetune --config alpaca_llama2_lora_finetune --override max_steps_per_epoch=2 epochs=1
Old context
We've initially dismissed Hydra as a non-starter due to the fact that it is not actively developed, but I'd like to challenge that premise and have the discussion of how using Hydra APIs solves all these pain points.
Hydra and what APIs we are using
Hydra is a Meta-owned OSS library (2M downloads and >8K stars) that provides utilities to create YAML-based configs, call a script with a certain config + overrides from command-line, and enables direct instantiation of Python objects. It's a well-established library that is currently used in production systems. The main challenge is that it is community supported and not actively developed on, so taking on this dependency carries some inherent risk. Let's first discuss a bit what we will use from the library and we can assess the risk.
We are primarily using two APIs:
hydra.utils.instantiate
and thehydra.main
decorator. Turns out,hydra.main
is simply argparse under the hood that takes in a yaml config file and CLI overrides, exactly what we were rebuilding. We can essentially swap outTuneArgumentParser
withhydra.main
one-to-one.hydra.utils.instantiate
directly creates the python object from the config without any layers of abstraction in between, which is exactly what we were trying to build registries for. Both of these APIs are core, stable functionality that is unlikely to see breakages. We are NOT using the full functionality of Hydra nor its experimental features. That being said, we should still understand what our plan is to manage this risk.