Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][Break BC] Custom instantiate API to simplify our config system #406

Merged
merged 13 commits into from
Mar 1, 2024

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Feb 23, 2024

EDIT: This PR has been updated to use a custom built instantiate API instead of Hydra. Much of the discussion below pertains to the original proposal of using Hydra, and we have decided to pursue alternatives due to two core reasons:

  • Maintaining and debugging parts of Hydra is a heavy responsibility for TorchTune devs that is not scalable
  • Hydra contains numerous features that can let user "meta-program" via configs that are outside the scope of the TorchTune library. We do not need nor want to maintain this functionality.

The updated code changes and context now reflect the new proposal of using a custom instantiate. The original Hydra-related context is moved to the bottom.

Changelist

Apologies in advance for the giant PR, but folks have been wanting to see the full extent of the changes so I've kept it all together. I'll summarize the main highlights here:

  • torchtune/config contains the files for the updated config system. Only instantiate and parse are exposed as public APIs
  • Getter methods are no longer needed, swapped out with config.instantiate. Side note, would like to enforce a convention to use config.instantiate instead of instantiate to be explicit that this is creating an object specified from a config, similar to models.get_model
  • ALL_MODELS, ALL_DATASETS, etc are removed and callsites refactored
  • TuneArgumentParser can be replaced with a config.parse decorator
  • Configs will specify TorchTune objects with the actual path instead of a string
  • Delete the recipe/params/ directory
  • All recipes and their tests were refactored. Since alpaca_generate did not have a config previously, had to quickly add one (see also Refactor alpaca_generate recipe with new config system #302)

Motivation

As we evolve our proposed config system (#230, #308) we quickly came across several key pain points that were difficult to address.

  • In Remove hardcoded dicts for getters and collect in central location #307, there were concerns about the ease and maintainability of config instantiation, i.e., mapping config strings to the exact TorchTune object that is needed via getter methods and hardcoded dictionaries/registry/use of __all__.
  • The difficulty of specifying/adding keyword arguments for a particular class (project name for WandBLogger, for example) without having to update the params dataclass and create defaults.
  • Splitting between configs and recipe dataclasses hinders shareability, a major value prop of yaml-based config systems
    It became more clear that we were building many layers of abstractions to solve these problems that may become difficult to maintain and debug. The closest analogue is Axolotl's config system, which allows for clean, user-friendly configs but under the hood requires a lot of heavy lifting (see normalize_config and validate_config) and indirection from the user, which is not only challenging to debug but goes against our core principle of extensibility.

Here, I build custom API that lives in torchtune/config that is inspired by Hydra's core functionality of instantiate and a parser decorator. Because this will be owned by TorchTune, we have full control over what users are able to do with the config and maintaining a third-party dependency is no longer needed.

New API

  • instantiate: Takes in a DictConfig object (parsed from yaml file by OmegaConf) and expects a _path_ field that specifies the dotpath of the exact object. An instance of this object is created and returned using any additional positional args and keyword arguments provided either in the DictConfig object or directly in instantiate.
  • parse: Decorator that can be used to wrap a recipe main function that handles creating TuneArgumentParser and parsing in args from a config file and command-line. This is passed into the recipe as a DictConfig object.

Problems that instantiate solves

No getter methods

We can now instantiate objects directly from the config, and any public API in torchtune is fair game. Actually, anything that is importable in the user's environment is fair game, so we can support external objects, like BandB optimizers. You can see in the recipe that the only getters that are used are from utils, which are not relying on preexisting mappings. We can basically swap out these getter methods 1-to-1 with config.instantiate

No registry

No getter methods means no need for a registry of objects you can instantiate from the config. That means no need for hardcoded dicts, relying on __all__, or separate registry decorators, as extensively discussed in #307

Configs are now the source of truth and are more shareable

Before we had this weird split between configs and the params dataclass. instantiate requires the config to be passed in as a DictConfig object directly, bypassing params. In fact, params no longer serves any purpose. The validation logic that it previously housed can be relocated to the recipe. This makes no difference since the onus is on the user to validate flag params whether it is in the dataclass or in the recipe. We can remove this layer of abstraction and configs now contain all details of a run (minus CLI overrides)

Easily add parsing to a recipe

The config.parse decorator handles the argparsing for you, and expects almost the exact same syntax that we had designed: --config-name my_config.yaml key=value for overrides. You can see in the recipe_main that the logic is cleaner as a result.

Coerces some organization in the configs

Currently, there is nothing enforcing the organization of parameters in the config besides comments and the goodwill of the users. With instantiate, parameters/kwargs related to the model, optimizer, dataset, etc will have to be colocated together, which imposes some organization.

Ease of specifying kwargs

Any further customizations of TorchTune objects were usually specified at the top level of the config and passed into the getters with kwargs. This poses two problems: 1) what if an object has a kwarg unique to itself and not present in other similar objects, like project in wandb logger not present in other loggers. You need to add a field to params just to account for these unique flags; 2) packaging all instance arguments in **kwargs is challenging to debug and more opaque than having explicit keyword arguments. instantiate lets us specify instance arguments at the config level and instantiate the object with a single API.

Problems that instantiate introduces

One new thing to learn

Something like instantiate is not commonly known to users as much as something like argparse, so there is a bit of a learning curve. But this can be addressed with thorough documentation and tutorials, and I want to point out that we are in effect removing two layers of abstraction that users will no longer have to learn (recipe dataclasses, getter methods)

Instantiating private APIs

Users can technically instantiate an API from a private module from the config. We need to do one of two things:

  • Restrict users from doing this by adding a check in get_object_from_path
  • Log a stern warning that users should use the public APIs and we can't guarantee stability

Config UX is a bit awkward

The use of _path_ and full module paths in the yaml files are a bit intimidating. But this is subjective and in my opinion is not too significant of a downgrade.

Verdict

This approach creates intuitiveness and seamlessness of going from config to all the python objects needed in the recipe without requiring many layers of logic while enabling all the features we need from a robust config system. We no longer have to worry about maintaining a third party library or unexpected user behaviors that Hydra would've introduced.

Test plan

Successful unit test: pytest tests
Successful recipe tests: recipes/tests/run_test.sh
Ran recipes manually with success:
tune --nnodes 1 --nproc_per_node 1 full_finetune --config alpaca_llama2_full_finetune --override max_steps_per_epoch=2 epochs=1 enable_fsdp=False enable_activation_checkpointing=False batch_size=1 gradient_accumulation_steps=1

tune --nnodes 1 --nproc_per_node 8 lora_finetune --config alpaca_llama2_lora_finetune --override max_steps_per_epoch=2 epochs=1

Old context

We've initially dismissed Hydra as a non-starter due to the fact that it is not actively developed, but I'd like to challenge that premise and have the discussion of how using Hydra APIs solves all these pain points.

Hydra and what APIs we are using

Hydra is a Meta-owned OSS library (2M downloads and >8K stars) that provides utilities to create YAML-based configs, call a script with a certain config + overrides from command-line, and enables direct instantiation of Python objects. It's a well-established library that is currently used in production systems. The main challenge is that it is community supported and not actively developed on, so taking on this dependency carries some inherent risk. Let's first discuss a bit what we will use from the library and we can assess the risk.

We are primarily using two APIs: hydra.utils.instantiate and the hydra.main decorator. Turns out, hydra.main is simply argparse under the hood that takes in a yaml config file and CLI overrides, exactly what we were rebuilding. We can essentially swap out TuneArgumentParser with hydra.main one-to-one. hydra.utils.instantiate directly creates the python object from the config without any layers of abstraction in between, which is exactly what we were trying to build registries for. Both of these APIs are core, stable functionality that is unlikely to see breakages. We are NOT using the full functionality of Hydra nor its experimental features. That being said, we should still understand what our plan is to manage this risk.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 23, 2024
Copy link

netlify bot commented Feb 23, 2024

Deploy Preview for torchtune-preview ready!

Name Link
🔨 Latest commit ec4e68f
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65e11bab7ceefc00089253c4
😎 Deploy Preview https://deploy-preview-406--torchtune-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kartikayk
Copy link
Contributor

Thanks for putting this up. It's crazy how much cleaner and more extensible this is compared to other alternatives and our current config system. A few thoughts:

  • Config Manipulation.As you pointed out offline, existing libraries/frameworks do a lot of config manipulation behind the scenes (updating config values based on backend for example). IMO this sort of manipulation while a nice user experience leads to debugging nightmare. I think WYSIWYG (what you see is what you get) should be a core principle for our config system. Hydra makes this easy by removing layers of abstractions.
  • Documenting Configs.Getting rid of the dataclass is nice for a variety of reasons, but it does put more onus on documenting configs correctly. For example, the current recipe will only work with model: torchtune.models.llama2_7b, otherwise this will break. How should we document this? This cannot be in a tutorial, and I don't want to depend on error validation to catch this - we should be more proactive. Should we add this documentation in the "default config" for each recipe? Or open to better ideas.
  • Breaking BC. We should make it clear that the next steps of this RFC will be to update existing recipes and getting rid of our current config system. As an aside, this will break backwards compatibility for all users. And we need to figure out how we do this moving forward, esp as we start opening this repo up to users. As a first step I would also start adding [RFC] and [Break BC] to the title of PRs such as this one. But we need to have a serious discussion around this with the team.
  • Recipe Tests. One big open question here is around recipe tests. How would this work? Right now we pass a dict of params for testing [example]. Can I still pass dicts to set this up? Or how will that work?
  • Config tests. We should test that the params are being correctly passed around.

@NicolasHug
Copy link
Member

Hydra is no longer maintained

Sorry to be that guy (always :( ) but that's a much more dangerous problem that one might think, and a big non-starter IMO.

I will note that if we pin the version to stable and do not use any newly added features, the chance of a breakage in their core APIs (instantiate, main) is low.

A package that doesn't get released and doesn't keep up with Python versions will break as Python version evolve over time - and yes Python does make BC-breaking changes even in the 3.X series. See e.g. pytorch/vision#8119 as an example where, if the code-base isn't updated, users wouldn't be able to use Python 3.12.

I'm also having a bunch of release-related issues on surprise as dependencies evolve NicolasHug/Surprise#473 NicolasHug/Surprise#468 NicolasHug/Surprise#470 and yes that's because I'm being naughty and not maintaining that package often enough.

@kartikayk
Copy link
Contributor

kartikayk commented Feb 23, 2024

One other question I have: are there any gotchas with instantiate?

@kartikayk
Copy link
Contributor

kartikayk commented Feb 23, 2024

@NicolasHug given that Hydra is a Meta-owned project, if indeed this is the right solution for our lib, we can figure out how to bring this back into active development. For this RFC (and the MVP), I'd love to see if Hydra is the right solution. The context section makes a strong case for this.

Edit: "bring this back into active development" -> contribute to this project if this is the right solution for users and us.

@NicolasHug
Copy link
Member

NicolasHug commented Feb 23, 2024

Then make it clear inthe RFC :)
Because now you're not just adding a new dependency, you're adding a new project that you need to maintain.

@kartikayk
Copy link
Contributor

kartikayk commented Feb 23, 2024

Fair point, @RdoubleA can you add this?

Overall, Hydra has over 2M downloads and >8K stars so also want to make sure we don't come across as taking a dependency on a random project.

Edit: Seems like failures are being actively fixed by the community [example]. based on this info, @RdoubleA, I think we should loosen our tenor on "this project is not maintained" to "this project is not being actively developed".

@joecummings
Copy link
Contributor

@NicolasHug given that Hydra is a Meta-owned project, if indeed this is the right solution for our lib, we can figure out how to bring this back into active development. For this RFC (and the MVP), I'd love to see if Hydra is the right solution. The context section makes a strong case for this.

Will be going through for a thorough-review, but this comment does not give me confidence. What I hear here is that we have to rely on our own powers of convincing to make sure that this other Meta project is maintained for our own interests. What I think is much more likely is that we become the TorchTune + Hydra team, which I would be surprised if any of us want.

@kartikayk
Copy link
Contributor

kartikayk commented Feb 23, 2024

@joecummings

FYI:

Overall, Hydra has over 2M downloads and >8K stars so also want to make sure we don't come across as taking a dependency on a random project. Seems like failures are being actively fixed by the community [https://github.com/facebookresearch/hydra/pull/2842].

What I think is much more likely is that we become the TorchTune + Hydra team, which I would be surprised if any of us want.

We were in the process of building our own "Hydra" which seems like an order of magnitude worse.

@joecummings
Copy link
Contributor

@kartikayk

We were in the process of building our own "Hydra" which seems like an order of magnitude worse.

While I agree our solution needs more thought and no simple solution has yet been presented, it's based on argparse + plain YAML, which are both very popular, widely accepted, and easily understood.

Also, while the project had a recent "fix" pushed, the latest release is EXACTLY 1 year ago: https://github.com/facebookresearch/hydra/releases/tag/v1.3.2

@kartikayk
Copy link
Contributor

@joecummings

it's based on argparse + plain YAML, which are both very popular, widely accepted, and easily understood.

Thats not a great reason for continuing down that path if they don't satisfy our requirements :)

@NicolasHug
Copy link
Member

Intrusive Configs. Config objects should not intrude into the class implementation. Configs should interact with these classes through cleanly defined builder functions which convert the config into flat parameters needed to instantiate an object.

This is from the best practices section. For the config noobs out there (me), would the adoption of Hydra change anything to these guidelines?

@kartikayk
Copy link
Contributor

@NicolasHug good question - nope we still have the separation we did before. The config is only visible to the recipe builder/setup methods and nothing more. Concretely, no building block from the lib should ever take in a config object and that continues to be the case

@RdoubleA RdoubleA changed the title [WIP] Hail Hydra [WIP][RFC][Break BC] Hail Hydra Feb 23, 2024
@RdoubleA RdoubleA marked this pull request as ready for review February 23, 2024 19:29
@RdoubleA
Copy link
Contributor Author

Thanks everyone for the discussion so far. I've updated the context with some more details to clarify a few key points.

A package that doesn't get released and doesn't keep up with Python versions will break as Python version evolve over time - and yes Python does make BC-breaking changes even in the 3.X series.

@NicolasHug This is a great point and something we definitely need to keep in mind if we add Hydra as a dependency. I'd like to open up the discussion of what "maintaining" Hydra would look like if we have to contribute to it directly. What are the barebones items we need to address to keep the lights on? If it's too much, can we garner support internally to see more active development?

more likely is that we become the TorchTune + Hydra team, which I would be surprised if any of us want

@joecummings I agree being the TorchTune + Hydra team is not ideal, but I want to assess what this will look like, because we can communicate clear expectations that we are not taking on active development of Hydra, just the bare minimum to keep stable APIs running. I don't think this will be as heavy as we think it is.

it's based on argparse + plain YAML, which are both very popular, widely accepted, and easily understood

Hydra is based off of argparse + plain YAML, this is not going away. In fact, I agree with @kartikayk that we are converging towards building our own Hydra, which DOES bring on the full burden of maintaining it, fixing bugs, supporting new features, while being less polished and less battle-tested. This is worse imo.

@RdoubleA RdoubleA changed the title [WIP][RFC][Break BC] Hail Hydra [WIP][RFC][Break BC] Using Hydra's instantiate API to simplify our config system Feb 23, 2024
@joecummings joecummings added the rfc Request for comments label Feb 23, 2024
@RdoubleA RdoubleA changed the title [WIP][RFC][Break BC] Using Hydra's instantiate API to simplify our config system [RFC][Break BC] Using Hydra's instantiate API to simplify our config system Feb 23, 2024
@RdoubleA RdoubleA changed the title [RFC][Break BC] Using Hydra's instantiate API to simplify our config system [WIP][RFC][Break BC] Using Hydra's instantiate API to simplify our config system Feb 23, 2024
@RdoubleA RdoubleA force-pushed the rafiayub/hail_hydra branch from 26dadec to 3ab9a99 Compare February 23, 2024 20:40
@RdoubleA RdoubleA changed the title [WIP][RFC][Break BC] Using Hydra's instantiate API to simplify our config system [RFC][Break BC] Using Hydra's instantiate API to simplify our config system Feb 23, 2024
but sometimes you may want to quickly try different values without having to update
the config itself. To enable quick experimentation, you can specify override values
to parameters in your config via the :code:`tune` command. These should be specified
with the flag :code:`--override k1=v1 k2=v2 ...`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we still need to stick with --override? That's a bit clunky, why can't we have kwargs specified normally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--override is more explicit and will show up in the help message. but I don't feel strongly about this, we can revert it back to the conventional way in a follow-up

model: str = ""
model_checkpoint: str = ""
dataset:
_component_: torchtune.datasets.AlpacaDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add a warning here about accessing private files vs files exposed in the init?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's discussed below in "Use public APIs only"

recipe.cleanup()


Linking recipes and configs with :code:`tune`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to remove this right??? We simply won't support linking like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only things available like this will be the built-in recipes that WE control.


from recipes.params.full_finetune import FullFinetuneParams
from recipes.params.lora_finetune import LoRAFinetuneParams
from torchtune.utils.argparse import TuneArgumentParser

ROOT_DIR: str = os.path.join(os.path.abspath(__file__), "../../../configs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We need to standardize our use of Pathlib vs os.path. I prefer Pathlib as it's more modern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks disgusting lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made an issue #434

_component_: torchtune.datasets.AlpacaDataset
train_on_input: True

Order in config is order of instantiation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I like the idea here, but tbh I don't think we should add it as guidance for users. Imo in the documentation we should call out that there can be some implicit ordering of config instantiation in the recipe, and we should also document it in our configs (as you've done here), but idk that we should actually tell users "hey make the ordering in your config line up with the order of instantiation in your recipe" as it is just an extra burden

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these are just suggestions right? we're not requiring or burdening them to do this. Otherwise, do we need a best practices section, or just leave it implicit through our example configs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a best practice, I think it's good to call it out? We spoke a bit about this yesterday, but I think it would be nice if folks think about while constructing their configs. Unless @ebsmothers you think this will lead to unintuitive config structure (eg: tokenizer being grouped with dataset instead of model, which doesn;t sound too bad to me).


.. code-block:: python
Use public APIs only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this section duplicative of the note above about AlpacaDataset in the "Configuring components using instantiate" section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I don't mind being redundant to emphasize the point, as most users will just skim through this. but if it's too much overlap in wording, I can remove the note

Comment on lines 113 to 135
tokenizer = config.instantiate(cfg.tokenizer)
dataset = config.instantiate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this example can be made clearer by explicitly saying "here is the tokenizer API, here is the dataset API" prior to just calling config.instantiate. Basically just give people a clear mental model of how they can resolve dependencies in initialization via instantiate

The following configs can be used to run this recipe:
>>> tune ls
RECIPE CONFIG
full_finetune alpaca_llama2_full_finetune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add one more sentence like "see the config file for fields required to run this recipe"

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a beast of a change, huge kudos to you for getting through all of this. Left a few more comments but gonna accept now so that you're unblocked. Please make sure that (a) all tests are passing, (b) all recipes run successfully, and (c) all remaining comments are addressed. Modulo that, this looks landable to me.

"full_finetune",
"lora_finetune",
"alpaca_generate",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait one more thing.. why not add alpaca_llama2_generate.yaml to _CONFIG_LISTS?

@RdoubleA RdoubleA force-pushed the rafiayub/hail_hydra branch from d916aae to ec4e68f Compare March 1, 2024 00:04
@RdoubleA RdoubleA merged commit 760d721 into main Mar 1, 2024
17 checks passed
@RdoubleA RdoubleA deleted the rafiayub/hail_hydra branch March 1, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. rfc Request for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants