-
Notifications
You must be signed in to change notification settings - Fork 38
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
🎯 Goal (What & Why)
We want to rework the configurations for models to help us going forward. Main goals are:
- Support block-modular models (Support block-modular architecture #242)
- Replace and generalize repetitive, ad-hoc parameters such as lr scales and initialization parameters, with standardized and automated configuration parameters.
- Make more components dynamic to help experimenting with new model configurations, ex. mixers, mlps, or even entire blocks or sequence of blocks.
- Help with the integration of more complex models, ex. multi-modal (vision) models
- Generalize peft/lora beyond transformers.
🚀 Execution Plan
Blocks:
Proposal:
- Standard blocks and their configs will be replaced by a block interface with a dynamic
BlockConfig
, allowing easy swap between all sorts of layers. - Standard transformer and transformer-like blocks that differ only by their mixer, ex. Transformer and SSM blocks, will be unified into
TransformerBlock
with typetransformer
. For now, this will be the default and only type of block. A transformer block will be defined through three fully dynamic sub-layers: mixer, mlp and normalization. hidden_size
,full_precision_residual
and other variables that need to remain consistent between blocks will not be part of the block config, and will instead be configured in block sequences (see below) and be passed as arguments to block instantiation.
Open questions:
- Should we allow for a different config for the two normalization sub-layers?
- Is
TransformerBlock
/transformer
appropriate, or do we want a more inclusive name?
*How should non-standard layers, ex. embeddings, head, relate to that interface?
Block sequences:
Proposal:
- We will introduce a concept of "block sequence", representing a logical sequence of blocks. We already have
Sequential
layers, so we will introduce matching dynamic constructBlockSequenceConfig
on the config side. - We will introduce
RepeatBlockConfig
as an implementation of the block sequence interface, repeating a pattern of block fornum_layers
, with constant parameters likehidden size
.
Open questions:
- Names are WIP.
- Allow for nested block sequences, ex. a block in a sequence that is itself a sequence? (Useful for lm and vision use cases below)
- Should the language model be itself a block sequence? Ex. we could have a fully generalizable structure (ex. vision of the form:
base_model [BlockSequenceConfig]: type: lm (LM -> LMLike -> BlockSequenceConfig) embeddings [EmbeddingsConfig]: type: lm_embeddings (LmEmbeddingsConfig -> EmbeddingsConfig -> BlockSequenceConfig) ... transformer [BlockSequenceConfig]: type: repeat ... head: type: lm_head (LmHeadConfig -> HeadConfig -> BlockSequenceConfig)
- How should be handle vision? Could be something like
base_model [BlockSequenceConfig]:
type: multi_modal
models:
- vision:
type: vision (Vision -> LMLike -> BlockSequenceConfig)
...
- lm
type: lm (LM -> LMLike -> BlockSequenceConfig)
...
Linear:
Proposal:
- Linear layers will have their own config. Parameters so far are
weight_initialization
,bias_initialization
,bias
,lr_scale
,apply_peft
. (see sections below for details) - Defaults will be customizable by the parent config, since having a fixed default doesn't make sense. Custom defaults will be set through the
default
non-init field ofLinearConfig
andLinearWeightConfig
, in the parent config's_validate
. - For layers that are the concatenation of logically distinct layers (ex. key_value, gate_and_up, moe mlp weights, ssm inner projection), there will be a separate configuration for each sub-layer.
Open questions:
- Should we make linear layers dynamic?
- Should we have a separate config class for layers that shouldn't have a bias option? (ex. MoE router)
- Is there a better way to achieve customizable defaults?
- Do we want to allow for convenience parameters to limit repetitions? Ex. keep
add_linear_biases
orinit_method_std
to set all linear layers at once? This would make things less verbose but could be more difficult to understand, and could make things less self-contained (ex. linear configuration may appear to depend not only onLinearConfig
, but also on arbitrary parameters in parent config.) Possible compromises:- have "init-only" shortcut fields that are converted explicitly in
_from_dict
, so that actual config end up with explicitLinearConfig
fields, i.e. all traces of the shortcut are gone when printing or saving the config. (Ex.add_linear_biases
replaced with explicitbias = False
in allLinearConfig
s.) - Rely on Hydra.
- have "init-only" shortcut fields that are converted explicitly in
- Concatenation of logically distinct layers should be manageable, but some details are still TBD. We already have examples for
lr_scale
(MoE) andapply_peft
(key and value) we can rely on, we can handle initialization by working with global tensors, and we shouldn't need non-constantbias
. - For MoE, configuring each expert separately could be tedious so maybe we want to make it optional.
Other layers and weights
Proposal:
- Normalization weights and biases will be managed through
NormalizationConfig
, with similar parameters toLinearConfig
but dynamic and without custom default (a fixed one works fine, and custom default doesn't work well with dynamic class). - Other parameters (embeddings, lm output, isolated parameters) will be configured in a generic
WeightConfig
.
Open questions:
- We could make
WeightConfig
the only way to configure parameters, using composition for linear, normalization, etc. This would bring consistency at the cost of convenience (more verbose configs). - Should we make more custom layer configurations, ex. embeddings, lm output, conv1d, etc? Embedding and LM output are technically linear(-like) layers, so they could use a linear config instead (without bias). As for the layer itself, making a construct could be difficult for technical and legacy reasons (making a submodule change weight names so isn't backward compatible).
Initialization
Proposal:
- All parameters will be associated with a fully dynamic
InitializationConfig
allowing for any conceivable initialization. - For linear and custom parameters, the default will be set in the parent layer, while normalization will use fixed defaults (fill with ones/zeros)
Open questions:
- How do we generalize default initializations? (Ex. SSMs have many isolated weights not fitting in the above categories).
LR scales
Proposal:
- All parameters will be associated with their own customizable learning rate scale. They may be be shared in some cases (ex. linear/normalization weight and bias).
- Layers and blocks may also define customizable lr scales, ex. to allow freezing an entire block. When more than one lr scale applies to a given parameter, the effect is multiplicative.
Peft
Proposal:
- Instead of specializing the Peft config to each model , we will use one single model-agnostic Peft config, and configure Peft behavior in individual layers.
Open questions:
- So far
LinearConfig.apply_peft
is the only configuration parameter, which is enough for simple lora. Do we need more? Ex. other types of peft, application toLinearWeightConfig
or other layers. - Where shoud the peft config live? do we want it at the top-level of the model config, or deeper into the config (ex. in
BlockConfig
)?
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request