Skip to content

Proposal: new model configuration mechanism #356

@jlamypoirier

Description

@jlamypoirier

🎯 Goal (What & Why)

We want to rework the configurations for models to help us going forward. Main goals are:

  • Support block-modular models (Support block-modular architecture #242)
  • Replace and generalize repetitive, ad-hoc parameters such as lr scales and initialization parameters, with standardized and automated configuration parameters.
  • Make more components dynamic to help experimenting with new model configurations, ex. mixers, mlps, or even entire blocks or sequence of blocks.
  • Help with the integration of more complex models, ex. multi-modal (vision) models
  • Generalize peft/lora beyond transformers.

🚀 Execution Plan

Blocks:

Proposal:

  • Standard blocks and their configs will be replaced by a block interface with a dynamic BlockConfig, allowing easy swap between all sorts of layers.
  • Standard transformer and transformer-like blocks that differ only by their mixer, ex. Transformer and SSM blocks, will be unified into TransformerBlock with type transformer. For now, this will be the default and only type of block. A transformer block will be defined through three fully dynamic sub-layers: mixer, mlp and normalization.
  • hidden_size, full_precision_residual and other variables that need to remain consistent between blocks will not be part of the block config, and will instead be configured in block sequences (see below) and be passed as arguments to block instantiation.

Open questions:

  • Should we allow for a different config for the two normalization sub-layers?
  • Is TransformerBlock / transformer appropriate, or do we want a more inclusive name?
    *How should non-standard layers, ex. embeddings, head, relate to that interface?

Block sequences:

Proposal:

  • We will introduce a concept of "block sequence", representing a logical sequence of blocks. We already have Sequential layers, so we will introduce matching dynamic construct BlockSequenceConfig on the config side.
  • We will introduce RepeatBlockConfig as an implementation of the block sequence interface, repeating a pattern of block for num_layers, with constant parameters like hidden size.

Open questions:

  • Names are WIP.
  • Allow for nested block sequences, ex. a block in a sequence that is itself a sequence? (Useful for lm and vision use cases below)
  • Should the language model be itself a block sequence? Ex. we could have a fully generalizable structure (ex. vision of the form:
    base_model [BlockSequenceConfig]:
      type: lm (LM -> LMLike ->  BlockSequenceConfig)
      embeddings [EmbeddingsConfig]:
        type: lm_embeddings (LmEmbeddingsConfig -> EmbeddingsConfig -> BlockSequenceConfig)
        ...
      transformer [BlockSequenceConfig]:
        type: repeat
        ...
      head:
        type: lm_head (LmHeadConfig -> HeadConfig -> BlockSequenceConfig)
    
  • How should be handle vision? Could be something like
base_model [BlockSequenceConfig]:
  type: multi_modal
  models:
    - vision:
        type: vision (Vision -> LMLike  ->  BlockSequenceConfig)
        ...
    - lm
        type: lm (LM -> LMLike ->  BlockSequenceConfig)
        ...

Linear:

Proposal:

  • Linear layers will have their own config. Parameters so far are weight_initialization, bias_initialization, bias, lr_scale, apply_peft. (see sections below for details)
  • Defaults will be customizable by the parent config, since having a fixed default doesn't make sense. Custom defaults will be set through the default non-init field of LinearConfig and LinearWeightConfig, in the parent config's _validate .
  • For layers that are the concatenation of logically distinct layers (ex. key_value, gate_and_up, moe mlp weights, ssm inner projection), there will be a separate configuration for each sub-layer.

Open questions:

  • Should we make linear layers dynamic?
  • Should we have a separate config class for layers that shouldn't have a bias option? (ex. MoE router)
  • Is there a better way to achieve customizable defaults?
  • Do we want to allow for convenience parameters to limit repetitions? Ex. keep add_linear_biases or init_method_std to set all linear layers at once? This would make things less verbose but could be more difficult to understand, and could make things less self-contained (ex. linear configuration may appear to depend not only on LinearConfig, but also on arbitrary parameters in parent config.) Possible compromises:
    • have "init-only" shortcut fields that are converted explicitly in _from_dict, so that actual config end up with explicit LinearConfig fields, i.e. all traces of the shortcut are gone when printing or saving the config. (Ex. add_linear_biases replaced with explicit bias = False in all LinearConfigs.)
    • Rely on Hydra.
  • Concatenation of logically distinct layers should be manageable, but some details are still TBD. We already have examples for lr_scale (MoE) and apply_peft (key and value) we can rely on, we can handle initialization by working with global tensors, and we shouldn't need non-constant bias.
  • For MoE, configuring each expert separately could be tedious so maybe we want to make it optional.
Other layers and weights

Proposal:

  • Normalization weights and biases will be managed through NormalizationConfig, with similar parameters to LinearConfig but dynamic and without custom default (a fixed one works fine, and custom default doesn't work well with dynamic class).
  • Other parameters (embeddings, lm output, isolated parameters) will be configured in a generic WeightConfig.

Open questions:

  • We could make WeightConfig the only way to configure parameters, using composition for linear, normalization, etc. This would bring consistency at the cost of convenience (more verbose configs).
  • Should we make more custom layer configurations, ex. embeddings, lm output, conv1d, etc? Embedding and LM output are technically linear(-like) layers, so they could use a linear config instead (without bias). As for the layer itself, making a construct could be difficult for technical and legacy reasons (making a submodule change weight names so isn't backward compatible).

Initialization

Proposal:

  • All parameters will be associated with a fully dynamic InitializationConfig allowing for any conceivable initialization.
  • For linear and custom parameters, the default will be set in the parent layer, while normalization will use fixed defaults (fill with ones/zeros)

Open questions:

  • How do we generalize default initializations? (Ex. SSMs have many isolated weights not fitting in the above categories).

LR scales

Proposal:

  • All parameters will be associated with their own customizable learning rate scale. They may be be shared in some cases (ex. linear/normalization weight and bias).
  • Layers and blocks may also define customizable lr scales, ex. to allow freezing an entire block. When more than one lr scale applies to a given parameter, the effect is multiplicative.

Peft

Proposal:

  • Instead of specializing the Peft config to each model , we will use one single model-agnostic Peft config, and configure Peft behavior in individual layers.

Open questions:

  • So far LinearConfig.apply_peft is the only configuration parameter, which is enough for simple lora. Do we need more? Ex. other types of peft, application to LinearWeightConfig or other layers.
  • Where shoud the peft config live? do we want it at the top-level of the model config, or deeper into the config (ex. in BlockConfig)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions