Performance experiments over AdamW #28

conceptofmind · 2023-05-10T05:45:23Z

conceptofmind
May 10, 2023

Hi Phil,

I have been testing some different Lion hyperparameters with PaLM at the 1B scale (Total batch size 192. ~1.6 million tokens a batch). Using a decoupled weight decay of 0.1 for all runs. And a linear warmup scheduler. So far the best configuration was:

3e-4
betas-90-98

This had about a 0.2 loss improvement over AdamW. The memory consumption was ~4% lower. There was an increase in speed of about 0.14. Lowering the iteration time from 1.65 to 1.51.

Wandb logs:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-43-11---Vmlldzo0MzE0MTcy
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-48-41---Vmlldzo0MzE0MjAz

I am going to be testing at the 2B scale next and report the results. I am going to try adjusting the learning rate and betas more as well. I was wondering if you had noticed a significant difference in performance as you increased the size of the model?

Thank you,

Enrico

conceptofmind · 2023-05-10T21:42:14Z

conceptofmind
May 10, 2023
Author

Experiments detailing 1e-5, 2e-5, and 3e-5:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-10-17-38-27---Vmlldzo0MzI0NTMx
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-10-17-40-36---Vmlldzo0MzI0NTU1

Performed worse in each case.

Total batch size 192. ~1.6 million tokens a batch. Using a decoupled weight decay of 0.1 for all runs. Tried with a cosine warmup scheduler.

1 reply

xiangning-chen May 11, 2023

Hi, thanks for trying out!
It looks like you are using the same decoupled weight decay for Lion and AdamW?
According to the instructions in this repo and our paper (Section 5),

a suitable learning rate for Lion is typically 3-10x smaller than that for AdamW. Note that the initial value, peak value, and end value of the learning rate should be changed simultaneously with the same ratio compared to AdamW. Since the effective weight decay is lr * λ: update += w * λ; update *= lr, the value of decoupled weight decay λ used for Lion is 3-10x larger than that for AdamW in order to maintain a similar strength.

So could you please try enlarging the decoupled weight decay value for Lion? Thanks!

You can also take a look at our paper on tuning beta1 and beta2, we use (0.9, 0.99) and (0.95, 0.98), where the latter setting helps the training stability.

conceptofmind · 2023-05-10T22:11:25Z

conceptofmind
May 10, 2023
Author

Optimizer setup for PaLM:

def decoupled_optimizer(
    model, learning_rate, weight_decay, beta_1, beta_2, use_lion=True,
):
    # Create an empty dictionary called param_dict to store the model's named parameters.
    param_dict = {}
    # Iterate over the model's named parameters and populate the param_dict with key-value pairs.
    for param_name, param in model.named_parameters():
        param_dict[param_name] = param

    # Separate the model's named modules into two groups: decay and no_decay.

    # Create an empty list to store the names of the LayerNorm and Embedding layer weights with no weight decay.
    no_decay = []

    # Iterate through the named modules of the model.
    for module_name, module in model.named_modules():
        # Check if the current module is an instance of any of the desired types (LayerNorm or torch.nn.Embedding).
        for ndim in [LayerNorm, torch.nn.Embedding]:
            if isinstance(module, ndim):
                # If torch.nn.Embedding, append its name with a ".weight" suffix to the no_decay list.
                if module_name == "token_emb":
                    no_decay.append(f"{module_name}.weight")
                else:
                    # If the module is an instance of LayerNorm
                    no_decay.append(f"{module_name}.gamma")
                # Exit the inner loop since the desired module has been found.
                break

    # Create an empty list to store the names of the Linear layer weights with weight decay.
    decay = []

    # Iterate through the named modules of the model.
    for module_name, module in model.named_modules():
        # Check if the current module is an instance of the desired type (torch.nn.Linear).
        for ndim in [torch.nn.Linear]:
            if isinstance(module, ndim):
                # If the module is an instance of torch.nn.Linear, append its name with a ".weight" suffix to the decay list.
                decay.append(f"{module_name}.weight")
                # Exit the inner loop since the desired module has been found.
                break

    # Create two separate lists of model parameters: decay_param and no_decay_param.
    # The decay_param list contains the parameters that should have weight decay applied.
    # The no_decay_param list contains the parameters that should not have weight decay applied, excluding the 'to_logits.weight' parameter.

    # Create an empty list called decay_param to store the parameters with weight decay.
    decay_param = []

    # Iterate over the decay list, which contains the names of the parameters with weight decay.
    for param in decay:
        # Check if the current parameter is not 'to_logits.weight'.
        # Append the corresponding parameter from param_dict to the decay_param list.
        if param != "to_logits.weight":
            decay_param.append(param_dict[param])

    # Create an empty list called no_decay_param to store the parameters without weight decay.
    no_decay_param = []

    # Iterate over the no_decay list, which contains the names of the parameters without weight decay.
    for param in no_decay:
        # Append the corresponding parameter from param_dict to the no_decay_param list.
        no_decay_param.append(param_dict[param])

    # Create a list called grouped_params that contains two dictionaries.
    # The first dictionary has the decay_param list and the corresponding weight_decay value.
    # The second dictionary has the no_decay_param list and a weight_decay value of 0.0.
    grouped_params = [
        {"params": decay_param, "weight_decay": weight_decay},
        {"params": no_decay_param, "weight_decay": 0.0},
    ]

    # Create a variable called optimizer that stores an instance of the optimizer.
    if use_lion:
        optimizer = Lion(
            grouped_params,
            lr=learning_rate,
            betas=(beta_1, beta_2),
        )

    # Return the optimizer.
    return optimizer
    ```

0 replies

conceptofmind · 2023-05-10T23:00:02Z

conceptofmind
May 10, 2023
Author

Experiments detailing 1e-6 and 3e-6:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-10-18-58-44---Vmlldzo0MzI0OTY3
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-10-18-59-10---Vmlldzo0MzI0OTcx

Performed worse in each case.

Total batch size 192. ~1.6 million tokens a batch. Using a decoupled weight decay of 0.1 for all runs. Tried with a cosine warmup scheduler.

0 replies

conceptofmind · 2023-05-10T23:03:00Z

conceptofmind
May 10, 2023
Author

Hi @xiangning-chen ,

I appreciate your great research.

I am testing numerous language models of varying scales based on Phil's PaLM model.

I was wondering if you could provide any input on incorporating Lion into natural language experiments.

Thank you,

Enrico

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance experiments over AdamW #28

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Performance experiments over AdamW #28

conceptofmind May 10, 2023

Replies: 4 comments · 1 reply

conceptofmind May 10, 2023 Author

xiangning-chen May 11, 2023

conceptofmind May 10, 2023 Author

conceptofmind May 10, 2023 Author

conceptofmind May 10, 2023 Author

conceptofmind
May 10, 2023

Replies: 4 comments 1 reply

conceptofmind
May 10, 2023
Author

conceptofmind
May 10, 2023
Author

conceptofmind
May 10, 2023
Author

conceptofmind
May 10, 2023
Author