Skip to content

Amir-Voloshin/DynamicEarlyExit

Repository files navigation

Dynamic Early Exit

This repository implements a dynamic early exit strategy aiming to enhance the computational efficiency of large language models (LLMs) while maintaining prediction quality. The framework extends the LayerSkip methodology with novel heuristics, including Repeated Tokens, Cosine Similarity, Token Confidence Convergence, Entropy-Based Threshold, and Max Probability, to determine stabilization in token predictions.

This repository is built off of the repository provided for the implementation of LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding.

Files added and/or changed from original repository:

  • self_speculation/early_exit_utils.py
  • self_speculation/llama_model_utils.py
  • self_speculation/generator_basy.py
  • self_speculation/autoregressive_generator.py

Authors:

Getting Started

  • Clone repo:
$ git clone https://github.com/Amir-Voloshin/DynamicEarlyExit.git
$ cd DynamicEarlyExit
  • Setup environment:
$ conda create --name layer_skip python=3.10
$ conda activate dynamic_early_exit

$ pip install -r requirements.txt

In order to access each model:

  1. Visit the model's corresponding link above, make sure you are logged on the HuggingFace website with your account.
  2. Fill the request form and submit it. Approval may take a while and you should receive an email notification to notify you that permission to the model is granted.
  3. Follow the steps here to obtain a user access token.
  4. In the command-line run huggingface-cli login, and you will be prompted to provide the token you have obtained in Step 3.

Once you run those steps, the commands below to run the LayerSkip checkpoints should work.

Generate

To run a model in interactive mode using regular autoregressive decoding:

$ torchrun generate.py --model facebook/layerskip-llama3.2-1B \
    --sample True \
    --max_steps 512

To perform dynamic early exit, you need to specify --criteria. Criteria options are: "cosine_similarity", "token_repeat", "entropy_based", "max_probability", or "convergence".

$ torchrun generate.py --model facebook/layerskip-llama3.2-1B \
    --sample True \
    --max_steps 512 \
    --generation_strategy autoregressive \
    --criteria "cosine_similarity"

Tips:

  • You may change --model to any HuggingFace model
  • By default we enable sampling. You may change the sampling behaviour using the --sample, --temperature, --top_p, and --top_k arguments.
  • You may run python generate.py --help for details on different command-line arguments.

Benchmark

To benchmark on a dataset:

$ torchrun benchmark.py --model facebook/layerskip-llama3.2-1B \
    --dataset cnn_dm_summarization \
    --num_samples 100 \
    --generation_strategy autoregressive \
    --output_dir ./logs

Tips:

  • You can specify different tasks by modifying the --dataset argument:
    • cnn_dm_summarization: CNN/DM Summarization
    • xsum_summarization: XSUM Summarization
    • cnn_dm_lm: CNN/DM Language Modeling (given the first few words of an article, generate the remaining article)
    • human_eval: HumanEval Coding
  • By default, the tasks run as 0-shot. You can change to any specified n-shot by specifying the --n_shot argument.
  • By default we enable sampling, while the results reported in the paper were greedy decoding without sampling. You may change the sampling behaviour using the --sample, --temperature, --top_p, and --top_k arguments.
  • You may run python benchmark.py --help for details on different command-line arguments.

Using Docker

Kindly check DOCKER.md to setup the project using docker

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •