Merge branch 'main' into dev_add_tp

EvolvingLMMs-Lab · Mar 11, 2024 · 2a5dd77 · 2a5dd77
2 parents bda1a45 + cbb3874
commit 2a5dd77
Show file tree

Hide file tree

Showing 13 changed files with 123 additions and 68 deletions.
diff --git a/README.md b/README.md
@@ -1,54 +1,80 @@
 <p align="center" width="100%">
-<img src="https://i.postimg.cc/g0QRgMVv/WX20240228-113337-2x.png"  width="100%" height="80%">
+<img src="https://i.postimg.cc/g0QRgMVv/WX20240228-113337-2x.png"  width="100%" height="70%">
 </p>
 
-# Large-scale Multi-modality Models Evaluation Suite
+# The Evaluation Suite of Large Multimodal Models 
 
-> Accelerating the development of large-scale multi-modality models (LMMs) with `lmms-eval`
+> Accelerating the development of large multimodal models (LMMs) with `lmms-eval`
 
-🏠 [Homepage](https://lmms-lab.github.io/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab)
+🏠 [Homepage](https://lmms-lab.github.io/) |  🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab)
+
+In an era where people pursue AGI (Artificial General Intelligence) with the zeal akin to 1960s moon landing mission. 
+Evaluating the core of AGI, the large language models (LLMs) and large multimodal models (LMMs) with unprecedented capabilities that can understand, learn, and interact across a broad range of human tasks, has become a pivotal challenge.
+
+To surmount this, a broad spectrum of evaluation datasets is proposed and used to assess model capabilities across various dimensions, creating a comprehensive capability chart that reveals the true performance of models. However, evaluation of models has become quite hard since there are countless evaluation benchmarks and datasets organized in various ways, scattered across the internet, sleeping in somebody's Google Drive, Dropbox, and other websites hosted by schools or research labs.
+
+In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models.
+
+However, though there are many new evaluation datasets are recently proposed, the efficient evaluation pipeline of LMM is still in its infancy, and there is no unified evaluation framework that can be used to evaluate LMM across a wide range of datasets. To address this challenge, we introduce **lmms-eval**, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
+
+We humbly obsorbed the exquisite and efficient design of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). Building upon its foundation, we implemented our `lmms-eval` framework with performance optimizations specifically for LMMs.
+
+## Necessity of lmms-eval
+
+We believe our effort could provide an efficient interface for the detailed comparison of publicly available models to discern their strengths and weaknesses. It's also useful for research institutions and production-oriented companies to accelerate the development of large multimodal models. With the `lmms-eval`, we have significantly accelerated the lifecycle of model iteration. Inside the LLaVA team, the utilization of `lmms-eval` largely improves the efficiency of the model development cycle, as we are able to evaluate weekly trained hundreds of checkpoints on 20-30 datasets, identifying the strengths and weaknesses, and then make targeted improvements.
 
 # Annoucement
 
 ## v0.1.0 Released
 
-The first version of the `lmms-eval` is released. We are working on providing an one-command evaluation API for accelerating the development of LMMs. 
+The first version of the `lmms-eval` is released. We are working on providing an one-command evaluation suite for accelerating the development of LMMs. 
+
+> In [LLaVA Next](https://llava-vl.github.io/blog/2024-01-30-llava-next/) development, we internally utilize this suite to evaluate the multiple different model versions on various datasets. It significantly accelerates the model development cycle for it's easy integration and fast evaluation speed.
 
-> In [LLaVA Next](https://llava-vl.github.io/blog/2024-01-30-llava-next/) development, we internally utilize this API to evaluate the model's performance on various model versions and datasets. It significantly accelerates the model development cycle for it's easy integration and fast evaluation speed. The main feature includes:
+The main feature includes:
+
+<p align="center" width="100%">
+<img src="https://i.postimg.cc/sgzNmJx7/teaser.png"  width="100%" height="80%">
+</p>
 
 ### One-command evaluation, with detailed logs and samples.
 You can evaluate the models on multiple datasets with a single command. No model/data preparation is needed, just one command line, few minutes, and get the results. Not just a result number, but also the detailed logs and samples, including the model args, input question, model response, and ground truth answer.
 
+```python
+# Evaluating LLaVA on multiple datasets
+accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.5-7b"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
+```
+
 ### Accelerator support and Tasks grouping.
-We support the usage of `accelerate` to wrap the model for distributed evaluation, supporting multi-gpu and tensor parallelism. With **Task Grouping**, all instances from all tasks are grouped and evaluated in parallel, which significantly improves the throughput of the evaluation.
+We support the usage of `accelerate` to wrap the model for distributed evaluation, supporting multi-gpu and tensor parallelism. With **Task Grouping**, all instances from all tasks are grouped and evaluated in parallel, which significantly improves the throughput of the evaluation. After evaluation, all instances are sent to postprocessing module for metric calcuations and potential GPT4-eval queries.
 
 Below are the total runtime on different datasets using 4 x A100 40G.
-|Dataset (#num)|LLaVA-v1.5-7b|LLaVA-v1.5-13b|
-|-------|-------------|--------------|
-|mme (2374)    | 2 mins 43 seconds | 3 mins 27 seconds |
-|gqa (12578)   | 10 mins 43 seconds | 14 mins 23 seconds |
-|scienceqa_img (2017) | 1 mins 58 seconds | 2 mins 52 seconds |
-|ai2d (3088)  | 3 mins 17 seconds | 4 mins 12 seconds |
-|coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |
-
-### Prepared HF datasets.
+
+| Dataset (#num)          | LLaVA-v1.5-7b      | LLaVA-v1.5-13b     |
+| :---------------------- | :----------------- | :----------------- |
+| mme (2374)              | 2 mins 43 seconds  | 3 mins 27 seconds  |
+| gqa (12578)             | 10 mins 43 seconds | 14 mins 23 seconds |
+| scienceqa_img (2017)    | 1 mins 58 seconds  | 2 mins 52 seconds  |
+| ai2d (3088)             | 3 mins 17 seconds  | 4 mins 12 seconds  |
+| coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |
+
+### All-In-One HF dataset hubs.
+
 We are hosting more than 40 (and increasing) datasets on [huggingface/lmms-lab](https://huggingface.co/lmms-lab), we carefully converted these datasets from original sources and included all variants, versions and splits. Now they can be directly accessed without any burden of data preprocessing. They also serve for the purpose of visualizing the data and grasping the sense of evaluation tasks distribution.
 
 <p align="center" width="100%">
-<img src="https://i.postimg.cc/8PXFW9sk/WX20240228-123110-2x.png"  width="100%" height="80%">
+<img src="https://i.postimg.cc/8PXFW9sk/WX20240228-123110_2x.png"  width="100%" height="80%">
 </p>
 
-### Detailed YAML task configuration
-Including prompt pre-processing, output post-processing, answer extraction, model specific args and more.
-
-### Reproducible results (for LLaVA series models) and Logging Utilites.
-We provide a set of pre-defined configurations & environments for llava-1.5, which can be directly used to reproduce the results in the paper.
+### Detailed Logging Utilites
 
-You can refer to the [repr_scripts.sh](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/dev/readme/miscs/repr_scripts.sh) we provide to see how to build and set-up the enviroments to reproduce the results from the paper. However, this environment is not recommended when you try to evaluating your own model or other models since it only install packages necessary to run llava and has a lower pytorch version that may results in a lower speed.
+We provide detailed logging utilities to help you understand the evaluation process and results. The logs include the model args, generation parameters, input question, model response, and ground truth answer. You can also record every details and visualize them inside runs on Weights & Biases.
 
-With `lmms-eval`, all evaluation details will be recorded including log samples and results, generating report tables to terminal output and to Weights & Biases Runs/Tables.
+{% include figure.liquid loading="eager" path="assets/img/wandb_table.png" class="img-fluid rounded z-depth-1" zoomable=true %}
 
-> Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
+<p align="center" width="100%">
+<img src="https://i.postimg.cc/W1c1vBDJ/Wechat-IMG1993.png"  width="100%" height="80%">
+</p>
 
 # Installation
 
@@ -71,7 +97,7 @@ cd LLaVA
 pip install -e .
 ```
 
-You can check the [environment install script](miscs/repr_scripts.sh) and [torch environment info](miscs/repr_torch_envs.txt) to reproduce LLaVA-1.5's paper results. We found torch/cuda versions difference would cause small variations in the results, we provide the [results check](miscs/llava_result_check.md) with different environments.
+You can check the [environment install script](miscs/repr_scripts.sh) and [torch environment info](miscs/repr_torch_envs.txt) to **reproduce LLaVA-1.5's paper results**. We found torch/cuda versions difference would cause small variations in the results, we provide the [results check](miscs/llava_result_check.md) with different environments.
 
 If you want to test on caption dataset such as `coco`, `refcoco`, and `nocaps`, you will need to have `java==1.8.0 ` to let pycocoeval api to work. If you don't have it, you can install by using conda
 ```
@@ -87,9 +113,29 @@ accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pr
 # Evaluating LLaVA on multiple datasets
 accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.5-7b"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
 
+# For other variants llava. Note that `conv_template` is an arg of the init function of llava in `lmms_eval/models/llava.py`
+accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
+accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct"   --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
+
 # From a predefined configuration, supporting evaluation of multiple models and datasets
 accelerate launch --num_processes=8 -m lmms_eval --config example_eval.yaml 
 ```
+
+# Model Results
+
+As demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process).
+
+We provide a Google Sheet for the detailed results of the LLaVA series models on different datasets. You can access the sheet [here](https://docs.google.com/spreadsheets/d/1a5ImfdKATDI8T7Cwh6eH-bEsnQFzanFraFUgcS9KHWc/edit?usp=sharing). It's a live sheet, and we are updating it with new results.
+
+<p align="center" width="100%">
+<img src="https://i.postimg.cc/jdw497NS/WX20240307-162526-2x.png"  width="100%" height="80%">
+</p>
+
+We also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data [here](https://docs.google.com/spreadsheets/d/1AvaEmuG4csSmXaHjgu4ei1KBMmNNW8wflOD_kkTDdv8/edit?usp=sharing).
+
+> Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
+
+
 ## Supported models
 
 - GPT4V (API, only generation-based evaluation)
@@ -124,7 +170,7 @@ accelerate launch --num_processes=8 -m lmms_eval --config example_eval.yaml
 - Infographic VQA (info_vqa)
   - Infographic VQA Validation (info_vqa_val)
   - Infographic VQA Test (info_vqa_test)
-- LLaVA-Bench (llava_bench_wild)
+- LLaVA-Bench (llava_in_the_wild)
 - LLaVA-Bench-COCO (llava_bench_coco)
 - MathVista (mathvista)
   - MathVista Validation (mathvista_testmini)
@@ -210,7 +256,23 @@ Please refer to our [documentation](docs/README.md).
 lmms_eval is a fork of [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). We recommend you to read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) for relevant information. 
 
 Below are the changes we made to the original API:
-
 - Build context now only pass in idx and process image and doc during the model responding phase. This is due to the fact that dataset now contains lots of images and we can't store them in the doc like the original lm-eval-harness other wise the cpu memory would explode.
 - Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
 - lm-eval-harness supports all HF language models as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.
+
+We also thank:
+- [Xiang Yue](https://xiangyue9607.github.io/), [Jingkang Yang](https://jingkang50.github.io/), [Dong Guo](https://www.linkedin.com/in/dongguoset/) and [Sheng Shen](https://sincerass.github.io/) for early discussion and testing.
+
+## Citations
+
+```shell
+@misc{lmms_eval2024,
+    title={LMMs-Eval: Accelerating the Development of Large Multimoal Models},
+    url={https://github.com/EvolvingLMMs-Lab/lmms-eval},
+    author={Bo Li*, Peiyuan Zhang*, Kaicheng Zhang*, Fanyi Pu*, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li and Ziwei Liu},
+    publisher    = {Zenodo},
+    version      = {v0.1.0},
+    month={March},
+    year={2024}
+}
+```
diff --git a/lmms_eval/api/metrics.py b/lmms_eval/api/metrics.py
@@ -166,7 +166,6 @@ def perplexity_fn(items):  # This is a passthrough function
     return items
 
 
-
 def levenshtein_distance(s1, s2):
     if len(s1) > len(s2):
         s1, s2 = s2, s1

diff --git a/lmms_eval/api/model.py b/lmms_eval/api/model.py
@@ -54,7 +54,6 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
         """
         pass
 
-
     # TODO: Add an optional max length
     @abc.abstractmethod
     def generate_until(self, requests) -> List[str]:

diff --git a/lmms_eval/api/samplers.py b/lmms_eval/api/samplers.py
@@ -37,7 +37,9 @@ def get_context(self, doc, num_fewshot):
                     + (
                         str(self.doc_to_target(doc)[0])
                         if type(self.doc_to_target(doc)) is list
-                        else self.doc_to_target(doc) if (self.config.doc_to_choice is None or type(self.doc_to_target(doc)) is str) else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
+                        else self.doc_to_target(doc)
+                        if (self.config.doc_to_choice is None or type(self.doc_to_target(doc)) is str)
+                        else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
                     )
                     for doc in selected_docs
                 ]

diff --git a/lmms_eval/api/task.py b/lmms_eval/api/task.py
@@ -687,7 +687,7 @@ def download(self, dataset_kwargs=None) -> None:
             download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
             **dataset_kwargs if dataset_kwargs is not None else {},
         )
-        self.dataset_no_image =  datasets.load_dataset(
+        self.dataset_no_image = datasets.load_dataset(
             path=self.DATASET_PATH,
             name=self.DATASET_NAME,
             download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,

diff --git a/lmms_eval/models/__init__.py b/lmms_eval/models/__init__.py
@@ -1,15 +1,21 @@
 import os
 
-try:
-    # enabling faster model download
-    from .llava import Llava
-    from .qwen_vl import Qwen_VL
-    from .fuyu import Fuyu
-    from .gpt4v import GPT4V
-    from .instructblip import InstructBLIP
-    from .minicpm_v import MiniCPM_V
-    import hf_transfer
-
-    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
-except ImportError:
-    pass
+AVAILABLE_MODELS = {
+    "llava": "Llava",
+    "qwen_vl": "Qwen_VL",
+    "fuyu": "Fuyu",
+    "gpt4v": "GPT4V",
+    "instructblip": "InstructBLIP",
+    "minicpm_v": "MiniCPM_V",
+}
+
+for model_name, model_class in AVAILABLE_MODELS.items():
+    try:
+        exec(f"from .{model_name} import {model_class}")
+    except ImportError:
+        pass
+
+
+import hf_transfer
+
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
diff --git a/lmms_eval/models/fuyu.py b/lmms_eval/models/fuyu.py
@@ -253,8 +253,6 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
         pbar.close()
         return res
 
-
-
     def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]:
         """ """
         add_special_tokens = False if add_special_tokens is None else add_special_tokens

diff --git a/lmms_eval/models/gpt4v.py b/lmms_eval/models/gpt4v.py
@@ -127,5 +127,3 @@ def generate_until(self, requests) -> List[str]:
     def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
         # TODO
         assert False, "GPT4V not support"
-
-
-Original file line number
+Diff line change
@@ Expand Up @@
             """
             pass
         # TODO: Add an optional max length
         @abc.abstractmethod
         def generate_until(self, requests) -> List[str]:
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
Expand Up		@@ -127,5 +127,3 @@ def generate_until(self, requests) -> List[str]:
		def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
		# TODO
		assert False, "GPT4V not support"