Skip to content

Commit

Permalink
Merge pull request #108 from EvolvingLMMs-Lab/internal_main_dev
Browse files Browse the repository at this point in the history
[Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval
  • Loading branch information
Luodian authored Jun 12, 2024
2 parents d99a24a + 05dc8e8 commit fea3806
Show file tree
Hide file tree
Showing 368 changed files with 14,960 additions and 1,343 deletions.
Empty file modified .github/issue_template.md
100644 → 100755
Empty file.
Empty file modified .github/pull_request_template.md
100644 → 100755
Empty file.
Empty file modified .github/workflows/black.yml
100644 → 100755
Empty file.
8 changes: 8 additions & 0 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,11 @@ ckpt
pretrained/
LLaVA/
*logs
temp/
InternVL/
logs/
data/
llava-video/
Video-MME/
VATEX/
lmms_eval/tasks/vatex/__pycache__/utils.cpython-310.pyc
Empty file modified .pre-commit-config.yaml
100644 → 100755
Empty file.
400 changes: 167 additions & 233 deletions README.md
100644 → 100755

Large diffs are not rendered by default.

Empty file modified docs/README.md
100644 → 100755
Empty file.
Empty file modified docs/commands.md
100644 → 100755
Empty file.
122 changes: 122 additions & 0 deletions docs/current_tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Current Tasks

> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file.
> The following is manually updated documentation. You could use `lmms_eval task --list` to list all supported tasks and their task names.
- AI2D (ai2d)
- ChartQA (chartqa)
- CMMMU (cmmmu)
- CMMMU Validation (cmmmu_val)
- CMMMU Test (cmmmu_test)
- COCO Caption (coco_cap)
- COCO 2014 Caption (coco2014_cap)
- COCO 2014 Caption Validation (coco2014_cap_val)
- COCO 2014 Caption Test (coco2014_cap_test)
- COCO 2017 Caption (coco2017_cap)
- COCO 2017 Caption MiniVal (coco2017_cap_val)
- COCO 2017 Caption MiniTest (coco2017_cap_test)
- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
- DOCVQA (docvqa)
- DOCVQA Validation (docvqa_val)
- DOCVQA Test (docvqa_test)
- Ferret (ferret)
- Flickr30K (flickr30k)
- Ferret Test (ferret_test)
- GQA (gqa)
- HallusionBenchmark (hallusion_bench_image)
- Infographic VQA (info_vqa)
- Infographic VQA Validation (info_vqa_val)
- Infographic VQA Test (info_vqa_test)
- LLaVA-Bench (llava_in_the_wild)
- LLaVA-Bench-COCO (llava_bench_coco)
- MathVerse (mathverse)
- MathVerse Text Dominant (mathverse_testmini_text_dominant)
- MathVerse Text Only (mathverse_testmini_text_only)
- MathVerse Text Lite (mathverse_testmini_text_lite)
- MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
- MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
- MathVerse Vision Only (mathverse_testmini_vision_only)
- MathVista (mathvista)
- MathVista Validation (mathvista_testmini)
- MathVista Test (mathvista_test)
- MMBench (mmbench)
- MMBench English (mmbench_en)
- MMBench English Dev (mmbench_en_dev)
- MMBench English Test (mmbench_en_test)
- MMBench Chinese (mmbench_cn)
- MMBench Chinese Dev (mmbench_cn_dev)
- MMBench Chinese Test (mmbench_cn_test)
- MME (mme)
- MMMU (mmmu)
- MMMU Validation (mmmu_val)
- MMMU Test (mmmu_test)
- MMUPD (mmupd)
- MMUPD Base (mmupd_base)
- MMAAD Base (mmaad_base)
- MMIASD Base (mmiasd_base)
- MMIVQD Base (mmivqd_base)
- MMUPD Option (mmupd_option)
- MMAAD Option (mmaad_option)
- MMIASD Option (mmiasd_option)
- MMIVQD Option (mmivqd_option)
- MMUPD Instruction (mmupd_instruction)
- MMAAD Instruction (mmaad_instruction)
- MMIASD Instruction (mmiasd_instruction)
- MMIVQD Instruction (mmivqd_instruction)
- MMVet (mmvet)
- Multi-DocVQA (multidocvqa)
- Multi-DocVQA Validation (multidocvqa_val)
- Multi-DocVQA Test (multidocvqa_test)
- NoCaps (nocaps)
- NoCaps Validation (nocaps_val)
- NoCaps Test (nocaps_test)
- OKVQA (ok_vqa)
- OKVQA Validation 2014 (ok_vqa_val2014)
- POPE (pope)
- RefCOCO (refcoco)
- refcoco_seg_test
- refcoco_seg_val
- refcoco_seg_testA
- refcoco_seg_testB
- refcoco_bbox_test
- refcoco_bbox_val
- refcoco_bbox_testA
- refcoco_bbox_testB
- RefCOCO+ (refcoco+)
- refcoco+_seg
- refcoco+_seg_val
- refcoco+_seg_testA
- refcoco+_seg_testB
- refcoco+_bbox
- refcoco+_bbox_val
- refcoco+_bbox_testA
- refcoco+_bbox_testB
- RefCOCOg (refcocog)
- refcocog_seg_test
- refcocog_seg_val
- refcocog_bbox_test
- refcocog_bbox_val
- ScienceQA (scienceqa_full)
- ScienceQA Full (scienceqa)
- ScienceQA IMG (scienceqa_img)
- ScreenSpot (screenspot)
- ScreenSpot REC / Grounding (screenspot_rec)
- ScreenSpot REG / Instruction Generation (screenspot_reg)
- SeedBench (seedbench)
- SeedBench 2 (seedbench_2)
- ST-VQA (stvqa)
- TextCaps (textcaps)
- TextCaps Validation (textcaps_val)
- TextCaps Test (textcaps_test)
- TextVQA (textvqa)
- TextVQA Validation (textvqa_val)
- TextVQA Test (textvqa_test)
- VizWizVQA (vizwiz_vqa)
- VizWizVQA Validation (vizwiz_vqa_val)
- VizWizVQA Test (vizwiz_vqa_test)
- VQAv2 (vqav2)
- VQAv2 Validation (vqav2_val)
- VQAv2 Test (vqav2_test)
- WebSRC (websrc)
- WebSRC Validation (websrc_val)
- WebSRC Test (websrc_test)
Empty file modified docs/model_guide.md
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion docs/task_guide.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 0
top_p: 1.0
num_beams: 1
do_sample: false
# The return value of process_results will be used by metrics
Expand Down
15 changes: 0 additions & 15 deletions example_eval.yaml

This file was deleted.

Empty file modified lmms_eval/__init__.py
100644 → 100755
Empty file.
24 changes: 20 additions & 4 deletions lmms_eval/__main__.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -106,9 +106,16 @@ def parse_eval_args() -> argparse.Namespace:
parser.add_argument(
"--log_samples_suffix",
type=str,
default="",
default="model_outputs",
help="Specify a suffix for the log_samples file name.",
)
parser.add_argument(
"--predict_only",
"-x",
action="store_true",
default=False,
help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.",
)
parser.add_argument(
"--show_config",
action="store_true",
Expand Down Expand Up @@ -228,6 +235,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:

initialize_tasks(args.verbosity)

if args.predict_only:
args.log_samples = True
if (args.log_samples or args.predict_only) and not args.output_path:
raise ValueError("Specify --output_path if providing --log_samples or --predict_only")
if args.limit:
eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING." "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")
if args.include_path is not None:
Expand Down Expand Up @@ -274,6 +285,10 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
# set datetime before evaluation
datetime_str = utils.get_datetime_str(timezone=args.timezone)
if args.output_path:
if args.log_samples_suffix and len(args.log_samples_suffix) > 15:
eval_logger.warning("The suffix for log_samples is too long. It is recommended to keep it under 15 characters.")
args.log_samples_suffix = args.log_samples_suffix[:5] + "..." + args.log_samples_suffix[-5:]

hash_input = f"{args.model_args}".encode("utf-8")
hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
path = Path(args.output_path)
Expand All @@ -296,6 +311,7 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
log_samples=args.log_samples,
gen_kwargs=args.gen_kwargs,
cli_args=args,
predict_only=args.predict_only,
)

if results is not None:
Expand All @@ -318,9 +334,9 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
for task_name, config in results["configs"].items():
filename = args.output_path.joinpath(f"{task_name}.json")
# Structure the data with 'args' and 'logs' keys
data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"])} # Convert Namespace to dict
samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable)
filename.open("w").write(samples_dumped)
data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"]), "time": datetime_str}
samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable, ensure_ascii=False)
filename.open("w", encoding="utf-8").write(samples_dumped)
eval_logger.info(f"Saved samples to {filename}")

return results, samples
Expand Down
Empty file modified lmms_eval/api/__init__.py
100644 → 100755
Empty file.
Empty file modified lmms_eval/api/filter.py
100644 → 100755
Empty file.
Empty file modified lmms_eval/api/instance.py
100644 → 100755
Empty file.
15 changes: 15 additions & 0 deletions lmms_eval/api/metrics.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@


# Register Aggregations First
@register_aggregation("bypass")
def bypass_agg(arr):
return 999


@register_aggregation("mean")
def mean(arr):
return sum(arr) / len(arr)
Expand Down Expand Up @@ -226,6 +231,16 @@ def mean_stderr(arr):
return sample_stddev(arr) / math.sqrt(len(arr))


@register_metric(
metric="bypass",
higher_is_better=True,
output_type=["loglikelihood", "multiple_choice", "generate_until"],
aggregation="bypass",
)
def bypass(items):
return items


@register_metric(
metric="mcc",
higher_is_better=True,
Expand Down
Empty file modified lmms_eval/api/model.py
100644 → 100755
Empty file.
18 changes: 18 additions & 0 deletions lmms_eval/api/registry.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from lmms_eval.api.model import lmms

from typing import Callable, Dict
import logging
import evaluate as hf_evaluate

eval_logger = logging.getLogger("lmms-eval")

Expand Down Expand Up @@ -104,6 +106,22 @@ def decorate(fn):
return decorate


def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
if not hf_evaluate_metric:
if name in METRIC_REGISTRY:
return METRIC_REGISTRY[name]
else:
eval_logger.warning(f"Could not find registered metric '{name}' in lm-eval, searching in HF Evaluate library...")

try:
metric_object = hf_evaluate.load(name)
return metric_object.compute
except Exception:
eval_logger.error(
f"{name} not found in the evaluate library! Please check https://huggingface.co/evaluate-metric",
)


def register_aggregation(name):
def decorate(fn):
assert name not in AGGREGATION_REGISTRY, f"aggregation named '{name}' conflicts with existing registered aggregation!"
Expand Down
Empty file modified lmms_eval/api/samplers.py
100644 → 100755
Empty file.
Loading

0 comments on commit fea3806

Please sign in to comment.