LLMeBench offers support for advanced benchmarking use cases. In this tutorial, we provide example commands for such cases starting from the following general command:
python -m llmebench --filter '*benchmarking_asset*' <benchmark-dir> <results-dir>
As can be seen in the previous command, the framework performs a wildcard search over the benchmarking assets directory to identify the assets(s) to run as specified by '*benchmarking_asset*'
. This is possible becuase we roughly maintain the following strcutrue and file naming scheme in the benchmarking assets directory.
language_code/task_category/task/Dataset_Model_LearningSetup.py
The framework currently uses a two letter language code. It is possible to run all assets implemented for a single language using the command:
python -m llmebench --filter '*language_code/*' <benchmark-dir> <results-dir>
language_code
: Example values: "ar"(--> Arabic), "en"(--> English), "fr"(--> French), etc.
We currently release assets under eight task categories as listed here. Running assets for one category can be done as follows:
python -m llmebench --filter '*task_category/*' <benchmark-dir> <results-dir>
task_category
: Example values: "MT"(for Machine Translation), "semantics", "sentiment_emotion_others", etc.
Running the above command will run assets from all models, languages, subtasks, and learning setups for a task_category
.
As with task categories, we also maintain consistent task names across languages, learning setups, etc. To run assets for a sinlge task:
python -m llmebench --filter '*task/*' <benchmark-dir> <results-dir>
task
: Example values: "sentiment", "SNS", "NLI", "news_categorization", etc.
It is possible to benchmark a single model using the following command:
python -m llmebench --filter '*model*' <benchmark-dir> <results-dir>
model
: Example values: "GPT35", "GPT4", "BLOOMZ", etc.
The framework currently supports both zero-shot and few-shot learning setups. To run all zero-shot assets:
python -m llmebench --filter '*ZeroShot*' <benchmark-dir> <results-dir>
To run all few shot assets:
python -m llmebench --filter '*FewShot*' --n_shots <n> <benchmark-dir> <results-dir>
--n_shots <n>
: For benchmarking few shot assest, this flag should be provided, setting<n>
to be > 0.