Advanced Usage Examples

LLMeBench offers support for advanced benchmarking use cases. In this tutorial, we provide example commands for such cases starting from the following general command:

python -m llmebench --filter '*benchmarking_asset*' <benchmark-dir> <results-dir>

As can be seen in the previous command, the framework performs a wildcard search over the benchmarking assets directory to identify the assets(s) to run as specified by '*benchmarking_asset*'. This is possible becuase we roughly maintain the following strcutrue and file naming scheme in the benchmarking assets directory.

language_code/task_category/task/Dataset_Model_LearningSetup.py

Running all assets for a specific language

The framework currently uses a two letter language code. It is possible to run all assets implemented for a single language using the command:

python -m llmebench --filter '*language_code/*' <benchmark-dir> <results-dir>

language_code: Example values: "ar"(--> Arabic), "en"(--> English), "fr"(--> French), etc.

Running all assets for a category of tasks

We currently release assets under eight task categories as listed here. Running assets for one category can be done as follows:

python -m llmebench --filter '*task_category/*' <benchmark-dir> <results-dir>

task_category: Example values: "MT"(for Machine Translation), "semantics", "sentiment_emotion_others", etc.

Running the above command will run assets from all models, languages, subtasks, and learning setups for a task_category.

Running all assets for a specific task

As with task categories, we also maintain consistent task names across languages, learning setups, etc. To run assets for a sinlge task:

python -m llmebench --filter '*task/*' <benchmark-dir> <results-dir>

task: Example values: "sentiment", "SNS", "NLI", "news_categorization", etc.

Running all assets for a specific model

It is possible to benchmark a single model using the following command:

python -m llmebench --filter '*model*' <benchmark-dir> <results-dir>

model: Example values: "GPT35", "GPT4", "BLOOMZ", etc.

Running all assets for a specific learning setup

The framework currently supports both zero-shot and few-shot learning setups. To run all zero-shot assets:

python -m llmebench --filter '*ZeroShot*' <benchmark-dir> <results-dir>

To run all few shot assets:

python -m llmebench --filter '*FewShot*' --n_shots <n> <benchmark-dir> <results-dir>

--n_shots <n>: For benchmarking few shot assest, this flag should be provided, setting <n> to be > 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!