Skip to content

Latest commit

 

History

History
32 lines (26 loc) · 1.17 KB

README.md

File metadata and controls

32 lines (26 loc) · 1.17 KB

Benchmark / Load-testing Suite by Fireworks.ai

LLM benchmarking

The load test is designed to simulate continuous production load and minimize effect of model generation behavior:

  • variation in generation parameters
  • continuous request stream with varying distribution and load levels
  • force generation of exact number of output tokens (for most providers)
  • specified load test duration

Supported providers and API flavors:

  • OpenAI API compatible endpoints:
    • Fireworks.ai public or private deployments
    • VLLM
    • Anyscale Endpoints
    • OpenAI
  • Text Generation Inference (TGI) / HuggingFace Endpoints
  • Together.ai
  • NVidia Triton server:
    • Legacy HTTP endpoints (no streaming)
    • LLM-focused endpoints (with or without streaming)

Captured metrics:

  • Overall latency
  • Number of generated tokens
  • Sustained requests throughput (QPS)
  • Time to first token (TTFT) for streaming
  • Per token latency for streaming

Metrics summary can be exported to CSV. This way multiple configuration can be scripted over. CSV file can be imported to Google Sheets/Excel or Jupyter for further analysis.

See llm_bench folder for detailed usage.