Skip to content

Conversation

finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Sep 7, 2025

Screenshot 2025-09-08 at 1 04 01 PM Screenshot 2025-09-08 at 1 04 13 PM Screenshot 2025-09-08 at 1 04 18 PM
  1. Fixed Average Batch Generation Time Bug - Removed duplicate reporting and added validation to prevent extreme
    outlier values from corrupting averages
  2. Static Tokens/sec Display - Tokens per second now remain static until new batches are processed, providing more
    stable metrics
  3. Improved Number Formatting - Enhanced formatting for tokens/s with proper comma separators (3,650) and K/M
    suffixes (36.5k, 3.65M) when numbers exceed thresholds
  4. Training Progress & ETA - Added current training step tracking with estimated time remaining calculations based
    on training speed
  5. MFU + MBU Calculations - Implemented Model FLOPs Utilization and Memory Bandwidth Utilization tracking with
    proper averaging
  6. Memory Usage Metrics - Added comprehensive memory tracking including:
    - Total GPU memory usage across all actors
    - Average KV cache size per actor
    - Peak memory usage tracking
    - Integration with vLLM engine memory statistics
  7. Dashboard UI Enhancements - Added new organized sections for:
    - 🎯 Training Progress (step counter + ETA)
    - 🧠 Model Utilization (MFU/MBU metrics)
    - 💾 Memory Usage (GPU memory + KV cache)
    - Enhanced styling and visual organization
  8. vLLM Integration - Extended LLMRayActor with comprehensive metrics collection from vLLM engines including GPU
    memory stats and compute utilization estimates

@finbarrtimbers finbarrtimbers changed the title Updated benchmark with a bunch of improvements. Updated dashboard with a bunch of improvements. Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant