Updated dashboard with a bunch of improvements. #994

finbarrtimbers · 2025-09-07T12:59:13Z

Fixed Average Batch Generation Time Bug - Removed duplicate reporting and added validation to prevent extreme
outlier values from corrupting averages
Static Tokens/sec Display - Tokens per second now remain static until new batches are processed, providing more
stable metrics
Improved Number Formatting - Enhanced formatting for tokens/s with proper comma separators (3,650) and K/M
suffixes (36.5k, 3.65M) when numbers exceed thresholds
Training Progress & ETA - Added current training step tracking with estimated time remaining calculations based
on training speed
MFU + MBU Calculations - Implemented Model FLOPs Utilization and Memory Bandwidth Utilization tracking with
proper averaging
Memory Usage Metrics - Added comprehensive memory tracking including:
- Total GPU memory usage across all actors
- Average KV cache size per actor
- Peak memory usage tracking
- Integration with vLLM engine memory statistics
Dashboard UI Enhancements - Added new organized sections for:
- 🎯 Training Progress (step counter + ETA)
- 🧠 Model Utilization (MFU/MBU metrics)
- 💾 Memory Usage (GPU memory + KV cache)
- Enhanced styling and visual organization
vLLM Integration - Extended LLMRayActor with comprehensive metrics collection from vLLM engines including GPU
memory stats and compute utilization estimates

finbarrtimbers added 3 commits September 7, 2025 06:58

Updated benchmark with a bunch of improvements.

9707f87

Added more functionality. now we track actor status

a237d4a

Adds more metrics, actor management

68f5a57

finbarrtimbers changed the title ~~Updated benchmark with a bunch of improvements.~~ Updated dashboard with a bunch of improvements. Sep 8, 2025

finbarrtimbers added 4 commits September 17, 2025 11:49

many changes to dashboard

7a2d006

Merge branch 'main' into dashboard-improvements

75da640

fixed errors

b02290c

set host networking

692ec67

Provide feedback