How to Use
Choose a Scenario:
- Steady Load: Constant request rate for baseline performance analysis
- Spike Load: Sudden burst of requests to test capacity
- Mixed Workload: Variable token lengths simulating real-world traffic
- Custom: Configure your own parameters
Key Metrics:
- Throughput: Requests processed per second
- Queue Length: Pending requests waiting for GPU time
- GPU Utilization: How efficiently the cluster uses available compute
- Latency: Time from request arrival to completion (P50/P95/P99)
Understanding the Visualization:
- GPU Grid: Each cell represents a GPU instance (gray = idle, blue = processing)
- Request Queue: Visualizes pending requests color-coded by age
- Metrics Charts: Real-time performance data over the simulation timeline
Key Concepts
Batching
GPUs process requests in batches for efficiency. The visualizer shows how:
- Requests are grouped together when possible
- Larger batches improve throughput but may increase latency
- Batch size affects GPU utilization
Load Patterns
Different traffic patterns stress the system differently:
- Steady: Predictable load, easier to optimize
- Bursty: Requires queue management and capacity planning
- Mixed: Realistic workloads with varying request sizes
Watch how changing parameters affects:
- Throughput vs. latency trade-offs
- Queue buildup during peak load
- Recovery time after load spikes
- Impact of token length variance
Configuration Options
Request Rate: Control how many requests arrive per second
GPUs: Scale the cluster size to see capacity impact
Batch Size: Affects how many requests process together
Processing Time: Base time per token (varies by model/hardware)
Token Distribution: Single value or range for realistic variance
Real-World Applications
This simulation helps understand:
- Capacity Planning: How many GPUs needed for expected load
- Auto-scaling: When to add/remove instances
- SLA Management: Meeting latency requirements under load
- Cost Optimization: Balancing performance vs. infrastructure cost
Technical Implementation
Built with React and Recharts, this visualizer uses:
- Event-driven simulation engine
- Real-time batching algorithms
- Statistical latency tracking
- Responsive data visualization
View the simulation running at different speeds to analyze behavior over extended periods. The "Speed" control lets you compress hours of operation into minutes.