
"The new capabilities center on two integrated components: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These tools work together to solve the "rate matching" challenge in disaggregated serving. The teams use this term when they split inference workloads. They separate prefill operations, which process the input context, from decode operations that generate output tokens. These tasks run on different GPU pools. Without the right tools, teams spend a lot of time determining the optimal GPU allocation for these phases."
"The Dynamo Planner Profiler is a pre-deployment simulation tool. It automates the search for the best configurations. Developers can skip manually testing various parallelization strategies and GPU counts, saving hours of GPU utilization. Instead, they define their needs in a DynamoGraphDeploymentRequest (DGDR) manifest. The profiler runs an automated sweep of the configuration space. It tests different tensor parallelism sizes for both prefill and decode stages. This helps find settings that boost throughput while staying within latency limits."
"The profiler includes an AI Configurator mode that can simulate performance in approximately 20 to 30 seconds based on pre-measured performance data. This capability allows teams to rapidly iterate on configurations before allocating physical GPU resources. The output gives a tuned setup to boost what teams call " Goodput." This is the highest possible throughput while staying within set limits for T"
Microsoft and NVIDIA provide automated resource planning and SLO-based dynamic scaling for NVIDIA Dynamo inference on Azure Kubernetes Service (AKS). Two integrated components, the Dynamo Planner Profiler and the SLO-based Dynamo Planner, address rate matching between prefill and decode phases that run on separate GPU pools. The profiler performs pre-deployment simulations driven by a DynamoGraphDeploymentRequest (DGDR) manifest, sweeping tensor parallelism and GPU count options to find configurations that maximize throughput within latency constraints. An AI Configurator mode can simulate performance in about 20–30 seconds using pre-measured data, enabling rapid iteration before allocating physical GPUs and optimizing Goodput.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]