KubeCon NA 2025 - Erica Hughberg and Alexa Griffith on Tools for the Age of GenAI
Briefly

KubeCon NA 2025 - Erica Hughberg and Alexa Griffith on Tools for the Age of GenAI
"The new requirements for Gen AI based appilcations include dynamic, model-based routing, token-level rate limiting, secure & centralized credential management, and observability, resilience & failover for AI. Existing tools are not sufficient to support these use cases due to their lack of AI-native logic, simple rate limiting, and request based routing. Kubernetes platform and tools like KServe, vLLM, Envoy and llm-d can be used to implement these new requirements."
"Envoy AI Gateway helps manage traffic at the edge and provides unified access from application clients to GenAI services like Inference Service or Model Context Protocol (MCP) Server. Its design is based on a two-tier gateway pattern with Tier One Gateway, referred to as AI Gateway, functioning as a centralized entry point and is responsible for authentication, top-level routing, unified LLM API, and token-based rate limiting. It can also acts as a MCP proxy."
"And the Tier Two Gateway, referred to as Reference Gateway, manages the ingress traffic to the AI models hosted on a Kubernetes cluster and is also responsible for fine-grained control to access the models. Envoy AI Gateway supports different AI providers like OpenAI, Azure OpenAI, Google Gemini, Vertex AI, AWS Bedrock, and Anthropic."
Generative AI workloads introduce new traffic patterns and infrastructure demands that require AI-native platform capabilities. Applications need dynamic model-based routing, token-level rate limiting, centralized secure credential management, and observability, resilience, and failover mechanisms. Existing tools lack AI-native logic, fine-grained rate limiting, and token-aware routing. Kubernetes platforms combined with projects like KServe, vLLM, Envoy, and llm-d can implement these requirements. Envoy AI Gateway implements a two-tier gateway with a centralized AI Gateway for authentication, top-level routing, unified LLM API, and token-based rate limiting, and a Reference Gateway that manages ingress and fine-grained model access. OpenTelemetry, Prometheus, and Grafana support monitoring and observability.
Read at InfoQ
Unable to calculate read time
[
|
]