How SREs and GenAI Work Together to Decrease eBay's Downtime: An Architect's Insights at KubeCon EU
Briefly

During his KubeCon EU keynote, Vijay Samuel from eBay discussed enhancements in incident response for SRE teams through machine learning (ML) and large language models (LLMs). eBay's infrastructure, which has scaled to over 4,000 microservices, generates massive data impacting incident management. Samuel introduced Groot, a system designed to attach root causes to alerts, drastically reducing incident detection time. Despite the advancements, his team faced challenges with LLMs, particularly in maintaining accuracy during interactions, highlighting the need for precise prompts to attain reliable outcomes.
During his keynote, Vijay Samuel emphasized that while incorporating machine learning and large language models into incident response at eBay is beneficial, they are not foolproof solutions.
The growth of eBay’s platform, with over 4,000 microservices generating substantial data, underscores the complexity of incident management which Samuel addressed through machine learning innovations.
Samuel's team developed Groot to enhance incident triage by attaching root causes to alerts, significantly reducing detection time and showcasing the evolving role of ML in operational processes.
Interacting with large language models can be unpredictable; Samuel found that clear and specific prompts yield more reliable results compared to broad requests when utilizing LLMs.
Read at InfoQ
[
|
]