Datadog Employs LLMs for Assisting with Writing Accident Postmortems
Briefly

Datadog has innovatively integrated large language models (LLMs) with structured metadata and Slack messages to streamline the incident postmortem creation process. The team faced challenges in ensuring high-quality content while adapting LLMs beyond typical dialog systems. After extensive fine-tuning over 100 hours, they evaluated multiple models, including GPT-3.5 and GPT-4, to balance cost, speed, and accuracy. This allowed for a significant reduction in report generation time from 12 minutes to under 1 minute by executing tasks in parallel and addressing trust concerns by marking AI-generated content clearly. Moreover, sensitive data was omitted from the inputs to protect privacy.
Datadog's innovative use of LLMs in incident postmortem creation combines structured data and Slack messages to enhance report quality and efficiency.
The team invested over 100 hours fine-tuning their model instructions and structure to ensure accurate and high-quality outputs for incident postmortems.
Exploring model variants like GPT-3.5 and GPT-4 revealed significant trade-offs; GPT-4 was more accurate but also slower and costlier than GPT-3.5.
By running LLM tasks in parallel and selecting model versions based on content complexity, Datadog reduced postmortem report generation from 12 minutes to under a minute.
Read at InfoQ
[
|
]