On February 5th, Cloudflare faced a significant outage of its R2 Gateway service due to human error tied to a phishing report. An employee's incorrect handling of a remediation protocol inadvertently disabled the service, disrupting various other Cloudflare services for more than an hour. Though no data loss occurred, operations involving R2 buckets were severely impacted. Cloudflare's leadership acknowledged the need for better validation safeguards to prevent such incidents in the future, emphasizing the importance of robust operational protocols.
The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2.
At the R2 service level, our internal Prometheus metrics showed R2's SLO near-immediately drop to 0% as R2's Gateway service stopped serving all requests.
Collection
[
|
...
]