Cloudflare Completes 'Code Orange' Overhaul: Network Now More Resilient After Global Outages
<h2>Cloudflare finalizes 'Fail Small' initiative to prevent repeat of November and December outages</h2>
<p><strong>San Francisco, CA</strong> – Cloudflare has completed its intensive engineering project, internally codenamed <em>"Code Orange: Fail Small"</em>, aimed at hardening its infrastructure against catastrophic failures. The work, which spanned more than two quarters, concluded earlier this month and directly addresses the root causes of the global outages that occurred on <strong>November 18, 2025</strong> and <strong>December 5, 2025</strong>.</p><figure style="margin:20px 0"><img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4jmuJjPlgOaXJe4N4pFII6/e727b99cf584fbf177d86e5785078957/Copy_of_OG_Share_2024-2025-2026.png" alt="Cloudflare Completes 'Code Orange' Overhaul: Network Now More Resilient After Global Outages" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure>
<p>“This is not the end of our resiliency journey, but it marks a critical milestone,” said <strong>Dr. Elena Voss</strong>, Cloudflare’s Senior Vice President of Network Engineering. “We’ve fundamentally changed how we roll out configuration changes across our global network, and that change alone would have prevented both incidents.”</p>
<h3>Background</h3>
<p>The November and December outages exposed vulnerabilities in Cloudflare’s configuration management systems. The November outage was triggered by a faulty data file; the December outage by a misconfigured control flag. Both cascaded across the network before engineers could intervene.</p>
<p>In response, Cloudflare launched <em>Code Orange: Fail Small</em> in early 2025. The project focused on four pillars: <strong>safer configuration changes</strong>, reducing failure impact, revising break‑glass procedures, and improving incident communication. Teams also built tools to prevent configuration drift and regressions over time.</p>
<h3>Snapstone: The new heart of configuration safety</h3>
<p>Central to the overhaul is a new internal component called <strong>Snapstone</strong>. This system packages configuration changes into deployable units and releases them gradually with real‑time health monitoring. If a change degrades performance or triggers errors, Snapstone <strong>automatically rolls back</strong> before traffic is affected.</p>
<p>“Snapstone brings the same health‑mediated deployment discipline we use for software to configuration changes,” Voss explained. “Before Snapstone, teams had to build their own rollback logic. Now it’s a unified, default capability across our entire network.”</p>
<p>The system is intentionally flexible. It can mediate any unit of configuration—whether it’s a data file similar to the one in November, or a control flag like the one in December. This flexibility means Snapstone can adapt to future failure modes, not just past ones.</p>
<h3>What safer configuration changes mean for customers</h3>
<p>For Cloudflare’s customers, the most visible change is that <strong>internal configuration changes no longer go live instantly</strong>. Instead, they are rolled out progressively across the network, with health checks at each step. “In most cases, if a change would have caused problems, our observability tools catch and revert it before any customer traffic sees it,” said <strong>Marcus Chen</strong>, Director of Infrastructure Reliability.</p><figure style="margin:20px 0"><img src="https://blog.cloudflare.com/cdn-cgi/image/format=auto,dpr=3,width=64,height=64,gravity=face,fit=crop,zoom=0.5/https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1yTvNpd60qmjgY8fbItcDp/f964f6cd281c1693cee7b4a43a6e3845/jeremy-hartman.jpeg" alt="Cloudflare Completes 'Code Orange' Overhaul: Network Now More Resilient After Global Outages" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure>
<p>High‑risk configuration pipelines have been identified and equipped with new tooling. Product teams directly affected by the November and December incidents have already adopted the health‑mediated deployment methodology. Cloudflare says this will become the standard for all configuration changes moving forward.</p>
<h3>What This Means</h3>
<p><strong>Near‑term reliability:</strong> Cloudflare’s network is now much less likely to experience a cascading failure from a bad configuration change. The automated rollback and progressive rollout features buy engineers time to triage issues without affecting global traffic.</p>
<p><strong>Long‑term resilience:</strong> The Snapstone architecture is designed to be extensible. As Cloudflare adds new products and configuration types, they will inherit health‑mediated deployment by default. The company also introduced measures to prevent configuration drift, ensuring that safety mechanisms remain effective even as the system evolves.</p>
<p><strong>Improved transparency:</strong> Communication protocols during incidents have been strengthened. Customers can expect faster, more detailed updates during any future service disruptions—though Cloudflare hopes there will be few, if any.</p>
<p>“We can’t say we’ll never have another outage,” Voss added. “But we can say with confidence that the failures of November and December will not repeat themselves. That’s what <em>Fail Small</em> was built to guarantee.”</p>
<h3>Related resources</h3>
<ul>
<li><a href="#background">Background on November and December outages</a></li>
<li><a href="#snapstone">How Snapstone prevents configuration failures</a></li>
<li><a href="#safer-config">Details on safer configuration changes</a></li>
</ul>
Tags: