At 8:32 a.m., a global bank’s mobile app froze. No alarms. No breach. Just cascading timeouts across hundreds of microservices on three continents.
Sarah couldn’t transfer lunch money to her 12-year-old daughter, Emma. At the cafeteria register, Emma watched friends tap to pay while she apologized to the cashier. Sarah’s phone said, “System maintenance.”
Two hours later, everything was back online. No headline. No blame. But something essential failed—trust. The systems recovered; the confidence didn’t. Uptime looked perfect on paper, yet a small promise between a mother and daughter went unkept.
The problem isn’t the stack—it’s ownership. When control fragments across clouds and teams, governance slips—and customers pay the price. Here’s how to fix it: build resilience that serves people, not percentages. Because in today’s distributed world, systems work perfectly in isolation—and still fail together.
Microservices, multi-cloud and edge computing promise agility but fracture accountability. Each team owns a fragment; no one owns the whole. We track latency to five decimals, yet miss the moment a transaction fails the customer.
When parts are reliable in isolation but fragile in combination, governance—not infrastructure—breaks first.
We still celebrate “five nines.” Reality argues back. British Airways, 2017: a power issue grounded 75,000 passengers. Cloudflare, 2020: a routing error disrupted the internet for 27 minutes. CrowdStrike, 2024: a faulty update disrupted operations at hospitals and airports.
Behind each statistic: a nurse who couldn’t access patient records, a father missing his flight home, a small business owner locked out during her busiest hour. These aren’t edge cases—they’re Tuesday mornings in someone’s life.
Industry experience shows many operational losses hide in reliability gaps invisible to traditional monitoring—quiet failures that look green on dashboards and red in people’s lives. That’s why the next KPI is “trusttime”—the speed you reassure people when systems bend.
Think about your organization. Every day, your systems make hundreds of micro-promises: process or delay. Connect or drop. Each outcome reflects your architecture and your leadership—by design or by neglect.
The hard part isn’t “more monitoring.” It’s confronting real friction: fragmented ownership, fear of slowing releases and habits that optimize speed over trust. Customers don’t measure reliability by percentages. They measure how you respond when things break. Security isn’t IT’s job alone. Neither is resilience. It becomes everyone’s job only when leadership makes it real and resilience is embedded in culture, not bolted on.
A moment from the field
In a large, multi-site retail environment I led, a “minor” configuration change slowed a hidden dependency: price lookups and loyalty checks. Registers still took payments, but lines grew and so did tempers.
What saved the day wasn’t a heroic patch. It was a rehearsed play: switch to degraded mode, default sensible pricing, give staff one sentence—“Your purchase will go through; some features are slower today.” Post updates every 15 minutes. Sales held. Complaints stayed manageable.
I watched a cashier calm an anxious customer buying diabetes medication: “It’s processing, just a bit slower. You’ll have your prescription.” That taught me more about resilience than any SLA dashboard. Mean time to reassurance beats mean time to recovery.
Here are five practical ways to build resilience when control fragments:
1: Design systems that speak human during failure
Resilience—per NIST—means anticipate, withstand, recover, and adapt. It’s more than redundancy. It’s grace under pressure.
- Communicate with empathy. “Almost there—your transaction is delayed, not lost.” The key metric isn’t MTTR; it’s mean time to reassurance. Be factual, avoid spin, don’t over-promise.
- Design dignified failure states. When Emma’s gift purchase fails, give her mother a next step: “We’ve saved your order and will process it within two hours. You’ll get a text when it’s ready.” That beats “Error 502” every time.
- Build trust reserves. Reliability reputations earn patience—built promise by promise.
2: Track what matters to people, not just what’s in your logs
When control scatters, measure human outcomes, not just machine states.
- Error budgets = people, not packets. Weight SLOs by who is impacted (e.g., breakfast-hour transfers), not uniform time buckets. Tie to business KPIs/KRIs and risk appetite. A pharmacy’s 99.9% means nothing if the 0.1% hits at 2 a.m. when a parent needs antibiotics.
- Impact-first retrospectives. Start with “Who was affected?,” then trace causes. Keep a balanced view so executives actually see reliability.
- Journey dashboards. Track “successful money transfers during breakfast hours,” not “API gateway response times.” When teams see real-time human impact, priorities snap into focus.
3: Architect for where people’s moments actually happen
Distributed systems fail through interdependence. Architect for proximity, grace, autonomy.
- Multi-region as proximity. Not a checkbox. Edge isn’t just latency; it’s empathy at scale—continuity at the point of impact.
- Design for disconnect. Assume intermittent links. Queue locally. Sync when possible. Never lose user intent to a blip. Circuit breakers should fail with grace, not silence.
- Instrument outcomes. Log “prescription filled,” “lunch money transferred,” “gift purchased.” Quiet failures surface where leaders can act.
4: Turn crisis response into executive muscle memory
Every outage is a leadership test. Technical resilience depends on cultural resilience.
- Lead with honesty. Communicate early. Resilience is governed, not improvised.
- Reward transparency over perfection. Make it safe to say, “I almost broke production.” Silence is a contagion.
- Decide before you’re tired. Pre-approve who speaks to customers, regulators and media. Keep offline contact trees and an IR retainer. Set business-function priorities. Run tabletops that force decision rights—not just test runbooks.
5: Build governance that connects the dots
When ownership fragments, unify accountability.
- One operating rhythm, many frameworks. Use COBIT for decision rights and oversight; ISO 22301 for continuity. Embed both in a digital-trust narrative that leaders can own.
- Make reliability auditable. Add reliability reviews to the audit cycle. Have the committee review “promises kept to customers” with the rigor of financial controls. Treat error-budget work as a strategic investment, not tech debt.
- Map authority before you need it. Document who can stop changes, notify customers and approve emergency fixes—so the next call isn’t a debate.
From uptime to trusttime
Leadership disengagement—not architecture—sits at the root of many failures. Start with one move: make error budgets count disappointed customers, or pre-approve who speaks when things break. Resilience isn’t about staying online—it's about staying worthy of trust when you go offline.