We Should Plan to Fail

Photo by Zach Lezniewicz on Unsplash

For weeks, the system behaved exactly as designed.

Aria had matured from a directory of experiments into a stable local agent: custom tools, a conversational interface, a growing memory layer, scheduled journaling, and enough reliability to test new ideas without worrying about what might break. Luna, the quieter counterpart built for reflection rather than analysis, was taking shape alongside her. Between the two of them, my home lab had become a small but complete ecosystem: embeddings generating cleanly, Docker pipelines spinning up without friction, a modest server carrying the load without complaint.

None of it was elegant. All of it worked. Add a tool, adjust a schema, refine a container, tighten a workflow. Increment on increment. A garden, grown from architecture.

And then, one afternoon, everything stopped.

Not a surge. Not a brownout. A full outage – the kind that ends processes mid-instruction and returns the room to silence. When power returned, the server reached POST, hesitated, and presented a single line that collapsed the entire system into a binary truth:

No boot drive detected.

What followed was methodical, not dramatic. Pull the SSD. Mount it externally. Wait for recognition. The drive appeared, reported 0B, and offered nothing else. Not corruption. Not partial failure. Simply absence. The controller had died in a way so clean that no recovery tool could acknowledge a filesystem had ever existed.

Could the data have been salvaged with soldering work and chip extraction? Possibly. But that level of recovery belongs to litigation, forensics, or national security, not a personal project. The practical outcome was an $84 return label and the realization that three months of work had vanished without leaving debris.

Failures like this feel sudden, but they rarely are. They are the moment a system reveals the assumptions it was built on.

And that is the point of this essay.

You don’t plan for failure because you expect disaster. You plan for failure because the universe is indifferent to effort, intent, and design. Businesses are reminded of this every day, often publicly, often expensively, when systems built for ideal conditions encounter real ones.

We design for how things should work, but we survive by planning for how they will break.

When We Design for Capability Instead of Fragility

Organizations rarely fail because people are careless or systems are poorly built. They fail because we design for conditions that don’t exist.

Workflows are drawn like highways; smooth paths with clear handoffs and dependable drivers. Every variable behaves. Every team is aligned. Every person shows up. On time. Forever.

Reality disagrees.

The moment plans meet execution, the seams show. Someone gets sick. A vendor ghosts. A credential fails silently in production. That “well-documented” process turns out to be tribal knowledge and a stale Confluence page last touched during a reorg.

But our planning doesn’t reflect this. We design for capability: what the system could do if everything held steady. It’s a useful fiction… until the fiction breaks.

Fragility, meanwhile, is the condition we refuse to model: the undocumented dependency, the brittle approval path, the ancient integration no one dares touch but everyone quietly relies on. It’s not the exception. It’s the operating default of any complex system, technical or human.

And when failure happens, it almost never surprises anyone. Because under all the dashboards, clouds, and AI layers, the root cause usually falls into one of three buckets:

People. Process. Resources.

Most outages, missed milestones, and organizational faceplants trace back to one (or all) of those. When we only design for capability, these weak points stay hidden. The architecture looks clean. The plan looks solid. Until something you didn’t plan for shows up.

Understanding fragility means seeing those vectors clearly, not in a planning doc, but in the messy places where things actually break.

If we want systems to last past the next vacancy, the next outage, the next ordinary surprise, we have to go looking for where fragility lives.

So that’s where we’ll start.

People: The Most Common Point of Failure

In any system, people are the most common point of failure, not because they’re unskilled or careless, but because the architecture assumes they’ll behave like machines.

We plan as if human capacity is permanent, consistent, and infinite. It isn’t.

Teams shift. Attention fractures. Knowledge walks out the door. Yet work is often scoped as though the team that exists today will be exactly the same next quarter, next year, or forever. That assumption feels harmless, right up until someone gets an offer you can’t match.

Unlike servers, people don’t emit logs when they’re at the edge. There’s no warning light for cognitive overload or quiet burnout. The signs are diffuse or invisible until something slips, and by then, fragility is already in the system.

The Knowledge Concentration Trap

Every organization has someone who really knows how things work; not the documented version, but the version kept alive through memory, habit, and intuition.

At first, they’re the go-to expert. Then, they become the default path. Eventually, they are the only one left holding the thread. Their absence reveals a map no one else can read anymore.

This isn’t a people problem. It’s a design flaw. Systems naturally centralize around competence unless you deliberately design for diffusion.

Burnout-as-Infrastructure

Organizations love to treat human bandwidth like a renewable resource. It’s not.

Even your best performers have limits. As pressure increases, focus narrows, judgment degrades, and resilience thins. Burnout doesn’t announce itself, it accumulates. And by the time you notice, that person isn’t just tired, they’re also a risk surface. Critical knowledge lives in someone who no longer has the capacity to protect it.

Invisible Load-Bearers

Some people become structural pillars without title or recognition. Their quiet judgment fills process gaps. Their memory patches over missing or broken systems. Their presence keeps things moving.

You don’t notice how much weight they’re carrying until they step away and the roof creaks.

When continuity depends on informal roles, the system is already fragile. You just haven’t received the bill.

Memory ≠ Resilience

If a workflow only functions because someone remembers the right step or applies informal authority at the right moment, that’s not process, it’s a gamble.

Documentation isn’t red tape. It’s resilience. It ensures the system works even when the person holding it together takes PTO, switches teams, or burns out.

The Real Failure

If a function collapses when one person leaves, that failure isn’t theirs. It belongs to the architecture that made them indispensable in the first place.

Process: Where Reality and PowerPoints Part Ways

If people are the most common point of failure, processes are the most misunderstood. Everyone references them. Every audit depends on them. Every planning deck points to them. But when systems are stressed, a clear pattern emerges:

Most processes don’t break; they were never real to begin with.

The team had a perfectly documented escalation path. But nobody remembered it existed, because it had never once been used in production.

In theory, a process is a policy, a flowchart, a clean diagram of steps and outcomes. In practice, it’s what actually happens when time runs out, context breaks down, and someone makes a judgment call on a Slack thread.

A diagram is not a process. Execution is.

The Cult of Aspirational Process

Every org has them: procedures that exist only in documents no one opens. They describe ideal steps, assign ideal roles, and assume ideal inputs. But they were never tested under real conditions.

They weren’t designed from observation. They were written to make slides look clean.

And aspirational processes fail the moment anything gets messy. Which is to say: constantly.

Fragility in Disguise

Even commonly used processes are often brittle. They rely on invisible conditions that only seem stable:

  • Single-threaded approvals. One inbox becomes the system’s bottleneck.
  • Tight timing. A one-day delay collapses a week-long sequence.
  • Clean handoffs. Assumes every input is complete, every step is correct.
  • “Ideal conditions.” Which exist exclusively in project kickoff meetings.

These aren’t workflows. They’re conditional flows with failure baked in.

The PTO Canary

Want to test a process? Let someone go on vacation.

If everything stalls because one person isn’t around to click the box, fix the file, or greenlight the next step, the problem isn’t their absence, it’s the design that required their presence.

A process that halts when someone takes PTO is not a process. It’s a dependency in disguise.

The Quiet Lies We Build Around Process

Most process failures aren’t dramatic. They stem from quiet, untested assumptions:

That everyone is paying attention.
That everyone is fully trained.
That no one is overwhelmed.
That input quality is consistent.
That no one will get sick, distracted, or reassigned.

These conditions don’t break the process. They are the process. We’re just not great at admitting it.

The Real Failure

The issue isn’t poor design. It’s unvalidated design. A process that only works when no one drops the ball isn’t a process, it’s a performance. And eventually, someone misses their cue.

A good process holds under pressure. A fake one just waits to get exposed.

Resource: The Material World Always Collects Its Debt

If people create unpredictability and processes create drift, resources bring something far less poetic: hard limits.

Hardware degrades. Configs rot. Integrations mutate. Deferred updates quietly stack into landmines. Eventually, something old, ignored, or misunderstood fails exactly when you need it most.

It’s not some malicious plot, just physics.

Failures in infrastructure aren’t usually caused by complexity. They’re caused by neglect. Systems running on hope, duct tape, and an eight-year-old server labeled “temp_backup_final” eventually collapse. What looks like “lean ops” in a deck is often just fragility on a delay timer.

The Lie of Deferred Maintenance

Delaying upkeep feels smart. You save money. Nothing breaks. Dashboards stay green. Until they don’t.

Entropy doesn’t take PTO. Fans wear out. Firmware drifts. Unpatched services gather vulnerabilities like moss. You don’t notice, until the incident. And then you’re answering hard questions about backups no one tested and configs no one documented.

What was called “discipline” becomes “deferred liability” overnight.

Redundancy: First Cut, Last Hope

Redundancy always looks unnecessary, until it’s not.

It’s easy to trim fallback systems when the main one hasn’t failed in years. But when failure hits, those backups become the only thing standing between disruption and disaster.

And yet, redundancy is almost always the first thing cut when budgets tighten. Why? Because in theory, nothing’s broken. In practice, you’ve just traded resilience for cost savings you’ll regret later (recovery is nearly always more expensive than failover).

Infrastructure SPOFs: Silent Until Loud

Organizations get twitchy about people as single points of failure. But infra? Not so much.

Until you’re down because:

  • That database lives on one old drive.
  • That switch has no failover path.
  • That critical integration uses a secret token generated by an intern in 2019.

These aren’t edge cases. These are standard practices no one questioned, normally because they “haven’t been a problem.”

When Infra Knowledge Dies Before the System

The worst outages aren’t caused by systems breaking. They’re caused by systems breaking and no one left who knows how they work.

Developers leave. Admins switch roles. And suddenly, you’re staring at a YAML file from 2017 that references a staging environment no one can locate, let alone replicate. You’re not just debugging at that point; you’re doing archaeology.

Tech debt is bad, but tech debt with amnesia is operational collapse with a mystery deadline.

Reliability Bias: The Most Dangerous Compliment

When something “just works,” it has a weird habit of becoming sacred. The database hasn’t failed in years? Must be solid. That cron job has always run? Bulletproof.

So we build more things on top of them. Because if they haven’t failed yet, they won’t. Right?

Wrong.

This is reliability bias: mistaking past uptime for future security. It’s not trust so much as superstition with better branding. And one day, it breaks. Often spectacularly. Because you treated consistency as a guarantee instead of what it really is: a warning sign that you’ve stopped asking questions.

The Structural Insight

Systems don’t run on goodwill. They run on electrons, time, temperature, firmware, dependencies, and care.

When infrastructure fails, it’s never random. It’s just the moment the physics wins and your stack gets audited by entropy.

The material world always collects its debt. The only variable is when, and how prepared you’ll be when it shows up.

Logistics as the Unifying Theory of Failure

By now, the pattern should be obvious.

People fail in familiar ways. Processes crack under invisible pressure. Resources rot quietly until they collapse. Different categories, same root cause.

These aren’t isolated failures. They’re logistical failures, expressed through different surfaces.

And most organizations never name that. But you should, because nearly every outage, missed milestone, surprise dependency, or escalation you’ve ever dealt with comes down to one thing:

The system failed to put the right thing, in the right place, in the right condition, at the right time.

That’s not bad luck. That’s logistics.

What Logistics Actually Means

Logistics isn’t just supply chains and shipping containers. It’s the architecture of operational readiness across people, processes, and systems.

It’s making sure:

  • The right approver is online when they’re needed.
  • The system isn’t relying on a drive that’s out of warranty and out of sight.
  • The workflow doesn’t require psychic foresight to compensate for an outdated handoff.

It’s the discipline of removing friction before it becomes failure. And when it cracks (anywhere) the whole system stutters.

Logistics Hiding in Human Systems

Availability is logistics. Burnout is logistics. Knowledge walking out the door is logistics.

When you assume infinite bandwidth or permanent presence, you’re not planning so much as gambling. On long enough timelines, that bet fails more often than it pays off.

A person becomes a single point of failure the moment you let them be.

Logistics Masquerading as Process

Processes don’t just break because people ignore the rules. They break because the timing was off, the input was late, or the person who knew how to fix it wasn’t on the call.

Every failed workflow is a logistical mismatch pretending to be a design flaw. It’s about sequence, availability, and conditions under load, not policy.

Most process failures aren’t accidents. They’re choreography without rehearsal.

Logistics Beneath Every Infrastructure Outage

Servers crash. Integrations drift. Configs rot. But the real reason systems go down?

Because someone assumed uptime was infinite and maintenance could wait. Infrastructure isn’t magical. It’s finite. Fragile. Dependent. And when it’s ignored, it reminds you.

Nothing fails forever. But everything fails eventually, especially the things you never check.

The Three Gaps That Break Everything

Every major failure, technical or human, can usually be traced back to one of three logistical gaps:

  1. What we think is happening.
  2. What’s actually happening.
  3. What happens when something stops happening.

That’s it. That’s the whole game. Miss one of those long enough, and failure doesn’t just become possible, it becomes scheduled. The calendar just didn’t warn you.

This Is the Backbone

This is why planning for failure is operational realism, though often with bad branding. Perhaps our first mistake was the naming of disaster recovery plans. They’d have undoubtably been more popular with “stability insurance plan” headers.

Logistics is the hidden scaffolding of every resilient system. When it’s weak, everything feels fine, right up until something changes. When it’s strong, resilience isn’t heroism. It’s just how things work.

Nothing about this is accidental. Resilience is built, maintained, and paid for – in advance.

Back to the Server, Back to the Lesson

In the end, it came down to something small. A server the size of a paperback. An SSD no heavier than a deck of cards. A click. Then silence.

Three months of work… gone. No surge. No smoke. Just a boot message that refused to see the drive. No logs. No warning. Just absence. And in that moment, every clean diagram, clever tool, and late-night sprint meant nothing. The system failed because I’d never designed for the part that would.

That’s how failure works. It doesn’t schedule itself. It just shows up. When I rebuilt, it looked different.

The machine now sits behind a surge protector. A UPS joins the setup; not for uptime, but for the minute required for a safe shutdown. Cron handles nightly differentials to a separate drive, one that can fail independently.

None of this was particularly complex. None of it required enterprise tooling or heroics. It was just architecture grounded in humility instead of optimism. And an acceptance of the laws that actually govern systems: heat, time, load, drift.

Not intent. Not effort. Not cleverness.

Which brings us to the security triad: Confidentiality. Integrity. Availability.

Everyone loves the first two. Availability gets ignored until it breaks. No drama, no headlines. Just a blinking cursor where your work used to be. But when it fails, nothing else matters. Not the encryption. Not the controls. Not the governance framework some consultant billed 200 hours to draw.

Availability is the quiet pillar that keeps everything else from being academic. Organizations learn that through outages. Individuals learn it the same way. Suddenly. Silently. At the worst possible time.

And the solution is rarely glamorous. It’s usually just this: Design for the thing you don’t want to happen.

Because the universe doesn’t care what you meant to do. It doesn’t agree to honor clean architecture diagrams. It doesn’t reward careful effort.

It just obeys physics and waits for your assumptions to expire.

That’s the lesson the dead server left behind. And the truth that resilient systems always learn:

The universe doesn’t deal in intent. It deals in conditions. And eventually, inevitably, it collects. Ultimately, you don’t plan for failure because you expect disaster. You plan for it because it’s the only honest way to build anything that matters.

The final track, “Ой у лузі червона калина” (The Red Viburnum in the Meadow), features Andriy Khlyvnyuk, the frontman of Ukrainian band Boombox, who halted an international tour in 2022 to return home and defend Kyiv after Russia’s full-scale invasion of Ukraine. The vocals were recorded while he was serving in the Territorial Defense Forces. This version, remixed by The Kiffness, became a symbol of solidarity and resilience; qualities that remain at the core of building systems, and societies, that endure.

~Dom

Leave a comment