(Outrage) Failure as a Catalyst for Systemic Quality
My family and I moved to Austin, Texas about two months ago, and as soon as we were settled, my daughters and I set upon finding the most suitable swimming hole.
That is no simple task: the city is filthy with awesome swimming holes, and one afternoon, at Bull Creek, I witnessed a two-year-old being washed underwater and downstream into the rapids.
I ran in, retrieved the child, and returned him to his parents. It happens just this fast, and water is a silent killer.
I started thinking: maybe what this area needs is a swimming hole, and so I started piling stones.
What I had in mind was a safe little area for small children, contrived in such a way which would preclude a repeat of the above describe scenario.
That was about three weeks ago, and what began as a modest afternoon project has, at present, evolved to become a dam of considerable size, considering the environmental constraints.
I’m looking at it now, and there could well be four to five tons of stone, and although I wish I could claim to have moved them all — in fact, this dam has been a labor of great collaboration, with most of the participants being about seven to 11 years old.
Without fail, its structural integrity is compromised either through sabotage on the part of children, who enjoy tipping over stones (no biggie: kids will be kids), and nature (because water doesn’t give a shit about your dam. In the broad sweep of time, water is going to do what water’s going to do, and there’s nothing you can do about it).
The earth abides.
In each structural collapse, the effort to repair and rebuild has only created a more resilient structure, which is probably how it’s become so large in the first place.
In fact, (and the point of this blog post) I contend that it could never have become so strong if it were built all at once, without all the intervening collapses and structural failures.
The failures, the various collapses, the sabotage: all have played a vital role in the creation and evolution of this collaborative effort.
Henry Petroski, an Aleksandar S. Vesic Professor of Civil Engineering and a professor of history at Duke University, wrote a book called “To Engineer is Human,” and it’s an exquisite piece of work.
He contends that you can teach engineering to ordinary people because the fundamentals of engineering are inherent to human nature.
And he demonstrates this through failure. He’s one of my favorite authors, by far.
In his book, he describes one catastrophe after another, and in each example illustrates the failures through foundational components of the engineering domain.
For example, the Tacoma Narrows bridge was literally blown away during strong wind gusts back in 1940.
In retrospect, the question was asked: “How in the world did engineers fail to account for the strength of wind?“
Professor Petroski does an exquisite job of explaining that, in the Tacoma Narrows example, humans have been building bridges for millennia, but the weight of material had not become sufficiently light to expose the relative destructive power of wind.
Somewhere along the way, our threshold had been crossed. It literally took failure to expose a vulnerability which needed to be addressed.
The weight of material had decreased, rendering the structure vulnerable to wind, but construction methodologies had not sufficiently evolved to accommodate this vulnerability.
What’s almost more interesting is how the engineering world responded in the wake of the disaster.
Engineers turned around and reviewed all of the projects that had been conducted using lightweight materials, and found that a shocking number of recently-built structures were likewise vulnerable to destruction.
They rolled up their sleeves and got to work, augmenting the design and construction to address the vulnerability.
Build something. Wait for it to fail. Repair. Lather, rinse, repeat.
Failure is tantamount to a successful and sufficiently resilient system.
Failure must be embraced as a necessary part of the journey.
Frequently, failure is the catalyst for an acceleration of inspired creativity which results in a dramatically improved systemic design.
Which means: you could just sit and wait for something to fail, but wouldn’t it be more fun to blast through a system with a lot of wind, and watch what happens when things start to break?
This is frequently referred to as a “stress test,” and I’m a pretty big fan.
Separate but related aside: my stepfather was the CFO for the state of Oregon’s fire response, and until retirement had served in a variety of senior accounting or CFO roles.
The Chief Financial Officer’s primary consideration is the preservation of corporate assets, and my stepfather pointed out that most people erroneously consider CFOs to be inherently conservative.
Not so, he contends. He describes himself and his peers as “creative destructionists,” and explains how bureaucracy and infrastructure tend to expand entropically; left unchecked, they become so bloated in the aggregate that it consumes corporate assets.
Note that in this context the word “corporate“ does not limit this example to the private sector.
The example is also applicable to the tragedy of the public commons, which results when there’s irresponsible stewardship of public resources.
Therefore, because his role is to ensure the preservation of corporate assets, his acts of “creative destruction“ were kicked off when he refused to renew a service contract, or reduce the budget of a particular group, or whatever.
The way he puts it: now they have to choose. It’s simple!
Either they learn to do more with less, or they figure out how to integrate the large number of siloed solutions to consolidate costs.
Of course, he says, those on the receiving end of this experience raise a real fuss, but that’s expected. Most people only know how to complain.
Performed responsibly, the act of “kicking out a tent pole or two” might seem scary, but it only forces those responsible for the resources to rebuild, and ideally what results is a stronger structure.
Within an organization, choices have to be made: What’s vital? What’s optional? Are there opportunities to integrate two or more overlapping resources? Are we fully utilizing what we currently have to consolidate our tool-set and save costs?
Most people understand these things are possible, but until a crisis is manifested, there’s no emphasis to pursue systemic evolution to address a vulnerability which might otherwise manifest as a catastrophe.
Back to the dam at Bull Creek.
In the evolution of this swimming hole and its crowd sourced “dam” of stones, there have been times when I have probed (or even purposely created) a structural vulnerability, resulting in collapse.
In every instance, the structure which evolves in response to the vulnerability is stronger than the one it replaced.
I removed vital supports, and have even created a dam upstream to cause a temporary rush of water to identify weaknesses. I’ve even encouraged kids to take the thing apart, because kids love that kind of purposeful destruction.
The water flows through the dam when it is breached, and there’s a clamor. The roar of the water as it rips the structure apart seems like the sounds of tragedy.
But the sound also draws attention to that which needs to be repaired.
Sometimes I’ve noticed that the larger stones move in response to the rapids, and assemble themselves in a manner not unlike an arch and a capstone, creating almost by accident something stronger that would’ve been possible under the original method.
So it is with organizations, and with software, as it turns out.
I’ve spent a lot of time in the security industry, and people frequently ask why there are so many ways for hackers to breach software or systems.
As an aside, not everybody is a hacker. A dramatic majority of those who consider themselves “hackers” merely use the tools created by others without a comprehension of how they actually work.
But a true hacker is someone who comprehends the system and therefore can discern its systemic vulnerabilities, which frequently manifest as defects (or bugs, using industry parlance).
Back in the 90’s I was responsible for a product that suffered an alarming systemic failure: the device’s ability to connect to the network would fail, and although that might not seem like a very big deal, we had customers with literally hundreds of thousands of these devices which were not functioning, and our customers were livid.
I spent a lot of time trying to figure out what was going on, and it actually came to me in a dream: as it turns out, there were network “packets“ that were larger than the “legal“ size, and it was killing each device is network stack.
It was as if the device itself was rendered both mute and deaf through a blast of network activity far larger and “louder” than what was legally possible (not literally “louder,“ but we’re talking technology here, and I’m not going to force you to learn about how network protocols function).
When I brought my hypothesis to my team, they dismissed me, patiently explaining to me that what I was describing was not possible.
I tried to find network test equipment to test my hypothesis, and found that even the most expensive test equipment had failed to allow for this particular scenario, because it was not “legal“ relative to agreed-upon network protocols, which is a fairly profound statement, considering I was working for a manufacturer of industry leading test equipment.
So. Long story short, I hacked a network card, made it possible for me to send an oversize packet, and then deployed it, and it killed almost every device in the marketing group, which freaked me out and made me think maybe they were going to fire me.
Of course, they did not, because I had found the root cause of a failure which was costing the company one hell of a lot of money.
There was a company that had provided little micro chips for networking, and the logic within counted the “size“ of networking traffic, but didn’t do anything to enforce the rules regarding a maximum size of a packet.
So when something which was too large came across the networking wire, the little microchip just kept reading it into the device, causing a crash.
And due the low cost of this particular microchip, it had been integrated into a huge variety of devices, including laptops, desktops, printers, etc.
It took the failure of our product to identify a failure which had actually impacted a number of other vendors, ultimately manifesting in millions of devices (and millions of dollars).
If you want to get technical about it, the vendor did nothing wrong. They had created logic which conforms to industry standards, and from that position they argued that everything was working according to design.
But as we all know, that’s not how the world works.
Expect the unexpected. Embrace a failure which should never, ever happen.
Prepare for this inevitable failure by orchestrating stress tests of your system.
And as you build your organization, be careful about adding to the team those who consider failure as something which should be avoided.
While interviewing, ask hard questions about how they have responded in the wake of unanticipated failure.
Ask them to provide an example of a truly embarrassing failure in their career, and describe how they responded.
Look for examples of how they have learned to embrace failure as a method of creating a more resilient, lean, and capable solution.
Because failure is going to happen. There’s no getting around it.
Final point: when challenges manifest within your organization, watch your leaders.
They don’t get to make excuses. It may not be their fault, but it’s still their responsibility.
It’s that simple.