Enterprise reactions to cloud and internet outages
Let’s face it, the last couple of months haven’t been great for the cloud. In October, both Amazon’s AWS and Microsoft’s Azure had widely publicized, highly impactful failures. In November, a Cloudflare outage took down a big chunk of websites, effectively closing some businesses.
I even had problems getting a haircut because the salon website was down with Cloudflare-itis and I couldn’t join a waitlist. Of course for enterprises, haircuts were the least of their worries. The cloud was too much of a risk, some said. The internet itself was at risk to cloud problems, and to problems of its own.
Those in the c-suite, not surprisingly, “examined” or “explored” or “assessed” their companies’ vulnerability to cloud and internet problems after the news. So what did they find? Are enterprises fleeing the cloud they now see as risky instead of protective? A total of 147 enterprises have offered comments to me, of which 83 offered highly technical suggestions. Here’s how they shook out.
First, none of the enterprises said they planned to abandon the cloud, and everyone thought any notion of leaving the internet world was too silly to even assess. This, despite the fact that 129 said that they believed both the cloud and the internet to be less reliable than it had been, and 24 said they did plan to take measures to reduce their vulnerability to outages. All the enterprises thought the dire comments they’d read about cloud abandonment were exaggerations, or reflected an incomplete understanding of the cloud and alternatives to cloud dependence. And the internet? “What’s our alternative there?” one executive asked me. “Do we go back to direct mail or build a bunch of retail stores?”
What caused the reliability of the cloud and internet to decline, in the perception of those who thought it had? A majority said that they believed the providers, to improve their profit margins, were running their servers too close to the edge. The next top problem cited was “an imbalance between the quality of staff tools and their mission,” and the remainder just thought that infrastructure was getting more complex, without offering exactly how, why, or why that was impacting reliability.
Of those who offered technical comments, all but two said the issue really is complexity. Virtualization in any form is more operationally intensive because it’s a multi-layer concept, and a secure and reliable global internet depends on layers of technology and administration. More layers, more disconnected things to go wrong. A little issue with something so many depend on necessarily has a massive impact.
> Enterprises ask: “What good is a ‘single pane of glass’ when three or four groups are all trying to see through it, looking for different things?”
Virtualization, in a data center of an enterprise or a cloud provider, is a three-layer process. The bottom layer is the resource pool, the servers and platform software. There’s a set of management tools associated with them, The top layer is a “mapping” layer that creates the virtual elements from the resource pool and exposes them for applications and application management. Astride both, in parallel, is the network layer, which provides connectivity throughout. This layer is run by different people, and in fact different teams.
The enterprise experts pointed out that the network piece of this cake had special challenges. Its critical to keep the two other layers separated, at least to ensure that nothing from the user-facing layer could see the resource layer, which of course would be supporting other applications and, in the case of the cloud, other companies. It’s also critical in exposing the features of the cloud to customers. The network layer, of course, includes the Domain Name Server (DNS) system that converts our familiar URLs to actual IP addresses for traffic routing; it’s the system that played a key role in the AWS problem, and as I’ve noted, it’s run by a different team.
The internet is layered, too. We have ISPs who offer access, a mixture of commercial and government players who provide and manage global connectivity, physical and logical (URL) addressing, and security offerings at the content end, the consumer end, and (as Cloudflare shows) as an intermediary process. It’s like a global dance, and even if everyone has the steps right, they can still trip over each other. (See also: Why cloud and AI projects take longer and how to fix the holdups)
Complexity is increasing in every one of our layers in both the internet and cloud, because the business case depends on efficient use of resources and reliable quality of experience. Operations is more likely to be a problem in each of the layers, and given their interdependence, cooperation in operations is essential. Yet we separate the people involved. All the layer-people dance, and sometimes trip up on the crowded floor.
Why not combine the groups? Enterprises ask: “What good is a ‘single pane of glass’ when three or four groups are all trying to see through it, looking for different things?”
Enterprises don’t see the notion of a combined team or an overlay, every-layer team, as the solution. None of the enterprises had a view of what would be needed to fix the internet, and only a quarter of even the virtualization experts express an opinion on what the answer is for the cloud. That group agrees with the limited comments I’ve gotten from people I know in the cloud provider world. The answer is **templates, simulations, and world models** , and I think that could work for the internet too, since many of our major internet issues, including Cloudflare’s issue, really come down to software configuration and operations.
The idea here is to prevent issues from developing by modeling the entire system, using real-world machine learning to add in experience with usage, traffic, conditions, and QoE, then using the model to come up with a list of steps needed to implement a desired change or respond to a problem. This template of steps would be simulated on the model, and the template and simulation results reviewed by the operations team or teams involved. Then when the steps are executed, the result of each step would be checked against the simulation and any discrepancies would halt the process, reverse the step if needed, and alert all the teams to convene for a review. Simulation, used to forecast reaction to potential rather than real issues, might even help in software-error-driven problems like the recent Cloudflare outage by pointing out risks of cascade faults and identifying remedies.
It might seem that this approach keeps people, the people who perhaps made the mistakes in the first place, too involved. Everyone disagrees with that, though. One operations manager said, “It’s my *** if things go wrong, so I’m going to be sure I have the last word on anything.” In general, operations types in all the verticals and in the cloud provider and even telco worlds agree that a purely automated strategy, if it goes wrong, would likely create something even operations professionals could be unable to fix. The risk of a massive and persistent outage, an AI disaster created at superhuman speed, is simply too great. Overall, among enterprises, less than ten percent of operations specialists think we’ll ever get to the point where purely automated processes will run the network, a virtualized data center, or the cloud. Among cloud providers and telcos, it’s less than half that number.
So what are enterprises going to do about cloud and internet problems? I think the answer may be found in (of all places) my haircut. Did I resort to home hair-cutting, let my hair grow long, make a change in barbers to avoid the risk of having to wait? No, I went to the salon with no appointment, ready to stand in line. Instead I found that nobody else had come, and I walked in.
Does that mean that enterprises should go back to the old way, forget the internet? Or does it mean that they should just work through an outage, knowing that it will end and they’ll survive it? It means that sometimes there isn’t really any choice because there really aren’t any options. Business say that using the cloud, or the internet, may mean accepting some major headaches, but they survived them in the past, so they’ll bet they can survive them in the future. Business as usual is safer, easier to justify to your boss.
And, if you’re worried, my haircut looks fine, but I’ll be online to join the waitlist next time I need one.