Inside the meltdown: One thing CrowdStrike and Microsoft can’t fix


I only saw one Blue Screen of Death on Sunday, July 21, across 15 hours of travel via two of the country’s biggest airports, just two days after a botched software update crippled millions of corporate computers running the Windows operating system.

“Maybe things are OK,” I remember thinking as my family took the first steps into New York’s LaGuardia Airport around 9 a.m. Headlines to the contrary on day 3 of the Great Windows Outage of 2024, the ticketing and baggage area didn’t look too bad.

I should have known better. I’d taken literally two steps inside the building before getting the first of about 3,000 delay emails from Delta over the course of the day, to go along with even more notifications from the Flighty and Fly Delta apps. This wasn’t going to be an easy run home from New York to Florida, something I’ve done dozens of times over the years.

A notification from the Flighty app on an Apple Watch.
The usually excellent Flighty App simply wasn’t designed to keep up with so many airframe swaps — these notifications came in multiple times an hour. Phil Nickinson / Digital Trends

I’m no stranger to flight delays. (I spent 15 hours in the Sky Club at LAX in late January — not something I recommend, despite how good it is.) But this one was different. Weather happens. Mechanical issues happen. They suck, but those all come down to safety. This time? A third-party security vendor botched a file inside of Windows. CrowdStrike should have caught it. Microsoft should have caught it. Neither did until it was too late. While the fix was relatively simple — boot into Safe Mode, or keep restarting the machine until the bad file was replaced — the first-order effects were immense.

It’s the second- and third-order effects where things really went wrong for the airlines. Delta was hit particularly hard — CEO Ed Bastian on Sunday wrote that more than 3,500 flights were canceled through Saturday, and many more on Sunday. “Please come see me at the podium if you need a hug,” our gate agent said around 4:30 p.m. on Sunday as the board refreshed to read CANCELED.

The scene from Gate A7 at Atlanta Hartsfield-Jackson International Airport late in the evening of July 21, 2024.
For many of us at Atlanta’s Hartsfield-Jackson International Airport, there was nothing to do but wait, and hope that the next flight wouldn’t be canceled. Phil Nickinson / Digital Trends

The line for the rebooking desk in the A concourse at Atlanta — one of seven terminals in the country’s busiest airport — was comically (or tragically) long. I sat with one earbud in, on hold with the airline reservation’s line for two hours before giving up. (My brother, who has much higher frequent flier status, at least managed to get a real person to tell him that there was no way I was getting out before midnight, and that the best thing to do was to stick to the assigned flight for now.)

Finally onboard in the early hours of Tuesday, July 22, a flight attendant gave us an idea of what was really throwing a wrench into things: Delta didn’t know where its crews were. That was confirmed later in the day in another news post from Delta, which said that more than half of its IT system runs Windows, and that additional sync time was required even after the affected machines were rebooted.

“Delta’s crews are fully staffed and ready to serve our customers,” the post continued. “But one of Delta’s most critical systems — which ensures all flights have a full crew in the right place at the right time — is deeply complex and is requiring the most time and manual support to synchronize.”

An in-flight entertainment screen on a Delta 757-200, waiting to leave Atlanta.
It was past midnight, but those of us who managed onto a Boeing 757-200 were plenty excited about it. Phil Nickinson / Digital Trends

We ultimately made it home at nearly 2 a.m. Tired. A little frazzled. But only about eight hours late, all told. We were fortunate. My brother spent some 30 hours in the Atlanta airport two days earlier, just trying to get home to Pensacola after aborting a trip to the West Coast. No flights. No one-way car rentals. Save for waiting, no other real options beyond someone driving five hours each way for a rescue.

Our stories were just two of thousands — and ours were relatively low-stakes. We didn’t have any kids traveling on their own. We weren’t out a ton of money, beyond a couple of meals we didn’t plan on having in an airport. Our bags made it on the same plane.

The immediate fix for the CrowdStrike failure was pretty simple. CrowdStrike and Microsoft need to have policies in place to mitigate the possibility of this happening again. (It will, of course, happen again.) But as the saying goes — and this is the PG-13 version — poop flows downhill. None of this was the airlines’ fault. But it quickly became their mess to clean up.

And that’s something a simple reboot can’t fix. Even if you do it more than 8 million times.








Source link

Previous articleWorld shares steady as Biden exits White House race