Human Infrastructure 362: Outages, Conferences, and a Book Review

THIS WEEK’S MUST-READ BLOGS 🤓

Andree Toonk reflects on the nationwide outage suffered by Canadian ISP Rogers. This isn’t an analysis of that outage, but rather a jumping off point to consider what anyone in infrastructure operations should be doing to minimize misery when all the things break.

If the Rogers outage interests you, there’s an assessment available here. - Ethan

Intel Fellow Brendan Gregg leverages the CrowdStrike + MS Windows kernel disaster to explain how eBPF is going to make such a scenario far less likely in the future. - Ethan

Forrest Brazeal considers live events in the post-pandemic world. Are they worth it for attendees? How do employers with budgets think about these events? How can you tell a good event from a bad one? If you really, really want to go to an event, how can you justify it to the person who’ll authorize and fund the trip?

One of my takeaways from Forrest’s POV is to avoid tech events where the speakers are celebrities…and that’s it. The good events will “bring in people with battle scars who are sharing their failures as well as their successes.” This has been my experience with the NetworkAutomation.forum’s AutoCon events. Hands-on folks sharing what went well and not so well in their network automation…all of which you can watch here. - Ethan

Dmytro Shypovalov points out that MPLS got complicated to implement over time. State is a problem. Vendor interoperability is another problem. Challenges with RSVP is yet another. He cites more in this well-written and illustrated post. From there, he explains why segment routing should be an improvement over traditional MPLS implementations, but isn’t what it could be because of the way vendors have delivered controllers.

“An SR-TE controller is just a router with some extra functionality like processing BGP-LS and calculating policies with CSPF. It should be even easier than your average MPLS-TE implementation, since there is no need for LSP signaling.

Yet what you see in actual controller implementations is some disgusting bloatware that needs a supercomputer to run and does all kinds of things like network monitoring, automation, netflow collecting and OSS/BSS functionality. Which is great but who asked for any of those on a routing platform?”

He makes a few more points about the foibles of modern SR controllers, then explains the ideal SR controller from his perspective. It’s even got a name—Traffic Dictator. Traffic Dictator “is a routing platform, with configuration resembling a router so every network engineer familiar with SR and BGP can intuitively figure out how to use it.” Links to download, read docs, and grab a whitepaper in the article. He’s got a pre-built Containerlab setup, too. - Ethan

This is a review of a recent book aimed at network and cloud engineers who need to get a handle on machine learning. The review says the book provides a detailed theoretical background on machine learning, but also offers practical applications relevant to network and cloud pros. It also explores essential issues such as data quality, models, and the ethics of ML and AI. The review concludes that the book is “a valuable resource for any network or cloud engineer looking to stay ahead of the curve. It provides a comprehensive yet practical introduction to ML models, equipping readers with the knowledge and skills to leverage its power for network automation.” - Drew

I love the premise of this blog series. Josh looks back at jobs he did in the early days of his career, explains the solution he came up with at the time, and then describes how he might have done it differently today. It’s a useful learning exercise. In this installment, Josh reviews a project to convert a management system for a wireless deployment in a retail environment that had to provide guest Wi-Fi access.  - Drew

Palo Alto Networks: A Leader in single-vendor SASE for the second time.

Palo Alto Networks has been named a Leader for the second year in a row in the 2024 Gartner® Magic Quadrant™ for Single-Vendor SASE. Rated highest on Ability to Execute and furthest on Completeness of Vision. Get more details at https://start.paloaltonetworks.com/gartner-sase-mq-2024.html

TECH NEWS 📣

GitLab, a GitHub competitor that I’ve heard only positive things about, might be up for sale. Potential buyers need access to plenty of capital, as GitLab is reported to be worth about $8B. Datadog is rumored to be interested, but would not go on the record to confirm. Look for an announcement in coming weeks, and hold onto your butts if you’re a GitLab shareholder. - Ethan

Based on my email, I’d have assumed 70%…but anyway…6.8% of the Internet traffic CloudFlare is reporting on is nasty. Vulnerabilities are being exploited more quickly, often within minutes. There are more zero days in the wild. DDoS attacks continue to grow.

Perhaps most depressing of all? “Finally, about 38% of all HTTP requests processed by Cloudflare are classified as automated bot traffic. Some bots are good and perform a needed service, such as customer service chatbots, or are authorized search engine crawlers. However, as many as 93% of bots are potentially bad.” Sigh. Our boring dystopia marches on. - Ethan

402Tbps. Yup. Researchers think 600Tbps is the absolutely maximum science would be able to squeeze out of the commercial grade fiber they were using for the test. But commercial grade fiber! That means the stuff already under the ocean might be able to carry these massive amounts of data. - Ethan

CrowdStrike released a preliminary review that says a bug in its testing software caused the testing software to miss a bug in a software update that would go on to crash Windows machines around the world. Apparently when it comes to software, you just can’t win. - Drew

FOR THE LULZ 🤣

RESEARCH & RESOURCES 📒

Networking instructor Ed Harmoush has been publishing a series about networking fundamentals to YouTube. There are 15 videos in the series so far, most about 10-15 minutes long. Well worth your time to watch or share with a colleague. - Ethan

Cake-autorate keeps CAKE up to date with real-time bandwidth availability. If you’re thinking that there’s no point because you have a constant bandwidth available, you’re right. “Cake-autorate is intended for variable bandwidth connections such as LTE, Starlink, and cable modems and is not generally required for use on connections that have a stable, fixed bandwidth.”

Don’t know what CAKE is? I recommend you start by reading about CoDeL and going from there. - Ethan

AutoCon2 is coming up fast on November 18-22 in Denver, Colorado and we want to let you know some key dates:Conference Registration is open NOW!

  • You can get super early bird pricing of only $299 until August 28

  • Hotel registration is open now - grab a room SOON!

Call for Speakers closes July 31

  • We already have the most proposals for talks that we've ever had

Workshop Registration opens August 8

  • We're going to have a great slate of workshop options covering a range of topics in network automation and orchestration

  • Note that it's a separate event conveniently preceding AC2

The Full AC2 Conference Agenda will be published by September 9

NAF is a watering hole - a place where we can have harmonious collaboration in network automation: the practice of network automation, orchestration, observability, AI tooling, education, process and standards, and more. Come hear what your peers are doing in their networks (on the stage and in the hallways), what solution providers are bringing to the table, what's happening with open source, and all things network automation.AutoCon is THE Forum for Network Automation. See you in Denver!

INDUSTRY BLOGS & VENDOR ANNOUNCEMENTS 💬 

CrowdStrike begins the painful, public process of explaining why what should have been a routine update resulted in what’s being called the largest IT outage in history. There is a substantial amount of technical detail here, if such things interest you. - Ethan

TL;DR. Voltage issues. Fixable with microcode. - Ethan

SonicWall has released a threat report looking at the first half of 2024. Highlights? From the company blog: “Business email compromise (BEC) attacks are on the rise, supply chain attacks and the risks associated with them are increasing and IoT malware is becoming more and more of an issue.” 

I was interested to see that SonicWall has adjusted one of its metrics. Regarding firewalls, the company says it used to count every hit against a firewall, but given the volume of attacks it decided that wasn’t a very descriptive metric. Instead, the company is now counting “the number of hours a firewall is under attack rather than every single hit.” SonicWall compares it to weather reporting. Instead of counting and reporting every drop of rain, it’s telling you that it rained hard in the afternoon. SonicWall says this change “is more consistent, simplifies comparisons and data interpretations, and overall significantly improves the way we’re analyzing and reporting telemetry data.” You can download the full report here in exchange for contact details. - Drew

TOO MANY LINKS WOULD NEVER BE ENOUGH 🐳

LAST LAUGH 😆