• Human Infrastructure
  • Posts
  • Human Infrastructure 366: Running CML in the Cloud, Ethernet History Deepdive, IETF Big Ideas

Human Infrastructure 366: Running CML in the Cloud, Ethernet History Deepdive, IETF Big Ideas

THIS WEEK’S MUST-READ BLOGS 🤓

If you don’t have the hardware at hand for a project, there’s always the cloud, right? Maybe. Liam breaks down the cost of running CML in Azure for a project he was working on (pricey!). But he also didn’t want to buy a server because he only needed to run CML for a short time. In short, he found a bare metal provider that met both his technical requirements and price requirements. He also shares his lab setup if you’re curious. - Drew

A simple-sounding question “Why are there different Ethernet encapsulations?” sent Daniel on a journey into the history of Ethernet. This is a deeply-researched post full of technical detail and interesting trivia. Daniel discusses the protocol’s creators, the formation of a consortium of DEC, Intel, and Xerox to promote Ethernet, the tangled standardization process, Ethernet competitors, and more. It’s an amazing piece of writing. (By the way, if you want even more Ethernet history, Ethan and I interviewed Alan Kirby, a DEC engineer who helped create the first Ethernet bridge.) - Drew

Geoff Huston summarized some key routing-related conversations that caught his ear at IETF 120. The one that piqued my interest the most was that of adding QUIC as a transport option to BGP. That is BGP over QUIC as opposed to BGP over TCP. There are a number of problems this would help to solve. Geoff explains. - Ethan

I tend to think of Broadcom solely in regard to its switch ASIC business, but the company is also a big player in AI silicon thanks to a deal it has with Google to design and build Google’s TPU chips. This post gets into the corporate history of how Broadcom got to this position. It starts with Avago CEO Hock Tan, who has made a career out of acquiring companies (often with the help of private equity firms), stripping out what didn’t fit, and sometimes getting lucky when one of the assets he did own turned into a market winner. That includes Broadcom, which Avago acquired in 2016. Meanwhile, the TPU design business is thanks to Avago’s acquisition of LSI Logic, which has a custom silicon design shop. The upshot is that Tan’s portfolio, now under the Broadcom brand, has a major stake in the AI market. - Drew

Bryan looks back at the early days of the Covid pandemic. He works on a university campus, and as institutions of all types tried to figure out how to move forward, his university decided to permit students to return to campus in the fall of 2020. However, it was presumed that classes would mostly be online, which required a massive upgrade to the WLAN infrastructure in the dorms to support remote learning. Due to an effort to sanitize the dorms, Bryan and his team had only two days to install 2,000 APs across multiple buildings. Read the post to find out how Bryan got it done–and to remember the strange experience of living under a pandemic. - Drew 

HashiCorp’s Vault is a popular tool for storing secrets. This post provides an overview of Vault’s capabilities, and then walks you step by step on how to set up Vault in a lab if you want to take it for a test drive or get familiar with it before using it in production. - Drew 

Building Self-Service Networks with NetOrca

With NetOrca you can allow your internal customers to consume all the great automation infrastructure you have created and manage it throughout its service lifecycle. See our demo with Ethan on YouTube.

Or visit our website for more information and contact details here.

TECH NEWS 📣

This piece intrigued me mostly for the tables that show AMD’s growth in server, desktop, and mobile CPUs from 2017 until now. They barely registered a blip in 2017. Now, AMD is at ~20% or better market share in all of these categories. Perhaps most intriguing of all? “AMD seems to lead in high-end crème-de-la-crème machines that require the most powerful and expensive processors.” Oh, Intel…what happened? - Ethan

Vehicle-to-everything (V2X) is an ambitious plan by the US Department of Transportation to enable vehicles to be aware of their surroundings. “V2X enables vehicles to stay in touch with each other as well as pedestrians, cyclists, other road users and roadside infrastructure. It lets them share information such as their position and speed, as well as road conditions.” Because safety, naturally. The article details some of the bureaucratic deployment challenges, but assuming a V2X rollout gets going, I see opportunities for networking folks over the next several years. - Ethan

Corning’s manufacturing process for fiber optic cables is described with some nifty pictures. In summary, glass is ultra-purified with heat. The result is a glass “icicle”. The icicle is put into a machine that’s over a hole dropping several floors deep into the factory. The machine melts the glass, and the glass falls into the hole, stretching as it goes. At the bottom, the end is placed into a machine that stretches it further, to the width of a human hair. That’s the core of the cable—from there’s it’s coated for purpose, spooled, and shipped. Neat! FYI, if your browser supports reader mode, you can read this piece without being challenged. - Ethan

FOR THE LULZ 🤣

Shared by Anton Lonnerbro

RESEARCH & RESOURCES 📒

“A full-featured NTP server and client implementation, including NTS support.” Written in Rust. Part of Project Pendulum. - Ethan

Long time networking instructor Wendell Odom teaches you how to sort out IPv4 subnets, explaining those pesky subnet masks. He maps this information to the CCNA program so that you know how to best apply what you’ve learned to the exam. - Ethan

AutoCon2 is coming up fast on November 18-22 in Denver, Colorado and we want to let you know some key dates: Conference Registration is open NOW!

  • You can get super early bird pricing of only $299 until August 28

  • Hotel registration is open now - grab a room SOON!

Call for Speakers closes July 31

  • We already have the most proposals for talks that we've ever had

Workshop Registration opens August 8

  • We're going to have a great slate of workshop options covering a range of topics in network automation and orchestration

  • Note that it's a separate event conveniently preceding AC2

The Full AC2 Conference Agenda will be published by September 9

NAF is a watering hole - a place where we can have harmonious collaboration in network automation: the practice of network automation, orchestration, observability, AI tooling, education, process and standards, and more. Come hear what your peers are doing in their networks (on the stage and in the hallways), what solution providers are bringing to the table, what's happening with open source, and all things network automation.AutoCon is THE Forum for Network Automation. See you in Denver!

INDUSTRY BLOGS & VENDOR ANNOUNCEMENTS 💬 

While the problems Meta has doing Internet-scale AI training won’t apply to many, their tweaking of the network to best support massive data throughput in a Clos architecture might. View this piece as a use-case from which you might gain some takeaways for your own data center environment, even if massive GenAI workloads aren’t part of your world. A few things caught my eye.

  • Meta separated their networks into front-end and back-end. The back-end is for the AI GPU-to-GPU stuff. The front-end is for everything else. They are not identical network designs. The back-end is where the focus of this piece is.

  • Equal Cost Multi-Path sucks for workloads where there’s low entropy—that is, you have elephant and/or long-lived flows and not all that many of them. To get an even load distribution with traditional ECMP, you need variety in the tuples you hash on. If you don’t have variety, some ECMP member links get slammed full while others are underutilized. Meta tweaked their ECMP algorithm (they run their own switches and NOS last I knew) to add RoCE packets’ queue pair (QP) field to the hash. Most of their traffic was RoCE, and they got better link distribution this way. Could you do that? Would depend on your chipset and NOS support.

  • Meta turned off DCQCN and relied solely on PFC for flow/congestion control once they moved from 200G to 400G Ethernet. DCQCN just wasn’t working right anymore in the context of the new speed, plus was buggy. Turns out this didn’t matter, as PFC was good enough to get the flow control behavior they needed.

In summary, not everything works like the books say it’s supposed to. Networking is more than typing commands and configuring protocols. Knowing both your traffic load and mix matters. Much more detail in the blog post, as well as the published academic paper. - Ethan

Post-quantum encryption is about encrypting things today so that a (not-yet extant) quantum computer won’t be able to crack them tomorrow. The work has been ongoing in this space for some time. NIST has announced three finalized standards, with more expected. The three finalized thus far are…

  • Federal Information Processing Standard (FIPS) 203. For general encryption. “Based on based on the CRYSTALS-Kyber algorithm, which has been renamed ML-KEM, short for Module-Lattice-Based Key-Encapsulation Mechanism.”

  • FIPS 204. For digital signatures. Based on “CRYSTALS-Dilithium algorithm, which has been renamed ML-DSA, short for Module-Lattice-Based Digital Signature Algorithm.”

  • FIPS 205. Also for digital signatures. Based on “Sphincs+ algorithm, which has been renamed SLH-DSA, short for Stateless Hash-Based Digital Signature Algorithm.”

Lots more detail and context in the NIST piece. - Ethan

Meter is a Network-as-a-Service (NaaS) company. That is, Meter develops, deploys, and operates your wireless and wired campus network for you. Meter builds its own networking equipment, develops its own NOS, and will even help you find and provision a service provider in your geographical area. The company’s newest offering is Command, which it describes as “generative UI.” In other words, Meter now has generative AI capabilities that let you interact with the infrastructure using natural language. Not only can you query the network to get things like status updates or performance reports, you can also use natural language to build dashboards on demand. You can even ask it to fix problems for you. The press release linked above gives an example: “Troubleshoot client @eng-laptop on @AP33. The user said it's slow.”

The company emphasizes the reliability of its generative AI capabilities. In a technical blog about Command (which is worth reading if you’re curious about Meter and this new capability), the company writes “Our models are built in-house to understand the nuances of both networking concepts and how we build backend and frontend software, ensuring that user intentions are correctly interpreted and acted upon.” Meter partners with a company called Cerebras to host its custom models. Meter says all customer data is encrypted in transit and at rest, and that it doesn’t share customer data with third-party LLM providers. - Drew

TOO MANY LINKS WOULD NEVER BE ENOUGH 🐳

LAST LAUGH 😆

Shared on X by @fightwithmemes