• Human Infrastructure
  • Posts
  • Human Infrastructure 423: AI vs Network Engineers, Data Centers in Orbit and Underwater, Lab Notes, and More

Human Infrastructure 423: AI vs Network Engineers, Data Centers in Orbit and Underwater, Lab Notes, and More

TAKE THE PACKET PUSHERS SALARY SURVEY!

We’ve put together a salary survey to understand the current market for network engineering and IT skills. We’ve got more than 280 responses so far. We’d love to cross the 300 mark (or more), so if you haven’t taken the survey yet, we’d appreciate if you take a few minutes. If you have taken the survey, maybe tell a colleague or two about it. We aren’t collecting any personal information, so all responses are anonymous.

THIS WEEK’S MUST-READ BLOGS 🤓

It’s clear that lots of executives are betting on AI to replace human workers (for example, see Amazon’s latest layoffs). Jason Gintert takes a careful look at what GenAI and LLMs can and can’t do when it comes to network troubleshooting. In his experience with these tools, he says GenAI isn’t good at discerning operator intent or environmental context, has trouble understanding root cause, and can be hamstrung by a lack of data or data of poor quality. And those downsides are good news for experienced network engineers.

That said, Jason also notes that some roles are going to impacted by AI, and that engineers who are comfortable with AI tools and can use them effectively will become more valuable than those who can’t. This is a post you should read and ponder seriously as you evaluate your own approach to AI. - Drew

Orbital data centers…as in, data centers in space. Yes, that’s a thing, and the company making noise about them is Starcloud. Andrew posts thoughtfully about whether or not ODCs will work in the long run, considering the cost model, maintenance, communications, heat management, power generation, and several other factors.

TL;DR. He doesn’t think this can work. - Ethan

Multiple protocols have emerged that enable AI agents to discover each other, understand what services and capabilities they offer, and to connect to third-party applications, data sources, and workflows. Google recently released the A2A protocol, and this post from Ryan talks about how it works, and walks through how he built a portfolio discoverable by other agents. Even if you don’t have plans to build your own AI agent, posts like these are useful for understanding how these protocols work because you may be running a network that will support such agents. - Drew

TL;DR. Wi-Fi adapter with the Intel Killer chipset can cause strange connectivity problems, especially with VPN software.

“Basically the Killer adapters come with a bunch of software add-ons that are intended to speed up your internet connection but can end up causing problems with VPNs. So far disabling these add-ons has solved the issue for all of my affected users. For detailed steps on how to do so you can check out the threads linked above or this summary page I made: https://wallpunch.net/windows-killer-troubleshooting/

Sharing here as sometimes clues like this help solve bizarre problems in client hardware that seem otherwise inscrutable. - Ethan

This short LinkedIn update from Suresh talks about why and how he uses Proxmox and Ubuntu for networking labs. He also is looking to gauge people’s interest in a more detailed blog post about setting up a network lab, so if you’re interested, let him know in the comments. - Drew

MORE BLOGS

  1. 100 GbE Lab (Jan 2025, budget build with Mellanox NIC + Mikrotik switch) - Jordan Rife

  2. Quick thoughts on the recent AWS outage - Surfing Complexity

  3. Are Research and Education Networks Critical Infrastructure? (an argument for “yes”) - Internet Society Pulse

  4. I was once an AI true believer. Now I think the whole thing is rotting from the inside. - /u/ShallowPedantic via /r/ArtificialIntelligence

  5. The Hidden Risks of AI Notetakers: Precaution or Paranoia? (precaution, someone is likely monetizing meeting transcriptions) - CircleID

TECH NEWS 📣

Who needs resource-intensive cooling for a data center when you can use an entire ocean to take up the thermal load? That’s the thinking behind an experiment being run by a Chinese data center operator, which has encased a bunch of servers into corrosion-resistant metal containers and sunk them 35 meters off the coast of Shanghai. The article linked above has more details, as well as a reminder that Microsoft sunk many years of experimentation into this idea before eventually abandoning ship. - Drew

If dropping a data center to the bottom of the ocean isn’t wild enough, here’s an interesting story about a startup building an FPGA and software that works with what it calls “probabalistic bits” or “p-bits” instead of traditional ones and zeros. The idea is that p-bits can model uncertainty, which in theory would make them useful for tasks that deal with complex systems. The article says “Extropic calls its processors thermodynamic sampling units, or TSUs, as opposed to central processing units (CPUs) or graphics processing units (GPUs). TSUs use silicon components to harness thermodynamic electron fluctuations, shaping them to model probabilities of various complex systems, such as the weather, or AI models capable of generating images, text, or videos.”

If this technology is viable, its creators say it has the potential to be dramatically more energy-efficient than today’s power-hungry GPUs. Frankly, it all sounds quite fantastical to me, but who knows? Given all the billions being poured into GPUs and data centers, funding a few wild ideas that could lead to more efficient systems seems like a good idea. - Drew

Specialized Ethernet transport, hardware, and more are coming to facilitate data centers supporting AI workloads. Thor Ultra is one such entrant.

“Thor Ultra represents a clean-sheet NIC design, not an evolution of Broadcom’s previous Thor 2 product. While Thor 2 was a 400G NIC serving multiple markets including enterprise, Thor Ultra is a new architecture focused exclusively on AI scale-out deployments. The NIC implements Ultra Ethernet Consortium (UEC) 1.0 specifications and introduces hardware-accelerated capabilities to modernize RDMA.”

Whether or networking for AI is interesting to you, I recommend you pay some attention to the work of the UEC. I’m at NANOG95 this week, and I’ve heard that while AI might be driving the UEC’s work and products like Thor Ultra, there’s a long-term play for Ultra Ethernet after the AI bubble pops. - Ethan

MORE NEWS

FOR THE LULZ 🤣

Shared on the Packet Pushers Slack by Matthew

RESEARCH & RESOURCES 📒

I’m at NANOG95 right now, so I thought I’d highlight a few of the resources NANOG has to offer to the network engineering community.

  1. An event series held in various cities across North America. (Yes, there is a registration fee.) The current cadence is 3x a year. Attend in person or virtually. Several hundred people attend these events in person. Not too big, not too small. The focus is on an educational content track that’s curated by the NANOG Program Committee and a “hallway track” that’s excellent. You gain access to experienced networking people that otherwise you’d be unlikely to be able to chat with. Everyone here helps everyone else. https://nanog.org/events/future/

  2. A YouTube channel with a massive archive of technical talks about networking from years of NANOG gatherings. If you can’t attend the events, you can still view the talks. https://youtube.com/@TeamNANOG

  3. More at https://nanog.org.

While at NANOG95, I recorded a Heavy Networking podcast episode on the history of NANOG with board member Steve Feldman that is scheduled to publish on November 14, 2025. If you’re not sure whether or not NANOG is interesting to you, that might help. You might also search for #nanog95 on LinkedIn to get the general vibe. - Ethan

Considering all the noise about MCP and agentic AI as related to network automation, I thought this paid-but-really-cheap course from Michael was pretty timely. - Ethan

This is a readable primer on symmetric and asymmetric encryption and key exchange, the risks quantum computing poses, and mechanisms for generating post-quantum pre-shared keys that are quantum-resistant. It goes into some Cisco-specific implementation details, but it’s still a useful overview. Thanks to Tim McConnaughy who shared a link to this resource via LinkedIn, which I immediately grabbed to include here. - Drew 

MORE RESOURCES

  1. Entire Linux Network stack diagram (ow, my eyes) - zenodo

INDUSTRY BLOGS & VENDOR ANNOUNCEMENTS 💬 

Arista has released new switch and router platforms in its R4 family that target scale-out and scale-across data center and AI infrastructure use cases. Products range from 10GbE to 800GbE fixed and modular options, including a monster 7800R4 modular chassis that offers up to 576 ports of 800GbE. These R4 models are designed to run Arista’s EOS network operating system, and include support for multiple encryption options including MACsec, IPsec, and VXLANsec.

Of particular note is an option in the 7800R4 family that Arista calls HyperPort. Designed to interconnect AI workloads across two physically separate data centers (a.k.a scale across), this R4 model essentially serves as a single 3.2Tbs interface. Arista claims HyperPort offers a 44% faster job completion time than can be achieved by load-balancing traffic across four individual 800GbE ports. 

Arista says it achieves this 3.2Tbps performance via updates to its EOS software that leverage capabilities in Broadcom’s Jericho 3+ ASIC. The HyperPort option is expected to be available in Q1 of 2026. - Drew 

Lightyear makes a SaaS platform to automate the telecom and service provider provisioning lifecycle for network connectivity, from creating RFPs to getting quotes to tracking installation, and more. The company has released a new report on customers’ ISP experiences that draws from Lightyear’s data. Lightyear says the report looks at trends around quote speed and accuracy, service implementations, lifecycle management, and other aspects of the customer experience. You have to give up contact details if you want the report. - Drew 

Here’s the detailed post-mortem from AWS explaining the outage that the entire world noticed. Allow me to massively oversimplify what this document states.

  1. DynamoDB had a problem that affected DNS resolution. “The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.”

  2. Droplets (servers that house EC2 instances) were unable to complete check-ins (because of the DynamoDB problem). And this had a cascading effect. Servers unable to complete check-ins were presumed unavailable, and EC2 capacity was reduced as leases timed out. As this was resolved, the and the backlog of EC2 instances started to fire up, the backlog of related network state changes also began firing up…but with so much to catch up on, network state propagation was sloooowww. The strain on the system caused NLB instances to fail health checks and get marked offline, even though they might have been okay. The health check system got so far behind that AZ swaps were triggered.

  3. A variety of other impacts to API calls, Lambda, and Fargate with a spider’s web of practical impacts.

I think there are at least two high-level takeaways for the rest of mere mortals not operating the largest cloud in the known universe. Neither of these observations are new or revelatory. Just more poignant in this moment.

Unforeseen dependencies can tank your environment. See point #2 above. Anticipating those failure modes is nearly impossible without experience, but that’s experience you don’t actually want to get. So, bring in the grizzled veterans with the scars of “data center down” days to analyze your system and surface the possible failure modes you can’t predict.

A single cloud instance should be considered a single point of failure. An unlikely SPOF? Sure. Public clouds are incredibly robust, but a SPOF nonetheless. Applications and IT services must be architected with the assumption that the cloud can fail. Business resiliency requires that we have an answer to the question, “What if US-East-1 (choose your favorite) is completely down?” More interestingly, and more complicated to design for, is the question, “What if the cloud I rely on is only partially down or intermittently down…or just freaking slow?” Gray failures are the worst scenarios to react to. But anticipating this situation and having a playbook for it could save your business some dollars. - Ethan

Starcloud will be offering orbiting data centers soon, and Crusoe’s gonna have one of their Crusoe Cloud modules on one of them, scheduled to launch into orbit in late 2026, with services offered in early 2027. NVIDIA has also touted involvement with Starcloud.

I have questions, mostly about terrestrial comm links, but also orbital decay, radiation shielding, how to swap out a failed power supply, launch packaging and related capex, and what functions as remote hands in an orbiting data center (gotta be robots, right?). I know all of those questions have answers (because they have to or Starcloud couldn’t be a thing). I want to have the discussion, though. - Ethan

MORE INDUSTRY NOISES

DYSTOPIA IRL 🐙

TOO MANY LINKS WOULD NEVER BE ENOUGH 🐳

LAST LAUGH 😆

Shared on the Packet Pushers Slack by Kaj