CS Colloquium

Workload-Aware Networks for Machine Learning

Time and Location:

April 07, 2026 at 2PM; 60 Fifth Avenue, Room 150

Speaker:

Weiyang "Frank" Wang, MIT CSAIL

Link:

Abstract:

Today's ML workloads require networks that connect tens to hundreds of thousands of GPUs. Existing GPU clusters rely on network designs offering any-to-any connectivity while remaining agnostic to the data they carry. These traits are carried over from legacy CPU datacenters, limiting scalability and hindering GPU utilization.

I will present workload-aware networking, a systematic approach that exploits structures inherent to machine learning traffic to co-design networks with ML workloads. I start by showing that large language model (LLM)’s network traffic exhibits a surprising property: it stays within the bottom layer of a switched network. This insight enables rail-only network designs that dramatically reduce cost and complexity. I then discuss TopoOpt, which uses reconfigurable networks to adapt to the repetitive, predictable traffic patterns of ML training, delivering performance improvements over today's network designs. Finally, I show that understanding traffic content in the network unlocks new functionalities. I introduce Checkmate, a system that embeds checkpointing into the network through gradient replication, enabling per-iteration checkpointing with zero GPU overhead. I conclude with future directions for extending these principles to emerging workloads like agentic AI, and building orchestration frameworks that automate network-workload co-design.