CS Colloquium: Pareto-Efficient AI Systems: Expanding the Quality and Efficiency Frontier of AI

Speaker: Simran Arora, Stanford University

Location: 60 Fifth Avenue, Room 150

Date: Thursday, April 10, 2025

In this talk, we piece-by-piece build up to a simple language model architecture that expands the Pareto frontier between quality and throughput efficiency. The Transformer, AI’s workhorse architecture, is memory hungry, limiting its throughput. This has led to a Cambrian explosion of alternate architecture candidates proposed across prior work. Prior work paints an exciting picture: there are architectures that are asymptotically faster than the Transformer, while also matching its quality. However, I ask, if we’re using asymptotically faster building blocks, what if anything are we giving up in quality? 1. In part one, we understand the tradeoff space and show there’s no free lunch. I present my work to identify and explain the fundamental quality and efficiency tradeoffs between different classes of language model architectures.

2. In part two, we measure how existing architecture candidates fare along on the tradeoff space. While many proposed architectures are asymptotically fast, they struggle to achieve wall-clock speed compared to Transformers. I present my work on ThunderKittens, a GPU programming library to help address the bottleneck of mapping AI algorithms to AI hardware.

3. In part three, we expand the Pareto frontier of the tradeoff space. I present the BASED architecture, which is built from familiar and hardware-efficient components. In culmination, I released a suite of state-of-the-art 8B-405B parameter Transformer-free language models, per standard evaluations, all on an academic budget. Given the massive investment into AI models, this work blending AI and systems has had significant impact and adoption in research, open-source, and industry.