Math and Data (MaD) Seminar

Speaker: Yury Polyanskiy

Location: 60 Fifth Avenue, Room 7th floor open space

Date: Thursday, December 5, 2024

The main building block of large language models is matrix multiplication, which is often bottlenecked by the speed of loading these matrices from memory. A possible solution is to trade accuracy for speed by storing the matrices in low precision (``quantizing'' them). In recent years a number of quantization algorithms with increasingly better performance were proposed (e.g., SmoothQuant, Brain compression, GPTQ, QuIP, QuIP#, QuaRot, SpinQuant). In this work, we prove an information theoretic lower bound on achievable accuracy of computing matrix product as a function of compression rate (number of bits per matrix entry). We also construct a quantizer (based on nested lattices) achieving this lower bound.

 

Based on a joint work with Or Ordentlich (HUJI), arXiv:2410.13780.