CS Colloquium

Building Effective Unstructured Data Systems

Time and Location:

Feb. 23, 2026 at 11AM; 4 Washington Square North, Room 1201

Link:

Abstract:

Databases and other data systems have successfully democratized data-oriented computation across domains, thanks to decades of research in system internals and end-user interfaces. However, such systems center on structured (i.e., tabular data; unstructured data-the vast majority of data-has largely been ignored. Large language models (LLMs) now give us a building block for unstructured data analysis, and we face the same questions as in the early days of data systems-e.g., how should users author queries? How do we efficiently execute queries at scale?-but many well-established tenets from traditional data systems no longer hold. In my talk, I will present DocETL, a system I developed for unstructured data analysis. I will discuss how we had to rethink query optimization under these new assumptions, optimizing user-written pipelines for both accuracy and efficiency-as well as end-user interfaces for authoring, iterating on, and debugging pipelines. DocETL is open-source with 3.5k+ GitHub stars; our hosted interface has supported 4.1k+ pipelines across 30+ S&P-500 industries. Query optimization ideas from our work have been adopted in databases such as Snowflake and BigQuery, and our interface design principles have been adopted by companies like LangChain and OpenAI.