Zooming out: New tools for probing the historical record and the human genome
Erez Lieberman Aiden, Harvard Society of Fellows

Abstract:
New structures often emerge when we explore a known phenomenon from a more global vantage point. For instance, any given book can be read and comprehended. But what happens when we try to read all the books at once? Or: the local structure of DNA is a double helix. But if DNA did not fold further, the human genome - which is two meters long - could never fit inside the nucleus of a cell. How does it fold? This talk will focus on the extraordinary potential of technologies that enable us to zoom out, in the process transforming familiar concepts, like the contents of a book or the shape of DNA, into new reserch horizons.

First, I will describe efforts, together with my collaborator Jean-Baptiste Michel and Google, to create tools for the quantitative analysis of a significant portion of the historical record. We began by constructing a reliable corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Such analyses are intuitive and addictive: the Google Ngram Viewer, a simple web-based tool we released for the analysis of this corpus was used over a million times in the first 24 hours. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

In the second half of my talk, I will describe Hi-C, a novel technology for probing the three-dimensional architecture of whole genomes. Developed together with collaborators at the Broad Institute and UMass Medical School, Hi-C couples proximity-dependent DNA ligation and massively parallel sequencing. My lab employs Hi-C to construct spatial proximity maps of the human genome.  Hi-C maps have revealed that active and inactive portions of the human genome are spatially segregated, ie, that cells employ a sort of 'regulatory origami' as they turn genes on and off.  At the megabase scale, the genomic fold is consistent with a fractal globule, a knot-free conformation that enables maximally dense packing while preserving the ability to easily fold and unfold any genomic locus.