Zooming out: New tools for
probing the historical record and the human genome
Erez Lieberman Aiden, Harvard
Society of Fellows
Abstract:
New structures often emerge when we explore a known phenomenon
from a more global vantage point. For instance, any given book can
be read and comprehended. But what happens when we try to read all
the books at once? Or: the local structure of DNA is a double
helix. But if DNA did not fold further, the human genome - which
is two meters long - could never fit inside the nucleus of a cell.
How does it fold? This talk will focus on the extraordinary
potential of technologies that enable us to zoom out, in the
process transforming familiar concepts, like the contents of a
book or the shape of DNA, into new reserch horizons.
First, I will describe efforts, together with my collaborator
Jean-Baptiste Michel and Google, to create tools for the
quantitative analysis of a significant portion of the historical
record. We began by constructing a reliable corpus of digitized
texts containing about 4% of all books ever printed. Analysis of
this corpus enables us to investigate cultural trends
quantitatively. We survey the vast terrain of ‘culturomics,’
focusing on linguistic and cultural phenomena that were reflected
in the English language between 1800 and 2000. We show how this
approach can provide insights about fields as diverse as
lexicography, the evolution of grammar, collective memory, the
adoption of technology, the pursuit of fame, censorship, and
historical epidemiology. Such analyses are intuitive and
addictive: the Google Ngram Viewer, a simple web-based tool we
released for the analysis of this corpus was used over a million
times in the first 24 hours. Culturomics extends the boundaries of
rigorous quantitative inquiry to a wide array of new phenomena
spanning the social sciences and the humanities.
In the second half of my talk, I will describe Hi-C, a novel
technology for probing the three-dimensional architecture of whole
genomes. Developed together with collaborators at the Broad
Institute and UMass Medical School, Hi-C couples
proximity-dependent DNA ligation and massively parallel
sequencing. My lab employs Hi-C to construct spatial proximity
maps of the human genome. Hi-C maps have revealed that
active and inactive portions of the human genome are spatially
segregated, ie, that cells employ a sort of 'regulatory origami'
as they turn genes on and off. At the megabase scale, the
genomic fold is consistent with a fractal globule, a knot-free
conformation that enables maximally dense packing while preserving
the ability to easily fold and unfold any genomic locus.