Using evolutionary sequence variation to make inferences about protein structure and function
Lucy Colwell, Cambridge

Abstract:
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. The explosive growth in the number of available protein sequences raises the possibility of using the natural variation present in homologous protein sequences to infer these constraints and thus identify residues that control different protein phenotypes. Because in many cases phenotypic changes are controlled by more than one amino acid, the mutations that separate one phenotype from another may not be independent, requiring us to understand the correlation structure of the data. To address this we build a maximum entropy model of the protein sequence, constrained by the statistics of a large sequence alignment. Using this model, we infer residue pair interactions, which accurately predict residues in close structural proximity in protein tertiary structure. These predictions are used to generate all atom structural models. We then apply our method to predict de novo the structure of 11 medically important transmembrane proteins of unknown structure. In addition we are able to predict protein quaternary structure and alternative conformations. The next step requires development of a theoretical inference framework that enables the relationship between the amount of available input data and the reliability of structural predictions to be better understood.