CDS Lunch Seminar - OutPredict: multiple datasets can improve prediction of expression and inference of causality

Speaker: Jacopo Cirrone

Location: On-Line

Date: Wednesday, October 20, 2021

The Systems Biology community has invested a great deal of effort in modeling gene regulatory networks that should be able to (i) accurately predict future states and (ii) identify regulatory hubs that can be manipulated to achieve desired phenotypes. Most computational tools for the problem embody linear models (e.g. 5 * TF1 + 2*TF2 - 0.4*TF3....). However, it is well known that biological interactions are highly synergistic and non-linear. Further, those tools mostly try to directly predict networks even when the discovered edges (which usually come from some assay such as Chip-seq) may have little physiological significance (e.g., may not influence gene expression).This work considers an alternative approach to inferring gene causality. Specifically, we consider the problem of predicting the expression of genes at a future time point in a genomic time series. In this, we follow the philosophy that accurate prediction often corresponds to a good understanding of causality.
The prediction may rest on several sources of data: the time point immediately preceding t, the entire target time series preceding t, and ancillary data. In biology, for example, the ancillary data may consist of a network based on binding data, data from different time series, steady state data, a community-blessed gold standard network, or some combination of those. We introduce OutPredict, which is a machine learning method for time series that incorporates ancillary steady state and network data to achieve a low error in gene expression prediction. We show that OutPredict outperforms several of the best state-of-the-art methods for prediction. The predictive models OutPredict in turn generate a causal network.Thus, this work presents an approach to the inference of causality based on predictions of out-of-sample time-points based on both steady state and time series data. Because the model for each gene identifies those transcription factors that have the most importance in prediction, those important transcription factors are the most likely causal elements for that gene. We validate those predictions for a set of well-documented transcription factors in Arabidopsis.
Because our methods apply to any situation in which there is time series data, ancillary data, and the need for non-linear causal models, we believe that this work will have a broad appeal to the scientific community, specifically those studying causality networks in any biological system.