Keller Colloquium in Computing and Mathematical Sciences
As the pace and scale of data collection continues to increase across all areas of biology, there is a growing need for effective and principled statistical methods for the analysis of the resulting data. In this talk, I'll describe two ongoing projects to help fill this gap. First, calcium imaging data is transforming the field of neuroscience by making it possible to assay the activities of large numbers of neurons simultaneously. For each neuron, the resulting "fluorescence trace" can be seen as a noisy surrogate of its spikes over time. In order to deconvolve a fluorescence trace into the underlying spike times, we consider an auto-regressive model for calcium dynamics. This leads naturally to a seemingly intractable $\ell_0$ optimization problem. I will show that it is in fact possible to efficiently solve this optimization problem for the global optimum, leading to substantial improvements over competing approaches. Second, across many areas of biology, it is becoming increasingly common to collect "multi-view data": that is, data in which multiple data types (e.g. gene expression, DNA sequence, clinical measurements) have been measured on a single set of observations (e.g. patients). I will consider the following question: given a set of n observations with measurements on L data types, can a single clustering of the n observations be defined on all L data types, or does each data type have its own clustering of the observations? To answer this question, I will introduce a general framework for modeling multi-view data, as well as hypothesis tests that can be used in order to characterize the extent to which the clusterings on each of the L data types are the same or different.