Research
With a focus on how natural systems adapt to varying environmental conditions, my interests range from modeling microbial populations to learning and information processing in nervous systems. Below I briefly describe some of my current and past work; for the technical details, see the links on the publications page.
Optimal olfactory receptor repertoire
We are working on a theoretical model explaining the uneven distribution of different receptor types in the olfactory epithelium. We suggest that the reason for which some receptor types are much more abundant than others is related to the affinities that these receptors have for different odors, and the natural statistics of odors. In particular, this means that in mammalian species, where the olfactory epithelium is regularly replaced, the distribution of receptors changes with olfactory experience, a phenomenon that has been observed experimentally.
more
Olfaction, or the sense of smell, is mediated by volatile molecules that drift through the air and enter the nose. These molecules, also called “odorants”, make their way to olfactory sensory neurons (OSNs) in the nasal epithelium. Each of sensory neuron expresses a single type of receptor molecule, and each receptor molecules has a certain binding profile to a wide variety of odorants.
When an odorant binds to a receptor inside an OSN, there is a certain chance that the neuron will spike, and this probability depends on both the odorant and receptor type. All the neurons expressing a certain receptor type project their axons to one (or a few) spherical structures in the olfactory bulb called glomeruli. This means that all the information that the brain has regarding the olfactory environment is contained in the activations at the level of the glomeruli.
Experiments have shown that olfactory receptors have a wide range of abundances, with some receptors being thousands of times more common than others in the epithelium. To explain this, we propose a model based on the so-called “efficient coding” hypothesis, which suggests that the sensory periphery is structured so as to take advantage of statistical regularities in the environment.
In the case of olfaction, because of the different likelihood of encountering different odors in natural olfactory scenes, some receptor types yield more information about the environment than others. Since each receptor has a certain amount of noise, it is advantageous to average the activations from many of them in order to increase the signal-to-noise ratio. Given a fixed total number of neurons (related to the size of the nasal epithelium), this implies a balance between reducing noise for the most important receptors, while also maintaining an acceptable signal-to-noise level for the others. More formally, we are looking for an optimal receptor distribution that maximizes the information that glomerular activations contain about the olfactory environment.
Since the optimal receptor distribution depends on the statistics of natural odors, the model predicts that a change in the environment should lead to a change in the abundances of different receptor types, as observed in mammals. We show that this effect is more pronounced when the olfactory receptors are narrowly-tuned to detect a small number of odors. Our model also suggests that there is a monotonic relationship between the total number of olfactory sensory neurons and the number of receptor types expressed in the olfactory epithelium of individuals of related species.
Our basic model prescribes an outcome — maximum information transfer between the environment and the brain — but does not tell us how this can be implemented with the building blocks available in the nose. We therefore show that a simple population dynamical model based on logistic growth can be used to reach the predicted optimum, suggesting that a realistic implementation is indeed possible.
Collaboration with Vjiay Balasubramanian, Simona Cocco, and Rémi Monasson.
Two-stage learning of birdsong
We built a novel model of song learning in zebra finches that emphasizes the two-stage nature of this process. Based on experimental evidence, our model assumes that a “tutor” circuit (corresponding to brain area LMAN in the bird) first learns a corrective bias for the song, and this is later solidified in a “student” circuit (the pre-motor area RA). This requires a match between the tutor signal and the synaptic plasticity rule in the student whose structure can be derived analytically in a firing-rate approximation. The resulting learning rules also work in spiking networks, and the tutor signal can also be generated using a reinforcement rule.
more
Vocalizations in birds range from simple calls to more complex songs. Bird songs are typically described as sequences of stereotyped vocalizations called syllables. It is hypothesized that females use song quality as an indication of male fitness, making bird song important for sexual selection.
Song complexity can vary greatly between different species of song birds, with some species such as lyrebirds and mockingbirds being capable of imitating almost arbitrary sounds. Other species produce more stereotyped songs.
Song learning is typically split into two parts: sensory learning, during which the bird memorizes the pattern it wants to emulate; and motor learning, during which the bird practices singing the pattern until it gets it right. These two periods of learning can be overlapping, as is the case for the zebra finch -- the bird from which our work gets inspiration.
The brain regions involved in song learning have been well characterized. Current evidence suggests that neurons in the HVC generate a time base by firing at different moments in the song, similar to a synfire chain. The signals from HVC are projected to RA, which provides a topographic map for the muscles involved in bird song. The synapses between HVC and RA are plastic, and experimental evidence suggests that they are the ones responsible for song learning.
While HVC and RA are necessary and sufficient for song production, successful learning also requires another brain area called LMAN, which also projects onto RA. Since the firing in LMAN is highly variable, it was thought that its only role was to add randomness to the song, providing the exploratory behavior necessary for reinforcement learning. More recently, it has been observed that LMAN provides a biased input to RA. Learning happens in two stages: first LMAN learns a corrective bias, and then, on a longer time scale, this is stored in the HVC-RA synapses.
Our study focuses mostly on the second stage in the learning process, the transfer of information between LMAN and RA. Using a firing-rate approximation, we showed that efficient learning requires LMAN to adapt its signal to the synaptic plasticity rule at work in RA. The particular structure of the LMAN signal can be derived from a gradient descent approach. In particular, the LMAN output can range from a signal that depends only on the current song output, to a signal that integrates the error in song production over a long period of time.
Neurons in the brain do not have continuous outputs, but rather fire discrete spikes. Using computer simulations, we showed that our results also hold in networks that use spiking neurons. In addition, in these network a simple reinforcement rule can be used to generate the LMAN signal.
Transferring information between two brain areas is likely to occur more broadly. A straightforward generalization is to mammalian motor control, where there is also some evidence that learning proceeds in two stages. Another process where information transfer is important is in long-term memory formation. Short-term memory is dependent on the hippocampus, while long-term memory is not, suggesting that memories get transferred outside the hippocampus. Hippocampal memory replay may play a role in this transfer, and it may parallel the way in which birds practice their song.
Collaboration with Vijay Balasubramanian and Bence Ölveczky.
CRISPR immunity in bacteria
We modeled bacteria-phage interactions when bacteria are capable of CRISPR-mediated adaptive immunity. Our model exhibits a variety of behaviors, from long-term coexistence of bacteria and phage, to extinction of one of the populations. We characterized the way in which the immune repertoire of a bacterial population depends on various characteristics of the interaction, showing how the rate at which immunity is acquired can lead to more or less diverse immune repertoires.
more
Bacteria are constantly at threat from invading viruses (bacteriophages). There are various ways in which bacteria can defend themselves against such infections, but one of the most intriguing ones is a recently-discovered mechanism called CRISPR. CRISPR is a heritable, adaptive immune system, meaning that immunity to a particular phage persists after the infection ends, as with mammalian immunity. Furthermore, unlike in mammals, this immunity is passed on to daughter cells.
CRISPR works by incorporating small bits (30-70 base pairs) of viral DNA sequence, called “spacers”, into the bacterial genome. When a virus enters the cell, its DNA is compared against these templates, and if a match is found, the virus is chopped up and neutralized.
The exact way in which spacers are acquired is not completely understood. It is also not known whether different spacers for the same virus can be more or less effective at defending against the infection. We have been working on a population dynamics model of the interaction between CRISPR-enabled bacteria and phage that can help approach these questions from a quantitative standpoint.
Our model shows that differences in the effectiveness of spacers can lead to highly-peaked spacer distributions, in which a few spacers dominate the population. This is what is observed in experiments. In contrast, if spacers differ mainly in the ease with which they are acquired, or if the overall acquisition rate is high, the steady-state spacer distribution is more homogeneous.
Collaboration with Vijay Balasubramanian, Serena Bradde, and Marija Vucelja.
Statistics of protein alignments
We investigated ways of extracting structural and functional information from statistical properties of protein alignments. We focused mainly on statistical coupling analysis (SCA) and direct coupling analysis (DCA). We showed that experimental evidence for many claims related to SCA is currently lacking, and suggested better ways to test it. We are also looking at how the global probability model used by DCA relates to protein function, and whether machine-learning methods applied directly to protein sequences can better predict fitness.
more
Proteins are chains of amino acids that fold into complex structures at equilibrium. Finding the relation between a protein's amino acid sequence and its structure or function is very important for many applications in biology. However, first-principles approaches based on fundamental physics have had limited success in understanding this relation for all but the smallest proteins.
An alternative approach, thought of many decades ago but made possible recently by large sequencing efforts, is to use the variability provided by evolution. By looking at different organisms, it is possible to find many variants of a protein that have different sequences but fulfill similar functions and fold in similar ways. These can be assembled into alignments that highlight the differences between these sequences. How much can we infer from these alignments?
Although publicly-available alignments have been steadily growing, having upwards of 50 000 sequences for some proteins, these numbers are still minuscule compared to the space of all possible protein sequences. The question arises whether we can use statistical methods to infer functional or structural information even for sequences that have not yet been encountered in nature. This can of course only be done given a set of assumptions that constrain the problem.
Two main approaches have gained some traction. One of them, statistical coupling analysis (SCA), posits the existence of subsets of amino acids in proteins called “sectors”. These are contiguous in space in the folded protein, though not in sequence, and are more conserved evolutionarily than other parts of the protein.
Another approach, direct coupling analysis (DCA), starts from the assumption that residues that are nearby in the folded structure are more strongly constrained during evolution, and thus a correlation should be visible between their mutations. A correction needs to be applied, however, because correlations, unlike spatial contacts, are transitive. If residues A and B, and residues B and C are close in structure, then mutations of A, B, and C are all likely to be pairwise correlated, but A and C may not be in contact. DCA uses inference on a probabilistic model to distinguish the causal correlations from the ones generated by transitivity. This approach leads to very accurate predictions of the contacts in a folded protein and, based on this, it is even able to provide an estimate for the full folded structure.
We have focused on investigating the assumptions behind these two methods. We have shown that it is useful to distinguish various different meanings of the protein sectors that have been conflated in the literature up to this point. For example, the property of being formed by a contiguous set of residues is a structural one, the property of being more conserved than other residues is an evolutionary one, and the property of having a distinct functional role is a functional one. While all of these have been assumed to apply to SCA sectors, there is no a priori reason why they should occur together. In fact, we showed that most experimental evidence for the functional role of SCA sectors could be an artifact of the overlap between SCA sectors and conserved residues. We also suggested new experiments that could clear up the confusion.
For DCA, various groups have proposed that the method can also be used to predict the fitness effect of mutations. This is because the method provides an estimate of the probability that any given sequence is part of a protein family. If a mutation leads to a sequence with a low probability, then we can assume its fitness to be low. However, each protein is optimized for several different functions, and the DCA probability can only respond to one of these (or a particular mixture). We are working on using machine learning techniques to combine alignment information with a small number of fitness measurements in order to predict the outcome of more general mutations.
Collaborations with Lucy Colwell and Stanislas Leibler, and Alpha Lee.
Models of transcriptional regulation
We worked on building quantitative models for describing transcriptional regulation in prokaryotes and eukaryotes. The models assume that the interaction between a transcription factor and a promoter or enhancer region is mediated by a sequence-dependent binding energy. Using a high-throughput mutational assay, we were able to accurately model the transcriptional profile of a mammalian enhancer, and used this information to generate artificial enhancer sequences better suited for a given purpose.
more
Not all genes in an organism's DNA are active at the same time. This is most clear for multicellular organisms, where, for example, brain cells and muscle cells have very different expression patterns despite having identical DNA. Even in unicellular organisms, the genes expressed at any given time, and the level at which they are expressed, is heavily regulated depending on internal and external conditions. A classic example is the lac operon in E. coli which turns on or off the genes necessary for metabolizing lactose depending on whether lactose is present in the environment, and whether a more desirable sugar (such as glucose) is not.
One way in which gene transcription is regulated is with the help of proteins called transcription factors. These bind to DNA regions near the gene, called promoters, and affect the ease with which RNA polymerase can bind to the DNA and transcribe the gene. The binding of transcription factors to promoters is highly sequence-specific and having quantitative models for this binding is important for understanding transcriptional regulation.
Our approach starts with a library of thousands of promoter sequences that have been randomly mutated from their wild type. These sequences are built and used in an assay capable of measuring the changes in transcription due to the mutations. Our algorithms start with this data, identify likely spots for transcription factor binding, and then attempt to fit the data by positing a particular form for how the interaction between trasncription factors and DNA depends on the promoter sequence. We have demonstrated this technique by applying it to a widely-used mammalian enhancer (a DNA region similar to a promoter, but located farther away from the gene it controls). We used the model to search for enhancer sequences that improve the behavior of the system, and validated them in experiments.
Collaboration with Curtis G. Callan Jr, Soheil Feizi, Andreas Gnirke, Manolis Kellis, Justin B. Kinney, Eric S. Lander, Alexandre Melnikov, Tarjei S. Mikkelsen, Anand Murugan, Peter Rogov, Li Wang, and Xiaolan Zhang.