My interests range from modeling microbial populations to learning and information processing in nervous systems.
We are working on a spiking-neuron model of song learning in zebra finches that emphasizes the two-stage nature of this process. This is based on empirical evidence that shows that the brain area LMAN learns a corrective bias that is later copied onto the pre-motor area RA. Also in agreement with experimental results, our models do not require a reward signal to reach RA directly, but rather rely on the biased variability received from LMAN. Our models also capture some of the qualitative details of the statistics of spiking in RA neurons.more
Vocalizations in birds range from simple calls to more complex songs. Bird songs are typically described as sequences of stereotyped vocalizations called syllables. It is hypothesized that females use song quality as an indication of male fitness, making bird song important for sexual selection.
Song complexity can vary greatly between different species of song birds, with some species such as lyrebirds and mockingbirds being capable of imitating almost arbitrary sounds. Other species produce more stereotyped songs.
Song learning is typically split into two parts: sensory learning, during which the bird memorizes the pattern it wants to emulate; and motor learning, during which the bird practices singing the pattern until it gets it right. These two periods of learning can be overlapping, as is the case for the zebra finch -- the bird from which our work gets inspiration.
The brain regions involved in song learning have been well characterized. Current evidence suggests that neurons in the HVC generate a time base by firing at different moments in the song, similar to a synfire chain. The signals from HVC are projected to RA, which provides a topographic map for the muscles involved in bird song. The synapses between HVC and RA are plastic, and experimental evidence suggests that they are the ones responsible for song learning.
While HVC and RA are necessary and sufficient for song production, successful learning also requires another brain area called LMAN, which also projects onto RA. Since the firing in LMAN is highly variable, it was thought that its only role was to add randomness to the song, providing the exploratory behavior necessary for reinforcement learning. More recently, it has been observed that LMAN provides a biased input to RA. Learning happens in two stages: first LMAN learns a corrective bias, and then, on a longer time scale, this is stored in the HVC-RA synapses.
Our study focuses on the second stage in the learning process, the transfer of information between LMAN and RA. We built artificial neuron networks modeling the RA neurons, and we added synaptic plasticity rules that update the HVC-RA synaptic weights based on the LMAN input. We observed that the LMAN signal needs to be tuned to the specific plasticity rule used in RA. For a particular class of plasticity rules that we considered, we showed that the corresponding LMAN signal can range from a signal that depends only on the current song output, to a signal that integrates the error in song production over a long period of time.
Transferring information between two brain areas is likely to occur more broadly. A straightforward generalization is to mammalian motor control, where there is also some evidence that learning proceeds in two stages. Another process where information transfer is important is in long-term memory formation. Short-term memory is dependent on the hippocampus, while long-term memory is not. This suggests that the memory gets transferred outside the hippocampus. Hippocampal memory replay may play a role in this, and it may parallel the way in which birds practice their song.
We model bacteria-phage interactions when bacteria are capable of CRISPR-mediated adaptive immunity. Our model can exhibit a variety of behaviors, from long-term coexistence of bacteria and phage, to extinction of one of them. We characterize the way in which the immune repertoire of a bacterial population depends on various characteristics of the interaction.more
Bacteria are constantly at threat from invading viruses (bacteriophages). There are various ways in which bacteria can defend themselves against such infections, but one of the most intriguing ones is a recently-discovered mechanism called CRISPR. CRISPR is a heritable, adaptive immune system, meaning that immunity to a particular phage persists after the infection ends, as with mammalian immunity, and furthermore, this immunity is passed on to daughter cells.
CRISPR works by incorporating small bits (30-70 base pairs) of viral DNA sequence called spacers into the bacterial genome. When another virus enters the cell, its DNA is compared against these templates, and if a match is found, the virus is chopped up and neutralized.
The exact way in which spacers are acquired is not completely understood. It is also not known whether different spacers for the same virus can be more or less effective at defending against the infection. We have been working on a population dynamics model of the interaction between CRISPR-enabled bacteria and phage that can help approach these questions from a quantitative standpoint.
Our model shows that differences in the effectiveness of spacers can lead to highly peaked spacer distributions, in which a few spacers dominate the population. This is what is observed in experiments. In contrast, if spacers differ mainly in the ease with which they are acquired, or if the overall acquisition rate is high, the steady-state spacer distribution is more homogeneous.
We investigated was of extracting structural and function information from statistical properties of protein alignments. We focused mainly on statistical coupling analysis (SCA) and direct coupling analysis (DCA). We showed that experimental evidence for many claims related to SCA is currently lacking and suggested focusing on proteins in which the method predicts the existence of several different sectors. We are also looking at how the global probability model used by DCA relates to protein function.more
Proteins are chains of amino acids that fold into complex structures at equilibrium. Finding the relation between a protein's amino acid sequence and its structure or function is very important for many applications in biology. However, first-principles approaches based on fundamental physics have had limited success in understanding this relation for all but the smallest proteins.
An alternative approach, thought of many decades ago but made possible recently by large sequencing efforts, is to use the variability provided by evolution. By looking at different organisms, it is possible to find many variants of a protein that have different sequences but fulfill similar functions and fold in similar ways. These can be assembled into alignments that highlight the differences between these sequences. How much can we infer from these alignments?
Although publicly-available alignments have been steadily growing, having upwards of 50 000 sequences for some proteins, these numbers are still minuscule compared to the space of all possible protein sequences. The question arises whether we can use statistical methods to infer functional or structural information even for sequences that have not yet been encountered in nature. This can of course only be done given a set of assumptions that constrain the problem.
Two main approaches have gained some traction. One of them, statistical coupling analysis (SCA), posits the existence of subsets of amino acids in proteins called sectors. These are contiguous in space in the folded protein, though not in sequence, and are more conserved evolutionarily than other parts of the protein.
Another approach, direct coupling analysis (DCA), starts from the assumption that residues that are nearby in the folded structure are more strongly constrained during evolution, and thus a correlation should be visible between their mutations. A correction needs to be applied, however, because correlations, unlike spatial contacts, are transitive. If residues A and B, and residues B and C are close in structure, then mutations of A, B, and C are all likely to be pairwise correlated, but A and C may not be in contact. DCA uses inference on a probabilistic model to distinguish the causal correlations from the ones generated by transitivity. This approach leads to very accurate predictions of the contacts in a folded protein and, based on this, it is even able to provide an estimate for the full folded structure.
We have focused on investigating the assumptions behind these two methods. We have shown that it is useful to distinguish various different meanings of the protein sectors that have been conflated in the literature up to this point. For example, the property of being formed by a contiguous set of residues is a structural one, the property of being more conserved than other residues is an evolutionary one, and the property of having a distinct functional role is a functional one. While all of these have been assumed to apply to SCA sectors, there is no a priori reason why they should occur together. In fact, we showed that most experimental evidence for the functional role of SCA sectors could be an artifact of the overlap between SCA sectors and conserved residues. We also suggested new experiments that could clear up the confusion.
For DCA, solving for the parameters of the maximum entropy model can usually only be done using an approximation whose accuracy is hard to determine. Our preliminary investigations show that the approximate method is actually not very accurate, and thus it remains an interesting question why the method is in fact capable of predicting contacts. Moreover, since the method provides a generative model for sequences in a given family, it could be expected that it can be used to identify loss-of-function mutations. In our hands, however, the DCA model is not much better at this than much simpler methods based on sequence conservation.
We worked on generating quantitative models for describing transcriptional regulation in prokaryotes and eukaryotes. The models assume that the interaction between a transcription factor and a promoter or enhancer region is mediated by a sequence-dependent binding energy. Using a high-throughput mutational assay, we were able to accurately model the transcription profile of a mammalian enhancer, and use this information to generate artificial enhancer sequences better suited for a given purposemore
Not all genes in an organism's DNA are active at the same time. This is most clear for multicellular organisms, where, for example, brain cells and muscle cells have very different expression patterns despite having identical DNA. Even in unicellular organisms, the genes expressed at any given time, and the level at which they are expressed, is heavily regulated depending on internal and external conditions. A classic example is the lac operon in E. coli which turns on or off the genes necessary for metabolizing lactose depending on whether lactose is present in the environment, and whether a more desirable sugar (such as glucose) is not.
One way in which gene transcription is regulated is with the help of proteins called transcription factors. These bind to DNA regions nearby the gene, called promoters, and affect the ease with which RNA polymerase can bind to the DNA and transcribe the gene. The binding of transcription factors to promoters is highly sequence-specific and having quantitative models for this binding is important for understanding transcriptional regulation.
Our approach starts with a library of thousands of promoter sequences that have been randomly mutated from their wild type. These sequences are built and used in an assay capable of measuring the changes in transcription due to the mutations. Our algorithms start with this data, identify likely spots for transcription factor binding, and then attempt to fit the data by positing a particular form for how the interaction between trasncription factors and DNA depends on the promoter sequence. We have demonstrated this technique by applying it to a widely-used mammalian enhancer (a DNA region similar to a promoter, but located farther away from the gene it controls). We used the model to search for enhancer sequences that improve the behavior of the system, and validated them in experiments.