code helix is a personal website of Anya Korsakova, currently a Computational Biology researcher at Calico Life Sciences (Alphabet). Calico aims to develop novel therapeutics for age-related diseases, and Anya is working on machine learning approaches to predict the effects of noncoding DNA variation on the transcriptomic and epigenomic state of cells, specifically focusing on indels, structural variants and tandem repeats.
Previously, Anya was a postdoctoral researcher at Cancer Science Institute of Singapore, extending methods to detect mutational signatures and their relationship with cancer phenotypes.
Anya earned her doctorate degree in biophysics from Nanyang Technological University as a SINGA scholar, and previously her B.Sc. and M.Sc. in theoretical nuclear physics from NRNU MEPhI in Moscow.
Large ML frameworks, such as the DNA sequence-based model Borzoi, ingest large amounts of data for training. Training data contain multitudes of features, and as the model is successful at inference tasks, it has extracted these features from sequence. We aim to use sparse autoencoders to decompose activations from the first few layers of the pre-trained model into monosemantic concepts that map to known and unknown transcriptional regulatory motifs.
Predicting genetic variant effects is critical for medical genetics. DNA sequence-based deep learning models attain SOTA performance, but generally focus on single-nucleotide polymorphisms, and technical challenges (such as misalignment of pooling blocks and output boundaries) create artificially inflated variant effect scores on another common type of mutations - insertions and deletions (indels). We suggested and demonstrated that boundary-aware stitching significantly improved scoring for indels, structural variants and tandem repeats.
ALPS is a probabilistic framework for assignment of single-base substitution (SBS) mutational signature enrichments to genomic and epigenomic features of interest. ALPS was developed to address the challenge of localizing mutational signatures in cancer genomes, where signature assignment does not take into account mutation coordinate information. However, signatures often colocalize with epigenomic features, and ALPS uses a simple probabilistic framework to assign SBS signatures back to genome regions.
Collaborated with the Pitt Genomics team to consult on the best algorithmic practices for building ensemble approaches to mutational signature assignment.
G-quadruplexes (G4s) are secondary structures abundant in DNA that play regulatory roles in cells. Only a small fraction forms G4 structures form in cells from the putative sequences. I approached the prediction of G4 formation by adding channels of the normalized epigenetic and chromatin accessibility data on top of the one-hot-encoded DNA sequence input channels with a deep vanilla CNN.
A single gene can encode for different protein versions through alternative splicing. Alternative splicing is determined by the gene's primary sequence and other regulatory factors such as RNA-binding protein levels. With these as input, we formulated the prediction of RNA splicing as a regression task, and proposed a discrete compositional energy network (DCEN) which leverages the hierarchical relationships between splice sites, junctions and transcripts. We built a new training and benchmarking dataset (CAPD), which is my main contribution here.
RNA G-quadruplex (rG4) structures are challenging to detect in long RNA transcripts. We used Oxford Nanopore sequencing to probe rG4 signatures in synthetic RNA constructs and native RNA transcripts. The nanopore current signature utility uses a z-score based method to detect G4 stalling events. The project did not reach enough maturity to publish it, but I learned a lot (and did quite a bit of wet lab experiments) along the way!
This little project was done for the Nonlinear Dynamics course at NTU and features aesthetic attractors forming in an MLP-turned-dynamic-map.
Art of Academia podcast: discussing computational biology and physics, biology, AI/ML pivots
Invited lecture at the Traektoriya school (Armenia, 2019): on DNA Oxford Nanopore sequencing (in Russian, auto-generated English subtitles available)