code helix is a personal website of Anya Korsakova, currently a Computational Biology researcher at Calico Life Sciences (Alphabet). Calico aims to develop novel therapeutics for age-related diseases, and Anya is working on machine learning approaches to predict the effects of noncoding DNA variation on the transcriptomic and epigenomic state of cells, specifically focusing on indels, structural variants and tandem repeats.

Previously, Anya was a postdoctoral researcher at Cancer Science Institute of Singapore, extending methods to detect mutational signatures and their relationship with cancer phenotypes.

Anya earned her doctorate degree in biophysics from Nanyang Technological University as a SINGA scholar, and previously her B.Sc. and M.Sc. in theoretical nuclear physics from NRNU MEPhI in Moscow.

Blog

Projects

Sparse autoencoders for mechanistic interpretability of the DNA sequence-based model Borzoi @ f(DNA) Calico

Large ML frameworks, such as the DNA sequence-based model Borzoi, ingest large amounts of data for training. Training data contain multitudes of features, and as the model is successful at inference tasks, it has extracted these features from sequence. We aim to use sparse autoencoders to decompose activations from the first few layers of the pre-trained model into monosemantic concepts that map to known and unknown transcriptional regulatory motifs.

Shift augmentation for improved indel scoring in DNA sequence-based ML models @ f(DNA) Calico

Predicting genetic variant effects is critical for medical genetics. DNA sequence-based deep learning models attain SOTA performance, but generally focus on single-nucleotide polymorphisms, and technical challenges (such as misalignment of pooling blocks and output boundaries) create artificially inflated variant effect scores on another common type of mutations - insertions and deletions (indels). We suggested and demonstrated that boundary-aware stitching significantly improved scoring for indels, structural variants and tandem repeats.

[manuscript to be submitted]

ALPS (Assignment of Local Probabilities for SBS Signatures) @ Pitt Genomics, CSI

ALPS is a probabilistic framework for assignment of single-base substitution (SBS) mutational signature enrichments to genomic and epigenomic features of interest. ALPS was developed to address the challenge of localizing mutational signatures in cancer genomes, where signature assignment does not take into account mutation coordinate information. However, signatures often colocalize with epigenomic features, and ALPS uses a simple probabilistic framework to assign SBS signatures back to genome regions.

Mutational signature assignment heterogeneity addressed by ensemble approaches @ Pitt Genomics, CSI

Collaborated with the Pitt Genomics team to consult on the best algorithmic practices for building ensemble approaches to mutational signature assignment.

Prediction of G4 formation in live cells with epigenetic data: a deep learning approach @ Phan Biophysics, NTU

G-quadruplexes (G4s) are secondary structures abundant in DNA that play regulatory roles in cells. Only a small fraction forms G4 structures form in cells from the putative sequences. I approached the prediction of G4 formation by adding channels of the normalized epigenetic and chromatin accessibility data on top of the one-hot-encoded DNA sequence input channels with a deep vanilla CNN.

RNA alternative splicing prediction with discrete compositional energy network @ Phan Biophysics, NTU

A single gene can encode for different protein versions through alternative splicing. Alternative splicing is determined by the gene's primary sequence and other regulatory factors such as RNA-binding protein levels. With these as input, we formulated the prediction of RNA splicing as a regression task, and proposed a discrete compositional energy network (DCEN) which leverages the hierarchical relationships between splice sites, junctions and transcripts. We built a new training and benchmarking dataset (CAPD), which is my main contribution here.

RNA G-quadruplex detection using Oxford nanopore sequencing @ Phan Biophysics, NTU

RNA G-quadruplex (rG4) structures are challenging to detect in long RNA transcripts. We used Oxford Nanopore sequencing to probe rG4 signatures in synthetic RNA constructs and native RNA transcripts. The nanopore current signature utility uses a z-score based method to detect G4 stalling events. The project did not reach enough maturity to publish it, but I learned a lot (and did quite a bit of wet lab experiments) along the way!

Attractors in neural network maps

This little project was done for the Nonlinear Dynamics course at NTU and features aesthetic attractors forming in an MLP-turned-dynamic-map.

Media

Art of Academia podcast: discussing computational biology and physics, biology, AI/ML pivots

Invited lecture at the Traektoriya school (Armenia, 2019): on DNA Oxford Nanopore sequencing (in Russian, auto-generated English subtitles available)