Projects

Natural language and data science

I developed Project ESPAÑOL, an application that leverages high-dimensional analysis of classical Spanish-language poems to create custom lesson plans and guide discovery of new texts for practice and study. Using high-throughput web scraping, I generated a dataset of 10,000 Spanish-language poems in the public domain (totaling 2.5 million words), and then ran an unsupervised clustering algorithm in Python to group the poems into difficulty levels by analyzing verb frequencies and grammatical tenses. I built an interactive Plotly Dash application to view the full texts, metadata, and grammatical statistics.

Biomarkers and genomic medicine

at Parexel International

At a clinical research organization for a major pharmaceutical client, I performed data ingestion and developed ETL pipelines for external genomic data integration (primarily GWAS & eQTL data). I wrote pipelines in R, Python, and Bash to streamline data ingestion according to the client’s specific data and metadata standards. With fluency in techniques from both bioinformatics and data science, I bridged the scientific analysts who required management of this data at scale, and the data engineers who lacked familiarity with bioinformatics methods and resources.

Example of open-source tool created:

SPUR: SNP Position Update from rsIDs
- A pipeline to lift over genomic coordinates from datasets on older builds of the human genome only containing rsIDs
- SPUR first identifies the rsIDs in the original data that map to the current dbSNP build without issue, and those that do not. Then, SPUR differentiates between the non-mapping rsIDs that have been merged to a different rsID in the current build, and the rsIDs that have been dropped completely. Finally, SPUR updates the SNP positions in the original dataset by linking the mapped and merged rsIDs with current genome positions.

Neural cell types and circuits for vocal learning

at University of Texas Southwestern Medical Center in the Roberts lab and Konopka lab

My post-doctoral research used comparative high-throughput transcriptomics to understand how the brain produces complex, learned behaviors like speech and language. These projects were the first to implement single-cell RNA sequencing in songbirds, and the results have broad implications for understanding the genetic toolkits that neurons and circuits use to perform advanced computations. I performed end-to-end analysis of large-scale, high-dimensional gene expression data encompassing molecular library preparation, computational pipelines, dataset integration, unsupervised clustering, and statistical modeling.

Open-access links to publications:

Neuromodulators of motivation and reward in vocal communication

at University of Wisconsin-Madison in the Riters lab

My doctoral dissertation examined the neural control of vocal communication across contexts, using songbirds as a model system. I identified neurotensin, a neuropeptide involved in motivation and reward that strongly interacts with dopamine, as a potential modulator of context-specific vocalizations. I applied multiple linear regression techniques to analyze individual, developmental, and physiological variation in gene expression, protein labeling, and behavior.

Open-access links to publications: