Spectral Clustering and Biological Data

Matt Mahoney

Genetics, Dartmouth Medical School


High-throughput gene expression data is rapidly becoming a standard tool in biology. Gene expression data provides a large-scale snapshot of a vast number of molecular processes within cells and tissues, leading to the hope that it will give insights into basic biological processes as well as clues to dysfunction in disease. Several studies over the past decade have demonstrated that many intractable diseases, including multiple cancers and autoimmune disorders, have molecularly distinct sub-types, suggesting multiple disease mechanisms. As such, data clustering to identify these sub-types is a ubiquitous first step in data processing and one to which much attention must be paid. This talk will present the ongoing work of an erstwhile mathematician to implement Spectral Clustering for high-throughput gene expression data. I will focus on the perennial problem of identifying the number of clusters in a given data set and how features of gene expression data can inform certain choices for accomplishing the task.

INDISCLAIMER: All buzzwords above (including the mathematical ones) will be defined for non-biologists!

Back to ACMS schedule