Scaling Persistent Homology to Large Biological Datasets
Vipul Periwal
Laboratory of Biological Modeling
NIDDK/NIH - Bethesda, MD
Wed, January 18, 2023 - 4:00 PM
Karl Herzfeld Auditorium of Hannan Hall - Rm 108
The structure and relative arrangement of the constituents of any biological system is crucial to its function due to existence of proximity-dependent interactions. Given data measuring the spatial embedding of such constituents, a pattern of interest is a region devoid of the constituents surrounded by a region of high density with constituents close enough to allow interaction, which we colloquially refer to as a hole. Such holes have been shown to have functional significance, for example, chromatin loops in chromosomes enable long range control of gene transcription and three-dimensional voids in protein crystal structures are related to ligand interaction. An algorithm to compute loops and voids is then needed to analyze the deluge of experimental data which often has large experimental uncertainties. Furthermore, identifying voids in a 3D embedding by visual inspection using the human eye is subjective and prone to inconsistencies. An objective mathematically sound method to detect holes and compute their statistical significance is required. Persistent homology (PH) is an approach to topological data analysis (TDA) that can compute the existence of holes in discrete data sets, assigning them a significance based on their robustness to experimental variability in the data set. This information comes at a high computational cost (run time and memory) that has limited applicability of PH to small data sets of a few thousand points. Further, it is commonly restricted to computing only the existence and significance of holes and not their location due to higher computational costs and a lack of precision in computing their location. We developed Dory, an efficient and scalable algorithm for computing PH along with the location of significant holes with improved precision in large data sets. We used Dory to find protein homologs with significantly different topology by analyzing 180k publicly available crystal structures (PDB) and find chromatin loops in the human genome by analyzing high resolution Hi-C contact maps that result in point clouds with millions of points. In benchmarking different software, Dory was the only one that was able to analyze genome wide Hi-C contact maps. For validation of results, we show that the computed loops in Hi-C data sets and the voids in proteins agree with known biology.
Refreshments served at 3:45 PM
If you have any questions about the Colloquium Series or would like to make a donation please contact the Physics Department, cua-physics@cua.edu or (202) 319-5315.