To make real scientific discoveries possible from so many sources of data, the data had to be reanalyzed for consistency. To reduce the possibility of technical artifacts, scientists had to perform realignment, recalibration, and re-genotyping of the exomes. But there was a problem: none of the consortium members had enough local computational resources to process all 6,500 exomes.
The team decided to use Google Genomics, a fully managed service on Google Cloud Platform. Scientist Mike Nalls ran Broad Institute’s GATK Best Practices pipeline using Google Genomics, processing the full 200TB set of 6,500 exomes—starting with raw, unaligned sequence data and leading to a set of variant calls—in just three and a half weeks. The dataset was subsequently used to identify six new risk loci for Parkinson’s disease, helping scientists better understand genetic risks for the disease.
“Cloud computing allowed us to speed up discovery,” says Mike Nalls, PhD, Scientist at National Institute on Aging. “We collaborated with Google Genomics to test varying implementations of the standard processing pipeline for exome sequence data on the cohort and population scale.”
Analyzing massive genetic datasets
Mike could have run the analysis even faster, but opted to limit the number of virtual machines and disks to take advantage of sustained use discounts and reduce costs. Even if hardware could have been procured, the effort would have taken months of compute time using local infrastructure. With Google Genomics on Google Cloud Platform, the National Institute on Aging can now analyze massive datasets, giving scientists access to virtually unlimited compute resources for large-scale projects.Download to Learn More
To learn more about how cloud computing allows new discoveries in weeks versus months, download the paper "National Institute on Aging: Accelerating the fight against Parkinson’s Disease."
Download Now