December 13th 2019
Topic: Containerized Attribute Indexing and Graph Genomes for Federated Data Access
Presenter: Dr.Ben Busby, Scientific Lead NCBI Hackathon Program, OpenCravat.org, Data Science Advisor
Abstract:
‘Federated Data’ is often cited as one of the primary benefits of cloud computing. Let’s think for a minute about what that actually means. Even if the technical hurdles are overcome, for example with Fusera in cloud buckets containing the NIH Sequence Read Archive (NCBI-SRA), one still can not do adequate compute for large scale applications -- say machine learning -- without being able to search across datasets. Knowing which datasets to search across requires adequate, harmonized biological metadata, which any biological data scientist can assure you is in short supply in late 2019. While we continue to work to make metadata better, through encouraging more metadata exposure through more complete and ontology-harmonized contributions to public databases, such as BioProject and BioSample we have taken a relatively novel approach to producing better calculated attributes -- through indexing and graphs. In 2019, we have been able to index viruses from thousands of metagenomes, come up with millions of new exons and start sites for RefSeq to review, and build and annotate usable graph infrastructure for diploid and haploid organisms, not only from existing datasets in the SRA, but from complementary datasets in other public databases. As a specific example, we have been able to load virological metadata -- from virus specific protein domains and contribution to bacterial pathogenicity -- extracted from metagenomes by containerized pipelines into a dynamic cloud-based indexing system. We hope to soon be able to load graph information into this index as well. The goal for 2020 is to make these data indices and graph infrastructure more accessible and easier to use for cloud interoperability and large scale data analysis using modern tools such as machine learning.