Federation in genomics pipelines: techniques and challenges.

Chaterji, Somali; Koo, Jinkyu; Li, Ninghui; Meyer, Folker; Grama, Ananth; Bagchi, Saurabh

Chaterji, Somali; Koo, Jinkyu; Li, Ninghui; Meyer, Folker; Grama, Ananth; Bagchi, Saurabh.

Afiliación

Chaterji S; Computer Science, Purdue University, Indiana, USA.
Koo J; Electrical and Computer Engineering, Purdue University, Indiana, USA.
Li N; Computer Science, Purdue University, Indiana, USA.
Meyer F; Argonne National Laboratory, Mathematics and Computer Science Division, Illinois, USA.
Grama A; University of Chicago Medical School, Illinois, USA.
Bagchi S; Computer Science, Purdue University, Indiana, USA.

Brief Bioinform ; 20(1): 235-244, 2019 01 18.

Article en En | MEDLINE | ID: mdl-28968781

ABSTRACT

ABSTRACT

Federation is a popular concept in building distributed cyberinfrastructures, whereby computational resources are provided by multiple organizations through a unified portal, decreasing the complexity of moving data back and forth among multiple organizations. Federation has been used in bioinformatics only to a limited extent, namely, federation of datastores, e.g. SBGrid Consortium for structural biology and Gene Expression Omnibus (GEO) for functional genomics. Here, we posit that it is important to federate both computational resources (CPU, GPU, FPGA, etc.) and datastores to support popular bioinformatics portals, with fast-increasing data volumes and increasing processing requirements. A prime example, and one that we discuss here, is in genomics and metagenomics. It is critical that the processing of the data be done without having to transport the data across large network distances. We exemplify our design and development through our experience with metagenomics-RAST (MG-RAST), the most popular metagenomics analysis pipeline. Currently, it is hosted completely at Argonne National Laboratory. However, through a recently started collaborative National Institutes of Health project, we are taking steps toward federating this infrastructure. Being a widely used resource, we have to move toward federation without disrupting 50 K annual users. In this article, we describe the computational tools that will be useful for federating a bioinformatics infrastructure and the open research challenges that we see in federating such infrastructures. It is hoped that our manuscript can serve to spur greater federation of bioinformatics infrastructures by showing the steps involved, and thus, allow them to scale to support larger user bases.

Asunto(s)

Genómica/estadística & datos numéricos; Difusión de la Información/métodos; Macrodatos; Biología Computacional/métodos; Confidencialidad; Bases de Datos Genéticas/estadística & datos numéricos; Privacidad Genética; Humanos; Metagenómica/estadística & datos numéricos; Programas Informáticos; Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Genómica / Difusión de la Información Límite: Humans País/Región como asunto: America do norte Idioma: En Revista: Brief Bioinform Asunto de la revista: BIOLOGIA / INFORMATICA MEDICA Año: 2019 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google