Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets.

Morrow, Alyssa Kramer; He, George Zhixuan; Nothaft, Frank Austin; Tu, Eric Tongching; Paschall, Justin; Yosef, Nir; Joseph, Anthony Douglas

Morrow, Alyssa Kramer; He, George Zhixuan; Nothaft, Frank Austin; Tu, Eric Tongching; Paschall, Justin; Yosef, Nir; Joseph, Anthony Douglas.

Afiliación

Morrow AK; Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA. Electronic address: akmorrow@berkeley.edu.
He GZ; Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Harvard Law School, 1563 Massachusetts Avenue, Cambridge, MA 02138, USA; Google, 355 Main St, Cambridge, MA 02142, USA.
Nothaft FA; Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Databricks, Inc., 160 Spear Street, 13th Floor, San Francisco, CA 94105, USA.
Tu ET; Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; The Boeing Company, 1950 E Imperial Hwy, El Segundo, CA 90245-2701, USA.
Paschall J; Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA.
Yosef N; Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Center for Computational Biology, University of California Berkeley, 108 Stanley Hall, Berkeley, CA 94720-3220, USA.
Joseph AD; Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Center for Computational Biology, University of California Berkeley, 108 Stanley Hall, Berkeley, CA 94720-3220, USA; Unite Genomics, Inc., 1301 Marina Village Pkwy,

Cell Syst ; 9(6): 609-613.e3, 2019 12 18.

Article en En | MEDLINE | ID: mdl-31812694

ABSTRACT

ABSTRACT

The decreasing cost of DNA sequencing over the past decade has led to an explosion of sequencing datasets, leaving us with petabytes of data to analyze. However, current sequencing visualization tools are designed to run on single machines, which limits their scalability and interactivity on modern genomic datasets. Here, we leverage the scalability of Apache Spark to provide Mango, consisting of a Jupyter notebook and genome browser, which removes scalability and interactivity constraints by leveraging multi-node compute clusters to allow interactive analysis over terabytes of sequencing data. We demonstrate scalability of the Mango tools by performing quality control analyses on 10 terabytes of 100 high-coverage sequencing samples from the Simons Genome Diversity Project, enabling capability for interactive genomic exploration of multi-sample datasets that surpass the computational limitations of single-node visualization tools. Mango is freely available for download with full documentation at https//bdg-mango.readthedocs.io/en/latest/.

Asunto(s)

Genómica/métodos; Análisis de Secuencia de ADN/métodos; Algoritmos; Macrodatos; Análisis de Datos; Genoma/genética; Secuenciación de Nucleótidos de Alto Rendimiento/métodos; Programas Informáticos

Palabras clave

Apache Spark; genome browser; genome sequencing; genome visualization; interactive notebook

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Análisis de Secuencia de ADN / Genómica Idioma: En Revista: Cell Syst Año: 2019 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Análisis de Secuencia de ADN / Genómica Idioma: En Revista: Cell Syst Año: 2019 Tipo del documento: Article