Búsqueda | Portal de Búsqueda de la BVS

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.

Blamey, Ben; Toor, Salman; Dahlö, Martin; Wieslander, Håkan; Harrison, Philip J; Sintorn, Ida-Maria; Sabirsh, Alan; Wählby, Carolina; Spjuth, Ola; Hellander, Andreas.

Gigascience ; 10(3)2021 03 19.

Artículo en Inglés | MEDLINE | ID: mdl-33739401

RESUMEN

BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. FINDINGS: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.

Asunto(s)

Disciplinas de las Ciencias Biológicas , Programas Informáticos , Diagnóstico por Imagen

MaRe: Processing Big Data with application containers on Apache Spark.

Capuccini, Marco; Dahlö, Martin; Toor, Salman; Spjuth, Ola.

Gigascience ; 9(5)2020 05 01.

Artículo en Inglés | MEDLINE | ID: mdl-32369166

RESUMEN

BACKGROUND: Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. RESULTS: Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. CONCLUSIONS: MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

Asunto(s)

Macrodatos , Biología Computacional/métodos , Bases de Datos Factuales , Programas Informáticos , Algoritmos , Polimorfismo de Nucleótido Simple , Flujo de Trabajo

SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines.

Lampa, Samuel; Dahlö, Martin; Alvarsson, Jonathan; Spjuth, Ola.

Gigascience ; 8(5)2019 05 01.

Artículo en Inglés | MEDLINE | ID: mdl-31029061

RESUMEN

BACKGROUND: The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. FINDINGS: SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. CONCLUSIONS: SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.

Asunto(s)

Biología Computacional , Genómica , Programas Informáticos , Biblioteca de Genes , Aprendizaje Automático , Lenguajes de Programación , Flujo de Trabajo

Tracking the NGS revolution: managing life science research on shared high-performance computing clusters.

Dahlö, Martin; Scofield, Douglas G; Schaal, Wesley; Spjuth, Ola.

Gigascience ; 7(5)2018 05 01.

Artículo en Inglés | MEDLINE | ID: mdl-29659792

RESUMEN

Background: Next-generation sequencing (NGS) has transformed the life sciences, and many research groups are newly dependent upon computer clusters to store and analyze large datasets. This creates challenges for e-infrastructures accustomed to hosting computationally mature research in other sciences. Using data gathered from our own clusters at UPPMAX computing center at Uppsala University, Sweden, where core hour usage of â¼800 NGS and â¼200 non-NGS projects is now similar, we compare and contrast the growth, administrative burden, and cluster usage of NGS projects with projects from other sciences. Results: The number of NGS projects has grown rapidly since 2010, with growth driven by entry of new research groups. Storage used by NGS projects has grown more rapidly since 2013 and is now limited by disk capacity. NGS users submit nearly twice as many support tickets per user, and 11 more tools are installed each month for NGS projects than for non-NGS projects. We developed usage and efficiency metrics and show that computing jobs for NGS projects use more RAM than non-NGS projects, are more variable in core usage, and rarely span multiple nodes. NGS jobs use booked resources less efficiently for a variety of reasons. Active monitoring can improve this somewhat. Conclusions: Hosting NGS projects imposes a large administrative burden at UPPMAX due to large numbers of inexperienced users and diverse and rapidly evolving research areas. We provide a set of recommendations for e-infrastructures that host NGS research projects. We provide anonymized versions of our storage, job, and efficiency databases.

Asunto(s)

Disciplinas de las Ciencias Biológicas/métodos , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Investigación , Programas Informáticos

Recommendations on e-infrastructures for next-generation sequencing.

Spjuth, Ola; Bongcam-Rudloff, Erik; Dahlberg, Johan; Dahlö, Martin; Kallio, Aleksi; Pireddu, Luca; Vezzi, Francesco; Korpelainen, Eija.

Gigascience ; 5: 26, 2016 06 07.

Artículo en Inglés | MEDLINE | ID: mdl-27267963

RESUMEN

With ever-increasing amounts of data being produced by next-generation sequencing (NGS) experiments, the requirements placed on supporting e-infrastructures have grown. In this work, we provide recommendations based on the collective experiences from participants in the EU COST Action SeqAhead for the tasks of data preprocessing, upstream processing, data delivery, and downstream analysis, as well as long-term storage and archiving. We cover demands on computational and storage resources, networks, software stacks, automation of analysis, education, and also discuss emerging trends in the field. E-infrastructures for NGS require substantial effort to set up and maintain over time, and with sequencing technologies and best practices for data analysis evolving rapidly it is important to prioritize both processing capacity and e-infrastructure flexibility when making strategic decisions to support the data analysis demands of tomorrow. Due to increasingly demanding technical requirements we recommend that e-infrastructure development and maintenance be handled by a professional service unit, be it internal or external to the organization, and emphasis should be placed on collaboration between researchers and IT professionals.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Biología Computacional/métodos , Humanos , Almacenamiento y Recuperación de la Información , Internet , Programas Informáticos

BioImg.org: A Catalog of Virtual Machine Images for the Life Sciences.

Dahlö, Martin; Haziza, Frédéric; Kallio, Aleksi; Korpelainen, Eija; Bongcam-Rudloff, Erik; Spjuth, Ola.

Bioinform Biol Insights ; 9: 125-8, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26401099

RESUMEN

Virtualization is becoming increasingly important in bioscience, enabling assembly and provisioning of complete computer setups, including operating system, data, software, and services packaged as virtual machine images (VMIs). We present an open catalog of VMIs for the life sciences, where scientists can share information about images and optionally upload them to a server equipped with a large file system and fast Internet connection. Other scientists can then search for and download images that can be run on the local computer or in a cloud computing environment, providing easy access to bioinformatics environments. We also describe applications where VMIs aid life science research, including distributing tools and data, supporting reproducible analysis, and facilitating education. BioImg.org is freely available at: https://bioimg.org.

An open science peer review oath.

Aleksic, Jelena; Alexa, Adrian; Attwood, Teresa K; Chue Hong, Neil; Dahlö, Martin; Davey, Robert; Dinkel, Holger; Förstner, Konrad U; Grigorov, Ivo; Hériché, Jean-Karim; Lahti, Leo; MacLean, Dan; Markie, Michael L; Molloy, Jenny; Schneider, Maria Victoria; Scott, Camille; Smith-Unna, Richard; Vieira, Bruno Miguel.

F1000Res ; 3: 271, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-25653839

RESUMEN

One of the foundations of the scientific method is to be able to reproduce experiments and corroborate the results of research that has been done before. However, with the increasing complexities of new technologies and techniques, coupled with the specialisation of experiments, reproducing research findings has become a growing challenge. Clearly, scientific methods must be conveyed succinctly, and with clarity and rigour, in order for research to be reproducible. Here, we propose steps to help increase the transparency of the scientific method and the reproducibility of research results: specifically, we introduce a peer-review oath and accompanying manifesto. These have been designed to offer guidelines to enable reviewers (with the minimum friction or bias) to follow and apply open science principles, and support the ideas of transparency, reproducibility and ultimately greater societal impact. Introducing the oath and manifesto at the stage of peer review will help to check that the research being published includes everything that other researchers would need to successfully repeat the work. Peer review is the lynchpin of the publishing system: encouraging the community to consciously (and conscientiously) uphold these principles should help to improve published papers, increase confidence in the reproducibility of the work and, ultimately, provide strategic benefits to authors and their institutions.

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data.

Lampa, Samuel; Dahlö, Martin; Olason, Pall I; Hagberg, Jonas; Spjuth, Ola.

Gigascience ; 2(1): 9, 2013 Jun 25.

Artículo en Inglés | MEDLINE | ID: mdl-23800020

RESUMEN

: Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.

The transcriptome of the adenovirus infected cell.

Zhao, Hongxing; Dahlö, Martin; Isaksson, Anders; Syvänen, Ann-Christine; Pettersson, Ulf.

Virology ; 424(2): 115-28, 2012 Mar 15.

Artículo en Inglés | MEDLINE | ID: mdl-22236370

RESUMEN

Alternations of cellular gene expression following an adenovirus type 2 infection of human primary cells were studied by using superior sensitive cDNA sequencing. In total, 3791 cellular genes were identified as differentially expressed more than 2-fold. Genes involved in DNA replication, RNA transcription and cell cycle regulation were very abundant among the up-regulated genes. On the other hand, genes involved in various signaling pathways including TGF-ß, Rho, G-protein, Map kinase, STAT and NF-κB stood out among the down-regulated genes. Binding sites for E2F, ATF/CREB and AP2 were prevalent in the up-regulated genes, whereas binding sites for SRF and NF-κB were dominant among the down-regulated genes. It is evident that the adenovirus has gained a control of the host cell cycle, growth, immune response and apoptosis at 24 h after infection. However, efforts from host cell to block the cell cycle progression and activate an antiviral response were also observed.

Asunto(s)

Infecciones por Adenovirus Humanos/genética , Adenovirus Humanos/fisiología , Transcriptoma , Infecciones por Adenovirus Humanos/metabolismo , Infecciones por Adenovirus Humanos/virología , Células Cultivadas , Fibroblastos/metabolismo , Fibroblastos/virología , Perfilación de la Expresión Génica , Humanos , Pulmón/citología , Pulmón/metabolismo , Pulmón/virología , Transducción de Señal

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA