Búsqueda | Portal Regional de la BVS

A super SDM (species distribution model) 'in the cloud' for better habitat-association inference with a 'big data' application of the Great Gray Owl for Alaska.

Huettmann, Falk; Andrews, Phillip; Steiner, Moriz; Das, Arghya Kusum; Philip, Jacques; Mi, Chunrong; Bryans, Nathaniel; Barker, Bryan.

Sci Rep ; 14(1): 7213, 2024 Mar 27.

Artículo en Inglés | MEDLINE | ID: mdl-38531933

RESUMEN

The currently available distribution and range maps for the Great Grey Owl (GGOW; Strix nebulosa) are ambiguous, contradictory, imprecise, outdated, often hand-drawn and thus not quantified, not based on data or scientific. In this study, we present a proof of concept with a biological application for technical and biological workflow progress on latest global open access 'Big Data' sharing, Open-source methods of R and geographic information systems (OGIS and QGIS) assessed with six recent multi-evidence citizen-science sightings of the GGOW. This proposed workflow can be applied for quantified inference for any species-habitat model such as typically applied with species distribution models (SDMs). Using Random Forest-an ensemble-type model of Machine Learning following Leo Breiman's approach of inference from predictions-we present a Super SDM for GGOWs in Alaska running on Oracle Cloud Infrastructure (OCI). These Super SDMs were based on best publicly available data (410 occurrences + 1% new assessment sightings) and over 100 environmental GIS habitat predictors ('Big Data'). The compiled global open access data and the associated workflow overcome for the first time the limitations of traditionally used PC and laptops. It breaks new ground and has real-world implications for conservation and land management for GGOW, for Alaska, and for other species worldwide as a 'new' baseline. As this research field remains dynamic, Super SDMs can have limits, are not the ultimate and final statement on species-habitat associations yet, but they summarize all publicly available data and information on a topic in a quantified and testable fashion allowing fine-tuning and improvements as needed. At minimum, they allow for low-cost rapid assessment and a great leap forward to be more ecological and inclusive of all information at-hand. Using GGOWs, here we aim to correct the perception of this species towards a more inclusive, holistic, and scientifically correct assessment of this urban-adapted owl in the Anthropocene, rather than a mysterious wilderness-inhabiting species (aka 'Phantom of the North'). Such a Super SDM was never created for any bird species before and opens new perspectives for impact assessment policy and global sustainability.

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.

Das, Arghya Kusum; Goswami, Sayan; Lee, Kisung; Park, Seung-Jong.

BMC Genomics ; 20(Suppl 11): 948, 2019 Dec 20.

Artículo en Inglés | MEDLINE | ID: mdl-31856721

RESUMEN

BACKGROUND: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS: In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS: ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION: ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Bases de Datos Genéticas , Genoma/genética , Mutación , Programas Informáticos

Large-scale parallel genome assembler over cloud computing environment.

Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong.

J Bioinform Comput Biol ; 15(3): 1740003, 2017 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-28610458

RESUMEN

The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

Asunto(s)

Nube Computacional , Genoma , Programas Informáticos , Bases de Datos Genéticas , Escherichia coli/genética , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Masculino , Staphylococcus aureus/genética

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA