Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
Bioinformatics ; 32(12): i216-i224, 2016 06 15.
Artículo en Inglés | MEDLINE | ID: mdl-27307620

RESUMEN

MOTIVATION: Despite rapid progress in sequencing technology, assembling de novo the genomes of new species as well as reconstructing complex metagenomes remains major technological challenges. New synthetic long read (SLR) technologies promise significant advances towards these goals; however, their applicability is limited by high sequencing requirements and the inability of current assembly paradigms to cope with combinations of short and long reads. RESULTS: Here, we introduce Architect, a new de novo scaffolder aimed at SLR technologies. Unlike previous assembly strategies, Architect does not require a costly subassembly step; instead it assembles genomes directly from the SLR's underlying short reads, which we refer to as read clouds This enables a 4- to 20-fold reduction in sequencing requirements and a 5-fold increase in assembly contiguity on both genomic and metagenomic datasets relative to state-of-the-art assembly strategies aimed directly at fully subassembled long reads. AVAILABILITY AND IMPLEMENTATION: Our source code is freely available at https://github.com/kuleshov/architect CONTACT: kuleshov@stanford.edu.


Asunto(s)
Genoma , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN
2.
Bioinformatics ; 30(17): i379-85, 2014 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-25161223

RESUMEN

MOTIVATION: Accurate haplotyping-determining from which parent particular portions of the genome are inherited-is still mostly an unresolved problem in genomics. This problem has only recently started to become tractable, thanks to the development of new long read sequencing technologies. Here, we introduce ProbHap, a haplotyping algorithm targeted at such technologies. The main algorithmic idea of ProbHap is a new dynamic programming algorithm that exactly optimizes a likelihood function specified by a probabilistic graphical model and which generalizes a popular objective called the minimum error correction. In addition to being accurate, ProbHap also provides confidence scores at phased positions. RESULTS: On a standard benchmark dataset, ProbHap makes 11% fewer errors than current state-of-the-art methods. This accuracy can be further increased by excluding low-confidence positions, at the cost of a small drop in haplotype completeness. AVAILABILITY: Our source code is freely available at: https://github.com/kuleshov/ProbHap.


Asunto(s)
Algoritmos , Haplotipos , Modelos Estadísticos , Análisis de Secuencia de ADN/métodos , Genoma Humano , Genómica , Humanos , Funciones de Verosimilitud
3.
bioRxiv ; 2024 Jun 10.
Artículo en Inglés | MEDLINE | ID: mdl-38895432

RESUMEN

Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.

4.
Nat Commun ; 10(1): 3341, 2019 07 26.
Artículo en Inglés | MEDLINE | ID: mdl-31350405

RESUMEN

Tens of thousands of genotype-phenotype associations have been discovered to date, yet not all of them are easily accessible to scientists. Here, we describe GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms. Our information extraction system helps curators by automatically collecting over 6,000 associations from open-access publications with an estimated recall of 60-80% and with an estimated precision of 78-94% (measured relative to existing manually curated knowledge bases). This system represents a fully automated GWAS curation effort and is made possible by a paradigm for constructing machine learning systems called data programming. Our work represents a step towards making the curation of scientific literature more efficient using automated systems.


Asunto(s)
Bases de Datos Genéticas , Estudio de Asociación del Genoma Completo , Biología Computacional , Minería de Datos , Genoma Humano , Humanos , Aprendizaje Automático
5.
Nat Med ; 25(1): 24-29, 2019 01.
Artículo en Inglés | MEDLINE | ID: mdl-30617335

RESUMEN

Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep-learning methods for genomics are reviewed.


Asunto(s)
Aprendizaje Profundo , Atención a la Salud , Diagnóstico por Imagen , Registros Electrónicos de Salud , Humanos , Procesamiento de Lenguaje Natural
6.
J Comput Biol ; 25(7): 677-688, 2018 07.
Artículo en Inglés | MEDLINE | ID: mdl-29658784

RESUMEN

We introduce GATTACA, a framework for fast unsupervised binning of metagenomic contigs. Similar to recent approaches, GATTACA clusters contigs based on their coverage profiles across a large cohort of metagenomic samples; however, unlike previous methods that rely on read mapping, GATTACA quickly estimates these profiles from kmer counts stored in a compact index. This approach can result in over an order of magnitude speedup, while matching the accuracy of earlier methods on synthetic and real data benchmarks. It also provides a way to index metagenomic samples (e.g., from public repositories such as the Human Microbiome Project) offline once and reuse them across experiments; furthermore, the small size of the sample indices allows them to be easily transferred and stored. Leveraging the MinHash technique, GATTACA also provides an efficient way to identify publicly available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy indexing and reuse of publicly available metagenomic data sets, GATTACA makes accurate metagenomic analyses accessible to a much wider range of researchers.


Asunto(s)
Teorema de Bayes , Biología Computacional/estadística & datos numéricos , Metagenómica/estadística & datos numéricos , Microbiota/genética , Análisis por Conglomerados , Humanos , Metagenoma/genética
7.
Nat Biotechnol ; 34(1): 64-9, 2016 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-26655498

RESUMEN

Identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequences remains a difficult problem. Here, we present an analysis of a human gut microbiome using TruSeq synthetic long reads combined with computational tools for metagenomic long-read assembly, variant calling and haplotyping (Nanoscope and Lens). Our analysis identifies 178 bacterial species, of which 51 were not found using shotgun reads alone. We recover bacterial contigs that comprise multiple operons, including 22 contigs of >1 Mbp. Furthermore, we observe extensive intraspecies variation within microbial strains in the form of haplotypes that span up to hundreds of Kbp. Incorporation of synthetic long-read sequencing technology with standard short-read approaches enables more precise and comprehensive analyses of metagenomic samples.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Microbiota , Especificidad de la Especie , Humanos
8.
Nat Biotechnol ; 32(3): 261-266, 2014 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-24561555

RESUMEN

The rapid growth of sequencing technologies has greatly contributed to our understanding of human genetics. Yet, despite this growth, mainstream technologies have not been fully able to resolve the diploid nature of the human genome. Here we describe statistically aided, long-read haplotyping (SLRH), a rapid, accurate method that uses a statistical algorithm to take advantage of the partially phased information contained in long genomic fragments analyzed by short-read sequencing. For a human sample, as little as 30 Gbp of additional sequencing data are needed to phase genotypes identified by 50× coverage whole-genome sequencing. Using SLRH, we phase 99% of single-nucleotide variants in three human genomes into long haplotype blocks 0.2-1 Mbp in length. We apply our method to determine allele-specific methylation patterns in a human genome and identify hundreds of differentially methylated regions that were previously unknown. SLRH should facilitate population-scale haplotyping of human genomes.


Asunto(s)
Genómica/métodos , Haplotipos/genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Metilación de ADN/genética , Genoma Humano/genética , Humanos , Reacción en Cadena de la Polimerasa , Polimorfismo de Nucleótido Simple/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA