Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 92
Filtrar
1.
Nat Commun ; 15(1): 989, 2024 Feb 02.
Artículo en Inglés | MEDLINE | ID: mdl-38307861

RESUMEN

Proteogenomics studies generate hypotheses on protein function and provide genetic evidence for drug target prioritization. Most previous work has been conducted using affinity-based proteomics approaches. These technologies face challenges, such as uncertainty regarding target identity, non-specific binding, and handling of variants that affect epitope affinity binding. Mass spectrometry-based proteomics can overcome some of these challenges. Here we report a pQTL study using the Proteograph™ Product Suite workflow (Seer, Inc.) where we quantify over 18,000 unique peptides from nearly 3000 proteins in more than 320 blood samples from a multi-ethnic cohort in a bottom-up, peptide-centric, mass spectrometry-based proteomics approach. We identify 184 protein-altering variants in 137 genes that are significantly associated with their corresponding variant peptides, confirming target specificity of co-associated affinity binders, identifying putatively causal cis-encoded proteins and providing experimental evidence for their presence in blood, including proteins that may be inaccessible to affinity-based proteomics.


Asunto(s)
Proteogenómica , Proteómica , Humanos , Proteómica/métodos , Espectrometría de Masas/métodos , Proteínas/análisis , Péptidos/análisis , Proteogenómica/métodos , Proteínas Mutantes
2.
bioRxiv ; 2024 Jan 08.
Artículo en Inglés | MEDLINE | ID: mdl-38260620

RESUMEN

Alzheimer's disease (AD) and related dementias (ADRD) is a complex disease with multiple pathophysiological drivers that determine clinical symptomology and disease progression. These diseases develop insidiously over time, through many pathways and disease mechanisms and continue to have a huge societal impact for affected individuals and their families. While emerging blood-based biomarkers, such as plasma p-tau181 and p-tau217, accurately detect Alzheimer neuropthology and are associated with faster cognitive decline, the full extension of plasma proteomic changes in ADRD remains unknown. Earlier detection and better classification of the different subtypes may provide opportunities for earlier, more targeted interventions, and perhaps a higher likelihood of successful therapeutic development. In this study, we aim to leverage unbiased mass spectrometry proteomics to identify novel, blood-based biomarkers associated with cognitive decline. 1,786 plasma samples from 1,005 patients were collected over 12 years from partcipants in the Massachusetts Alzheimer's Disease Research Center Longitudinal Cohort Study. Patient metadata includes demographics, final diagnoses, and clinical dementia rating (CDR) scores taken concurrently. The Proteograph™ Product Suite (Seer, Inc.) and liquid-chromatography mass-spectrometry (LC-MS) analysis were used to process the plasma samples in this cohort and generate unbiased proteomics data. Data-independent acquisition (DIA) mass spectrometry results yielded 36,259 peptides and 4,007 protein groups. Linear mixed effects models revealed 138 differentially abundant proteins between AD and healthy controls. Machine learning classification models for AD diagnosis identified potential candidate biomarkers including MBP, BGLAP, and APoD. Cox regression models were created to determine the association of proteins with disease progression and suggest CLNS1A, CRISPLD2, and GOLPH3 as targets of further investigation as potential biomarkers. The Proteograph workflow provided deep, unbiased coverage of the plasma proteome at a speed that enabled a cohort study of almost 1,800 samples, which is the largest, deep, unbiased proteomics study of ADRD conducted to date.

3.
bioRxiv ; 2023 Aug 29.
Artículo en Inglés | MEDLINE | ID: mdl-37693476

RESUMEN

Background: The wide dynamic range of circulating proteins coupled with the diversity of proteoforms present in plasma has historically impeded comprehensive and quantitative characterization of the plasma proteome at scale. Automated nanoparticle (NP) protein corona-based proteomics workflows can efficiently compress the dynamic range of protein abundances into a mass spectrometry (MS)-accessible detection range. This enhances the depth and scalability of quantitative MS-based methods, which can elucidate the molecular mechanisms of biological processes, discover new protein biomarkers, and improve comprehensiveness of MS-based diagnostics. Methods: Investigating multi-species spike-in experiments and a cohort, we investigated fold-change accuracy, linearity, precision, and statistical power for the using the Proteograph™ Product Suite, a deep plasma proteomics workflow, in conjunction with multiple MS instruments. Results: We show that NP-based workflows enable accurate identification (false discovery rate of 1%) of more than 6,000 proteins from plasma (Orbitrap Astral) and, compared to a gold standard neat plasma workflow that is limited to the detection of hundreds of plasma proteins, facilitate quantification of more proteins with accurate fold-changes, high linearity, and precision. Furthermore, we demonstrate high statistical power for the discovery of biomarkers in small- and large-scale cohorts. Conclusions: The automated NP workflow enables high-throughput, deep, and quantitative plasma proteomics investigation with sufficient power to discover new biomarker signatures with a peptide level resolution.

4.
Science ; 380(6648): eabn8153, 2023 06 02.
Artículo en Inglés | MEDLINE | ID: mdl-37262156

RESUMEN

Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.


Asunto(s)
Variación Genética , Primates , Animales , Humanos , Secuencia de Bases , Frecuencia de los Genes , Primates/genética , Secuenciación Completa del Genoma
5.
bioRxiv ; 2023 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-37205491

RESUMEN

Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole genome sequencing data for 809 individuals from 233 primate species, and identified 4.3 million common protein-altering variants with orthologs in human. We show that these variants can be inferred to have non-deleterious effects in human based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases. One Sentence Summary: Deep learning classifier trained on 4.3 million common primate missense variants predicts variant pathogenicity in humans.

6.
PLoS One ; 18(3): e0282821, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36989217

RESUMEN

Advancements in deep plasma proteomics are enabling high-resolution measurement of plasma proteoforms, which may reveal a rich source of novel biomarkers previously concealed by aggregated protein methods. Here, we analyze 188 plasma proteomes from non-small cell lung cancer subjects (NSCLC) and controls to identify NSCLC-associated protein isoforms by examining differentially abundant peptides as a proxy for isoform-specific exon usage. We find four proteins comprised of peptides with opposite patterns of abundance between cancer and control subjects. One of these proteins, BMP1, has known isoforms that can explain this differential pattern, for which the abundance of the NSCLC-associated isoform increases with stage of NSCLC progression. The presence of cancer and control-associated isoforms suggests differential regulation of BMP1 isoforms. The identified BMP1 isoforms have known functional differences, which may reveal insights into mechanisms impacting NSCLC disease progression.


Asunto(s)
Carcinoma de Pulmón de Células no Pequeñas , Neoplasias Pulmonares , Humanos , Carcinoma de Pulmón de Células no Pequeñas/metabolismo , Neoplasias Pulmonares/metabolismo , Biomarcadores de Tumor/metabolismo , Isoformas de Proteínas/metabolismo , Péptidos , Proteína Morfogenética Ósea 1
7.
Adv Mater ; 34(44): e2206008, 2022 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-35986672

RESUMEN

Introducing engineered nanoparticles (NPs) into a biofluid such as blood plasma leads to the formation of a selective and reproducible protein corona at the particle-protein interface, driven by the relationship between protein-NP affinity and protein abundance. This enables scalable systems that leverage protein-nano interactions to overcome current limitations of deep plasma proteomics in large cohorts. Here the importance of the protein to NP-surface ratio (P/NP) is demonstrated and protein corona formation dynamics are modeled, which determine the competition between proteins for binding. Tuning the P/NP ratio significantly modulates the protein corona composition, enhancing depth and precision of a fully automated NP-based deep proteomic workflow (Proteograph). By increasing the binding competition on engineered NPs, 1.2-1.7× more proteins with 1% false discovery rate are identified on the surface of each NP, and up to 3× more proteins compared to a standard plasma proteomics workflow. Moreover, the data suggest P/NP plays a significant role in determining the in vivo fate of nanomaterials in biomedical applications. Together, the study showcases the importance of P/NP as a key design element for biomaterials and nanomedicine in vivo and as a powerful tuning strategy for accurate, large-scale NP-based deep proteomic studies.


Asunto(s)
Nanopartículas , Corona de Proteínas , Corona de Proteínas/química , Proteoma , Proteómica , Nanopartículas/química , Nanomedicina
8.
Proc Natl Acad Sci U S A ; 119(11): e2106053119, 2022 03 15.
Artículo en Inglés | MEDLINE | ID: mdl-35275789

RESUMEN

SignificanceDeep profiling of the plasma proteome at scale has been a challenge for traditional approaches. We achieve superior performance across the dimensions of precision, depth, and throughput using a panel of surface-functionalized superparamagnetic nanoparticles in comparison to conventional workflows for deep proteomics interrogation. Our automated workflow leverages competitive nanoparticle-protein binding equilibria that quantitatively compress the large dynamic range of proteomes to an accessible scale. Using machine learning, we dissect the contribution of individual physicochemical properties of nanoparticles to the composition of protein coronas. Our results suggest that nanoparticle functionalization can be tailored to protein sets. This work demonstrates the feasibility of deep, precise, unbiased plasma proteomics at a scale compatible with large-scale genomics enabling multiomic studies.


Asunto(s)
Proteínas Sanguíneas , Aprendizaje Profundo , Nanopartículas , Proteómica , Proteínas Sanguíneas/química , Nanopartículas/química , Corona de Proteínas/química , Proteoma , Proteómica/métodos
9.
Gigascience ; 9(7)2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32649757

RESUMEN

BACKGROUND: Macaque species share >93% genome homology with humans and develop many disease phenotypes similar to those of humans, making them valuable animal models for the study of human diseases (e.g., HIV and neurodegenerative diseases). However, the quality of genome assembly and annotation for several macaque species lags behind the human genome effort. RESULTS: To close this gap and enhance functional genomics approaches, we used a combination of de novo linked-read assembly and scaffolding using proximity ligation assay (HiC) to assemble the pig-tailed macaque (Macaca nemestrina) genome. This combinatorial method yielded large scaffolds at chromosome level with a scaffold N50 of 127.5 Mb; the 23 largest scaffolds covered 90% of the entire genome. This assembly revealed large-scale rearrangements between pig-tailed macaque chromosomes 7, 12, and 13 and human chromosomes 2, 14, and 15. We subsequently annotated the genome using transcriptome and proteomics data from personalized induced pluripotent stem cells derived from the same animal. Reconstruction of the evolutionary tree using whole-genome annotation and orthologous comparisons among 3 macaque species, human, and mouse genomes revealed extensive homology between human and pig-tailed macaques with regards to both pluripotent stem cell genes and innate immune gene pathways. Our results confirm that rhesus and cynomolgus macaques exhibit a closer evolutionary distance to each other than either species exhibits to humans or pig-tailed macaques. CONCLUSIONS: These findings demonstrate that pig-tailed macaques can serve as an excellent animal model for the study of many human diseases particularly with regards to pluripotency and innate immune pathways.


Asunto(s)
Cromosomas , Genoma , Genómica , Macaca nemestrina/genética , Animales , Biología Computacional/métodos , Genómica/métodos , Humanos , Cariotipificación/métodos , Masculino , Anotación de Secuencia Molecular , Proteómica/métodos , Secuencias Repetitivas de Ácidos Nucleicos
10.
Genome Med ; 12(1): 50, 2020 05 29.
Artículo en Inglés | MEDLINE | ID: mdl-32471482

RESUMEN

BACKGROUND: Populations of closely related microbial strains can be simultaneously present in bacterial communities such as the human gut microbiome. We recently developed a de novo genome assembly approach that uses read cloud sequencing to provide more complete microbial genome drafts, enabling precise differentiation and tracking of strain-level dynamics across metagenomic samples. In this case study, we present a proof-of-concept using read cloud sequencing to describe bacterial strain diversity in the gut microbiome of one hematopoietic cell transplantation patient over a 2-month time course and highlight temporal strain variation of gut microbes during therapy. The treatment was accompanied by diet changes and administration of multiple immunosuppressants and antimicrobials. METHODS: We conducted short-read and read cloud metagenomic sequencing of DNA extracted from four longitudinal stool samples collected during the course of treatment of one hematopoietic cell transplantation (HCT) patient. After applying read cloud metagenomic assembly to discover strain-level sequence variants in these complex microbiome samples, we performed metatranscriptomic analysis to investigate differential expression of antibiotic resistance genes. Finally, we validated predictions from the genomic and metatranscriptomic findings through in vitro antibiotic susceptibility testing and whole genome sequencing of isolates derived from the patient stool samples. RESULTS: During the 56-day longitudinal time course that was studied, the patient's microbiome was profoundly disrupted and eventually dominated by Bacteroides caccae. Comparative analysis of B. caccae genomes obtained using read cloud sequencing together with metagenomic RNA sequencing allowed us to identify differences in substrain populations over time. Based on this, we predicted that particular mobile element integrations likely resulted in increased antibiotic resistance, which we further supported using in vitro antibiotic susceptibility testing. CONCLUSIONS: We find read cloud assembly to be useful in identifying key structural genomic strain variants within a metagenomic sample. These strains have fluctuating relative abundance over relatively short time periods in human microbiomes. We also find specific structural genomic variations that are associated with increased antibiotic resistance over the course of clinical treatment.


Asunto(s)
Bacterias/genética , Microbioma Gastrointestinal/genética , Antiinfecciosos/farmacología , Azacitidina/farmacología , Azitromicina/farmacología , Bacterias/clasificación , Bacterias/efectos de los fármacos , Bacterias/aislamiento & purificación , Ciprofloxacina/farmacología , ADN Bacteriano , Dieta , Heces/microbiología , Microbioma Gastrointestinal/efectos de los fármacos , Genoma Bacteriano , Trasplante de Células Madre Hematopoyéticas , Humanos , Inmunosupresores/farmacología , Masculino , Metagenoma , Persona de Mediana Edad , Síndromes Mielodisplásicos/microbiología , Síndromes Mielodisplásicos/terapia , Mielofibrosis Primaria/microbiología , Mielofibrosis Primaria/terapia , RNA-Seq , Análisis de Secuencia de ADN
11.
Bioinformatics ; 36(4): 1082-1090, 2020 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-31584621

RESUMEN

MOTIVATION: We propose Meltos, a novel computational framework to address the challenging problem of building tumor phylogeny trees using somatic structural variants (SVs) among multiple samples. Meltos leverages the tumor phylogeny tree built on somatic single nucleotide variants (SNVs) to identify high confidence SVs and produce a comprehensive tumor lineage tree, using a novel optimization formulation. While we do not assume the evolutionary progression of SVs is necessarily the same as SNVs, we show that a tumor phylogeny tree using high-quality somatic SNVs can act as a guide for calling and assigning somatic SVs on a tree. Meltos utilizes multiple genomic read signals for potential SV breakpoints in whole genome sequencing data and proposes a probabilistic formulation for estimating variant allele fractions (VAFs) of SV events. RESULTS: In order to assess the ability of Meltos to correctly refine SNV trees with SV information, we tested Meltos on two simulated datasets with five genomes in both. We also assessed Meltos on two real cancer datasets. We tested Meltos on multiple samples from a liposarcoma tumor and on a multi-sample breast cancer data (Yates et al., 2015), where the authors provide validated structural variation events together with deep, targeted sequencing for a collection of somatic SNVs. We show Meltos has the ability to place high confidence validated SV calls on a refined tumor phylogeny tree. We also showed the flexibility of Meltos to either estimate VAFs directly from genomic data or to use copy number corrected estimates. AVAILABILITY AND IMPLEMENTATION: Meltos is available at https://github.com/ih-lab/Meltos. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Neoplasias , Genoma , Variación Estructural del Genoma , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Neoplasias/genética , Filogenia , Análisis de Secuencia , Programas Informáticos
12.
Nat Commun ; 10(1): 3341, 2019 07 26.
Artículo en Inglés | MEDLINE | ID: mdl-31350405

RESUMEN

Tens of thousands of genotype-phenotype associations have been discovered to date, yet not all of them are easily accessible to scientists. Here, we describe GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms. Our information extraction system helps curators by automatically collecting over 6,000 associations from open-access publications with an estimated recall of 60-80% and with an estimated precision of 78-94% (measured relative to existing manually curated knowledge bases). This system represents a fully automated GWAS curation effort and is made possible by a paradigm for constructing machine learning systems called data programming. Our work represents a step towards making the curation of scientific literature more efficient using automated systems.


Asunto(s)
Bases de Datos Genéticas , Estudio de Asociación del Genoma Completo , Biología Computacional , Minería de Datos , Genoma Humano , Humanos , Aprendizaje Automático
13.
Cell ; 176(3): 535-548.e24, 2019 01 24.
Artículo en Inglés | MEDLINE | ID: mdl-30661751

RESUMEN

The splicing of pre-mRNAs into mature transcripts is remarkable for its precision, but the mechanisms by which the cellular machinery achieves such specificity are incompletely understood. Here, we describe a deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing. Synonymous and intronic mutations with predicted splice-altering consequence validate at a high rate on RNA-seq and are strongly deleterious in the human population. De novo mutations with predicted splice-altering consequence are significantly enriched in patients with autism and intellectual disability compared to healthy controls and validate against RNA-seq in 21 out of 28 of these patients. We estimate that 9%-11% of pathogenic mutations in patients with rare genetic disorders are caused by this previously underappreciated class of disease variation.


Asunto(s)
Predicción/métodos , Precursores del ARN/genética , Empalme del ARN/genética , Algoritmos , Empalme Alternativo/genética , Trastorno Autístico/genética , Aprendizaje Profundo , Exones/genética , Humanos , Discapacidad Intelectual/genética , Intrones/genética , Redes Neurales de la Computación , Precursores del ARN/metabolismo , Sitios de Empalme de ARN/genética , Sitios de Empalme de ARN/fisiología
14.
Nat Genet ; 51(2): 364, 2019 02.
Artículo en Inglés | MEDLINE | ID: mdl-30559491

RESUMEN

In the version of this article originally published, the name of author Serafim Batzoglou was misspelled. The error has been corrected in the HTML and PDF versions of the article.

15.
Nat Biotechnol ; 2018 Oct 15.
Artículo en Inglés | MEDLINE | ID: mdl-30320765

RESUMEN

Although shotgun metagenomic sequencing of microbiome samples enables partial reconstruction of strain-level community structure, obtaining high-quality microbial genome drafts without isolation and culture remains difficult. Here, we present an application of read clouds, short-read sequences tagged with long-range information, to microbiome samples. We present Athena, a de novo assembler that uses read clouds to improve metagenomic assemblies. We applied this approach to sequence stool samples from two healthy individuals and compared it with existing short-read and synthetic long-read metagenomic sequencing techniques. Read-cloud metagenomic sequencing and Athena assembly produced the most comprehensive individual genome drafts with high contiguity (>200-kb N50, fewer than ten contigs), even for bacteria with relatively low (20×) raw short-read-sequence coverage. We also sequenced a complex marine-sediment sample and generated 24 intermediate-quality genome drafts (>70% complete, <10% contaminated), nine of which were complete (>90% complete, <5% contaminated). Our approach allows for culture-free generation of high-quality microbial genome drafts by using a single shotgun experiment.

16.
Nat Commun ; 9(1): 4453, 2018 10 26.
Artículo en Inglés | MEDLINE | ID: mdl-30367051

RESUMEN

Outcomes for cancer patients vary greatly even within the same tumor type, and characterization of molecular subtypes of cancer holds important promise for improving prognosis and personalized treatment. This promise has motivated recent efforts to produce large amounts of multidimensional genomic (multi-omic) data, but current algorithms still face challenges in the integrated analysis of such data. Here we present Cancer Integration via Multikernel Learning (CIMLR), a new cancer subtyping method that integrates multi-omic data to reveal molecular subtypes of cancer. We apply CIMLR to multi-omic data from 36 cancer types and show significant improvements in both computational efficiency and ability to extract biologically meaningful cancer subtypes. The discovered subtypes exhibit significant differences in patient survival for 27 of 36 cancer types. Our analysis reveals integrated patterns of gene expression, methylation, point mutations, and copy number changes in multiple cancers and highlights patterns specifically associated with poor patient outcomes.


Asunto(s)
Biología Computacional , Genómica , Neoplasias/genética , Neoplasias/mortalidad , Algoritmos , Análisis por Conglomerados , Variaciones en el Número de Copia de ADN , Metilación de ADN , Perfilación de la Expresión Génica , Humanos , Neoplasias/clasificación , Neoplasias/terapia , Mutación Puntual , Pronóstico , Análisis de Supervivencia
17.
Nat Commun ; 9(1): 3108, 2018 08 06.
Artículo en Inglés | MEDLINE | ID: mdl-30082777

RESUMEN

Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise ratio of undirected, weighted networks. NE uses a doubly stochastic matrix operator that induces sparsity and provides a closed-form solution that increases spectral eigengap of the input network. As a result, NE removes weak edges, enhances real connections, and leads to better downstream performance. Experiments show that NE improves gene-function prediction by denoising tissue-specific interaction networks, alleviates interpretation of noisy Hi-C contact maps from the human genome, and boosts fine-grained identification accuracy of species. Our results indicate that NE is widely applicable for denoising biological networks.


Asunto(s)
Biología Computacional/métodos , Perfilación de la Expresión Génica , Genoma Humano , Algoritmos , Área Bajo la Curva , Productos Biológicos , Difusión , Ecosistema , Humanos , Modelos Biológicos , Dominios Proteicos , Relación Señal-Ruido , Procesos Estocásticos
18.
Nat Genet ; 50(8): 1161-1170, 2018 08.
Artículo en Inglés | MEDLINE | ID: mdl-30038395

RESUMEN

Millions of human genomes and exomes have been sequenced, but their clinical applications remain limited due to the difficulty of distinguishing disease-causing mutations from benign genetic variation. Here we demonstrate that common missense variants in other primate species are largely clinically benign in human, enabling pathogenic mutations to be systematically identified by the process of elimination. Using hundreds of thousands of common variants from population sequencing of six non-human primate species, we train a deep neural network that identifies pathogenic mutations in rare disease patients with 88% accuracy and enables the discovery of 14 new candidate genes in intellectual disability at genome-wide significance. Cataloging common variation from additional primate species would improve interpretation for millions of variants of uncertain significance, further advancing the clinical utility of human genome sequencing.


Asunto(s)
Genoma Humano , Mutación , Red Nerviosa/fisiología , Animales , Exoma , Predisposición Genética a la Enfermedad , Humanos , Discapacidad Intelectual/genética , Discapacidad Intelectual/patología , Primates
19.
BMC Genomics ; 19(1): 467, 2018 Jun 18.
Artículo en Inglés | MEDLINE | ID: mdl-29914369

RESUMEN

BACKGROUND: De novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls. RESULTS: To address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM. HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80 to 99% of false positives regardless of how large the candidate DNM set is. CONCLUSIONS: HAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.


Asunto(s)
Genoma Humano , Haplotipos , Mutación , Programas Informáticos , Algoritmos , Biología Computacional , Análisis Mutacional de ADN , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos
20.
J Comput Biol ; 25(7): 677-688, 2018 07.
Artículo en Inglés | MEDLINE | ID: mdl-29658784

RESUMEN

We introduce GATTACA, a framework for fast unsupervised binning of metagenomic contigs. Similar to recent approaches, GATTACA clusters contigs based on their coverage profiles across a large cohort of metagenomic samples; however, unlike previous methods that rely on read mapping, GATTACA quickly estimates these profiles from kmer counts stored in a compact index. This approach can result in over an order of magnitude speedup, while matching the accuracy of earlier methods on synthetic and real data benchmarks. It also provides a way to index metagenomic samples (e.g., from public repositories such as the Human Microbiome Project) offline once and reuse them across experiments; furthermore, the small size of the sample indices allows them to be easily transferred and stored. Leveraging the MinHash technique, GATTACA also provides an efficient way to identify publicly available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy indexing and reuse of publicly available metagenomic data sets, GATTACA makes accurate metagenomic analyses accessible to a much wider range of researchers.


Asunto(s)
Teorema de Bayes , Biología Computacional/estadística & datos numéricos , Metagenómica/estadística & datos numéricos , Microbiota/genética , Análisis por Conglomerados , Humanos , Metagenoma/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA