Pesquisa | Portal Regional da BVS

1.

Complementation testing identifies genes mediating effects at quantitative trait loci underlying fear-related behavior.

Chen, Patrick B; Chen, Rachel; LaPierre, Nathan; Chen, Zeyuan; Mefford, Joel; Marcus, Emilie; Heffel, Matthew G; Soto, Daniela C; Ernst, Jason; Luo, Chongyuan; Flint, Jonathan.

Cell Genom ; 4(5): 100545, 2024 May 08.

Artigo em Inglês | MEDLINE | ID: mdl-38697120

RESUMO

Knowing the genes involved in quantitative traits provides an entry point to understanding the biological bases of behavior, but there are very few examples where the pathway from genetic locus to behavioral change is known. To explore the role of specific genes in fear behavior, we mapped three fear-related traits, tested fourteen genes at six quantitative trait loci (QTLs) by quantitative complementation, and identified six genes. Four genes, Lamp, Ptprd, Nptx2, and Sh3gl, have known roles in synapse function; the fifth, Psip1, was not previously implicated in behavior; and the sixth is a long non-coding RNA, 4933413L06Rik, of unknown function. Variation in transcriptome and epigenetic modalities occurred preferentially in excitatory neurons, suggesting that genetic variation is more permissible in excitatory than inhibitory neuronal circuits. Our results relieve a bottleneck in using genetic mapping of QTLs to uncover biology underlying behavior and prompt a reconsideration of expected relationships between genetic and functional variation.

Assuntos

Medo , Locos de Características Quantitativas , Animais , Feminino , Masculino , Camundongos , Comportamento Animal/fisiologia , Mapeamento Cromossômico , Medo/fisiologia , Camundongos Endogâmicos C57BL , Teste de Complementação Genética

2.

Packaging and containerization of computational methods.

Alser, Mohammed; Lawlor, Brendan; Abdill, Richard J; Waymost, Sharon; Ayyala, Ram; Rajkumar, Neha; LaPierre, Nathan; Brito, Jaqueline; Ribeiro-Dos-Santos, André M; Almadhoun, Nour; Sarwal, Varuni; Firtina, Can; Osinski, Tomasz; Eskin, Eleazar; Hu, Qiyang; Strong, Derek; Kim, Byoung-Do B D; Abedalthagafi, Malak S; Mutlu, Onur; Mangul, Serghei.

Nat Protoc ; 2024 Apr 02.

Artigo em Inglês | MEDLINE | ID: mdl-38565959

RESUMO

Methods for analyzing the full complement of a biomolecule type, e.g., proteomics or metabolomics, generate large amounts of complex data. The software tools used to analyze omics data have reshaped the landscape of modern biology and become an essential component of biomedical research. These tools are themselves quite complex and often require the installation of other supporting software, libraries and/or databases. A researcher may also be using multiple different tools that require different versions of the same supporting materials. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging and containerization are different approaches to satisfy this need by delivering omics tools already wrapped in additional software that makes the tools easier to install and use. In this systematic review, we describe and compare the features of prominent packaging and containerization platforms. We outline the challenges, advantages and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers and system administrators. We also propose principles to make the distribution of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.

3.

Accounting for isoform expression increases power to identify genetic regulation of gene expression.

LaPierre, Nathan; Pimentel, Harold.

PLoS Comput Biol ; 20(2): e1011857, 2024 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-38346082

RESUMO

A core problem in genetics is molecular quantitative trait locus (QTL) mapping, in which genetic variants associated with changes in the molecular phenotypes are identified. One of the most-studied molecular QTL mapping problems is expression QTL (eQTL) mapping, in which the molecular phenotype is gene expression. It is common in eQTL mapping to compute gene expression by aggregating the expression levels of individual isoforms from the same gene and then performing linear regression between SNPs and this aggregated gene expression level. However, SNPs may regulate isoforms from the same gene in different directions due to alternative splicing, or only regulate the expression level of one isoform, causing this approach to lose power. Here, we examine a broader question: which genes have at least one isoform whose expression level is regulated by genetic variants? In this study, we propose and evaluate several approaches to answering this question, demonstrating that "isoform-aware" methods-those that account for the expression levels of individual isoforms-have substantially greater power to answer this question than standard "gene-level" eQTL mapping methods. We identify settings in which different approaches yield an inflated number of false discoveries or lose power. In particular, we show that calling an eGene if there is a significant association between a SNP and any isoform fails to control False Discovery Rate, even when applying standard False Discovery Rate correction. We show that similar trends are observed in real data from the GEUVADIS and GTEx studies, suggesting the possibility that similar effects are present in these consortia.

Assuntos

Regulação da Expressão Gênica , Locos de Características Quantitativas , Mapeamento Cromossômico/métodos , Regulação da Expressão Gênica/genética , Locos de Características Quantitativas/genética , Fenótipo , Isoformas de Proteínas/genética , Polimorfismo de Nucleotídeo Único/genética , Estudo de Associação Genômica Ampla

4.

Complementation testing identifies causal genes at quantitative trait loci underlying fear related behavior.

Chen, Patrick B; Chen, Rachel; LaPierre, Nathan; Chen, Zeyuan; Mefford, Joel; Marcus, Emilie; Heffel, Matthew G; Soto, Daniela C; Ernst, Jason; Luo, Chongyuan; Flint, Jonathan.

bioRxiv ; 2024 Jan 04.

Artigo em Inglês | MEDLINE | ID: mdl-38260483

RESUMO

Knowing the genes involved in quantitative traits provides a critical entry point to understanding the biological bases of behavior, but there are very few examples where the pathway from genetic locus to behavioral change is known. Here we address a key step towards that goal by deploying a test that directly queries whether a gene mediates the effect of a quantitative trait locus (QTL). To explore the role of specific genes in fear behavior, we mapped three fear-related traits, tested fourteen genes at six QTLs, and identified six genes. Four genes, Lsamp, Ptprd, Nptx2 and Sh3gl, have known roles in synapse function; the fifth gene, Psip1, is a transcriptional co-activator not previously implicated in behavior; the sixth is a long non-coding RNA 4933413L06Rik with no known function. Single nucleus transcriptomic and epigenetic analyses implicated excitatory neurons as likely mediating the genetic effects. Surprisingly, variation in transcriptome and epigenetic modalities between inbred strains occurred preferentially in excitatory neurons, suggesting that genetic variation is more permissible in excitatory than inhibitory neuronal circuits. Our results open a bottleneck in using genetic mapping of QTLs to find novel biology underlying behavior and prompt a reconsideration of expected relationships between genetic and functional variation.

5.

Genetic pathways regulating the longitudinal acquisition of cocaine self-administration in a panel of inbred and recombinant inbred mice.

Khan, Arshad H; Bagley, Jared R; LaPierre, Nathan; Gonzalez-Figueroa, Carlos; Spencer, Tadeo C; Choudhury, Mudra; Xiao, Xinshu; Eskin, Eleazar; Jentsch, James D; Smith, Desmond J.

Cell Rep ; 42(8): 112856, 2023 08 29.

Artigo em Inglês | MEDLINE | ID: mdl-37481717

RESUMO

To identify addiction genes, we evaluate intravenous self-administration of cocaine or saline in 84 inbred and recombinant inbred mouse strains over 10 days. We integrate the behavior data with brain RNA-seq data from 41 strains. The self-administration of cocaine and that of saline are genetically distinct. We maximize power to map loci for cocaine intake by using a linear mixed model to account for this longitudinal phenotype while correcting for population structure. A total of 15 unique significant loci are identified in the genome-wide association study. A transcriptome-wide association study highlights the Trpv2 ion channel as a key locus for cocaine self-administration as well as identifying 17 additional genes, including Arhgef26, Slc18b1, and Slco5a1. We find numerous instances where alternate splice site selection or RNA editing altered transcript abundance. Our work emphasizes the importance of Trpv2, an ionotropic cannabinoid receptor, for the response to cocaine.

Assuntos

Transtornos Relacionados ao Uso de Cocaína , Cocaína , Camundongos , Animais , Cocaína/farmacologia , Estudo de Associação Genômica Ampla , Encéfalo , Administração Intravenosa , Camundongos Endogâmicos C57BL

6.

Leveraging family data to design Mendelian randomization that is provably robust to population stratification.

LaPierre, Nathan; Fu, Boyang; Turnbull, Steven; Eskin, Eleazar; Sankararaman, Sriram.

Genome Res ; 33(7): 1032-1041, 2023 07.

Artigo em Inglês | MEDLINE | ID: mdl-37197991

RESUMO

Mendelian randomization (MR) has emerged as a powerful approach to leverage genetic instruments to infer causality between pairs of traits in observational studies. However, the results of such studies are susceptible to biases owing to weak instruments, as well as the confounding effects of population stratification and horizontal pleiotropy. Here, we show that family data can be leveraged to design MR tests that are provably robust to confounding from population stratification, assortative mating, and dynastic effects. We show in simulations that our approach, MR-Twin, is robust to confounding from population stratification and is not affected by weak instrument bias, whereas standard MR methods yield inflated false positive rates. We then conduct an exploratory analysis of MR-Twin and other MR methods applied to 121 trait pairs in the UK Biobank data set. Our results suggest that confounding from population stratification can lead to false positives for existing MR methods, whereas MR-Twin is immune to this type of confounding, and that MR-Twin can help assess whether traditional approaches may be inflated owing to confounding from population stratification.

Assuntos

Análise da Randomização Mendeliana , Reprodução , Viés , Estudo de Associação Genômica Ampla , Análise da Randomização Mendeliana/métodos , Fenótipo , Humanos

7.

Leveraging family data to design Mendelian Randomization that is provably robust to population stratification.

LaPierre, Nathan; Fu, Boyang; Turnbull, Steven; Eskin, Eleazar; Sankararaman, Sriram.

bioRxiv ; 2023 Jan 06.

Artigo em Inglês | MEDLINE | ID: mdl-36711635

RESUMO

Mendelian Randomization (MR) has emerged as a powerful approach to leverage genetic instruments to infer causality between pairs of traits in observational studies. However, the results of such studies are susceptible to biases due to weak instruments as well as the confounding effects of population stratification and horizontal pleiotropy. Here, we show that family data can be leveraged to design MR tests that are provably robust to confounding from population stratification, assortative mating, and dynastic effects. We demonstrate in simulations that our approach, MR-Twin, is robust to confounding from population stratification and is not affected by weak instrument bias, while standard MR methods yield inflated false positive rates. We applied MR-Twin to 121 trait pairs in the UK Biobank dataset and found that MR-Twin identifies likely causal trait pairs and does not identify trait pairs that are unlikely to be causal. Our results suggest that confounding from population stratification can lead to false positives for existing MR methods, while MR-Twin is immune to this type of confounding.

8.

Ensemble neural network model for detecting thyroid eye disease using external photographs.

Karlin, Justin; Gai, Lisa; LaPierre, Nathan; Danesh, Kayla; Farajzadeh, Justin; Palileo, Bea; Taraszka, Kodi; Zheng, Jie; Wang, Wei; Eskin, Eleazar; Rootman, Daniel.

Br J Ophthalmol ; 107(11): 1722-1729, 2023 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-36126104

RESUMO

PURPOSE: To describe an artificial intelligence platform that detects thyroid eye disease (TED). DESIGN: Development of a deep learning model. METHODS: 1944 photographs from a clinical database were used to train a deep learning model. 344 additional images ('test set') were used to calculate performance metrics. Receiver operating characteristic, precision-recall curves and heatmaps were generated. From the test set, 50 images were randomly selected ('survey set') and used to compare model performance with ophthalmologist performance. 222 images obtained from a separate clinical database were used to assess model recall and to quantitate model performance with respect to disease stage and grade. RESULTS: The model achieved test set accuracy of 89.2%, specificity 86.9%, recall 93.4%, precision 79.7% and an F1 score of 86.0%. Heatmaps demonstrated that the model identified pixels corresponding to clinical features of TED. On the survey set, the ensemble model achieved accuracy, specificity, recall, precision and F1 score of 86%, 84%, 89%, 77% and 82%, respectively. 27 ophthalmologists achieved mean performance of 75%, 82%, 63%, 72% and 66%, respectively. On the second test set, the model achieved recall of 91.9%, with higher recall for moderate to severe (98.2%, n=55) and active disease (98.3%, n=60), as compared with mild (86.8%, n=68) or stable disease (85.7%, n=63). CONCLUSIONS: The deep learning classifier is a novel approach to identify TED and is a first step in the development of tools to improve diagnostic accuracy and lower barriers to specialist evaluation.

9.

Critical Assessment of Metagenome Interpretation: the second round of challenges.

Meyer, Fernando; Fritz, Adrian; Deng, Zhi-Luo; Koslicki, David; Lesker, Till Robin; Gurevich, Alexey; Robertson, Gary; Alser, Mohammed; Antipov, Dmitry; Beghini, Francesco; Bertrand, Denis; Brito, Jaqueline J; Brown, C Titus; Buchmann, Jan; Buluç, Aydin; Chen, Bo; Chikhi, Rayan; Clausen, Philip T L C; Cristian, Alexandru; Dabrowski, Piotr Wojciech; Darling, Aaron E; Egan, Rob; Eskin, Eleazar; Georganas, Evangelos; Goltsman, Eugene; Gray, Melissa A; Hansen, Lars Hestbjerg; Hofmeyr, Steven; Huang, Pingqin; Irber, Luiz; Jia, Huijue; Jørgensen, Tue Sparholt; Kieser, Silas D; Klemetsen, Terje; Kola, Axel; Kolmogorov, Mikhail; Korobeynikov, Anton; Kwan, Jason; LaPierre, Nathan; Lemaitre, Claire; Li, Chenhao; Limasset, Antoine; Malcher-Miranda, Fabio; Mangul, Serghei; Marcelino, Vanessa R; Marchet, Camille; Marijon, Pierre; Meleshko, Dmitry; Mende, Daniel R; Milanese, Alessio.

Nat Methods ; 19(4): 429-440, 2022 04.

Artigo em Inglês | MEDLINE | ID: mdl-35396482

RESUMO

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.

Assuntos

Metagenoma , Metagenômica , Archaea/genética , Metagenômica/métodos , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Software

10.

Robust Mendelian randomization in the presence of residual population stratification, batch effects and horizontal pleiotropy.

Cinelli, Carlos; LaPierre, Nathan; Hill, Brian L; Sankararaman, Sriram; Eskin, Eleazar.

Nat Commun ; 13(1): 1093, 2022 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-35232963

RESUMO

Mendelian Randomization (MR) studies are threatened by population stratification, batch effects, and horizontal pleiotropy. Although a variety of methods have been proposed to mitigate those problems, residual biases may still remain, leading to highly statistically significant false positives in large databases. Here we describe a suite of sensitivity analysis tools that enables investigators to quantify the robustness of their findings against such validity threats. Specifically, we propose the routine reporting of sensitivity statistics that reveal the minimal strength of violations necessary to explain away the MR results. We further provide intuitive displays of the robustness of the MR estimate to any degree of violation, and formal bounds on the worst-case bias caused by violations multiple times stronger than observed variables. We demonstrate how these tools can aid researchers in distinguishing robust from fragile findings by examining the effect of body mass index on diastolic blood pressure and Townsend deprivation index.

Assuntos

Pleiotropia Genética , Análise da Randomização Mendeliana , Viés , Pressão Sanguínea/genética , Índice de Massa Corporal , Progressão da Doença , Estudo de Associação Genômica Ampla , Humanos , Análise da Randomização Mendeliana/métodos

11.

Identifying causal variants by fine mapping across multiple studies.

LaPierre, Nathan; Taraszka, Kodi; Huang, Helen; He, Rosemary; Hormozdiari, Farhad; Eskin, Eleazar.

PLoS Genet ; 17(9): e1009733, 2021 09.

Artigo em Inglês | MEDLINE | ID: mdl-34543273

RESUMO

Increasingly large Genome-Wide Association Studies (GWAS) have yielded numerous variants associated with many complex traits, motivating the development of "fine mapping" methods to identify which of the associated variants are causal. Additionally, GWAS of the same trait for different populations are increasingly available, raising the possibility of refining fine mapping results further by leveraging different linkage disequilibrium (LD) structures across studies. Here, we introduce multiple study causal variants identification in associated regions (MsCAVIAR), a method that extends the popular CAVIAR fine mapping framework to a multiple study setting using a random effects model. MsCAVIAR only requires summary statistics and LD as input, accounts for uncertainty in association statistics using a multivariate normal model, allows for multiple causal variants at a locus, and explicitly models the possibility of different SNP effect sizes in different populations. We demonstrate the efficacy of MsCAVIAR in both a simulation study and a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL).

Assuntos

Estudo de Associação Genômica Ampla , Causalidade , Mapeamento Cromossômico/métodos , Humanos , Desequilíbrio de Ligação , Lipoproteínas HDL/genética , Polimorfismo de Nucleotídeo Único

12.

Massively scaled-up testing for SARS-CoV-2 RNA via next-generation sequencing of pooled and barcoded nasal and saliva samples.

Bloom, Joshua S; Sathe, Laila; Munugala, Chetan; Jones, Eric M; Gasperini, Molly; Lubock, Nathan B; Yarza, Fauna; Thompson, Erin M; Kovary, Kyle M; Park, Jimin; Marquette, Dawn; Kay, Stephania; Lucas, Mark; Love, TreQuan; Sina Booeshaghi, A; Brandenberg, Oliver F; Guo, Longhua; Boocock, James; Hochman, Myles; Simpkins, Scott W; Lin, Isabella; LaPierre, Nathan; Hong, Duke; Zhang, Yi; Oland, Gabriel; Choe, Bianca Judy; Chandrasekaran, Sukantha; Hilt, Evann E; Butte, Manish J; Damoiseaux, Robert; Kravit, Clifford; Cooper, Aaron R; Yin, Yi; Pachter, Lior; Garner, Omai B; Flint, Jonathan; Eskin, Eleazar; Luo, Chongyuan; Kosuri, Sriram; Kruglyak, Leonid; Arboleda, Valerie A.

Nat Biomed Eng ; 5(7): 657-665, 2021 07.

Artigo em Inglês | MEDLINE | ID: mdl-34211145

RESUMO

Frequent and widespread testing of members of the population who are asymptomatic for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is essential for the mitigation of the transmission of the virus. Despite the recent increases in testing capacity, tests based on quantitative polymerase chain reaction (qPCR) assays cannot be easily deployed at the scale required for population-wide screening. Here, we show that next-generation sequencing of pooled samples tagged with sample-specific molecular barcodes enables the testing of thousands of nasal or saliva samples for SARS-CoV-2 RNA in a single run without the need for RNA extraction. The assay, which we named SwabSeq, incorporates a synthetic RNA standard that facilitates end-point quantification and the calling of true negatives, and that reduces the requirements for automation, purification and sample-to-sample normalization. We used SwabSeq to perform 80,000 tests, with an analytical sensitivity and specificity comparable to or better than traditional qPCR tests, in less than two months with turnaround times of less than 24 h. SwabSeq could be rapidly adapted for the detection of other pathogens.

Assuntos

RNA Viral/genética , SARS-CoV-2/patogenicidade , Saliva/virologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , SARS-CoV-2/genética , Sensibilidade e Especificidade

13.

Swab-Seq: A high-throughput platform for massively scaled up SARS-CoV-2 testing.

Bloom, Joshua S; Sathe, Laila; Munugala, Chetan; Jones, Eric M; Gasperini, Molly; Lubock, Nathan B; Yarza, Fauna; Thompson, Erin M; Kovary, Kyle M; Park, Jimin; Marquette, Dawn; Kay, Stephania; Lucas, Mark; Love, TreQuan; Booeshaghi, A Sina; Brandenberg, Oliver F; Guo, Longhua; Boocock, James; Hochman, Myles; Simpkins, Scott W; Lin, Isabella; LaPierre, Nathan; Hong, Duke; Zhang, Yi; Oland, Gabriel; Choe, Bianca Judy; Chandrasekaran, Sukantha; Hilt, Evann E; Butte, Manish J; Damoiseaux, Robert; Kravit, Clifford; Cooper, Aaron R; Yin, Yi; Pachter, Lior; Garner, Omai B; Flint, Jonathan; Eskin, Eleazar; Luo, Chongyuan; Kosuri, Sriram; Kruglyak, Leonid; Arboleda, Valerie A.

medRxiv ; 2021 Mar 09.

Artigo em Inglês | MEDLINE | ID: mdl-32909008

RESUMO

The rapid spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is due to the high rates of transmission by individuals who are asymptomatic at the time of transmission1,2. Frequent, widespread testing of the asymptomatic population for SARS-CoV-2 is essential to suppress viral transmission. Despite increases in testing capacity, multiple challenges remain in deploying traditional reverse transcription and quantitative PCR (RT-qPCR) tests at the scale required for population screening of asymptomatic individuals. We have developed SwabSeq, a high-throughput testing platform for SARS-CoV-2 that uses next-generation sequencing as a readout. SwabSeq employs sample-specific molecular barcodes to enable thousands of samples to be combined and simultaneously analyzed for the presence or absence of SARS-CoV-2 in a single run. Importantly, SwabSeq incorporates an in vitro RNA standard that mimics the viral amplicon, but can be distinguished by sequencing. This standard allows for end-point rather than quantitative PCR, improves quantitation, reduces requirements for automation and sample-to-sample normalization, enables purification-free detection, and gives better ability to call true negatives. After setting up SwabSeq in a high-complexity CLIA laboratory, we performed more than 80,000 tests for COVID-19 in less than two months, confirming in a real world setting that SwabSeq inexpensively delivers highly sensitive and specific results at scale, with a turn-around of less than 24 hours. Our clinical laboratory uses SwabSeq to test both nasal and saliva samples without RNA extraction, while maintaining analytical sensitivity comparable to or better than traditional RT-qPCR tests. Moving forward, SwabSeq can rapidly scale up testing to mitigate devastating spread of novel pathogens.

14.

Metalign: efficient alignment-based metagenomic profiling via containment min hash.

LaPierre, Nathan; Alser, Mohammed; Eskin, Eleazar; Koslicki, David; Mangul, Serghei.

Genome Biol ; 21(1): 242, 2020 09 10.

Artigo em Inglês | MEDLINE | ID: mdl-32912225

RESUMO

Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profiling. We use a novel containment min hash approach to pre-filter the reference database prior to alignment and then process both uniquely aligned and multi-aligned reads to produce accurate abundance estimates. In performance evaluations on both real and simulated datasets, Metalign is the only method evaluated that maintained high performance and competitive running time across all datasets.

Assuntos

Metagenômica/métodos , Alinhamento de Sequência/métodos , Microbiota

15.

Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL).

Rahman, Mohammad Arifur; LaPierre, Nathan; Rangwala, Huzefa.

IEEE/ACM Trans Comput Biol Bioinform ; 17(3): 828-840, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-28981422

RESUMO

The recent advent of Metagenome Wide Association Studies (MGWAS) provides insight into the role of microbes on human health and disease. However, the studies present several computational challenges. In this paper, we demonstrate a novel, efficient, and effective Multiple Instance Learning (MIL) based computational pipeline to predict patient phenotype from metagenomic data. MIL methods have the advantage that besides predicting the clinical phenotype, we can infer the instance level label or role of microbial sequence reads in the specific disease. Specifically, we use a Bag of Words method, which has been shown to be one of the most effective and efficient MIL methods. This involves assembly of the metagenomic sequence data, clustering of the assembled contigs, extracting features from the contigs, and using an SVM classifier to predict patient labels and identify the most relevant sequence clusters. With the exception of the given labels for the patients, this entire process is de novo (unsupervised). We call our pipeline "CAMIL", which stands for Clustering and Assembly with Multiple Instance Learning. We use multiple state-of-the-art clustering methods for feature extraction, evaluation, and comparison of the performance of our proposed approach for each of these clustering methods. We also present a fast and scalable pre-clustering algorithm as a preprocessing step for our proposed pipeline. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using locality sensitive hashing (LSH). These canopies are then refined by using state-of-the-art sequence clustering algorithms. We use data from a well-known MGWAS study of patients with Type-2 Diabetes and show that our pipeline significantly outperforms the classifier used in that paper, as well as other common MIL methods.

Assuntos

Aprendizado de Máquina , Metagenoma/genética , Metagenômica/métodos , Fenótipo , Análise por Conglomerados , Humanos

16.

De novo Nanopore read quality improvement using deep learning.

LaPierre, Nathan; Egan, Rob; Wang, Wei; Wang, Zhong.

BMC Bioinformatics ; 20(1): 552, 2019 Nov 06.

Artigo em Inglês | MEDLINE | ID: mdl-31694525

RESUMO

BACKGROUND: Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. RESULTS: Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent "scrubbing" (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. CONCLUSIONS: MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub .

Assuntos

Aprendizado Profundo/normas , Sequenciamento por Nanoporos/métodos , Bases de Dados Genéticas , Humanos , Metagenoma , Melhoria de Qualidade , Software

17.

MiCoP: microbial community profiling method for detecting viral and fungal organisms in metagenomic samples.

LaPierre, Nathan; Mangul, Serghei; Alser, Mohammed; Mandric, Igor; Wu, Nicholas C; Koslicki, David; Eskin, Eleazar.

BMC Genomics ; 20(Suppl 5): 423, 2019 Jun 06.

Artigo em Inglês | MEDLINE | ID: mdl-31167634

RESUMO

BACKGROUND: High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Additionally, most work in metagenomics has focused on bacteria and archaea, neglecting to study other key microbes such as viruses and eukaryotes. RESULTS: Here we present a method, MiCoP (Microbiome Community Profiling), that uses fast-mapping of reads to build a comprehensive reference database of full genomes from viruses and eukaryotes to achieve maximum read usage and enable the analysis of the virome and eukaryome in each sample. We demonstrate that mapping of metagenomic reads is feasible for the smaller viral and eukaryotic reference databases. We show that our method is accurate on simulated and mock community data and identifies many more viral and fungal species than previously-reported results on real data from the Human Microbiome Project. CONCLUSIONS: MiCoP is a mapping-based method that proves more effective than existing methods at abundance profiling of viruses and eukaryotes in metagenomic samples. MiCoP can be used to detect the full diversity of these communities. The code, data, and documentation are publicly available on GitHub at: https://github.com/smangul1/MiCoP .

Assuntos

Biologia Computacional/métodos , Fungos/genética , Marcadores Genéticos , Metagenômica/métodos , Microbiota , Análise de Sequência de DNA/métodos , Vírus/genética , Algoritmos , Fungos/classificação , Genoma Fúngico , Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Vírus/classificação

18.

MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction.

LaPierre, Nathan; Ju, Chelsea J-T; Zhou, Guangyu; Wang, Wei.

Methods ; 166: 74-82, 2019 08 15.

Artigo em Inglês | MEDLINE | ID: mdl-30885720

RESUMO

The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-available sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these methods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are available via GitHub:https://github.com/nlapier2/metapheno.

Assuntos

Aprendizado Profundo , Diabetes Mellitus Tipo 2/genética , Metagenoma/genética , Obesidade/genética , Algoritmos , Diabetes Mellitus Tipo 2/microbiologia , Humanos , Aprendizado de Máquina/estatística & dados numéricos , Metagenômica/métodos , Microbiota/genética , Obesidade/microbiologia

19.

Metagenome sequence clustering with hash-based canopies.

Rahman, Mohammad Arifur; LaPierre, Nathan; Rangwala, Huzefa; Barbara, Daniel.

J Bioinform Comput Biol ; 15(6): 1740006, 2017 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-29113561

RESUMO

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a.

Assuntos

Algoritmos , Biodiversidade , Metagenoma , Metagenômica/métodos , Análise por Conglomerados , Bases de Dados Factuais , Microbioma Gastrointestinal/genética , Humanos , Cirrose Hepática/microbiologia , Microbiota , Filogenia , RNA Ribossômico 16S , RNA Ribossômico 18S , Análise de Sequência de RNA/métodos , Microbiologia do Solo

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA