Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
Nat Commun ; 15(1): 6306, 2024 Jul 26.
Article in English | MEDLINE | ID: mdl-39060254

ABSTRACT

Tiled amplicon sequencing has served as an essential tool for tracking the spread and evolution of pathogens. Over 15 million complete SARS-CoV-2 genomes are now publicly available, most sequenced and assembled via tiled amplicon sequencing. While computational tools for tiled amplicon design exist, they require downstream manual optimization both computationally and experimentally, which is slow and costly. Here we present Olivar, a first step towards a fully automated, variant-aware design of tiled amplicons for pathogen genomes. Olivar converts each nucleotide of the target genome into a numeric risk score, capturing undesired sequence features that should be avoided. In a direct comparison with PrimalScheme, we show that Olivar has fewer mismatches overlapping with primers and predicted PCR byproducts. We also compare Olivar head-to-head with ARTIC v4.1, the most widely used primer set for SARS-CoV-2 sequencing, and show Olivar yields similar read mapping rates (~90%) and better coverage to the manually designed ARTIC v4.1 amplicons. We also evaluate Olivar on real wastewater samples and found that Olivar has up to 3-fold higher mapping rates while retaining similar coverage. In summary, Olivar automates and accelerates the generation of tiled amplicons, even in situations of high mutation frequency and/or density. Olivar is available online as a web application at https://olivar.rice.edu  and can be installed locally as a command line tool with Bioconda. Source code, installation guide, and usage are available at https://github.com/treangenlab/Olivar .


Subject(s)
COVID-19 , DNA Primers , Genome, Viral , SARS-CoV-2 , SARS-CoV-2/genetics , COVID-19/virology , COVID-19/diagnosis , Humans , Genome, Viral/genetics , DNA Primers/genetics , High-Throughput Nucleotide Sequencing/methods , Wastewater/virology , Multiplex Polymerase Chain Reaction/methods , Software
2.
bioRxiv ; 2023 Sep 30.
Article in English | MEDLINE | ID: mdl-36824759

ABSTRACT

Tiled amplicon sequencing has served as an essential tool for tracking the spread and evolution of pathogens. Over 2 million complete SARS-CoV-2 genomes are now publicly available, most sequenced and assembled via tiled amplicon sequencing. While computational tools for tiled amplicon design exist, they require downstream manual optimization both computationally and experimentally, which is slow and costly. Here we present Olivar, a first step towards a fully automated, variant-aware design of tiled amplicons for pathogen genomes. Olivar converts each nucleotide of the target genome into a numeric risk score, capturing undesired sequence features that should be avoided. In a direct comparison with PrimalScheme, we show that Olivar has fewer SNPs overlapping with primers and predicted PCR byproducts. We also compared Olivar head-to-head with ARTIC v4.1, the most widely used primer set for SARS-CoV-2 sequencing, and show Olivar yields similar read mapping rates (~90%) and better coverage to the manually designed ARTIC v4.1 amplicons. We also evaluated Olivar on real wastewater samples and found that Olivar had up to 3-fold higher mapping rates while retaining similar coverage. In summary, Olivar automates and accelerates the generation of tiled amplicons, even in situations of high mutation frequency and/or density. Olivar is available as a web application at https://olivar.rice.edu. Olivar can also be installed locally as a command line tool with Bioconda. Source code, installation guide and usage are available at https://github.com/treangenlab/Olivar.

3.
Nat Commun ; 13(1): 6799, 2022 11 10.
Article in English | MEDLINE | ID: mdl-36357382

ABSTRACT

Computational analysis of host-associated microbiomes has opened the door to numerous discoveries relevant to human health and disease. However, contaminant sequences in metagenomic samples can potentially impact the interpretation of findings reported in microbiome studies, especially in low-biomass environments. Contamination from DNA extraction kits or sampling lab environments leaves taxonomic "bread crumbs" across multiple distinct sample types. Here we describe Squeegee, a de novo contamination detection tool that is based upon this principle, allowing the detection of microbial contaminants when negative controls are unavailable. On the low-biomass samples, we compare Squeegee predictions to experimental negative control data and show that Squeegee accurately recovers putative contaminants. We analyze samples of varying biomass from the Human Microbiome Project and identify likely, previously unreported kit contamination. Collectively, our results highlight that Squeegee can identify microbial contaminants with high precision and thus represents a computational approach for contaminant detection when negative controls are unavailable.


Subject(s)
Microbiota , Humans , Biomass , Microbiota/genetics , Metagenomics/methods , Metagenome , Specimen Handling
4.
Genome Biol ; 23(1): 133, 2022 06 20.
Article in English | MEDLINE | ID: mdl-35725628

ABSTRACT

The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .


Subject(s)
Machine Learning , Bacteria/genetics , Bacteria/pathogenicity , COVID-19 , Humans , Leukocytes, Mononuclear/virology , Open Reading Frames
5.
Nat Commun ; 13(1): 1728, 2022 04 01.
Article in English | MEDLINE | ID: mdl-35365602

ABSTRACT

Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.


Subject(s)
Deep Learning , Computational Biology , Phylogeny , Proteins , Systems Biology
6.
Genome Res ; 31(4): 635-644, 2021 04.
Article in English | MEDLINE | ID: mdl-33602693

ABSTRACT

The COVID-19 pandemic has sparked an urgent need to uncover the underlying biology of this devastating disease. Though RNA viruses mutate more rapidly than DNA viruses, there are a relatively small number of single nucleotide polymorphisms (SNPs) that differentiate the main SARS-CoV-2 lineages that have spread throughout the world. In this study, we investigated 129 RNA-seq data sets and 6928 consensus genomes to contrast the intra-host and inter-host diversity of SARS-CoV-2. Our analyses yielded three major observations. First, the mutational profile of SARS-CoV-2 highlights intra-host single nucleotide variant (iSNV) and SNP similarity, albeit with differences in C > U changes. Second, iSNV and SNP patterns in SARS-CoV-2 are more similar to MERS-CoV than SARS-CoV-1. Third, a significant fraction of insertions and deletions contribute to the genetic diversity of SARS-CoV-2. Altogether, our findings provide insight into SARS-CoV-2 genomic diversity, inform the design of detection tests, and highlight the potential of iSNVs for tracking the transmission of SARS-CoV-2.


Subject(s)
COVID-19/diagnosis , COVID-19/transmission , Genetic Variation , Genome, Viral , Real-Time Polymerase Chain Reaction/methods , SARS-CoV-2/genetics , COVID-19/virology , Host-Pathogen Interactions , Humans , Polymorphism, Single Nucleotide
7.
Nat Commun ; 12(1): 1167, 2021 02 26.
Article in English | MEDLINE | ID: mdl-33637701

ABSTRACT

With advances in synthetic biology and genome engineering comes a heightened awareness of potential misuse related to biosafety concerns. A recent study employed machine learning to identify the lab-of-origin of DNA sequences to help mitigate some of these concerns. Despite their promising results, this deep learning based approach had limited accuracy, was computationally expensive to train, and wasn't able to provide the precise features that were used in its predictions. To address these shortcomings, we developed PlasmidHawk for lab-of-origin prediction. Compared to a machine learning approach, PlasmidHawk has higher prediction accuracy; PlasmidHawk can successfully predict unknown sequences' depositing labs 76% of the time and 85% of the time the correct lab is in the top 10 candidates. In addition, PlasmidHawk can precisely single out the signature sub-sequences that are responsible for the lab-of-origin detection. In summary, PlasmidHawk represents an explainable and accurate tool for lab-of-origin prediction of synthetic plasmid sequences. PlasmidHawk is available at https://gitlab.com/treangenlab/plasmidhawk.git .


Subject(s)
Plasmids/genetics , Sequence Alignment/methods , Software , Synthetic Biology/methods , DNA , Genetic Engineering/methods , Machine Learning , Neural Networks, Computer
8.
Cell ; 183(5): 1185-1201.e20, 2020 11 25.
Article in English | MEDLINE | ID: mdl-33242417

ABSTRACT

Spaceflight is known to impose changes on human physiology with unknown molecular etiologies. To reveal these causes, we used a multi-omics, systems biology analytical approach using biomedical profiles from fifty-nine astronauts and data from NASA's GeneLab derived from hundreds of samples flown in space to determine transcriptomic, proteomic, metabolomic, and epigenetic responses to spaceflight. Overall pathway analyses on the multi-omics datasets showed significant enrichment for mitochondrial processes, as well as innate immunity, chronic inflammation, cell cycle, circadian rhythm, and olfactory functions. Importantly, NASA's Twin Study provided a platform to confirm several of our principal findings. Evidence of altered mitochondrial function and DNA damage was also found in the urine and blood metabolic data compiled from the astronaut cohort and NASA Twin Study data, indicating mitochondrial stress as a consistent phenotype of spaceflight.


Subject(s)
Genomics , Mitochondria/pathology , Space Flight , Stress, Physiological , Animals , Circadian Rhythm , Extracellular Matrix/metabolism , Humans , Immunity, Innate , Lipid Metabolism , Metabolic Flux Analysis , Mice, Inbred BALB C , Mice, Inbred C57BL , Muscles/immunology , Organ Specificity , Smell/physiology
10.
bioRxiv ; 2020 Jul 02.
Article in English | MEDLINE | ID: mdl-32637955

ABSTRACT

The COVID-19 pandemic has sparked an urgent need to uncover the underlying biology of this devastating disease. Though RNA viruses mutate more rapidly than DNA viruses, there are a relatively small number of single nucleotide polymorphisms (SNPs) that differentiate the main SARS-CoV-2 clades that have spread throughout the world. In this study, we investigated over 7,000 SARS-CoV-2 datasets to unveil both intrahost and interhost diversity. Our intrahost and interhost diversity analyses yielded three major observations. First, the mutational profile of SARS-CoV-2 highlights iSNV and SNP similarity, albeit with high variability in C>T changes. Second, iSNV and SNP patterns in SARS-CoV-2 are more similar to MERS-CoV than SARS-CoV-1. Third, a significant fraction of small indels fuel the genetic diversity of SARS-CoV-2. Altogether, our findings provide insight into SARS-CoV-2 genomic diversity, inform the design of detection tests, and highlight the potential of iSNVs for tracking the transmission of SARS-CoV-2.

11.
Nucleic Acids Res ; 48(10): 5217-5234, 2020 06 04.
Article in English | MEDLINE | ID: mdl-32338745

ABSTRACT

As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.


Subject(s)
Algorithms , Metagenomics/methods , Probability , Signal Processing, Computer-Assisted , Humans , Metagenome/genetics
12.
Bioinformatics ; 34(16): 2848-2850, 2018 08 15.
Article in English | MEDLINE | ID: mdl-29562324

ABSTRACT

Summary: The evolutionary histories of individual regions across a genomic alignment-called 'local genealogies'-can differ from each other, due to processes such as recombination. Elucidating and analyzing these local genealogies are important for a large number of inference tasks, including those pertaining to species phylogenies, evolutionary processes and trait mapping. In this paper, we present a toolkit for automated local phylogenomic analyses, or ALPHA. The purpose of this toolkit is to provide a wide array of functionalities for automated inference of local genealogies as well as analyses based on these local genealogies. The toolkit uses sliding windows to construct local genealogies and can compute a wide array of local phylogeny based statistics, such as the D-statistic. The toolkit comes with a graphical user interface and several import/export functionalities. Over the last few decades, much emphasis in phylogenomics has been put on developing tools for inferring species phylogenies. This toolkit complements those efforts by emphasizing the 'local' aspect of phylogenomics. Availability and implementation: ALPHA is freely available for installation and use, including source code, at https://github.com/chilleo/ALPHA.


Subject(s)
Genomics/methods , Phylogeny , Genealogy and Heraldry , Genome , Software
SELECTION OF CITATIONS
SEARCH DETAIL