Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Bioinformatics ; 2024 May 09.
Artículo en Inglés | MEDLINE | ID: mdl-38724243

RESUMEN

MOTIVATION: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. RESULTS: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4x and reduce runtime by over 2x, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. AVAILABILITY: Parsnp v2 is available at https://github.com/marbl/parsnp.

2.
bioRxiv ; 2024 Jan 31.
Artículo en Inglés | MEDLINE | ID: mdl-38352342

RESUMEN

Motivation: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. Results: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4x and reduce runtime by over 2x, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability: Parsnp is available at https://github.com/marbl/parsnp.

3.
Bioinformatics ; 39(9)2023 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-37603771

RESUMEN

MOTIVATION: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION: MashMap3 is available at https://github.com/marbl/MashMap.


Asunto(s)
Biología Computacional , Genómica
4.
ACS Bio Med Chem Au ; 3(3): 240-251, 2023 Jun 21.
Artículo en Inglés | MEDLINE | ID: mdl-37363077

RESUMEN

The radical S-adenosylmethionine (rSAM) superfamily has become a wellspring for discovering new enzyme chemistry, especially regarding ribosomally synthesized and post-translationally modified peptides (RiPPs). Here, we report a compendium of nearly 15,000 rSAM proteins with high-confidence involvement in RiPP biosynthesis. While recent bioinformatics advances have unveiled the broad sequence space covered by rSAM proteins, the significant challenge of functional annotation remains unsolved. Through a combination of sequence analysis and protein structural predictions, we identified a set of catalytic site proximity residues with functional predictive power, especially among the diverse rSAM proteins that form sulfur-to-α carbon thioether (sactionine) linkages. As a case study, we report that an rSAM protein from Streptomyces sparsogenes (StsB) shares higher full-length similarity with MftC (mycofactocin biosynthesis) than any other characterized enzyme. However, a comparative analysis of StsB to known rSAM proteins using "catalytic site proximity" predicted that StsB would be distinct from MftC and instead form sactionine bonds. The prediction was confirmed by mass spectrometry, targeted mutagenesis, and chemical degradation. We further used "catalytic site proximity" analysis to identify six new sactipeptide groups undetectable by traditional genome-mining strategies. Additional catalytic site proximity profiling of cyclophane-forming rSAM proteins suggests that this approach will be more broadly applicable and enhance, if not outright correct, protein functional predictions based on traditional genomic enzymology principles.

5.
bioRxiv ; 2023 May 18.
Artículo en Inglés | MEDLINE | ID: mdl-37325780

RESUMEN

Motivation: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. Results: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.

6.
bioRxiv ; 2023 Jul 06.
Artículo en Inglés | MEDLINE | ID: mdl-36993481

RESUMEN

Massively parallel genetic screens have been used to map sequence-to-function relationships for a variety of genetic elements. However, because these approaches only interrogate short sequences, it remains challenging to perform high throughput (HT) assays on constructs containing combinations of sequence elements arranged across multi-kb length scales. Overcoming this barrier could accelerate synthetic biology; by screening diverse gene circuit designs, "composition-to-function" mappings could be created that reveal genetic part composability rules and enable rapid identification of behavior-optimized variants. Here, we introduce CLASSIC, a generalizable genetic screening platform that combines long- and short-read next-generation sequencing (NGS) modalities to quantitatively assess pooled libraries of DNA constructs of arbitrary length. We show that CLASSIC can measure expression profiles of >10 5 drug-inducible gene circuit designs (ranging from 6-9 kb) in a single experiment in human cells. Using statistical inference and machine learning (ML) approaches, we demonstrate that data obtained with CLASSIC enables predictive modeling of an entire circuit design landscape, offering critical insight into underlying design principles. Our work shows that by expanding the throughput and understanding gained with each design-build-test-learn (DBTL) cycle, CLASSIC dramatically augments the pace and scale of synthetic biology and establishes an experimental basis for data-driven design of complex genetic systems.

7.
bioRxiv ; 2023 Sep 30.
Artículo en Inglés | MEDLINE | ID: mdl-36824759

RESUMEN

Tiled amplicon sequencing has served as an essential tool for tracking the spread and evolution of pathogens. Over 2 million complete SARS-CoV-2 genomes are now publicly available, most sequenced and assembled via tiled amplicon sequencing. While computational tools for tiled amplicon design exist, they require downstream manual optimization both computationally and experimentally, which is slow and costly. Here we present Olivar, a first step towards a fully automated, variant-aware design of tiled amplicons for pathogen genomes. Olivar converts each nucleotide of the target genome into a numeric risk score, capturing undesired sequence features that should be avoided. In a direct comparison with PrimalScheme, we show that Olivar has fewer SNPs overlapping with primers and predicted PCR byproducts. We also compared Olivar head-to-head with ARTIC v4.1, the most widely used primer set for SARS-CoV-2 sequencing, and show Olivar yields similar read mapping rates (~90%) and better coverage to the manually designed ARTIC v4.1 amplicons. We also evaluated Olivar on real wastewater samples and found that Olivar had up to 3-fold higher mapping rates while retaining similar coverage. In summary, Olivar automates and accelerates the generation of tiled amplicons, even in situations of high mutation frequency and/or density. Olivar is available as a web application at https://olivar.rice.edu. Olivar can also be installed locally as a command line tool with Bioconda. Source code, installation guide and usage are available at https://github.com/treangenlab/Olivar.

8.
Biochemistry ; 62(4): 956-967, 2023 02 21.
Artículo en Inglés | MEDLINE | ID: mdl-36734655

RESUMEN

The RiPP precursor recognition element (RRE) is a conserved domain found in many prokaryotic ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic gene clusters (BGCs). RREs bind with high specificity and affinity to a recognition sequence within the N-terminal leader region of RiPP precursor peptides. Lasso peptide biosynthesis involves an RRE-dependent leader peptidase, which is discretely encoded or fused to the RRE as a di-domain protein. Here we leveraged thousands of predicted BGCs to define the RRE:leader peptidase interaction through evolutionary covariance analysis. Each interacting domain contributes a three-stranded ß-sheet to form a hydrophobic ß-sandwich-like interface. The bioinformatics-guided predictions were experimentally confirmed using proteins from discrete and fused lasso peptide BGC architectures. Support for the domain-domain interface derived from chemical shift perturbation, paramagnetic relaxation enhancement experiments, and rapid variant activity screening using cell-free biosynthesis. Further validation of selected variants was performed with purified proteins. We developed a p-nitroanilide-based leader peptidase assay to illuminate the role of RRE domains. Our data show that RRE domains play a dual function. RRE domains deliver the precursor peptide to the leader peptidase, and the rate is saturable as expected for a substrate. RRE domains also partially compose the elusive S2 proteolytic pocket that binds the penultimate threonine of lasso leader peptides. Because the RRE domain is required to form the active site, leader peptidase activity is greatly diminished when the RRE domain is supplied at substoichiometric levels. Full proteolytic activation requires RRE engagement with the recognition sequence-containing portion of the leader peptide. Together, our observations define a new mechanism for protease activity regulation.


Asunto(s)
Péptido Hidrolasas , Señales de Clasificación de Proteína , Péptido Hidrolasas/metabolismo , Procesamiento Proteico-Postraduccional , Proteínas Bacterianas/química , Péptidos/química
9.
Genome Biol ; 23(1): 182, 2022 08 29.
Artículo en Inglés | MEDLINE | ID: mdl-36038949

RESUMEN

With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.


Asunto(s)
Genoma Humano , Genómica , Genómica/métodos , Humanos , Nucleótidos , Telómero/genética
10.
Genome Biol ; 23(1): 133, 2022 06 20.
Artículo en Inglés | MEDLINE | ID: mdl-35725628

RESUMEN

The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .


Asunto(s)
Aprendizaje Automático , Bacterias/genética , Bacterias/patogenicidad , COVID-19 , Humanos , Leucocitos Mononucleares/virología , Sistemas de Lectura Abierta
11.
Nat Commun ; 13(1): 1728, 2022 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-35365602

RESUMEN

Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.


Asunto(s)
Aprendizaje Profundo , Biología Computacional , Filogenia , Proteínas , Biología de Sistemas
12.
Nat Commun ; 13(1): 1321, 2022 03 14.
Artículo en Inglés | MEDLINE | ID: mdl-35288552

RESUMEN

Infectious disease monitoring on Oxford Nanopore Technologies (ONT) platforms offers rapid turnaround times and low cost. Tracking low frequency intra-host variants provides important insights with respect to elucidating within-host viral population dynamics and transmission. However, given the higher error rate of ONT, accurate identification of intra-host variants with low allele frequencies remains an open challenge with no viable computational solutions available. In response to this need, we present Variabel, a novel approach and first method designed for rescuing low frequency intra-host variants from ONT data alone. We evaluate Variabel on both synthetic data (SARS-CoV-2) and patient derived datasets (Ebola virus, norovirus, SARS-CoV-2); our results show that Variabel can accurately identify low frequency variants below 0.5 allele frequency, outperforming existing state-of-the-art ONT variant callers for this task. Variabel is open-source and available for download at: www.gitlab.com/treangenlab/variabel .


Asunto(s)
COVID-19 , Secuenciación de Nanoporos , Nanoporos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , SARS-CoV-2/genética
13.
ACS Chem Biol ; 16(12): 2787-2797, 2021 12 17.
Artículo en Inglés | MEDLINE | ID: mdl-34766760

RESUMEN

Graspetides are a class of ribosomally synthesized and post-translationally modified peptide natural products featuring ATP-grasp ligase-dependent formation of macrolactones/macrolactams. These modifications arise from serine, threonine, or lysine donor residues linked to aspartate or glutamate acceptor residues. Characterized graspetides include serine protease inhibitors such as the microviridins and plesiocin. Here, we report an update to Rapid ORF Description and Evaluation Online (RODEO) for the automated detection of graspetides, which identified 3,923 high-confidence graspetide biosynthetic gene clusters. Sequence and co-occurrence analyses doubled the number of graspetide groups from 12 to 24, defined based on core consensus sequence and putative secondary modification. Bioinformatic analyses of the ATP-grasp ligase superfamily suggest that extant graspetide synthetases diverged once from an ancestral ATP-grasp ligase and later evolved to introduce a variety of ring connectivities. Furthermore, we characterized thatisin and iso-thatisin, two graspetides related by conformational stereoisomerism from Lysobacter antibioticus. Derived from a newly identified graspetide group, thatisin and iso-thatisin feature two interlocking macrolactones with identical ring connectivity, as determined by a combination of tandem mass spectrometry (MS/MS), methanolytic, and mutational analyses. NMR spectroscopy of thatisin revealed a cis conformation for a key proline residue, while molecular dynamics simulations, solvent-accessible surface area calculations, and partial methanolytic analysis coupled with MS/MS support a trans conformation for iso-thatisin at the same position. Overall, this work provides a comprehensive overview of the graspetide landscape, and the improved RODEO algorithm will accelerate future graspetide discoveries by enabling open-access analysis of existing and emerging genomes.


Asunto(s)
Productos Biológicos/química , Biología Computacional/métodos , Ligasas/química , Péptidos/química , Inhibidores de Serina Proteinasa/química , Conformación Molecular , Familia de Multigenes , Procesamiento Proteico-Postraduccional , Ribosomas , Espectrometría de Masas en Tándem
14.
bioRxiv ; 2021 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-34518837

RESUMEN

Infectious disease monitoring on Oxford Nanopore Technologies (ONT) platforms offers rapid turnaround times and low cost, exemplified by well over a half of million ONT SARS-COV-2 datasets. Tracking low frequency intra-host variants has provided important insights with respect to elucidating within host viral population dynamics and transmission. However, given the higher error rate of ONT, accurate identification of intra-host variants with low allele frequencies remains an open challenge with no viable solutions available. In response to this need, we present Variabel, a novel approach and first method designed for rescuing low frequency intra-host variants from ONT data alone. We evaluated Variabel on both within patient and across patient paired Illumina and ONT datasets; our results show that Variabel can accurately identify low frequency variants below 0.5 allele frequency, outperforming existing state-of-the-art ONT variant callers for this task. Variabel is open-source and available for download at: www.gitlab.com/treangenlab/variabel.

16.
ArXiv ; 2021 May 07.
Artículo en Inglés | MEDLINE | ID: mdl-33972927

RESUMEN

With recent advances in sequencing technology it has become affordable and practical to sequence genomes to very high depth-of-coverage, allowing researchers to discover low-frequency variants in the genome. However, due to the errors in sequencing it is an active area of research to develop algorithms that can separate noise from the true variants. LoFreq is a state of the art algorithm for low-frequency variant detection but has a relatively long runtime compared to other tools. In addition to this, the interface for running in parallel could be simplified, allowing for multithreading as well as distributing jobs to a cluster. In this work we describe some specific contributions to LoFreq that remedy these issues.

17.
Nat Commun ; 12(1): 1167, 2021 02 26.
Artículo en Inglés | MEDLINE | ID: mdl-33637701

RESUMEN

With advances in synthetic biology and genome engineering comes a heightened awareness of potential misuse related to biosafety concerns. A recent study employed machine learning to identify the lab-of-origin of DNA sequences to help mitigate some of these concerns. Despite their promising results, this deep learning based approach had limited accuracy, was computationally expensive to train, and wasn't able to provide the precise features that were used in its predictions. To address these shortcomings, we developed PlasmidHawk for lab-of-origin prediction. Compared to a machine learning approach, PlasmidHawk has higher prediction accuracy; PlasmidHawk can successfully predict unknown sequences' depositing labs 76% of the time and 85% of the time the correct lab is in the top 10 candidates. In addition, PlasmidHawk can precisely single out the signature sub-sequences that are responsible for the lab-of-origin detection. In summary, PlasmidHawk represents an explainable and accurate tool for lab-of-origin prediction of synthetic plasmid sequences. PlasmidHawk is available at https://gitlab.com/treangenlab/plasmidhawk.git .


Asunto(s)
Plásmidos/genética , Alineación de Secuencia/métodos , Programas Informáticos , Biología Sintética/métodos , ADN , Ingeniería Genética/métodos , Aprendizaje Automático , Redes Neurales de la Computación
18.
J Am Chem Soc ; 141(20): 8228-8238, 2019 05 22.
Artículo en Inglés | MEDLINE | ID: mdl-31059252

RESUMEN

Recently developed bioinformatic tools have bolstered the discovery of ribosomally synthesized and post-translationally modified peptides (RiPPs). Using an improved version of Rapid ORF Description and Evaluation Online (RODEO 2.0), a biosynthetic gene cluster mining algorithm, we bioinformatically mapped the sactipeptide RiPP class via the radical S-adenosylmethionine (SAM) enzymes that form the characteristic sactionine (sulfur-to-α carbon) cross-links between cysteine and acceptor residues. Hundreds of new sactipeptide biosynthetic gene clusters were uncovered, and a novel sactipeptide "huazacin" with growth-suppressive activity against Listeria monocytogenes was characterized. Bioinformatic analysis further suggested that a group of sactipeptide-like peptides heretofore referred to as six cysteines in forty-five residues (SCIFFs) might not be sactipeptides as previously thought. Indeed, the bioinformatically identified SCIFF peptide "freyrasin" was demonstrated to contain six thioethers linking the ß carbons of six aspartate residues. Another SCIFF, thermocellin, was shown to contain a thioether cross-linked to the γ carbon of threonine. SCIFFs feature a different paradigm of non-α carbon thioether linkages, and they are exclusively formed by radical SAM enzymes, as opposed to the polar chemistry employed during lanthipeptide biosynthesis. Therefore, we propose the renaming of the SCIFF family as radical non-α thioether peptides (ranthipeptides) to better distinguish them from the sactipeptide and lanthipeptide RiPP classes.


Asunto(s)
Proteínas Bacterianas/metabolismo , Péptidos/metabolismo , Sulfuros/metabolismo , Secuencia de Aminoácidos , Bacillus thuringiensis/genética , Proteínas Bacterianas/genética , Biología Computacional/métodos , Enzimas/metabolismo , Internet , Familia de Multigenes , Péptidos/genética , Procesamiento Proteico-Postraduccional , S-Adenosilmetionina/metabolismo , Terminología como Asunto
19.
J Am Chem Soc ; 140(30): 9494-9501, 2018 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-29983054

RESUMEN

Thiopeptides are members of the ribosomally synthesized and post-translationally modified peptide family of natural products. Most characterized thiopeptides display nanomolar potency toward Gram-positive bacteria by blocking protein translation with several being produced at the industrial scale for veterinary and livestock applications. Employing our custom bioinformatics program, RODEO, we expand the thiopeptide family of natural products by a factor of four. This effort revealed many new thiopeptide biosynthetic gene clusters with products predicted to be distinct from characterized thiopeptides and identified gene clusters for previously characterized molecules of unknown biosynthetic origin. To further validate our data set of predicted thiopeptide biosynthetic gene clusters, we isolated and characterized a structurally unique thiopeptide featuring a central piperidine and rare thioamide moiety. Termed saalfelduracin, this thiopeptide displayed potent antibiotic activity toward several drug-resistant Gram-positive pathogens. A combination of whole-genome sequencing, comparative genomics, and heterologous expression experiments confirmed that the thioamide moiety of saalfelduracin is installed post-translationally by the joint action of two proteins, TfuA and YcaO. These results reconcile the previously unknown origin of the thioamide in two long-known thiopeptides, thiopeptin and Sch 18640. Armed with these new insights into thiopeptide chemical-genomic space, we provide a roadmap for the discovery of additional members of this natural product family.


Asunto(s)
Antibacterianos/clasificación , Familia de Multigenes , Péptidos Cíclicos/clasificación , Péptidos Cíclicos/genética , Actinobacteria/química , Actinobacteria/genética , Algoritmos , Secuencia de Aminoácidos , Antibacterianos/química , Antibacterianos/aislamiento & purificación , Antibacterianos/farmacología , Bacillus subtilis/efectos de los fármacos , Biología Computacional , Bases de Datos Genéticas , Enterococcus faecium/efectos de los fármacos , Liasas/genética , Cadenas de Markov , Staphylococcus aureus Resistente a Meticilina/efectos de los fármacos , Péptidos Cíclicos/aislamiento & purificación , Péptidos Cíclicos/farmacología , Procesamiento Proteico-Postraduccional , Tioamidas/química , Secuenciación Completa del Genoma
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...