RESUMEN
BACKGROUND: Sequence alignment lies at the heart of genome sequence annotation. While the BLAST suite of alignment tools has long held an important role in alignment-based sequence database search, greater sensitivity is achieved through the use of profile hidden Markov models (pHMMs). Here, we describe an FPGA hardware accelerator, called HAVAC, that targets a key bottleneck step (SSV) in the analysis pipeline of the popular pHMM alignment tool, HMMER. RESULTS: The HAVAC kernel calculates the SSV matrix at 1739 GCUPS on a â¼ $3000 Xilinx Alveo U50 FPGA accelerator card, â¼ 227× faster than the optimized SSV implementation in nhmmer. Accounting for PCI-e data transfer data processing, HAVAC is 65× faster than nhmmer's SSV with one thread and 35× faster than nhmmer with four threads, and uses â¼ 31% the energy of a traditional high end Intel CPU. CONCLUSIONS: HAVAC demonstrates the potential offered by FPGA hardware accelerators to produce dramatic speed gains in sequence annotation and related bioinformatics applications. Because these computations are performed on a co-processor, the host CPU remains free to simultaneously compute other aspects of the analysis pipeline.
Asunto(s)
Cadenas de Markov , Alineación de Secuencia , Alineación de Secuencia/métodos , Biología Computacional/métodos , Homología de Secuencia , Algoritmos , Programas InformáticosRESUMEN
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.
Asunto(s)
Elementos Transponibles de ADN , ADN/química , Bases de Datos de Ácidos Nucleicos , Secuencias Repetitivas de Ácidos Nucleicos , Animales , ADN/clasificación , Genoma , Humanos , Internet , Cadenas de Markov , Ratones , Anotación de Secuencia Molecular , Alineación de SecuenciaRESUMEN
The HMMER website, available at http://www.ebi.ac.uk/Tools/hmmer/, provides access to the protein homology search algorithms found in the HMMER software suite. Since the first release of the website in 2011, the search repertoire has been expanded to include the iterative search algorithm, jackhmmer. The continued growth of the target sequence databases means that traditional tabular representations of significant sequence hits can be overwhelming to the user. Consequently, additional ways of presenting homology search results have been developed, allowing them to be summarised according to taxonomic distribution or domain architecture. The taxonomy and domain architecture representations can be used in combination to filter the results according to the needs of a user. Searches can also be restricted prior to submission using a new taxonomic filter, which not only ensures that the results are specific to the requested taxonomic group, but also improves search performance. The repertoire of profile hidden Markov model libraries, which are used for annotation of query sequences with protein families and domains, has been expanded to include the libraries from CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs. Finally, we discuss the relocation of the HMMER webserver to the European Bioinformatics Institute and the potential impact that this will have.
Asunto(s)
Homología de Secuencia de Aminoácido , Programas Informáticos , Algoritmos , Bases de Datos de Proteínas , Internet , Cadenas de Markov , Estructura Terciaria de Proteína , Alineación de Secuencia , Análisis de Secuencia de ProteínaRESUMEN
We present a database of repetitive DNA elements, called Dfam (http://dfam.janelia.org). Many genomes contain a large fraction of repetitive DNA, much of which is made up of remnants of transposable elements (TEs). Accurate annotation of TEs enables research into their biology and can shed light on the evolutionary processes that shape genomes. Identification and masking of TEs can also greatly simplify many downstream genome annotation and sequence analysis tasks. The commonly used TE annotation tools RepeatMasker and Censor depend on sequence homology search tools such as cross_match and BLAST variants, as well as Repbase, a collection of known TE families each represented by a single consensus sequence. Dfam contains entries corresponding to all Repbase TE entries for which instances have been found in the human genome. Each Dfam entry is represented by a profile hidden Markov model, built from alignments generated using RepeatMasker and Repbase. When used in conjunction with the hidden Markov model search tool nhmmer, Dfam produces a 2.9% increase in coverage over consensus sequence search methods on a large human benchmark, while maintaining low false discovery rates, and coverage of the full human genome is 54.5%. The website provides a collection of tools and data views to support improved TE curation and annotation efforts. Dfam is also available for download in flat file format or in the form of MySQL table dumps.
Asunto(s)
Elementos Transponibles de ADN , Bases de Datos de Ácidos Nucleicos , Genoma Humano , Humanos , Internet , Cadenas de Markov , Modelos Estadísticos , Anotación de Secuencia MolecularRESUMEN
BACKGROUND: Logos are commonly used in molecular biology to provide a compact graphical representation of the conservation pattern of a set of sequences. They render the information contained in sequence alignments or profile hidden Markov models by drawing a stack of letters for each position, where the height of the stack corresponds to the conservation at that position, and the height of each letter within a stack depends on the frequency of that letter at that position. RESULTS: We present a new tool and web server, called Skylign, which provides a unified framework for creating logos for both sequence alignments and profile hidden Markov models. In addition to static image files, Skylign creates a novel interactive logo plot for inclusion in web pages. These interactive logos enable scrolling, zooming, and inspection of underlying values. Skylign can avoid sampling bias in sequence alignments by down-weighting redundant sequences and by combining observed counts with informed priors. It also simplifies the representation of gap parameters, and can optionally scale letter heights based on alternate calculations of the conservation of a position. CONCLUSION: Skylign is available as a website, a scriptable web service with a RESTful interface, and as a software package for download. Skylign's interactive logos are easily incorporated into a web page with just a few lines of HTML markup. Skylign may be found at http://skylign.org.
Asunto(s)
Biología Computacional/métodos , Internet , Alineación de Secuencia/métodos , Análisis de Secuencia/métodos , Programas Informáticos , Secuencia de Aminoácidos , Secuencia de Bases , Gráficos por Computador , ADN/química , Datos de Secuencia MolecularRESUMEN
SUMMARY: Sequence database searches are an essential part of molecular biology, providing information about the function and evolutionary history of proteins, RNA molecules and DNA sequence elements. We present a tool for DNA/DNA sequence comparison that is built on the HMMER framework, which applies probabilistic inference methods based on hidden Markov models to the problem of homology search. This tool, called nhmmer, enables improved detection of remote DNA homologs, and has been used in combination with Dfam and RepeatMasker to improve annotation of transposable elements in the human genome. AVAILABILITY: nhmmer is a part of the new HMMER3.1 release. Source code and documentation can be downloaded from http://hmmer.org. HMMER3.1 is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X.
Asunto(s)
ADN/análisis , Homología de Secuencia de Ácido Nucleico , Programas Informáticos , Algoritmos , Elementos Transponibles de ADN , Genoma Humano , Humanos , Cadenas de Markov , Probabilidad , Alineación de SecuenciaRESUMEN
Background: Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis. Results: We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences. Impact: Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.
RESUMEN
Background: Eukaryotic genes are often composed of multiple exons that are stitched together by splicing out the intervening introns. These exons may be conditionally joined in different combinations to produce a collection of related, but distinct, mRNA transcripts. For protein-coding genes, these products of alternative splicing lead to production of related protein variants (isoforms) of a gene. Complete labeling of the protein-coding content of a eukaryotic genome requires discovery of mRNA encoding all isoforms, but it is impractical to enumerate all possible combinations of tissue, developmental stage, and environmental context; as a result, many true exons go unlabeled in genome annotations. Results: One way to address the combinatoric challenge of finding all isoforms in a single organism A is to leverage sequencing efforts for other organisms - each time a new organism is sequenced, it may be under a new combination of conditions, so that a previously unobserved isoform may be sequenced. We present Diviner, a software tool that identifies previously undocumented exons in organisms by comparing isoforms across species. We demonstrate Diviner's utility by locating hundreds of novel exons in the genomes of human, mouse, and rat, as well as in the ferret genome. Further, we provide analyses supporting the notion that most of the new exons reported by Diviner are likely to be part of a true (but unobserved) isoform of the containing species.
RESUMEN
In the age of long read sequencing, genomics researchers now have access to accurate repetitive DNA sequence (including satellites) that, due to the limitations of short read sequencing, could previously be observed only as unmappable fragments. Tools that annotate repetitive sequence are now more important than ever, so that we can better understand newly uncovered repetitive sequences, and also so that we can mitigate errors in bioinformatic software caused by those repetitive sequences. To that end, we introduce the 1.0 release of our tool for identifying and annotating locally-repetitive sequence, ULTRA (ULTRA Locates Tandemly Repetitive Areas). ULTRA is fast enough to use as part of an efficient annotation pipeline, produces state-of-the-art reliable coverage of repetitive regions containing many mutations, and provides interpretable statistics and labels for repetitive regions. It released under an open license, and available for download at https://github.com/TravisWheelerLab/ULTRA.
RESUMEN
Summary: We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based translated sequence annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long-read sequencing data and in the context of pseudogenes. Availability and implementation: The software is available at https://github.com/TravisWheelerLab/BATH.
RESUMEN
We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long read sequencing data and in the context of pseudogenes.
RESUMEN
" Fast is fine, but accuracy is final. " -- Wyatt Earp. Background: The extreme diversity of newly sequenced organisms and considerable scale of modern sequence databases lead to a tension between competing needs for sensitivity and speed in sequence annotation, with multiple tools displacing the venerable BLAST software suite on one axis or another. Alignment based on profile hidden Markov models (pHMMs) has demonstrated state of art sensitivity, while recent algorithmic advances have resulted in hyper-fast annotation tools with sensitivity close to that of BLAST. Results: Here, we introduce a new tool that bridges the gap between advances in these two directions, reaching speeds comparable to fast annotation methods such as MMseqs2 while retaining most of the sensitivity offered by pHMMs. The tool, called nail, implements a heuristic approximation of the pHMM Forward/Backward (FB) algorithm by identifying a sparse subset of the cells in the FB dynamic programming matrix that contains most of the probability mass. The method produces an accurate approximation of pHMM scores and E-values with high speed and small memory requirements. On a protein benchmark, nail recovers the majority of recall difference between MMseqs2 and HMMER, with run time ~26x faster than HMMER3 (only ~2.4x slower than MMseqs2's sensitive variant). nail is released under the open BSD-3-clause license and is available for download at https://github.com/TravisWheelerLab/nail.
RESUMEN
Transposable elements are ubiquitous mobile DNA sequences generating insertion polymorphisms, contributing to genomic diversity. We present GraffiTE, a flexible pipeline to analyze polymorphic mobile elements insertions. By integrating state-of-the-art structural variant detection algorithms and graph genomes, GraffiTE identifies polymorphic mobile elements from genomic assemblies or long-read sequencing data, and genotypes these variants using short or long read sets. Benchmarking on simulated and real datasets reports high precision and recall rates. GraffiTE is designed to allow non-expert users to perform comprehensive analyses, including in models with limited transposable element knowledge and is compatible with various sequencing technologies. Here, we demonstrate the versatility of GraffiTE by analyzing human, Drosophila melanogaster, maize, and Cannabis sativa pangenome data. These analyses reveal the landscapes of polymorphic mobile elements and their frequency variations across individuals, strains, and cultivars.
Asunto(s)
Elementos Transponibles de ADN , Drosophila melanogaster , Polimorfismo Genético , Zea mays , Elementos Transponibles de ADN/genética , Humanos , Drosophila melanogaster/genética , Zea mays/genética , Animales , Algoritmos , Genoma de Planta/genética , Genómica/métodos , Programas Informáticos , Análisis de Secuencia de ADN/métodosRESUMEN
Computational approaches for small-molecule drug discovery now regularly scale to the consideration of libraries containing billions of candidate small molecules. One promising approach to increased the speed of evaluating billion-molecule libraries is to develop succinct representations of each molecule that enable the rapid identification of molecules with similar properties. Molecular fingerprints are thought to provide a mechanism for producing such representations. Here, we explore the utility of commonly used fingerprints in the context of predicting similar molecular activity. We show that fingerprint similarity provides little discriminative power between active and inactive molecules for a target protein based on a known active-while they may sometimes provide some enrichment for active molecules in a drug screen, a screened data set will still be dominated by inactive molecules. We also demonstrate that high-similarity actives appear to share a scaffold with the query active, meaning that they could more easily be identified by structural enumeration. Furthermore, even when limited to only active molecules, fingerprint similarity values do not correlate with compound potency. In sum, these results highlight the need for a new wave of molecular representations that will improve the capacity to detect biologically active molecules based on their similarity to other such molecules.
RESUMEN
Bacteriophages are viruses that infect bacteria. Many bacteriophages integrate their genomes into the bacterial chromosome and become prophages. Prophages may substantially burden or benefit host bacteria fitness, acting in some cases as parasites and in others as mutualists. Some prophages have been demonstrated to increase host virulence. The increasing ease of bacterial genome sequencing provides an opportunity to deeply explore prophage prevalence and insertion sites. Here we present VIBES (Viral Integrations in Bacterial genomES), a workflow intended to automate prophage annotation in complete bacterial genome sequences. VIBES provides additional context to prophage annotations by annotating bacterial genes and viral proteins in user-provided bacterial and viral genomes. The VIBES pipeline is implemented as a Nextflow-driven workflow, providing a simple, unified interface for execution on local, cluster and cloud computing environments. For each step of the pipeline, a container including all necessary software dependencies is provided. VIBES produces results in simple tab-separated format and generates intuitive and interactive visualizations for data exploration. Despite VIBES's primary emphasis on prophage annotation, its generic alignment-based design allows it to be deployed as a general-purpose sequence similarity search manager. We demonstrate the utility of the VIBES prophage annotation workflow by searching for 178 Pf phage genomes across 1072 Pseudomonas spp. genomes.
RESUMEN
The reconstruction of complete microbial metabolic pathways using 'omics data from environmental samples remains challenging. Computational pipelines for pathway reconstruction that utilize machine learning methods to predict the presence or absence of KEGG modules in incomplete genomes are lacking. Here, we present MetaPathPredict, a software tool that incorporates machine learning models to predict the presence of complete KEGG modules within bacterial genomic datasets. Using gene annotation data and information from the KEGG module database, MetaPathPredict employs deep learning models to predict the presence of KEGG modules in a genome. MetaPathPredict can be used as a command line tool or as a Python module, and both options are designed to be run locally or on a compute cluster. Benchmarks show that MetaPathPredict makes robust predictions of KEGG module presence within highly incomplete genomes.
Asunto(s)
Genoma Bacteriano , Redes y Vías Metabólicas , Programas Informáticos , Redes y Vías Metabólicas/genética , Biología Computacional/métodos , Aprendizaje Automático , Bacterias/genética , Bacterias/metabolismo , Bacterias/clasificaciónRESUMEN
Background: Genetic variation in APOE is associated with altered lipid metabolism, as well as cardiovascular and neurodegenerative disease risk. However, prior studies are largely limited to European ancestry populations and differential risk by sex and ancestry has not been widely evaluated. We utilized a phenome-wide association study (PheWAS) approach to explore APOE-associated phenotypes in the All of Us Research Program. Methods: We determined APOE alleles for 181,880 All of Us participants with whole genome sequencing and electronic health record (EHR) data, representing seven gnomAD ancestry groups. We tested association of APOE variants, ordered based on Alzheimer's disease risk hierarchy (ε2/ε2<ε2/ε3<ε3/ε3<ε2/ε4<ε3/ε4<ε4/ε4), with 2,318 EHR-derived phenotypes. Bonferroni-adjusted analyses were performed overall, by ancestry, by sex, and with adjustment for social determinants of health (SDOH). Findings: In the overall cohort, PheWAS identified 17 significant associations, including an increased odds of hyperlipidemia (OR 1.15 [1.14-1.16] per APOE genotype group; P=1.8×10-129), dementia, and Alzheimer's disease (OR 1.55 [1.40-1.70]; P=5×10-19), and a reduced odds of fatty liver disease (OR 0.93 [0.90-0.95]; P=1.6×10-9) and chronic liver disease. ORs were similar after SDOH adjustment and by sex, except for an increased number of cardiovascular associations in males, and decreased odds of noninflammatory disorders of vulva and perineum in females (OR 0.89 [0.84-0.94]; P=1.1×10-5). Significant heterogeneity was observed for hyperlipidemia and mild cognitive impairment across ancestry. Unique associations by ancestry included transient retinal arterial occlusion in the European ancestry group, and first-degree atrioventricular block in the American Admixed/Latino ancestry group. Interpretation: We replicate extensive phenotypic associations with APOE alleles in a large, diverse cohort, despite limitations in accuracy for EHR-derived phenotypes. We provide a comprehensive catalog of APOE-associated phenotypes and present evidence of unique phenotypic associations by sex and ancestry, as well as heterogeneity in effect size across ancestry.
RESUMEN
Immunotherapy has changed the treatment paradigm for many types of cancer, but immune checkpoint inhibitors (ICIs) have not shown benefit in prostate cancer (PCa). Chronic inflammation contributes to the immunosuppressive prostate tumor microenvironment (TME) and is associated with poor response to ICIs. The primary source of inflammatory cytokine production is the inflammasome. Here, we identify PIM kinases as important regulators of inflammasome activation in tumor associated macrophages (TAMs). Analysis of clinical data from a cohort of treatment naïve, hormone responsive PCa patients revealed that tumors from patients with high PIM1/2/3 display an immunosuppressive TME characterized by high inflammation (IL-1ß and TNFα) and a high density of repressive immune cells, most notably TAMs. Strikingly, macrophage-specific knockout of PIM reduced tumor growth in syngeneic models of prostate cancer. Transcriptional analyses indicate that eliminating PIM from macrophages enhanced the adaptive immune response and increased cytotoxic immune cells. Combined treatment with PIM inhibitors and ICIs synergistically reduced tumor growth. Immune profiling revealed that PIM inhibitors sensitized PCa tumors to ICIs by increasing tumor suppressive TAMs and increasing the activation of cytotoxic T cells. Collectively, our data implicate macrophage PIM as a driver of inflammation that limits the potency of ICIs and provides preclinical evidence that PIM inhibitors are an effective strategy to improve the efficacy of immunotherapy in prostate cancer.
RESUMEN
The organization of homologous protein sequences into multiple sequence alignments (MSAs) is a cornerstone of modern analysis of proteins. Recent focus on the importance of alternatively-spliced isoforms in disease and cell biology has highlighted the need for MSA software that can appropriately account for isoforms and the exon-length insertions or deletions that isoforms may have relative to each other. We previously developed Mirage, a software package for generating MSAs for isoforms spanning multiple species. Here, we present Mirage2, which retains the fundamental algorithms of the original Mirage implementation while providing substantially improved translated mapping and improving several aspects of usability. We demonstrate that Mirage2 is highly effective at mapping proteins to their encoding exons, and that these protein-genome mappings lead to extremely accurate intron-aware alignments. Additionally, Mirage2 implements a number of engineering improvements that simplify installation and use.
Asunto(s)
Algoritmos , Programas Informáticos , Alineación de Secuencia , Isoformas de Proteínas/genética , Mapeo CromosómicoRESUMEN
Recordings of animal sounds enable a wide range of observational inquiries into animal communication, behavior, and diversity. Automated labeling of sound events in such recordings can improve both throughput and reproducibility of analysis. Here, we describe our software package for labeling sound elements in recordings of animal sounds and demonstrate its utility on recordings of beetle courtships and whale songs. The software, DISCO, computes sensible confidence estimates and produces labels with high precision and accuracy. In addition to the core labeling software, it provides a simple tool for labeling training data, and a visual system for analysis of resulting labels. DISCO is open-source and easy to install, it works with standard file formats, and it presents a low barrier of entry to use.