Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 47
Filter
1.
Nature ; 622(7983): 637-645, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37704730

ABSTRACT

Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy1, and over 214 million predicted structures are available in the AlphaFold database2. However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life.


Subject(s)
Algorithms , Cluster Analysis , Proteins , Structural Homology, Protein , Humans , Databases, Protein , Proteins/chemistry , Proteins/classification , Proteins/metabolism , Sequence Alignment , Molecular Sequence Annotation , Prokaryotic Cells/chemistry , Phylogeny , Species Specificity , Evolution, Molecular
2.
Trends Biochem Sci ; 48(4): 345-359, 2023 04.
Article in English | MEDLINE | ID: mdl-36504138

ABSTRACT

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.


Subject(s)
Machine Learning , Proteins , Proteins/chemistry , Computational Biology/methods , Protein Conformation
3.
Nat Methods ; 21(6): 971-973, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38769467

ABSTRACT

Metagenomic taxonomic classifiers analyze either DNA or amino acid (AA) sequences. Metabuli ( https://metabuli.steineggerlab.com ), however, jointly analyzes both DNA and AA to leverage AA conservation for sensitive homology detection and DNA mutations for specific differentiation of closely related taxa. In the Critical Assessment of Metagenome Interpretation 2 plant-associated dataset, Metabuli covered 99% and 98% of classifications of state-of-the-art DNA- and AA-based classifiers, respectively.


Subject(s)
Amino Acids , Metagenome , Metagenomics , Metagenomics/methods , Amino Acids/genetics , DNA/genetics , Software , Plants/classification , Sequence Analysis, DNA/methods , Amino Acid Sequence
4.
Nature ; 596(7873): 583-589, 2021 08.
Article in English | MEDLINE | ID: mdl-34265844

ABSTRACT

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.


Subject(s)
Neural Networks, Computer , Protein Conformation , Protein Folding , Proteins/chemistry , Amino Acid Sequence , Computational Biology/methods , Computational Biology/standards , Databases, Protein , Deep Learning/standards , Models, Molecular , Reproducibility of Results , Sequence Alignment
5.
Nucleic Acids Res ; 52(D1): D426-D433, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37933852

ABSTRACT

The DescribePROT database of amino acid-level descriptors of protein structures and functions was substantially expanded since its release in 2020. This expansion includes substantial increase in the size, scope, and quality of the underlying data, the addition of experimental structural information, the inclusion of new data download options, and an upgraded graphical interface. DescribePROT currently covers 19 structural and functional descriptors for proteins in 273 reference proteomes generated by 11 accurate and complementary predictive tools. Users can search our resource in multiple ways, interact with the data using the graphical interface, and download data at various scales including individual proteins, entire proteomes, and whole database. The annotations in DescribePROT are useful for a broad spectrum of studies that include investigations of protein structure and function, development and validation of predictive tools, and to support efforts in understanding molecular underpinnings of diseases and development of therapeutics. DescribePROT can be freely accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.


Subject(s)
Amino Acids , Proteome , Proteome/chemistry , Databases, Factual
6.
Nucleic Acids Res ; 52(D1): D368-D375, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37933859

ABSTRACT

The AlphaFold Database Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled by the groundbreaking AlphaFold2 artificial intelligence (AI) system, the predictions archived in AlphaFold DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements in data archiving, covering successive releases encompassing model organisms, global health proteomes, Swiss-Prot integration, and a host of curated protein datasets. We detail the data access mechanisms of AlphaFold DB, from direct file access via FTP to advanced queries using Google Cloud Public Datasets and the programmatic access endpoints of the database. We also discuss the improvements and services added since its initial release, including enhancements to the Predicted Aligned Error viewer, customisation options for the 3D viewer, and improvements in the search engine of AlphaFold DB.


The AlphaFold Protein Structure Database (AlphaFold DB) is a massive digital library of predicted protein structures, with over 214 million entries, marking a 500-times expansion in size since its initial release in 2021. The structures are predicted using Google DeepMind's AlphaFold 2 artificial intelligence (AI) system. Our new report highlights the latest updates we have made to this database. We have added more data on specific organisms and proteins related to global health and expanded to cover almost the complete UniProt database, a primary data resource of protein sequences. We also made it easier for our users to access the data by directly downloading files or using advanced cloud-based tools. Finally, we have also improved how users view and search through these protein structures, making the user experience smoother and more informative. In short, AlphaFold DB has been growing rapidly and has become more user-friendly and robust to support the broader scientific community.


Subject(s)
Artificial Intelligence , Protein Structure, Secondary , Proteome , Amino Acid Sequence , Databases, Protein , Search Engine , Proteins/chemistry
7.
Proc Natl Acad Sci U S A ; 120(28): e2301007120, 2023 07 11.
Article in English | MEDLINE | ID: mdl-37399371

ABSTRACT

Wood-decaying fungi are the major decomposers of plant litter. Heavy sequencing efforts on genomes of wood-decaying fungi have recently been made due to the interest in their lignocellulolytic enzymes; however, most parts of their proteomes remain uncharted. We hypothesized that wood-decaying fungi would possess promiscuous enzymes for detoxifying antifungal phytochemicals remaining in the dead plant bodies, which can be useful biocatalysts. We designed a computational mass spectrometry-based untargeted metabolomics pipeline for the phenotyping of biotransformation and applied it to 264 fungal cultures supplemented with antifungal plant phenolics. The analysis identified the occurrence of diverse reactivities by the tested fungal species. Among those, we focused on O-xylosylation of multiple phenolics by one of the species tested, Lentinus brumalis. By integrating the metabolic phenotyping results with publicly available genome sequences and transcriptome analysis, a UDP-glycosyltransferase designated UGT66A1 was identified and validated as an enzyme catalyzing O-xylosylation with broad substrate specificity. We anticipate that our analytical workflow will accelerate the further characterization of fungal enzymes as promising biocatalysts.


Subject(s)
Glucosyltransferases , Lentinula , Metabolomics , Metabolomics/methods , Lentinula/enzymology , Glucosyltransferases/chemistry , Glucosyltransferases/isolation & purification , Glucosyltransferases/metabolism , Phytochemicals/metabolism , Xylose/metabolism , Genome, Fungal , Liquid Chromatography-Mass Spectrometry
8.
Nat Methods ; 19(6): 679-682, 2022 06.
Article in English | MEDLINE | ID: mdl-35637307

ABSTRACT

ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold's 40-60-fold faster search and optimized model utilization enables prediction of close to 1,000 structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.com/sokrypton/ColabFold and its novel environmental databases are available at https://colabfold.mmseqs.com .


Subject(s)
Protein Folding , Software , Computers , Databases, Factual , Proteins
9.
Nucleic Acids Res ; 51(D1): D777-D784, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36271795

ABSTRACT

In phylogenomics the evolutionary relationship of organisms is studied by their genomic information. A common approach to phylogenomics is to extract related genes from each organism, build a multiple sequence alignment and then reconstruct evolution relations through a phylogenetic tree. Often a set of highly conserved genes occurring in single-copy, called core genes, are used for this analysis, as they allow efficient automation within a taxonomic clade. Here we introduce the Universal Fungal Core Genes (UFCG) database and pipeline for genome-wide phylogenetic analysis of fungi. The UFCG database consists of 61 curated fungal marker genes, including a novel set of 41 computationally derived core genes and 20 canonical genes derived from literature, as well as marker gene sequences extracted from publicly available fungal genomes. Furthermore, we provide an easy-to-use, fully automated and open-source pipeline for marker gene extraction, training and phylogenetic tree reconstruction. The UFCG pipeline can identify marker genes from genomic, proteomic and transcriptomic data, while producing phylogenies consistent with those previously reported, and is publicly available together with the UFCG database at https://ufcg.steineggerlab.com.


Subject(s)
Databases, Genetic , Fungi , Fungi/classification , Fungi/genetics , Genes, Fungal , Genome, Fungal , Phylogeny , Proteomics
10.
Bioinformatics ; 39(8)2023 08 01.
Article in English | MEDLINE | ID: mdl-37535681

ABSTRACT

MOTIVATION: Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions). RESULTS: We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities. AVAILABILITY AND IMPLEMENTATION: Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.


Subject(s)
Algorithms , Proteins , Position-Specific Scoring Matrices , Sequence Alignment , Sequence Analysis , Software
11.
Bioinformatics ; 39(4)2023 04 03.
Article in English | MEDLINE | ID: mdl-36961332

ABSTRACT

SUMMARY: Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By using a combination of internal and Cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of three compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is five times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analysing large collections of protein structures. AVAILABILITY AND IMPLEMENTATION: Foldcomp is a free open-source software (GPLv3) and available for Linux, macOS, and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB), and ESMatlas HQ (114GB) database ready-for-download.


Subject(s)
Data Compression , Software , Algorithms , Data Compression/methods , Proteins , Gene Library
12.
Nat Chem Biol ; 18(7): 713-723, 2022 07.
Article in English | MEDLINE | ID: mdl-35484435

ABSTRACT

Despite advances in resolving the structures of multi-pass membrane proteins, little is known about the native folding pathways of these complex structures. Using single-molecule magnetic tweezers, we here report a folding pathway of purified human glucose transporter 3 (GLUT3) reconstituted within synthetic lipid bilayers. The N-terminal major facilitator superfamily (MFS) fold strictly forms first, serving as a structural template for its C-terminal counterpart. We found polar residues comprising the conduit for glucose molecules present major folding challenges. The endoplasmic reticulum membrane protein complex facilitates insertion of these hydrophilic transmembrane helices, thrusting GLUT3's microstate sampling toward folded structures. Final assembly between the N- and C-terminal MFS folds depends on specific lipids that ease desolvation of the lipid shells surrounding the domain interfaces. Sequence analysis suggests that this asymmetric folding propensity across the N- and C-terminal MFS folds prevails for metazoan sugar porters, revealing evolutionary conflicts between foldability and functionality faced by many multi-pass membrane proteins.


Subject(s)
Glucose Transport Proteins, Facilitative , Lipid Bilayers , Animals , Glucose Transport Proteins, Facilitative/genetics , Glucose Transport Proteins, Facilitative/metabolism , Glucose Transporter Type 3/metabolism , Humans , Lipid Bilayers/chemistry , Membrane Proteins/metabolism , Protein Folding , Protein Structure, Secondary
13.
Bioinformatics ; 38(5): 1440-1442, 2022 02 07.
Article in English | MEDLINE | ID: mdl-34734986

ABSTRACT

SUMMARY: PhyloCSF++ is an efficient and parallelized C++ implementation of the popular PhyloCSF method to distinguish protein-coding and non-coding regions in a genome based on multiple sequence alignments (MSAs). It can score alignments or produce browser tracks for entire genomes in the wig file format. Additionally, PhyloCSF++ annotates coding sequences in GFF/GTF files using precomputed tracks or computes and scores MSAs on the fly with MMseqs2. AVAILABILITY AND IMPLEMENTATION: PhyloCSF++ is released under the AGPLv3 license. Binaries and source code are available at https://github.com/cpockrandt/PhyloCSFpp. The software can be installed through bioconda. A variety of tracks can be accessed through ftp://ftp.ccb.jhu.edu/pub/software/phylocsfpp/.


Subject(s)
Genome , Software , Sequence Alignment , Exons
14.
Nucleic Acids Res ; 49(D1): D298-D308, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33119734

ABSTRACT

We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.


Subject(s)
Amino Acids/chemistry , Databases, Protein , Genome , Proteins/genetics , Proteome/genetics , Software , Amino Acid Sequence , Amino Acids/metabolism , Animals , Archaea/genetics , Archaea/metabolism , Bacteria/genetics , Bacteria/metabolism , Binding Sites , Conserved Sequence , Fungi/genetics , Fungi/metabolism , Humans , Internet , Plants/genetics , Plants/metabolism , Prokaryotic Cells/metabolism , Protein Binding , Protein Structure, Secondary , Proteins/chemistry , Proteins/classification , Proteins/metabolism , Proteome/chemistry , Proteome/metabolism , Sequence Analysis, Protein , Viruses/genetics , Viruses/metabolism
15.
Nucleic Acids Res ; 49(W1): W535-W540, 2021 07 02.
Article in English | MEDLINE | ID: mdl-33999203

ABSTRACT

Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.


Subject(s)
Protein Conformation , Software , Binding Sites , Coronavirus Nucleocapsid Proteins/chemistry , DNA-Binding Proteins/chemistry , Phosphoproteins/chemistry , Protein Structure, Secondary , Proteins/chemistry , Proteins/physiology , RNA-Binding Proteins/chemistry , Sequence Alignment , Sequence Analysis, Protein
16.
Nat Methods ; 16(7): 603-606, 2019 07.
Article in English | MEDLINE | ID: mdl-31235882

ABSTRACT

The open-source de novo protein-level assembler, Plass ( https://plass.mmseqs.com ), assembles six-frame-translated sequencing reads into protein sequences. It recovers 2-10 times more protein sequences from complex metagenomes and can assemble huge datasets. We assembled two redundancy-filtered reference protein catalogs, 2 billion sequences from 640 soil samples (soil reference protein catalog) and 292 million sequences from 775 marine eukaryotic metatranscriptomes (marine eukaryotic reference catalog), the largest free collections of protein sequences.


Subject(s)
Metagenomics , Proteins/chemistry , Amino Acid Sequence , Codon , Open Reading Frames
17.
Proteins ; 89(12): 1711-1721, 2021 12.
Article in English | MEDLINE | ID: mdl-34599769

ABSTRACT

We describe the operation and improvement of AlphaFold, the system that was entered by the team AlphaFold2 to the "human" category in the 14th Critical Assessment of Protein Structure Prediction (CASP14). The AlphaFold system entered in CASP14 is entirely different to the one entered in CASP13. It used a novel end-to-end deep neural network trained to produce protein structures from amino acid sequence, multiple sequence alignments, and homologous proteins. In the assessors' ranking by summed z scores (>2.0), AlphaFold scored 244.0 compared to 90.8 by the next best group. The predictions made by AlphaFold had a median domain GDT_TS of 92.4; this is the first time that this level of average accuracy has been achieved during CASP, especially on the more difficult Free Modeling targets, and represents a significant improvement in the state of the art in protein structure prediction. We reported how AlphaFold was run as a human team during CASP14 and improved such that it now achieves an equivalent level of performance without intervention, opening the door to highly accurate large-scale structure prediction.


Subject(s)
Models, Molecular , Neural Networks, Computer , Protein Folding , Proteins , Software , Amino Acid Sequence , Computational Biology , Deep Learning , Protein Conformation , Proteins/chemistry , Proteins/metabolism , Sequence Analysis, Protein
18.
Mod Pathol ; 34(6): 1093-1103, 2021 06.
Article in English | MEDLINE | ID: mdl-33536572

ABSTRACT

There is an urgent and unprecedented need for sensitive and high-throughput molecular diagnostic tests to combat the SARS-CoV-2 pandemic. Here we present a generalized version of the RNA-mediated oligonucleotide Annealing Selection and Ligation with next generation DNA sequencing (RASL-seq) assay, called "capture RASL-seq" (cRASL-seq), which enables highly sensitive (down to ~1-100 pfu/ml or cfu/ml) and highly multiplexed (up to ~10,000 target sequences) detection of pathogens. Importantly, cRASL-seq analysis of COVID-19 patient nasopharyngeal (NP) swab specimens does not involve nucleic acid purification or reverse transcription, steps that have introduced supply bottlenecks into standard assay workflows. Our simplified protocol additionally enables the direct and efficient genotyping of selected, informative SARS-CoV-2 polymorphisms across the entire genome, which can be used for enhanced characterization of transmission chains at population scale and detection of viral clades with higher or lower virulence. Given its extremely low per-sample cost, simple and automatable protocol and analytics, probe panel modularity, and massive scalability, we propose that cRASL-seq testing is a powerful new technology with the potential to help mitigate the current pandemic and prevent similar public health crises.


Subject(s)
COVID-19 Testing/methods , COVID-19/diagnosis , COVID-19/virology , High-Throughput Nucleotide Sequencing/methods , SARS-CoV-2/genetics , Genotype , Humans , Oligonucleotide Probes , RNA, Viral/analysis
19.
Bioinformatics ; 35(16): 2856-2858, 2019 08 15.
Article in English | MEDLINE | ID: mdl-30615063

ABSTRACT

SUMMARY: The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein sequence and profile databases on personal workstations. By eliminating MMseqs2's runtime overhead, we reduced response times to a few seconds at sensitivities close to BLAST. AVAILABILITY AND IMPLEMENTATION: The app is easy to install for non-experts. GPLv3-licensed code, pre-built desktop app packages for Windows, MacOS and Linux, Docker images for the web server application and a demo web server are available at https://search.mmseqs.com. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computers , Software , Amino Acid Sequence , Databases, Factual
20.
BMC Bioinformatics ; 20(1): 473, 2019 Sep 14.
Article in English | MEDLINE | ID: mdl-31521110

ABSTRACT

BACKGROUND: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. RESULTS: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite . CONCLUSION: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.


Subject(s)
Molecular Sequence Annotation/methods , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Software , Algorithms , Markov Chains
SELECTION OF CITATIONS
SEARCH DETAIL