Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
Add more filters










Publication year range
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38706320

ABSTRACT

The advent of rapid whole-genome sequencing has created new opportunities for computational prediction of antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored for this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno and Aytan-Aktug), an ML baseline and the rule-based ResFinder by training and testing each of them across 78 species-antibiotic datasets, using a rigorous benchmarking workflow that integrates three evaluation approaches, each paired with three distinct sample splitting methods. Our analysis revealed considerable variation in the performance across techniques and datasets. Whereas ML methods generally excelled for closely related strains, ResFinder excelled for handling divergent genomes. Overall, Kover most frequently ranked top among the ML approaches, followed by PhenotypeSeeker and Seq2Geno2Pheno. AMR phenotypes for antibiotic classes such as macrolides and sulfonamides were predicted with the highest accuracies. The quality of predictions varied substantially across species-antibiotic combinations, particularly for beta-lactams; across species, resistance phenotyping of the beta-lactams compound, aztreonam, amoxicillin/clavulanic acid, cefoxitin, ceftazidime and piperacillin/tazobactam, alongside tetracyclines demonstrated more variable performance than the other benchmarked antibiotics. By organism, Campylobacter jejuni and Enterococcus faecium phenotypes were more robustly predicted than those of Escherichia coli, Staphylococcus aureus, Salmonella enterica, Neisseria gonorrhoeae, Klebsiella pneumoniae, Pseudomonas aeruginosa, Acinetobacter baumannii, Streptococcus pneumoniae and Mycobacterium tuberculosis. In addition, our study provides software recommendations for each species-antibiotic combination. It furthermore highlights the need for optimization for robust clinical applications, particularly for strains that diverge substantially from those used for training.


Subject(s)
Anti-Bacterial Agents , Phenotype , Anti-Bacterial Agents/pharmacology , Machine Learning , Drug Resistance, Bacterial/genetics , Computational Biology/methods , Genome, Bacterial , Genome, Microbial , Humans , Bacteria/genetics , Bacteria/drug effects
2.
Bioinformatics ; 39(2)2023 02 03.
Article in English | MEDLINE | ID: mdl-36786404

ABSTRACT

MOTIVATION: Gene annotation is the problem of mapping proteins to their functions represented as Gene Ontology (GO) terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem, which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset used for benchmarking the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear. RESULTS: The Gene Benchmarking database is the first effort to provide an easy-to-use and configurable hub for the learning and evaluation of gene annotation models. It provides easy access to pre-specified datasets and takes the non-trivial steps of preprocessing and filtering all data according to custom presets using a web interface. The GO bench web application can also be used to evaluate and display any trained model on leaderboards for annotation tasks. AVAILABILITY AND IMPLEMENTATION: The GO Benchmarking dataset is freely available at www.gobench.org. Code is hosted at github.com/mofradlab, with repositories for website code, core utilities and examples of usage (Supplementary Section S.7). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Benchmarking , Software , Molecular Sequence Annotation , Gene Ontology , Machine Learning , Proteins/metabolism
3.
Emerg Microbes Infect ; 11(1): 1037-1048, 2022 Dec.
Article in English | MEDLINE | ID: mdl-35320064

ABSTRACT

The coronavirus SARS-CoV-2 is the causative agent for the disease COVID-19. To capture the IgA, IgG, and IgM antibody response of patients infected with SARS-CoV-2 at individual epitope resolution, we constructed planar microarrays of 648 overlapping peptides that cover the four major structural proteins S(pike), N(ucleocapsid), M(embrane), and E(nvelope). The arrays were incubated with sera of 67 SARS-CoV-2 positive and 22 negative control samples. Specific responses to SARS-CoV-2 were detectable, and nine peptides were associated with a more severe course of the disease. A random forest model disclosed that antibody binding to 21 peptides, mostly localized in the S protein, was associated with higher neutralization values in cellular anti-SARS-CoV-2 assays. For antibodies addressing the N-terminus of M, or peptides close to the fusion region of S, protective effects were proven by antibody depletion and neutralization assays. The study pinpoints unusual viral binding epitopes that might be suited as vaccine candidates.


Subject(s)
COVID-19 , SARS-CoV-2 , Antibodies, Neutralizing , Antibodies, Viral , Antibody Formation , Epitopes , Humans , Machine Learning , Peptides , Spike Glycoprotein, Coronavirus
4.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3744-3753, 2022.
Article in English | MEDLINE | ID: mdl-34460382

ABSTRACT

Pretrained representations have recently gained attention in various machine learning applications. Nonetheless, the high computational costs associated with training these models have motivated alternative approaches for representation learning. Herein we introduce TripletProt, a new approach for protein representation learning based on the Siamese neural networks. Representation learning of biological entities which capture essential features can alleviate many of the challenges associated with supervised learning in bioinformatics. The most important distinction of our proposed method is relying on the protein-protein interaction (PPI) network. The computational cost of the generated representations for any potential application is significantly lower than comparable methods since the length of the representations is significantly smaller than that in other approaches. TripletProt offers great potentials for the protein informatics tasks and can be widely applied to similar tasks. We evaluate TripletProt comprehensively in protein functional annotation tasks including sub-cellular localization (14 categories) and gene ontology prediction (more than 2000 classes), which are both challenging multi-class, multi-label classification machine learning problems. We compare the performance of TripletProt with the state-of-the-art approaches including a recurrent language model-based approach (i.e., UniRep), as well as a protein-protein interaction (PPI) network and sequence-based method (i.e., DeepGO). Our TripletProt showed an overall improvement of F1 score in the above mentioned comprehensive functional annotation tasks, solely relying on the PPI network. Availability: The source code and datasets are available at https://github.com/EsmaeilNourani/TripletProt.


Subject(s)
Neural Networks, Computer , Proteins , Proteins/metabolism , Software , Protein Interaction Maps , Language
5.
Bioinformatics ; 37(23): 4517-4525, 2021 12 07.
Article in English | MEDLINE | ID: mdl-34180989

ABSTRACT

MOTIVATION: B-cell epitopes (BCEs) play a pivotal role in the development of peptide vaccines, immuno-diagnostic reagents and antibody production, and thus in infectious disease prevention and diagnostics in general. Experimental methods used to determine BCEs are costly and time-consuming. Therefore, it is essential to develop computational methods for the rapid identification of BCEs. Although several computational methods have been developed for this task, generalizability is still a major concern, where cross-testing of the classifiers trained and tested on different datasets has revealed accuracies of 51-53%. RESULTS: We describe a new method called EpitopeVec, which uses a combination of residue properties, modified antigenicity scales, and protein language model-based representations (protein vectors) as features of peptides for linear BCE predictions. Extensive benchmarking of EpitopeVec and other state-of-the-art methods for linear BCE prediction on several large and small datasets, as well as cross-testing, demonstrated an improvement in the performance of EpitopeVec over other methods in terms of accuracy and area under the curve. As the predictive performance depended on the species origin of the respective antigens (viral, bacterial and eukaryotic), we also trained our method on a large viral dataset to create a dedicated linear viral BCE predictor with improved cross-testing performance. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/hzi-bifo/epitope-prediction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Antigens , Peptides , Amino Acid Sequence , Peptides/chemistry , Antigens/chemistry , Software , Epitopes, B-Lymphocyte/chemistry
6.
EMBO Mol Med ; 12(3): e10264, 2020 03 06.
Article in English | MEDLINE | ID: mdl-32048461

ABSTRACT

Limited therapy options due to antibiotic resistance underscore the need for optimization of current diagnostics. In some bacterial species, antimicrobial resistance can be unambiguously predicted based on their genome sequence. In this study, we sequenced the genomes and transcriptomes of 414 drug-resistant clinical Pseudomonas aeruginosa isolates. By training machine learning classifiers on information about the presence or absence of genes, their sequence variation, and expression profiles, we generated predictive models and identified biomarkers of resistance to four commonly administered antimicrobial drugs. Using these data types alone or in combination resulted in high (0.8-0.9) or very high (> 0.9) sensitivity and predictive values. For all drugs except for ciprofloxacin, gene expression information improved diagnostic performance. Our results pave the way for the development of a molecular resistance profiling tool that reliably predicts antimicrobial susceptibility based on genomic and transcriptomic markers. The implementation of a molecular susceptibility test system in routine microbiology diagnostics holds promise to provide earlier and more detailed information on antibiotic resistance profiles of bacterial pathogens and thus could change how physicians treat bacterial infections.


Subject(s)
Drug Resistance, Bacterial , Machine Learning , Pseudomonas aeruginosa , Anti-Bacterial Agents/pharmacology , Genome, Bacterial , Microbial Sensitivity Tests , Pathology, Molecular , Pseudomonas aeruginosa/drug effects , Transcriptome
7.
Genome Biol ; 20(1): 244, 2019 11 19.
Article in English | MEDLINE | ID: mdl-31744546

ABSTRACT

BACKGROUND: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS: Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION: We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.


Subject(s)
Molecular Sequence Annotation/trends , Animals , Biofilms , Candida albicans/genetics , Drosophila melanogaster/genetics , Genome, Bacterial , Genome, Fungal , Humans , Locomotion , Memory, Long-Term , Molecular Sequence Annotation/methods , Pseudomonas aeruginosa/genetics
8.
Sci Rep ; 9(1): 3577, 2019 03 05.
Article in English | MEDLINE | ID: mdl-30837494

ABSTRACT

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Sequence Analysis , Amino Acid Motifs , Amino Acid Sequence
10.
Bioinformatics ; 35(14): 2498-2500, 2019 07 15.
Article in English | MEDLINE | ID: mdl-30500871

ABSTRACT

SUMMARY: Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets. AVAILABILITY AND IMPLEMENTATION: DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , RNA, Ribosomal, 16S/genetics , Biomarkers , Humans , Nucleotides , Phenotype , Sequence Analysis, DNA , Software
11.
Bioinformatics ; 34(13): i32-i42, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29950008

ABSTRACT

Motivation: Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes. Results: A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine. Availability and implementation: The software and datasets are available at https://llp.berkeley.edu/micropheno. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Databases, Nucleic Acid , Microbiota/genetics , Phenotype , RNA, Ribosomal, 16S/genetics , Sequence Analysis, DNA/methods , Software , Genes, rRNA , Humans , Sequence Alignment/methods
12.
Biophys J ; 114(5): 1190-1203, 2018 03 13.
Article in English | MEDLINE | ID: mdl-29539404

ABSTRACT

The LINC complex is found in a wide variety of organisms and is formed by the transluminal interaction between outer- and inner-nuclear-membrane KASH and SUN proteins, respectively. Most extensively studied are SUN1 and SUN2 proteins, which are widely expressed in mammals. Although SUN1 and SUN2 play functionally redundant roles in several cellular processes, more recent studies have revealed diverse and distinct functions for SUN1. While several recent in vitro structural studies have revealed the molecular details of various fragments of SUN2, no such structural information is available for SUN1. Herein, we conduct a systematic analysis of the molecular relationships between SUN1 and SUN2, highlighting key similarities and differences that could lead to clues into their distinct functions. We use a wide range of computational tools, including multiple sequence alignments, homology modeling, molecular docking, and molecular dynamic simulations, to predict structural differences between SUN1 and SUN2, with the goal of understanding the molecular mechanisms underlying SUN1 oligomerization in the nuclear envelope. Our simulations suggest that the structural model of SUN1 is stable in a trimeric state and that SUN1 trimers can associate through their SUN domains to form lateral complexes. We also ask whether SUN1 could adopt an inactive monomeric conformation as seen in SUN2. Our results imply that the KASH binding domain of SUN1 is also inhibited in monomeric SUN1 but through weaker interactions than in monomeric SUN2.


Subject(s)
Membrane Proteins/chemistry , Membrane Proteins/metabolism , Microtubule-Associated Proteins/chemistry , Microtubule-Associated Proteins/metabolism , Nuclear Envelope/metabolism , Nuclear Proteins/chemistry , Nuclear Proteins/metabolism , Protein Multimerization , Amino Acid Sequence , Humans , Intracellular Signaling Peptides and Proteins/chemistry , Molecular Dynamics Simulation , Protein Domains , Protein Structure, Quaternary
13.
PLoS One ; 10(11): e0141287, 2015.
Article in English | MEDLINE | ID: mdl-26555596

ABSTRACT

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.


Subject(s)
Computational Biology/methods , Genomics/methods , Proteomics/methods , Support Vector Machine , Amino Acid Sequence , Databases, Protein , Intrinsically Disordered Proteins/chemistry , Natural Language Processing , Nuclear Pore Complex Proteins/chemistry , Nuclear Pore Complex Proteins/classification , Protein Structure, Secondary , Proteins/classification
SELECTION OF CITATIONS
SEARCH DETAIL
...