Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 61
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Nucleic Acids Res ; 51(19): 10162-10175, 2023 10 27.
Article in English | MEDLINE | ID: mdl-37739408

ABSTRACT

Determining the repertoire of a microbe's molecular functions is a central question in microbial biology. Modern techniques achieve this goal by comparing microbial genetic material against reference databases of functionally annotated genes/proteins or known taxonomic markers such as 16S rRNA. Here, we describe a novel approach to exploring bacterial functional repertoires without reference databases. Our Fusion scheme establishes functional relationships between bacteria and assigns organisms to Fusion-taxa that differ from otherwise defined taxonomic clades. Three key findings of our work stand out. First, bacterial functional comparisons outperform marker genes in assigning taxonomic clades. Fusion profiles are also better for this task than other functional annotation schemes. Second, Fusion-taxa are robust to addition of novel organisms and are, arguably, able to capture the environment-driven bacterial diversity. Finally, our alignment-free nucleic acid-based Siamese Neural Network model, created using Fusion functions, enables finding shared functionality of very distant, possibly structurally different, microbial homologs. Our work can thus help annotate functional repertoires of bacterial organisms and further guide our understanding of microbial communities.


Subject(s)
Bacteria , Bacteria/cytology , Bacteria/genetics , Databases, Factual , Microbiota , Phylogeny , RNA, Ribosomal, 16S/genetics , Bacterial Physiological Phenomena
2.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36688705

ABSTRACT

MOTIVATION: Advances in sequencing technologies have led to a surge in genomic data, although the functions of many gene products coded by these genes remain unknown. While in-depth, targeted experiments that determine the functions of these gene products are crucial and routinely performed, they fail to keep up with the inflow of novel genomic data. In an attempt to address this gap, high-throughput experiments are being conducted in which a large number of genes are investigated in a single study. The annotations generated as a result of these experiments are generally biased towards a small subset of less informative Gene Ontology (GO) terms. Identifying and removing biases from protein function annotation databases is important since biases impact our understanding of protein function by providing a poor picture of the annotation landscape. Additionally, as machine learning methods for predicting protein function are becoming increasingly prevalent, it is essential that they are trained on unbiased datasets. Therefore, it is not only crucial to be aware of biases, but also to judiciously remove them from annotation datasets. RESULTS: We introduce GOThresher, a Python tool that identifies and removes biases in function annotations from protein function annotation databases. AVAILABILITY AND IMPLEMENTATION: GOThresher is written in Python and released via PyPI https://pypi.org/project/gothresher/ and on the Bioconda Anaconda channel https://anaconda.org/bioconda/gothresher. The source code is hosted on GitHub https://github.com/FriedbergLab/GOThresher and distributed under the GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Genomics , Computational Biology/methods , Molecular Sequence Annotation , Software , Proteins/genetics , Proteins/metabolism , Databases, Protein
3.
Bioinformatics ; 38(Suppl 1): i19-i27, 2022 06 24.
Article in English | MEDLINE | ID: mdl-35758800

ABSTRACT

MOTIVATION: Wikipedia is one of the most important channels for the public communication of science and is frequently accessed as an educational resource in computational biology. Joint efforts between the International Society for Computational Biology (ISCB) and the Computational Biology taskforce of WikiProject Molecular Biology (a group of expert Wikipedia editors) have considerably improved computational biology representation on Wikipedia in recent years. However, there is still an urgent need for further improvement in quality, especially when compared to related scientific fields such as genetics and medicine. Facilitating involvement of members from ISCB Communities of Special Interest (COSIs) would improve a vital open education resource in computational biology, additionally allowing COSIs to provide a quality educational resource highly specific to their subfield. RESULTS: We generate a list of around 1500 English Wikipedia articles relating to computational biology and describe the development of a binary COSI-Article matrix, linking COSIs to relevant articles and thereby defining domain-specific open educational resources. Our analysis of the COSI-Article matrix data provides a quantitative assessment of computational biology representation on Wikipedia against other fields and at a COSI-specific level. Furthermore, we conducted similarity analysis and subsequent clustering of COSI-Article data to provide insight into potential relationships between COSIs. Finally, based on our analysis, we suggest courses of action to improve the quality of computational biology representation on Wikipedia.


Subject(s)
Computational Biology , Cluster Analysis
4.
Nucleic Acids Res ; 49(1): 67-78, 2021 01 11.
Article in English | MEDLINE | ID: mdl-33305328

ABSTRACT

Gene-editing experiments commonly elicit the error-prone non-homologous end joining for DNA double-strand break (DSB) repair. Microhomology-mediated end joining (MMEJ) can generate more predictable outcomes for functional genomic and somatic therapeutic applications. We compared three DSB repair prediction algorithms - MENTHU, inDelphi, and Lindel - in identifying MMEJ-repaired, homogeneous genotypes (PreMAs) in an independent dataset of 5,885 distinct Cas9-mediated mouse embryonic stem cell DSB repair events. MENTHU correctly identified 46% of all PreMAs available, a ∼2- and ∼60-fold sensitivity increase compared to inDelphi and Lindel, respectively. In contrast, only Lindel correctly predicted predominant single-base insertions. We report the new algorithm MENdel, a combination of MENTHU and Lindel, that achieves the most predictive coverage of homogeneous out-of-frame mutations in this large dataset. We then estimated the frequency of Cas9-targetable homogeneous frameshift-inducing DSBs in vertebrate coding regions for gene discovery using MENdel. 47 out of 54 genes (87%) contained at least one early frameshift-inducing DSB and 49 out of 54 (91%) did so when also considering Cas12a-mediated deletions. We suggest that the use of MENdel helps researchers use MMEJ at scale for reverse genetics screenings and with sufficient intra-gene density rates to be viable for nearly all loss-of-function based gene editing therapeutic applications.


Subject(s)
Algorithms , DNA Breaks, Double-Stranded , DNA End-Joining Repair , Frameshift Mutation , Gene Editing/methods , Genetic Therapy/methods , Genomics/methods , INDEL Mutation , Loss of Function Mutation , Reverse Genetics/methods , Animals , Bacterial Proteins/metabolism , Caspase 9/metabolism , Datasets as Topic , Embryonic Stem Cells/metabolism , Humans , Mice , ROC Curve , Streptococcus pyogenes/enzymology , Zebrafish/genetics
5.
PLoS Comput Biol ; 17(10): e1009463, 2021 10.
Article in English | MEDLINE | ID: mdl-34710081

ABSTRACT

Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.


Subject(s)
Crowdsourcing/methods , Gene Ontology , Molecular Sequence Annotation/methods , Computational Biology , Databases, Genetic , Humans , Proteins/genetics , Proteins/physiology
6.
Bioinformatics ; 36(Suppl_2): i668-i674, 2020 12 30.
Article in English | MEDLINE | ID: mdl-33381825

ABSTRACT

MOTIVATION: The evolution of complexity is one of the most fascinating and challenging problems in modern biology, and tracing the evolution of complex traits is an open problem. In bacteria, operons and gene blocks provide a model of tractable evolutionary complexity at the genomic level. Gene blocks are structures of co-located genes with related functions, and operons are gene blocks whose genes are co-transcribed on a single mRNA molecule. The genes in operons and gene blocks typically work together in the same system or molecular complex. Previously, we proposed a method that explains the evolution of orthologous gene blocks (orthoblocks) as a combination of a small set of events that take place in vertical evolution from common ancestors. A heuristic method was proposed to solve this problem. However, no study was done to identify the complexity of the problem. RESULTS: Here, we establish that finding the homologous gene block problem is NP-hard and APX-hard. We have developed a greedy algorithm that runs in polynomial time and guarantees an O(ln⁡n) approximation. In addition, we formalize our problem as an integer linear program problem and solve it using the PuLP package and the standard CPLEX algorithm. Our exploration of several candidate operons reveals that our new method provides more optimal results than the results from the heuristic approach, and is significantly faster. AVAILABILITY AND IMPLEMENTATION: The software and data accompanying this paper are available under the GPLv3 and CC0 license respectively on: https://github.com/nguyenngochuy91/Relevant-Operon.


Subject(s)
Genomics , Software , Algorithms , Bacteria , Computational Biology , Hardness
7.
Bioinformatics ; 35(12): 2009-2016, 2019 06 01.
Article in English | MEDLINE | ID: mdl-30418485

ABSTRACT

MOTIVATION: Antibiotic resistance constitutes a major public health crisis, and finding new sources of antimicrobial drugs is crucial to solving it. Bacteriocins, which are bacterially produced antimicrobial peptide products, are candidates for broadening the available choices of antimicrobials. However, the discovery of new bacteriocins by genomic mining is hampered by their sequences' low complexity and high variance, which frustrates sequence similarity-based searches. RESULTS: Here we use word embeddings of protein sequences to represent bacteriocins, and apply a word embedding method that accounts for amino acid order in protein sequences, to predict novel bacteriocins from protein sequences without using sequence similarity. Our method predicts, with a high probability, six yet unknown putative bacteriocins in Lactobacillus. Generalized, the representation of sequences with word embeddings preserving sequence order information can be applied to peptide and protein classification problems for which sequence similarity cannot be used. AVAILABILITY AND IMPLEMENTATION: Data and source code for this project are freely available at: https://github.com/nafizh/NeuBI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Neural Networks, Computer , Anti-Infective Agents , Computational Biology , Peptides , Software
8.
Bioinformatics ; 35(17): 2998-3004, 2019 09 01.
Article in English | MEDLINE | ID: mdl-30689726

ABSTRACT

MOTIVATION: Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or subsets of components, cannot. Examples of complex molecular systems include protein structures such as the F1Fo-ATPase, the ribosome, or the flagellar motor: each one of these structures requires most or all of its components to function properly. Given the ubiquity of complex systems in the biosphere, understanding the evolution of complexity is central to biology. At the molecular level, operons are classic examples of a complex system. An operon's genes are co-transcribed under the control of a single promoter to a polycistronic mRNA molecule, and the operon's gene products often form molecular complexes or metabolic pathways. With the large number of complete bacterial genomes available, we now have the opportunity to explore the evolution of these complex entities, by identifying possible intermediate states of operons. RESULTS: In this work, we developed a maximum parsimony algorithm to reconstruct ancestral operon states, and show a simple vertical evolution model of how operons may evolve from the individual component genes. We describe several ancestral states that are plausible functional intermediate forms leading to the full operon. We also offer Reconstruction of Ancestral Gene blocks Using Events or ROAGUE as a software tool for those interested in exploring gene block and operon evolution. AVAILABILITY AND IMPLEMENTATION: The software accompanying this paper is available under GPLv3 license on: https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Evolution, Molecular , Genome, Bacterial , Operon , Bacteria , Software
9.
Drug Dev Res ; 81(1): 43-51, 2020 02.
Article in English | MEDLINE | ID: mdl-31483516

ABSTRACT

Bacteriocins, the ribosomally produced antimicrobial peptides of bacteria, represent an untapped source of promising antibiotic alternatives. However, bacteriocins display diverse mechanisms of action, a narrow spectrum of activity, and inherent challenges in natural product isolation making in vitro verification of putative bacteriocins difficult. A subset of bacteriocins exert their antimicrobial effects through favorable biophysical interactions with the bacterial membrane mediated by the charge, hydrophobicity, and conformation of the peptide. We have developed a pipeline for bacteriocin-derived compound design and testing that combines sequence-free prediction of bacteriocins using machine learning and a simple biophysical trait filter to generate 20 amino acid peptides that can be synthesized and evaluated for activity. We generated 28,895 total 20-mer candidate peptides and scored them for charge, α-helicity, and hydrophobic moment. Of those, we selected 16 sequences for synthesis and evaluated their antimicrobial, cytotoxicity, and hemolytic activities. Peptides with the overall highest scores for our biophysical parameters exhibited significant antimicrobial activity against Escherichia coli and Pseudomonas aeruginosa. Our combined method incorporates machine learning and biophysical-based minimal region determination to create an original approach to swiftly discover bacteriocin candidates amenable to rapid synthesis and evaluation for therapeutic use.


Subject(s)
Anti-Bacterial Agents/chemical synthesis , Antimicrobial Cationic Peptides/chemical synthesis , Bacteriocins/chemistry , Computational Biology/methods , Anti-Bacterial Agents/chemistry , Anti-Bacterial Agents/pharmacology , Antimicrobial Cationic Peptides/chemistry , Antimicrobial Cationic Peptides/pharmacology , Drug Design , Escherichia coli/drug effects , Escherichia coli/growth & development , Hydrophobic and Hydrophilic Interactions , Machine Learning , Microbial Sensitivity Tests , Protein Domains , Protein Structure, Secondary , Pseudomonas aeruginosa/drug effects , Pseudomonas aeruginosa/growth & development , Staphylococcus aureus/drug effects , Staphylococcus aureus/growth & development , Structure-Activity Relationship
10.
Hum Mutat ; 40(9): 1530-1545, 2019 09.
Article in English | MEDLINE | ID: mdl-31301157

ABSTRACT

Accurate prediction of the impact of genomic variation on phenotype is a major goal of computational biology and an important contributor to personalized medicine. Computational predictions can lead to a better understanding of the mechanisms underlying genetic diseases, including cancer, but their adoption requires thorough and unbiased assessment. Cystathionine-beta-synthase (CBS) is an enzyme that catalyzes the first step of the transsulfuration pathway, from homocysteine to cystathionine, and in which variations are associated with human hyperhomocysteinemia and homocystinuria. We have created a computational challenge under the CAGI framework to evaluate how well different methods can predict the phenotypic effect(s) of CBS single amino acid substitutions using a blinded experimental data set. CAGI participants were asked to predict yeast growth based on the identity of the mutations. The performance of the methods was evaluated using several metrics. The CBS challenge highlighted the difficulty of predicting the phenotype of an ex vivo system in a model organism when classification models were trained on human disease data. We also discuss the variations in difficulty of prediction for known benign and deleterious variants, as well as identify methodological and experimental constraints with lessons to be learned for future challenges.


Subject(s)
Amino Acid Substitution , Computational Biology/methods , Cystathionine beta-Synthase/genetics , Cystathionine/metabolism , Cystathionine beta-Synthase/metabolism , Homocysteine/metabolism , Humans , Phenotype , Precision Medicine
11.
PLoS Comput Biol ; 14(7): e1006337, 2018 07.
Article in English | MEDLINE | ID: mdl-30059508

ABSTRACT

The accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.


Subject(s)
Crops, Agricultural/physiology , Crowdsourcing/methods , Image Processing, Computer-Assisted/methods , Machine Learning , Algorithms , Data Accuracy , Food Supply , Humans , Internet , Phenotype , Pilot Projects
12.
Emerg Infect Dis ; 22(10): 1804-7, 2016 10.
Article in English | MEDLINE | ID: mdl-27347760

ABSTRACT

Chikungunya virus (CHIKV) was isolated from 12 febrile humans in Yucatan, Mexico, in 2015. One patient was co-infected with dengue virus type 1. Two additional CHIKV isolates were obtained from Aedes aegypti mosquitoes collected in the homes of patients. Phylogenetic analysis showed that the CHIKV isolates belong to the Asian lineage.


Subject(s)
Aedes/virology , Chikungunya Fever/virology , Chikungunya virus/isolation & purification , Fever/virology , Animals , Chikungunya Fever/complications , Chikungunya virus/classification , Chlorocebus aethiops , Coinfection/virology , Dengue/complications , Dengue/virology , Dengue Virus/isolation & purification , Mexico , Phylogeny , Vero Cells
13.
Bioinformatics ; 31(13): 2075-83, 2015 Jul 01.
Article in English | MEDLINE | ID: mdl-25717195

ABSTRACT

MOTIVATION: Gene blocks are genes co-located on the chromosome. In many cases, gene blocks are conserved between bacterial species, sometimes as operons, when genes are co-transcribed. The conservation is rarely absolute: gene loss, gain, duplication, block splitting and block fusion are frequently observed. An open question in bacterial molecular evolution is that of the formation and breakup of gene blocks, for which several models have been proposed. These models, however, are not generally applicable to all types of gene blocks, and consequently cannot be used to broadly compare and study gene block evolution. To address this problem, we introduce an event-based method for tracking gene block evolution in bacteria. RESULTS: We show here that the evolution of gene blocks in proteobacteria can be described by a small set of events. Those include the insertion of genes into, or the splitting of genes out of a gene block, gene loss, and gene duplication. We show how the event-based method of gene block evolution allows us to determine the evolutionary rateand may be used to trace the ancestral states of their formation. We conclude that the event-based method can be used to help us understand the formation of these important bacterial genomic structures. AVAILABILITY AND IMPLEMENTATION: The software is available under GPLv3 license on http://github.com/reamdc1/gene_block_evolution.git. Supplementary online material: http://iddo-friedberg.net/operon-evolution


Subject(s)
Bacteria/genetics , Computational Biology/methods , Evolution, Molecular , Genes, Bacterial , Genome, Bacterial , Software , Genomics/methods , Operon
14.
BMC Bioinformatics ; 16: 381, 2015 Nov 11.
Article in English | MEDLINE | ID: mdl-26558535

ABSTRACT

BACKGROUND: Bacteriocins are peptide-derived molecules produced by bacteria, whose recently-discovered functions include virulence factors and signaling molecules as well as their better known roles as antibiotics. To date, close to five hundred bacteriocins have been identified and classified. Recent discoveries have shown that bacteriocins are highly diverse and widely distributed among bacterial species. Given the heterogeneity of bacteriocin compounds, many tools struggle with identifying novel bacteriocins due to their vast sequence and structural diversity. Many bacteriocins undergo post-translational processing or modifications necessary for the biosynthesis of the final mature form. Enzymatic modification of bacteriocins as well as their export is achieved by proteins whose genes are often located in a discrete gene cluster proximal to the bacteriocin precursor gene, referred to as context genes in this study. Although bacteriocins themselves are structurally diverse, context genes have been shown to be largely conserved across unrelated species. METHODS: Using this knowledge, we set out to identify new candidates for context genes which may clarify how bacteriocins are synthesized, and identify new candidates for bacteriocins that bear no sequence similarity to known toxins. To achieve these goals, we have developed a software tool, Bacteriocin Operon and gene block Associator (BOA) that can identify homologous bacteriocin associated gene blocks and predict novel ones. BOA generates profile Hidden Markov Models from the clusters of bacteriocin context genes, and uses them to identify novel bacteriocin gene blocks and operons. RESULTS AND CONCLUSIONS: We provide a novel dataset of predicted bacteriocins and context genes. We also discover that several phyla have a strong preference for bacteriocin genes, suggesting distinct functions for this group of molecules. SOFTWARE AVAILABILITY: https://github.com/idoerg/BOA.


Subject(s)
Anti-Bacterial Agents/pharmacology , Bacteriocins/antagonists & inhibitors , Bacteriocins/metabolism , Genome, Archaeal , Genome, Bacterial , Operon/genetics , Software , Bacteriocins/genetics , Chromosome Mapping
15.
Bioinformatics ; 30(17): i609-16, 2014 Sep 01.
Article in English | MEDLINE | ID: mdl-25161254

ABSTRACT

MOTIVATION: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. RESULTS: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Proteins/physiology , Algorithms , Computational Biology/methods , Gene Ontology , Molecular Sequence Annotation , Proteins/genetics , Sequence Alignment
16.
PLoS Comput Biol ; 9(5): e1003063, 2013.
Article in English | MEDLINE | ID: mdl-23737737

ABSTRACT

The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the "few articles - many proteins" phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments.


Subject(s)
Computational Biology/methods , Databases, Protein , Molecular Sequence Annotation/methods , Proteins/classification , Animals , High-Throughput Screening Assays , Humans , Proteins/chemistry , Proteins/metabolism
17.
Bioinform Adv ; 4(1): vbae089, 2024.
Article in English | MEDLINE | ID: mdl-38911822

ABSTRACT

Motivation: Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results: Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation: TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.

18.
bioRxiv ; 2024 Jul 03.
Article in English | MEDLINE | ID: mdl-39005379

ABSTRACT

Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknownme". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for 450 enzymes of unknown function from the model bacteria Escherichia coli using the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome.

19.
bioRxiv ; 2024 Mar 11.
Article in English | MEDLINE | ID: mdl-38559275

ABSTRACT

Epitope tagging is an invaluable technique enabling the identification, tracking, and purification of proteins in vivo. We developed a tool, EpicTope, to facilitate this method by identifying amino acid positions suitable for epitope insertion. Our method uses a scoring function that considers multiple protein sequence and structural features to determine locations least disruptive to the protein's function. We validated our approach on the zebrafish Smad5 protein, showing that multiple predicted internally tagged Smad5 proteins rescue zebrafish smad5 mutant embryos, while the N- and C-terminal tagged variants do not, also as predicted. We further show that the internally tagged Smad5 proteins are accessible to antibodies in wholemount zebrafish embryo immunohistochemistry and by western blot. Our work demonstrates that EpicTope is an accessible and effective tool for designing epitope tag insertion sites. EpicTope is available under a GPL-3 license from: https://github.com/FriedbergLab/Epictope.

20.
Bioinform Adv ; 4(1): vbae043, 2024.
Article in English | MEDLINE | ID: mdl-38545087

ABSTRACT

We present CAFA-evaluator, a powerful Python program designed to evaluate the performance of prediction methods on targets with hierarchical concept dependencies. It generalizes multi-label evaluation to modern ontologies where the prediction targets are drawn from a directed acyclic graph and achieves high efficiency by leveraging matrix computation and topological sorting. The program requirements include a small number of standard Python libraries, making CAFA-evaluator easy to maintain. The code replicates the Critical Assessment of protein Function Annotation (CAFA) benchmarking, which evaluates predictions of the consistent subgraphs in Gene Ontology. Owing to its reliability and accuracy, the organizers have selected CAFA-evaluator as the official CAFA evaluation software. Availability and implementation: https://pypi.org/project/cafaeval.

SELECTION OF CITATIONS
SEARCH DETAIL