Search | VHL Regional Portal

Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins.

de Crécy-Lagard, Valérie; Dias, Raquel; Friedberg, Iddo; Yuan, Yifeng; Swairjo, Manal A.

bioRxiv ; 2024 Jul 03.

Article in English | MEDLINE | ID: mdl-39005379

ABSTRACT

Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknownme". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for 450 enzymes of unknown function from the model bacteria Escherichia coli using the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome.

Discovering genomic islands in unannotated bacterial genomes using sequence embedding.

Banerjee, Priyanka; Eulenstein, Oliver; Friedberg, Iddo.

Bioinform Adv ; 4(1): vbae089, 2024.

Article in English | MEDLINE | ID: mdl-38911822

ABSTRACT

Motivation: Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results: Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation: TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.

EpicTope: narrating protein sequence features to identify non-disruptive epitope tagging sites.

Zinski, Joseph; Chung, Henri; Joshi, Parnal; Warrick, Finn; Berg, Brian D; Glova, Greg; McGrail, Maura; Balciunas, Darius; Friedberg, Iddo; Mullins, Mary.

bioRxiv ; 2024 Mar 11.

Article in English | MEDLINE | ID: mdl-38559275

ABSTRACT

Epitope tagging is an invaluable technique enabling the identification, tracking, and purification of proteins in vivo. We developed a tool, EpicTope, to facilitate this method by identifying amino acid positions suitable for epitope insertion. Our method uses a scoring function that considers multiple protein sequence and structural features to determine locations least disruptive to the protein's function. We validated our approach on the zebrafish Smad5 protein, showing that multiple predicted internally tagged Smad5 proteins rescue zebrafish smad5 mutant embryos, while the N- and C-terminal tagged variants do not, also as predicted. We further show that the internally tagged Smad5 proteins are accessible to antibodies in wholemount zebrafish embryo immunohistochemistry and by western blot. Our work demonstrates that EpicTope is an accessible and effective tool for designing epitope tag insertion sites. EpicTope is available under a GPL-3 license from: https://github.com/FriedbergLab/Epictope.

CAFA-evaluator: a Python tool for benchmarking ontological classification methods.

Piovesan, Damiano; Zago, Davide; Joshi, Parnal; De Paolis Kaluza, M Clara; Mehdiabadi, Mahta; Ramola, Rashika; Monzon, Alexander Miguel; Reade, Walter; Friedberg, Iddo; Radivojac, Predrag; Tosatto, Silvio C E.

Bioinform Adv ; 4(1): vbae043, 2024.

Article in English | MEDLINE | ID: mdl-38545087

ABSTRACT

We present CAFA-evaluator, a powerful Python program designed to evaluate the performance of prediction methods on targets with hierarchical concept dependencies. It generalizes multi-label evaluation to modern ontologies where the prediction targets are drawn from a directed acyclic graph and achieves high efficiency by leveraging matrix computation and topological sorting. The program requirements include a small number of standard Python libraries, making CAFA-evaluator easy to maintain. The code replicates the Critical Assessment of protein Function Annotation (CAFA) benchmarking, which evaluates predictions of the consistent subgraphs in Gene Ontology. Owing to its reliability and accuracy, the organizers have selected CAFA-evaluator as the official CAFA evaluation software. Availability and implementation: https://pypi.org/project/cafaeval.

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL