Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Nucleic Acids Res ; 51(19): 10162-10175, 2023 10 27.
Artigo em Inglês | MEDLINE | ID: mdl-37739408

RESUMO

Determining the repertoire of a microbe's molecular functions is a central question in microbial biology. Modern techniques achieve this goal by comparing microbial genetic material against reference databases of functionally annotated genes/proteins or known taxonomic markers such as 16S rRNA. Here, we describe a novel approach to exploring bacterial functional repertoires without reference databases. Our Fusion scheme establishes functional relationships between bacteria and assigns organisms to Fusion-taxa that differ from otherwise defined taxonomic clades. Three key findings of our work stand out. First, bacterial functional comparisons outperform marker genes in assigning taxonomic clades. Fusion profiles are also better for this task than other functional annotation schemes. Second, Fusion-taxa are robust to addition of novel organisms and are, arguably, able to capture the environment-driven bacterial diversity. Finally, our alignment-free nucleic acid-based Siamese Neural Network model, created using Fusion functions, enables finding shared functionality of very distant, possibly structurally different, microbial homologs. Our work can thus help annotate functional repertoires of bacterial organisms and further guide our understanding of microbial communities.


Assuntos
Bactérias , Bactérias/citologia , Bactérias/genética , Bases de Dados Factuais , Microbiota , Filogenia , RNA Ribossômico 16S/genética , Fenômenos Fisiológicos Bacterianos
2.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36688705

RESUMO

MOTIVATION: Advances in sequencing technologies have led to a surge in genomic data, although the functions of many gene products coded by these genes remain unknown. While in-depth, targeted experiments that determine the functions of these gene products are crucial and routinely performed, they fail to keep up with the inflow of novel genomic data. In an attempt to address this gap, high-throughput experiments are being conducted in which a large number of genes are investigated in a single study. The annotations generated as a result of these experiments are generally biased towards a small subset of less informative Gene Ontology (GO) terms. Identifying and removing biases from protein function annotation databases is important since biases impact our understanding of protein function by providing a poor picture of the annotation landscape. Additionally, as machine learning methods for predicting protein function are becoming increasingly prevalent, it is essential that they are trained on unbiased datasets. Therefore, it is not only crucial to be aware of biases, but also to judiciously remove them from annotation datasets. RESULTS: We introduce GOThresher, a Python tool that identifies and removes biases in function annotations from protein function annotation databases. AVAILABILITY AND IMPLEMENTATION: GOThresher is written in Python and released via PyPI https://pypi.org/project/gothresher/ and on the Bioconda Anaconda channel https://anaconda.org/bioconda/gothresher. The source code is hosted on GitHub https://github.com/FriedbergLab/GOThresher and distributed under the GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Genômica , Biologia Computacional/métodos , Anotação de Sequência Molecular , Software , Proteínas/genética , Proteínas/metabolismo , Bases de Dados de Proteínas
3.
Bioinformatics ; 38(Suppl 1): i19-i27, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758800

RESUMO

MOTIVATION: Wikipedia is one of the most important channels for the public communication of science and is frequently accessed as an educational resource in computational biology. Joint efforts between the International Society for Computational Biology (ISCB) and the Computational Biology taskforce of WikiProject Molecular Biology (a group of expert Wikipedia editors) have considerably improved computational biology representation on Wikipedia in recent years. However, there is still an urgent need for further improvement in quality, especially when compared to related scientific fields such as genetics and medicine. Facilitating involvement of members from ISCB Communities of Special Interest (COSIs) would improve a vital open education resource in computational biology, additionally allowing COSIs to provide a quality educational resource highly specific to their subfield. RESULTS: We generate a list of around 1500 English Wikipedia articles relating to computational biology and describe the development of a binary COSI-Article matrix, linking COSIs to relevant articles and thereby defining domain-specific open educational resources. Our analysis of the COSI-Article matrix data provides a quantitative assessment of computational biology representation on Wikipedia against other fields and at a COSI-specific level. Furthermore, we conducted similarity analysis and subsequent clustering of COSI-Article data to provide insight into potential relationships between COSIs. Finally, based on our analysis, we suggest courses of action to improve the quality of computational biology representation on Wikipedia.


Assuntos
Biologia Computacional , Análise por Conglomerados
4.
Nucleic Acids Res ; 49(1): 67-78, 2021 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-33305328

RESUMO

Gene-editing experiments commonly elicit the error-prone non-homologous end joining for DNA double-strand break (DSB) repair. Microhomology-mediated end joining (MMEJ) can generate more predictable outcomes for functional genomic and somatic therapeutic applications. We compared three DSB repair prediction algorithms - MENTHU, inDelphi, and Lindel - in identifying MMEJ-repaired, homogeneous genotypes (PreMAs) in an independent dataset of 5,885 distinct Cas9-mediated mouse embryonic stem cell DSB repair events. MENTHU correctly identified 46% of all PreMAs available, a ∼2- and ∼60-fold sensitivity increase compared to inDelphi and Lindel, respectively. In contrast, only Lindel correctly predicted predominant single-base insertions. We report the new algorithm MENdel, a combination of MENTHU and Lindel, that achieves the most predictive coverage of homogeneous out-of-frame mutations in this large dataset. We then estimated the frequency of Cas9-targetable homogeneous frameshift-inducing DSBs in vertebrate coding regions for gene discovery using MENdel. 47 out of 54 genes (87%) contained at least one early frameshift-inducing DSB and 49 out of 54 (91%) did so when also considering Cas12a-mediated deletions. We suggest that the use of MENdel helps researchers use MMEJ at scale for reverse genetics screenings and with sufficient intra-gene density rates to be viable for nearly all loss-of-function based gene editing therapeutic applications.


Assuntos
Algoritmos , Quebras de DNA de Cadeia Dupla , Reparo do DNA por Junção de Extremidades , Mutação da Fase de Leitura , Edição de Genes/métodos , Terapia Genética/métodos , Genômica/métodos , Mutação INDEL , Mutação com Perda de Função , Genética Reversa/métodos , Animais , Proteínas de Bactérias/metabolismo , Caspase 9/metabolismo , Conjuntos de Dados como Assunto , Células-Tronco Embrionárias/metabolismo , Humanos , Camundongos , Curva ROC , Streptococcus pyogenes/enzimologia , Peixe-Zebra/genética
5.
PLoS Comput Biol ; 17(10): e1009463, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34710081

RESUMO

Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.


Assuntos
Crowdsourcing/métodos , Ontologia Genética , Anotação de Sequência Molecular/métodos , Biologia Computacional , Bases de Dados Genéticas , Humanos , Proteínas/genética , Proteínas/fisiologia
6.
Bioinformatics ; 36(Suppl_2): i668-i674, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381825

RESUMO

MOTIVATION: The evolution of complexity is one of the most fascinating and challenging problems in modern biology, and tracing the evolution of complex traits is an open problem. In bacteria, operons and gene blocks provide a model of tractable evolutionary complexity at the genomic level. Gene blocks are structures of co-located genes with related functions, and operons are gene blocks whose genes are co-transcribed on a single mRNA molecule. The genes in operons and gene blocks typically work together in the same system or molecular complex. Previously, we proposed a method that explains the evolution of orthologous gene blocks (orthoblocks) as a combination of a small set of events that take place in vertical evolution from common ancestors. A heuristic method was proposed to solve this problem. However, no study was done to identify the complexity of the problem. RESULTS: Here, we establish that finding the homologous gene block problem is NP-hard and APX-hard. We have developed a greedy algorithm that runs in polynomial time and guarantees an O(ln⁡n) approximation. In addition, we formalize our problem as an integer linear program problem and solve it using the PuLP package and the standard CPLEX algorithm. Our exploration of several candidate operons reveals that our new method provides more optimal results than the results from the heuristic approach, and is significantly faster. AVAILABILITY AND IMPLEMENTATION: The software and data accompanying this paper are available under the GPLv3 and CC0 license respectively on: https://github.com/nguyenngochuy91/Relevant-Operon.


Assuntos
Genômica , Software , Algoritmos , Bactérias , Biologia Computacional , Dureza
7.
Bioinformatics ; 35(12): 2009-2016, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-30418485

RESUMO

MOTIVATION: Antibiotic resistance constitutes a major public health crisis, and finding new sources of antimicrobial drugs is crucial to solving it. Bacteriocins, which are bacterially produced antimicrobial peptide products, are candidates for broadening the available choices of antimicrobials. However, the discovery of new bacteriocins by genomic mining is hampered by their sequences' low complexity and high variance, which frustrates sequence similarity-based searches. RESULTS: Here we use word embeddings of protein sequences to represent bacteriocins, and apply a word embedding method that accounts for amino acid order in protein sequences, to predict novel bacteriocins from protein sequences without using sequence similarity. Our method predicts, with a high probability, six yet unknown putative bacteriocins in Lactobacillus. Generalized, the representation of sequences with word embeddings preserving sequence order information can be applied to peptide and protein classification problems for which sequence similarity cannot be used. AVAILABILITY AND IMPLEMENTATION: Data and source code for this project are freely available at: https://github.com/nafizh/NeuBI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Anti-Infecciosos , Biologia Computacional , Peptídeos , Software
8.
Bioinformatics ; 35(17): 2998-3004, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30689726

RESUMO

MOTIVATION: Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or subsets of components, cannot. Examples of complex molecular systems include protein structures such as the F1Fo-ATPase, the ribosome, or the flagellar motor: each one of these structures requires most or all of its components to function properly. Given the ubiquity of complex systems in the biosphere, understanding the evolution of complexity is central to biology. At the molecular level, operons are classic examples of a complex system. An operon's genes are co-transcribed under the control of a single promoter to a polycistronic mRNA molecule, and the operon's gene products often form molecular complexes or metabolic pathways. With the large number of complete bacterial genomes available, we now have the opportunity to explore the evolution of these complex entities, by identifying possible intermediate states of operons. RESULTS: In this work, we developed a maximum parsimony algorithm to reconstruct ancestral operon states, and show a simple vertical evolution model of how operons may evolve from the individual component genes. We describe several ancestral states that are plausible functional intermediate forms leading to the full operon. We also offer Reconstruction of Ancestral Gene blocks Using Events or ROAGUE as a software tool for those interested in exploring gene block and operon evolution. AVAILABILITY AND IMPLEMENTATION: The software accompanying this paper is available under GPLv3 license on: https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Evolução Molecular , Genoma Bacteriano , Óperon , Bactérias , Software
9.
Drug Dev Res ; 81(1): 43-51, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-31483516

RESUMO

Bacteriocins, the ribosomally produced antimicrobial peptides of bacteria, represent an untapped source of promising antibiotic alternatives. However, bacteriocins display diverse mechanisms of action, a narrow spectrum of activity, and inherent challenges in natural product isolation making in vitro verification of putative bacteriocins difficult. A subset of bacteriocins exert their antimicrobial effects through favorable biophysical interactions with the bacterial membrane mediated by the charge, hydrophobicity, and conformation of the peptide. We have developed a pipeline for bacteriocin-derived compound design and testing that combines sequence-free prediction of bacteriocins using machine learning and a simple biophysical trait filter to generate 20 amino acid peptides that can be synthesized and evaluated for activity. We generated 28,895 total 20-mer candidate peptides and scored them for charge, α-helicity, and hydrophobic moment. Of those, we selected 16 sequences for synthesis and evaluated their antimicrobial, cytotoxicity, and hemolytic activities. Peptides with the overall highest scores for our biophysical parameters exhibited significant antimicrobial activity against Escherichia coli and Pseudomonas aeruginosa. Our combined method incorporates machine learning and biophysical-based minimal region determination to create an original approach to swiftly discover bacteriocin candidates amenable to rapid synthesis and evaluation for therapeutic use.


Assuntos
Antibacterianos/síntese química , Peptídeos Catiônicos Antimicrobianos/síntese química , Bacteriocinas/química , Biologia Computacional/métodos , Antibacterianos/química , Antibacterianos/farmacologia , Peptídeos Catiônicos Antimicrobianos/química , Peptídeos Catiônicos Antimicrobianos/farmacologia , Desenho de Fármacos , Escherichia coli/efeitos dos fármacos , Escherichia coli/crescimento & desenvolvimento , Interações Hidrofóbicas e Hidrofílicas , Aprendizado de Máquina , Testes de Sensibilidade Microbiana , Domínios Proteicos , Estrutura Secundária de Proteína , Pseudomonas aeruginosa/efeitos dos fármacos , Pseudomonas aeruginosa/crescimento & desenvolvimento , Staphylococcus aureus/efeitos dos fármacos , Staphylococcus aureus/crescimento & desenvolvimento , Relação Estrutura-Atividade
10.
Hum Mutat ; 40(9): 1530-1545, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31301157

RESUMO

Accurate prediction of the impact of genomic variation on phenotype is a major goal of computational biology and an important contributor to personalized medicine. Computational predictions can lead to a better understanding of the mechanisms underlying genetic diseases, including cancer, but their adoption requires thorough and unbiased assessment. Cystathionine-beta-synthase (CBS) is an enzyme that catalyzes the first step of the transsulfuration pathway, from homocysteine to cystathionine, and in which variations are associated with human hyperhomocysteinemia and homocystinuria. We have created a computational challenge under the CAGI framework to evaluate how well different methods can predict the phenotypic effect(s) of CBS single amino acid substitutions using a blinded experimental data set. CAGI participants were asked to predict yeast growth based on the identity of the mutations. The performance of the methods was evaluated using several metrics. The CBS challenge highlighted the difficulty of predicting the phenotype of an ex vivo system in a model organism when classification models were trained on human disease data. We also discuss the variations in difficulty of prediction for known benign and deleterious variants, as well as identify methodological and experimental constraints with lessons to be learned for future challenges.


Assuntos
Substituição de Aminoácidos , Biologia Computacional/métodos , Cistationina beta-Sintase/genética , Cistationina/metabolismo , Cistationina beta-Sintase/metabolismo , Homocisteína/metabolismo , Humanos , Fenótipo , Medicina de Precisão
11.
PLoS Comput Biol ; 14(7): e1006337, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-30059508

RESUMO

The accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.


Assuntos
Produtos Agrícolas/fisiologia , Crowdsourcing/métodos , Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina , Algoritmos , Confiabilidade dos Dados , Abastecimento de Alimentos , Humanos , Internet , Fenótipo , Projetos Piloto
13.
Bioinformatics ; 31(13): 2075-83, 2015 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-25717195

RESUMO

MOTIVATION: Gene blocks are genes co-located on the chromosome. In many cases, gene blocks are conserved between bacterial species, sometimes as operons, when genes are co-transcribed. The conservation is rarely absolute: gene loss, gain, duplication, block splitting and block fusion are frequently observed. An open question in bacterial molecular evolution is that of the formation and breakup of gene blocks, for which several models have been proposed. These models, however, are not generally applicable to all types of gene blocks, and consequently cannot be used to broadly compare and study gene block evolution. To address this problem, we introduce an event-based method for tracking gene block evolution in bacteria. RESULTS: We show here that the evolution of gene blocks in proteobacteria can be described by a small set of events. Those include the insertion of genes into, or the splitting of genes out of a gene block, gene loss, and gene duplication. We show how the event-based method of gene block evolution allows us to determine the evolutionary rateand may be used to trace the ancestral states of their formation. We conclude that the event-based method can be used to help us understand the formation of these important bacterial genomic structures. AVAILABILITY AND IMPLEMENTATION: The software is available under GPLv3 license on http://github.com/reamdc1/gene_block_evolution.git. Supplementary online material: http://iddo-friedberg.net/operon-evolution


Assuntos
Bactérias/genética , Biologia Computacional/métodos , Evolução Molecular , Genes Bacterianos , Genoma Bacteriano , Software , Genômica/métodos , Óperon
14.
BMC Bioinformatics ; 16: 381, 2015 Nov 11.
Artigo em Inglês | MEDLINE | ID: mdl-26558535

RESUMO

BACKGROUND: Bacteriocins are peptide-derived molecules produced by bacteria, whose recently-discovered functions include virulence factors and signaling molecules as well as their better known roles as antibiotics. To date, close to five hundred bacteriocins have been identified and classified. Recent discoveries have shown that bacteriocins are highly diverse and widely distributed among bacterial species. Given the heterogeneity of bacteriocin compounds, many tools struggle with identifying novel bacteriocins due to their vast sequence and structural diversity. Many bacteriocins undergo post-translational processing or modifications necessary for the biosynthesis of the final mature form. Enzymatic modification of bacteriocins as well as their export is achieved by proteins whose genes are often located in a discrete gene cluster proximal to the bacteriocin precursor gene, referred to as context genes in this study. Although bacteriocins themselves are structurally diverse, context genes have been shown to be largely conserved across unrelated species. METHODS: Using this knowledge, we set out to identify new candidates for context genes which may clarify how bacteriocins are synthesized, and identify new candidates for bacteriocins that bear no sequence similarity to known toxins. To achieve these goals, we have developed a software tool, Bacteriocin Operon and gene block Associator (BOA) that can identify homologous bacteriocin associated gene blocks and predict novel ones. BOA generates profile Hidden Markov Models from the clusters of bacteriocin context genes, and uses them to identify novel bacteriocin gene blocks and operons. RESULTS AND CONCLUSIONS: We provide a novel dataset of predicted bacteriocins and context genes. We also discover that several phyla have a strong preference for bacteriocin genes, suggesting distinct functions for this group of molecules. SOFTWARE AVAILABILITY: https://github.com/idoerg/BOA.


Assuntos
Antibacterianos/farmacologia , Bacteriocinas/antagonistas & inibidores , Bacteriocinas/metabolismo , Genoma Arqueal , Genoma Bacteriano , Óperon/genética , Software , Bacteriocinas/genética , Mapeamento Cromossômico
15.
Bioinformatics ; 30(17): i609-16, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25161254

RESUMO

MOTIVATION: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. RESULTS: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas/fisiologia , Algoritmos , Biologia Computacional/métodos , Ontologia Genética , Anotação de Sequência Molecular , Proteínas/genética , Alinhamento de Sequência
16.
PLoS Comput Biol ; 9(5): e1003063, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23737737

RESUMO

The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the "few articles - many proteins" phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments.


Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Anotação de Sequência Molecular/métodos , Proteínas/classificação , Animais , Ensaios de Triagem em Larga Escala , Humanos , Proteínas/química , Proteínas/metabolismo
17.
Bioinform Adv ; 4(1): vbae089, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38911822

RESUMO

Motivation: Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results: Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation: TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.

18.
bioRxiv ; 2024 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-38559275

RESUMO

Epitope tagging is an invaluable technique enabling the identification, tracking, and purification of proteins in vivo. We developed a tool, EpicTope, to facilitate this method by identifying amino acid positions suitable for epitope insertion. Our method uses a scoring function that considers multiple protein sequence and structural features to determine locations least disruptive to the protein's function. We validated our approach on the zebrafish Smad5 protein, showing that multiple predicted internally tagged Smad5 proteins rescue zebrafish smad5 mutant embryos, while the N- and C-terminal tagged variants do not, also as predicted. We further show that the internally tagged Smad5 proteins are accessible to antibodies in wholemount zebrafish embryo immunohistochemistry and by western blot. Our work demonstrates that EpicTope is an accessible and effective tool for designing epitope tag insertion sites. EpicTope is available under a GPL-3 license from: https://github.com/FriedbergLab/Epictope.

19.
Bioinform Adv ; 4(1): vbae043, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38545087

RESUMO

We present CAFA-evaluator, a powerful Python program designed to evaluate the performance of prediction methods on targets with hierarchical concept dependencies. It generalizes multi-label evaluation to modern ontologies where the prediction targets are drawn from a directed acyclic graph and achieves high efficiency by leveraging matrix computation and topological sorting. The program requirements include a small number of standard Python libraries, making CAFA-evaluator easy to maintain. The code replicates the Critical Assessment of protein Function Annotation (CAFA) benchmarking, which evaluates predictions of the consistent subgraphs in Gene Ontology. Owing to its reliability and accuracy, the organizers have selected CAFA-evaluator as the official CAFA evaluation software. Availability and implementation: https://pypi.org/project/cafaeval.

20.
PLoS One ; 18(8): e0290473, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37616210

RESUMO

Understanding the microbial genomic contributors to antimicrobial resistance (AMR) is essential for early detection of emerging AMR infections, a pressing global health threat in human and veterinary medicine. Here we used whole genome sequencing and antibiotic susceptibility test data from 980 disease causing Escherichia coli isolated from companion and farm animals to model AMR genotypes and phenotypes for 24 antibiotics. We determined the strength of genotype-to-phenotype relationships for 197 AMR genes with elastic net logistic regression. Model predictors were designed to evaluate different potential modes of AMR genotype translation into resistance phenotypes. Our results show a model that considers the presence of individual AMR genes and total number of AMR genes present from a set of genes known to confer resistance was able to accurately predict isolate resistance on average (mean F1 score = 98.0%, SD = 2.3%, mean accuracy = 98.2%, SD = 2.7%). However, fitted models sometimes varied for antibiotics in the same class and for the same antibiotic across animal hosts, suggesting heterogeneity in the genetic determinants of AMR resistance. We conclude that an interpretable AMR prediction model can be used to accurately predict resistance phenotypes across multiple host species and reveal testable hypotheses about how the mechanism of resistance may vary across antibiotics within the same class and across animal hosts for the same antibiotic.


Assuntos
Antibacterianos , Gado , Animais , Humanos , Antibacterianos/farmacologia , Animais de Estimação , Farmacorresistência Bacteriana/genética , Escherichia coli/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA