RESUMO
BACKGROUND: Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. RESULTS: In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. CONCLUSIONS: This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Assuntos
Redes Neurais de Computação , Aprendizado de Máquina Supervisionado , Mineração de Dados , Humanos , FenótipoRESUMO
This study provides a snapshot of the current vaccine business ecosystem, including practices, challenges, beliefs, and expectations of vaccine providers. Our team focused on providers' firsthand experience with administering vaccines to determine if an oral vaccine (e.g. pill or oral-drop) would be well-received. We interviewed 135 healthcare providers and vaccine specialists across the US, focusing questions on routine vaccinations, not COVID-19 vaccines. Improving workflow efficiency is a top concern among vaccine providers due to shrinking reimbursement rates-determined by pharmacy benefit managers (PBMs)-and the time-intensiveness of injectable vaccines. Administering injectable vaccines takes 23 minutes/patient on average, while dispensing pills takes only 5 minutes/patient. An average of 24% of patients express needle-fear, which further lengthens the processing time. Misaligned incentives between providers and PBMs could reduce the quality and availability of vaccine-related care. The unavailability of single-dose orders prevents some rural providers from offering certain vaccines. Most interviewees (74%) believe an oral vaccine would improve patient-provider experience, patient-compliance, and workflow efficiency, while detractors (26%) worry about the taste, vaccine absorption, and efficacy. Additional research could investigate whether currently non-vaccinating pharmacies would be willing to offer oral vaccines, and the impact of oral vaccines on vaccine acceptance.
Assuntos
Ecossistema , Vacinas , Estados Unidos , Humanos , Vacinação , Pessoal de Saúde , TecnologiaRESUMO
microRNAs (miRNAs) are known as one of the small non-coding RNA molecules that control the expression of genes at the RNA level, while some operate at the DNA level. They typically range from 20 to 24 nucleotides in length and can be found in the plant and animal kingdoms as well as in some viruses. Computational approaches have overcome the limitations of the experimental methods and have performed well in identifying miRNAs. Compared to mature miRNAs, precursor miRNAs (pre-miRNAs) are long and have a hairpin loop structure with structural features. Therefore, most in-silico tools are implemented for pre-miRNA identification. This study presents a multilayer perceptron (MLP) based classifier implemented using 180 features under sequential, structural, and thermodynamic feature categories for plant pre-miRNA identification. This classifier has a 92% accuracy, a 94% specificity, and a 90% sensitivity. We have further tested this model with other small non-coding RNA types and obtained 78% accuracy. Furthermore, we introduce a novel dataset to train and test machine learning models, addressing the overlapping data issue in the positive training and testing datasets presented in PlantMiRNAPred for the classification of real and pseudo-plant pre-miRNAs. The new dataset and the classifier that can be used with any plant species are deployed on a web server freely accessible at http://mirnafinder.shyaman.me/.
Assuntos
MicroRNAs , Precursores de RNA , Animais , Biologia Computacional/métodos , Aprendizado de Máquina , MicroRNAs/química , MicroRNAs/genética , Plantas/genética , Precursores de RNA/química , Precursores de RNA/genéticaRESUMO
Camelina [Camelina sativa (L.) Crantz] is an oilseed crop in the Brassicaceae family that is currently being developed as a source of bioenergy and healthy fatty acids. To facilitate modern breeding efforts through marker-assisted selection and biotechnology, we evaluated genetic variation among a worldwide collection of 222 camelina accessions. We performed whole-genome resequencing to obtain single nucleotide polymorphism (SNP) markers and to analyze genomic diversity. We also conducted phenotypic field evaluations in two consecutive seasons for variations in key agronomic traits related to oilseed production such as seed size, oil content (OC), fatty acid composition, and flowering time. We determined the population structure of the camelina accessions using 161,301 SNPs. Further, we identified quantitative trait loci (QTL) and candidate genes controlling the above field-evaluated traits by genome-wide association studies (GWAS) complemented with linkage mapping using a recombinant inbred line (RIL) population. Characterization of the natural variation at the genome and phenotypic levels provides valuable resources to camelina genetic studies and crop improvement. The QTL and candidate genes should assist in breeding of advanced camelina varieties that can be integrated into the cropping systems for the production of high yield of oils of desired fatty acid composition.
Assuntos
Brassicaceae , Locos de Características Quantitativas , Brassicaceae/genética , Dissecação , Estudo de Associação Genômica Ampla , Melhoramento VegetalRESUMO
Research Domain Criteria (RDoC), which is a recently introduced framework for mental illness, utilizes various units of analysis from genetics, neural circuits, etc., for accurate multi-dimensional classification of mental illnesses. Due to the large amount of relevant biomedical research available, automating the process of extracting evidence from the literature to assist with the curation of the RDoC matrix is essential for processing the full breadth of data in an accurate and cost-effective manner. In this work, we formulate the task of information retrieval of brain research literature from general PubMed abstracts. We develop BRret (Brain Research retriever), a novel algorithm for brain research related article retrieval. We use a large dataset of PubMed abstracts annotated with RDoC concepts to demonstrate the effectiveness of BRret. To the best of our knowledge, this is the first study aimed at automated retrieval of brain research related literature.
RESUMO
BACKGROUND: MicroRNAs (miRNAs) play a vital role as post-transcriptional regulators in gene expression. Experimental determination of miRNA sequence and structure is both expensive and time consuming. The next-generation sequencing revolution, which facilitated the rapid accumulation of biological data has brought biology into the "big data" domain. As such, developing computational methods to predict miRNAs has become an active area of inter-disciplinary research. OBJECTIVE: The objective of this systematic review is to focus on the developments of ab initio plant miRNA identification methods over the last decade. DATA SOURCES: Five databases were searched for relevant articles, according to a well-defined review protocol. STUDY SELECTION: The search results were further filtered using the selection criteria that only included studies on novel plant miRNA identification using machine learning. DATA EXTRACTION: Relevant data from each study were extracted in order to carry out an analysis on their methodologies and findings. RESULTS: Results depict that in the last decade, there were 20 articles published on novel miRNA identification methods in plants of which only 11 of them were primarily focused on plant microRNA identification. Our findings suggest a need for more stringent plant-focused miRNA identification studies. CONCLUSION: Overall, the study accuracies are of a satisfactory level, although they may generate a considerable number of false negatives. In future, attention must be paid to the biological plausibility of computationally identified miRNAs to prevent further propagation of biologically questionable miRNA sequences.
RESUMO
We consider the problem of identifying regions within a pan-genome De Bruijn graph that are traversed by many sequence paths. We define such regions and the subpaths that traverse them as frequented regions (FRs). In this work, we formalize the FR problem and describe an efficient algorithm for finding FRs. Subsequently, we propose some applications of FRs based on machine-learning and pan-genome graph simplification. We demonstrate the effectiveness of these applications using data sets for the organisms Staphylococcus aureus (bacterium) and Saccharomyces cerevisiae (yeast). We corroborate the biological relevance of FRs such as identifying introgressions in yeast that aid in alcohol tolerance, and show that FRs are useful for classification of yeast strains by industrial use and visualizing pan-genomic space.
Assuntos
Genoma/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Gráficos por Computador , Bases de Dados Genéticas , Saccharomyces cerevisiae/genética , Staphylococcus aureus/genéticaRESUMO
BACKGROUND: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS: Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION: We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
Assuntos
Anotação de Sequência Molecular/tendências , Animais , Biofilmes , Candida albicans/genética , Drosophila melanogaster/genética , Genoma Bacteriano , Genoma Fúngico , Humanos , Locomoção , Memória de Longo Prazo , Anotação de Sequência Molecular/métodos , Pseudomonas aeruginosa/genéticaRESUMO
The text-mining services for kinome curation track, part of BioCreative VI, proposed a competition to assess the effectiveness of text mining to perform literature triage. The track has exploited an unpublished curated data set from the neXtProt database. This data set contained comprehensive annotations for 300 human protein kinases. For a given protein and a given curation axis [diseases or gene ontology (GO) biological processes], participants' systems had to identify and rank relevant articles in a collection of 5.2 M MEDLINE citations (task 1) or 530 000 full-text articles (task 2). Explored strategies comprised named-entity recognition and machine-learning frameworks. For that latter approach, participants developed methods to derive a set of negative instances, as the databases typically do not store articles that were judged as irrelevant by curators. The supervised approaches proposed by the participating groups achieved significant improvements compared to the baseline established in a previous study and compared to a basic PubMed search.
Assuntos
Mineração de Dados , Proteínas Quinases/metabolismo , Bases de Dados Factuais , Humanos , Publicações Periódicas como AssuntoRESUMO
BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
Assuntos
Biologia Computacional , Proteínas/química , Software , Relação Estrutura-Atividade , Algoritmos , Bases de Dados de Proteínas , Ontologia Genética , Humanos , Anotação de Sequência Molecular , Proteínas/genéticaRESUMO
Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a "medium-throughput" pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated.
RESUMO
The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.
RESUMO
BACKGROUND: The recently held Critical Assessment of Function Annotation challenge (CAFA2) required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. This is in contrast to the original CAFA challenge in which participants were asked to submit predictions for proteins with no existing annotations. The CAFA2 task is more realistic, in that it more closely mimics the accumulation of annotations over time. In this study we compare these tasks in terms of their difficulty, and determine whether cross-validation provides a good estimate of performance. RESULTS: The CAFA2 task is a combination of two subtasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In this study we analyze the performance of several function prediction methods in these two scenarios. Our results show that several methods (structured support vector machine, binary support vector machines and guilt-by-association methods) do not usually achieve the same level of accuracy on these two tasks as that achieved by cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We also find that different methods have different performance characteristics in these tasks, and that cross-validation is not adequate at estimating performance and ranking methods. CONCLUSIONS: These results have implications for the design of computational experiments in the area of automated function prediction and can provide useful insight for the understanding and design of future CAFA competitions.