RESUMEN
Over 10,000 rare genetic diseases have been identified, and millions of newborns are affected by severe rare genetic diseases each year. A variety of Human Phenotype Ontology (HPO)-based clinical decision support systems (CDSS) and patient repositories have been developed to support clinicians in diagnosing patients with suspected rare genetic diseases. In September 2017, we released PubCaseFinder (https://pubcasefinder.dbcls.jp), a web-based CDSS that provides ranked lists of genetic and rare diseases using HPO-based phenotypic similarities, where top-listed diseases represent the most likely differential diagnosis. We also developed a Matchmaker Exchange (MME) application programming interface (API) to query PubCaseFinder, which has been adopted by several patient repositories. In this paper, we describe notable updates regarding PubCaseFinder, the GeneYenta matching algorithm implemented in PubCaseFinder, and the PubCaseFinder API. The updated GeneYenta matching algorithm improves the performance of the CDSS automated differential diagnosis function. Moreover, the updated PubCaseFinder and new API empower patient repositories participating in MME and medical professionals to actively use HPO-based resources.
Asunto(s)
Bases de Datos Genéticas , Programas Informáticos , Algoritmos , Humanos , Recién Nacido , Fenotipo , Enfermedades Raras/genéticaRESUMEN
Recently, to speed up the differential-diagnosis process based on symptoms and signs observed from an affected individual in the diagnosis of rare diseases, researchers have developed and implemented phenotype-driven differential-diagnosis systems. The performance of those systems relies on the quantity and quality of underlying databases of disease-phenotype associations (DPAs). Although such databases are often developed by manual curation, they inherently suffer from limited coverage. To address this problem, we propose a text-mining approach to increase the coverage of DPA databases and consequently improve the performance of differential-diagnosis systems. Our analysis showed that a text-mining approach using one million case reports obtained from PubMed could increase the coverage of manually curated DPAs in Orphanet by 125.6%. We also present PubCaseFinder (see Web Resources), a new phenotype-driven differential-diagnosis system in a freely available web application. By utilizing automatically extracted DPAs from case reports in addition to manually curated DPAs, PubCaseFinder improves the performance of automated differential diagnosis. Moreover, PubCaseFinder helps clinicians search for relevant case reports by using phenotype-based comparisons and confirm the results with detailed contextual information.
Asunto(s)
Enfermedades Raras/diagnóstico , Enfermedades Raras/genética , Minería de Datos/métodos , Bases de Datos Genéticas , Diagnóstico Diferencial , Humanos , FenotipoRESUMEN
MOTIVATION: Most currently available text mining tools share two characteristics that make them less than optimal for use by biomedical researchers: they require extensive specialist skills in natural language processing and they were built on the assumption that they should optimize global performance metrics on representative datasets. This is a problem because most end-users are not natural language processing specialists and because biomedical researchers often care less about global metrics like F-measure or representative datasets than they do about more granular metrics such as precision and recall on their own specialized datasets. Thus, there are fundamental mismatches between the assumptions of much text mining work and the preferences of potential end-users. RESULTS: This article introduces the concept of Agile text mining, and presents the PubAnnotation ecosystem as an example implementation. The system approaches the problems from two perspectives: it allows the reformulation of text mining by biomedical researchers from the task of assembling a complete system to the task of retrieving warehoused annotations, and it makes it possible to do very targeted customization of the pre-existing system to address specific end-user requirements. Two use cases are presented: assisted curation of the GlycoEpitope database, and assessing coverage in the literature of pre-eclampsia-associated genes. AVAILABILITY AND IMPLEMENTATION: The three tools that make up the ecosystem, PubAnnotation, PubDictionaries and TextAE are publicly available as web services, and also as open source projects. The dictionaries and the annotation datasets associated with the use cases are all publicly available through PubDictionaries and PubAnnotation, respectively.
Asunto(s)
Biología Computacional , Ecosistema , Minería de Datos , Femenino , Humanos , Procesamiento de Lenguaje Natural , Embarazo , PubMedRESUMEN
BACKGROUND: microRNAs (miRNAs) are tiny endogenous RNAs that have been discovered in animals and plants, and direct the post-transcriptional regulation of target mRNAs for degradation or translational repression via binding to the 3'UTRs and the coding exons. To gain insight into the biological role of miRNAs, it is essential to identify the full repertoire of mRNA targets (target genes). A number of computer programs have been developed for miRNA-target prediction. These programs essentially focus on potential binding sites in 3'UTRs, which are recognized by miRNAs according to specific base-pairing rules. RESULTS: Here, we introduce a novel method for miRNA-target prediction that is entirely independent of existing approaches. The method is based on the hypothesis that transcription of a miRNA and its target genes tend to be co-regulated by common transcription factors. This hypothesis predicts the frequent occurrence of common cis-elements between promoters of a miRNA and its target genes. That is, our proposed method first identifies putative cis-elements in a promoter of a given miRNA, and then identifies genes that contain common putative cis-elements in their promoters. In this paper, we show that a significant number of common cis-elements occur in ~28% of experimentally supported human miRNA-target data. Moreover, we show that the prediction of human miRNA-targets based on our method is statistically significant. Further, we discuss the random incidence of common cis-elements, their consensus sequences, and the advantages and disadvantages of our method. CONCLUSIONS: This is the first report indicating prevalence of transcriptional regulation of a miRNA and its target genes by common transcription factors and the predictive ability of miRNA-targets based on this property.
Asunto(s)
Biología Computacional/métodos , Regulación de la Expresión Génica , MicroARNs/genética , Regiones Promotoras Genéticas/genética , Animales , Secuencia de Consenso , Bases de Datos de Ácidos Nucleicos , Humanos , MicroARNs/clasificación , Factores de Transcripción/metabolismoRESUMEN
For researchers, writing a paper is an essential task, and it is crucial for them to have an environment to facilitate the paper writing process. In addition, writing in English is more difficult for many non-native English speakers. The Database Center for Life Science (DBCLS) provides researchers in the life sciences with several text-mining related services, such as Allie and inMeXes, which were developed to facilitate paper writing. Allie is an abbreviation database that shows researchers expanded forms and several relevant data, such as the papers that contain the abbreviations and their corresponding expanded forms. Since a large amount of abbreviations are coined, remembering their meanings is difficult, even in one's research field. Therefore, Allie helps one lookup abbreviations. inMeXes is an incremental search service for English phrases appearing in PubMed. Researchers can learn English phrases used in life science papers, such as the use of prepositions or widely used phrases that contain a specific word. Allie and inMeXes are updated monthly and yearly, respectively, to provide the latest information.
RESUMEN
AIMS: Monogenic diabetes is clinically heterogeneous and differs from common forms of diabetes (type 1 and 2). We aimed to investigate the clinical usefulness of a comprehensive genetic testing system, comprised of targeted next-generation sequencing (NGS) with phenotype-driven bioinformatics analysis in patients with monogenic diabetes, which uses patient genotypic and phenotypic data to prioritize potentially causal variants. METHODS: We performed targeted NGS of 383 genes associated with monogenic diabetes or common forms of diabetes in 13 Japanese patients with suspected (n = 10) or previously diagnosed (n = 3) monogenic diabetes or severe insulin resistance. We performed in silico structural analysis and phenotype-driven bioinformatics analysis of candidate variants from NGS data. RESULTS: Among the patients suspected having monogenic diabetes or insulin resistance, we diagnosed 3 patients as subtypes of monogenic diabetes due to disease-associated variants of INSR, LMNA, and HNF1B. Additionally, in 3 other patients, we detected rare variants with potential phenotypic effects. Notably, we identified a novel missense variant in TBC1D4 and an MC4R variant, which together may cause a mixed phenotype of severe insulin resistance. CONCLUSIONS: This comprehensive approach could assist in the early diagnosis of patients with monogenic diabetes and facilitate the provision of tailored therapy.
Asunto(s)
Diabetes Mellitus/diagnóstico , Diabetes Mellitus/genética , Pruebas Genéticas/métodos , Resistencia a la Insulina/genética , Adolescente , Adulto , Anciano , Biología Computacional , Femenino , Proteínas Activadoras de GTPasa/genética , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Lactante , Japón , Masculino , Tamizaje Masivo/métodos , Persona de Mediana Edad , Mutación Missense , Fenotipo , Adulto JovenRESUMEN
BACKGROUND: To promote research activities in a particular research area, it is important to efficiently identify current research trends, advances, and issues in that area. Although review papers in the research area can suffice for this purpose in general, researchers are not necessarily able to obtain these papers from research aspects of their interests at the time they are required. Therefore, the utilization of the citation contexts of papers in a research area has been considered as another approach. However, there are few search services to retrieve citation contexts in the life sciences domain; furthermore, efficiently obtaining citation contexts is becoming difficult due to the large volume and rapid growth of life sciences papers. RESULTS: Here, we introduce the Colil (Comments on Literature in Literature) database to store citation contexts in the life sciences domain. By using the Resource Description Framework (RDF) and a newly compiled vocabulary, we built the Colil database and made it available through the SPARQL endpoint. In addition, we developed a web-based search service called Colil that searches for a cited paper in the Colil database and then returns a list of citation contexts for it along with papers relevant to it based on co-citations. The citation contexts in the Colil database were extracted from full-text papers of the PubMed Central Open Access Subset (PMC-OAS), which includes 545,147 papers indexed in PubMed. These papers are distributed across 3,171 journals and cite 5,136,741 unique papers that correspond to approximately 25 % of total PubMed entries. CONCLUSIONS: By utilizing Colil, researchers can easily refer to a set of citation contexts and relevant papers based on co-citations for a target paper. Colil helps researchers to comprehend life sciences papers in a research area more efficiently and makes their biological research more efficient.
RESUMEN
BACKGROUND: Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. RESULTS: We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. CONCLUSIONS: Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility.
RESUMEN
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.