Búsqueda | Portal Regional de la BVS

1.

OMD Curation Toolkit: a workflow for in-house curation of public omics datasets.

Piquer-Esteban, Samuel; Arnau, Vicente; Diaz, Wladimiro; Moya, Andrés.

BMC Bioinformatics ; 25(1): 184, 2024 May 09.

Artículo en Inglés | MEDLINE | ID: mdl-38724907

RESUMEN

BACKGROUND: Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing. RESULTS: Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources. CONCLUSIONS: Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.

Asunto(s)

Curaduría de Datos , Programas Informáticos , Flujo de Trabajo , Curaduría de Datos/métodos , Metadatos , Bases de Datos Genéticas , Genómica/métodos , Biología Computacional/métodos

2.

PMBC: a manually curated database for prognostic markers of breast cancer.

Liu, Jiabei; Yu, Yiyi; Li, Mingyue; Wu, Yixuan; Chen, Weijun; Liu, Guanru; Liu, Lingxian; Lin, Jiechun; Peng, Chujun; Sun, Weijun; Wu, Xiaoli; Chen, Xin.

Database (Oxford) ; 20242024 May 15.

Artículo en Inglés | MEDLINE | ID: mdl-38748636

RESUMEN

Breast cancer is notorious for its high mortality and heterogeneity, resulting in different therapeutic responses. Classical biomarkers have been identified and successfully commercially applied to predict the outcome of breast cancer patients. Accumulating biomarkers, including non-coding RNAs, have been reported as prognostic markers for breast cancer with the development of sequencing techniques. However, there are currently no databases dedicated to the curation and characterization of prognostic markers for breast cancer. Therefore, we constructed a curated database for prognostic markers of breast cancer (PMBC). PMBC consists of 1070 markers covering mRNAs, lncRNAs, miRNAs and circRNAs. These markers are enriched in various cancer- and epithelial-related functions including mitogen-activated protein kinases signaling. We mapped the prognostic markers into the ceRNA network from starBase. The lncRNA NEAT1 competes with 11 RNAs, including lncRNAs and mRNAs. The majority of the ceRNAs in ABAT belong to pseudogenes. The topology analysis of the ceRNA network reveals that known prognostic RNAs have higher closeness than random. Among all the biomarkers, prognostic lncRNAs have a higher degree, while prognostic mRNAs have significantly higher closeness than random RNAs. These results indicate that the lncRNAs play important roles in maintaining the interactions between lncRNAs and their ceRNAs, which might be used as a characteristic to prioritize prognostic lncRNAs based on the ceRNA network. PMBC renders a user-friendly interface and provides detailed information about individual prognostic markers, which will facilitate the precision treatment of breast cancer. PMBC is available at the following URL: http://www.pmbreastcancer.com/.

Asunto(s)

Biomarcadores de Tumor , Neoplasias de la Mama , Bases de Datos Genéticas , Humanos , Neoplasias de la Mama/genética , Neoplasias de la Mama/metabolismo , Femenino , Biomarcadores de Tumor/genética , Pronóstico , ARN Largo no Codificante/genética , Redes Reguladoras de Genes , Curaduría de Datos/métodos , ARN Mensajero/genética , ARN Mensajero/metabolismo , Regulación Neoplásica de la Expresión Génica

3.

A study on formalizing the knowledge of data curation activities across different fields.

Minamiyama, Yasuyuki; Takeda, Hideaki; Hayashi, Masaharu; Asaoka, Makoto; Yamaji, Kazutsuna.

PLoS One ; 19(4): e0301772, 2024.

Artículo en Inglés | MEDLINE | ID: mdl-38662657

RESUMEN

In recent years, with the trend of open science, there have been many efforts to share research data on the internet. To promote research data sharing, data curation is essential to make the data interpretable and reusable. In research fields such as life sciences, earth sciences, and social sciences, tasks and procedures have been already developed to implement efficient data curation to meet the needs and customs of individual research fields. However, not only data sharing within research fields but also interdisciplinary data sharing is required to promote open science. For this purpose, knowledge of data curation across the research fields is surveyed, analyzed, and organized as an ontology in this paper. As the survey, existing vocabularies and procedures are collected and compared as well as interviews with the data curators in research institutes in different fields are conducted to clarify commonalities and differences in data curation across the research fields. It turned out that the granularity of tasks and procedures that constitute the building blocks of data curation is not formalized. Without a method to overcome this gap, it will be challenging to promote interdisciplinary reuse of research data. Based on the analysis above, the ontology for the data curation process is proposed to describe data curation processes in different fields universally. It is described by OWL and shown as valid and consistent from the logical viewpoint. The ontology successfully represents data curation activities as the processes in the different fields acquired by the interviews. It is also helpful to identify the functions of the systems to support the data curation process. This study contributes to building a knowledge framework for an interdisciplinary understanding of data curation activities in different fields.

Asunto(s)

Curaduría de Datos , Difusión de la Información , Curaduría de Datos/métodos , Difusión de la Información/métodos , Humanos , Conocimiento , Internet

4.

Prediction and curation of missing biomedical identifier mappings with Biomappings.

Hoyt, Charles Tapley; Hoyt, Amelia L; Gyori, Benjamin M.

Bioinformatics ; 39(4)2023 04 03.

Artículo en Inglés | MEDLINE | ID: mdl-36916735

RESUMEN

MOTIVATION: Biomedical identifier resources (such as ontologies, taxonomies, and controlled vocabularies) commonly overlap in scope and contain equivalent entries under different identifiers. Maintaining mappings between these entries is crucial for interoperability and the integration of data and knowledge. However, there are substantial gaps in available mappings motivating their semi-automated curation. RESULTS: Biomappings implements a curation workflow for missing mappings which combines automated prediction with human-in-the-loop curation. It supports multiple prediction approaches and provides a web-based user interface for reviewing predicted mappings for correctness, combined with automated consistency checking. Predicted and curated mappings are made available in public, version-controlled resource files on GitHub. Biomappings currently makes available 9274 curated mappings and 40 691 predicted ones, providing previously missing mappings between widely used identifier resources covering small molecules, cell lines, diseases, and other concepts. We demonstrate the value of Biomappings on case studies involving predicting and curating missing mappings among cancer cell lines as well as small molecules tested in clinical trials. We also present how previously missing mappings curated using Biomappings were contributed back to multiple widely used community ontologies. AVAILABILITY AND IMPLEMENTATION: The data and code are available under the CC0 and MIT licenses at https://github.com/biopragmatics/biomappings.

Asunto(s)

Curaduría de Datos , Vocabulario Controlado , Humanos , Curaduría de Datos/métodos , Programas Informáticos , Interfaz Usuario-Computador

5.

An Efficient Semi-Supervised Framework with Multi-Task and Curriculum Learning for Medical Image Segmentation.

Wang, Kaiping; Wang, Yan; Zhan, Bo; Yang, Yujie; Zu, Chen; Wu, Xi; Zhou, Jiliu; Nie, Dong; Zhou, Luping.

Int J Neural Syst ; 32(9): 2250043, 2022 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-35912583

RESUMEN

A practical problem in supervised deep learning for medical image segmentation is the lack of labeled data which is expensive and time-consuming to acquire. In contrast, there is a considerable amount of unlabeled data available in the clinic. To make better use of the unlabeled data and improve the generalization on limited labeled data, in this paper, a novel semi-supervised segmentation method via multi-task curriculum learning is presented. Here, curriculum learning means that when training the network, simpler knowledge is preferentially learned to assist the learning of more difficult knowledge. Concretely, our framework consists of a main segmentation task and two auxiliary tasks, i.e. the feature regression task and target detection task. The two auxiliary tasks predict some relatively simpler image-level attributes and bounding boxes as the pseudo labels for the main segmentation task, enforcing the pixel-level segmentation result to match the distribution of these pseudo labels. In addition, to solve the problem of class imbalance in the images, a bounding-box-based attention (BBA) module is embedded, enabling the segmentation network to concern more about the target region rather than the background. Furthermore, to alleviate the adverse effects caused by the possible deviation of pseudo labels, error tolerance mechanisms are also adopted in the auxiliary tasks, including inequality constraint and bounding-box amplification. Our method is validated on ACDC2017 and PROMISE12 datasets. Experimental results demonstrate that compared with the full supervision method and state-of-the-art semi-supervised methods, our method yields a much better segmentation performance on a small labeled dataset. Code is available at https://github.com/DeepMedLab/MTCL.

Asunto(s)

Curriculum , Aprendizaje Automático Supervisado , Curaduría de Datos/métodos , Curaduría de Datos/normas , Conjuntos de Datos como Asunto/normas , Conjuntos de Datos como Asunto/provisión & distribución , Procesamiento de Imagen Asistido por Computador/métodos , Aprendizaje Automático Supervisado/clasificación , Aprendizaje Automático Supervisado/estadística & datos numéricos , Aprendizaje Automático Supervisado/tendencias

6.

PeakForest: a multi-platform digital infrastructure for interoperable metabolite spectral data and metadata management.

Paulhe, Nils; Canlet, Cécile; Damont, Annelaure; Peyriga, Lindsay; Durand, Stéphanie; Deborde, Catherine; Alves, Sandra; Bernillon, Stephane; Berton, Thierry; Bir, Raphael; Bouville, Alyssa; Cahoreau, Edern; Centeno, Delphine; Costantino, Robin; Debrauwer, Laurent; Delabrière, Alexis; Duperier, Christophe; Emery, Sylvain; Flandin, Amelie; Hohenester, Ulli; Jacob, Daniel; Joly, Charlotte; Jousse, Cyril; Lagree, Marie; Lamari, Nadia; Lefebvre, Marie; Lopez-Piffet, Claire; Lyan, Bernard; Maucourt, Mickael; Migne, Carole; Olivier, Marie-Francoise; Rathahao-Paris, Estelle; Petriacq, Pierre; Pinelli, Julie; Roch, Léa; Roger, Pierrick; Roques, Simon; Tabet, Jean-Claude; Tremblay-Franco, Marie; Traïkia, Mounir; Warnet, Anna; Zhendre, Vanessa; Rolin, Dominique; Jourdan, Fabien; Thévenot, Etienne; Moing, Annick; Jamin, Emilien; Fenaille, François; Junot, Christophe; Pujos-Guillot, Estelle.

Metabolomics ; 18(6): 40, 2022 06 14.

Artículo en Inglés | MEDLINE | ID: mdl-35699774

RESUMEN

INTRODUCTION: Accuracy of feature annotation and metabolite identification in biological samples is a key element in metabolomics research. However, the annotation process is often hampered by the lack of spectral reference data in experimental conditions, as well as logistical difficulties in the spectral data management and exchange of annotations between laboratories. OBJECTIVES: To design an open-source infrastructure allowing hosting both nuclear magnetic resonance (NMR) and mass spectra (MS), with an ergonomic Web interface and Web services to support metabolite annotation and laboratory data management. METHODS: We developed the PeakForest infrastructure, an open-source Java tool with automatic programming interfaces that can be deployed locally to organize spectral data for metabolome annotation in laboratories. Standardized operating procedures and formats were included to ensure data quality and interoperability, in line with international recommendations and FAIR principles. RESULTS: PeakForest is able to capture and store experimental spectral MS and NMR metadata as well as collect and display signal annotations. This modular system provides a structured database with inbuilt tools to curate information, browse and reuse spectral information in data treatment. PeakForest offers data formalization and centralization at the laboratory level, facilitating shared spectral data across laboratories and integration into public databases. CONCLUSION: PeakForest is a comprehensive resource which addresses a technical bottleneck, namely large-scale spectral data annotation and metabolite identification for metabolomics laboratories with multiple instruments. PeakForest databases can be used in conjunction with bespoke data analysis pipelines in the Galaxy environment, offering the opportunity to meet the evolving needs of metabolomics research. Developed and tested by the French metabolomics community, PeakForest is freely-available at https://github.com/peakforest .

Asunto(s)

Metabolómica , Metadatos , Curaduría de Datos/métodos , Espectrometría de Masas/métodos , Metaboloma , Metabolómica/métodos

7.

Raspberry Pi-Based Data Archival System for Electroencephalogram Signals From the SedLine Root Device.

Suresha, Pradyumna B; Robichaux, Chad J; Cassim, Tuan Z; García, Paul S; Clifford, Gari D.

Anesth Analg ; 134(2): 380-388, 2022 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-34673658

RESUMEN

BACKGROUND: The retrospective analysis of electroencephalogram (EEG) signals acquired from patients under general anesthesia is crucial in understanding the patient's unconscious brain's state. However, the creation of such database is often tedious and cumbersome and involves human labor. Hence, we developed a Raspberry Pi-based system for archiving EEG signals recorded from patients under anesthesia in operating rooms (ORs) with minimal human involvement. METHODS: Using this system, we archived patient EEG signals from over 500 unique surgeries at the Emory University Orthopaedics and Spine Hospital, Atlanta, for about 18 months. For this, we developed a software package that runs on a Raspberry Pi and archives patient EEG signals from a SedLine Root EEG Monitor (Masimo) to a secure Health Insurance Portability and Accountability Act (HIPAA) compliant cloud storage. The OR number corresponding to each surgery was archived along with the EEG signal to facilitate retrospective EEG analysis. We retrospectively processed the archived EEG signals and performed signal quality checks. We also proposed a formula to compute the proportion of true EEG signal and calculated the corresponding statistics. Further, we curated and interleaved patient medical record information with the corresponding EEG signals. RESULTS: We retrospectively processed the EEG signals to demonstrate a statistically significant negative correlation between the relative alpha power (8-12 Hz) of the EEG signal captured under anesthesia and the patient's age. CONCLUSIONS: Our system is a standalone EEG archiver developed using low cost and readily available hardware. We demonstrated that one could create a large-scale EEG database with minimal human involvement. Moreover, we showed that the captured EEG signal is of good quality for retrospective analysis and combined the EEG signal with the patient medical records. This project's software has been released under an open-source license to enable others to use and contribute.

Asunto(s)

Curaduría de Datos/métodos , Electroencefalografía/instrumentación , Electroencefalografía/métodos , Monitoreo Intraoperatorio/instrumentación , Monitoreo Intraoperatorio/métodos , Adulto , Anciano , Anciano de 80 o más Años , Manejo de Datos/instrumentación , Manejo de Datos/métodos , Femenino , Humanos , Masculino , Persona de Mediana Edad , Estudios Retrospectivos , Adulto Joven

8.

Self-reporting data assets and their representation in the pharmaceutical industry.

Della Corte, Dennis; Colsman, Wolfgang; Fessenmayr, Heiko; Sawczuk da Silva, Alexandre; Vanderwall, Dana E.

Drug Discov Today ; 27(1): 207-214, 2022 01.

Artículo en Inglés | MEDLINE | ID: mdl-34332096

RESUMEN

Standardizing data is crucial for preserving and exchanging scientific information. In particular, recording the context in which data were created ensures that information remains findable, accessible, interoperable, and reusable. Here, we introduce the concept of self-reporting data assets (SRDAs), which preserve data and contextual information. SRDAs are an abstract concept, which requires a suitable data format for implementation. Four promising data formats or languages are popularly used to represent data in pharma: JCAMP-DX, JSON, AnIML, and, more recently, the Allotrope Data Format (ADF). Here, we evaluate these four options in common use cases within the pharmaceutical industry using multiple criteria. The evaluation shows that ADF is the most suitable format for the implementation of SRDAs.

Asunto(s)

Exactitud de los Datos , Curaduría de Datos , Industria Farmacéutica , Difusión de la Información/métodos , Proyectos de Investigación/normas , Curaduría de Datos/métodos , Curaduría de Datos/normas , Difusión de Innovaciones , Industria Farmacéutica/métodos , Industria Farmacéutica/organización & administración , Humanos , Prueba de Estudio Conceptual , Estándares de Referencia , Tecnología Farmacéutica/métodos

9.

Curating the Evidence About COVID-19 for Frontline Public Health and Clinical Care: The Novel Coronavirus Research Compendium.

Redd, Andrew D; Peetluk, Lauren S; Jarrett, Brooke A; Hanrahan, Colleen; Schwartz, Sheree; Rao, Amrita; Jaffe, Andrew E; Peer, Austin D; Jones, Carli B; Lutz, Chelsea S; McKee, Clifton D; Patel, Eshan U; Rosen, Joseph G; Garrison Desany, Henri; McKay, Heather S; Muschelli, John; Andersen, Kathleen M; Link, Malen A; Wada, Nikolas; Baral, Prativa; Young, Ruth; Boon, Denali; Grabowski, M Kate; Gurley, Emily S.

Public Health Rep ; 137(2): 197-202, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-34969294

RESUMEN

The public health crisis created by the COVID-19 pandemic has spurred a deluge of scientific research aimed at informing the public health and medical response to the pandemic. However, early in the pandemic, those working in frontline public health and clinical care had insufficient time to parse the rapidly evolving evidence and use it for decision-making. Academics in public health and medicine were well-placed to translate the evidence for use by frontline clinicians and public health practitioners. The Novel Coronavirus Research Compendium (NCRC), a group of >60 faculty and trainees across the United States, formed in March 2020 with the goal to quickly triage and review the large volume of preprints and peer-reviewed publications on SARS-CoV-2 and COVID-19 and summarize the most important, novel evidence to inform pandemic response. From April 6 through December 31, 2020, NCRC teams screened 54 192 peer-reviewed articles and preprints, of which 527 were selected for review and uploaded to the NCRC website for public consumption. Most articles were peer-reviewed publications (n = 395, 75.0%), published in 102 journals; 25.1% (n = 132) of articles reviewed were preprints. The NCRC is a successful model of how academics translate scientific knowledge for practitioners and help build capacity for this work among students. This approach could be used for health problems beyond COVID-19, but the effort is resource intensive and may not be sustainable in the long term.

Asunto(s)

COVID-19 , Curaduría de Datos/métodos , Difusión de la Información/métodos , Investigación Interdisciplinaria/organización & administración , Revisión de la Investigación por Pares , Preimpresos como Asunto , SARS-CoV-2 , Humanos , Salud Pública , Estados Unidos

10.

Complex Portal 2022: new curation frontiers.

Meldal, Birgit H M; Perfetto, Livia; Combe, Colin; Lubiana, Tiago; Ferreira Cavalcante, João Vitor; Bye-A-Jee, Hema; Waagmeester, Andra; Del-Toro, Noemi; Shrivastava, Anjali; Barrera, Elisabeth; Wong, Edith; Mlecnik, Bernhard; Bindea, Gabriela; Panneerselvam, Kalpana; Willighagen, Egon; Rappsilber, Juri; Porras, Pablo; Hermjakob, Henning; Orchard, Sandra.

Nucleic Acids Res ; 50(D1): D578-D586, 2022 01 07.

Artículo en Inglés | MEDLINE | ID: mdl-34718729

RESUMEN

The Complex Portal (www.ebi.ac.uk/complexportal) is a manually curated, encyclopaedic database of macromolecular complexes with known function from a range of model organisms. It summarizes complex composition, topology and function along with links to a large range of domain-specific resources (i.e. wwPDB, EMDB and Reactome). Since the last update in 2019, we have produced a first draft complexome for Escherichia coli, maintained and updated that of Saccharomyces cerevisiae, added over 40 coronavirus complexes and increased the human complexome to over 1100 complexes that include approximately 200 complexes that act as targets for viral proteins or are part of the immune system. The display of protein features in ComplexViewer has been improved and the participant table is now colour-coordinated with the nodes in ComplexViewer. Community collaboration has expanded, for example by contributing to an analysis of putative transcription cofactors and providing data accessible to semantic web tools through Wikidata which is now populated with manually curated Complex Portal content through a new bot. Our data license is now CC0 to encourage data reuse. Users are encouraged to get in touch, provide us with feedback and send curation requests through the 'Support' link.

Asunto(s)

Curaduría de Datos/métodos , Bases de Datos de Proteínas , Complejos Multiproteicos/química , Coronavirus/química , Visualización de Datos , Bases de Datos de Compuestos Químicos , Enzimas/química , Enzimas/metabolismo , Escherichia coli/química , Humanos , Cooperación Internacional , Anotación de Secuencia Molecular , Complejos Multiproteicos/metabolismo , Interfaz Usuario-Computador

11.

Semi-Automated Data Curation from Biomedical Literature.

Rahman, Protiva; Fabbri, Daniel.

AMIA Annu Symp Proc ; 2022: 884-891, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-37128469

RESUMEN

Data curation is a bottleneck for many informatics pipelines. A specific example of this is aggregating data from preclinical studies to identify novel genetic pathways for atherosclerosis in humans. This requires extracting data from published mouse studies such as the perturbed gene and its impact on lesion sizes and plaque inflammation, which is non-trivial. Curation efforts are resource-heavy, with curators manually extracting data from hundreds of publications. In this work, we describe the development of a semi-automated curation tool to accelerate data extraction. We use natural language processing (NLP) methods to auto-populate a web-based form which is then reviewed by a curator. We conducted a controlled user study to evaluate the curation tool. Our NLP model has a 70% accuracy on categorical fields and our curation tool accelerates task completion time by 49% compared to manual curation.

Asunto(s)

Curaduría de Datos , Procesamiento de Lenguaje Natural , Humanos , Animales , Ratones , Curaduría de Datos/métodos , Publicaciones

12.

ProCanBio: A Database of Manually Curated Biomarkers for Prostate Cancer.

Sapra, Dikscha; Kaur, Harpreet; Dhall, Anjali; Raghava, Gajendra P S.

J Comput Biol ; 28(12): 1248-1257, 2021 12.

Artículo en Inglés | MEDLINE | ID: mdl-34898255

RESUMEN

Prostate cancer (PCa) is the second lethal malignancy in men worldwide. In the past, numerous research groups investigated the omics profiles of patients and scrutinized biomarkers for the diagnosis and prognosis of PCa. However, information related to the biomarkers is widely scattered across numerous resources in complex textual format, which poses hindrance to understand the tumorigenesis of this malignancy and scrutinization of robust signature. To create a comprehensive resource, we collected all the relevant literature on PCa biomarkers from the PubMed. We scrutinize the extensive information about each biomarker from a total of 412 unique research articles. Each entry of the database incorporates PubMed ID, biomarker name, biomarker type, biomolecule, source, subjects, validation status, and performance measures such as sensitivity, specificity, and hazard ratio (HR). In this study, we present ProCanBio, a manually curated database that maintains detailed data on 2053 entries of potential PCa biomarkers obtained from 412 publications in user-friendly tabular format. Among them are 766 protein-based, 507 RNA-based, 157 genomic mutations, 260 miRNA-based, and 122 metabolites-based biomarkers. To explore the information in the resource, a web-based interactive platform was developed with searching and browsing facilities. To the best of the authors' knowledge, there is no resource that can consolidate the information contained in all the published literature. Besides this, ProCanBio is freely available and is compatible with most web browsers and devices. Eventually, we anticipate this resource will be highly useful for the research community involved in the area of prostate malignancy.

Asunto(s)

Biomarcadores de Tumor/genética , Biomarcadores de Tumor/metabolismo , Curaduría de Datos/métodos , Neoplasias de la Próstata/genética , Neoplasias de la Próstata/metabolismo , Bases de Datos Factuales , Redes Reguladoras de Genes , Humanos , Masculino , Metabolómica , MicroARNs/genética , Mutación , Pronóstico , Mapas de Interacción de Proteínas , Interfaz Usuario-Computador , Navegador Web

13.

A crowdsourcing open platform for literature curation in UniProt.

Wang, Yuqi; Wang, Qinghua; Huang, Hongzhan; Huang, Wei; Chen, Yongxing; McGarvey, Peter B; Wu, Cathy H; Arighi, Cecilia N.

PLoS Biol ; 19(12): e3001464, 2021 12.

Artículo en Inglés | MEDLINE | ID: mdl-34871295

RESUMEN

The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.

Asunto(s)

Colaboración de las Masas/métodos , Curaduría de Datos/métodos , Anotación de Secuencia Molecular/métodos , Secuencia de Aminoácidos/genética , Biología Computacional/métodos , Bases de Datos de Proteínas/tendencias , Humanos , Literatura , Proteínas/metabolismo , Participación de los Interesados

14.

A localization strategy combined with transfer learning for image annotation.

Chen, Zhiqiang; Rajamanickam, Leelavathi; Cao, Jianfang; Zhao, Aidi; Hu, Xiaohui.

PLoS One ; 16(12): e0260758, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-34879097

RESUMEN

This study aims to solve the overfitting problem caused by insufficient labeled images in the automatic image annotation field. We propose a transfer learning model called CNN-2L that incorporates the label localization strategy described in this study. The model consists of an InceptionV3 network pretrained on the ImageNet dataset and a label localization algorithm. First, the pretrained InceptionV3 network extracts features from the target dataset that are used to train a specific classifier and fine-tune the entire network to obtain an optimal model. Then, the obtained model is used to derive the probabilities of the predicted labels. For this purpose, we introduce a squeeze and excitation (SE) module into the network architecture that augments the useful feature information, inhibits useless feature information, and conducts feature reweighting. Next, we perform label localization to obtain the label probabilities and determine the final label set for each image. During this process, the number of labels must be determined. The optimal K value is obtained experimentally and used to determine the number of predicted labels, thereby solving the empty label set problem that occurs when the predicted label values of images are below a fixed threshold. Experiments on the Corel5k multilabel image dataset verify that CNN-2L improves the labeling precision by 18% and 15% compared with the traditional multiple-Bernoulli relevance model (MBRM) and joint equal contribution (JEC) algorithms, respectively, and it improves the recall by 6% compared with JEC. Additionally, it improves the precision by 20% and 11% compared with the deep learning methods Weight-KNN and adaptive hypergraph learning (AHL), respectively. Although CNN-2L fails to improve the recall compared with the semantic extension model (SEM), it improves the comprehensive index of the F1 value by 1%. The experimental results reveal that the proposed transfer learning model based on a label localization strategy is effective for automatic image annotation and substantially boosts the multilabel image annotation performance.

Asunto(s)

Algoritmos , Curaduría de Datos/métodos , Aprendizaje Profundo , Procesamiento de Imagen Asistido por Computador/métodos , Redes Neurales de la Computación , Tomografía Computarizada por Rayos X/métodos , Humanos

15.

Metabolite discovery through global annotation of untargeted metabolomics data.

Chen, Li; Lu, Wenyun; Wang, Lin; Xing, Xi; Chen, Ziyang; Teng, Xin; Zeng, Xianfeng; Muscarella, Antonio D; Shen, Yihui; Cowan, Alexis; McReynolds, Melanie R; Kennedy, Brandon J; Lato, Ashley M; Campagna, Shawn R; Singh, Mona; Rabinowitz, Joshua D.

Nat Methods ; 18(11): 1377-1385, 2021 11.

Artículo en Inglés | MEDLINE | ID: mdl-34711973

RESUMEN

Liquid chromatography-high-resolution mass spectrometry (LC-MS)-based metabolomics aims to identify and quantify all metabolites, but most LC-MS peaks remain unidentified. Here we present a global network optimization approach, NetID, to annotate untargeted LC-MS metabolomics data. The approach aims to generate, for all experimentally observed ion peaks, annotations that match the measured masses, retention times and (when available) tandem mass spectrometry fragmentation patterns. Peaks are connected based on mass differences reflecting adduction, fragmentation, isotopes, or feasible biochemical transformations. Global optimization generates a single network linking most observed ion peaks, enhances peak assignment accuracy, and produces chemically informative peak-peak relationships, including for peaks lacking tandem mass spectrometry spectra. Applying this approach to yeast and mouse data, we identified five previously unrecognized metabolites (thiamine derivatives and N-glucosyl-taurine). Isotope tracer studies indicate active flux through these metabolites. Thus, NetID applies existing metabolomic knowledge and global optimization to substantially improve annotation coverage and accuracy in untargeted metabolomics datasets, facilitating metabolite discovery.

Asunto(s)

Algoritmos , Curaduría de Datos/normas , Hígado/metabolismo , Metaboloma , Metabolómica/normas , Saccharomyces cerevisiae/metabolismo , Animales , Cromatografía Liquida/métodos , Curaduría de Datos/métodos , Metabolómica/métodos , Ratones , Espectrometría de Masas en Tándem/métodos

16.

OGT Protein Interaction Network (OGT-PIN): A Curated Database of Experimentally Identified Interaction Proteins of OGT.

Ma, Junfeng; Hou, Chunyan; Li, Yaoxiang; Chen, Shufu; Wu, Ci.

Int J Mol Sci ; 22(17)2021 Sep 06.

Artículo en Inglés | MEDLINE | ID: mdl-34502531

RESUMEN

Interactions between proteins are essential to any cellular process and constitute the basis for molecular networks that determine the functional state of a cell. With the technical advances in recent years, an astonishingly high number of protein-protein interactions has been revealed. However, the interactome of O-linked N-acetylglucosamine transferase (OGT), the sole enzyme adding the O-linked ß-N-acetylglucosamine (O-GlcNAc) onto its target proteins, has been largely undefined. To that end, we collated OGT interaction proteins experimentally identified in the past several decades. Rigorous curation of datasets from public repositories and O-GlcNAc-focused publications led to the identification of up to 929 high-stringency OGT interactors from multiple species studied (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, and others). Among them, 784 human proteins were found to be interactors of human OGT. Moreover, these proteins spanned a very diverse range of functional classes (e.g., DNA repair, RNA metabolism, translational regulation, and cell cycle), with significant enrichment in regulating transcription and (co)translation. Our dataset demonstrates that OGT is likely a hub protein in cells. A webserver OGT-Protein Interaction Network (OGT-PIN) has also been created, which is freely accessible.

Asunto(s)

Acetilglucosamina/metabolismo , Curaduría de Datos/métodos , Bases de Datos de Proteínas/estadística & datos numéricos , N-Acetilglucosaminiltransferasas/metabolismo , Mapas de Interacción de Proteínas , Procesamiento Proteico-Postraduccional , Animales , Proteínas de Arabidopsis/metabolismo , Proteínas de Drosophila/metabolismo , Humanos , Ratones , Ratas

17.

Annotating cell types in human single-cell RNA-seq data with CellO.

Bernstein, Matthew N; Dewey, Colin N.

STAR Protoc ; 2(3): 100705, 2021 09 17.

Artículo en Inglés | MEDLINE | ID: mdl-34458864

RESUMEN

Cell type annotation is important in the analysis of single-cell RNA-seq data. CellO is a machine-learning-based tool for annotating cells using the Cell Ontology, a rich hierarchy of known cell types. We provide a protocol for using the CellO Python package to annotate human cells. We demonstrate how to use CellO in conjunction with Scanpy, a Python library for performing single-cell analysis, annotate a lung tissue data set, interpret its hierarchically structured cell type annotations, and create publication-ready figures. For complete details on the use and execution of this protocol, please refer to Bernstein et al. (2021).

Asunto(s)

Curaduría de Datos/métodos , RNA-Seq/métodos , Análisis de Secuencia de ARN/métodos , Ontologías Biológicas , Biología Computacional/métodos , Humanos , Aprendizaje Automático , Análisis de la Célula Individual/métodos , Programas Informáticos , Transcriptoma/genética , Secuenciación del Exoma/métodos

18.

Lisen&Curate: A platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria.

Díaz-Rodríguez, Martín; Lithgow-Serrano, Oscar; Guadarrama-García, Francisco; Tierrafría, Víctor H; Gama-Castro, Socorro; Solano-Lira, Hilda; Salgado, Heladia; Rinaldi, Fabio; Méndez-Cruz, Carlos-Francisco; Collado-Vides, Julio.

Biochim Biophys Acta Gene Regul Mech ; 1864(11-12): 194753, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-34461312

RESUMEN

The number of published papers in biomedical research makes it rather impossible for a researcher to keep up to date. This is where manually curated databases contribute facilitating the access to knowledge. However, the structure required by databases strongly limits the type of valuable information that can be incorporated. Here, we present Lisen&Curate, a curation system that facilitates linking sentences or part of sentences (both considered sources) in articles with their corresponding curated objects, so that rich additional information of these objects is easily available to users. These sources are going to be offered both within RegulonDB and a new database, L-Regulon. To show the relevance of our work, two senior curators performed a curation of 31 articles on the regulation of transcription initiation of E. coli using Lisen&Curate. As a result, 194 objects were curated and 781 sources were recorded. We also found that these sources are useful to develop automatic approaches to detect objects in articles by observing word frequency patterns and by carrying out an open information extraction task. Sources may help to elaborate a controlled vocabulary of experimental methods. Finally, we discuss our ecosystem of interconnected applications, RegulonDB, L-Regulon, and Lisen&Curate, to facilitate the access to knowledge on regulation of transcription initiation in bacteria. We see our proposal as the starting point to change the way experimentalists connect a piece of knowledge with its evidence using RegulonDB.

Asunto(s)

Curaduría de Datos/métodos , Bases de Datos Genéticas , Regulación Bacteriana de la Expresión Génica , Iniciación de la Transcripción Genética , Escherichia coli/genética

19.

Scaling national and international improvement in virtual gene panel curation via a collaborative approach to discordance resolution.

Stark, Zornitza; Foulger, Rebecca E; Williams, Eleanor; Thompson, Bryony A; Patel, Chirag; Lunke, Sebastian; Snow, Catherine; Leong, Ivone U S; Puzriakova, Arina; Daugherty, Louise C; Leigh, Sarah; Boustred, Christopher; Niblock, Olivia; Rueda-Martin, Antonio; Gerasimenko, Oleg; Savage, Kevin; Bellamy, William; Lin, Victor San Kho; Valls, Roman; Gordon, Lavinia; Brittain, Helen K; Thomas, Ellen R A; Taylor Tavares, Ana Lisa; McEntagart, Meriel; White, Susan M; Tan, Tiong Y; Yeung, Alison; Downie, Lilian; Macciocca, Ivan; Savva, Elena; Lee, Crystle; Roesley, Ain; De Fazio, Paul; Deller, Jane; Deans, Zandra C; Hill, Sue L; Caulfield, Mark J; North, Kathryn N; Scott, Richard H; Rendon, Augusto; Hofmann, Oliver; McDonagh, Ellen M.

Am J Hum Genet ; 108(9): 1551-1557, 2021 09 02.

Artículo en Inglés | MEDLINE | ID: mdl-34329581

RESUMEN

Clinical validity assessments of gene-disease associations underpin analysis and reporting in diagnostic genomics, and yet wide variability exists in practice, particularly in use of these assessments for virtual gene panel design and maintenance. Harmonization efforts are hampered by the lack of agreed terminology, agreed gene curation standards, and platforms that can be used to identify and resolve discrepancies at scale. We undertook a systematic comparison of the content of 80 virtual gene panels used in two healthcare systems by multiple diagnostic providers in the United Kingdom and Australia. The process was enabled by a shared curation platform, PanelApp, and resulted in the identification and review of 2,144 discordant gene ratings, demonstrating the utility of sharing structured gene-disease validity assessments and collaborative discordance resolution in establishing national and international consensus.

Asunto(s)

Consenso , Curaduría de Datos/normas , Enfermedades Genéticas Congénitas/genética , Genómica/normas , Anotación de Secuencia Molecular/normas , Australia , Biomarcadores/metabolismo , Curaduría de Datos/métodos , Atención a la Salud , Expresión Génica , Ontología de Genes , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/patología , Genómica/métodos , Humanos , Aplicaciones Móviles/provisión & distribución , Terminología como Asunto , Reino Unido

20.

Interpretable prioritization of splice variants in diagnostic next-generation sequencing.

Danis, Daniel; Jacobsen, Julius O B; Carmody, Leigh C; Gargano, Michael A; McMurry, Julie A; Hegde, Ayushi; Haendel, Melissa A; Valentini, Giorgio; Smedley, Damian; Robinson, Peter N.

Am J Hum Genet ; 108(9): 1564-1577, 2021 09 02.

Artículo en Inglés | MEDLINE | ID: mdl-34289339

RESUMEN

A critical challenge in genetic diagnostics is the computational assessment of candidate splice variants, specifically the interpretation of nucleotide changes located outside of the highly conserved dinucleotide sequences at the 5' and 3' ends of introns. To address this gap, we developed the Super Quick Information-content Random-forest Learning of Splice variants (SQUIRLS) algorithm. SQUIRLS generates a small set of interpretable features for machine learning by calculating the information-content of wild-type and variant sequences of canonical and cryptic splice sites, assessing changes in candidate splicing regulatory sequences, and incorporating characteristics of the sequence such as exon length, disruptions of the AG exclusion zone, and conservation. We curated a comprehensive collection of disease-associated splice-altering variants at positions outside of the highly conserved AG/GT dinucleotides at the termini of introns. SQUIRLS trains two random-forest classifiers for the donor and for the acceptor and combines their outputs by logistic regression to yield a final score. We show that SQUIRLS transcends previous state-of-the-art accuracy in classifying splice variants as assessed by rank analysis in simulated exomes, and is significantly faster than competing methods. SQUIRLS provides tabular output files for incorporation into diagnostic pipelines for exome and genome analysis, as well as visualizations that contextualize predicted effects of variants on splicing to make it easier to interpret splice variants in diagnostic settings.

Asunto(s)

Algoritmos , Curaduría de Datos/métodos , Enfermedades Genéticas Congénitas/genética , Sitios de Empalme de ARN , Empalme del ARN , Programas Informáticos , Secuencia de Bases , Biología Computacional/métodos , Exoma , Exones , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/patología , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Intrones , Mutación , Secuenciación del Exoma

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA