Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 59
Filtrar
1.
Int J Neural Syst ; 32(9): 2250043, 2022 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-35912583

RESUMEN

A practical problem in supervised deep learning for medical image segmentation is the lack of labeled data which is expensive and time-consuming to acquire. In contrast, there is a considerable amount of unlabeled data available in the clinic. To make better use of the unlabeled data and improve the generalization on limited labeled data, in this paper, a novel semi-supervised segmentation method via multi-task curriculum learning is presented. Here, curriculum learning means that when training the network, simpler knowledge is preferentially learned to assist the learning of more difficult knowledge. Concretely, our framework consists of a main segmentation task and two auxiliary tasks, i.e. the feature regression task and target detection task. The two auxiliary tasks predict some relatively simpler image-level attributes and bounding boxes as the pseudo labels for the main segmentation task, enforcing the pixel-level segmentation result to match the distribution of these pseudo labels. In addition, to solve the problem of class imbalance in the images, a bounding-box-based attention (BBA) module is embedded, enabling the segmentation network to concern more about the target region rather than the background. Furthermore, to alleviate the adverse effects caused by the possible deviation of pseudo labels, error tolerance mechanisms are also adopted in the auxiliary tasks, including inequality constraint and bounding-box amplification. Our method is validated on ACDC2017 and PROMISE12 datasets. Experimental results demonstrate that compared with the full supervision method and state-of-the-art semi-supervised methods, our method yields a much better segmentation performance on a small labeled dataset. Code is available at https://github.com/DeepMedLab/MTCL.


Asunto(s)
Curriculum , Aprendizaje Automático Supervisado , Curaduría de Datos/métodos , Curaduría de Datos/normas , Conjuntos de Datos como Asunto/normas , Conjuntos de Datos como Asunto/provisión & distribución , Procesamiento de Imagen Asistido por Computador/métodos , Aprendizaje Automático Supervisado/clasificación , Aprendizaje Automático Supervisado/estadística & datos numéricos , Aprendizaje Automático Supervisado/tendencias
2.
Drug Discov Today ; 27(1): 207-214, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34332096

RESUMEN

Standardizing data is crucial for preserving and exchanging scientific information. In particular, recording the context in which data were created ensures that information remains findable, accessible, interoperable, and reusable. Here, we introduce the concept of self-reporting data assets (SRDAs), which preserve data and contextual information. SRDAs are an abstract concept, which requires a suitable data format for implementation. Four promising data formats or languages are popularly used to represent data in pharma: JCAMP-DX, JSON, AnIML, and, more recently, the Allotrope Data Format (ADF). Here, we evaluate these four options in common use cases within the pharmaceutical industry using multiple criteria. The evaluation shows that ADF is the most suitable format for the implementation of SRDAs.


Asunto(s)
Exactitud de los Datos , Curaduría de Datos , Industria Farmacéutica , Difusión de la Información/métodos , Proyectos de Investigación/normas , Curaduría de Datos/métodos , Curaduría de Datos/normas , Difusión de Innovaciones , Industria Farmacéutica/métodos , Industria Farmacéutica/organización & administración , Humanos , Prueba de Estudio Conceptual , Estándares de Referencia , Tecnología Farmacéutica/métodos
3.
Neuroinformatics ; 20(2): 463-481, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-34970709

RESUMEN

Human electrophysiological and related time series data are often acquired in complex, event-rich environments. However, the resulting recorded brain or other dynamics are often interpreted in relation to more sparsely recorded or subsequently-noted events. Currently a substantial gap exists between the level of event description required by current digital data archiving standards and the level of annotation required for successful analysis of event-related data across studies, environments, and laboratories. Manifold challenges must be addressed, most prominently ontological clarity, vocabulary extensibility, annotation tool availability, and overall usability, to allow and promote sharing of data with an effective level of descriptive detail for labeled events. Motivating data authors to perform the work needed to adequately annotate their data is a key challenge. This paper describes new developments in the Hierarchical Event Descriptor (HED) system for addressing these issues. We recap the evolution of HED and its acceptance by the Brain Imaging Data Structure (BIDS) movement, describe the recent release of HED-3G, a third generation HED tools and design framework, and discuss directions for future development. Given consistent, sufficiently detailed, tool-enabled, field-relevant annotation of the nature of recorded events, prospects are bright for large-scale analysis and modeling of aggregated time series data, both in behavioral and brain imaging sciences and beyond.


Asunto(s)
Curaduría de Datos , Factores de Tiempo , Humanos , Curaduría de Datos/normas , Procesamiento Automatizado de Datos
4.
Plant Physiol ; 188(2): 955-970, 2022 02 04.
Artículo en Inglés | MEDLINE | ID: mdl-34792587

RESUMEN

Short interspersed nuclear elements (SINEs) are a widespread type of small transposable element (TE). With increasing evidence for their impact on gene function and genome evolution in plants, accurate genome-scale SINE annotation becomes a fundamental step for studying the regulatory roles of SINEs and their relationship with other components in the genomes. Despite the overall promising progress made in TE annotation, SINE annotation remains a major challenge. Unlike some other TEs, SINEs are short and heterogeneous, and they usually lack well-conserved sequence or structural features. Thus, current SINE annotation tools have either low sensitivity or high false discovery rates. Given the demand and challenges, we aimed to provide a more accurate and efficient SINE annotation tool for plant genomes. The pipeline starts with maximizing the pool of SINE candidates via profile hidden Markov model-based homology search and de novo SINE search using structural features. Then, it excludes the false positives by integrating all known features of SINEs and the features of other types of TEs that can often be misannotated as SINEs. As a result, the pipeline substantially improves the tradeoff between sensitivity and accuracy, with both values close to or over 90%. We tested our tool in Arabidopsis thaliana and rice (Oryza sativa), and the results show that our tool competes favorably against existing SINE annotation tools. The simplicity and effectiveness of this tool would potentially be useful for generating more accurate SINE annotations for other plant species. The pipeline is freely available at https://github.com/yangli557/AnnoSINE.


Asunto(s)
Arabidopsis/genética , Curaduría de Datos/normas , Genoma de Planta , Guías como Asunto , Oryza/genética , Elementos de Nucleótido Esparcido Corto , Reproducibilidad de los Resultados
5.
Nat Methods ; 18(11): 1377-1385, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34711973

RESUMEN

Liquid chromatography-high-resolution mass spectrometry (LC-MS)-based metabolomics aims to identify and quantify all metabolites, but most LC-MS peaks remain unidentified. Here we present a global network optimization approach, NetID, to annotate untargeted LC-MS metabolomics data. The approach aims to generate, for all experimentally observed ion peaks, annotations that match the measured masses, retention times and (when available) tandem mass spectrometry fragmentation patterns. Peaks are connected based on mass differences reflecting adduction, fragmentation, isotopes, or feasible biochemical transformations. Global optimization generates a single network linking most observed ion peaks, enhances peak assignment accuracy, and produces chemically informative peak-peak relationships, including for peaks lacking tandem mass spectrometry spectra. Applying this approach to yeast and mouse data, we identified five previously unrecognized metabolites (thiamine derivatives and N-glucosyl-taurine). Isotope tracer studies indicate active flux through these metabolites. Thus, NetID applies existing metabolomic knowledge and global optimization to substantially improve annotation coverage and accuracy in untargeted metabolomics datasets, facilitating metabolite discovery.


Asunto(s)
Algoritmos , Curaduría de Datos/normas , Hígado/metabolismo , Metaboloma , Metabolómica/normas , Saccharomyces cerevisiae/metabolismo , Animales , Cromatografía Liquida/métodos , Curaduría de Datos/métodos , Metabolómica/métodos , Ratones , Espectrometría de Masas en Tándem/métodos
6.
Am J Hum Genet ; 108(9): 1551-1557, 2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-34329581

RESUMEN

Clinical validity assessments of gene-disease associations underpin analysis and reporting in diagnostic genomics, and yet wide variability exists in practice, particularly in use of these assessments for virtual gene panel design and maintenance. Harmonization efforts are hampered by the lack of agreed terminology, agreed gene curation standards, and platforms that can be used to identify and resolve discrepancies at scale. We undertook a systematic comparison of the content of 80 virtual gene panels used in two healthcare systems by multiple diagnostic providers in the United Kingdom and Australia. The process was enabled by a shared curation platform, PanelApp, and resulted in the identification and review of 2,144 discordant gene ratings, demonstrating the utility of sharing structured gene-disease validity assessments and collaborative discordance resolution in establishing national and international consensus.


Asunto(s)
Consenso , Curaduría de Datos/normas , Enfermedades Genéticas Congénitas/genética , Genómica/normas , Anotación de Secuencia Molecular/normas , Australia , Biomarcadores/metabolismo , Curaduría de Datos/métodos , Atención a la Salud , Expresión Génica , Ontología de Genes , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/patología , Genómica/métodos , Humanos , Aplicaciones Móviles/provisión & distribución , Terminología como Asunto , Reino Unido
7.
Brief Bioinform ; 22(1): 146-163, 2021 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-31838514

RESUMEN

MOTIVATION: Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. METHODS: We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. RESULTS: We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).


Asunto(s)
Biología Computacional/normas , Curaduría de Datos/normas , Biología Computacional/métodos , Curaduría de Datos/métodos , Programas Informáticos/normas
9.
Int J Med Inform ; 136: 104095, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32058265

RESUMEN

Clinicians write a billion free text notes per year. These notes are typically replete with errors of all types. No established automated method can extract data from this treasure trove. The practice of medicine therefore remains haphazard and chaotic, resulting in vast economic waste. The lexeme hypotheses are based on our analysis of how records are created. They enable a computer system to predict what issue a clinician will need to address next, based on the environment in which the clinician is working, and what responses the clinician has selected to date. The system uses a lexicon storing the issues (queries) and a range of responses to the issues. When the clinician selects a response, a text fragment is added to the output file. In the first phase of this work, the notes of 69 returning hemophilia patients were scrutinized, and the lexicon was expanded to 847 lexeme queries and 7995 responses to enable the construction of completed notes. The quality of lexeme-generated notes from 20 consecutive subjects was then compared to the clinicians' conventional clinic notes. The system generated grammatically correct notes. In comparison to the traditional clinic note, the lexeme-generated notes were more complete (88 % compared with 62 %), and had less typographical and grammatical errors (0.8 versus 3.5 errors per note). The system notes and traditional notes averaged about 800 words, but the traditional notes had a much wider distribution of lengths. The note-creation rate from marshalling the data to completion using the system averaged 80 wpm, twice as fast as the typical clinician can type. The lexeme method generates more complete, grammatical and organized notes faster than traditional methods. The notes are completely computerized at inception, and they incorporate prompts for clinicians to address otherwise overlooked items. This pilot justifies further exploration of this methodology.


Asunto(s)
Curaduría de Datos/normas , Documentación/métodos , Almacenamiento y Recuperación de la Información/métodos , Anamnesis/métodos , Pautas de la Práctica en Medicina/normas , Procesamiento de Texto/estadística & datos numéricos , Escritura/normas , Adulto , Automatización , Competencia Clínica , Hemofilia A/diagnóstico , Hemofilia A/terapia , Humanos , Registros Médicos , Proyectos Piloto , Adulto Joven
10.
JCO Clin Cancer Inform ; 3: 1-11, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31834820

RESUMEN

PURPOSE: Data sharing creates potential cost savings, supports data aggregation, and facilitates reproducibility to ensure quality research; however, data from heterogeneous systems require retrospective harmonization. This is a major hurdle for researchers who seek to leverage existing data. Efforts focused on strategies for data interoperability largely center around the use of standards but ignore the problems of competing standards and the value of existing data. Interoperability remains reliant on retrospective harmonization. Approaches to reduce this burden are needed. METHODS: The Cancer Imaging Archive (TCIA) is an example of an imaging repository that accepts data from a diversity of sources. It contains medical images from investigators worldwide and substantial nonimage data. Digital Imaging and Communications in Medicine (DICOM) standards enable querying across images, but TCIA does not enforce other standards for describing nonimage supporting data, such as treatment details and patient outcomes. In this study, we used 9 TCIA lung and brain nonimage files containing 659 fields to explore retrospective harmonization for cross-study query and aggregation. It took 329.5 hours, or 2.3 months, extended over 6 months to identify 41 overlapping fields in 3 or more files and transform 31 of them. We used the Genomic Data Commons (GDC) data elements as the target standards for harmonization. RESULTS: We characterized the issues and have developed recommendations for reducing the burden of retrospective harmonization. Once we harmonized the data, we also developed a Web tool to easily explore harmonized collections. CONCLUSION: While prospective use of standards can support interoperability, there are issues that complicate this goal. Our work recognizes and reveals retrospective harmonization issues when trying to reuse existing data and recommends national infrastructure to address these issues.


Asunto(s)
Neoplasias Encefálicas/diagnóstico por imagen , Curaduría de Datos/normas , Interoperabilidad de la Información en Salud/normas , Neoplasias Pulmonares/diagnóstico por imagen , Neoplasias Encefálicas/diagnóstico , Curaduría de Datos/métodos , Bases de Datos Factuales , Guías como Asunto , Humanos , Neoplasias Pulmonares/diagnóstico , Reproducibilidad de los Resultados , Estudios Retrospectivos
12.
Genet Epidemiol ; 43(4): 356-364, 2019 06.
Artículo en Inglés | MEDLINE | ID: mdl-30657194

RESUMEN

When interpreting genome-wide association peaks, it is common to annotate each peak by searching for genes with plausible relationships to the trait. However, "all that glitters is not gold"-one might interpret apparent patterns in the data as plausible even when the peak is a false positive. Accordingly, we sought to see how human annotators interpreted association results containing a mixture of peaks from both the original trait and a genetically uncorrelated "synthetic" trait. Two of us prepared a mix of original and synthetic peaks of three significance categories from five different scans along with relevant literature search results and then we all annotated these regions. Three annotators also scored the strength of evidence connecting each peak to the scanned trait and the likelihood of further studying that region. While annotators found original peaks to have stronger evidence (p Bonferroni = 0.017) and higher likelihood of further study ( p Bonferroni = 0.006) than synthetic peaks, annotators often made convincing connections between the synthetic peaks and the original trait, finding these connections 55% of the time. These results show that it is not difficult for annotators to make convincing connections between synthetic association signals and genes found in those regions.


Asunto(s)
Curaduría de Datos , Interpretación Estadística de Datos , Reacciones Falso Positivas , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Curaduría de Datos/métodos , Curaduría de Datos/normas , Curaduría de Datos/estadística & datos numéricos , Decepción , Estudio de Asociación del Genoma Completo/normas , Humanos , Fenotipo , Polimorfismo de Nucleótido Simple
13.
PLoS One ; 14(12): e0218904, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31891586

RESUMEN

Video and image data are regularly used in the field of benthic ecology to document biodiversity. However, their use is subject to a number of challenges, principally the identification of taxa within the images without associated physical specimens. The challenge of applying traditional taxonomic keys to the identification of fauna from images has led to the development of personal, group, or institution level reference image catalogues of operational taxonomic units (OTUs) or morphospecies. Lack of standardisation among these reference catalogues has led to problems with observer bias and the inability to combine datasets across studies. In addition, lack of a common reference standard is stifling efforts in the application of artificial intelligence to taxon identification. Using the North Atlantic deep sea as a case study, we propose a database structure to facilitate standardisation of morphospecies image catalogues between research groups and support future use in multiple front-end applications. We also propose a framework for coordination of international efforts to develop reference guides for the identification of marine species from images. The proposed structure maps to the Darwin Core standard to allow integration with existing databases. We suggest a management framework where high-level taxonomic groups are curated by a regional team, consisting of both end users and taxonomic experts. We identify a mechanism by which overall quality of data within a common reference guide could be raised over the next decade. Finally, we discuss the role of a common reference standard in advancing marine ecology and supporting sustainable use of this ecosystem.


Asunto(s)
Clasificación/métodos , Procesamiento de Imagen Asistido por Computador/normas , Biología Marina/normas , Animales , Inteligencia Artificial , Biodiversidad , Curaduría de Datos/métodos , Curaduría de Datos/normas , Bases de Datos Factuales , Ecología , Ecosistema , Procesamiento de Imagen Asistido por Computador/métodos , Biología Marina/clasificación
14.
Sci Data ; 5: 180258, 2018 11 20.
Artículo en Inglés | MEDLINE | ID: mdl-30457569

RESUMEN

Clinical case reports (CCRs) provide an important means of sharing clinical experiences about atypical disease phenotypes and new therapies. However, published case reports contain largely unstructured and heterogeneous clinical data, posing a challenge to mining relevant information. Current indexing approaches generally concern document-level features and have not been specifically designed for CCRs. To address this disparity, we developed a standardized metadata template and identified text corresponding to medical concepts within 3,100 curated CCRs spanning 15 disease groups and more than 750 reports of rare diseases. We also prepared a subset of metadata on reports on selected mitochondrial diseases and assigned ICD-10 diagnostic codes to each. The resulting resource, Metadata Acquired from Clinical Case Reports (MACCRs), contains text associated with high-level clinical concepts, including demographics, disease presentation, treatments, and outcomes for each report. Our template and MACCR set render CCRs more findable, accessible, interoperable, and reusable (FAIR) while serving as valuable resources for key user groups, including researchers, physician investigators, clinicians, data scientists, and those shaping government policies for clinical trials.


Asunto(s)
Estudios Clínicos como Asunto , Curaduría de Datos , Metadatos , Biología Computacional , Análisis de Datos , Curaduría de Datos/métodos , Curaduría de Datos/normas , Humanos , Metadatos/normas
15.
Sci Data ; 5: 180259, 2018 11 20.
Artículo en Inglés | MEDLINE | ID: mdl-30457573

RESUMEN

This article presents a practical roadmap for scholarly publishers to implement data citation in accordance with the Joint Declaration of Data Citation Principles (JDDCP), a synopsis and harmonization of the recommendations of major science policy bodies. It was developed by the Publishers Early Adopters Expert Group as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of FORCE11.org and the NIH BioCADDIE program. The structure of the roadmap presented here follows the "life of a paper" workflow and includes the categories Pre-submission, Submission, Production, and Publication. The roadmap is intended to be publisher-agnostic so that all publishers can use this as a starting point when implementing JDDCP-compliant data citation. Authors reading this roadmap will also better know what to expect from publishers and how to enable their own data citations to gain maximum impact, as well as complying with what will become increasingly common funder mandates on data transparency.


Asunto(s)
Edición/normas , Curaduría de Datos/normas
16.
Anim Genet ; 49(6): 520-526, 2018 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-30311252

RESUMEN

The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal genomes with an initial focus on farmed and companion animals. A key goal of the initiative is to ensure high quality and rich supporting metadata to describe the project's animals, specimens, cell cultures and experimental assays. By defining rich sample and experimental metadata standards and promoting best practices in data descriptions, deposition and openness, FAANG champions higher quality and reusability of published datasets. FAANG has established a Data Coordination Centre, which sits at the heart of the Metadata and Data Sharing Committee. It continues to evolve the metadata standards, support submissions and, crucially, create powerful and accessible tools to support deposition and validation of metadata. FAANG conforms to the findable, accessible, interoperable, and reusable (FAIR) data principles, with high quality, open access and functionally interlinked data. In addition to data generated by FAANG members and specific FAANG projects, existing datasets that meet the main-or more permissive legacy-standards are incorporated into a central, focused, functional data resource portal for the entire farmed and companion animal community. Through clear and effective metadata standards, validation and conversion software, combined with promotion of best practices in metadata implementation, FAANG aims to maximise effectiveness and inter-comparability of assay data. This supports the community to create a rich genome-to-phenotype resource and promotes continuing improvements in animal data standards as a whole.


Asunto(s)
Curaduría de Datos/normas , Genómica , Metadatos/normas , Animales , Ganado , Mascotas , Programas Informáticos
20.
Nucleic Acids Res ; 46(D1): D221-D228, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29126148

RESUMEN

The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.


Asunto(s)
Secuencia de Consenso , Bases de Datos Genéticas , Sistemas de Lectura Abierta , Animales , Curaduría de Datos/métodos , Curaduría de Datos/normas , Bases de Datos Genéticas/normas , Guías como Asunto , Humanos , Ratones , Anotación de Secuencia Molecular , National Library of Medicine (U.S.) , Estados Unidos , Interfaz Usuario-Computador
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...