Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 59
Filtrar
1.
Int J Neural Syst ; 32(9): 2250043, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35912583

RESUMO

A practical problem in supervised deep learning for medical image segmentation is the lack of labeled data which is expensive and time-consuming to acquire. In contrast, there is a considerable amount of unlabeled data available in the clinic. To make better use of the unlabeled data and improve the generalization on limited labeled data, in this paper, a novel semi-supervised segmentation method via multi-task curriculum learning is presented. Here, curriculum learning means that when training the network, simpler knowledge is preferentially learned to assist the learning of more difficult knowledge. Concretely, our framework consists of a main segmentation task and two auxiliary tasks, i.e. the feature regression task and target detection task. The two auxiliary tasks predict some relatively simpler image-level attributes and bounding boxes as the pseudo labels for the main segmentation task, enforcing the pixel-level segmentation result to match the distribution of these pseudo labels. In addition, to solve the problem of class imbalance in the images, a bounding-box-based attention (BBA) module is embedded, enabling the segmentation network to concern more about the target region rather than the background. Furthermore, to alleviate the adverse effects caused by the possible deviation of pseudo labels, error tolerance mechanisms are also adopted in the auxiliary tasks, including inequality constraint and bounding-box amplification. Our method is validated on ACDC2017 and PROMISE12 datasets. Experimental results demonstrate that compared with the full supervision method and state-of-the-art semi-supervised methods, our method yields a much better segmentation performance on a small labeled dataset. Code is available at https://github.com/DeepMedLab/MTCL.


Assuntos
Currículo , Aprendizado de Máquina Supervisionado , Curadoria de Dados/métodos , Curadoria de Dados/normas , Conjuntos de Dados como Assunto/normas , Conjuntos de Dados como Assunto/provisão & distribuição , Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina Supervisionado/classificação , Aprendizado de Máquina Supervisionado/estatística & dados numéricos , Aprendizado de Máquina Supervisionado/tendências
2.
Drug Discov Today ; 27(1): 207-214, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34332096

RESUMO

Standardizing data is crucial for preserving and exchanging scientific information. In particular, recording the context in which data were created ensures that information remains findable, accessible, interoperable, and reusable. Here, we introduce the concept of self-reporting data assets (SRDAs), which preserve data and contextual information. SRDAs are an abstract concept, which requires a suitable data format for implementation. Four promising data formats or languages are popularly used to represent data in pharma: JCAMP-DX, JSON, AnIML, and, more recently, the Allotrope Data Format (ADF). Here, we evaluate these four options in common use cases within the pharmaceutical industry using multiple criteria. The evaluation shows that ADF is the most suitable format for the implementation of SRDAs.


Assuntos
Confiabilidade dos Dados , Curadoria de Dados , Indústria Farmacêutica , Disseminação de Informação/métodos , Projetos de Pesquisa/normas , Curadoria de Dados/métodos , Curadoria de Dados/normas , Difusão de Inovações , Indústria Farmacêutica/métodos , Indústria Farmacêutica/organização & administração , Humanos , Estudo de Prova de Conceito , Padrões de Referência , Tecnologia Farmacêutica/métodos
3.
Neuroinformatics ; 20(2): 463-481, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-34970709

RESUMO

Human electrophysiological and related time series data are often acquired in complex, event-rich environments. However, the resulting recorded brain or other dynamics are often interpreted in relation to more sparsely recorded or subsequently-noted events. Currently a substantial gap exists between the level of event description required by current digital data archiving standards and the level of annotation required for successful analysis of event-related data across studies, environments, and laboratories. Manifold challenges must be addressed, most prominently ontological clarity, vocabulary extensibility, annotation tool availability, and overall usability, to allow and promote sharing of data with an effective level of descriptive detail for labeled events. Motivating data authors to perform the work needed to adequately annotate their data is a key challenge. This paper describes new developments in the Hierarchical Event Descriptor (HED) system for addressing these issues. We recap the evolution of HED and its acceptance by the Brain Imaging Data Structure (BIDS) movement, describe the recent release of HED-3G, a third generation HED tools and design framework, and discuss directions for future development. Given consistent, sufficiently detailed, tool-enabled, field-relevant annotation of the nature of recorded events, prospects are bright for large-scale analysis and modeling of aggregated time series data, both in behavioral and brain imaging sciences and beyond.


Assuntos
Curadoria de Dados , Fatores de Tempo , Humanos , Curadoria de Dados/normas , Processamento Eletrônico de Dados
4.
Plant Physiol ; 188(2): 955-970, 2022 02 04.
Artigo em Inglês | MEDLINE | ID: mdl-34792587

RESUMO

Short interspersed nuclear elements (SINEs) are a widespread type of small transposable element (TE). With increasing evidence for their impact on gene function and genome evolution in plants, accurate genome-scale SINE annotation becomes a fundamental step for studying the regulatory roles of SINEs and their relationship with other components in the genomes. Despite the overall promising progress made in TE annotation, SINE annotation remains a major challenge. Unlike some other TEs, SINEs are short and heterogeneous, and they usually lack well-conserved sequence or structural features. Thus, current SINE annotation tools have either low sensitivity or high false discovery rates. Given the demand and challenges, we aimed to provide a more accurate and efficient SINE annotation tool for plant genomes. The pipeline starts with maximizing the pool of SINE candidates via profile hidden Markov model-based homology search and de novo SINE search using structural features. Then, it excludes the false positives by integrating all known features of SINEs and the features of other types of TEs that can often be misannotated as SINEs. As a result, the pipeline substantially improves the tradeoff between sensitivity and accuracy, with both values close to or over 90%. We tested our tool in Arabidopsis thaliana and rice (Oryza sativa), and the results show that our tool competes favorably against existing SINE annotation tools. The simplicity and effectiveness of this tool would potentially be useful for generating more accurate SINE annotations for other plant species. The pipeline is freely available at https://github.com/yangli557/AnnoSINE.


Assuntos
Arabidopsis/genética , Curadoria de Dados/normas , Genoma de Planta , Guias como Assunto , Oryza/genética , Elementos Nucleotídeos Curtos e Dispersos , Reprodutibilidade dos Testes
5.
Nat Methods ; 18(11): 1377-1385, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34711973

RESUMO

Liquid chromatography-high-resolution mass spectrometry (LC-MS)-based metabolomics aims to identify and quantify all metabolites, but most LC-MS peaks remain unidentified. Here we present a global network optimization approach, NetID, to annotate untargeted LC-MS metabolomics data. The approach aims to generate, for all experimentally observed ion peaks, annotations that match the measured masses, retention times and (when available) tandem mass spectrometry fragmentation patterns. Peaks are connected based on mass differences reflecting adduction, fragmentation, isotopes, or feasible biochemical transformations. Global optimization generates a single network linking most observed ion peaks, enhances peak assignment accuracy, and produces chemically informative peak-peak relationships, including for peaks lacking tandem mass spectrometry spectra. Applying this approach to yeast and mouse data, we identified five previously unrecognized metabolites (thiamine derivatives and N-glucosyl-taurine). Isotope tracer studies indicate active flux through these metabolites. Thus, NetID applies existing metabolomic knowledge and global optimization to substantially improve annotation coverage and accuracy in untargeted metabolomics datasets, facilitating metabolite discovery.


Assuntos
Algoritmos , Curadoria de Dados/normas , Fígado/metabolismo , Metaboloma , Metabolômica/normas , Saccharomyces cerevisiae/metabolismo , Animais , Cromatografia Líquida/métodos , Curadoria de Dados/métodos , Metabolômica/métodos , Camundongos , Espectrometria de Massas em Tandem/métodos
6.
Am J Hum Genet ; 108(9): 1551-1557, 2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-34329581

RESUMO

Clinical validity assessments of gene-disease associations underpin analysis and reporting in diagnostic genomics, and yet wide variability exists in practice, particularly in use of these assessments for virtual gene panel design and maintenance. Harmonization efforts are hampered by the lack of agreed terminology, agreed gene curation standards, and platforms that can be used to identify and resolve discrepancies at scale. We undertook a systematic comparison of the content of 80 virtual gene panels used in two healthcare systems by multiple diagnostic providers in the United Kingdom and Australia. The process was enabled by a shared curation platform, PanelApp, and resulted in the identification and review of 2,144 discordant gene ratings, demonstrating the utility of sharing structured gene-disease validity assessments and collaborative discordance resolution in establishing national and international consensus.


Assuntos
Consenso , Curadoria de Dados/normas , Doenças Genéticas Inatas/genética , Genômica/normas , Anotação de Sequência Molecular/normas , Austrália , Biomarcadores/metabolismo , Curadoria de Dados/métodos , Atenção à Saúde , Expressão Gênica , Ontologia Genética , Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/patologia , Genômica/métodos , Humanos , Aplicativos Móveis/provisão & distribuição , Terminologia como Assunto , Reino Unido
7.
Brief Bioinform ; 22(1): 146-163, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31838514

RESUMO

MOTIVATION: Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. METHODS: We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. RESULTS: We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).


Assuntos
Biologia Computacional/normas , Curadoria de Dados/normas , Biologia Computacional/métodos , Curadoria de Dados/métodos , Software/normas
9.
Int J Med Inform ; 136: 104095, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32058265

RESUMO

Clinicians write a billion free text notes per year. These notes are typically replete with errors of all types. No established automated method can extract data from this treasure trove. The practice of medicine therefore remains haphazard and chaotic, resulting in vast economic waste. The lexeme hypotheses are based on our analysis of how records are created. They enable a computer system to predict what issue a clinician will need to address next, based on the environment in which the clinician is working, and what responses the clinician has selected to date. The system uses a lexicon storing the issues (queries) and a range of responses to the issues. When the clinician selects a response, a text fragment is added to the output file. In the first phase of this work, the notes of 69 returning hemophilia patients were scrutinized, and the lexicon was expanded to 847 lexeme queries and 7995 responses to enable the construction of completed notes. The quality of lexeme-generated notes from 20 consecutive subjects was then compared to the clinicians' conventional clinic notes. The system generated grammatically correct notes. In comparison to the traditional clinic note, the lexeme-generated notes were more complete (88 % compared with 62 %), and had less typographical and grammatical errors (0.8 versus 3.5 errors per note). The system notes and traditional notes averaged about 800 words, but the traditional notes had a much wider distribution of lengths. The note-creation rate from marshalling the data to completion using the system averaged 80 wpm, twice as fast as the typical clinician can type. The lexeme method generates more complete, grammatical and organized notes faster than traditional methods. The notes are completely computerized at inception, and they incorporate prompts for clinicians to address otherwise overlooked items. This pilot justifies further exploration of this methodology.


Assuntos
Curadoria de Dados/normas , Documentação/métodos , Armazenamento e Recuperação da Informação/métodos , Anamnese/métodos , Padrões de Prática Médica/normas , Processamento de Texto/estatística & dados numéricos , Redação/normas , Adulto , Automação , Competência Clínica , Hemofilia A/diagnóstico , Hemofilia A/terapia , Humanos , Prontuários Médicos , Projetos Piloto , Adulto Jovem
10.
JCO Clin Cancer Inform ; 3: 1-11, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31834820

RESUMO

PURPOSE: Data sharing creates potential cost savings, supports data aggregation, and facilitates reproducibility to ensure quality research; however, data from heterogeneous systems require retrospective harmonization. This is a major hurdle for researchers who seek to leverage existing data. Efforts focused on strategies for data interoperability largely center around the use of standards but ignore the problems of competing standards and the value of existing data. Interoperability remains reliant on retrospective harmonization. Approaches to reduce this burden are needed. METHODS: The Cancer Imaging Archive (TCIA) is an example of an imaging repository that accepts data from a diversity of sources. It contains medical images from investigators worldwide and substantial nonimage data. Digital Imaging and Communications in Medicine (DICOM) standards enable querying across images, but TCIA does not enforce other standards for describing nonimage supporting data, such as treatment details and patient outcomes. In this study, we used 9 TCIA lung and brain nonimage files containing 659 fields to explore retrospective harmonization for cross-study query and aggregation. It took 329.5 hours, or 2.3 months, extended over 6 months to identify 41 overlapping fields in 3 or more files and transform 31 of them. We used the Genomic Data Commons (GDC) data elements as the target standards for harmonization. RESULTS: We characterized the issues and have developed recommendations for reducing the burden of retrospective harmonization. Once we harmonized the data, we also developed a Web tool to easily explore harmonized collections. CONCLUSION: While prospective use of standards can support interoperability, there are issues that complicate this goal. Our work recognizes and reveals retrospective harmonization issues when trying to reuse existing data and recommends national infrastructure to address these issues.


Assuntos
Neoplasias Encefálicas/diagnóstico por imagem , Curadoria de Dados/normas , Interoperabilidade da Informação em Saúde/normas , Neoplasias Pulmonares/diagnóstico por imagem , Neoplasias Encefálicas/diagnóstico , Curadoria de Dados/métodos , Bases de Dados Factuais , Guias como Assunto , Humanos , Neoplasias Pulmonares/diagnóstico , Reprodutibilidade dos Testes , Estudos Retrospectivos
12.
Genet Epidemiol ; 43(4): 356-364, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-30657194

RESUMO

When interpreting genome-wide association peaks, it is common to annotate each peak by searching for genes with plausible relationships to the trait. However, "all that glitters is not gold"-one might interpret apparent patterns in the data as plausible even when the peak is a false positive. Accordingly, we sought to see how human annotators interpreted association results containing a mixture of peaks from both the original trait and a genetically uncorrelated "synthetic" trait. Two of us prepared a mix of original and synthetic peaks of three significance categories from five different scans along with relevant literature search results and then we all annotated these regions. Three annotators also scored the strength of evidence connecting each peak to the scanned trait and the likelihood of further studying that region. While annotators found original peaks to have stronger evidence (p Bonferroni = 0.017) and higher likelihood of further study ( p Bonferroni = 0.006) than synthetic peaks, annotators often made convincing connections between the synthetic peaks and the original trait, finding these connections 55% of the time. These results show that it is not difficult for annotators to make convincing connections between synthetic association signals and genes found in those regions.


Assuntos
Curadoria de Dados , Interpretação Estatística de Dados , Reações Falso-Positivas , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Curadoria de Dados/métodos , Curadoria de Dados/normas , Curadoria de Dados/estatística & dados numéricos , Enganação , Estudo de Associação Genômica Ampla/normas , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único
13.
PLoS One ; 14(12): e0218904, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31891586

RESUMO

Video and image data are regularly used in the field of benthic ecology to document biodiversity. However, their use is subject to a number of challenges, principally the identification of taxa within the images without associated physical specimens. The challenge of applying traditional taxonomic keys to the identification of fauna from images has led to the development of personal, group, or institution level reference image catalogues of operational taxonomic units (OTUs) or morphospecies. Lack of standardisation among these reference catalogues has led to problems with observer bias and the inability to combine datasets across studies. In addition, lack of a common reference standard is stifling efforts in the application of artificial intelligence to taxon identification. Using the North Atlantic deep sea as a case study, we propose a database structure to facilitate standardisation of morphospecies image catalogues between research groups and support future use in multiple front-end applications. We also propose a framework for coordination of international efforts to develop reference guides for the identification of marine species from images. The proposed structure maps to the Darwin Core standard to allow integration with existing databases. We suggest a management framework where high-level taxonomic groups are curated by a regional team, consisting of both end users and taxonomic experts. We identify a mechanism by which overall quality of data within a common reference guide could be raised over the next decade. Finally, we discuss the role of a common reference standard in advancing marine ecology and supporting sustainable use of this ecosystem.


Assuntos
Classificação/métodos , Processamento de Imagem Assistida por Computador/normas , Biologia Marinha/normas , Animais , Inteligência Artificial , Biodiversidade , Curadoria de Dados/métodos , Curadoria de Dados/normas , Bases de Dados Factuais , Ecologia , Ecossistema , Processamento de Imagem Assistida por Computador/métodos , Biologia Marinha/classificação
14.
Sci Data ; 5: 180258, 2018 11 20.
Artigo em Inglês | MEDLINE | ID: mdl-30457569

RESUMO

Clinical case reports (CCRs) provide an important means of sharing clinical experiences about atypical disease phenotypes and new therapies. However, published case reports contain largely unstructured and heterogeneous clinical data, posing a challenge to mining relevant information. Current indexing approaches generally concern document-level features and have not been specifically designed for CCRs. To address this disparity, we developed a standardized metadata template and identified text corresponding to medical concepts within 3,100 curated CCRs spanning 15 disease groups and more than 750 reports of rare diseases. We also prepared a subset of metadata on reports on selected mitochondrial diseases and assigned ICD-10 diagnostic codes to each. The resulting resource, Metadata Acquired from Clinical Case Reports (MACCRs), contains text associated with high-level clinical concepts, including demographics, disease presentation, treatments, and outcomes for each report. Our template and MACCR set render CCRs more findable, accessible, interoperable, and reusable (FAIR) while serving as valuable resources for key user groups, including researchers, physician investigators, clinicians, data scientists, and those shaping government policies for clinical trials.


Assuntos
Estudos Clínicos como Assunto , Curadoria de Dados , Metadados , Biologia Computacional , Análise de Dados , Curadoria de Dados/métodos , Curadoria de Dados/normas , Humanos , Metadados/normas
15.
Sci Data ; 5: 180259, 2018 11 20.
Artigo em Inglês | MEDLINE | ID: mdl-30457573

RESUMO

This article presents a practical roadmap for scholarly publishers to implement data citation in accordance with the Joint Declaration of Data Citation Principles (JDDCP), a synopsis and harmonization of the recommendations of major science policy bodies. It was developed by the Publishers Early Adopters Expert Group as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of FORCE11.org and the NIH BioCADDIE program. The structure of the roadmap presented here follows the "life of a paper" workflow and includes the categories Pre-submission, Submission, Production, and Publication. The roadmap is intended to be publisher-agnostic so that all publishers can use this as a starting point when implementing JDDCP-compliant data citation. Authors reading this roadmap will also better know what to expect from publishers and how to enable their own data citations to gain maximum impact, as well as complying with what will become increasingly common funder mandates on data transparency.


Assuntos
Editoração/normas , Curadoria de Dados/normas
16.
Anim Genet ; 49(6): 520-526, 2018 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-30311252

RESUMO

The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal genomes with an initial focus on farmed and companion animals. A key goal of the initiative is to ensure high quality and rich supporting metadata to describe the project's animals, specimens, cell cultures and experimental assays. By defining rich sample and experimental metadata standards and promoting best practices in data descriptions, deposition and openness, FAANG champions higher quality and reusability of published datasets. FAANG has established a Data Coordination Centre, which sits at the heart of the Metadata and Data Sharing Committee. It continues to evolve the metadata standards, support submissions and, crucially, create powerful and accessible tools to support deposition and validation of metadata. FAANG conforms to the findable, accessible, interoperable, and reusable (FAIR) data principles, with high quality, open access and functionally interlinked data. In addition to data generated by FAANG members and specific FAANG projects, existing datasets that meet the main-or more permissive legacy-standards are incorporated into a central, focused, functional data resource portal for the entire farmed and companion animal community. Through clear and effective metadata standards, validation and conversion software, combined with promotion of best practices in metadata implementation, FAANG aims to maximise effectiveness and inter-comparability of assay data. This supports the community to create a rich genome-to-phenotype resource and promotes continuing improvements in animal data standards as a whole.


Assuntos
Curadoria de Dados/normas , Genômica , Metadados/normas , Animais , Gado , Animais de Estimação , Software
20.
Nucleic Acids Res ; 46(D1): D221-D228, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-29126148

RESUMO

The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.


Assuntos
Sequência Consenso , Bases de Dados Genéticas , Fases de Leitura Aberta , Animais , Curadoria de Dados/métodos , Curadoria de Dados/normas , Bases de Dados Genéticas/normas , Guias como Assunto , Humanos , Camundongos , Anotação de Sequência Molecular , National Library of Medicine (U.S.) , Estados Unidos , Interface Usuário-Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...