Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Nucleic Acids Res ; 49(D1): D924-D931, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33104772

RESUMEN

The Gene Expression Database (GXD; www.informatics.jax.org/expression.shtml) is an extensive and well-curated community resource of mouse developmental gene expression information. For many years, GXD has collected and integrated data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot, and western blot experiments through curation of the scientific literature and by collaborations with large-scale expression projects. Since our last report in 2019, we have continued to acquire these classical types of expression data; developed a searchable index of RNA-Seq and microarray experiments that allows users to quickly and reliably find specific mouse expression studies in ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) and GEO (https://www.ncbi.nlm.nih.gov/geo/); and expanded GXD to include RNA-Seq data. Uniformly processed RNA-Seq data are imported from the EBI Expression Atlas and then integrated with the other types of expression data in GXD, and with the genetic, functional, phenotypic and disease-related information in Mouse Genome Informatics (MGI). This integration has made the RNA-Seq data accessible via GXD's enhanced searching and filtering capabilities. Further, we have embedded the Morpheus heat map utility into the GXD user interface to provide additional tools for display and analysis of RNA-Seq data, including heat map visualization, sorting, filtering, hierarchical clustering, nearest neighbors analysis and visual enrichment.


Asunto(s)
Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Análisis por Conglomerados , Internet , Ratones , Proteínas/genética , Proteínas/metabolismo , Interfaz Usuario-Computador
2.
Mamm Genome ; 33(1): 55-65, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-34482425

RESUMEN

Recombinase alleles and transgenes can be used to facilitate spatio-temporal specificity of gene disruption or transgene expression. However, the versatility of this in vivo recombination system relies on having detailed and accurate characterization of recombinase expression and activity to enable selection of the appropriate allele or transgene. The CrePortal ( http://www.informatics.jax.org/home/recombinase ) leverages the informatics infrastructure of Mouse Genome Informatics to integrate data from the scientific literature, direct data submissions from the scientific community at-large, and from major projects developing new recombinase lines and characterizing recombinase expression and specificity patterns. Searching the CrePortal by recombinase activity or specific recombinase gene driver provides users with a recombinase alleles and transgenes activity tissue summary and matrix comparison of gene expression and recombinase activity with links to generation details, a recombinase activity grid, and associated phenotype annotations. Future improvements will add cell type-based activity annotations. The CrePortal provides a comprehensive presentation of recombinase allele and transgene data to assist researchers in selection of the recombinase allele or transgene based on where and when recombination is desired.


Asunto(s)
Integrasas , Recombinasas , Alelos , Animales , Integrasas/genética , Integrasas/metabolismo , Ratones , Ratones Transgénicos , Recombinasas/genética , Transgenes
3.
Mamm Genome ; 33(1): 4-18, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-34698891

RESUMEN

The Mouse Genome Informatics (MGI) database system combines multiple expertly curated community data resources into a shared knowledge management ecosystem united by common metadata annotation standards. MGI's mission is to facilitate the use of the mouse as an experimental model for understanding the genetic and genomic basis of human health and disease. MGI is the authoritative source for mouse gene, allele, and strain nomenclature and is the primary source of mouse phenotype annotations, functional annotations, developmental gene expression information, and annotations of mouse models with human diseases. MGI maintains mouse anatomy and phenotype ontologies and contributes to the development of the Gene Ontology and Disease Ontology and uses these ontologies as standard terminologies for annotation. The Mouse Genome Database (MGD) and the Gene Expression Database (GXD) are MGI's two major knowledgebases. Here, we highlight some of the recent changes and enhancements to MGD and GXD that have been implemented in response to changing needs of the biomedical research community and to improve the efficiency of expert curation. MGI can be accessed freely at http://www.informatics.jax.org .


Asunto(s)
Bases de Datos Genéticas , Ecosistema , Alelos , Animales , Ontología de Genes , Genómica , Ratones
4.
Bioinformatics ; 37(Suppl_1): i468-i476, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252939

RESUMEN

MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature-a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. RESULTS: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. AVAILABILITY AND IMPLEMENTATION: Source code and the list of PMIDs of the publications in our datasets are available upon request.


Asunto(s)
Investigación Biomédica , Bases de Datos Factuales
5.
Nucleic Acids Res ; 47(D1): D774-D779, 2019 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-30335138

RESUMEN

The mouse Gene Expression Database (GXD) is an extensive, well-curated community resource freely available at www.informatics.jax.org/expression.shtml. Covering all developmental stages, GXD includes data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot and western blot experiments in wild-type and mutant mice. GXD's gene expression information is integrated with the other data in Mouse Genome Informatics and interconnected with other databases, placing these data in the larger biological and biomedical context. Since the last report, the ability of GXD to provide insights into the molecular mechanisms of development and disease has been greatly enhanced by the addition of new data and by the implementation of new web features. These include: improvements to the Differential Gene Expression Data Search, facilitating searches for genes that have been shown to be exclusively expressed in a specified structure and/or developmental stage; an enhanced anatomy browser that now provides access to expression data and phenotype data for a given anatomical structure; direct access to the wild-type gene expression data for the tissues affected in a specific mutant; and a comparison matrix that juxtaposes tissues where a gene is normally expressed against tissues, where mutations in that gene cause abnormalities.


Asunto(s)
Bases de Datos Genéticas , Genoma/genética , Transcriptoma/genética , Animales , Internet , Ratones , Interfaz Usuario-Computador
6.
Nucleic Acids Res ; 45(D1): D730-D736, 2017 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-27899677

RESUMEN

The Gene Expression Database (GXD; www.informatics.jax.org/expression.shtml) is an extensive and well-curated community resource of mouse developmental expression information. Through curation of the scientific literature and by collaborations with large-scale expression projects, GXD collects and integrates data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot and western blot experiments. Expression data from both wild-type and mutant mice are included. The expression data are combined with genetic and phenotypic data in Mouse Genome Informatics (MGI) and made readily accessible to many types of database searches. At present, GXD includes over 1.5 million expression results and more than 300 000 images, all annotated with detailed and standardized metadata. Since our last report in 2014, we have added a large amount of data, we have enhanced data and database infrastructure, and we have implemented many new search and display features. Interface enhancements include: a new Mouse Developmental Anatomy Browser; interactive tissue-by-developmental stage and tissue-by-gene matrix views; capabilities to filter and sort expression data summaries; a batch search utility; gene-based expression overviews; and links to expression data from other species.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Expresión Génica , Genómica/métodos , Animales , Ontología de Genes , Ratones , Especificidad de Órganos , Motor de Búsqueda , Interfaz Usuario-Computador , Navegador Web
8.
Nucleic Acids Res ; 42(Database issue): D818-24, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24163257

RESUMEN

The Gene Expression Database (GXD; http://www.informatics.jax.org/expression.shtml) is an extensive and well-curated community resource of mouse developmental expression information. GXD collects different types of expression data from studies of wild-type and mutant mice, covering all developmental stages and including data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot and western blot experiments. The data are acquired from the scientific literature and from researchers, including groups doing large-scale expression studies. Integration with the other data in Mouse Genome Informatics (MGI) and interconnections with other databases places GXD's gene expression information in the larger biological and biomedical context. Since the last report, the utility of GXD has been greatly enhanced by the addition of new data and by the implementation of more powerful and versatile search and display features. Web interface enhancements include the capability to search for expression data for genes associated with specific phenotypes and/or human diseases; new, more interactive data summaries; easy downloading of data; direct searches of expression images via associated metadata; and new displays that combine image data and their associated annotations. At present, GXD includes >1.4 million expression results and 250,000 images that are accessible to our search tools.


Asunto(s)
Bases de Datos Genéticas , Expresión Génica , Ratones/genética , Animales , Internet , Interfaz Usuario-Computador
9.
Genesis ; 53(8): 510-22, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-26045019

RESUMEN

The Gene Expression Database (GXD) is an extensive and freely available community resource of mouse developmental expression data. GXD curates and integrates expression data from the literature, via electronic data submissions, and by collaborations with large-scale projects. As an integral component of the Mouse Genome Informatics Resource, GXD combines expression data with genetic, functional, phenotypic, and disease-related data, and provides tools for the research community to search for and analyze expression data in this larger context. Recent enhancements include: an interactive browser to navigate the mouse developmental anatomy and find expression data for specific anatomical structures; the capability to search for expression data of genes located in specific genomic regions, supporting the identification of disease candidate genes; a summary displaying all the expression images that meet specified search criteria; interactive matrix views that provide overviews of spatio-temporal expression patterns (Tissue × Stage Matrix) and enable the comparison of expression patterns between genes (Tissue × Gene Matrix); data zoom and filter utilities to iteratively refine summary displays and data sets; and gene-based links to expression data from other model organisms, such as chicken, Xenopus, and zebrafish, fostering comparative expression analysis for species that are highly relevant for developmental research.


Asunto(s)
Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Ratones/genética , Animales , Curaduría de Datos , Genómica/métodos , Internet , Modelos Animales
10.
Mamm Genome ; 26(9-10): 422-30, 2015 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-26208972

RESUMEN

Mouse anatomy ontologies provide standard nomenclature for describing normal and mutant mouse anatomy, and are essential for the description and integration of data directly related to anatomy such as gene expression patterns. Building on our previous work on anatomical ontologies for the embryonic and adult mouse, we have recently developed a new and substantially revised anatomical ontology covering all life stages of the mouse. Anatomical terms are organized in complex hierarchies enabling multiple relationships between terms. Tissue classification as well as partonomic, developmental, and other types of relationships can be represented. Hierarchies for specific developmental stages can also be derived. The ontology forms the core of the eMouse Atlas Project (EMAP) and is used extensively for annotating and integrating gene expression patterns and other data by the Gene Expression Database (GXD), the eMouse Atlas of Gene Expression (EMAGE) and other database resources. Here we illustrate the evolution of the developmental and adult mouse anatomical ontologies toward one combined system. We report on recent ontology enhancements, describe the current status, and discuss future plans for mouse anatomy ontology development and application in integrating data resources.


Asunto(s)
Biología Computacional , Especificidad de Órganos/genética , Programas Informáticos , Animales , Bases de Datos Genéticas , Regulación del Desarrollo de la Expresión Génica , Ratones
11.
Mamm Genome ; 26(7-8): 272-84, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-26238262

RESUMEN

From its inception in 1989, the mission of the Mouse Genome Informatics (MGI) resource remains to integrate genetic, genomic, and biological data about the laboratory mouse to facilitate the study of human health and disease. This mission is ever more feasible as the revolution in genetics knowledge, the ability to sequence genomes, and the ability to specifically manipulate mammalian genomes are now at our fingertips. Through major paradigm shifts in biological research and computer technologies, MGI has adapted and evolved to become an integral part of the larger global bioinformatics infrastructure and honed its ability to provide authoritative reference datasets used and incorporated by many other established bioinformatics resources. Here, we review some of the major changes in research approaches over that last quarter century, how these changes are reflected in the MGI resource you use today, and what may be around the next corner.


Asunto(s)
Bases de Datos Genéticas/historia , Genoma , Genómica/historia , Programas Informáticos , Animales , Bases de Datos Genéticas/provisión & distribución , Modelos Animales de Enfermedad , Genómica/métodos , Genómica/tendencias , Genotipo , Historia del Siglo XX , Historia del Siglo XXI , Humanos , Ratones , Mutagénesis Sitio-Dirigida , Fenotipo , Genética Inversa
12.
Mamm Genome ; 26(7-8): 314-24, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-25939429

RESUMEN

The Gene Expression Database (GXD) is an extensive, easily searchable, and freely available database of mouse gene expression information (www.informatics.jax.org/expression.shtml). GXD was developed to foster progress toward understanding the molecular basis of human development and disease. GXD contains information about when and where genes are expressed in different tissues in the mouse, especially during the embryonic period. GXD collects different types of expression data from wild-type and mutant mice, including RNA in situ hybridization, immunohistochemistry, RT-PCR, and northern and western blot results. The GXD curators read the scientific literature and enter the expression data from those papers into the database. GXD also acquires expression data directly from researchers, including groups doing large-scale expression studies. GXD currently contains nearly 1.5 million expression results for over 13,900 genes. In addition, it has over 265,000 images of expression data, allowing users to retrieve the primary data and interpret it themselves. By being an integral part of the larger Mouse Genome Informatics (MGI) resource, GXD's expression data are combined with other genetic, functional, phenotypic, and disease-oriented data. This allows GXD to provide tools for researchers to evaluate expression data in the larger context, search by a wide variety of biologically and biomedically relevant parameters, and discover new data connections to help in the design of new experiments. Thus, GXD can provide researchers with critical insights into the functions of genes and the molecular mechanisms of development, differentiation, and disease.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Genéticas , Genoma , Interfaz Usuario-Computador , Animales , Embrión de Mamíferos , Expresión Génica , Marcadores Genéticos , Humanos , Difusión de la Información , Ratones , Especificidad de Órganos
13.
Mamm Genome ; 26(7-8): 305-13, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-26223881

RESUMEN

The mouse genome database (MGD) is the model organism database component of the mouse genome informatics system at The Jackson Laboratory. MGD is the international data resource for the laboratory mouse and facilitates the use of mice in the study of human health and disease. Since its beginnings, MGD has included comparative genomics data with a particular focus on human-mouse orthology, an essential component of the use of mouse as a model organism. Over the past 25 years, novel algorithms and addition of orthologs from other model organisms have enriched comparative genomics in MGD data, extending the use of orthology data to support the laboratory mouse as a model of human biology. Here, we describe current comparative data in MGD and review the history and refinement of orthology representation in this resource.


Asunto(s)
Bases de Datos Genéticas/historia , Genoma , Genómica/métodos , Homología de Secuencia de Aminoácido , Alelos , Animales , Modelos Animales de Enfermedad , Genómica/historia , Genotipo , Historia del Siglo XX , Historia del Siglo XXI , Humanos , Ratones , Anotación de Secuencia Molecular , Fenotipo , Filogenia
14.
Dev Dyn ; 243(10): 1176-86, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-24958384

RESUMEN

Because molecular mechanisms of development are extraordinarily complex, the understanding of these processes requires the integration of pertinent research data. Using the Gene Expression Database for Mouse Development (GXD) as an example, we illustrate the progress made toward this goal, and discuss relevant issues that apply to developmental databases and developmental research in general. Since its first release in 1998, GXD has served the scientific community by integrating multiple types of expression data from publications and electronic submissions and by making these data freely and widely available. Focusing on endogenous gene expression in wild-type and mutant mice and covering data from RNA in situ hybridization, in situ reporter (knock-in), immunohistochemistry, reverse transcriptase-polymerase chain reaction, Northern blot, and Western blot experiments, the database has grown tremendously over the years in terms of data content and search utilities. Currently, GXD includes over 1.4 million annotated expression results and over 260,000 images. All these data and images are readily accessible to many types of database searches. Here we describe the data and search tools of GXD; explain how to use the database most effectively; discuss how we acquire, curate, and integrate developmental expression information; and describe how the research community can help in this process.


Asunto(s)
Bases de Datos Genéticas , Regulación del Desarrollo de la Expresión Génica , Expresión Génica , Ratones/embriología , Acceso a la Información , Animales , Humanos , Almacenamiento y Recuperación de la Información , Ratones/genética , Interfaz Usuario-Computador
15.
Genetics ; 227(1)2024 05 07.
Artículo en Inglés | MEDLINE | ID: mdl-38531069

RESUMEN

Mouse Genome Informatics (MGI) is a federation of expertly curated information resources designed to support experimental and computational investigations into genetic and genomic aspects of human biology and disease using the laboratory mouse as a model system. The Mouse Genome Database (MGD) and the Gene Expression Database (GXD) are core MGI databases that share data and system architecture. MGI serves as the central community resource of integrated information about mouse genome features, variation, expression, gene function, phenotype, and human disease models acquired from peer-reviewed publications, author submissions, and major bioinformatics resources. To facilitate integration and standardization of data, biocuration scientists annotate using terms from controlled metadata vocabularies and biological ontologies (e.g. Mammalian Phenotype Ontology, Mouse Developmental Anatomy, Disease Ontology, Gene Ontology, etc.), and by applying international community standards for gene, allele, and mouse strain nomenclature. MGI serves basic scientists, translational researchers, and data scientists by providing access to FAIR-compliant data in both human-readable and compute-ready formats. The MGI resource is accessible at https://informatics.jax.org. Here, we present an overview of the core data types represented in MGI and highlight recent enhancements to the resource with a focus on new data and functionality for MGD and GXD.


Asunto(s)
Bases de Datos Genéticas , Genoma , Animales , Ratones , Bases del Conocimiento , Genómica/métodos , Biología Computacional/métodos , Humanos
16.
Nucleic Acids Res ; 39(Database issue): D835-41, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-21062809

RESUMEN

The Gene Expression Database (GXD) is a community resource of mouse developmental expression information. GXD integrates different types of expression data at the transcript and protein level and captures expression information from many different mouse strains and mutants. GXD places these data in the larger biological context through integration with other Mouse Genome Informatics (MGI) resources and interconnections with many other databases. Web-based query forms support simple or complex searches that take advantage of all these integrated data. The data in GXD are obtained from the literature, from individual laboratories, and from large-scale data providers. All data are annotated and reviewed by GXD curators. Since the last report, the GXD data content has increased significantly, the interface and data displays have been improved, new querying capabilities were implemented, and links to other expression resources were added. GXD is available through the MGI web site (www.informatics.jax.org), or directly at www.informatics.jax.org/expression.shtml.


Asunto(s)
Bases de Datos Genéticas , Expresión Génica , Ratones/genética , Animales , Gráficos por Computador , Ratones/embriología , Ratones/crecimiento & desarrollo , Interfaz Usuario-Computador
17.
Nucleic Acids Res ; 39(Database issue): D849-55, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-20929875

RESUMEN

The International Knockout Mouse Consortium (IKMC) aims to mutate all protein-coding genes in the mouse using a combination of gene targeting and gene trapping in mouse embryonic stem (ES) cells and to make the generated resources readily available to the research community. The IKMC database and web portal (www.knockoutmouse.org) serves as the central public web site for IKMC data and facilitates the coordination and prioritization of work within the consortium. Researchers can access up-to-date information on IKMC knockout vectors, ES cells and mice for specific genes, and follow links to the respective repositories from which corresponding IKMC products can be ordered. Researchers can also use the web site to nominate genes for targeting, or to indicate that targeting of a gene should receive high priority. The IKMC database provides data to, and features extensive interconnections with, other community databases.


Asunto(s)
Bases de Datos Genéticas , Ratones Noqueados , Alelos , Animales , Marcación de Gen , Vectores Genéticos , Genómica , Internet , Ratones , Anotación de Secuencia Molecular , Interfaz Usuario-Computador
18.
Mamm Genome ; 23(9-10): 550-8, 2012 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-22847375

RESUMEN

Mouse gene expression data are complex and voluminous. To maximize the utility of these data, they must be made readily accessible through databases, and those resources need to place the expression data in the larger biological context. Here we describe two community resources that approach these problems in different but complementary ways: BioGPS and the Mouse Gene Expression Database (GXD). BioGPS connects its large and homogeneous microarray gene expression reference data sets via plugins with a heterogeneous collection of external gene centric resources, thus casting a wide but loose net. GXD acquires different types of expression data from many sources and integrates these data tightly with other types of data in the Mouse Genome Informatics (MGI) resource, with a strong emphasis on consistency checks and manual curation. We describe and contrast the "loose" and "tight" data integration strategies employed by BioGPS and GXD, respectively, and discuss the challenges and benefits of data integration. BioGPS is freely available at http://biogps.org . GXD is freely available through the MGI web site ( www.informatics.jax.org ) or directly at www.informatics.jax.org/expression.shtml .


Asunto(s)
Bases de Datos Genéticas , Expresión Génica , Ratones/genética , Animales , Almacenamiento y Recuperación de la Información
19.
Methods ; 53(4): 405-10, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21185380

RESUMEN

Recent advances in high-throughput gene targeting and conditional mutagenesis are creating new and powerful resources to study the in vivo function of mammalian genes using the mouse as an experimental model. Mutant ES cells and mice are being generated at a rapid rate to study the molecular and phenotypic consequences of genetic mutations, and to correlate these study results with human disease conditions. Likewise, classical genetics approaches to identify mutations in the mouse genome that cause specific phenotypes have become more effective. Here, we describe methods to quickly obtain information on what mutant ES cells and mice are available, including recombinase driver lines for the generation of conditional mutants. Further, we describe means to access genetic and phenotypic data that identify mouse models for specific human diseases.


Asunto(s)
Genes , Ratones Mutantes/genética , Mutación , Fenotipo , Animales , Línea Celular , Bases de Datos Genéticas , Modelos Animales de Enfermedad , Células Madre Embrionarias/citología , Enfermedades Genéticas Congénitas/genética , Humanos , Ratones , Sistemas en Línea , Especificidad de Órganos , Regiones Promotoras Genéticas , Recombinasas/metabolismo
20.
Database (Oxford) ; 20202020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-32294192

RESUMEN

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation.We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012-2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier's performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation.Database URL.


Asunto(s)
Investigación Biomédica/estadística & datos numéricos , Biología Computacional/métodos , Curaduría de Datos/métodos , Bases de Datos Factuales , Animales , Investigación Biomédica/clasificación , Investigación Biomédica/métodos , Biología Computacional/clasificación , Minería de Datos/métodos , Humanos , Internet
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA