RegEl corpus: identifying DNA regulatory elements in the scientific literature.

Garda, Samuele; Lenihan-Geels, Freyda; Proft, Sebastian; Hochmuth, Stefanie; Schülke, Markus; Seelow, Dominik; Leser, Ulf

Garda, Samuele; Lenihan-Geels, Freyda; Proft, Sebastian; Hochmuth, Stefanie; Schülke, Markus; Seelow, Dominik; Leser, Ulf.

Afiliação

Garda S; Computer Science, Humboldt-Universitält zu Berlin, Rudower Chaussee 25, 12489, Berlin, Germany.
Lenihan-Geels F; Klinik für Pädiatrie m.S. Neurologie, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany.
Proft S; Bioinformatics and Translational Genetics, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Anna-Louisa-Karsch-Straße 2, 10178, Berlin, Germany.
Hochmuth S; Institut für Medizinische Genetik und Humangenetik, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany.
Schülke M; Klinik für Pädiatrie m.S. Neurologie, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany.
Seelow D; Klinik für Pädiatrie m.S. Neurologie, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany.
Leser U; Bioinformatics and Translational Genetics, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Anna-Louisa-Karsch-Straße 2, 10178, Berlin, Germany.

Database (Oxford) ; 20222022 06 27.

Article em En | MEDLINE | ID: mdl-35758881

ABSTRACT

ABSTRACT

High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48-0.91 for entity detection and 0.71-0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL https//zenodo.org/record/6418451#.YqcLHvexVqg.

Assuntos

Algoritmos; Mineração de Dados; DNA/genética; Mineração de Dados/métodos; Bases de Dados Factuais; Humanos; PubMed

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Mineração de Dados Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Revista: Database (Oxford) Ano de publicação: 2022 Tipo de documento: Article País de afiliação: Alemanha

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google