Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
Nucleic Acids Res ; 52(D1): D1668-D1676, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37994696

RESUMO

Europe PMC (https://europepmc.org/) is an open access database of life science journal articles and preprints, which contains over 42 million abstracts and over 9 million full text articles accessible via the website, APIs and bulk download. This publication outlines new developments to the Europe PMC platform since the last database update in 2020 (1) and focuses on five main areas. (i) Improving discoverability, reproducibility and trust in preprints by indexing new preprint content, enriching preprint metadata and identifying withdrawn and removed preprints. (ii) Enhancing support for text and data mining by expanding the types of annotations provided and developing the Europe PMC Annotations Corpus, which can be used to train machine learning models to increase their accuracy and precision. (iii) Developing the Article Status Monitor tool and email alerts, to notify users about new articles and updates to existing records. (iv) Positioning Europe PMC as an open scholarly infrastructure through increasing the portion of open source core software, improving sustainability and accessibility of the service.


Assuntos
Disciplinas das Ciências Biológicas , Bases de Dados Bibliográficas , Mineração de Dados , Europa (Continente) , Software , Bases de Dados Bibliográficas/normas , Internet
2.
Sci Data ; 10(1): 722, 2023 10 19.
Artigo em Inglês | MEDLINE | ID: mdl-37857688

RESUMO

Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Humanos , Mineração de Dados/métodos , Bases de Dados Factuais , Europa (Continente) , Aprendizado de Máquina
3.
Nucleic Acids Res ; 51(D1): D1353-D1359, 2023 Jan 06.
Artigo em Inglês | MEDLINE | ID: mdl-36399499

RESUMO

The Open Targets Platform (https://platform.opentargets.org/) is an open source resource to systematically assist drug target identification and prioritisation using publicly available data. Since our last update, we have reimagined, redesigned, and rebuilt the Platform in order to streamline data integration and harmonisation, expand the ways in which users can explore the data, and improve the user experience. The gene-disease causal evidence has been enhanced and expanded to better capture disease causality across rare, common, and somatic diseases. For target and drug annotations, we have incorporated new features that help assess target safety and tractability, including genetic constraint, PROTACtability assessments, and AlphaFold structure predictions. We have also introduced new machine learning applications for knowledge extraction from the published literature, clinical trial information, and drug labels. The new technologies and frameworks introduced since the last update will ease the introduction of new features and the creation of separate instances of the Platform adapted to user requirements. Our new Community forum, expanded training materials, and outreach programme support our users in a range of use cases.

4.
Nucleic Acids Res ; 49(D1): D1507-D1514, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33180112

RESUMO

Europe PMC (https://europepmc.org) is a database of research articles, including peer reviewed full text articles and abstracts, and preprints - all freely available for use via website, APIs and bulk download. This article outlines new developments since 2017 where work has focussed on three key areas: (i) Europe PMC has added to its core content to include life science preprint abstracts and a special collection of full text of COVID-19-related preprints. Europe PMC is unique as an aggregator of biomedical preprints alongside peer-reviewed articles, with over 180 000 preprints available to search. (ii) Europe PMC has significantly expanded its links to content related to the publications, such as links to Unpaywall, providing wider access to full text, preprint peer-review platforms, all major curated data resources in the life sciences, and experimental protocols. The redesigned Europe PMC website features the PubMed abstract and corresponding PMC full text merged into one article page; there is more evident and user-friendly navigation within articles and to related content, plus a figure browse feature. (iii) The expanded annotations platform offers ∼1.3 billion text mined biological terms and concepts sourced from 10 providers and over 40 global data resources.


Assuntos
Disciplinas das Ciências Biológicas/estatística & dados numéricos , COVID-19/prevenção & controle , Curadoria de Dados/estatística & dados numéricos , Mineração de Dados/estatística & dados numéricos , Bases de Dados Factuais/estatística & dados numéricos , PubMed , SARS-CoV-2/isolamento & purificação , Disciplinas das Ciências Biológicas/métodos , Pesquisa Biomédica/métodos , Pesquisa Biomédica/estatística & dados numéricos , COVID-19/epidemiologia , COVID-19/virologia , Curadoria de Dados/métodos , Mineração de Dados/métodos , Epidemias , Europa (Continente) , Humanos , Internet , SARS-CoV-2/fisiologia
5.
Nucleic Acids Res ; 46(D1): D1223-D1228, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-30053269

RESUMO

PITDB is a freely available database of translated genomic elements (TGEs) that have been observed in PIT (proteomics informed by transcriptomics) experiments. In PIT, a sample is analyzed using both RNA-seq transcriptomics and proteomic mass spectrometry. Transcripts assembled from RNA-seq reads are used to create a library of sample-specific amino acid sequences against which the acquired mass spectra are searched, permitting detection of any TGE, not just those in canonical proteome databases. At the time of writing, PITDB contains over 74 000 distinct TGEs from four species, supported by more than 600 000 peptide spectrum matches. The database, accessible via http://pitdb.org, provides supporting evidence for each TGE, often from multiple experiments and an indication of the confidence in the TGE's observation and its type, ranging from known protein (exact match to a UniProt protein sequence), through multiple types of protein variant including various splice isoforms, to a putative novel molecule. PITDB's modern web interface allows TGEs to be viewed individually or by species or experiment, and downloaded for further analysis. PITDB is for bench scientists seeking to share their PIT results, for researchers investigating novel genome products in model organisms and for those wishing to construct proteomes for lesser studied species.


Assuntos
Bases de Dados Factuais , Proteínas/química , Proteínas/genética , Análise de Sequência de RNA , Algoritmos , Sequência de Aminoácidos , Animais , Apresentação de Dados , Humanos , Internet , Fases de Leitura Aberta , Biossíntese de Proteínas , Isoformas de Proteínas/genética , Proteômica/métodos , Espectrometria de Massas em Tandem , Interface Usuário-Computador
6.
Nucleic Acids Res ; 46(10): 4893-4902, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29718325

RESUMO

Proteomics informed by transcriptomics (PIT), in which proteomic MS/MS spectra are searched against open reading frames derived from de novo assembled transcripts, can reveal previously unknown translated genomic elements (TGEs). However, determining which TGEs are truly novel, which are variants of known proteins, and which are simply artefacts of poor sequence assembly, is challenging. We have designed and implemented an automated solution that classifies putative TGEs by comparing to reference proteome sequences. This allows large-scale identification of sequence polymorphisms, splice isoforms and novel TGEs supported by presence or absence of variant-specific peptide evidence. Unlike previously reported methods, ours does not require a catalogue of known variants, making it more applicable to non-model organisms. The method was validated on human PIT data, then applied to Mus musculus, Pteropus alecto and Aedes aegypti. Novel discoveries included 60 human protein isoforms, 32 392 polymorphisms in P. alecto, and TGEs with non-methionine start sites including tyrosine.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Isoformas de Proteínas/genética , Proteômica/métodos , Aedes/genética , Aedes/metabolismo , Animais , Linhagem Celular , Quirópteros/genética , Quirópteros/metabolismo , Códon de Iniciação , Humanos , Proteínas de Insetos/genética , Proteínas de Insetos/metabolismo , Camundongos , Fases de Leitura Aberta , Polimorfismo Genético , Reprodutibilidade dos Testes , Espectrometria de Massas em Tandem , Tirosina/genética
7.
Mol Cell Proteomics ; 14(11): 3087-93, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26269333

RESUMO

With the recent advent of RNA-seq technology the proteomics community has begun to generate sample-specific protein databases for peptide and protein identification, an approach we call proteomics informed by transcriptomics (PIT). This approach has gained a lot of interest, particularly among researchers who work with nonmodel organisms or with particularly dynamic proteomes such as those observed in developmental biology and host-pathogen studies. PIT has been shown to improve coverage of known proteins, and to reveal potential novel gene products. However, many groups are impeded in their use of PIT by the complexity of the required data analysis. Necessarily, this analysis requires complex integration of a number of different software tools from at least two different communities, and because PIT has a range of biological applications a single software pipeline is not suitable for all use cases. To overcome these problems, we have created GIO, a software system that uses the well-established Galaxy platform to make PIT analysis available to the typical bench scientist via a simple web interface. Within GIO we provide workflows for four common use cases: a standard search against a reference proteome; PIT protein identification without a reference genome; PIT protein identification using a genome guide; and PIT genome annotation. These workflows comprise individual tools that can be reconfigured and rearranged within the web interface to create new workflows to support additional use cases.


Assuntos
Proteômica/métodos , Software , Transcriptoma , Algoritmos , Mineração de Dados , Bases de Dados de Proteínas , Humanos , Espectrometria de Massas/estatística & dados numéricos , Fluxo de Trabalho
8.
Biomed Inform Insights ; 5(Suppl. 1): 175-84, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22879774

RESUMO

We describe our approach for creating a system able to detect emotions in suicide notes. Motivated by the sparse and imbalanced data as well as the complex annotation scheme, we have considered three hybrid approaches for distinguishing between the different categories. Each of the three approaches combines machine learning with manually derived rules, where the latter target very sparse emotion categories. The first approach considers the task as single label multi-class classification, where an SVM and a CRF classifier are trained to recognise fifteen different categories and their results are combined. Our second approach trains individual binary classifiers (SVM and CRF) for each of the fifteen sentence categories and returns the union of the classifiers as the final result. Finally, our third approach is a combination of binary and multi-class classifiers (SVM and CRF) trained on different subsets of the training data. We considered a number of different feature configurations. All three systems were tested on 300 unseen messages. Our second system had the best performance of the three, yielding an F1 score of 45.6% and a Precision of 60.1% whereas our best Recall (43.6%) was obtained using the third system.

9.
Bioinformatics ; 28(7): 991-1000, 2012 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-22321698

RESUMO

MOTIVATION: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication. RESULTS: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with 'Experiment', 'Background' and 'Model' being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress. AVAILABILITY: A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/software http://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data. CONTACT: liakata@ebi.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodos , Publicações Periódicas como Assunto/classificação , Máquina de Vetores de Suporte , Algoritmos , Internet , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA