Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 62
Filtrar
1.
Bioinformatics ; 40(1)2024 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-38258418

RESUMEN

MOTIVATION: Scientific advances build on the findings of existing research. The 2001 publication of the human genome has led to the production of huge volumes of literature exploring the context-specific functions and interactions of genes. Technology is needed to perform large-scale text mining of research papers to extract the reported actions of genes in specific experimental contexts and cell states, such as cancer, thereby facilitating the design of new therapeutic strategies. RESULTS: We present a new corpus and Text Mining methodology that can accurately identify and extract the most important details of cancer genomics experiments from biomedical texts. We build a Named Entity Recognition model that accurately extracts relevant experiment details from PubMed abstract text, and a second model that identifies the relationships between them. This system outperforms earlier models and enables the analysis of gene function in diverse and dynamically evolving experimental contexts. AVAILABILITY AND IMPLEMENTATION: Code and data are available here: https://github.com/cambridgeltl/functional-genomics-ie.


Asunto(s)
Genómica , Neoplasias , Humanos , Neoplasias/genética , Minería de Datos/métodos , PubMed , Fenotipo
2.
Digit Biomark ; 7(1): 157-166, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38029002

RESUMEN

Introduction: Changes in speech can act as biomarkers of cognitive decline in Alzheimer's disease (AD). While shorter speech samples would promote data collection and analysis, the minimum length of informative speech samples remains debated. This study aims to provide insight into the effect of sample length in analyzing longitudinal recordings of spontaneous speech in AD by comparing the original random length, 5- and 1-minute-long samples. We hope to understand whether capping the audio improves the accuracy of the analysis, and whether an extra 4 min conveys necessary information. Methods: 110 spontaneous speech samples were collected from decades of Youtube videos of 17 public figures, 9 of whom eventually developed AD. 456 language features were extracted and their text-length-sensitivity, comparability, and ability to capture change over time were analyzed across three different sample lengths. Results: Capped audio files had advantages over the random length ones. While most extracted features were statistically comparable or highly correlated across the datasets, potential effects of sample length should be acknowledged for some features. The 5-min dataset presented the highest reliability in tracking the evolution of the disease, suggesting that the 4 extra minutes do convey informative data. Conclusion: Sample length seems to play an important role in extracting the language feature values from speech and tracking disease progress over time. We highlight the importance of further research into optimal sample length and standardization of methods when studying speech in AD.

3.
J Alzheimers Dis ; 92(2): 547-564, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36776053

RESUMEN

BACKGROUND: Language impairment in Alzheimer's disease (AD) has been widely studied but due to limited data availability, relatively few studies have focused on the longitudinal change in language in the individuals who later develop AD. Significant differences in speech have previously been found by comparing the press conference transcripts of President Bush and President Reagan, who was later diagnosed with AD. OBJECTIVE: In the current study, we explored whether the patterns previously established in the single AD-healthy control (HC) participant pair apply to a larger group of individuals who later receive AD diagnosis. METHODS: We replicated previous methods on two larger corpora of longitudinal spontaneous speech samples of public figures, consisting of 10 and 9 AD-HC participant pairs. As we failed to find generalizable patterns of language change using previous methodology, we proposed alternative methods for data analysis, investigating the benefits of using different language features and their change with age, and compiling the single features into aggregate scores. RESULTS: The single features that showed the strongest results were moving average type:token ratio (MATTR) and pronoun-related features. The aggregate scores performed better than the single features, with lexical diversity capturing a similar change in two-thirds of the participants. CONCLUSION: Capturing universal patterns of language change prior to AD can be challenging, but the decline in lexical diversity and changes in MATTR and pronoun-related features act as promising measures that reflect the cognitive changes in many participants.


Asunto(s)
Enfermedad de Alzheimer , Disfunción Cognitiva , Trastornos del Lenguaje , Humanos , Habla , Enfermedad de Alzheimer/diagnóstico , Enfermedad de Alzheimer/psicología , Lenguaje , Trastornos del Lenguaje/diagnóstico , Trastornos del Lenguaje/etiología , Disfunción Cognitiva/psicología
4.
J Biomed Semantics ; 12(1): 12, 2021 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-34266499

RESUMEN

BACKGROUND: Recent advances in representation learning have enabled large strides in natural language understanding; However, verbal reasoning remains a challenge for state-of-the-art systems. External sources of structured, expert-curated verb-related knowledge have been shown to boost model performance in different Natural Language Processing (NLP) tasks where accurate handling of verb meaning and behaviour is critical. The costliness and time required for manual lexicon construction has been a major obstacle to porting the benefits of such resources to NLP in specialised domains, such as biomedicine. To address this issue, we combine a neural classification method with expert annotation to create BioVerbNet. This new resource comprises 693 verbs assigned to 22 top-level and 117 fine-grained semantic-syntactic verb classes. We make this resource available complete with semantic roles and VerbNet-style syntactic frames. RESULTS: We demonstrate the utility of the new resource in boosting model performance in document- and sentence-level classification in biomedicine. We apply an established retrofitting method to harness the verb class membership knowledge from BioVerbNet and transform a pretrained word embedding space by pulling together verbs belonging to the same semantic-syntactic class. The BioVerbNet knowledge-aware embeddings surpass the non-specialised baseline by a significant margin on both tasks. CONCLUSION: This work introduces the first large, annotated semantic-syntactic classification of biomedical verbs, providing a detailed account of the annotation process, the key differences in verb behaviour between the general and biomedical domain, and the design choices made to accurately capture the meaning and properties of verbs used in biomedical texts. The demonstrated benefits of leveraging BioVerbNet in text classification suggest the resource could help systems better tackle challenging NLP tasks in biomedicine.


Asunto(s)
Procesamiento de Lenguaje Natural , Semántica , Lenguaje
5.
Environ Health Perspect ; 129(6): 67008, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-34165340

RESUMEN

BACKGROUND: Cancer risk assessment of complex exposures, such as exposure to mixtures of polycyclic aromatic hydrocarbons (PAHs), is challenging due to the diverse biological activities of these compounds. With the help of text mining (TM), we have developed TM tools-the latest iteration of the Cancer Risk Assessment using Biomedical literature tool (CRAB3) and a Cancer Hallmarks Analytics Tool (CHAT)-that could be useful for automatic literature analyses in cancer risk assessment and research. Although CRAB3 analyses are based on carcinogenic modes of action (MOAs) and cover almost all the key characteristics of carcinogens, CHAT evaluates literature according to the hallmarks of cancer referring to the alterations in cellular behavior that characterize the cancer cell. OBJECTIVES: The objective was to evaluate the usefulness of these tools to support cancer risk assessment by performing a case study of 22 European Union and U.S. Environmental Protection Agency priority PAHs and diesel exhaust and a case study of PAH interactions with silica. METHODS: We analyzed PubMed literature, comprising 57,498 references concerning priority PAHs and complex PAH mixtures, using CRAB3 and CHAT. RESULTS: CRAB3 analyses correctly identified similarities and differences in genotoxic and nongenotoxic MOAs of the 22 priority PAHs and grouped them according to their known carcinogenic potential. CHAT had the same capacity and complemented the CRAB output when comparing, for example, benzo[a]pyrene and dibenzo[a,l]pyrene. Both CRAB3 and CHAT analyses highlighted potentially interacting mechanisms within and across complex PAH mixtures and mechanisms of possible importance for interactions with silica. CONCLUSION: These data suggest that our TM approach can be useful in the hazard identification of PAHs and mixtures including PAHs. The tools can assist in grouping chemicals and identifying similarities and differences in carcinogenic MOAs and their interactions. https://doi.org/10.1289/EHP6702.


Asunto(s)
Hidrocarburos Policíclicos Aromáticos , Carcinógenos/toxicidad , Minería de Datos , Hidrocarburos Policíclicos Aromáticos/toxicidad , Medición de Riesgo , Emisiones de Vehículos
6.
Acta Anaesthesiol Scand ; 65(8): 1073-1078, 2021 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-33840090

RESUMEN

BACKGROUND: Reports of the prevalence and impact of hazardous alcohol use among intensive care unit (ICU) patients are contradictory. We aimed to study the prevalence of hazardous alcohol use among ICU patients and its association with ICU length of stay (LOS) and mortality. METHODS: Finnish ICUs have been using the Alcohol Use Disorder Identification Test-Consumption (AUDIT-C) to evaluate and record patients' alcohol use into the Finnish Intensive Care Consortium's Database (FICC). We retrieved data from the FICC from a 3-month period. We excluded data from centers with an AUDIT-C recording rate of less than 70% of admissions. We defined hazardous alcohol use as a score of 5 or more for women and 6 or more for men from a maximum score of 12 points. RESULTS: Two thousand forty-five patients were treated in the 10 centers with an AUDIT-C recording rate of 70% or higher. AUDIT-C was available for 1576 (77%) patients and indicated hazardous alcohol use for 334 (21%) patients who were more often younger (median age 55 [interquartile range 42-65] vs 67 [57-74] [P < .001]) and male (78.1% vs 61.3% [P < .001]) compared to other patients. We found no difference in LOS or hospital mortality between hazardous and non-hazardous alcohol users. Among the non-abstinent, risk of death within a year increased with increasing AUDIT-C scores adjusted odds ratio 1.077 (95% confidence interval, 1.006-1.152) per point. CONCLUSION: The prevalence of hazardous alcohol use in Finnish ICUs was 21%. Patients with hazardous alcohol use were more often younger and male compared with non-hazardous alcohol users.


Asunto(s)
Cuidados Críticos , Unidades de Cuidados Intensivos , Femenino , Mortalidad Hospitalaria , Humanos , Tiempo de Internación , Masculino , Persona de Mediana Edad , Prevalencia
7.
J Am Med Inform Assoc ; 27(11): 1784-1797, 2020 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-32929494

RESUMEN

OBJECTIVE: In recent years numerous studies have achieved promising results in Alzheimer's Disease (AD) detection using automatic language processing. We systematically review these articles to understand the effectiveness of this approach, identify any issues and report the main findings that can guide further research. MATERIALS AND METHODS: We searched PubMed, Ovid, and Web of Science for articles published in English between 2013 and 2019. We performed a systematic literature review to answer 5 key questions: (1) What were the characteristics of participant groups? (2) What language data were collected? (3) What features of speech and language were the most informative? (4) What methods were used to classify between groups? (5) What classification performance was achieved? RESULTS AND DISCUSSION: We identified 33 eligible studies and 5 main findings: participants' demographic variables (especially age ) were often unbalanced between AD and control group; spontaneous speech data were collected most often; informative language features were related to word retrieval and semantic, syntactic, and acoustic impairment; neural nets, support vector machines, and decision trees performed well in AD detection, and support vector machines and decision trees performed well in decline detection; and average classification accuracy was 89% in AD and 82% in mild cognitive impairment detection versus healthy control groups. CONCLUSION: The systematic literature review supported the argument that language and speech could successfully be used to detect dementia automatically. Future studies should aim for larger and more balanced datasets, combine data collection methods and the type of information analyzed, focus on the early stages of the disease, and report performance using standardized metrics.


Asunto(s)
Enfermedad de Alzheimer/diagnóstico , Inteligencia Artificial , Trastornos del Lenguaje/diagnóstico , Pruebas del Lenguaje , Habla , Enfermedad de Alzheimer/complicaciones , Árboles de Decisión , Humanos , Trastornos del Lenguaje/etiología , Aprendizaje Automático , Procesamiento de Lenguaje Natural
8.
PLoS One ; 15(5): e0232891, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32413059

RESUMEN

Literature-based Discovery (LBD) aims to discover new knowledge automatically from large collections of literature. Scientific literature is growing at an exponential rate, making it difficult for researchers to stay current in their discipline and easy to miss knowledge necessary to advance their research. LBD can facilitate hypothesis testing and generation and thus accelerate scientific progress. Neural networks have demonstrated improved performance on LBD-related tasks but are yet to be applied to it. We propose four graph-based, neural network methods to perform open and closed LBD. We compared our methods with those used by the state-of-the-art LION LBD system on the same evaluations to replicate recently published findings in cancer biology. We also applied them to a time-sliced dataset of human-curated peer-reviewed biological interactions. These evaluations and the metrics they employ represent performance on real-world knowledge advances and are thus robust indicators of approach efficacy. In the first experiments, our best methods performed 2-4 times better than the baselines in closed discovery and 2-3 times better in open discovery. In the second, our best methods performed almost 2 times better than the baselines in open discovery. These results are strong indications that neural LBD is potentially a very effective approach for generating new scientific discoveries from existing literature. The code for our models and other information can be found at: https://github.com/cambridgeltl/nn_for_LBD.


Asunto(s)
Descubrimiento del Conocimiento/métodos , Redes Neurales de la Computación , Minería de Datos/métodos , Humanos , Neoplasias/metabolismo , Reconocimiento de Normas Patrones Automatizadas/métodos , Revisión por Pares , Comunicación Académica
10.
J Biomed Semantics ; 10(1): 2, 2019 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-30658707

RESUMEN

BACKGROUND: VerbNet, an extensive computational verb lexicon for English, has proved useful for supporting a wide range of Natural Language Processing tasks requiring information about the behaviour and meaning of verbs. Biomedical text processing and mining could benefit from a similar resource. We take the first step towards the development of BioVerbNet: A VerbNet specifically aimed at describing verbs in the area of biomedicine. Because VerbNet-style classification is extremely time consuming, we start from a small manual classification of biomedical verbs and apply a state-of-the-art neural representation model, specifically developed for class-based optimization, to expand the classification with new verbs, using all the PubMed abstracts and the full articles in the PubMed Central Open Access subset as data. RESULTS: Direct evaluation of the resulting classification against BioSimVerb (verb similarity judgement data in biomedicine) shows promising results when representation learning is performed using verb class-based contexts. Human validation by linguists and biologists reveals that the automatically expanded classification is highly accurate. Including novel, valid member verbs and classes, our method can be used to facilitate cost-effective development of BioVerbNet. CONCLUSION: This work constitutes the first effort on applying a state-of-the-art architecture for neural representation learning to biomedical verb classification. While we discuss future optimization of the method, our promising results suggest that the automatic classification released with this article can be used to readily support application tasks in biomedicine.


Asunto(s)
Minería de Datos , Procesamiento de Lenguaje Natural , Investigación Biomédica , Aprendizaje Automático , PubMed
11.
Bioinformatics ; 35(9): 1553-1561, 2019 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-30304355

RESUMEN

MOTIVATION: The overwhelming size and rapid growth of the biomedical literature make it impossible for scientists to read all studies related to their work, potentially leading to missed connections and wasted time and resources. Literature-based discovery (LBD) aims to alleviate these issues by identifying implicit links between disjoint parts of the literature. While LBD has been studied in depth since its introduction three decades ago, there has been limited work making use of recent advances in biomedical text processing methods in LBD. RESULTS: We present LION LBD, a literature-based discovery system that enables researchers to navigate published information and supports hypothesis generation and testing. The system is built with a particular focus on the molecular biology of cancer using state-of-the-art machine learning and natural language processing methods, including named entity recognition and grounding to domain ontologies covering a wide range of entity types and a novel approach to detecting references to the hallmarks of cancer in text. LION LBD implements a broad selection of co-occurrence based metrics for analyzing the strength of entity associations, and its design allows real-time search to discover indirect associations between entities in a database of tens of millions of publications while preserving the ability of users to explore each mention in its original context in the literature. Evaluations of the system demonstrate its ability to identify undiscovered links and rank relevant concepts highly among potential connections. AVAILABILITY AND IMPLEMENTATION: The LION LBD system is available via a web-based user interface and a programmable API, and all components of the system are made available under open licenses from the project home page http://lbd.lionproject.net. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Neoplasias , Algoritmos , Bases de Datos Factuales , Humanos , Procesamiento de Lenguaje Natural , Publicaciones
12.
Tree Physiol ; 39(4): 526-535, 2019 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-30371901

RESUMEN

Ectomycorrhizal trees are common in the cold regions of the world, yet the role of the mycorrhizal symbiosis in plant cold tolerance is poorly known. Moreover, the standard methods for testing plant frost hardiness may not be adequate for roots and mycorrhizas. The aims of this study were to compare the frost hardiness of mycorrhizal and non-mycorrhizal Scots pine (Pinus sylvestris L.) seedlings and to test the use of reverse-flow root hydraulic conductance (Kr) measurement for root frost hardiness determination. Mycorrhizal (Hebeloma sp. or Suillus luteus) and non-mycorrhizal seedlings were grown in controlled-environment chambers for 13 weeks. After this, half of the plants were allotted to a non-hardening treatment (long day and high temperature, same as during the preceding growing season) and the other half to a hardening (short day and low temperature) 'autumn' treatment for 4 weeks. The intact seedlings were exposed to whole-plant freezing tests and the needle frost hardiness was measured by relative electrolyte leakage (REL) method. The seedlings were grown for three more weeks for visual damage assessment and Kr measurements using a high-pressure flow meter (HPFM). Mycorrhizas did not affect the frost hardiness of seedlings in either hardening treatment. The effect of the hardening treatment on frost hardiness was shown by REL and visual assessment of the aboveground parts as well as Kr of roots. Non-mycorrhizal plants were larger than mycorrhizal ones while nitrogen and phosphorus contents (per unit dry mass) were similar in all mycorrhiza treatments. In plants with no frost exposure, the non-mycorrhizal treatment had higher Kr. There was no mycorrhizal effect on plant frost hardiness when nutritional effects were excluded. Further studies are needed on the role of mycorrhizas especially in the recovery of growth and nutrient uptake in cold soils in the spring. The HPFM is useful novel method for assessment of root damage.


Asunto(s)
Hebeloma/fisiología , Micorrizas/fisiología , Pinus sylvestris/fisiología , Frío , Congelación , Pinus sylvestris/microbiología , Raíces de Plantas/microbiología , Raíces de Plantas/fisiología , Transpiración de Plantas , Estaciones del Año , Plantones/microbiología , Plantones/fisiología , Simbiosis , Árboles
13.
Acta Anaesthesiol Scand ; 62(10): 1452-1459, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-29978569

RESUMEN

BACKGROUND: Studies reporting renal and overall survival after acute kidney injury (AKI) treated exclusively with intermittent modalities of renal replacement therapy (IRRT) are rare. This study focused on outcomes of AKI patients treated with IRRT both in intensive care units (ICUs) and non-ICU dialysis units. METHODS: This prospective observational study was carried on during a 5-month period in 17 ICUs and 17 non-ICUs. ICU and non-ICU patients (total n = 138; 65 ICU, 73 non-ICU) requiring RRT for AKI and chosen to receive IRRT were included. Patient and RRT characteristics as well as outcomes at 90 days, 1 year, and 3 years were registered. RESULTS: Characteristics of ICU and non-ICU patients differed markedly. Pre-existing chronic kidney disease (CKD) and chronic heart failure were significantly more common among non-ICU patients. At 1 year, RRT dependence was significantly more common in the non-ICU group. At 3 years, there was no significant difference between the groups either in RRT dependence or mortality. CONCLUSION: Outcome of AKI patients treated with IRRT is dismal with regard to 3-year kidney function and mortality. Although pre-existing CKD emerged as a major risk factor for end-stage renal disease after AKI, the poor kidney survival was also seen in patients without prior CKD.


Asunto(s)
Lesión Renal Aguda/terapia , Terapia de Reemplazo Renal , Lesión Renal Aguda/etiología , Lesión Renal Aguda/mortalidad , Anciano , Estudios Transversales , Femenino , Humanos , Unidades de Cuidados Intensivos , Masculino , Persona de Mediana Edad , Estudios Prospectivos , Diálisis Renal , Insuficiencia Renal Crónica/complicaciones
14.
BMC Bioinformatics ; 19(1): 176, 2018 05 21.
Artículo en Inglés | MEDLINE | ID: mdl-29783926

RESUMEN

BACKGROUND: Link prediction in biomedical graphs has several important applications including predicting Drug-Target Interactions (DTI), Protein-Protein Interaction (PPI) prediction and Literature-Based Discovery (LBD). It can be done using a classifier to output the probability of link formation between nodes. Recently several works have used neural networks to create node representations which allow rich inputs to neural classifiers. Preliminary works were done on this and report promising results. However they did not use realistic settings like time-slicing, evaluate performances with comprehensive metrics or explain when or why neural network methods outperform. We investigated how inputs from four node representation algorithms affect performance of a neural link predictor on random- and time-sliced biomedical graphs of real-world sizes (∼ 6 million edges) containing information relevant to DTI, PPI and LBD. We compared the performance of the neural link predictor to those of established baselines and report performance across five metrics. RESULTS: In random- and time-sliced experiments when the neural network methods were able to learn good node representations and there was a negligible amount of disconnected nodes, those approaches outperformed the baselines. In the smallest graph (∼ 15,000 edges) and in larger graphs with approximately 14% disconnected nodes, baselines such as Common Neighbours proved a justifiable choice for link prediction. At low recall levels (∼ 0.3) the approaches were mostly equal, but at higher recall levels across all nodes and average performance at individual nodes, neural network approaches were superior. Analysis showed that neural network methods performed well on links between nodes with no previous common neighbours; potentially the most interesting links. Additionally, while neural network methods benefit from large amounts of data, they require considerable amounts of computational resources to utilise them. CONCLUSIONS: Our results indicate that when there is enough data for the neural network methods to use and there are a negligible amount of disconnected nodes, those approaches outperform the baselines. At low recall levels the approaches are mostly equal but at higher recall levels and average performance at individual nodes, neural network approaches are superior. Performance at nodes without common neighbours which indicate more unexpected and perhaps more useful links account for this.


Asunto(s)
Redes Neurales de la Computación , Algoritmos , Descubrimiento de Drogas , Descubrimiento del Conocimiento , Mapeo de Interacción de Proteínas
15.
BMC Bioinformatics ; 19(1): 33, 2018 02 05.
Artículo en Inglés | MEDLINE | ID: mdl-29402212

RESUMEN

BACKGROUND: Word representations support a variety of Natural Language Processing (NLP) tasks. The quality of these representations is typically assessed by comparing the distances in the induced vector spaces against human similarity judgements. Whereas comprehensive evaluation resources have recently been developed for the general domain, similar resources for biomedicine currently suffer from the lack of coverage, both in terms of word types included and with respect to the semantic distinctions. Notably, verbs have been excluded, although they are essential for the interpretation of biomedical language. Further, current resources do not discern between semantic similarity and semantic relatedness, although this has been proven as an important predictor of the usefulness of word representations and their performance in downstream applications. RESULTS: We present two novel comprehensive resources targeting the evaluation of word representations in biomedicine. These resources, Bio-SimVerb and Bio-SimLex, address the previously mentioned problems, and can be used for evaluations of verb and noun representations respectively. In our experiments, we have computed the Pearson's correlation between performances on intrinsic and extrinsic tasks using twelve popular state-of-the-art representation models (e.g. word2vec models). The intrinsic-extrinsic correlations using our datasets are notably higher than with previous intrinsic evaluation benchmarks such as UMNSRS and MayoSRS. In addition, when evaluating representation models for their abilities to capture verb and noun semantics individually, we show a considerable variation between performances across all models. CONCLUSION: Bio-SimVerb and Bio-SimLex enable intrinsic evaluation of word representations. This evaluation can serve as a predictor of performance on various downstream tasks in the biomedical domain. The results on Bio-SimVerb and Bio-SimLex using standard word representation models highlight the importance of developing dedicated evaluation resources for NLP in biomedicine for particular word classes (e.g. verbs). These are needed to identify the most accurate methods for learning class-specific representations. Bio-SimVerb and Bio-SimLex are publicly available.


Asunto(s)
Tecnología Biomédica , Semántica , Programas Informáticos , Bases de Datos como Asunto , Humanos , Lenguaje , Procesamiento de Lenguaje Natural
16.
Lang Resour Eval ; 52(3): 771-799, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30956632

RESUMEN

VerbNet-the most extensive online verb lexicon currently available for English-has proved useful in supporting a variety of NLP tasks. However, its exploitation in multilingual NLP has been limited by the fact that such classifications are available for few languages only. Since manual development of VerbNet is a major undertaking, researchers have recently translated VerbNet classes from English to other languages. However, no systematic investigation has been conducted into the applicability and accuracy of such a translation approach across different, typologically diverse languages. Our study is aimed at filling this gap. We develop a systematic method for translation of VerbNet classes from English to other languages which we first apply to Polish and subsequently to Croatian, Mandarin, Japanese, Italian, and Finnish. Our results on Polish demonstrate high translatability with all the classes (96% of English member verbs successfully translated into Polish) and strong inter-annotator agreement, revealing a promising degree of overlap in the resultant classifications. The results on other languages are equally promising. This demonstrates that VerbNet classes have strong cross-lingual potential and the proposed method could be applied to obtain gold standards for automatic verb classification in different languages. We make our annotation guidelines and the six language-specific verb classifications available with this paper.

17.
Bioinformatics ; 33(24): 3973-3981, 2017 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-29036271

RESUMEN

MOTIVATION: To understand the molecular mechanisms involved in cancer development, significant efforts are being invested in cancer research. This has resulted in millions of scientific articles. An efficient and thorough review of the existing literature is crucially important to drive new research. This time-demanding task can be supported by emerging computational approaches based on text mining which offer a great opportunity to organize and retrieve the desired information efficiently from sizable databases. One way to organize existing knowledge on cancer is to utilize the widely accepted framework of the Hallmarks of Cancer. These hallmarks refer to the alterations in cell behaviour that characterize the cancer cell. RESULTS: We created an extensive Hallmarks of Cancer taxonomy and developed automatic text mining methodology and a tool (CHAT) capable of retrieving and organizing millions of cancer-related references from PubMed into the taxonomy. The efficiency and accuracy of the tool was evaluated intrinsically as well as extrinsically by case studies. The correlations identified by the tool show that it offers a great potential to organize and correctly classify cancer-related literature. Furthermore, the tool can be useful, for example, in identifying hallmarks associated with extrinsic factors, biomarkers and therapeutics targets. AVAILABILITY AND IMPLEMENTATION: CHAT can be accessed at: http://chat.lionproject.net. The corpus of hallmark-annotated PubMed abstracts and the software are available at: http://chat.lionproject.net/about. CONTACT: simon.baker@cl.cam.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Neoplasias/clasificación , Publicaciones/clasificación , Programas Informáticos , Biomarcadores , Bases de Datos Factuales , Humanos , Reproducibilidad de los Resultados , Literatura de Revisión como Asunto
18.
BMC Bioinformatics ; 18(1): 368, 2017 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-28810903

RESUMEN

BACKGROUND: Named Entity Recognition (NER) is a key task in biomedical text mining. Accurate NER systems require task-specific, manually-annotated datasets, which are expensive to develop and thus limited in size. Since such datasets contain related but different information, an interesting question is whether it might be possible to use them together to improve NER performance. To investigate this, we develop supervised, multi-task, convolutional neural network models and apply them to a large number of varied existing biomedical named entity datasets. Additionally, we investigated the effect of dataset size on performance in both single- and multi-task settings. RESULTS: We present a single-task model for NER, a Multi-output multi-task model and a Dependent multi-task model. We apply the three models to 15 biomedical datasets containing multiple named entities including Anatomy, Chemical, Disease, Gene/Protein and Species. Each dataset represent a task. The results from the single-task model and the multi-task models are then compared for evidence of benefits from Multi-task Learning. With the Multi-output multi-task model we observed an average F-score improvement of 0.8% when compared to the single-task model from an average baseline of 78.4%. Although there was a significant drop in performance on one dataset, performance improves significantly for five datasets by up to 6.3%. For the Dependent multi-task model we observed an average improvement of 0.4% when compared to the single-task model. There were no significant drops in performance on any dataset, and performance improves significantly for six datasets by up to 1.1%. The dataset size experiments found that as dataset size decreased, the multi-output model's performance increased compared to the single-task model's. Using 50, 25 and 10% of the training data resulted in an average drop of approximately 3.4, 8 and 16.7% respectively for the single-task model but approximately 0.2, 3.0 and 9.8% for the multi-task model. CONCLUSIONS: Our results show that, on average, the multi-task models produced better NER results than the single-task models trained on a single NER dataset. We also found that Multi-task Learning is beneficial for small datasets. Across the various settings the improvements are significant, demonstrating the benefit of Multi-task Learning for this task.


Asunto(s)
Redes Neurales de la Computación , Minería de Datos , Bases de Datos Factuales , Aprendizaje Automático , Modelos Teóricos
19.
PLoS One ; 12(3): e0173132, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-28257498

RESUMEN

Chemical exposure assessments are based on information collected via different methods, such as biomonitoring, personal monitoring, environmental monitoring and questionnaires. The vast amount of chemical-specific exposure information available from web-based databases, such as PubMed, is undoubtedly a great asset to the scientific community. However, manual retrieval of relevant published information is an extremely time consuming task and overviewing the data is nearly impossible. Here, we present the development of an automatic classifier for chemical exposure information. First, nearly 3700 abstracts were manually annotated by an expert in exposure sciences according to a taxonomy exclusively created for exposure information. Natural Language Processing (NLP) techniques were used to extract semantic and syntactic features relevant to chemical exposure text. Using these features, we trained a supervised machine learning algorithm to automatically classify PubMed abstracts according to the exposure taxonomy. The resulting classifier demonstrates good performance in the intrinsic evaluation. We also show that the classifier improves information retrieval of chemical exposure data compared to keyword-based PubMed searches. Case studies demonstrate that the classifier can be used to assist researchers by facilitating information retrieval and classification, enabling data gap recognition and overviewing available scientific literature using chemical-specific publication profiles. Finally, we identify challenges to be addressed in future development of the system.


Asunto(s)
Minería de Datos/métodos , Monitoreo del Ambiente/métodos , Contaminación Ambiental , Almacenamiento y Recuperación de la Información , Algoritmos , Biología Computacional , Bases de Datos Factuales , Humanos , Procesamiento de Lenguaje Natural , PubMed , Semántica
20.
BMC Bioinformatics ; 18(1): 39, 2017 Jan 17.
Artículo en Inglés | MEDLINE | ID: mdl-28095781

RESUMEN

BACKGROUND: In silico drug-target interaction (DTI) prediction plays an integral role in drug repositioning: the discovery of new uses for existing drugs. One popular method of drug repositioning is network-based DTI prediction, which uses complex network theory to predict DTIs from a drug-target network. Currently, most network-based DTI prediction is based on machine learning - methods such as Restricted Boltzmann Machines (RBM) or Support Vector Machines (SVM). These methods require additional information about the characteristics of drugs, targets and DTIs, such as chemical structure, genome sequence, binding types, causes of interactions, etc., and do not perform satisfactorily when such information is unavailable. We propose a new, alternative method for DTI prediction that makes use of only network topology information attempting to solve this problem. RESULTS: We compare our method for DTI prediction against the well-known RBM approach. We show that when applied to the MATADOR database, our approach based on node neighborhoods yield higher precision for high-ranking predictions than RBM when no information regarding DTI types is available. CONCLUSION: This demonstrates that approaches purely based on network topology provide a more suitable approach to DTI prediction in the many real-life situations where little or no prior knowledge is available about the characteristics of drugs, targets, or their interactions.


Asunto(s)
Simulación por Computador , Reposicionamiento de Medicamentos/métodos , Máquina de Vectores de Soporte , Bases de Datos Factuales , Reposicionamiento de Medicamentos/instrumentación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA