Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
1.
PLoS One ; 15(5): e0232525, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32357164

RESUMO

Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications. Many of these applications perform preprocessing. There are different types of text preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML tag removal, stopword removal, punctuation mark removal, lemmatization, correction of common misspelled words, and reduction of replicated characters. We hypothesize that the application of different combinations of preprocessing methods can improve TC results. Therefore, we performed an extensive and systematic set of TC experiments (and this is our main research contribution) to explore the impact of all possible combinations of five/six basic preprocessing methods on four benchmark text corpora (and not samples of them) using three ML methods and training and test sets. The general conclusion (at least for the datasets verified) is that it is always advisable to perform an extensive and systematic variety of preprocessing methods combined with TC experiments because it contributes to improve TC accuracy. For all the tested datasets, there was always at least one combination of basic preprocessing methods that could be recommended to significantly improve the TC using a BOW representation. For three datasets, stopword removal was the only single preprocessing method that enabled a significant improvement compared to the baseline result using a bag of 1,000-word unigrams. For some of the datasets, there was minimal improvement when we removed HTML tags, performed spelling correction or removed punctuation marks, and reduced replicated characters. However, for the fourth dataset, the stopword removal was not beneficial. Instead, the conversion of uppercase letters into lowercase letters was the only single preprocessing method that demonstrated a significant improvement compared to the baseline result. The best result for this dataset was obtained when we performed spelling correction and conversion into lowercase letters. In general, for all the datasets processed, there was always at least one combination of basic preprocessing methods that could be recommended to improve the accuracy results when using a bag-of-words representation.


Assuntos
Processamento de Linguagem Natural , Aprendizado de Máquina Supervisionado , Processamento de Texto , Algoritmos , Mineração de Dados/classificação , Bases de Dados Factuais , Humanos , Idioma , Aprendizado de Máquina Supervisionado/classificação , Envio de Mensagens de Texto/classificação , Processamento de Texto/classificação
2.
Plant Physiol ; 180(3): 1261-1276, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31061104

RESUMO

Modern phenotyping techniques yield vast amounts of data that are challenging to manage and analyze. When thoroughly examined, this type of data can reveal genotype-to-phenotype relationships and meaningful connections among individual traits. However, efficient data mining is challenging for experimental biologists with limited training in curating, integrating, and exploring complex datasets. Additionally, data transparency, accessibility, and reproducibility are important considerations for scientific publication. The need for a streamlined, user-friendly pipeline for advanced phenotypic data analysis is pressing. In this article we present an open-source, online platform for multivariate analysis (MVApp), which serves as an interactive pipeline for data curation, in-depth analysis, and customized visualization. MVApp builds on the available R-packages and adds extra functionalities to enhance the interpretability of the results. The modular design of the MVApp allows for flexible analysis of various data structures and includes tools underexplored in phenotypic data analysis, such as clustering and quantile regression. MVApp aims to enhance findable, accessible, interoperable, and reproducible data transparency, streamline data curation and analysis, and increase statistical literacy among the scientific community.


Assuntos
Biologia Computacional/métodos , Análise de Dados , Mineração de Dados/métodos , Análise Multivariada , Análise por Conglomerados , Mineração de Dados/classificação , Reprodutibilidade dos Testes , Software
3.
Neural Netw ; 110: 243-255, 2019 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-30616096

RESUMO

Complex networks provide a powerful tool for data representation due to its ability to describe the interplay between topological, functional, and dynamical properties of the input data. A fundamental process in network-based (graph-based) data analysis techniques is the network construction from original data usually in vector form. Here, a natural question is: How to construct an "optimal" network regarding a given processing goal? This paper investigates structural optimization in the context of network-based data classification tasks. To be specific, we propose a particle swarm optimization framework which is responsible for building a network from vector-based data set while optimizing a quality function driven by the classification accuracy. The classification process considers both topological and physical features of the training and test data and employing PageRank measure for classification according to the importance concept of a test instance to each class. Results on artificial and real-world problems reveal that data network generated using structural optimization provides better results in general than those generated by classical network formation methods. Moreover, this investigation suggests that other kinds of network-based machine learning and data mining tasks, such as dimensionality reduction and data clustering, can benefit from the proposed structural optimization method.


Assuntos
Mineração de Dados/classificação , Redes Neurais de Computação , Algoritmos , Bases de Dados Factuais/classificação , Humanos , Aprendizado de Máquina
4.
São Paulo; s.n; s.n; dez. 2015. 115 p. tab, graf, ilus.
Tese em Português | LILACS | ID: biblio-834070

RESUMO

A quitosana é um biopolímero funcional com grande potencial de desenvolvimento, podendo gerar diferentes tipos de materiais com variadas funções. Conforme modificações na sua estrutura, a quitosana tem encontrado aplicações nas mais diversas áreas, possuindo um grande leque de aplicações. Apesar do crescente uso da quitosana e do aumento das pesquisas por novas aplicações, a prospecção de outras opções de fontes (que não crustáceos) de quitosana não têm sido consistentemente apresentadas. O objetivo do presente projeto é realizar a prospecção quantitativa e qualitativa de uma nova fonte renovável de quitosana. Temos como uma fonte alternativa para a produção de quitosana, os blatódeos que são comumente conhecidos como baratas. Eles são organismos terrestres que apresentam uma reprodução consideravelmente rápida, se adaptam aos mais variados ambientes e tem o custo de criação baixíssimo devido à sua fácil adaptação ao ambiente e alimentação. Além disso, os blatódeos não possuem sazonalidade, e ainda realizam ecdises, podendo-se utilizar as exúvias para a produção de quitosana. Foram determinados o processo e o rendimento do processo de obtenção de quitosana a partir de blatódeos (Phoetalia pallida). Os blatódeos foram submetidos a tratamento com solução de hidróxido de sódio 50% (p/v) em temperatura de 120 ºC por sete tempos diferentes (1, 2, 3, 6, 10 e 20 horas). As quitosanas obtidas foram caracterizadas mediante técnicas de espectroscopia no Infravermelho (FTIR), comportamento térmico (TG/DTG e DSC), difração de raios-x, viscosimetria e teste de solubilidade. A obtenção de quitosana a partir de blatódeos apresentou vantagens em relação à produção a partir de crustáceos: reduzido número de etapas do processo e dispensa o tratamento com HCl, que é um poluente. O processo de obtenção de quitosana teve rendimento de aproximadamente 15%, variando de acordo com o tempo de reação. De uma maneira geral, as quitosanas de barata apresentaram características semelhantes à quitosana de camarão


Chitosan is a functional biopolymer with great development potential, which can generate different types of materials with several purposes. Depending on changes in its structure, chitosan has found applications in several areas, having a wide range of applications. Despite the increasing use of chitosan and the increase in research for new applications, the exploration of other options as sources of chitosan (other than shellfish) have not been consistently shown. The goal of this project is to conduct a quantitative and qualitative exploration of a new renewable source of chitosan. Blattaria, commonly known as cockroaches, are an alternative source for the production of chitosan. They are terrestrial organisms that present a considerably fast reproduction, adapt to many different environments and have a very low cost of growing, due to its easy adaptation to the environment and food. Moreover, the cockroaches don´t present seasonality and still perform ecdysis, where the exuvia can be used to produce chitosan. The process and the efficiency of the process of obtaining chitosan from the cockroaches, Phoetalia pallida, were determined: they were treated with a solution of sodium hydroxide 50% (w / v) at a temperature of 120 °C for seven different time periods (1, 2, 3, 6, 10 and 20 hours). Chitosans obtained therefrom were characterized by Infrared spectroscopy (FTIR), thermal behavior (TG / DTG and DSC), x-ray diffraction, viscosimetry and solubility test. Obtaining chitosan from cockroaches showed advantages over the production from shellfish: reduced number of process steps and not requiring treatment with HCl, which is a pollutant. The process of obtaining chitosan showed an efficiency of approximately 15%, depending upon the reaction time. In general, the cockroach chitosan showed characteristics similar to shrimp chitosan


Assuntos
Animais , Tecnologia Farmacêutica/classificação , Quitosana/efeitos adversos , Mineração de Dados/classificação , Biopolímeros , Baratas/fisiologia
5.
Genomics ; 106(6): 355-9, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26551295

RESUMO

Mining patterns of histone modifications interplay from epigenomic profiles are one of the leading research areas these days. Various methods based on clustering approaches and hidden Markov models have been presented so far with some limitations. Here we present ChromClust, a semi-supervised clustering tool for mining commonly occurring histone modifications at various locations of the genome. Applying our method to 11 chromatin marks in nine human cell types recovered 11 clusters based on distinct chromatin signatures mapping to various elements of the genome. Our approach is efficient in respect to time and space usage along with the added facility of maintaining database at the backend. It outperforms the existing methods with respect to mining patterns in a semi-supervised fashion mapping to various functional elements of the genome. It will aid in future by saving the resources of time and space along with efficiently retrieving the hidden interplay of histone combinations.


Assuntos
Cromatina/genética , Biologia Computacional/métodos , Mineração de Dados/métodos , Código das Histonas , Cromatina/metabolismo , Análise por Conglomerados , Mineração de Dados/classificação , Genoma Humano/genética , Humanos , Reprodutibilidade dos Testes
6.
Stud Health Technol Inform ; 216: 1099, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26262398

RESUMO

Depression in adolescence is associated with significant suicidality. Therefore, it is important to detect the risk for depression and provide timely care to adolescents. This study aims to develop an ontology for collecting and analyzing social media data about adolescent depression. This ontology was developed using the 'ontology development 101'. The important terms were extracted from several clinical practice guidelines and postings on Social Network Service. We extracted 777 terms, which were categorized into 'risk factors', 'sign and symptoms', 'screening', 'diagnosis', 'treatment', and 'prevention'. An ontology developed in this study can be used as a framework to understand adolescent depression using unstructured data from social media.


Assuntos
Mineração de Dados/classificação , Depressão/classificação , Depressão/psicologia , Processamento de Linguagem Natural , Mídias Sociais/classificação , Vocabulário Controlado , Adolescente , Saúde do Adolescente/classificação , Mineração de Dados/métodos , Feminino , Humanos , Masculino , Psicologia do Adolescente/classificação
7.
Artigo em Inglês | MEDLINE | ID: mdl-26261998

RESUMO

Electronic Health Records (EHRs) have made patient information widely available, allowing health professionals to provide better care. However, information confidentiality is an issue that continually needs to be taken into account. The objective of this study is to describe the implementation of rule-based access permissions to an EHR system. The rules that were implemented were based on a qualitative study. Every time users did not meet the specified requirements, they had to justify access through a pop up window with predetermined options, including a free text option ("other justification"). A secondary analysis of a deidentified database was performed. From a total of 20,540,708 hits on the electronic medical record database, 85% of accesses to the EHR system did not require justification. Content analysis of the "Other Justification" option allowed the identification of new types of access. At the time to justify, however, users may choose the faster or less clicks option to access to EHR, associating the justification of access to the EHR as a barrier.


Assuntos
Acesso à Informação , Segurança Computacional , Confidencialidade , Mineração de Dados/classificação , Mineração de Dados/métodos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Argentina , Registros de Saúde Pessoal , Uso Significativo/organização & administração , Uso Significativo/estatística & dados numéricos , Processamento de Linguagem Natural , Software , Revisão da Utilização de Recursos de Saúde
8.
J Biomed Inform ; 55: 1-10, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25733166

RESUMO

OBJECTIVE: To compare the performance of the Concurrent (CTA) and Retrospective (RTA) Think Aloud method and to assess their value in a formative usability evaluation of an Intensive Care Registry-physician data query tool designed to support ICU quality improvement processes. METHODS: Sixteen representative intensive care physicians participated in the usability evaluation study. Subjects were allocated to either the CTA or RTA method by a matched randomized design. Each subject performed six usability-testing tasks of varying complexity in the query tool in a real-working context. Methods were compared with regard to number and type of problems detected. Verbal protocols of CTA and RTA were analyzed in depth to assess differences in verbal output. Standardized measures were applied to assess thoroughness in usability problem detection weighted per problem severity level and method overall effectiveness in detecting usability problems with regard to the time subjects spent per method. RESULTS: The usability evaluation of the data query tool revealed a total of 43 unique usability problems that the intensive care physicians encountered. CTA detected unique usability problems with regard to graphics/symbols, navigation issues, error messages, and the organization of information on the query tool's screens. RTA detected unique issues concerning system match with subjects' language and applied terminology. The in-depth verbal protocol analysis of CTA provided information on intensive care physicians' query design strategies. Overall, CTA performed significantly better than RTA in detecting usability problems. CTA usability problem detection effectiveness was 0.80 vs. 0.62 (p<0.05) respectively, with an average difference of 42% less time spent per subject compared to RTA. In addition, CTA was more thorough in detecting usability problems of a moderate (0.85 vs. 0.7) and severe nature (0.71 vs. 0.57). CONCLUSION: In this study, the CTA is more effective in usability-problem detection and provided clarification of intensive care physician query design strategies to inform redesign of the query tool. However, CTA does not outperform RTA. The RTA additionally elucidated unique usability problems and new user requirements. Based on the results of this study, we recommend the use of CTA in formative usability evaluation studies of health information technology. However, we recommend further research on the application of RTA in usability studies with regard to user expertise and experience when focusing on user profile customized (re)design.


Assuntos
Comportamento do Consumidor/estatística & dados numéricos , Mineração de Dados/classificação , Registros Eletrônicos de Saúde/estatística & dados numéricos , Uso Significativo/estatística & dados numéricos , Padrões de Prática Médica/estatística & dados numéricos , Software , Atitude do Pessoal de Saúde , Mineração de Dados/métodos , Mineração de Dados/estatística & dados numéricos , Médicos , Padrões de Prática Médica/classificação , Estudos Retrospectivos , Validação de Programas de Computador , Revisão da Utilização de Recursos de Saúde/métodos
9.
BMC Med Res Methodol ; 15: 11, 2015 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-25649372

RESUMO

BACKGROUND: Clinical data gathered for administrative purposes often lack sufficient information to separate the records of radiotherapy given for palliation from those given for cure. An absence, incompleteness, or inaccuracy of such information could hinder or bias the study of the utilization and outcome of radiotherapy. This study has three specific purposes: 1) develop a method to determine the therapeutic role of radiotherapy (TRR); 2) assess the accuracy of the method; 3) report the quality of the information on treatment "intent" recorded in the clinical data in Ontario, Canada. A general purpose is to use this study as a prototype to demonstrate and test a method to assess the quality of administrative data. METHODS: This is a population based retrospective study. A random sample was drawn from the treatment records with "intent" assigned in treating hospitals. A decision tree is grown using treatment parameters as predictors and "intent" as outcome variable to classify the treatments into curative or palliative. The tree classifier was applied to the entire dataset, and the classification results were compared with those identified by "intent". A manual audit was conducted to assess the accuracy of the classification. RESULTS: The following parameters predicted the TRR, from the strongest to the weakest: radiation dose per fraction, treated body-region, disease site, and time of treatment. When applied to the records of treatments given between 1990 and 2008 in Ontario, Canada, the classification rules correctly classified 96.1% of the records. The quality of the "intent" variable was as follows: 77.5% correctly classified, 3.7% misclassified, and 18.8% did not have an "intent" assigned. CONCLUSIONS: The classification rules derived in this study can be used to determine the TRR when such information is unavailable, incomplete, or inaccurate in administrative data. The study demonstrates that data mining approach can be used to effectively assess and improve the quality of large administrative datasets.


Assuntos
Mineração de Dados/estatística & dados numéricos , Registros Hospitalares/estatística & dados numéricos , Prontuários Médicos/estatística & dados numéricos , Neoplasias/radioterapia , Radioterapia/estatística & dados numéricos , Mineração de Dados/classificação , Mineração de Dados/métodos , Árvores de Decisões , Registros Hospitalares/classificação , Registros Hospitalares/normas , Humanos , Prontuários Médicos/classificação , Prontuários Médicos/normas , Ontário , Avaliação de Resultados em Cuidados de Saúde/métodos , Avaliação de Resultados em Cuidados de Saúde/estatística & dados numéricos , Radioterapia (Especialidade)/métodos , Radioterapia (Especialidade)/estatística & dados numéricos , Radioterapia/métodos , Reprodutibilidade dos Testes , Estudos Retrospectivos
10.
ScientificWorldJournal ; 2014: 179105, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25276846

RESUMO

This paper analyses the effect of the effort distribution along the software development lifecycle on the prevalence of software defects. This analysis is based on data that was collected by the International Software Benchmarking Standards Group (ISBSG) on the development of 4,106 software projects. Data mining techniques have been applied to gain a better understanding of the behaviour of the project activities and to identify a link between the effort distribution and the prevalence of software defects. This analysis has been complemented with the use of a hierarchical clustering algorithm with a dissimilarity based on the likelihood ratio statistic, for exploratory purposes. As a result, different behaviours have been identified for this collection of software development projects, allowing for the definition of risk control strategies to diminish the number and impact of the software defects. It is expected that the use of similar estimations might greatly improve the awareness of project managers on the risks at hand.


Assuntos
Algoritmos , Software , Análise por Conglomerados , Biologia Computacional/classificação , Biologia Computacional/métodos , Mineração de Dados/classificação , Mineração de Dados/métodos , Análise Discriminante , Reprodutibilidade dos Testes , Design de Software , Validação de Programas de Computador
11.
Stud Health Technol Inform ; 205: 201-5, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25160174

RESUMO

This paper presents the results of a blind comparison of top ten search results retrieved by Google.ch (French) and Khresmoi for everyone, a health specialized search engine. Participants--students of the Faculty of Medicine of the University of Geneva had to complete three tasks and select their preferred results. The majority of the participants have largely preferred Google results while Khresmoi results showed potential to compete in specific topics. The coverage of the results seems to be one of the reasons. The second being that participants do not know how to select quality and transparent health web pages. More awareness, tools and education about the matter is required for the students of Medicine to be able to efficiently distinguish trustworthy online health information.


Assuntos
Informação de Saúde ao Consumidor/classificação , Informação de Saúde ao Consumidor/estatística & dados numéricos , Mineração de Dados/classificação , Mineração de Dados/estatística & dados numéricos , Ferramenta de Busca/classificação , Ferramenta de Busca/estatística & dados numéricos , Estudantes de Medicina/estatística & dados numéricos , Método Simples-Cego
12.
J. health inform ; 5(2): 44-51, abr.-jun. 2013. graf, tab
Artigo em Português | LILACS | ID: lil-696498

RESUMO

Objectives: Compare Data Mining algorithms related to Classification and Association tasks over medical datasets about dermatology, vertebral column and breast cancer patients, analyzing which is the best one over each of these datasets. Methods: The classification algorithms are ran over these datasets and compared using precision, F-measure, ROC curve and Kappa performance metrics. For associaton task, the Apriori algorithm is ran to get a significant number of rules with confidence above 90%. Results: For diagnostics prediction about breast cancer and dermatology issues, the best classification algorithm was BayesNet and for vertebral column was the Logistic Model Tree. For association task, were extracted 100 knowledge rules for breast cancer and dermatology issues with confidence higher than 90% while for vertebral column were found 18 with same confidence. Conclusion: The comparison was useful to prove the possibility of using Data Mining algorithms to help Medicine decision engine with good precision.


Objetivos: Compar os algoritmos de Mineração de Dados de Classificação e Associação de dados sobre bases de dados de dermatologia, câncer de mâma e de problemas da coluna vertebral. Métodos: Os algoritmos de classificação foram executados sobre essas bases de dados e comparadas pelas métricas de precisão, F-measure, curva ROC e Kappa. Para associação, o algoritmo Apriori é executado para gerar um número significante de regras com confiança acima de 90%. Resultados: Para a predição de diagnósticos sobre câncer de mâma e dermatologia o melhor algoritmo foi o BayesNet e para coluna vertebral foi o de Árvore de Modelo Logístico. Para a tarefa de associação, foram extraídas 100 regras de conhecimento para a base de câncer de mâma e de dermatologia com confiança acima de 90% enquanto para a da coluna vertebral foram encontradas 18 com a mesma confiança. Conclusão: A comparação foi útil para provar a possibilidade do uso de algoritmos de Mineração de Dados no auxílio ao processo decisório na Medicina com boa precisão.


Objetivos: Comparar los algoritmos de minería de datos relacionados con las tareas de clasificación y asociación de conjuntos de datos médicos sobre dermatología, coluna vertebral y patientes con cáncer de mama, analizando cual es el mejor en cada uno de estos conjuntos de datos. Métodos: Los algoritmos de clasificación se pasó por encima de estos conjuntos de datos y se compararon con las métricas de rendimiento precisión, F-medida, la curva ROC y Kappa. Para la tarea Associaton, el algoritmo Apriori obtiene normas de confianza superior al 90%. Resultados: Para la predicción de diagnóstico sobre el cáncer de mama y problemas dermatológicos el mejor algoritmo de clasificación fue BayesNet y de la columna vertebral era el árbol del modelo logístico. Para tarea de asociación, se extrajeron 100 reglas de conocimiento para el cáncer de mama y problemas dermatológicos con confianza mayor que 90%, mientras que para la columna vertebral se encontraron 18 con la misma confianza. Conclusión: La comparación es útil para demostrar la posibilidad de utilizar algoritmos de minería de datos para ayudar a motor de decisóin de Medicina con buena precisión.


Assuntos
Algoritmos , Associação , Classificação , Coluna Vertebral/patologia , Dermatologia , Mineração de Dados/classificação , Neoplasias da Mama
13.
Invest Ophthalmol Vis Sci ; 53(13): 8310-8, 2012 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-23150624

RESUMO

PURPOSE: To describe and evaluate an automated grading system for age-related macular degeneration (AMD) by color fundus photography. METHODS: An automated "disease/no disease" grading system for AMD was developed based on image-mining techniques. First, image preprocessing was performed to normalize color and nonuniform illumination of the fundus images to define a region of interest and to identify and remove pixels belonging to retinal vessels. To represent images for the prediction task, a graph-based image representation using quadtrees was then adopted. Next, a graph-mining technique was applied to the generated graphs to extract relevant features (in the form of frequent subgraphs) from images of both AMD and healthy volunteers. Features of the training data were then fed into a classifier generator for training purposes before employing the trained classifiers to classify new "unseen" images. RESULTS: The algorithm was evaluated on two publically available fundus-image datasets comprising 258 images (160 AMD and 98 normal). Ten-fold cross validation was used. The experiments produced a best specificity of 100% and a best sensitivity of 99.4% with an overall accuracy of 99.6%. Our approach outperformed previous approaches reported in the literature. CONCLUSIONS: This study has demonstrated a proof-of-concept, image-mining technique for automated AMD grading. This technique has the potential to be further developed as an automated grading tool for future whole-scale AMD screening programs.


Assuntos
Mineração de Dados/classificação , Técnicas de Diagnóstico Oftalmológico , Interpretação de Imagem Assistida por Computador/métodos , Degeneração Macular/classificação , Algoritmos , Teorema de Bayes , Estudos de Viabilidade , Atrofia Geográfica/classificação , Humanos , Reprodutibilidade dos Testes , Drusas Retinianas/classificação , Vasos Retinianos/patologia , Sensibilidade e Especificidade
14.
J Am Med Inform Assoc ; 18(5): 594-600, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21846787

RESUMO

OBJECTIVE: A supervised machine learning approach to discover relations between medical problems, treatments, and tests mentioned in electronic medical records. MATERIALS AND METHODS: A single support vector machine classifier was used to identify relations between concepts and to assign their semantic type. Several resources such as Wikipedia, WordNet, General Inquirer, and a relation similarity metric inform the classifier. RESULTS: The techniques reported in this paper were evaluated in the 2010 i2b2 Challenge and obtained the highest F1 score for the relation extraction task. When gold standard data for concepts and assertions were available, F1 was 73.7, precision was 72.0, and recall was 75.3. F1 is defined as 2*Precision*Recall/(Precision+Recall). Alternatively, when concepts and assertions were discovered automatically, F1 was 48.4, precision was 57.6, and recall was 41.7. DISCUSSION: Although a rich set of features was developed for the classifiers presented in this paper, little knowledge mining was performed from medical ontologies such as those found in UMLS. Future studies should incorporate features extracted from such knowledge sources, which we expect to further improve the results. Moreover, each relation discovery was treated independently. Joint classification of relations may further improve the quality of results. Also, joint learning of the discovery of concepts, assertions, and relations may also improve the results of automatic relation extraction. CONCLUSION: Lexical and contextual features proved to be very important in relation extraction from medical texts. When they are not available to the classifier, the F1 score decreases by 3.7%. In addition, features based on similarity contribute to a decrease of 1.1% when they are not available.


Assuntos
Mineração de Dados , Sistemas de Apoio a Decisões Clínicas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Máquina de Vetores de Suporte , Mineração de Dados/classificação , Sistemas de Apoio a Decisões Clínicas/classificação , Registros Eletrônicos de Saúde/classificação , Humanos , Internet
15.
J Am Med Inform Assoc ; 18(5): 574-9, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21737844

RESUMO

OBJECTIVE: Information extraction and classification of clinical data are current challenges in natural language processing. This paper presents a cascaded method to deal with three different extractions and classifications in clinical data: concept annotation, assertion classification and relation classification. MATERIALS AND METHODS: A pipeline system was developed for clinical natural language processing that includes a proofreading process, with gold-standard reflexive validation and correction. The information extraction system is a combination of a machine learning approach and a rule-based approach. The outputs of this system are used for evaluation in all three tiers of the fourth i2b2/VA shared-task and workshop challenge. RESULTS: Overall concept classification attained an F-score of 83.3% against a baseline of 77.0%, the optimal F-score for assertions about the concepts was 92.4% and relation classifier attained 72.6% for relationships between clinical concepts against a baseline of 71.0%. Micro-average results for the challenge test set were 81.79%, 91.90% and 70.18%, respectively. DISCUSSION: The challenge in the multi-task test requires a distribution of time and work load for each individual task so that the overall performance evaluation on all three tasks would be more informative rather than treating each task assessment as independent. The simplicity of the model developed in this work should be contrasted with the very large feature space of other participants in the challenge who only achieved slightly better performance. There is a need to charge a penalty against the complexity of a model as defined in message minimalisation theory when comparing results. CONCLUSION: A complete pipeline system for constructing language processing models that can be used to process multiple practical detection tasks of language structures of clinical records is presented.


Assuntos
Mineração de Dados , Sistemas de Apoio a Decisões Clínicas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão , Mineração de Dados/classificação , Sistemas de Apoio a Decisões Clínicas/classificação , Registros Eletrônicos de Saúde/classificação , Humanos , Modelos Teóricos , Semântica , Vocabulário Controlado
16.
J Am Med Inform Assoc ; 18(5): 568-73, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21724741

RESUMO

OBJECTIVE: This paper describes natural-language-processing techniques for two tasks: identification of medical concepts in clinical text, and classification of assertions, which indicate the existence, absence, or uncertainty of a medical problem. Because so many resources are available for processing clinical texts, there is interest in developing a framework in which features derived from these resources can be optimally selected for the two tasks of interest. MATERIALS AND METHODS: The authors used two machine-learning (ML) classifiers: support vector machines (SVMs) and conditional random fields (CRFs). Because SVMs and CRFs can operate on a large set of features extracted from both clinical texts and external resources, the authors address the following research question: Which features need to be selected for obtaining optimal results? To this end, the authors devise feature-selection techniques which greatly reduce the amount of manual experimentation and improve performance. RESULTS: The authors evaluated their approaches on the 2010 i2b2/VA challenge data. Concept extraction achieves 79.59 micro F-measure. Assertion classification achieves 93.94 micro F-measure. DISCUSSION: Approaching medical concept extraction and assertion classification through ML-based techniques has the advantage of easily adapting to new data sets and new medical informatics tasks. However, ML-based techniques perform best when optimal features are selected. By devising promising feature-selection techniques, the authors obtain results that outperform the current state of the art. CONCLUSION: This paper presents two ML-based approaches for processing language in the clinical texts evaluated in the 2010 i2b2/VA challenge. By using novel feature-selection methods, the techniques presented in this paper are unique among the i2b2 participants.


Assuntos
Mineração de Dados , Sistemas de Apoio a Decisões Clínicas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Máquina de Vetores de Suporte , Mineração de Dados/classificação , Sistemas de Apoio a Decisões Clínicas/classificação , Registros Eletrônicos de Saúde/classificação , Humanos , Semântica , Incerteza
17.
J Am Med Inform Assoc ; 18(5): 552-6, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21685143

RESUMO

The 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records presented three tasks: a concept extraction task focused on the extraction of medical concepts from patient reports; an assertion classification task focused on assigning assertion types for medical problem concepts; and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. i2b2 and the VA provided an annotated reference standard corpus for the three tasks. Using this reference standard, 22 systems were developed for concept extraction, 21 for assertion classification, and 16 for relation classification. These systems showed that machine learning approaches could be augmented with rule-based systems to determine concepts, assertions, and relations. Depending on the task, the rule-based systems can either provide input for machine learning or post-process the output of machine learning. Ensembles of classifiers, information from unlabeled data, and external knowledge sources can help when the training data are inadequate.


Assuntos
Mineração de Dados , Sistemas de Apoio a Decisões Clínicas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Mineração de Dados/classificação , Sistemas de Apoio a Decisões Clínicas/classificação , Registros Eletrônicos de Saúde/classificação , Humanos
18.
J Am Med Inform Assoc ; 18(5): 607-13, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21697292

RESUMO

OBJECTIVE: Despite at least 40 years of promising empirical performance, very few clinical natural language processing (NLP) or information extraction systems currently contribute to medical science or care. The authors address this gap by reducing the need for custom software and rules development with a graphical user interface-driven, highly generalizable approach to concept-level retrieval. MATERIALS AND METHODS: A 'learn by example' approach combines features derived from open-source NLP pipelines with open-source machine learning classifiers to automatically and iteratively evaluate top-performing configurations. The Fourth i2b2/VA Shared Task Challenge's concept extraction task provided the data sets and metrics used to evaluate performance. RESULTS: Top F-measure scores for each of the tasks were medical problems (0.83), treatments (0.82), and tests (0.83). Recall lagged precision in all experiments. Precision was near or above 0.90 in all tasks. Discussion With no customization for the tasks and less than 5 min of end-user time to configure and launch each experiment, the average F-measure was 0.83, one point behind the mean F-measure of the 22 entrants in the competition. Strong precision scores indicate the potential of applying the approach for more specific clinical information extraction tasks. There was not one best configuration, supporting an iterative approach to model creation. CONCLUSION: Acceptable levels of performance can be achieved using fully automated and generalizable approaches to concept-level information extraction. The described implementation and related documentation is available for download.


Assuntos
Mineração de Dados , Sistemas de Apoio a Decisões Clínicas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Interface Usuário-Computador , Algoritmos , Mineração de Dados/classificação , Sistemas de Apoio a Decisões Clínicas/classificação , Registros Eletrônicos de Saúde/classificação , Humanos
19.
J Am Med Inform Assoc ; 18(5): 614-20, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21622934

RESUMO

BACKGROUND: Open-source clinical natural-language-processing (NLP) systems have lowered the barrier to the development of effective clinical document classification systems. Clinical natural-language-processing systems annotate the syntax and semantics of clinical text; however, feature extraction and representation for document classification pose technical challenges. METHODS: The authors developed extensions to the clinical Text Analysis and Knowledge Extraction System (cTAKES) that simplify feature extraction, experimentation with various feature representations, and the development of both rule and machine-learning based document classifiers. The authors describe and evaluate their system, the Yale cTAKES Extensions (YTEX), on the classification of radiology reports that contain findings suggestive of hepatic decompensation. RESULTS AND DISCUSSION: The F(1)-Score of the system for the retrieval of abdominal radiology reports was 96%, and was 79%, 91%, and 95% for the presence of liver masses, ascites, and varices, respectively. The authors released YTEX as open source, available at http://code.google.com/p/ytex.


Assuntos
Mineração de Dados , Sistemas de Apoio a Decisões Clínicas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão , Connecticut , Mineração de Dados/classificação , Sistemas de Apoio a Decisões Clínicas/classificação , Registros Eletrônicos de Saúde/classificação , Humanos , Falência Hepática/diagnóstico por imagem , Reconhecimento Automatizado de Padrão/classificação , Radiografia , Sistemas de Informação em Radiologia/classificação
20.
J Am Med Inform Assoc ; 18(5): 557-62, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21565856

RESUMO

OBJECTIVE: As clinical text mining continues to mature, its potential as an enabling technology for innovations in patient care and clinical research is becoming a reality. A critical part of that process is rigid benchmark testing of natural language processing methods on realistic clinical narrative. In this paper, the authors describe the design and performance of three state-of-the-art text-mining applications from the National Research Council of Canada on evaluations within the 2010 i2b2 challenge. DESIGN: The three systems perform three key steps in clinical information extraction: (1) extraction of medical problems, tests, and treatments, from discharge summaries and progress notes; (2) classification of assertions made on the medical problems; (3) classification of relations between medical concepts. Machine learning systems performed these tasks using large-dimensional bags of features, as derived from both the text itself and from external sources: UMLS, cTAKES, and Medline. MEASUREMENTS: Performance was measured per subtask, using micro-averaged F-scores, as calculated by comparing system annotations with ground-truth annotations on a test set. RESULTS: The systems ranked high among all submitted systems in the competition, with the following F-scores: concept extraction 0.8523 (ranked first); assertion detection 0.9362 (ranked first); relationship detection 0.7313 (ranked second). CONCLUSION: For all tasks, we found that the introduction of a wide range of features was crucial to success. Importantly, our choice of machine learning algorithms allowed us to be versatile in our feature design, and to introduce a large number of features without overfitting and without encountering computing-resource bottlenecks.


Assuntos
Benchmarking , Mineração de Dados , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Algoritmos , Inteligência Artificial , Canadá , Mineração de Dados/classificação , Registros Eletrônicos de Saúde/classificação , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...