Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 52
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 37(Suppl_1): i468-i476, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252939

RESUMO

MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature-a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. RESULTS: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. AVAILABILITY AND IMPLEMENTATION: Source code and the list of PMIDs of the publications in our datasets are available upon request.


Assuntos
Pesquisa Biomédica , Bases de Dados Factuais
2.
Bioinformatics ; 35(21): 4381-4388, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-30949681

RESUMO

MOTIVATION: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respective captions as a source of knowledge. Notably, an essential step underlying such mining is the extraction of figures and captions from publications. While several PDF parsing tools that extract information from such documents are publicly available, they attempt to identify images by analyzing the PDF encoding and structure and the complex graphical objects embedded within. As such, they often incorrectly identify figures and captions in scientific publications, whose structure is often non-trivial. The extraction of figures, captions and figure-caption pairs from biomedical publications is thus neither well-studied nor yet well-addressed. RESULTS: We introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike existing methods, we first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions. We generate files containing the figures and their associated captions and provide those as output to the end-user.We test our system both over a public dataset of computer science documents previously used by others, and over two newly collected sets of publications focusing on the biomedical domain. Our experiments and results comparing PDFigCapX to other state-of-the-art systems show a significant improvement in performance, and demonstrate the effectiveness and robustness of our approach. AVAILABILITY AND IMPLEMENTATION: Our system is publicly available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX. The two new datasets are available at: https://www.eecis.udel.edu/~compbio/PDFigCapX/Downloads.


Assuntos
Publicações , Mineração de Dados
3.
Bioinformatics ; 34(7): 1192-1199, 2018 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-29040394

RESUMO

Motivation: Images convey essential information in biomedical publications. As such, there is a growing interest within the bio-curation and the bio-databases communities, to store images within publications as evidence for biomedical processes and for experimental results. However, many of the images in biomedical publications are compound images consisting of multiple panels, where each individual panel potentially conveys a different type of information. Segmenting such images into constituent panels is an essential first step toward utilizing images. Results: In this article, we develop a new compound image segmentation system, FigSplit, which is based on Connected Component Analysis. To overcome shortcomings typically manifested by existing methods, we develop a quality assessment step for evaluating and modifying segmentations. Two methods are proposed to re-segment the images if the initial segmentation is inaccurate. Experimental results show the effectiveness of our method compared with other methods. Availability and implementation: The system is publicly available for use at: https://www.eecis.udel.edu/~compbio/FigSplit. The code is available upon request. Contact: shatkay@udel.edu. Supplementary information: Supplementary data are available online at Bioinformatics.


Assuntos
Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão , Software , Algoritmos , Gráficos por Computador
4.
New Phytol ; 220(3): 851-864, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30020552

RESUMO

Little is known about the characteristics and function of reproductive phased, secondary, small interfering RNAs (phasiRNAs) in the Poaceae, despite the availability of significant genomic resources, experimental data, and a growing number of computational tools. We utilized machine-learning methods to identify sequence-based and positional features that distinguish phasiRNAs in rice and maize from other small RNAs (sRNAs). We developed Random Forest classifiers that can distinguish reproductive phasiRNAs from other sRNAs in complex sets of sequencing data, utilizing sequence-based (k-mers) and features describing position-specific sequence biases. The classification performance attained is > 80% in accuracy, sensitivity, specificity, and positive predicted value. Feature selection identified important features in both ends of phasiRNAs. We demonstrated that phasiRNAs have strand specificity and position-specific nucleotide biases potentially influencing AGO sorting; we also predicted targets to infer functions of phasiRNAs, and computationally assessed their sequence characteristics relative to other sRNAs. Our results demonstrate that machine-learning methods effectively identify phasiRNAs despite the lack of characteristic features typically present in precursor loci of other small RNAs, such as sequence conservation or structural motifs. The 5'-end features we identified provide insights into AGO-phasiRNA interactions. We describe a hypothetical model of competition for AGO loading between phasiRNAs of different nucleotide compositions.


Assuntos
Poaceae/genética , RNA de Plantas/metabolismo , RNA Interferente Pequeno/metabolismo , Composição de Bases/genética , Nucleotídeos/genética , Reprodução
5.
J Biomed Inform ; 82: 31-40, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-29655947

RESUMO

Patients associated with multiple co-occurring health conditions often face aggravated complications and less favorable outcomes. Co-occurring conditions are especially prevalent among individuals suffering from kidney disease, an increasingly widespread condition affecting 13% of the general population in the US. This study aims to identify and characterize patterns of co-occurring medical conditions in patients employing a probabilistic framework. Specifically, we apply topic modeling in a non-traditional way to find associations across SNOMED-CT codes assigned and recorded in the EHRs of >13,000 patients diagnosed with kidney disease. Unlike most prior work on topic modeling, we apply the method to codes rather than to natural language. Moreover, we quantitatively evaluate the topics, assessing their tightness and distinctiveness, and also assess the medical validity of our results. Our experiments show that each topic is succinctly characterized by a few highly probable and unique disease codes, indicating that the topics are tight. Furthermore, inter-topic distance between each pair of topics is typically high, illustrating distinctiveness. Last, most coded conditions grouped together within a topic, are indeed reported to co-occur in the medical literature. Notably, our results uncover a few indirect associations among conditions that have hitherto not been reported as correlated in the medical literature.


Assuntos
Comorbidade , Nefropatias/complicações , Informática Médica/métodos , Systematized Nomenclature of Medicine , Idoso , Idoso de 80 Anos ou mais , Registros Eletrônicos de Saúde , Feminino , Humanos , Classificação Internacional de Doenças , Nefropatias/epidemiologia , Masculino , Pessoa de Meia-Idade , Modelos Estatísticos , Probabilidade , Reprodutibilidade dos Testes , Estados Unidos
6.
BMC Med Inform Decis Mak ; 18(Suppl 4): 125, 2018 12 12.
Artigo em Inglês | MEDLINE | ID: mdl-30537962

RESUMO

BACKGROUND: Chronic Kidney Disease (CKD) is one of several conditions that affect a growing percentage of the US population; the disease is accompanied by multiple co-morbidities, and is hard to diagnose in-and-of itself. In its advanced forms it carries severe outcomes and can lead to death. It is thus important to detect the disease as early as possible, which can help devise effective intervention and treatment plan. Here we investigate ways to utilize information available in electronic health records (EHRs) from regular office visits of more than 13,000 patients, in order to distinguish among several stages of the disease. While clinical data stored in EHRs provide valuable information for risk-stratification, one of the major challenges in using them arises from data imbalance. That is, records associated with a more severe condition are typically under-represented compared to those associated with a milder manifestation of the disease. To address imbalance, we propose and develop a sampling-based ensemble approach, hierarchical meta-classification, aiming to stratify CKD patients into severity stages, using simple quantitative non-text features gathered from standard office visit records. METHODS: The proposed hierarchical meta-classification method frames the multiclass classification task as a hierarchy of two subtasks. The first is binary classification, separating records associated with the majority class from those associated with all minority classes combined, using meta-classification. The second subtask separates the records assigned to the combined minority classes into the individual constituent classes. RESULTS: The proposed method identifies a significant proportion of patients suffering from the more advanced stages of the condition, while also correctly identifying most of the less severe cases, maintaining high sensitivity, specificity and F-measure (≥ 93%). Our results show that the high level of performance attained by our method is preserved even when the size of the training set is significantly reduced, demonstrating the stability and generalizability of our approach. CONCLUSION: We present a new approach to perform classification while addressing data imbalance, which is inherent in the biomedical domain. Our model effectively identifies severity stages of CKD patients, using information readily available in office visit records within the realistic context of high data imbalance.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Visita a Consultório Médico , Insuficiência Renal Crônica/classificação , Idoso , Idoso de 80 Anos ou mais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Sensibilidade e Especificidade , Índice de Gravidade de Doença
8.
Bioinformatics ; 31(12): i365-74, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26072505

RESUMO

MOTIVATION: Proteins are responsible for a multitude of vital tasks in all living organisms. Given that a protein's function and role are strongly related to its subcellular location, protein location prediction is an important research area. While proteins move from one location to another and can localize to multiple locations, most existing location prediction systems assign only a single location per protein. A few recent systems attempt to predict multiple locations for proteins, however, their performance leaves much room for improvement. Moreover, such systems do not capture dependencies among locations and usually consider locations as independent. We hypothesize that a multi-location predictor that captures location inter-dependencies can improve location predictions for proteins. RESULTS: We introduce a probabilistic generative model for protein localization, and develop a system based on it-which we call MDLoc-that utilizes inter-dependencies among locations to predict multiple locations for proteins. The model captures location inter-dependencies using Bayesian networks and represents dependency between features and locations using a mixture model. We use iterative processes for learning model parameters and for estimating protein locations. We evaluate our classifier MDLoc, on a dataset of single- and multi-localized proteins derived from the DBMLoc dataset, which is the most comprehensive protein multi-localization dataset currently available. Our results, obtained by using MDLoc, significantly improve upon results obtained by an initial simpler classifier, as well as on results reported by other top systems. AVAILABILITY AND IMPLEMENTATION: MDLoc is available at: http://www.eecis.udel.edu/∼compbio/mdloc.


Assuntos
Bases de Dados de Proteínas , Modelos Teóricos , Proteínas/metabolismo , Teorema de Bayes , Humanos , Transporte Proteico , Frações Subcelulares
9.
Bioinformatics ; 31(2): 297-8, 2015 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-25024289

RESUMO

UNLABELLED: The ISMB Special Interest Group on Linking Literature, Information and Knowledge for Biology (BioLINK) organized a one-day workshop at ISMB/ECCB 2013 in Berlin, Germany. The theme of the workshop was 'Roles for text mining in biomedical knowledge discovery and translational medicine'. This summary reviews the outcomes of the workshop. Meeting themes included concept annotation methods and applications, extraction of biological relationships and the use of text-mined data for biological data analysis. AVAILABILITY AND IMPLEMENTATION: All articles are available at http://biolinksig.org/proceedings-online/.


Assuntos
Biologia Computacional/métodos , Mineração de Dados , Publicações Periódicas como Assunto , Relatório de Pesquisa , Biologia Computacional/normas , Congressos como Assunto , Humanos
10.
Methods ; 74: 54-64, 2015 03.
Artigo em Inglês | MEDLINE | ID: mdl-25448299

RESUMO

The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining. Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text. In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Animais , Biologia Computacional/tendências , Mineração de Dados/tendências , Bases de Dados de Proteínas/tendências , Humanos , Proteínas/genética
11.
BMC Bioinformatics ; 14 Suppl 3: S14, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23514326

RESUMO

BACKGROUND: Advances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge. RESULTS: We have developed a preliminary system that represents proteins using text-based features and predicts protein function using a k-nearest neighbour classifier (Text-KNN). We selected text features for our classifier by extracting key terms from biomedical abstracts based on their statistical properties. The system was trained and tested using 5-fold cross-validation over a dataset of 36,536 proteins. System performance was measured using the standard measures of precision, recall, F-measure and overall accuracy. The performance of our system was compared to two baseline classifiers: one that assigns function based solely on the prior distribution of protein function (Base-Prior) and one that assigns function based on sequence similarity (Base-Seq). The overall prediction accuracy of Text-KNN, Base-Prior, and Base-Seq for molecular function classes are 62%, 43%, and 58% while the overall accuracy for biological process classes are 17%, 11%, and 28% respectively. Results obtained as part of the CAFA evaluation itself on the CAFA dataset are reported as well. CONCLUSIONS: Our evaluation shows that the text-based classifier consistently outperforms the baseline classifier that is based on prior distribution, and typically has comparable performance to the baseline classifier that uses sequence similarity. Moreover, the results suggest that combining text features with other types of features can potentially lead to improved prediction performance. The preliminary results also suggest that while our text-based classifier can be used to predict both molecular function and biological process in which a protein is involved, the classifier performs significantly better for predicting molecular function than for predicting biological process. A similar trend was observed for other classifiers participating in the CAFA challenge.


Assuntos
Mineração de Dados , Proteínas/fisiologia , Análise por Conglomerados , Biologia Computacional/métodos , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Proteínas/química , Análise de Sequência de Proteína , Software , Vocabulário Controlado
12.
Bioinform Adv ; 3(1): vbad095, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37485423

RESUMO

Motivation: Figures in biomedical papers communicate essential information with the potential to identify relevant documents in biomedical and clinical settings. However, academic search interfaces mainly search over text fields. Results: We describe a search system for biomedical documents that leverages image modalities and an existing index server. We integrate a problem-specific taxonomy of image modalities and image-based data into a custom search system. Our solution features a front-end interface to enhance classical document search results with image-related data, including page thumbnails, figures, captions and image-modality information. We demonstrate the system on a subset of the CORD-19 document collection. A quantitative evaluation demonstrates higher precision and recall for biomedical document retrieval. A qualitative evaluation with domain experts further highlights our solution's benefits to biomedical search. Availability and implementation: A demonstration is available at https://runachay.evl.uic.edu/scholar. Our code and image models can be accessed via github.com/uic-evl/bio-search. The dataset is continuously expanded.

14.
Database (Oxford) ; 20222022 05 18.
Artigo em Inglês | MEDLINE | ID: mdl-35616099

RESUMO

The discovery of drug-drug interactions (DDIs) that have a translational impact among in vitro pharmacokinetics (PK), in vivo PK and clinical outcomes depends largely on the quality of the annotated corpus available for text mining. We have developed a new DDI corpus based on an annotation scheme that builds upon and extends previous ones, where an abstract is fragmented and each fragment is then annotated along eight dimensions, namely, focus, polarity, certainty, evidence, directionality, study type, interaction type and mechanism. The guideline for defining these dimensions has undergone refinement during the annotation process. Our DDI corpus comprises 900 positive DDI abstracts and 750 that are not directly relevant to DDI. The abstracts in corpus are separated into eight categories of DDI or non-DDI evidence: DDI with pharmacokinetic (PK) mechanism, in vivo DDI PK, DDI clinical, drug-nutrition interaction, single drug, not drug related, in vitro pharmacodynamic (PD) and case report. Seven annotators, three annotators with drug-interaction research experience and four annotators with less drug-interaction research experience independently annotated the DDI corpus, where two researchers independently annotated each abstract. After two rounds of annotations with additional training in between, agreement improved from (0.79, 0.96, 0.86, 0.70, 0.91, 0.65, 0.78, 0.90) to (0.93, 0.99, 0.96, 0.94, 0.95, 0.93, 0.96, 0.97) for focus, certainty, evidence, study type, interaction type, mechanisms, polarity and direction, respectively. The novice-level annotators improved from 0.83 to 0.96, while the expert-level annotators stayed in high performance with some improvement, from 0.90 to 0.96. In summary, we achieved 96% agreement among each pair of annotators with regard to the eight dimensions. The annotated corpus is now available to the community for inclusion in their text-mining pipelines. Database URL https://github.com/zha204/DDI-Corpus-Database/tree/master/DDI%20corpus.


Assuntos
Mineração de Dados , Mineração de Dados/métodos , Bases de Dados Factuais , Interações Medicamentosas , Humanos
15.
Front Artif Intell ; 5: 832909, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35757296

RESUMO

This work proposes a domain-informed neural network architecture for experimental particle physics, using particle interaction localization with the time-projection chamber (TPC) technology for dark matter research as an example application. A key feature of the signals generated within the TPC is that they allow localization of particle interactions through a process called reconstruction (i.e., inverse-problem regression). While multilayer perceptrons (MLPs) have emerged as a leading contender for reconstruction in TPCs, such a black-box approach does not reflect prior knowledge of the underlying scientific processes. This paper looks anew at neural network-based interaction localization and encodes prior detector knowledge, in terms of both signal characteristics and detector geometry, into the feature encoding and the output layers of a multilayer (deep) neural network. The resulting neural network, termed Domain-informed Neural Network (DiNN), limits the receptive fields of the neurons in the initial feature encoding layers in order to account for the spatially localized nature of the signals produced within the TPC. This aspect of the DiNN, which has similarities with the emerging area of graph neural networks in that the neurons in the initial layers only connect to a handful of neurons in their succeeding layer, significantly reduces the number of parameters in the network in comparison to an MLP. In addition, in order to account for the detector geometry, the output layers of the network are modified using two geometric transformations to ensure the DiNN produces localizations within the interior of the detector. The end result is a neural network architecture that has 60% fewer parameters than an MLP, but that still achieves similar localization performance and provides a path to future architectural developments with improved performance because of their ability to encode additional domain knowledge into the architecture.

16.
J Am Med Inform Assoc ; 29(11): 1879-1889, 2022 10 07.
Artigo em Inglês | MEDLINE | ID: mdl-35923089

RESUMO

OBJECTIVE: Abnormalities in impulse propagation and cardiac repolarization are frequent in hypertrophic cardiomyopathy (HCM), leading to abnormalities in 12-lead electrocardiograms (ECGs). Computational ECG analysis can identify electrophysiological and structural remodeling and predict arrhythmias. This requires accurate ECG segmentation. It is unknown whether current segmentation methods developed using datasets containing annotations for mostly normal heartbeats perform well in HCM. Here, we present a segmentation method to effectively identify ECG waves across 12-lead HCM ECGs. METHODS: We develop (1) a web-based tool that permits manual annotations of P, P', QRS, R', S', T, T', U, J, epsilon waves, QRS complex slurring, and atrial fibrillation by 3 experts and (2) an easy-to-implement segmentation method that effectively identifies ECG waves in normal and abnormal heartbeats. Our method was tested on 131 12-lead HCM ECGs and 2 public ECG sets to evaluate its performance in non-HCM ECGs. RESULTS: Over the HCM dataset, our method obtained a sensitivity of 99.2% and 98.1% and a positive predictive value of 92% and 95.3% when detecting QRS complex and T-offset, respectively, significantly outperforming a state-of-the-art segmentation method previously employed for HCM analysis. Over public ECG sets, it significantly outperformed 3 state-of-the-art methods when detecting P-onset and peak, T-offset, and QRS-onset and peak regarding the positive predictive value and segmentation error. It performed at a level similar to other methods in other tasks. CONCLUSION: Our method accurately identified ECG waves in the HCM dataset, outperforming a state-of-the-art method, and demonstrated similar good performance as other methods in normal/non-HCM ECG sets.


Assuntos
Cardiomiopatia Hipertrófica , Cardiomiopatia Hipertrófica/diagnóstico , Eletrocardiografia/métodos , Humanos , Valor Preditivo dos Testes
17.
BMC Bioinformatics ; 12 Suppl 8: S12, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151823

RESUMO

BACKGROUND: We participated, as Team 81, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. Our main goal was to exploit the power of available named entity recognition and dictionary tools to aid in the classification of documents relevant to Protein-Protein Interaction (PPI). For the IMT, we focused on obtaining evidence in support of the interaction methods used, rather than on tagging the document with the method identifiers. We experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. In a nutshell, we exploited classifiers, simple pattern matching for potential PPI methods within sentences, and ranking of candidate matches using statistical considerations. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline. RESULTS: For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions to the challenge in terms of Area Under the Interpolated Precision and Recall Curve, Mathew's Correlation Coefficient, and F-Score. We observe that the most useful Named Entity Recognition and Dictionary tools for classification of articles relevant to protein-protein interaction are: ABNER, NLPROT, OSCAR 3 and the PSI-MI ontology. For the IMT, our results are comparable to those of other systems, which took very different approaches. While the performance is not very high, we focus on providing evidence for potential interaction detection methods. A significant majority of the evidence sentences, as evaluated by independent annotators, are relevant to PPI detection methods. CONCLUSIONS: For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as "rules" for human understanding of the classification. We also provide evidence supporting certain named entity recognition tools as beneficial for protein-interaction article classification, or demonstrating that some of the tools are not beneficial for the task. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment, where multiple independent annotators manually evaluated the evidence produced by one of our runs. Preliminary results from this experiment are reported here and suggest that the majority of the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods. Regarding the integration of both tasks, we note that the time required for running each pipeline is realistic within a curation effort, and that we can, without compromising the quality of the output, reduce the time necessary to extract entities from text for the ACT pipeline by pre-selecting candidate relevant text using the IMT pipeline.


Assuntos
Mineração de Dados , Proteínas/metabolismo , Humanos , Processamento de Linguagem Natural , Publicações Periódicas como Assunto
18.
BMC Bioinformatics ; 12 Suppl 8: S3, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151929

RESUMO

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them. RESULTS: A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.


Assuntos
Algoritmos , Mineração de Dados , Proteínas/metabolismo , Animais , Bases de Dados de Proteínas , Humanos , Publicações Periódicas como Assunto , PubMed
19.
Ann Biomed Eng ; 49(2): 573-584, 2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-32779056

RESUMO

Prostate cancer (PCa) is a common, serious form of cancer in men that is still prevalent despite ongoing developments in diagnostic oncology. Current detection methods lead to high rates of inaccurate diagnosis. We present a method to directly model and exploit temporal aspects of temporal enhanced ultrasound (TeUS) for tissue characterization, which improves malignancy prediction. We employ a probabilistic-temporal framework, namely, hidden Markov models (HMMs), for modeling TeUS data obtained from PCa patients. We distinguish malignant from benign tissue by comparing the respective log-likelihood estimates generated by the HMMs. We analyze 1100 TeUS signals acquired from 12 patients. Our results show improved malignancy identification compared to previous results, demonstrating over 85% accuracy and AUC of 0.95. Incorporating temporal information directly into the models leads to improved tissue differentiation in PCa. We expect our method to generalize and be applied to other types of cancer in which temporal-ultrasound can be recorded.


Assuntos
Modelos Teóricos , Próstata/diagnóstico por imagem , Neoplasias da Próstata/diagnóstico , Humanos , Masculino , Cadeias de Markov , Ultrassonografia
20.
CJC Open ; 3(6): 801-813, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-34169259

RESUMO

BACKGROUND: Hypertrophic cardiomyopathy (HCM) patients have a high incidence of atrial fibrillation (AF) and increased stroke risk, even with low CHA2DS2-VASc (congestive heart failure, hypertension, age diabetes, previous stroke/transient ischemic attack) scores. Hence, there is a need to understand the pathophysiology of AF/stroke in HCM. In this retrospective study, we develop and apply a data-driven, machine learning-based method to identify AF cases, and clinical/imaging features associated with AF, using electronic health record data. METHODS: HCM patients with documented paroxysmal/persistent/permanent AF (n = 191) were considered AF cases, and the remaining patients in sinus rhythm (n = 640) were tagged as No-AF. We evaluated 93 clinical variables; the most informative variables useful for distinguishing AF from No-AF cases were selected based on the 2-sample t test and the information gain criterion. RESULTS: We identified 18 highly informative variables that are positively (n = 11) and negatively (n = 7) correlated with AF in HCM. Next, patient records were represented via these 18 variables. Data imbalance resulting from the relatively low number of AF cases was addressed via a combination of oversampling and undersampling strategies. We trained and tested multiple classifiers under this sampling approach, showing effective classification. Specifically, an ensemble of logistic regression and naïve Bayes classifiers, trained based on the 18 variables and corrected for data imbalance, proved most effective for separating AF from No-AF cases (sensitivity = 0.74, specificity = 0.70, C-index = 0.80). CONCLUSIONS: Our model (HCM-AF-Risk Model) is the first machine learning-based method for identification of AF cases in HCM. This model demonstrates good performance, addresses data imbalance, and suggests that AF is associated with a more severe cardiac HCM phenotype.


INTRODUCTION: Les patients atteints d'une cardiomyopathie hypertrophique (CMH) présentent une forte incidence de fibrillation auriculaire (FA) et un risque accru d'accident vasculaire cérébral (AVC), malgré des scores CHA2DS2-VASc (congestive heart failure, hypertension, age diabetes, previous stroke/transient ischemic attack, c'est-à-dire : insuffisance cardiaque congestive, hypertension, âge, diabète, AVC ou accident ischémique transitoire antérieur) faibles. Par conséquent, il est nécessaire de comprendre la physiopathologie de la FA et de l'AVC en présence d'une CMH. Dans la présente étude rétrospective, nous avons élaboré et appliqué une méthode d'apprentissage automatique dirigée sur les données pour déterminer les cas de FA, et les caractéristiques cliniques/d'imagerie associées à la FA, à l'aide des données des dossiers de santé électroniques. MÉTHODES: Nous avons considéré les patients atteints d'une CMH qui ont une FA paroxystique/persistante/permanente documentée (n = 191) comme des cas de FA, et avons étiqueté les autres patients en rythme sinusal (n = 640) comme des cas sans FA. Nous avons évalué 93 variables cliniques; nous avons sélectionné les variables les plus informatives qui sont utiles pour distinguer les cas de FA des cas sans FA en fonction du test t pour deux échantillons et du critère de gain d'information. RÉSULTATS: Nous avons relevé 18 variables hautement informatives qui ont une corrélation positive (n = 11) et une corrélation négative (n = 7) avec la FA en présence d'une CMH. Ensuite, nous avons représenté les dossiers des patients au moyen de ces 18 variables. Nous avons remédié au déséquilibre des données, qui résulte du nombre relativement faible de cas de FA, grâce à une combinaison de stratégies de suréchantillonnage et de sous-échantillonnage. Nous avons formé et testé de nombreux classificateurs selon cette approche d'échantillonnage, qui montre une classification efficace. Particulièrement, un ensemble de régression logistique et de classificateurs bayésiens naïfs formés en fonction des 18 variables et corrigés en fonction du déséquilibre des données s'est révélé le plus efficace pour séparer les cas de FA des cas sans FA (sensibilité = 0,74, spécificité = 0,70, indice C = 0,80). CONCLUSIONS: Notre modèle (modèle de risque de CMH-FA) est la première méthode d'apprentissage automatique qui sert à déterminer les cas de FA en présence de CMH. Ce modèle permet de démontrer une bonne performance, de remédier au déséquilibre des données, et de croire que la FA est associée à un phénotype grave de CMH.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA