Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
BMC Bioinformatics ; 20(1): 216, 2019 Apr 29.
Artigo em Inglês | MEDLINE | ID: mdl-31035936

RESUMO

BACKGROUND: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS: Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS: BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.


Assuntos
Algoritmos , Mineração de Dados/métodos , Bases de Dados Factuais , Bases de Dados de Ácidos Nucleicos , Humanos , Mapas de Interação de Proteínas , Editoração
2.
J Biomed Inform ; 71: 229-240, 2017 07.
Artigo em Inglês | MEDLINE | ID: mdl-28624643

RESUMO

We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.


Assuntos
Algoritmos , Curadoria de Dados , Bases de Dados Bibliográficas , Bases de Dados de Ácidos Nucleicos , Armazenamento e Recuperação da Informação , Humanos , PubMed , Publicações , Controle de Qualidade
3.
Artigo em Inglês | MEDLINE | ID: mdl-37527325

RESUMO

Traditional support vector machines (SVMs) are fragile in the presence of outliers; even a single corrupt data point can arbitrarily alter the quality of the approximation. If even a small fraction of columns is corrupted, then classification performance will inevitably deteriorate. This article considers the problem of high-dimensional data classification, where a number of the columns are arbitrarily corrupted. An efficient Support Matrix Machine that simultaneously performs matrix Recovery (SSMRe) is proposed, i.e. feature selection and classification through joint minimization of l2,1 (the nuclear norm of L ). The data are assumed to consist of a low-rank clean matrix plus a sparse noisy matrix. SSMRe works under incoherence and ambiguity conditions and is able to recover an intrinsic matrix of higher rank in the presence of data densely corrupted. The objective function is a spectral extension of the conventional elastic net; it combines the property of matrix recovery along with low rank and joint sparsity to deal with complex high-dimensional noisy data. Furthermore, SSMRe leverages structural information, as well as the intrinsic structure of data, avoiding the inevitable upper bound. Experimental results on different real-time applications, supported by the theoretical analysis and statistical testing, show significant gain for BCI, face recognition, and person identification datasets, especially in the presence of outliers, while preserving a reasonable number of support vectors.

4.
PeerJ Comput Sci ; 8: e991, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35721404

RESUMO

Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the "Iran nuclear deal". The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance.

5.
Neural Comput Appl ; : 1-13, 2021 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-34720443

RESUMO

Severe acute respiratory syndrome coronavirus (SARS-CoV-2) also named COVID-19, aggressively spread all over the world in just a few months. Since then, it has multiple variants that are far more contagious than its parent. Rapid and accurate diagnosis of COVID-19 and its variants are crucial for its treatment, analysis of lungs damage and quarantine management. Deep learning-based solution for efficient and accurate diagnosis to COVID-19 and its variants using Chest X-rays, and computed tomography images could help to counter its outbreak. This work presents a novel depth-wise residual network with an atrous mechanism for accurate segmentation and lesion location of COVID-19 affected areas using volumetric CT images. The proposed framework consists of 3D depth-wise and 3D residual squeeze and excitation block in cascaded and parallel to capture uniformly multi-scale context (low-level detailed, mid-level comprehensive and high-level rich semantic features). The squeeze and excitation block adaptively recalibrates channel-wise feature responses by explicitly modeling inter-dependencies between various channels. We further have introduced an atrous mechanism with a different atrous rate as the bottom layer. Extensive experiments on benchmark CT datasets showed considerable gain (5%) for accurate segmentation and lesion location of COVID-19 affected areas.

6.
Can J Cardiol ; 36(6): 878-885, 2020 06.
Artigo em Inglês | MEDLINE | ID: mdl-32204950

RESUMO

BACKGROUND: The ability to predict readmission accurately after hospitalization for acute myocardial infarction (AMI) is limited in current statistical models. Machine-learning (ML) methods have shown improved predictive ability in various clinical contexts, but their utility in predicting readmission after hospitalization for AMI is unknown. METHODS: Using detailed clinical information collected from patients hospitalized with AMI, we evaluated 6 ML algorithms (logistic regression, naïve Bayes, support vector machines, random forest, gradient boosting, and deep neural networks) to predict readmission within 30 days and 1 year of discharge. A nested cross-validation approach was used to develop and test models. We used C-statistics to compare discriminatory capacity, whereas the Brier score was used to indicate overall model performance. Model calibration was assessed using calibration plots. RESULTS: The 30-day readmission rate was 16.3%, whereas the 1-year readmission rate was 45.1%. For 30-day readmission, the discriminative ability for the ML models was modest (C-statistic 0.641; 95% confidence interval (CI), 0.621-0.662 for gradient boosting) and did not outperform previously reported methods. For 1-year readmission, different ML models showed moderate performance, with C-statistics around 0.72. Despite modest discriminatory capabilities, the observed readmission rates were markedly higher in the tenth decile of predicted risk compared with the first decile of predicted risk for both 30-day and 1-year readmission. CONCLUSIONS: Despite including detailed clinical information and evaluating various ML methods, these models did not have better discriminatory ability to predict readmission outcomes compared with previously reported methods.


Assuntos
Algoritmos , Hospitalização/estatística & dados numéricos , Aprendizado de Máquina , Infarto do Miocárdio , Readmissão do Paciente/estatística & dados numéricos , Canadá/epidemiologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Infarto do Miocárdio/epidemiologia , Infarto do Miocárdio/terapia , Valor Preditivo dos Testes , Prognóstico , Medição de Risco/métodos , Fatores de Risco , Fatores de Tempo
7.
Database (Oxford) ; 20172017 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29220457

RESUMO

In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.


Assuntos
Pesquisa Biomédica , Bases de Dados Factuais , Armazenamento e Recuperação da Informação/métodos
8.
Database (Oxford) ; 2017(1)2017 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28365737

RESUMO

Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics.


Assuntos
Biologia Computacional/métodos , Curadoria de Dados/métodos , Mineração de Dados/métodos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA