Pesquisa | Biblioteca Virtual em Saúde Fiocruz

Statistical representation models for mutation information within genomic data.

Özcan Simsek, N Özlem; Özgür, Arzucan; Gürgen, Fikret.

BMC Bioinformatics ; 20(1): 324, 2019 Jun 13.

Artigo em Inglês | MEDLINE | ID: mdl-31195961

RESUMO

BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes.

Assuntos

Genômica/métodos , Modelos Genéticos , Modelos Estatísticos , Mutação/genética , Bases de Dados Genéticas , Éxons/genética , Humanos , Íntrons/genética , Aprendizado de Máquina , Neoplasias/genética , Redes Neurais de Computação

A novel gene selection method for gene expression data for the task of cancer type classification.

Özcan SImSek, N Özlem; ÖzgÜr, Arzucan; GÜrgen, Fikret.

Biol Direct ; 16(1): 7, 2021 02 08.

Artigo em Inglês | MEDLINE | ID: mdl-33557857

RESUMO

Cancer is a poligenetic disease with each cancer type having a different mutation profile. Genomic data can be utilized to detect these profiles and to diagnose and differentiate cancer types. Variant calling provide mutation information. Gene expression data reveal the altered cell behaviour. The combination of the mutation and expression information can lead to accurate discrimination of different cancer types. In this study, we utilized and transferred the information of existing mutations for a novel gene selection method for gene expression data. We tested the proposed method in order to diagnose and differentiate cancer types. It is a disease specific method as both the mutations and expressions are filtered according to the selected cancer types. Our experiment results show that the proposed gene selection method leads to similar or improved performance metrics compared to classical feature selection methods and curated gene sets.

Assuntos

Perfilação da Expressão Gênica/métodos , Genômica/estatística & dados numéricos , Aprendizado de Máquina , Neoplasias/classificação , Algoritmos , Neoplasias/genética

Estimation of Parkinson's disease severity using speech features and extreme gradient boosting.

Tunc, Hunkar C; Sakar, C Okan; Apaydin, Hulya; Serbes, Gorkem; Gunduz, Aysegul; Tutuncu, Melih; Gurgen, Fikret.

Med Biol Eng Comput ; 58(11): 2757-2773, 2020 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-32910301

RESUMO

In recent years, there is an increasing interest in building e-health systems. The systems built to deliver the health services with the use of internet and communication technologies aim to reduce the costs arising from outpatient visits of patients. Some of the related recent studies propose machine learning-based telediagnosis and telemonitoring systems for Parkinson's disease (PD). Motivated from the studies showing the potential of speech disorders in PD telemonitoring systems, in this study, we aim to estimate the severity of PD from voice recordings of the patients using motor Unified Parkinson's Disease Rating Scale (UPDRS) as the evaluation metric. For this purpose, we apply various speech processing algorithms to the voice signals of the patients and then use these features as input to a two-stage estimation model. The first step is to apply a wrapper-based feature selection algorithm, called Boruta, and select the most informative speech features. The second step is to feed the selected set of features to a decision tree-based boosting algorithm, extreme gradient boosting, which has been recently applied successfully in many machine learning tasks due to its generalization ability and speed. The feature selection analysis showed that the vibration pattern of the vocal fold is an important indicator of PD severity. Besides, we also investigate the effectiveness of using age and years passed since diagnosis as covariates together with speech features. The lowest mean absolute error with 3.87 was obtained by combining these covariates and speech features with prediction level fusion. Graphical Abstract Framework for the proposed UPDRS estimation model.

Assuntos

Algoritmos , Diagnóstico por Computador , Doença de Parkinson/diagnóstico , Fala , Fatores Etários , Idoso , Feminino , Humanos , Aprendizado de Máquina , Masculino , Pessoa de Meia-Idade , Autoavaliação (Psicologia) , Índice de Gravidade de Doença , Processamento de Sinais Assistido por Computador , Gravação em Fita , Telemedicina/métodos

Intelligent data analysis to interpret major risk factors for diabetic patients with and without ischemic stroke in a small population.

Gürgen, Fikret; Gürgen, Nurgül.

Biomed Eng Online ; 2: 5, 2003 Mar 04.

Artigo em Inglês | MEDLINE | ID: mdl-12685939

RESUMO

This study proposes an intelligent data analysis approach to investigate and interpret the distinctive factors of diabetes mellitus patients with and without ischemic (non-embolic type) stroke in a small population. The database consists of a total of 16 features collected from 44 diabetic patients. Features include age, gender, duration of diabetes, cholesterol, high density lipoprotein, triglyceride levels, neuropathy, nephropathy, retinopathy, peripheral vascular disease, myocardial infarction rate, glucose level, medication and blood pressure. Metric and non-metric features are distinguished. First, the mean and covariance of the data are estimated and the correlated components are observed. Second, major components are extracted by principal component analysis. Finally, as common examples of local and global classification approach, a k-nearest neighbor and a high-degree polynomial classifier such as multilayer perceptron are employed for classification with all the components and major components case. Macrovascular changes emerged as the principal distinctive factors of ischemic-stroke in diabetes mellitus. Microvascular changes were generally ineffective discriminators. Recommendations were made according to the rules of evidence-based medicine. Briefly, this case study, based on a small population, supports theories of stroke in diabetes mellitus patients and also concludes that the use of intelligent data analysis improves personalized preventive intervention.

Assuntos

Infarto Encefálico/epidemiologia , Diabetes Mellitus/epidemiologia , Modelos Estatísticos , Isquemia Encefálica/epidemiologia , Comorbidade , Análise Fatorial , Humanos , Fatores de Risco

Combining multiple clusterings for protein structure prediction.

Sakar, C Okan; Kursun, Olcay; Seker, Huseyin; Gurgen, Fikret.

Int J Data Min Bioinform ; 10(2): 162-74, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-25796736

RESUMO

Computational annotation and prediction of protein structure is very important in the post-genome era due to existence of many different proteins, most of which are yet to be verified. Mutual information based feature selection methods can be used in selecting such minimal yet predictive subsets of features. However, as protein features are organised into natural partitions, individual feature selection that ignores the presence of these views, dismantles them, and treats their variables intermixed along with those of others at best results in a complex un-interpretable predictive system for such multi-view datasets. In this paper, instead of selecting a subset of individual features, each feature subset is passed through a clustering step so that it is represented in discrete form using the cluster indices; this makes mutual information based methods applicable to view-selection. We present our experimental results on a multi-view protein dataset that are used to predict protein structure.

Assuntos

Algoritmos , Bases de Dados de Proteínas , Modelos Químicos , Proteínas/química , Proteínas/ultraestrutura , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Mineração de Dados/métodos , Modelos Moleculares , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão/métodos , Conformação Proteica

Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings.

Sakar, Betul Erdogdu; Isenkul, M Erdem; Sakar, C Okan; Sertbas, Ahmet; Gurgen, Fikret; Delil, Sakir; Apaydin, Hulya; Kursun, Olcay.

IEEE J Biomed Health Inform ; 17(4): 828-34, 2013 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-25055311

RESUMO

There has been an increased interest in speech pattern analysis applications of Parkinsonism for building predictive telediagnosis and telemonitoring models. For this purpose, we have collected a wide variety of voice samples, including sustained vowels, words, and sentences compiled from a set of speaking exercises for people with Parkinson's disease. There are two main issues in learning from such a dataset that consists of multiple speech recordings per subject: 1) How predictive these various types, e.g., sustained vowels versus words, of voice samples are in Parkinson's disease (PD) diagnosis? 2) How well the central tendency and dispersion metrics serve as representatives of all sample recordings of a subject? In this paper, investigating our Parkinson dataset using well-known machine learning tools, as reported in the literature, sustained vowels are found to carry more PD-discriminative information. We have also found that rather than using each voice recording of each subject as an independent data sample, representing the samples of a subject with central tendency and dispersion metrics improves generalization of the predictive model.

Assuntos

Doença de Parkinson/fisiopatologia , Reconhecimento Automatizado de Padrão/métodos , Espectrografia do Som/métodos , Fala/fisiologia , Voz/fisiologia , Adulto , Idoso , Bases de Dados Factuais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Máquina de Vetores de Suporte

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA