Pesquisa | Portal Regional da BVS

1.

Detection of overdose and underdose prescriptions-An unsupervised machine learning approach.

Nagata, Kenichiro; Tsuji, Toshikazu; Suetsugu, Kimitaka; Muraoka, Kayoko; Watanabe, Hiroyuki; Kanaya, Akiko; Egashira, Nobuaki; Ieiri, Ichiro.

PLoS One ; 16(11): e0260315, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34797894

RESUMO

Overdose prescription errors sometimes cause serious life-threatening adverse drug events, while underdose errors lead to diminished therapeutic effects. Therefore, it is important to detect and prevent these errors. In the present study, we used the one-class support vector machine (OCSVM), one of the most common unsupervised machine learning algorithms for anomaly detection, to identify overdose and underdose prescriptions. We extracted prescription data from electronic health records in Kyushu University Hospital between January 1, 2014 and December 31, 2019. We constructed an OCSVM model for each of the 21 candidate drugs using three features: age, weight, and dose. Clinical overdose and underdose prescriptions, which were identified and rectified by pharmacists before administration, were collected. Synthetic overdose and underdose prescriptions were created using the maximum and minimum doses, defined by drug labels or the UpToDate database. We applied these prescription data to the OCSVM model and evaluated its detection performance. We also performed comparative analysis with other unsupervised outlier detection algorithms (local outlier factor, isolation forest, and robust covariance). Twenty-seven out of 31 clinical overdose and underdose prescriptions (87.1%) were detected as abnormal by the model. The constructed OCSVM models showed high performance for detecting synthetic overdose prescriptions (precision 0.986, recall 0.964, and F-measure 0.973) and synthetic underdose prescriptions (precision 0.980, recall 0.794, and F-measure 0.839). In comparative analysis, OCSVM showed the best performance. Our models detected the majority of clinical overdose and underdose prescriptions and demonstrated high performance in synthetic data analysis. OCSVM models, constructed using features such as age, weight, and dose, are useful for detecting overdose and underdose prescriptions.

Assuntos

Overdose de Drogas/diagnóstico , Medicamentos sob Prescrição/efeitos adversos , Prescrições/estatística & dados numéricos , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Algoritmos , Pré-Escolar , Análise de Dados , Coleta de Dados/estatística & dados numéricos , Gerenciamento de Dados/estatística & dados numéricos , Bases de Dados Factuais/estatística & dados numéricos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Humanos , Lactente , Rememoração Mental , Pessoa de Meia-Idade , Máquina de Vetores de Suporte/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Adulto Jovem

2.

Partitioning variability in animal behavioral videos using semi-supervised variational autoencoders.

Whiteway, Matthew R; Biderman, Dan; Friedman, Yoni; Dipoppa, Mario; Buchanan, E Kelly; Wu, Anqi; Zhou, John; Bonacchi, Niccolò; Miska, Nathaniel J; Noel, Jean-Paul; Rodriguez, Erica; Schartner, Michael; Socha, Karolina; Urai, Anne E; Salzman, C Daniel; Cunningham, John P; Paninski, Liam.

PLoS Comput Biol ; 17(9): e1009439, 2021 09.

Artigo em Inglês | MEDLINE | ID: mdl-34550974

RESUMO

Recent neuroscience studies demonstrate that a deeper understanding of brain function requires a deeper understanding of behavior. Detailed behavioral measurements are now often collected using video cameras, resulting in an increased need for computer vision algorithms that extract useful information from video data. Here we introduce a new video analysis tool that combines the output of supervised pose estimation algorithms (e.g. DeepLabCut) with unsupervised dimensionality reduction methods to produce interpretable, low-dimensional representations of behavioral videos that extract more information than pose estimates alone. We demonstrate this tool by extracting interpretable behavioral features from videos of three different head-fixed mouse preparations, as well as a freely moving mouse in an open field arena, and show how these interpretable features can facilitate downstream behavioral and neural analyses. We also show how the behavioral features produced by our model improve the precision and interpretation of these downstream analyses compared to using the outputs of either fully supervised or fully unsupervised methods alone.

Assuntos

Algoritmos , Inteligência Artificial/estatística & dados numéricos , Comportamento Animal , Gravação em Vídeo , Animais , Biologia Computacional , Simulação por Computador , Cadeias de Markov , Camundongos , Modelos Estatísticos , Redes Neurais de Computação , Aprendizado de Máquina Supervisionado/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Gravação em Vídeo/estatística & dados numéricos

3.

Fast and precise single-cell data analysis using a hierarchical autoencoder.

Tran, Duc; Nguyen, Hung; Tran, Bang; La Vecchia, Carlo; Luu, Hung N; Nguyen, Tin.

Nat Commun ; 12(1): 1029, 2021 02 15.

Artigo em Inglês | MEDLINE | ID: mdl-33589635

RESUMO

A primary challenge in single-cell RNA sequencing (scRNA-seq) studies comes from the massive amount of data and the excess noise level. To address this challenge, we introduce an analysis framework, named single-cell Decomposition using Hierarchical Autoencoder (scDHA), that reliably extracts representative information of each cell. The scDHA pipeline consists of two core modules. The first module is a non-negative kernel autoencoder able to remove genes or components that have insignificant contributions to the part-based representation of the data. The second module is a stacked Bayesian autoencoder that projects the data onto a low-dimensional space (compressed). To diminish the tendency to overfit of neural networks, we repeatedly perturb the compressed space to learn a more generalized representation of the data. In an extensive analysis, we demonstrate that scDHA outperforms state-of-the-art techniques in many research sub-fields of scRNA-seq analysis, including cell segregation through unsupervised learning, visualization of transcriptome landscape, cell classification, and pseudo-time inference.

Assuntos

Redes Neurais de Computação , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Animais , Teorema de Bayes , Benchmarking , Separação Celular/métodos , Cerebelo/química , Cerebelo/citologia , Embrião de Mamíferos , Humanos , Fígado/química , Fígado/citologia , Pulmão/química , Pulmão/citologia , Camundongos , Células-Tronco Embrionárias Murinas/química , Células-Tronco Embrionárias Murinas/citologia , Pâncreas/química , Pâncreas/citologia , Retina/química , Retina/citologia , Análise de Célula Única/métodos , Córtex Visual/química , Córtex Visual/citologia , Zigoto/química , Zigoto/citologia

4.

A robust unsupervised machine-learning method to quantify the morphological heterogeneity of cells and nuclei.

Phillip, Jude M; Han, Kyu-Sang; Chen, Wei-Chiang; Wirtz, Denis; Wu, Pei-Hsun.

Nat Protoc ; 16(2): 754-774, 2021 02.

Artigo em Inglês | MEDLINE | ID: mdl-33424024

RESUMO

Cell morphology encodes essential information on many underlying biological processes. It is commonly used by clinicians and researchers in the study, diagnosis, prognosis, and treatment of human diseases. Quantification of cell morphology has seen tremendous advances in recent years. However, effectively defining morphological shapes and evaluating the extent of morphological heterogeneity within cell populations remain challenging. Here we present a protocol and software for the analysis of cell and nuclear morphology from fluorescence or bright-field images using the VAMPIRE algorithm ( https://github.com/kukionfr/VAMPIRE_open ). This algorithm enables the profiling and classification of cells into shape modes based on equidistant points along cell and nuclear contours. Examining the distributions of cell morphologies across automatically identified shape modes provides an effective visualization scheme that relates cell shapes to cellular subtypes based on endogenous and exogenous cellular conditions. In addition, these shape mode distributions offer a direct and quantitative way to measure the extent of morphological heterogeneity within cell populations. This protocol is highly automated and fast, with the ability to quantify the morphologies from 2D projections of cells seeded both on 2D substrates or embedded within 3D microenvironments, such as hydrogels and tissues. The complete analysis pipeline can be completed within 60 minutes for a dataset of ~20,000 cells/2,400 images.

Assuntos

Forma Celular/fisiologia , Imageamento Tridimensional/métodos , Microscopia Confocal/métodos , Algoritmos , Núcleo Celular/fisiologia , Humanos , Software , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos

5.

Integrating Deep Supervised, Self-Supervised and Unsupervised Learning for Single-Cell RNA-seq Clustering and Annotation.

Chen, Liang; Zhai, Yuyao; He, Qiuyan; Wang, Weinan; Deng, Minghua.

Genes (Basel) ; 11(7)2020 07 14.

Artigo em Inglês | MEDLINE | ID: mdl-32674393

RESUMO

As single-cell RNA sequencing technologies mature, massive gene expression profiles can be obtained. Consequently, cell clustering and annotation become two crucial and fundamental procedures affecting other specific downstream analyses. Most existing single-cell RNA-seq (scRNA-seq) data clustering algorithms do not take into account the available cell annotation results on the same tissues or organisms from other laboratories. Nonetheless, such data could assist and guide the clustering process on the target dataset. Identifying marker genes through differential expression analysis to manually annotate large amounts of cells also costs labor and resources. Therefore, in this paper, we propose a novel end-to-end cell supervised clustering and annotation framework called scAnCluster, which fully utilizes the cell type labels available from reference data to facilitate the cell clustering and annotation on the unlabeled target data. Our algorithm integrates deep supervised learning, self-supervised learning and unsupervised learning techniques together, and it outperforms other customized scRNA-seq supervised clustering methods in both simulation and real data. It is particularly worth noting that our method performs well on the challenging task of discovering novel cell types that are absent in the reference data.

Assuntos

Anotação de Sequência Molecular , RNA-Seq/métodos , Análise de Célula Única/métodos , Transcriptoma/genética , Análise por Conglomerados , Simulação por Computador , Perfilação da Expressão Gênica , Marcadores Genéticos/genética , RNA-Seq/estatística & dados numéricos , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Sequenciamento do Exoma/métodos , Sequenciamento do Exoma/estatística & dados numéricos

6.

R/PY-SUMMA: An R/Python Package for Unsupervised Ensemble Learning for Binary Classification Problems in Bioinformatics.

Ahsen, Mehmet Eren; Vogel, Robert; Stolovitzky, Gustavo A.

J Comput Biol ; 27(9): 1337-1340, 2020 09.

Artigo em Inglês | MEDLINE | ID: mdl-31905016

RESUMO

The increasing availability of complex data in biology and medicine has promoted the use of machine learning in classification tasks to address important problems in translational and fundamental science. Two important obstacles, however, may limit the unraveling of the full potential of machine learning in these fields: the lack of generalization of the resulting models and the limited number of labeled data sets in some applications. To address these important problems, we developed an unsupervised ensemble algorithm called strategy for unsupervised multiple method aggregation (SUMMA). By virtue of being an ensemble method, SUMMA is more robust to generalization than the predictions it combines. By virtue of being unsupervised, SUMMA does not require labeled data. SUMMA receives as input predictions from a diversity of models and estimates their classification performance even when labeled data are unavailable. It then uses these performance estimates to combine these different predictions into an ensemble model. SUMMA can be applied to a variety of binary classification problems in bioinformatics including but not limited to gene network inference, cancer diagnostics, drug response prediction, somatic mutation, and differential expression calling. In this application note, we introduce the R/PY-SUMMA packages, available in R or Python, that implement the SUMMA algorithm.

Assuntos

Biologia Computacional/estatística & dados numéricos , Redes Reguladoras de Genes/genética , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Algoritmos , Modelos Estatísticos

7.

Exon level machine learning analyses elucidate novel candidate miRNA targets in an avian model of fetal alcohol spectrum disorder.

Al-Shaer, Abrar E; Flentke, George R; Berres, Mark E; Garic, Ana; Smith, Susan M.

PLoS Comput Biol ; 15(4): e1006937, 2019 04.

Artigo em Inglês | MEDLINE | ID: mdl-30973878

RESUMO

Gestational alcohol exposure causes fetal alcohol spectrum disorder (FASD) and is a prominent cause of neurodevelopmental disability. Whole transcriptome sequencing (RNA-Seq) offer insights into mechanisms underlying FASD, but gene-level analysis provides limited information regarding complex transcriptional processes such as alternative splicing and non-coding RNAs. Moreover, traditional analytical approaches that use multiple hypothesis testing with a false discovery rate adjustment prioritize genes based on an adjusted p-value, which is not always biologically relevant. We address these limitations with a novel approach and implemented an unsupervised machine learning model, which we applied to an exon-level analysis to reduce data complexity to the most likely functionally relevant exons, without loss of novel information. This was performed on an RNA-Seq paired-end dataset derived from alcohol-exposed neural fold-stage chick crania, wherein alcohol causes facial deficits recapitulating those of FASD. A principal component analysis along with k-means clustering was utilized to extract exons that deviated from baseline expression. This identified 6857 differentially expressed exons representing 1251 geneIDs; 391 of these genes were identified in a prior gene-level analysis of this dataset. It also identified exons encoding 23 microRNAs (miRNAs) having significantly differential expression profiles in response to alcohol. We developed an RDAVID pipeline to identify KEGG pathways represented by these exons, and separately identified predicted KEGG pathways targeted by these miRNAs. Several of these (ribosome biogenesis, oxidative phosphorylation) were identified in our prior gene-level analysis. Other pathways are crucial to facial morphogenesis and represent both novel (focal adhesion, FoxO signaling, insulin signaling) and known (Wnt signaling) alcohol targets. Importantly, there was substantial overlap between the exomes themselves and the predicted miRNA targets, suggesting these miRNAs contribute to the gene-level expression changes. Our novel application of unsupervised machine learning in conjunction with statistical analyses facilitated the discovery of signaling pathways and miRNAs that inform mechanisms underlying FASD.

Assuntos

Éxons/genética , Transtornos do Espectro Alcoólico Fetal/genética , MicroRNAs/genética , Aprendizado de Máquina não Supervisionado , Animais , Big Data , Embrião de Galinha , Análise por Conglomerados , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Modelos Animais de Doenças , Etanol/toxicidade , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Gravidez , Análise de Componente Principal , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos

8.

Machine learning in suicide science: Applications and ethics.

Linthicum, Kathryn P; Schafer, Katherine Musacchio; Ribeiro, Jessica D.

Behav Sci Law ; 37(3): 214-222, 2019 May.

Artigo em Inglês | MEDLINE | ID: mdl-30609102

RESUMO

For decades, our ability to predict suicide has remained at near-chance levels. Machine learning has recently emerged as a promising tool for advancing suicide science, particularly in the domain of suicide prediction. The present review provides an introduction to machine learning and its potential application to open questions in suicide research. Although only a few studies have implemented machine learning for suicide prediction, results to date indicate considerable improvement in accuracy and positive predictive value. Potential barriers to algorithm integration into clinical practice are discussed, as well as attendant ethical issues. Overall, machine learning approaches hold promise for accurate, scalable, and effective suicide risk detection; however, many critical questions and issues remain unexplored.

Assuntos

Ética Médica , Aprendizado de Máquina/legislação & jurisprudência , Suicídio/ética , Suicídio/legislação & jurisprudência , Algoritmos , Análise por Conglomerados , Técnicas de Apoio para a Decisão , Humanos , Estudos Longitudinais , Aprendizado de Máquina/ética , Probabilidade , Pesquisa , Medição de Risco/legislação & jurisprudência , Aprendizado de Máquina não Supervisionado/ética , Aprendizado de Máquina não Supervisionado/legislação & jurisprudência , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Prevenção do Suicídio

9.

Multi-omics integration-a comparison of unsupervised clustering methodologies.

Tini, Giulia; Marchetti, Luca; Priami, Corrado; Scott-Boyer, Marie-Pier.

Brief Bioinform ; 20(4): 1269-1279, 2019 07 19.

Artigo em Inglês | MEDLINE | ID: mdl-29272335

RESUMO

With the recent developments in the field of multi-omics integration, the interest in factors such as data preprocessing, choice of the integration method and the number of different omics considered had increased. In this work, the impact of these factors is explored when solving the problem of sample classification, by comparing the performances of five unsupervised algorithms: Multiple Canonical Correlation Analysis, Multiple Co-Inertia Analysis, Multiple Factor Analysis, Joint and Individual Variation Explained and Similarity Network Fusion. These methods were applied to three real data sets taken from literature and several ad hoc simulated scenarios to discuss classification performance in different conditions of noise and signal strength across the data types. The impact of experimental design, feature selection and parameter training has been also evaluated to unravel important conditions that can affect the accuracy of the result.

Assuntos

Biologia Computacional/métodos , Integração de Sistemas , Aprendizado de Máquina não Supervisionado , Algoritmos , Animais , Análise por Conglomerados , Simulação por Computador , Bases de Dados Factuais , Análise Fatorial , Genômica/estatística & dados numéricos , Humanos , Metabolômica/estatística & dados numéricos , Camundongos , Modelos Biológicos , Análise Multivariada , Proteômica/estatística & dados numéricos , Biologia de Sistemas , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos

10.

Using Unsupervised Machine Learning to Identify Subgroups Among Home Health Patients With Heart Failure Using Telehealth.

Bose, Eliezer; Radhakrishnan, Kavita.

Comput Inform Nurs ; 36(5): 242-248, 2018 May.

Artigo em Inglês | MEDLINE | ID: mdl-29494361

RESUMO

This study explored the use of unsupervised machine learning to identify subgroups of patients with heart failure who used telehealth services in the home health setting, and examined intercluster differences for patient characteristics related to medical history, symptoms, medications, psychosocial assessments, and healthcare utilization. Using a feature selection algorithm, we selected seven variables from 557 patients for clustering. We tested three clustering techniques: hierarchical, k-means, and partitioning around medoids. Hierarchical clustering was identified as the best technique using internal validation methods. Intercluster differences among patient characteristics and outcomes were assessed with either χ test or one-way analysis of variance. Ranging in size from 153 to 233 patients, three clusters displayed patterns that differed significantly (P < .05) in patient characteristics of age, sex, medical history of comorbid conditions, use of beta blockers, and quality of life assessment. Significant (P < .001) intercluster differences in number of medications, comorbidities, and healthcare utilization were also revealed. The study identified patterns of association between (1) mental health status, pulmonary disorders, and obesity, and (2) healthcare utilization for patients with heart failure who used telehealth in the home health setting. Study results also revealed a lack of prescription guideline-recommended heart failure medications for the subgroup with the highest proportion of older female adults.

Assuntos

Insuficiência Cardíaca/classificação , Serviços de Assistência Domiciliar/estatística & dados numéricos , Aceitação pelo Paciente de Cuidados de Saúde , Telemedicina , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Idoso , Idoso de 80 Anos ou mais , Comorbidade , Feminino , Humanos , Masculino , Modelos Estatísticos , Estudos Retrospectivos

11.

Mapping Patient Trajectories using Longitudinal Extraction and Deep Learning in the MIMIC-III Critical Care Database.

Beaulieu-Jones, Brett K; Orzechowski, Patryk; Moore, Jason H.

Pac Symp Biocomput ; 23: 123-132, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29218875

RESUMO

Electronic Health Records (EHRs) contain a wealth of patient data useful to biomedical researchers. At present, both the extraction of data and methods for analyses are frequently designed to work with a single snapshot of a patient's record. Health care providers often perform and record actions in small batches over time. By extracting these care events, a sequence can be formed providing a trajectory for a patient's interactions with the health care system. These care events also offer a basic heuristic for the level of attention a patient receives from health care providers. We show that is possible to learn meaningful embeddings from these care events using two deep learning techniques, unsupervised autoencoders and long short-term memory networks. We compare these methods to traditional machine learning methods which require a point in time snapshot to be extracted from an EHR.

Assuntos

Cuidados Críticos/estatística & dados numéricos , Aprendizado de Máquina/estatística & dados numéricos , Biologia Computacional/métodos , Bases de Dados Factuais/estatística & dados numéricos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Feminino , Humanos , Masculino , Aprendizado de Máquina Supervisionado/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos

12.

Exploring trends of nonmedical use of prescription drugs and polydrug abuse in the Twittersphere using unsupervised machine learning.

Kalyanam, Janani; Katsuki, Takeo; R G Lanckriet, Gert; Mackey, Tim K.

Addict Behav ; 65: 289-295, 2017 02.

Artigo em Inglês | MEDLINE | ID: mdl-27568339

RESUMO

INTRODUCTION: Nonmedical use of prescription medications/drugs (NMUPD) is a serious public health threat, particularly in relation to the prescription opioid analgesics abuse epidemic. While attention to this problem has been growing, there remains an urgent need to develop novel strategies in the field of "digital epidemiology" to better identify, analyze and understand trends in NMUPD behavior. METHODS: We conducted surveillance of the popular microblogging site Twitter by collecting 11 million tweets filtered for three commonly abused prescription opioid analgesic drugs Percocet® (acetaminophen/oxycodone), OxyContin® (oxycodone), and Oxycodone. Unsupervised machine learning was applied on the subset of tweets for each analgesic drug to discover underlying latent themes regarding risk behavior. A two-step process of obtaining themes, and filtering out unwanted tweets was carried out in three subsequent rounds of machine learning. RESULTS: Using this methodology, 2.3M tweets were identified that contained content relevant to analgesic NMUPD. The underlying themes were identified for each drug and the most representative tweets of each theme were annotated for NMUPD behavioral risk factors. The primary themes identified evidence high levels of social media discussion about polydrug abuse on Twitter. This included specific mention of various polydrug combinations including use of other classes of prescription drugs, and illicit drug abuse. CONCLUSIONS: This study presents a methodology to filter Twitter content for NMUPD behavior, while also identifying underlying themes with minimal human intervention. Results from the study track accurately with the inclusion/exclusion criteria used to isolate NMUPD-related risk behaviors of interest and also provides insight on NMUPD behavior that has a high level of social media engagement. Results suggest that this could be a viable methodology for use in big data substance abuse surveillance, data collection, and analysis in comparison to other studies that rely upon content analysis and human coding schemes.

Assuntos

Transtornos Relacionados ao Uso de Opioides/epidemiologia , Uso Indevido de Medicamentos sob Prescrição/estatística & dados numéricos , Mídias Sociais/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Humanos , Fatores de Risco

13.

TOWARDS EARLY DISCOVERY OF SALIENT HEALTH THREATS: A SOCIAL MEDIA EMOTION CLASSIFICATION TECHNIQUE.

Ofoghi, Bahadorreza; Mann, Meghan; Verspoor, Karin.

Pac Symp Biocomput ; 21: 504-15, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-26776213

RESUMO

Online social media microblogs may be a valuable resource for timely identification of critical ad hoc health-related incidents or serious epidemic outbreaks. In this paper, we explore emotion classification of Twitter microblogs related to localized public health threats, and study whether the public mood can be effectively utilized in early discovery or alarming of such events. We analyse user tweets around recent incidents of Ebola, finding differences in the expression of emotions in tweets posted prior to and after the incidents have emerged. We also analyse differences in the nature of the tweets in the immediately affected area as compared to areas remote to the events. The results of this analysis suggest that emotions in social media microblogging data (from Twitter in particular) may be utilized effectively as a source of evidence for disease outbreak detection and monitoring.

Assuntos

Emoções/classificação , Vigilância em Saúde Pública/métodos , Mídias Sociais/estatística & dados numéricos , Teorema de Bayes , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Surtos de Doenças/estatística & dados numéricos , Doença pelo Vírus Ebola/epidemiologia , Doença pelo Vírus Ebola/psicologia , Humanos , Fatores de Tempo , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos

14.

Integrating biological knowledge based on functional annotations for biclustering of gene expression data.

Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S.

Comput Methods Programs Biomed ; 119(3): 163-80, 2015 May.

Artigo em Inglês | MEDLINE | ID: mdl-25843807

RESUMO

Gene expression data analysis is based on the assumption that co-expressed genes imply co-regulated genes. This assumption is being reformulated because the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime. For this reason, traditional techniques are recently being improved with the use of prior biological knowledge from open-access repositories together with gene expression data. Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. A scatter search-based biclustering algorithm that integrates biological information is proposed in this paper. In addition to the gene expression data matrix, the input of the algorithm is only a direct annotation file that relates each gene to a set of terms from a biological repository where genes are annotated. Two different biological measures, FracGO and SimNTO, are proposed to integrate this information by means of its addition to-be-optimized fitness function in the scatter search scheme. The measure FracGO is based on the biological enrichment and SimNTO is based on the overlapping among GO annotations of pairs of genes. Experimental results evaluate the proposed algorithm for two datasets and show the algorithm performs better when biological knowledge is integrated. Moreover, the analysis and comparison between the two different biological measures is presented and it is concluded that the differences depend on both the data source and how the annotation file has been built in the case GO is used. It is also shown that the proposed algorithm obtains a greater number of enriched biclusters than other classical biclustering algorithms typically used as benchmark and an analysis of the overlapping among biclusters reveals that the biclusters obtained present a low overlapping. The proposed methodology is a general-purpose algorithm which allows the integration of biological information from several sources and can be extended to other biclustering algorithms based on the optimization of a merit function.

Assuntos

Algoritmos , Perfilação da Expressão Gênica/estatística & dados numéricos , Anotação de Sequência Molecular/estatística & dados numéricos , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos , Análise por Conglomerados , Mineração de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Ontologia Genética/estatística & dados numéricos , Genes Fúngicos , Bases de Conhecimento , Leveduras/genética

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA