Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 63
Filtrar
1.
Methods Mol Biol ; 2284: 457-466, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33835457

RESUMEN

Circular RNA (or circRNA) is a type of single-stranded covalently closed circular RNA molecule and play important roles in diverse biological pathways. A comprehensive functionally annotated circRNA database will help to understand the circRNAs and their functions. CircFunBase is such a web-accessible database that aims to provide a high-quality functional circRNA resource including experimentally validated and computationally predicted functions. CircFunBase provides visualized circRNA-miRNA interaction networks. In addition, a genome browser is provided to visualize the genome context of circRNA. In this chapter, we illustrate examples of searching for circRNA and getting detailed information of circRNA. Moreover, other circRNA related databases are outlined.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas/provisión & distribución , ARN Circular/análisis , Análisis de Datos , Bases de Datos Genéticas/clasificación , Enfermedad/genética , Redes Reguladoras de Genes , Humanos , ARN Circular/genética , ARN Circular/fisiología , Programas Informáticos
2.
IEEE J Biomed Health Inform ; 23(4): 1805-1815, 2019 07.
Artículo en Inglés | MEDLINE | ID: mdl-31283472

RESUMEN

The discovery of disease-causing genes is a critical step towards understanding the nature of a disease and determining a possible cure for it. In recent years, many computational methods to identify disease genes have been proposed. However, making full use of disease-related (e.g., symptoms) and gene-related (e.g., gene ontology and protein-protein interactions) information to improve the performance of disease gene prediction is still an issue. Here, we develop a heterogeneous disease-gene-related network (HDGN) embedding representation framework for disease gene prediction (called HerGePred). Based on this framework, a low-dimensional vector representation (LVR) of the nodes in the HDGN can be obtained. Then, we propose two specific algorithms, namely, an LVR-based similarity prediction and a random walk with restart on a reconstructed heterogeneous disease-gene network (RW-RDGN), to predict disease genes with high performance. First, to validate the rationality of the framework, we analyze the similarity-based overlap distribution of disease pairs and design an experiment for disease-gene association recovery, the results of which revealed that the LVR of nodes performs well at preserving the local and global network structure of the HDGN. Then, we apply tenfold cross validation and external validation to compare our methods with other well-known disease gene prediction algorithms. The experimental results show that the RW-RDGN performs better than the state-of-the-art algorithm. The prediction results of disease candidate genes are essential for molecular mechanism investigation and experimental validation. The source codes of HerGePred and experimental data are available at https://github.com/yangkuoone/HerGePred.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas/clasificación , Enfermedad/genética , Aprendizaje Automático , Algoritmos , Humanos , Modelos Estadísticos
3.
J Med Syst ; 43(8): 235, 2019 Jun 17.
Artículo en Inglés | MEDLINE | ID: mdl-31209677

RESUMEN

Cancer is a deadly disease which requires a very complex and costly treatment. Microarray data classification plays an important role in cancer treatment. An efficient gene selection technique to select the more promising genes is necessary for cancer classification. Here, we propose a Two-stage MI-GA Gene Selection algorithm for selecting informative genes in cancer data classification. In the first stage, Mutual Information based gene selection is applied which selects only the genes that have high information related to the cancer. The genes which have high mutual information value are given as input to the second stage. The Genetic Algorithm based gene selection is applied in the second stage to identify and select the optimal set of genes required for accurate classification. For classification, Support Vector Machine (SVM) is used. The proposed MI-GA gene selection approach is applied to Colon, Lung and Ovarian cancer datasets and the results show that the proposed gene selection approach results in higher classification accuracy compared to the existing methods.


Asunto(s)
Algoritmos , Bases de Datos Genéticas/clasificación , Perfilación de la Expresión Génica , Neoplasias/genética , Minería de Datos , Humanos , Análisis por Micromatrices , Máquina de Vectores de Soporte
4.
Artículo en Inglés | MEDLINE | ID: mdl-29990202

RESUMEN

Analyzing the disease data from the view of combinatorial features may better characterize the disease phenotype. In this study, a novel method is proposed to construct feature combinations and a classification model (CFC-CM) by mining key feature relationships. CFC-CM iteratively tests for differences in the feature relationship between different groups. To do this, it uses a modified $k$k-top-scoring pair (M-$k$k-TSP) algorithm and then selects the most discriminative feature pairs in the current feature set to infer the combinatorial features and build the classification model. Compared with support vector machines, random forests, least absolute shrinkage and selection operator, elastic net, and M-$k$k-TSP, the superior performance of CFC-CM on nine public gene expression datasets validates its potential for more precise identification of complex diseases. Subsequently, CFC-CM was applied to two metabolomics datasets, it obtained accuracy rates of $88.73\pm 2.06\%$88.73±2.06% and $79.11\pm 2.70\%$79.11±2.70% in distinguishing between hepatocellular carcinoma and hepatic cirrhosis groups and between acute kidney injury (AKI) and non-AKI samples, results superior to those of the other five methods. In summary, the better results of CFC-CM show that in contrast to molecules and combinations constituted by just two features, the combinations inferred by appropriate number of features could better identify the complex diseases.


Asunto(s)
Biología Computacional/métodos , Diagnóstico por Computador/métodos , Metaboloma , Metabolómica/métodos , Algoritmos , Bases de Datos Genéticas/clasificación , Humanos , Enfermedades Renales/diagnóstico , Hepatopatías/diagnóstico , Metaboloma/genética , Metaboloma/fisiología , Máquina de Vectores de Soporte
6.
IEEE J Biomed Health Inform ; 22(5): 1619-1629, 2018 09.
Artículo en Inglés | MEDLINE | ID: mdl-29990162

RESUMEN

One of the main challenges in modern medic-ine is to stratify patients for personalized care. Many different clustering methods have been proposed to solve the problem in both quantitative and biologically meaningful manners. However, existing clustering algorithms suffer from numerous restrictions such as experimental noises, high dimensionality, and poor interpretability. To overcome those limitations altogether, we propose and formulate a multiobjective framework based on evolutionary multiobjective optimization to balance the feature relevance and redundancy for patient stratification. To demonstrate the effectiveness of our proposed algorithms, we benchmark our algorithms across 55 synthetic datasets based on a real human transcription regulation network model, 35 real cancer gene expression datasets, and two case studies. Experimental results suggest that the proposed algorithms perform better than the recent state-of-the-arts. In addition, time complexity analysis, convergence analysis, and parameter analysis are conducted to demonstrate the robustness of the proposed methods from different perspectives. Finally, the t-Distributed Stochastic Neighbor Embedding (t-SNE) is applied to project the selected feature subsets onto two or three dimensions to visualize the high-dimensional patient stratification data.


Asunto(s)
Bases de Datos Genéticas/clasificación , Registros Electrónicos de Salud/clasificación , Informática Médica/métodos , Medicina de Precisión/métodos , Algoritmos , Análisis por Conglomerados , Humanos , Transcriptoma
7.
Gene ; 614: 37-48, 2017 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-28257835

RESUMEN

Transcription factors that activate both proliferation- and apoptosis-effector genes, along with a number of related observations, have led to a proposal for a feed forward mechanism of activating the two gene classes, whereby a certain concentration of a transcription factor activates the proliferation-effector genes and a higher concentration of the transcription factor activates the apoptosis-effector genes. We reasoned that this paradigm of regulation could lead to, in the cancer setting, a selection for relatively reduced copy numbers of apoptosis-effector gene, transcription factor binding sites (TFBS). Thus, the aim of this investigation was to examine the DNA sequencing read depths of TFBS for a set of proliferation- and apoptosis-effector genes, normalized to the read depths found in matching blood samples, as provided by the cancer genome atlas (TCGA); and thereby document copy number differences among these TFBS. We determined that the melanoma and breast cancer, TCGA datasets could be divided into three categories: (i) no detectable copy number variation for the proliferation- and apoptosis-effector, shared TFBS; (ii) a relative increase in the copy number of proliferation-effector gene TFBS, compared with the copy number of the apoptosis-effector gene TFBS; and (iii) a relative decrease in the number of proliferation-effector gene TFBS. Thus, we conclude that changes in the relative copies of the shared TFBS, for proliferation- and apoptosis-effector genes, have the potential of impacting tumor cell proliferative and apoptotic capacities.


Asunto(s)
Apoptosis/genética , Neoplasias de la Mama/genética , Proliferación Celular/genética , Variaciones en el Número de Copia de ADN , Melanoma/genética , Factores de Transcripción/metabolismo , Secuencia de Bases/genética , Sitios de Unión/genética , Neoplasias de la Mama/patología , Biología Computacional/métodos , Bases de Datos Genéticas/clasificación , Femenino , Predisposición Genética a la Enfermedad/genética , Genoma Humano/genética , Humanos , Melanoma/patología , Mutación , Polimorfismo de Nucleótido Simple , Unión Proteica
8.
Trends Microbiol ; 23(11): 671-679, 2015 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-26439295

RESUMEN

The names and lineages of microorganisms are critical to our understanding of the microbiome. However, microbial taxonomy and phylogeny are in perpetual flux, with emerging criteria being used to rename and reshape our views of the microbial world. Different candidate molecular and nonmolecular criteria are often broadly consistent with one another, which underpins the pluralistic approach to taxonomy. However, the taxonomic picture is clouded when underlying criteria are not in agreement, or when reference datasets contain erroneously named organisms. How does the shifting taxonomic landscape impact our interpretation of microbial communities, especially in the face of inconsistencies and errors? How can taxonomy be applied in a consistent way when different users have different requirements of the classifications that emerge? The key path forward involves finding ways to integrate conflicting taxonomic criteria, choosing the right units of analysis for microbiomic studies, and making molecular taxonomy transparent and accessible in a way that complements current genomic resources.


Asunto(s)
Bacterias/clasificación , Bacterias/genética , Clasificación/métodos , Microbiota/genética , Acidithiobacillus/clasificación , Acidithiobacillus/genética , Clostridium/clasificación , Clostridium/genética , Bases de Datos Genéticas/clasificación , Bases de Datos Genéticas/tendencias , Genética Microbiana , Genómica/métodos , Filogenia , ARN Ribosómico 16S/genética
9.
Stud Health Technol Inform ; 216: 1037, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26262336

RESUMEN

Karyotyping, or visually examining and recording chromosomal abnormalities, is commonly used to diagnose and treat disease. Karyotypes are written in the International System for Human Cytogenetic Nomenclature (ISCN), a computationally non-readable language that precludes full analysis of these genomic data. In response, we developed a cytogenetic platform that transfers the ISCN karyotypes to a machine-readable model available for computational analysis. Here we use cytogenetic data from the National Cancer Institute (NCI)-curated Mitelman database1 to create a structured karyotype language. Then, drug-gene-disease triplets are generated via a computational pipeline connecting public drug-gene interaction data sources to identify potential drug repurposing opportunities.


Asunto(s)
Antineoplásicos/uso terapéutico , Minería de Datos/métodos , Reposicionamiento de Medicamentos/métodos , Cariotipo , Neoplasias/tratamiento farmacológico , Neoplasias/genética , Antineoplásicos/clasificación , Bases de Datos Genéticas/clasificación , Bases de Datos Farmacéuticas/clasificación , Humanos , Procesamiento de Lenguaje Natural , Pruebas de Farmacogenómica/métodos , PubMed
10.
Brief Bioinform ; 14(1): 13-26, 2013 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-22408190

RESUMEN

A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte-Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas/clasificación , Algoritmos , Análisis Discriminante , Femenino , Humanos , Máquina de Vectores de Soporte
11.
Database (Oxford) ; 2012: bas050, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23221176

RESUMEN

We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.


Asunto(s)
Minería de Datos/métodos , Bases de Datos Genéticas/clasificación , Publicaciones Periódicas como Asunto , Toxicogenética , Internet , Anotación de Secuencia Molecular , Semántica , Máquina de Vectores de Soporte , Flujo de Trabajo
12.
Soc Stud Sci ; 42(2): 214-36, 2012 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-22848998

RESUMEN

Cross-species comparison has long been regarded as a stepping-stone for medical research, enabling the discovery and testing of prospective treatments before they undergo clinical trial on humans. Post-genomic medicine has made cross-species comparison crucial in another respect: the 'community databases' developed to collect and disseminate data on model organisms are now often used as a template for the dissemination of data on humans and as a tool for comparing results of medical significance across the human-animal boundary. This paper identifies and discusses four key problems encountered by database curators when integrating human and non-human data within the same database: (1) picking criteria for what counts as reliable evidence, (2) selecting metadata, (3) standardising and describing research materials and (4) choosing nomenclature to classify data. An analysis of these hurdles reveals epistemic disagreement and controversies underlying cross-species comparisons, which in turn highlight important differences in the experimental cultures of biologists and clinicians trying to make sense of these data. By considering database development through the eyes of curators, this study casts new light on the complex conjunctions of biological and clinical practice, model organisms and human subjects, and material and virtual sources of evidence--thus emphasizing the fragmented, localized and inherently translational nature of biomedicine.


Asunto(s)
Investigación Biomédica , Bases de Datos Genéticas , Almacenamiento y Recuperación de la Información , Modelos Animales , Especificidad de la Especie , Animales , Bases de Datos Genéticas/clasificación , Bases de Datos Genéticas/normas , Genómica , Humanos , Internet , Control de Calidad , Estándares de Referencia , Terminología como Asunto
13.
J Integr Bioinform ; 9(3): 199, 2012 Jul 24.
Artículo en Inglés | MEDLINE | ID: mdl-22829570

RESUMEN

DNA microarrays have contributed to the exponential growth of genomic and experimental data in the last decade. This large amount of gene expression data has been used by researchers seeking diagnosis of diseases like cancer using machine learning methods. In turn, explicit biological knowledge about gene functions has also grown tremendously over the last decade. This work integrates explicit biological knowledge, provided as gene sets, into the classication process by means of Variable Precision Rough Set Theory (VPRS). The proposed model is able to highlight which part of the provided biological knowledge has been important for classification. This paper presents a novel model for microarray data classification which is able to incorporate prior biological knowledge in the form of gene sets. Based on this knowledge, we transform the input microarray data into supergenes, and then we apply rough set theory to select the most promising supergenes and to derive a set of easy interpretable classification rules. The proposed model is evaluated over three breast cancer microarrays datasets obtaining successful results compared to classical classification techniques. The experimental results shows that there are not significant differences between our model and classical techniques but it is able to provide a biological-interpretable explanation of how it classifies new samples.


Asunto(s)
Algoritmos , Biología Computacional/métodos , ADN/genética , Bases de Datos Genéticas/clasificación , Regulación de la Expresión Génica , Conocimiento , Neoplasias de la Mama/genética , Femenino , Regulación Neoplásica de la Expresión Génica , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos
14.
PLoS One ; 7(4): e34796, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22558100

RESUMEN

Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single genes classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single genes classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single genes classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single genes sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single genes classifiers for predicting outcome in breast cancer.


Asunto(s)
Biomarcadores de Tumor/genética , Neoplasias de la Mama/diagnóstico , Neoplasias de la Mama/genética , Clasificación/métodos , Biología Computacional/métodos , Regulación Neoplásica de la Expresión Génica/genética , Bases de Datos Genéticas/clasificación , Femenino , Perfilación de la Expresión Génica/métodos , Humanos , Valor Predictivo de las Pruebas , Pronóstico
16.
J Bioinform Comput Biol ; 9(2): 251-67, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21523931

RESUMEN

Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.


Asunto(s)
Algoritmos , Bases de Datos Genéticas/clasificación , Bases de Datos Genéticas/estadística & datos numéricos , Análisis por Micromatrices/estadística & datos numéricos , Inteligencia Artificial , Biología Computacional , Minería de Datos/estadística & datos numéricos , Árboles de Decisión , Humanos , Estimación de Kaplan-Meier , Linfoma de Células B Grandes Difuso/clasificación , Linfoma de Células B Grandes Difuso/genética , Linfoma de Células B Grandes Difuso/mortalidad
17.
BMC Bioinformatics ; 11: 530, 2010 Oct 25.
Artículo en Inglés | MEDLINE | ID: mdl-20973947

RESUMEN

BACKGROUND: The Gene Ontology project supports categorization of gene products according to their location of action, the molecular functions that they carry out, and the processes that they are involved in. Although the ontologies are intentionally developed to be taxon neutral, and to cover all species, there are inherent taxon specificities in some branches. For example, the process 'lactation' is specific to mammals and the location 'mitochondrion' is specific to eukaryotes. The lack of an explicit formalization of these constraints can lead to errors and inconsistencies in automated and manual annotation. RESULTS: We have formalized the taxonomic constraints implicit in some GO classes, and specified these at various levels in the ontology. We have also developed an inference system that can be used to check for violations of these constraints in annotations. Using the constraints in conjunction with the inference system, we have detected and removed errors in annotations and improved the structure of the ontology. CONCLUSIONS: Detection of inconsistencies in taxon-specificity enables gradual improvement of the ontologies, the annotations, and the formalized constraints. This is progressively improving the quality of our data. The full system is available for download, and new constraints or proposed changes to constraints can be submitted online at https://sourceforge.net/tracker/?atid=605890&group_id=36855.


Asunto(s)
Clasificación/métodos , Anotación de Secuencia Molecular/métodos , Bases de Datos Genéticas/clasificación , Bases de Datos de Proteínas/clasificación , Terminología como Asunto , Vocabulario Controlado
18.
Artículo en Inglés | MEDLINE | ID: mdl-20671315

RESUMEN

This paper describes a Biological Literature Miner (BioLMiner) system and its implementation. BioLMiner is a text mining system for biological literature, whose purpose is to extract useful information from biological literature, including gene and protein names, normalized gene and protein names, and protein-protein interaction pairs. BioLMiner has three main subsystems in a pipeline structure: a gene mention recognizer (GMRer), a gene normalizer (GNer), and a protein-protein interaction pair extractor (PPIEor). All these subsystems are developed based on the machine learning techniques including support vector machines (SVMs) and conditional random fields (CRFs) together with carefully designed informative features. At the same time, BioLMiner makes use of some biological specific resources and existing natural language processing tools. In order to evaluate and compare BioLMiner, it is adapted to participate in two tasks of the BioCreative II.5 challenge: interaction normalization task (INT) using GNer and interaction pair task (IPT) using PPIEor. Our system is among the highest performing systems on the two tasks from which it can be seen that GMRer provides a good support for the INT and IPT although its performance is not evaluated, and the methods developed in GNer and PPIEor are extended well to the BioCreative II.5 tasks.


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Bases de Datos Genéticas/clasificación , Mapeo de Interacción de Proteínas/métodos , Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas/métodos , Publicaciones Periódicas como Asunto
19.
Curr Pharm Des ; 16(20): 2224-31, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-20459387

RESUMEN

The completion of the human genome sequencing project and the establishment of new methods for the detection of point mutations have lead to a remarkable increase of sequence variants identification in a growing number of genes. As a result of this, a new field of research has emerged, pharmacogenomics, which deals with the influence of genetic variation on drug response by correlating gene expression or single-nucleotide polymorphisms with a drug's efficacy or toxicity. Genetic databases are continuously updated online repositories of mutation data, described for a single or more genes or specifically for a population or ethnic group. Genetic databases can also fulfil the mission of pharmacogenomics by helping elucidate gene function, estimate the prevalence of genes in populations, differentiate among subtypes of diseases, trace how genes may predispose to or protect against illnesses, and improve medical intervention. Therefore, it is expected that genetic databases will gradually assume an increasing importance in all aspects of genome medicine. This article aims to provide an update of the current and emerging types of genetic databases relevant to the field of pharmacogenomics. Moreover, the key elements that are holding back the field as well as the challenges that should be addressed in the near future are also commented.


Asunto(s)
Bases de Datos Genéticas , Farmacogenética/métodos , Medicina de Precisión/métodos , Bases de Datos Genéticas/clasificación , Bases de Datos Genéticas/estadística & datos numéricos , Bases de Datos Genéticas/tendencias , Quimioterapia/métodos , Humanos , Mutación/efectos de los fármacos , Polimorfismo Genético/efectos de los fármacos
20.
J Biomed Inform ; 43(1): 81-7, 2010 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-19699316

RESUMEN

Selecting relevant and discriminative genes for sample classification is a common and critical task in gene expression analysis (e.g. disease diagnostic). It is desirable that gene selection can improve classification performance of learning algorithm effectively. In general, for most gene selection methods widely used in reality, an individual gene subset will be chosen according to its discriminative power. One of deficiencies of individual gene subset is that its contribution to classification purpose is limited. This issue can be alleviated by ensemble gene selection based on random selection to some extend. However, the random one requires an unnecessary large number of candidate gene subsets and its reliability is a problem. In this study, we propose a new ensemble method, called ensemble gene selection by grouping (EGSG), to select multiple gene subsets for the classification purpose. Rather than selecting randomly, our method chooses salient gene subsets from microarray data by virtue of information theory and approximate Markov blanket. The effectiveness and accuracy of our method is validated by experiments on five publicly available microarray data sets. The experimental results show that our ensemble gene selection method has comparable classification performance to other gene selection methods, and is more stable than the random one.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Algoritmos , Inteligencia Artificial , Línea Celular Tumoral , Biología Computacional/métodos , Simulación por Computador , Bases de Datos Genéticas/clasificación , Procesamiento Automatizado de Datos , Regulación Neoplásica de la Expresión Génica , Humanos , Cadenas de Markov , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas/clasificación , Reconocimiento de Normas Patrones Automatizadas/métodos , Reproducibilidad de los Resultados
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...