Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Comput Math Methods Med ; 2021: 5556992, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33986823

RESUMO

Ensemble learning combines multiple learners to perform combinatorial learning, which has advantages of good flexibility and higher generalization performance. To achieve higher quality cancer classification, in this study, the fast correlation-based feature selection (FCBF) method was used to preprocess the data to eliminate irrelevant and redundant features. Then, the classification was carried out in the stacking ensemble learner. A library for support vector machine (LIBSVM), K-nearest neighbor (KNN), decision tree C4.5 (C4.5), and random forest (RF) were used as the primary learners of the stacking ensemble. Given the imbalanced characteristics of cancer gene expression data, the embedding cost-sensitive naive Bayes was used as the metalearner of the stacking ensemble, which was represented as CSNB stacking. The proposed CSNB stacking method was applied to nine cancer datasets to further verify the classification performance of the model. Compared with other classification methods, such as single classifier algorithms and ensemble algorithms, the experimental results showed the effectiveness and robustness of the proposed method in processing different types of cancer data. This method may therefore help guide cancer diagnosis and research.


Assuntos
Algoritmos , Aprendizado de Máquina , Neoplasias/classificação , Teorema de Bayes , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Árvores de Decisões , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Masculino , Neoplasias/genética , Redes Neurais de Computação , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Oncogenes , Curva ROC , Máquina de Vetores de Suporte
2.
Math Biosci ; 305: 96-101, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30194959

RESUMO

Background and ObjectiveBayesian State Space models are recent advancement in stochastic modeling which capture the randomness of a hidden background process by scrutinizing the prior knowledge and likelihood of observed data. This article elucidate the scope of Bayesian state space modeling on predicting the future expression values of a longitudinal micro array data. MethodsThe study conveniently makes use of longitudinally collected clinical trial data (GSE30531) from NCBI Gene Expression Omnibus (GEO) data repository. Multiple testing methodology using t-test is used for selecting differentially expressed genes between groups for fitting the model. The parameter values of the predictive model and future expression levels are estimated by drawing samples from the posterior joint distribution using a stochastic Markov Chain Monte Carlo (MCMC) algorithm which relies on Gibbs Sampling. The study also made an attempt to get estimates and its 95% Credible Interval through assumptions of different covariance structures like Variance Components, First order Auto Regressive and Unstructured variance-covariance structure to showcase the flexibility of the algorithm. Results72 Distinct genes with significantly different expression levels where selected for model fitting. Parameter estimates showed almost similar trends under different covariance structure assumption. Cross tabulation of gene frequencies having minimum credible interval under each covariance structure and study group showed a significant P value of 0.02. ConclusionsPresent study reveals that Bayesian state space models can be effectively used to explain and predict a complex data like gene expression data.


Assuntos
Teorema de Bayes , Perfilação da Expressão Gênica/estatística & dados numéricos , Modelos Genéticos , Algoritmos , Marcadores Genéticos , Humanos , Cadeias de Markov , Conceitos Matemáticos , Método de Monte Carlo , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Processos Estocásticos
3.
Stat Methods Med Res ; 27(2): 364-383, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-26984908

RESUMO

The problem of multiple hypothesis testing can be represented as a Markov process where a new alternative hypothesis is accepted in accordance with its relative evidence to the currently accepted one. This virtual and not formally observed process provides the most probable set of non null hypotheses given the data; it plays the same role as Markov Chain Monte Carlo in approximating a posterior distribution. To apply this representation and obtain the posterior probabilities over all alternative hypotheses, it is enough to have, for each test, barely defined Bayes Factors, e.g. Bayes Factors obtained up to an unknown constant. Such Bayes Factors may either arise from using default and improper priors or from calibrating p-values with respect to their corresponding Bayes Factor lower bound. Both sources of evidence are used to form a Markov transition kernel on the space of hypotheses. The approach leads to easy interpretable results and involves very simple formulas suitable to analyze large datasets as those arising from gene expression data (microarray or RNA-seq experiments).


Assuntos
Cadeias de Markov , Animais , Teorema de Bayes , Bioestatística , Bovinos , Simulação por Computador , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Masculino , Modelos Estatísticos , Método de Monte Carlo , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Neoplasias da Próstata/genética , Análise de Sequência de RNA/estatística & dados numéricos , Tuberculose Bovina/genética
4.
PLoS One ; 11(12): e0167504, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27936033

RESUMO

This study describes a novel approach to reducing the challenges of highly nonlinear multiclass gene expression values for cancer diagnosis. To build a fruitful system for cancer diagnosis, in this study, we introduced two levels of gene selection such as filtering and embedding for selection of potential genes and the most relevant genes associated with cancer, respectively. The filter procedure was implemented by developing a fuzzy rough set (FR)-based method for redefining the criterion function of f-information (FI) to identify the potential genes without discretizing the continuous gene expression values. The embedded procedure is implemented by means of a water swirl algorithm (WSA), which attempts to optimize the rule set and membership function required to classify samples using a fuzzy-rule-based multiclassification system (FRBMS). Two novel update equations are proposed in WSA, which have better exploration and exploitation abilities while designing a self-learning FRBMS. The efficiency of our new approach was evaluated on 13 multicategory and 9 binary datasets of cancer gene expression. Additionally, the performance of the proposed FRFI-WSA method in designing an FRBMS was compared with existing methods for gene selection and optimization such as genetic algorithm (GA), particle swarm optimization (PSO), and artificial bee colony algorithm (ABC) on all the datasets. In the global cancer map with repeated measurements (GCM_RM) dataset, the FRFI-WSA showed the smallest number of 16 most relevant genes associated with cancer using a minimal number of 26 compact rules with the highest classification accuracy (96.45%). In addition, the statistical validation used in this study revealed that the biological relevance of the most relevant genes associated with cancer and their linguistics detected by the proposed FRFI-WSA approach are better than those in the other methods. The simple interpretable rules with most relevant genes and effectively classified samples suggest that the proposed FRFI-WSA approach is reliable for classification of an individual's cancer gene expression data with high precision and therefore it could be helpful for clinicians as a clinical decision support system.


Assuntos
Algoritmos , Lógica Fuzzy , Regulação Neoplásica da Expressão Gênica/genética , Predisposição Genética para Doença/genética , Testes Genéticos/métodos , Neoplasias/genética , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Método de Monte Carlo , Neoplasias/diagnóstico , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Curva ROC , Reprodutibilidade dos Testes
5.
Pac Symp Biocomput ; 21: 69-80, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26776174

RESUMO

Understanding community structure in networks has received considerable attention in recent years. Detecting and leveraging community structure holds promise for understanding and potentially intervening with the spread of influence. Network features of this type have important implications in a number of research areas, including, marketing, social networks, and biology. However, an overwhelming majority of traditional approaches to community detection cannot readily incorporate information of node attributes. Integrating structural and attribute information is a major challenge. We propose a exible iterative method; inverse regularized Markov Clustering (irMCL), to network clustering via the manipulation of the transition probability matrix (aka stochastic flow) corresponding to a graph. Similar to traditional Markov Clustering, irMCL iterates between "expand" and "inflate" operations, which aim to strengthen the intra-cluster flow, while weakening the inter-cluster flow. Attribute information is directly incorporated into the iterative method through a sigmoid (logistic function) that naturally dampens attribute influence that is contradictory to the stochastic flow through the network. We demonstrate advantages and the exibility of our approach using simulations and real data. We highlight an application that integrates breast cancer gene expression data set and a functional network defined via KEGG pathways reveal significant modules for survival.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Algoritmos , Neoplasias da Mama/genética , Análise por Conglomerados , Biologia Computacional/estatística & dados numéricos , Simulação por Computador , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Genômica/estatística & dados numéricos , Humanos , Modelos Logísticos , Cadeias de Markov , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Transdução de Sinais/genética , Processos Estocásticos , Integração de Sistemas
6.
PLoS One ; 10(11): e0141874, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26562156

RESUMO

One goal of cluster analysis is to sort characteristics into groups (clusters) so that those in the same group are more highly correlated to each other than they are to those in other groups. An example is the search for groups of genes whose expression of RNA is correlated in a population of patients. These genes would be of greater interest if their common level of RNA expression were additionally predictive of the clinical outcome. This issue arose in the context of a study of trauma patients on whom RNA samples were available. The question of interest was whether there were groups of genes that were behaving similarly, and whether each gene in the cluster would have a similar effect on who would recover. For this, we develop an algorithm to simultaneously assign characteristics (genes) into groups of highly correlated genes that have the same effect on the outcome (recovery). We propose a random effects model where the genes within each group (cluster) equal the sum of a random effect, specific to the observation and cluster, and an independent error term. The outcome variable is a linear combination of the random effects of each cluster. To fit the model, we implement a Markov chain Monte Carlo algorithm based on the likelihood of the observed data. We evaluate the effect of including outcome in the model through simulation studies and describe a strategy for prediction. These methods are applied to trauma data from the Inflammation and Host Response to Injury research program, revealing a clustering of the genes that are informed by the recovery outcome.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Ferimentos e Lesões/genética , Análise por Conglomerados , Simulação por Computador , Perfilação da Expressão Gênica/classificação , Perfilação da Expressão Gênica/métodos , Humanos , Cadeias de Markov , Modelos Genéticos , Modelos Estatísticos , Método de Monte Carlo , Análise de Sequência com Séries de Oligonucleotídeos/classificação , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Avaliação de Resultados em Cuidados de Saúde/métodos , Avaliação de Resultados em Cuidados de Saúde/estatística & dados numéricos
7.
Pharm Stat ; 14(4): 284-93, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25914330

RESUMO

Drug-induced organ toxicity (DIOT) that leads to the removal of marketed drugs or termination of candidate drugs has been a leading concern for regulatory agencies and pharmaceutical companies. In safety studies, the genomic assays are conducted after the treatment so that drug-induced adverse effects can occur. Two types of biomarkers are observed: biomarkers of susceptibility and biomarkers of response. This paper presents a statistical model to distinguish two types of biomarkers and procedures to identify susceptible subpopulations. The biomarkers identified are used to develop classification model to identify susceptible subpopulation. Two methods to identify susceptibility biomarkers were evaluated in terms of predictive performance in subpopulation identification, including sensitivity, specificity, and accuracy. Method 1 considered the traditional linear model with a variable-by-treatment interaction term, and Method 2 considered fitting a single predictor variable model using only treatment data. Monte Carlo simulation studies were conducted to evaluate the performance of the two methods and impact of the subpopulation prevalence, probability of DIOT, and sample size on the predictive performance. Method 2 appeared to outperform Method 1, which was due to the lack of power for testing the interaction effect. Important statistical issues and challenges regarding identification of preclinical DIOT biomarkers were discussed. In summary, identification of predictive biomarkers for treatment determination highly depends on the subpopulation prevalence. When the proportion of susceptible subpopulation is 1% or less, a very large sample size is needed to ensure observing sufficient number of DIOT responses for biomarker and/or subpopulation identifications.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/genética , Regulação da Expressão Gênica/efeitos dos fármacos , Marcadores Genéticos , Projetos de Pesquisa/estatística & dados numéricos , Animais , Simulação por Computador , Interpretação Estatística de Dados , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/classificação , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/diagnóstico , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/epidemiologia , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Modelos Lineares , Modelos Logísticos , Modelos Estatísticos , Método de Monte Carlo , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Prevalência , Medição de Risco , Tamanho da Amostra
8.
Pac Symp Biocomput ; : 383-94, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25592598

RESUMO

Gene expression and disease-associated variants are often used to prioritize candidate genes for target validation. However, the success of these gene features alone or in combination in the discovery of therapeutic targets is uncertain. Here we evaluated the effectiveness of the differential expression (DE), the disease-associated single nucleotide polymorphisms (SNPs) and the combination of the two in recovering and predicting known therapeutic targets across 56 human diseases. We demonstrate that the performance of each feature varies across diseases and generally the features have more recovery power than predictive power. The combination of the two features, however, has significantly higher predictive power than each feature alone. Our study provides a systematic evaluation of two common gene features, DE and SNPs, for prioritization of candidate targets and identified an improved predictive power of coupling these two features.


Assuntos
Expressão Gênica , Variação Genética , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Doença/genética , Feminino , Humanos , Masculino , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único
9.
Diagn Microbiol Infect Dis ; 81(1): 4-8, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25445120

RESUMO

The Verigene Gram-positive blood culture assay (BC-GP) is a microarray-based rapid diagnostic test, which includes targets for 12 bacterial species and 3 resistance determinants. We prospectively compared the diagnostic accuracy of the BC-GP to routine microbiologic methods and evaluated the potential of the BC-GP for antimicrobial stewardship programs. A total of 143 consecutive patients with Gram-positive bacteremia were included in the analysis. BC-GP correctly identified 127/128 (99.2%) of organisms from monomicrobial blood cultures and 9/14 (64.3%) from polymicrobial, including all methicillin-resistant Staphylococcus aureus and vancomycin-resistant enterococci. Stewardship interventions were possible in 51.0% of patients, most commonly stopping or preventing unnecessary vancomycin or starting a targeted therapy. In Monte Carlo simulations, unnecessary antibiotics could be stopped at least 24 hours earlier in 65.6% of cases, and targeted therapy could be started at least 24 hours earlier in 81.2%. BC-GP is a potentially useful test for antibiotic stewardship in patients with Gram-positive bacteremia.


Assuntos
Antibacterianos/uso terapêutico , Bactérias Gram-Positivas/genética , Infecções por Bactérias Gram-Positivas/microbiologia , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Bacteriemia/tratamento farmacológico , Bacteriemia/microbiologia , Sangue/microbiologia , Estudos de Coortes , Farmacorresistência Bacteriana , Enterococcus/efeitos dos fármacos , Enterococcus/genética , Bactérias Gram-Positivas/efeitos dos fármacos , Bactérias Gram-Positivas/patogenicidade , Infecções por Bactérias Gram-Positivas/tratamento farmacológico , Humanos , Staphylococcus aureus Resistente à Meticilina/genética , Técnicas de Diagnóstico Molecular/métodos , Técnicas de Diagnóstico Molecular/estatística & dados numéricos , Método de Monte Carlo , Nanosferas , Análise de Sequência com Séries de Oligonucleotídeos/instrumentação , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Estudos Prospectivos , Vancomicina/farmacologia
10.
Pac Symp Biocomput ; : 241-52, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24297551

RESUMO

A striking finding from recent large-scale sequencing efforts is that the vast majority of variants in the human genome are rare and found within single populations or lineages. These observations hold important implications for the design of the next round of disease variant discovery efforts-if genetic variants that influence disease risk follow the same trend, then we expect to see population-specific disease associations that require large sample sizes for detection. To address this challenge, and due to the still prohibitive cost of sequencing large cohorts, researchers have developed a new generation of low-cost genotyping arrays that assay rare variation previously identified from large exome sequencing studies. Genotyping approaches rely not only on directly observing variants, but also on phasing and imputation methods that use publicly available reference panels to infer unobserved variants in a study cohort. Rare variant exome arrays are intentionally enriched for variants likely to be disease causing, and here we assay the ability of the first commercially available rare exome variant array (the Illumina Infinium HumanExome BeadChip) to also tag other potentially damaging variants not molecularly assayed. Using full sequence data from chromosome 22 from the phase I 1000 Genomes Project, we evaluate three methods for imputation (BEAGLE, MaCH-Admix, and SHAPEIT2/IMPUTE2) with the rare exome variant array under varied study panel sizes, reference panel sizes, and LD structures via population differences. We find that imputation is more accurate across both the genome and exome for common variant arrays than the next generation array for all allele frequencies, including rare alleles. We also find that imputation is the least accurate in African populations, and accuracy is substantially improved for rare variants when the same population is included in the reference panel. Depending on the goals of GWAS researchers, our results will aid budget decisions by helping determine whether money is best spent sequencing the genomes of smaller sample sizes, genotyping larger sample sizes with rare and/or common variant arrays and imputing SNPs, or some combination of the two.


Assuntos
Exoma , Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Algoritmos , Biologia Computacional , Genética Populacional/estatística & dados numéricos , Genoma Humano , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Genótipo , Projeto Genoma Humano , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único , Medicina de Precisão/estatística & dados numéricos , Tamanho da Amostra
11.
Epigenetics ; 9(2): 318-29, 2014 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-24241353

RESUMO

The Illumina Infinium HumanMethylation450 BeadChip has emerged as one of the most popular platforms for genome wide profiling of DNA methylation. While the technology is wide-spread, systematic technical biases are believed to be present in the data. For example, this array incorporates two different chemical assays, i.e., Type I and Type II probes, which exhibit different technical characteristics and potentially complicate the computational and statistical analysis. Several normalization methods have been introduced recently to adjust for possible biases. However, there is considerable debate within the field on which normalization procedure should be used and indeed whether normalization is even necessary. Yet despite the importance of the question, there has been little comprehensive comparison of normalization methods. We sought to systematically compare several popular normalization approaches using the Norwegian Mother and Child Cohort Study (MoBa) methylation data set and the technical replicates analyzed with it as a case study. We assessed both the reproducibility between technical replicates following normalization and the effect of normalization on association analysis. Results indicate that the raw data are already highly reproducible, some normalization approaches can slightly improve reproducibility, but other normalization approaches may introduce more variability into the data. Results also suggest that differences in association analysis after applying different normalizations are not large when the signal is strong, but when the signal is more modest, different normalizations can yield very different numbers of findings that meet a weaker statistical significance threshold. Overall, our work provides useful, objective assessment of the effectiveness of key normalization methods.


Assuntos
Metilação de DNA , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Adulto , Ilhas de CpG , Humanos , Recém-Nascido , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reprodutibilidade dos Testes , Software
12.
Biometrics ; 69(3): 614-23, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23909925

RESUMO

A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses.


Assuntos
Biometria/métodos , Interpretação Estatística de Dados , Teoria da Decisão , Algoritmos , Teorema de Bayes , Encéfalo/anatomia & histologia , Encéfalo/fisiologia , Simulação por Computador , Neuroimagem Funcional/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Genômica/estatística & dados numéricos , Humanos , Imageamento por Ressonância Magnética/estatística & dados numéricos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos
13.
Biometrics ; 68(3): 774-83, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-22260651

RESUMO

DNA methylation has emerged as an important hallmark of epigenetics. Numerous platforms including tiling arrays and next generation sequencing, and experimental protocols are available for profiling DNA methylation. Similar to other tiling array data, DNA methylation data shares the characteristics of inherent correlation structure among nearby probes. However, unlike gene expression or protein DNA binding data, the varying CpG density which gives rise to CpG island, shore and shelf definition provides exogenous information in detecting differential methylation. This article aims to introduce a robust testing and probe ranking procedure based on a nonhomogeneous hidden Markov model that incorporates the above-mentioned features for detecting differential methylation. We revisit the seminal work of Sun and Cai (2009, Journal of the Royal Statistical Society: Series B (Statistical Methodology)71, 393-424) and propose modeling the nonnull using a nonparametric symmetric distribution in two-sided hypothesis testing. We show that this model improves probe ranking and is robust to model misspecification based on extensive simulation studies. We further illustrate that our proposed framework achieves good operating characteristics as compared to commonly used methods in real DNA methylation data that aims to detect differential methylation sites.


Assuntos
Biometria/métodos , Metilação de DNA , Modelos Estatísticos , Ilhas de CpG , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Epigênese Genética , Humanos , Cadeias de Markov , Modelos Genéticos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Probabilidade
14.
Biometrics ; 68(2): 437-45, 2012 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21950383

RESUMO

The additive model is a semiparametric class of models that has become extremely popular because it is more flexible than the linear model and can be fitted to high-dimensional data when fully nonparametric models become infeasible. We consider the problem of simultaneous variable selection and parametric component identification using spline approximation aided by two smoothly clipped absolute deviation (SCAD) penalties. The advantage of our approach is that one can automatically choose between additive models, partially linear additive models and linear models, in a single estimation step. Simulation studies are used to illustrate our method, and we also present its applications to motif regression.


Assuntos
Biometria/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Modelos Estatísticos , Bases de Dados Genéticas/estatística & dados numéricos , Modelos Lineares , Método de Monte Carlo , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Análise de Regressão , Saccharomyces cerevisiae/genética , Estatísticas não Paramétricas
15.
Biostatistics ; 12(4): 682-94, 2011 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-21551122

RESUMO

We propose a semiparametric Bayesian model, based on penalized splines, for the recovery of the time-invariant topology of a causal interaction network from longitudinal data. Our motivation is inference of gene regulatory networks from low-resolution microarray time series, where existence of nonlinear interactions is well known. Parenthood relations are mapped by augmenting the model with kinship indicators and providing these with either an overall or gene-wise hierarchical structure. Appropriate specification of the prior is crucial to control the flexibility of the splines, especially under circumstances of scarce data; thus, we provide an informative, proper prior. Substantive improvement in network inference over a linear model is demonstrated using synthetic data drawn from ordinary differential equation models and gene expression from an experimental data set of the Arabidopsis thaliana circadian rhythm.


Assuntos
Teorema de Bayes , Redes Reguladoras de Genes , Modelos Genéticos , Modelos Estatísticos , Algoritmos , Arabidopsis/genética , Bioestatística , Ritmo Circadiano/genética , Genoma de Planta , Modelos Lineares , Cadeias de Markov , Dinâmica não Linear , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos
16.
Methods Mol Biol ; 722: 61-77, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21590413

RESUMO

Advances in genome sequencing technologies have facilitated production of a wealth of fungal data; within the last 5 years, experimental costs and labor have diminished, shifting the production bottleneck from genomic data generation to data analysis. Genome sequences and microarrays now exist for many fungi, and transcriptional profiling has been shown to be an efficient way to examine how the entire genome changes in response to many different environments or treatments. Multiple platforms, programs, and protocols exist for analyzing such data, making this task daunting for the bench-based scientist. Furthermore, many existing programs are expensive and require license renewals on a yearly basis for each user in the laboratory. Costs may be prohibitively high for bench-based scientists in academia. Our combined experiences with this kind of analysis have favored two programs, depending upon whether the scientist is working with single- or dual-channel hybridization data. Our protocols are aimed toward helping the bench-based PI get the most possible information from their data, without the need for expensive software or an experienced bioinformaticist.


Assuntos
Interpretação Estatística de Dados , Proteínas Fúngicas/metabolismo , Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Software , Ascomicetos/genética , Ascomicetos/metabolismo , Teorema de Bayes , Biologia Computacional/métodos , Proteínas Fúngicas/genética , Perfilação da Expressão Gênica/métodos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Software/economia , Software/tendências , Fatores de Tempo
17.
Adv Exp Med Biol ; 696: 27-35, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21431543

RESUMO

Classification problems of microarray data may be successfully performed with approaches by human experts which are easy to understand and interpret, like decision trees or Top Scoring Pairs algorithms. In this chapter, we propose a hybrid solution that combines the above-mentioned methods. An application of presented decision trees, which splits instances based on pairwise comparisons of the gene expression values, may have considerable potential for genomic research and scientific modeling of underlying processes. We have compared proposed solution with the TSP-family methods and decision trees on 11 public domain microarray datasets and the results are promising.


Assuntos
Algoritmos , Árvores de Decisões , Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Neoplasias da Mama/genética , DNA de Neoplasias/genética , Interpretação Estatística de Dados , Bases de Dados de Ácidos Nucleicos , Feminino , Humanos
18.
Adv Exp Med Biol ; 696: 191-9, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21431559

RESUMO

Machine learning approaches have wide applications in bioinformatics, and decision tree is one of the successful approaches applied in this field. In this chapter, we briefly review decision tree and related ensemble algorithms and show the successful applications of such approaches on solving biological problems. We hope that by learning the algorithms of decision trees and ensemble classifiers, biologists can get the basic ideas of how machine learning algorithms work. On the other hand, by being exposed to the applications of decision trees and ensemble algorithms in bioinformatics, computer scientists can get better ideas of which bioinformatics topics they may work on in their future research directions. We aim to provide a platform to bridge the gap between biologists and computer scientists.


Assuntos
Algoritmos , Inteligência Artificial , Biologia Computacional/métodos , Árvores de Decisões , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Genômica/estatística & dados numéricos , Humanos , Masculino , Espectrometria de Massas/estatística & dados numéricos , Neoplasias/química , Neoplasias/classificação , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Análise de Regressão , Software
19.
J Bioinform Comput Biol ; 9(1): 131-48, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21328710

RESUMO

DNA copy number (DCN) is the number of copies of DNA at a region of a genome. The alterations of DCN are highly associated with the development of different tumors. Recently, microarray technologies are being employed to detect DCN changes at many loci at the same time in tumor samples. The resulting DCN data are often very noisy, and the tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task do not model the tumor/normal cell mixture ratio explicitly and they cannot output segments with DCN annotations. We developed a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves better accuracies on average as compared to three previous methods, namely Circular Binary Segmentation, Hidden Markov Model and Ultrasome.


Assuntos
Variações do Número de Cópias de DNA , Algoritmos , Biologia Computacional , Simulação por Computador , DNA de Neoplasias/genética , Interpretação Estatística de Dados , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Cadeias de Markov , Modelos Estatísticos , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Software
20.
J Biopharm Stat ; 20(2): 209-22, 2010 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-20309755

RESUMO

Empirical Bayes methods are widely used in the analysis of microarray gene expression data in order to identify the differentially expressed genes or genes that are associated with other general phenotypes. Available methods often assume that genes are independent. However, genes are expected to function interactively and to form molecular modules to affect the phenotypes. In order to account for regulatory dependency among genes, we propose in this paper a network-based empirical Bayes method for analyzing genomic data in the framework of linear models, where the dependency of genes is modeled by a discrete Markov random field defined on a predefined biological network. This method provides a statistical framework for integrating the known biological network information into the analysis of genomic data. We present an iterated conditional mode algorithm for parameter estimation and for estimating the posterior probabilities using Gibbs sampling. We demonstrate the application of the proposed methods using simulations and analysis of a human brain aging microarray gene expression data set.


Assuntos
Teorema de Bayes , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Genômica/estatística & dados numéricos , Modelos Lineares , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Adulto , Fatores Etários , Idoso , Idoso de 80 Anos ou mais , Envelhecimento/genética , Encéfalo/fisiologia , Simulação por Computador , Interpretação Estatística de Dados , Pesquisa Empírica , Regulação da Expressão Gênica , Humanos , Cadeias de Markov , Pessoa de Meia-Idade , Biologia de Sistemas/estatística & dados numéricos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA