Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 176
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 37(19): 3212-3219, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33822889

RESUMO

MOTIVATION: When learning to subtype complex disease based on next-generation sequencing data, the amount of available data is often limited. Recent works have tried to leverage data from other domains to design better predictors in the target domain of interest with varying degrees of success. But they are either limited to the cases requiring the outcome label correspondence across domains or cannot leverage the label information at all. Moreover, the existing methods cannot usually benefit from other information available a priori such as gene interaction networks. RESULTS: In this article, we develop a generative optimal Bayesian supervised domain adaptation (OBSDA) model that can integrate RNA sequencing (RNA-Seq) data from different domains along with their labels for improving prediction accuracy in the target domain. Our model can be applied in cases where different domains share the same labels or have different ones. OBSDA is based on a hierarchical Bayesian negative binomial model with parameter factorization, for which the optimal predictor can be derived by marginalization of likelihood over the posterior of the parameters. We first provide an efficient Gibbs sampler for parameter inference in OBSDA. Then, we leverage the gene-gene network prior information and construct an informed and flexible variational family to infer the posterior distributions of model parameters. Comprehensive experiments on real-world RNA-Seq data demonstrate the superior performance of OBSDA, in terms of accuracy in identifying cancer subtypes by utilizing data from different domains. Moreover, we show that by taking advantage of the prior network information we can further improve the performance. AVAILABILITY AND IMPLEMENTATION: The source code for implementations of OBSDA and SI-OBSDA are available at the following link. https://github.com/SHBLK/BSDA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
Bioinformatics ; 35(4): 643-649, 2019 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-30052771

RESUMO

MOTIVATION: Canalizing genes enforce broad corrective actions on cellular processes for the purpose of biological robustness maintaining a constant phenotype to remain unchanged in spite of genetic mutations or environmental perturbations. Despite their central role in biological systems, the observation/detection of canalizing genes is often impeded because the behavior of affected genes is highly varied relative to the inactive canalizer. Therefore, the activity of canalizing genes is difficult to predict to any significant degree by their subject genes under normal cell conditions. RESULTS: We investigate this question and present a quantitative framework that allows for the estimation of the power of canalizing genes in the context of Boolean Networks (BNs) with perturbation. This framework borrows tools from the Pattern Recognition theory and uses the coefficient of determination (CoD) to capture the capacity of the canalizing genes. The canalizing power (CP) of a gene is quantitatively characterized by two terms: regulation power (RP) and incapacitating power (IP). We base this assumption on the idea that canalizing power of a gene should be quantified by the extent of its regulation on the overall network and the extent of control that the gene takes over from other master genes when it is activated, which is equivalent to reduction of the control of other master genes upon its activation. Following this, the CP concept is illustrated with examples in which the goal is to provide preliminary evidence that CP can be used to characterize the ability of canalizing genes. AVAILABILITY AND IMPLEMENTATION: A library of functions written in MATLAB for computing CP is available at http://github.com/eunjikim-angie/CanalizingPower.


Assuntos
Redes Reguladoras de Genes , Modelos Genéticos , Biologia Computacional
3.
BMC Bioinformatics ; 20(Suppl 12): 321, 2019 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-31216989

RESUMO

BACKGROUND: Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data. RESULTS: We consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We show how the missing-value problem fits neatly into the overall framework of optimal clustering by incorporating the missing value mechanism into the random labeled point process and then marginalizing out the missing-value process. In particular, we demonstrate the proposed framework for the Gaussian model with arbitrary covariance structures. Comprehensive experimental studies on both synthetic and real-world RNA-seq data show the superior performance of the proposed optimal clustering with missing values when compared to various clustering approaches. CONCLUSION: Optimal clustering with missing values obviates the need for imputation-based pre-processing of the data, while at the same time possessing smaller clustering errors.


Assuntos
Algoritmos , Neoplasias da Mama/genética , Análise por Conglomerados , Simulação por Computador , Feminino , Perfilação da Expressão Gênica , Humanos , Modelos Teóricos , Distribuição Normal , Probabilidade
4.
BMC Genomics ; 20(Suppl 6): 435, 2019 Jun 13.
Artigo em Inglês | MEDLINE | ID: mdl-31189480

RESUMO

BACKGROUND: Single-cell gene expression measurements offer opportunities in deriving mechanistic understanding of complex diseases, including cancer. However, due to the complex regulatory machinery of the cell, gene regulatory network (GRN) model inference based on such data still manifests significant uncertainty. RESULTS: The goal of this paper is to develop optimal classification of single-cell trajectories accounting for potential model uncertainty. Partially-observed Boolean dynamical systems (POBDS) are used for modeling gene regulatory networks observed through noisy gene-expression data. We derive the exact optimal Bayesian classifier (OBC) for binary classification of single-cell trajectories. The application of the OBC becomes impractical for large GRNs, due to computational and memory requirements. To address this, we introduce a particle-based single-cell classification method that is highly scalable for large GRNs with much lower complexity than the optimal solution. CONCLUSION: The performance of the proposed particle-based method is demonstrated through numerical experiments using a POBDS model of the well-known T-cell large granular lymphocyte (T-LGL) leukemia network with noisy time-series gene-expression data.


Assuntos
Algoritmos , Teorema de Bayes , Biologia Computacional/métodos , Redes Reguladoras de Genes , Leucemia Linfocítica Granular Grande/genética , Análise de Célula Única/métodos , Perfilação da Expressão Gênica , Humanos , Modelos Biológicos , Modelos Genéticos , Incerteza
5.
Curr Genomics ; 20(1): 16-23, 2019 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-31015788

RESUMO

INTRODUCTION: The most basic aspect of modern engineering is the design of operators to act on physical systems in an optimal manner relative to a desired objective - for instance, designing a con-trol policy to autonomously direct a system or designing a classifier to make decisions regarding the sys-tem. These kinds of problems appear in biomedical science, where physical models are created with the intention of using them to design tools for diagnosis, prognosis, and therapy. METHODS: In the classical paradigm, our knowledge regarding the model is certain; however, in practice, especially with complex systems, our knowledge is uncertain and operators must be designed while tak-ing this uncertainty into account. The related concepts of intrinsically Bayesian robust operators and op-timal Bayesian operators treat operator design under uncertainty. An objective-based experimental de-sign procedure is naturally related to operator design: We would like to perform an experiment that max-imally reduces our uncertainty as it pertains to our objective. RESULTS & DISCUSSION: This paper provides a nonmathematical review of optimal Bayesian operators directed at biomedical scientists. It considers two applications important to genomics, structural interven-tion in gene regulatory networks and classification. CONCLUSION: The salient point regarding intrinsically Bayesian operators is that uncertainty is quantified relative to the scientific model, and the prior distribution is on the parameters of this model. Optimization has direct physical (biological) meaning. This is opposed to the common method of placing prior distri-butions on the parameters of the operator, in which case there is a scientific gap between operator design and the phenomena.

6.
Proc Natl Acad Sci U S A ; 113(47): 13301-13306, 2016 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-27821777

RESUMO

An outstanding challenge in the nascent field of materials informatics is to incorporate materials knowledge in a robust Bayesian approach to guide the discovery of new materials. Utilizing inputs from known phase diagrams, features or material descriptors that are known to affect the ferroelectric response, and Landau-Devonshire theory, we demonstrate our approach for BaTiO3-based piezoelectrics with the desired target of a vertical morphotropic phase boundary. We predict, synthesize, and characterize a solid solution, (Ba0.5Ca0.5)TiO3-Ba(Ti0.7Zr0.3)O3, with piezoelectric properties that show better temperature reliability than other BaTiO3-based piezoelectrics in our initial training data.

7.
BMC Bioinformatics ; 18(Suppl 14): 552, 2017 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-29297278

RESUMO

BACKGROUND: Phenotypic classification is problematic because small samples are ubiquitous; and, for these, use of prior knowledge is critical. If knowledge concerning the feature-label distribution - for instance, genetic pathways - is available, then it can be used in learning. Optimal Bayesian classification provides optimal classification under model uncertainty. It differs from classical Bayesian methods in which a classification model is assumed and prior distributions are placed on model parameters. With optimal Bayesian classification, uncertainty is treated directly on the feature-label distribution, which assures full utilization of prior knowledge and is guaranteed to outperform classical methods. RESULTS: The salient problem confronting optimal Bayesian classification is prior construction. In this paper, we propose a new prior construction methodology based on a general framework of constraints in the form of conditional probability statements. We call this prior the maximal knowledge-driven information prior (MKDIP). The new constraint framework is more flexible than our previous methods as it naturally handles the potential inconsistency in archived regulatory relationships and conditioning can be augmented by other knowledge, such as population statistics. We also extend the application of prior construction to a multinomial mixture model when labels are unknown, which often occurs in practice. The performance of the proposed methods is examined on two important pathway families, the mammalian cell-cycle and a set of p53-related pathways, and also on a publicly available gene expression dataset of non-small cell lung cancer when combined with the existing prior knowledge on relevant signaling pathways. CONCLUSION: The new proposed general prior construction framework extends the prior construction methodology to a more flexible framework that results in better inference when proper prior knowledge exists. Moreover, the extension of optimal Bayesian classification to multinomial mixtures where data sets are both small and unlabeled, enables superior classifier design using small, unstructured data sets. We have demonstrated the effectiveness of our approach using pathway information and available knowledge of gene regulating functions; however, the underlying theory can be applied to a wide variety of knowledge types, and other applications when there are small samples.


Assuntos
Algoritmos , Animais , Teorema de Bayes , Carcinoma Pulmonar de Células não Pequenas/genética , Ciclo Celular , Entropia , Humanos , Teoria da Informação , Neoplasias Pulmonares/genética , Mamíferos/metabolismo , Probabilidade , Transdução de Sinais , Proteína Supressora de Tumor p53/metabolismo
8.
IEEE Trans Signal Process ; 64(23): 6243-6253, 2016 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-28824268

RESUMO

The recently introduced intrinsically Bayesian robust filter (IBRF) provides fully optimal filtering relative to a prior distribution over an uncertainty class ofjoint random process models, whereas formerly the theory was limited to model-constrained Bayesian robust filters, for which optimization was limited to the filters that are optimal for models in the uncertainty class. This paper extends the IBRF theory to the situation where there are both a prior on the uncertainty class and sample data. The result is optimal Bayesian filtering (OBF), where optimality is relative to the posterior distribution derived from the prior and the data. The IBRF theories for effective characteristics and canonical expansions extend to the OBF setting. A salient focus of the present work is to demonstrate the advantages of Bayesian regression within the OBF setting over the classical Bayesian approach in the context otlinear Gaussian models.

9.
BMC Bioinformatics ; 16 Suppl 13: S2, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26423515

RESUMO

BACKGROUND: An accurate understanding of interactions among genes plays a major role in developing therapeutic intervention methods. Gene regulatory networks often contain a significant amount of uncertainty. The process of prioritizing biological experiments to reduce the uncertainty of gene regulatory networks is called experimental design. Under such a strategy, the experiments with high priority are suggested to be conducted first. RESULTS: The authors have already proposed an optimal experimental design method based upon the objective for modeling gene regulatory networks, such as deriving therapeutic interventions. The experimental design method utilizes the concept of mean objective cost of uncertainty (MOCU). MOCU quantifies the expected increase of cost resulting from uncertainty. The optimal experiment to be conducted first is the one which leads to the minimum expected remaining MOCU subsequent to the experiment. In the process, one must find the optimal intervention for every gene regulatory network compatible with the prior knowledge, which can be prohibitively expensive when the size of the network is large. In this paper, we propose a computationally efficient experimental design method. This method incorporates a network reduction scheme by introducing a novel cost function that takes into account the disruption in the ranking of potential experiments. We then estimate the approximate expected remaining MOCU at a lower computational cost using the reduced networks. CONCLUSIONS: Simulation results based on synthetic and real gene regulatory networks show that the proposed approximate method has close performance to that of the optimal method but at lower computational cost. The proposed approximate method also outperforms the random selection policy significantly. A MATLAB software implementing the proposed experimental design method is available at http://gsp.tamu.edu/Publications/supplementary/roozbeh15a/.


Assuntos
Redes Reguladoras de Genes/fisiologia , Genômica/métodos , Humanos , Projetos de Pesquisa , Incerteza
10.
BMC Bioinformatics ; 16 Suppl 13: S3, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26423606

RESUMO

BACKGROUND: Most dynamical models for genomic networks are built upon two current methodologies, one process-based and the other based on Boolean-type networks. Both are problematic when it comes to experimental design purposes in the laboratory. The first approach requires a comprehensive knowledge of the parameters involved in all biological processes a priori, whereas the results from the second method may not have a biological correspondence and thus cannot be tested in the laboratory. Moreover, the current methods cannot readily utilize existing curated knowledge databases and do not consider uncertainty in the knowledge. Therefore, a new methodology is needed that can generate a dynamical model based on available biological data, assuming uncertainty, while the results from experimental design can be examined in the laboratory. RESULTS: We propose a new methodology for dynamical modeling of genomic networks that can utilize the interaction knowledge provided in public databases. The model assigns discrete states for physical entities, sets priorities among interactions based on information provided in the database, and updates each interaction based on associated node states. Whenever uncertainty in dynamics arises, it explores all possible outcomes. By using the proposed model, biologists can study regulation networks that are too complex for manual analysis. CONCLUSIONS: The proposed approach can be effectively used for constructing dynamical models of interaction-based genomic networks without requiring a complete knowledge of all parameters affecting the network dynamics, and thus based on a small set of available data.


Assuntos
Genômica/métodos , Modelos Moleculares , Simulação de Dinâmica Molecular , Incerteza
11.
Bioinformatics ; 30(2): 242-50, 2014 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-24257187

RESUMO

MOTIVATION: Measurements are commonly taken from two phenotypes to build a classifier, where the number of data points from each class is predetermined, not random. In this 'separate sampling' scenario, the data cannot be used to estimate the class prior probabilities. Moreover, predetermined class sizes can severely degrade classifier performance, even for large samples. RESULTS: We employ simulations using both synthetic and real data to show the detrimental effect of separate sampling on a variety of classification rules. We establish propositions related to the effect on the expected classifier error owing to a sampling ratio different from the population class ratio. From these we derive a sample-based minimax sampling ratio and provide an algorithm for approximating it from the data. We also extend to arbitrary distributions the classical population-based Anderson linear discriminant analysis minimax sampling ratio derived from the discriminant form of the Bayes classifier. AVAILABILITY: All the codes for synthetic data and real data examples are written in MATLAB. A function called mmratio, whose output is an approximation of the minimax sampling ratio of a given dataset, is also written in MATLAB. All the codes are available at: http://gsp.tamu.edu/Publications/supplementary/shahrokh13b.


Assuntos
Algoritmos , Neoplasias da Mama/classificação , Leucemia Mieloide Aguda/classificação , Mieloma Múltiplo/classificação , Leucemia-Linfoma Linfoblástico de Células Precursoras/classificação , Viés de Seleção , Teorema de Bayes , Neoplasias da Mama/genética , Criança , Análise Discriminante , Feminino , Perfilação da Expressão Gênica , Humanos , Leucemia Mieloide Aguda/genética , Mieloma Múltiplo/genética , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Tamanho da Amostra
12.
Bioinformatics ; 30(23): 3349-55, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25123902

RESUMO

MOTIVATION: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. RESULTS: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used. AVAILABILITY AND IMPLEMENTATION: The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.


Assuntos
Viés de Seleção , Humanos , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos , Doença de Parkinson/genética , Probabilidade , Transcriptoma
13.
BMC Bioinformatics ; 15: 401, 2014 Dec 10.
Artigo em Inglês | MEDLINE | ID: mdl-25491122

RESUMO

BACKGROUND: Sequencing datasets consist of a finite number of reads which map to specific regions of a reference genome. Most effort in modeling these datasets focuses on the detection of univariate differentially expressed genes. However, for classification, we must consider multiple genes and their interactions. RESULTS: Thus, we introduce a hierarchical multivariate Poisson model (MP) and the associated optimal Bayesian classifier (OBC) for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior or equivalent classification performance compared to typical classifiers for two synthetic datasets and over a range of classification problem difficulties. We also introduce the Bayesian minimum mean squared error (MMSE) conditional error estimator and demonstrate its computation over the feature space. In addition, we demonstrate superior or leading class performance over an RNA-Seq dataset containing two lung cancer tumor types from The Cancer Genome Atlas (TCGA). CONCLUSIONS: Through model-based, optimal Bayesian classification, we demonstrate superior classification performance for both synthetic and real RNA-Seq datasets. A tutorial video and Python source code is available under an open source license at http://bit.ly/1gimnss .


Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias Pulmonares/genética , Modelos Estatísticos , RNA/genética , Adenocarcinoma/genética , Carcinoma de Células Escamosas/genética , Humanos , Cadeias de Markov , Método de Monte Carlo
14.
Bioinformatics ; 29(14): 1758-67, 2013 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-23630177

RESUMO

MOTIVATION: A basic issue for translational genomics is to model gene interaction via gene regulatory networks (GRNs) and thereby provide an informatics environment to study the effects of intervention (say, via drugs) and to derive effective intervention strategies. Taking the view that the phenotype is characterized by the long-run behavior (steady-state distribution) of the network, we desire interventions to optimally move the probability mass from undesirable to desirable states Heretofore, two external control approaches have been taken to shift the steady-state mass of a GRN: (i) use a user-defined cost function for which desirable shift of the steady-state mass is a by-product and (ii) use heuristics to design a greedy algorithm. Neither approach provides an optimal control policy relative to long-run behavior. RESULTS: We use a linear programming approach to optimally shift the steady-state mass from undesirable to desirable states, i.e. optimization is directly based on the amount of shift and therefore must outperform previously proposed methods. Moreover, the same basic linear programming structure is used for both unconstrained and constrained optimization, where in the latter case, constraints on the optimization limit the amount of mass that may be shifted to 'ambiguous' states, these being states that are not directly undesirable relative to the pathology of interest but which bear some perceived risk. We apply the method to probabilistic Boolean networks, but the theory applies to any Markovian GRN. AVAILABILITY: Supplementary materials, including the simulation results, MATLAB source code and description of suboptimal methods are available at http://gsp.tamu.edu/Publications/supplementary/yousefi13b. CONTACT: edward@ece.tamu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Reguladoras de Genes , Fenótipo , Algoritmos , Ciclo Celular/genética , Humanos , Melanoma/genética , Melanoma/metabolismo , Melanoma/patologia , Metástase Neoplásica , Probabilidade , Programação Linear
15.
Bioessays ; 34(4): 277-9, 2012 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-22337590

RESUMO

Is a two-fold approach - preliminary studies based on small samples followed by a large-sample study to check reproducibility - in the search for biomarkers really prudent?


Assuntos
Biomarcadores , Humanos , Reprodutibilidade dos Testes
16.
Pattern Recognit ; 47(6): 2178-2192, 2014 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-24729636

RESUMO

The most important aspect of any classifier is its error rate, because this quantifies its predictive capacity. Thus, the accuracy of error estimation is critical. Error estimation is problematic in small-sample classifier design because the error must be estimated using the same data from which the classifier has been designed. Use of prior knowledge, in the form of a prior distribution on an uncertainty class of feature-label distributions to which the true, but unknown, feature-distribution belongs, can facilitate accurate error estimation (in the mean-square sense) in circumstances where accurate completely model-free error estimation is impossible. This paper provides analytic asymptotically exact finite-sample approximations for various performance metrics of the resulting Bayesian Minimum Mean-Square-Error (MMSE) error estimator in the case of linear discriminant analysis (LDA) in the multivariate Gaussian model. These performance metrics include the first, second, and cross moments of the Bayesian MMSE error estimator with the true error of LDA, and therefore, the Root-Mean-Square (RMS) error of the estimator. We lay down the theoretical groundwork for Kolmogorov double-asymptotics in a Bayesian setting, which enables us to derive asymptotic expressions of the desired performance metrics. From these we produce analytic finite-sample approximations and demonstrate their accuracy via numerical examples. Various examples illustrate the behavior of these approximations and their use in determining the necessary sample size to achieve a desired RMS. The Supplementary Material contains derivations for some equations and added figures.

17.
Front Bioinform ; 4: 1280971, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38812660

RESUMO

Radiation exposure poses a significant threat to human health. Emerging research indicates that even low-dose radiation once believed to be safe, may have harmful effects. This perception has spurred a growing interest in investigating the potential risks associated with low-dose radiation exposure across various scenarios. To comprehensively explore the health consequences of low-dose radiation, our study employs a robust statistical framework that examines whether specific groups of genes, belonging to known pathways, exhibit coordinated expression patterns that align with the radiation levels. Notably, our findings reveal the existence of intricate yet consistent signatures that reflect the molecular response to radiation exposure, distinguishing between low-dose and high-dose radiation. Moreover, we leverage a pathway-constrained variational autoencoder to capture the nonlinear interactions within gene expression data. By comparing these two analytical approaches, our study aims to gain valuable insights into the impact of low-dose radiation on gene expression patterns, identify pathways that are differentially affected, and harness the potential of machine learning to uncover hidden activity within biological networks. This comparative analysis contributes to a deeper understanding of the molecular consequences of low-dose radiation exposure.

18.
BMC Bioinformatics ; 14: 307, 2013 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-24118904

RESUMO

BACKGROUND: A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable. RESULTS: Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification. CONCLUSIONS: Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification.


Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Biologia de Sistemas/métodos , Perfilação da Expressão Gênica , Genoma/genética , Modelos Genéticos , RNA Mensageiro/análise , RNA Mensageiro/genética , RNA Mensageiro/metabolismo
19.
Brief Bioinform ; 12(3): 245-52, 2011 May.
Artigo em Inglês | MEDLINE | ID: mdl-21183477

RESUMO

Gene regulatory network models are a major area of study in systems and computational biology and the construction of network models is among the most important problems in these disciplines. The critical epistemological issue concerns validation. Validity can be approached from two different perspectives (i) given a hypothesized network model, its scientific validity relates to the ability to make predictions from the model that can be checked against experimental observations; and (ii) the validity of a network inference procedure must be evaluated relative to its ability to infer a network from sample points generated by the network. This article examines both perspectives in the framework of a distance function between two networks. It considers some of the obstacles to validation and provides examples of both validation paradigms.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Algoritmos , Biologia de Sistemas/métodos
20.
Bioinformatics ; 28(21): 2824-33, 2012 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-22954625

RESUMO

MOTIVATION: A common practice in biomarker discovery is to decide whether a large laboratory experiment should be carried out based on the results of a preliminary study on a small set of specimens. Consideration of the efficacy of this approach motivates the introduction of a probabilistic measure, for whether a classifier showing promising results in a small-sample preliminary study will perform similarly on a large independent sample. Given the error estimate from the preliminary study, if the probability of reproducible error is low, then there is really no purpose in substantially allocating more resources to a large follow-on study. Indeed, if the probability of the preliminary study providing likely reproducible results is small, then why even perform the preliminary study? RESULTS: This article introduces a reproducibility index for classification, measuring the probability that a sufficiently small error estimate on a small sample will motivate a large follow-on study. We provide a simulation study based on synthetic distribution models that possess known intrinsic classification difficulties and emulate real-world scenarios. We also set up similar simulations on four real datasets to show the consistency of results. The reproducibility indices for different distributional models, real datasets and classification schemes are empirically calculated. The effects of reporting and multiple-rule biases on the reproducibility index are also analyzed. AVAILABILITY: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routine and error estimation methods. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi12a/.


Assuntos
Algoritmos , Biomarcadores/análise , Perfilação da Expressão Gênica/métodos , Modelos Estatísticos , Software , Viés , Seguimentos , Marcadores Genéticos , Humanos , Reconhecimento Automatizado de Padrão/métodos , Medicina de Precisão/métodos , Probabilidade , Análise de Regressão , Reprodutibilidade dos Testes , Tamanho da Amostra
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA