Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros

Base de dados
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 21(1): 156, 2020 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-32334509

RESUMO

BACKGROUND: Binary classification rules based on a small-sample of high-dimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions, and (b) the need for highly interpretable glass-box models. We use the theory of high dimensional model representation (HDMR) to build interpretable low dimensional approximations of the log-likelihood ratio accounting for the effects of each individual gene as well as gene-gene interactions. We propose two algorithms approximating the second order HDMR expansion, and a hypothesis test based on the HDMR formulation to identify significantly dysregulated pairwise interactions. The theory is seen as flexible and requiring only a mild set of assumptions. RESULTS: We apply our approach to gene expression data from both synthetic and real (breast and lung cancer) datasets comparing it also against several popular state-of-the-art methods. The analyses suggest the proposed algorithms can be used to obtain interpretable prediction rules with high prediction accuracies and to successfully extract significantly dysregulated gene-gene interactions from the data. They also compare favorably against their competitors across multiple synthetic data scenarios. CONCLUSION: The proposed HDMR-based approach appears to produce a reliable classifier that additionally allows one to describe how individual genes or gene-gene interactions affect classification decisions. Both real and synthetic data analyses suggest that our methods can be used to identify gene networks with dysregulated pairwise interactions, and are therefore appropriate for differential networks analysis.


Assuntos
Modelos Teóricos , Algoritmos , Área Sob a Curva , Neoplasias da Mama/metabolismo , Neoplasias da Mama/patologia , Bases de Dados Genéticas , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Neoplasias Pulmonares/metabolismo , Neoplasias Pulmonares/patologia , Curva ROC
2.
BMC Bioinformatics ; 19(Suppl 3): 70, 2018 03 21.
Artigo em Inglês | MEDLINE | ID: mdl-29589558

RESUMO

BACKGROUND: Many bioinformatics studies aim to identify markers, or features, that can be used to discriminate between distinct groups. In problems where strong individual markers are not available, or where interactions between gene products are of primary interest, it may be necessary to consider combinations of features as a marker family. To this end, recent work proposes a hierarchical Bayesian framework for feature selection that places a prior on the set of features we wish to select and on the label-conditioned feature distribution. While an analytical posterior under Gaussian models with block covariance structures is available, the optimal feature selection algorithm for this model remains intractable since it requires evaluating the posterior over the space of all possible covariance block structures and feature-block assignments. To address this computational barrier, in prior work we proposed a simple suboptimal algorithm, 2MNC-Robust, with robust performance across the space of block structures. Here, we present three new heuristic feature selection algorithms. RESULTS: The proposed algorithms outperform 2MNC-Robust and many other popular feature selection algorithms on synthetic data. In addition, enrichment analysis on real breast cancer, colon cancer, and Leukemia data indicates they also output many of the genes and pathways linked to the cancers under study. CONCLUSIONS: Bayesian feature selection is a promising framework for small-sample high-dimensional data, in particular biomarker discovery applications. When applied to cancer data these algorithms outputted many genes already shown to be involved in cancer as well as potentially new biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs blocks of heavily correlated genes, particularly useful for studying gene interactions and gene networks.


Assuntos
Algoritmos , Heurística , Teorema de Bayes , Humanos , Neoplasias/genética , Distribuição Normal
3.
Bioinformatics ; 27(13): 1822-31, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21551140

RESUMO

MOTIVATION: With the development of high-throughput genomic and proteomic technologies, coupled with the inherent difficulties in obtaining large samples, biomedicine faces difficult small-sample classification issues, in particular, error estimation. Most popular error estimation methods are motivated by intuition rather than mathematical inference. A recently proposed error estimator based on Bayesian minimum mean square error estimation places error estimation in an optimal filtering framework. In this work, we examine the application of this error estimator to gene expression microarray data, including the suitability of the Gaussian model with normal-inverse-Wishart priors and how to find prior probabilities. RESULTS: We provide an implementation for non-linear classification, where closed form solutions are not available. We propose a methodology for calibrating normal-inverse-Wishart priors based on discarded microarray data and examine the performance on synthetic high-dimensional data and a real dataset from a breast cancer study. The calibrated Bayesian error estimator has superior root mean square performance, especially with moderate to high expected true errors and small feature sizes. AVAILABILITY: We have implemented in C code the Bayesian error estimator for Gaussian distributions and normal-inverse-Wishart priors for both linear classifiers, with exact closed-form representations, and arbitrary classifiers, where we use a Monte Carlo approximation. Our code for the Bayesian error estimator and a toolbox of related utilities are available at http://gsp.tamu.edu/Publications/supplementary/dalton11a. Several supporting simulations are also included. CONTACT: ldalton@tamu.edu


Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Neoplasias da Mama/genética , Feminino , Humanos , Distribuição Normal , Probabilidade
4.
Artigo em Inglês | MEDLINE | ID: mdl-30040658

RESUMO

Optimal Bayesian feature filtering (OBF) is a fast and memory-efficient algorithm that optimally identifies markers with distributional differences between treatment groups under Gaussian models. Here, we study the performance and robustness of OBF for biomarker discovery. Our contributions are twofold: (1) we examine how OBF performs on data that violates modeling assumptions, and (2) we provide guidelines on how to set input parameters for robust performance. Contribution (1) addresses an important, relevant, and commonplace problem in computational biology, where it is often impossible to validate an algorithm's core assumptions. To accomplish both tasks, we present a battery of simulations that implement OBF with different inputs and challenge each assumption made by OBF. In particular, we examine the robustness of OBF with respect to incorrect input parameters, false independence, imbalanced sample size, and we address the Gaussianity assumption by considering performance on an extensive family of non-Gaussian distributions. We address advantages and disadvantages between different priors and optimization criteria throughout. Finally, we evaluate the utility of OBF in biomarker discovery using acute myeloid leukemia (AML) and colon cancer microarray datasets, and show that OBF is successful at identifying well-known biomarkers for these diseases that rank low under moderated t-test.


Assuntos
Teorema de Bayes , Biomarcadores , Biologia Computacional/métodos , Algoritmos , Bases de Dados Factuais , Humanos , Neoplasias/diagnóstico , Neoplasias/metabolismo
5.
BMC Med Genomics ; 13(Suppl 9): 133, 2020 09 21.
Artigo em Inglês | MEDLINE | ID: mdl-32957998

RESUMO

BACKGROUND: Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. METHODS: We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. RESULTS: We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. CONCLUSION: LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.


Assuntos
Biologia Computacional/métodos , Polimorfismo de Nucleotídeo Único , Algoritmos , Estudo de Associação Genômica Ampla , Funções Verossimilhança
6.
PLoS One ; 13(10): e0204627, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30278063

RESUMO

Classical clustering algorithms typically either lack an underlying probability framework to make them predictive or focus on parameter estimation rather than defining and minimizing a notion of error. Recent work addresses these issues by developing a probabilistic framework based on the theory of random labeled point processes and characterizing a Bayes clusterer that minimizes the number of misclustered points. The Bayes clusterer is analogous to the Bayes classifier. Whereas determining a Bayes classifier requires full knowledge of the feature-label distribution, deriving a Bayes clusterer requires full knowledge of the point process. When uncertain of the point process, one would like to find a robust clusterer that is optimal over the uncertainty, just as one may find optimal robust classifiers with uncertain feature-label distributions. Herein, we derive an optimal robust clusterer by first finding an effective random point process that incorporates all randomness within its own probabilistic structure and from which a Bayes clusterer can be derived that provides an optimal robust clusterer relative to the uncertainty. This is analogous to the use of effective class-conditional distributions in robust classification. After evaluating the performance of robust clusterers in synthetic mixtures of Gaussians models, we apply the framework to granular imaging, where we make use of the asymptotic granulometric moment theory for granular images to relate robust clustering theory to the application.


Assuntos
Análise por Conglomerados , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Teorema de Bayes , Simulação por Computador , Probabilidade , Incerteza
7.
Cancer Inform ; 17: 1176935118760944, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29531471

RESUMO

Cancer is a systems disease involving mutations and altered regulation. This supplement treats cancer research as it pertains to 3 systems issues of an inherently statistical nature: regulatory modeling and information processing, diagnostic classification, and therapeutic intervention and control. Topics of interest include (but are not limited to) multiscale modeling, gene/protein transcriptional regulation, dynamical systems, pharmacokinetic/pharmacodynamic modeling, compensatory regulation, feedback, apoptotic and proliferative control, copy number-expression interaction, integration of different feature types, error estimation, and reproducibility. We are especially interested in how the above issues relate to the extremely high-dimensional data sets and small- to moderate-sized data sets typically involved in cancer research, for instance, their effect on statistical power, inference accuracy, and multiple comparisons.

8.
Cancer Inform ; 14(Suppl 5): 123-38, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-27127404

RESUMO

Cancer prognosis prediction is typically carried out without integrating scientific knowledge available on genomic pathways, the effect of drugs on cell dynamics, or modeling mutations in the population. Recent work addresses some of these problems by formulating an uncertainty class of Boolean regulatory models for abnormal gene regulation, assigning prognosis scores to each network based on intervention outcomes, and partitioning networks in the uncertainty class into prognosis classes based on these scores. For a new patient, the probability distribution of the prognosis class was evaluated using optimal Bayesian classification, given patient data. It was assumed that (1) disease is the result of several mutations of a known healthy network and that these mutations and their probability distribution in the population are known and (2) only a single snapshot of the patient's gene activity profile is observed. It was shown that, even in ideal settings where cancer in the population and the effect of a drug are fully modeled, a single static measurement is typically not sufficient. Here, we study what measurements are sufficient to predict prognosis. In particular, we relax assumption (1) by addressing how population data may be used to estimate network probabilities, and extend assumption (2) to include static and time-series measurements of both population and patient data. Furthermore, we extend the prediction of prognosis classes to optimal Bayesian regression of prognosis metrics. Even when time-series data is preferable to infer a stochastic dynamical network, we show that static data can be superior for prognosis prediction when constrained to small samples. Furthermore, although population data is helpful, performance is not sensitive to inaccuracies in the estimated network probabilities.

9.
EURASIP J Bioinform Syst Biol ; 2015(1): 8, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26523135

RESUMO

A recently proposed optimal Bayesian classification paradigm addresses optimal error rate analysis for small-sample discrimination, including optimal classifiers, optimal error estimators, and error estimation analysis tools with respect to the probability of misclassification under binary classes. Here, we address multi-class problems and optimal expected risk with respect to a given risk function, which are common settings in bioinformatics. We present Bayesian risk estimators (BRE) under arbitrary classifiers, the mean-square error (MSE) of arbitrary risk estimators under arbitrary classifiers, and optimal Bayesian risk classifiers (OBRC). We provide analytic expressions for these tools under several discrete and Gaussian models and present a new methodology to approximate the BRE and MSE when analytic expressions are not available. Of particular note, we present analytic forms for the MSE under Gaussian models with homoscedastic covariances, which are new even in binary classification.

10.
EURASIP J Bioinform Syst Biol ; 2015: 1, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28194170

RESUMO

Typically, a vast amount of experience and data is needed to successfully determine cancer prognosis in the face of (1) the inherent stochasticity of cell dynamics, (2) incomplete knowledge of healthy cell regulation, and (3) the inherent uncertain and evolving nature of cancer progression. There is hope that models of cell regulation could be used to predict disease progression and successful treatment strategies, but there has been little work focusing on the third source of uncertainty above. In this work, we investigate the impact of this kind of network uncertainty in predicting cancer prognosis. In particular, we focus on a scenario in which the precise aberrant regulatory relationships between genes in a patient are unknown, but the patient gene regulatory network is contained in an uncertainty class of possible mutations of some known healthy network. We optimistically assume that the probabilities of these abnormal networks are available, along with the best treatment for each network. Then, given a snapshot of the patient gene activity profile at a single moment in time, we study what can be said regarding the patient's treatability and prognosis. Our methodology is based on recent developments on optimal control strategies for probabilistic Boolean networks and optimal Bayesian classification. We show that in some circumstances, prognosis prediction may be highly unreliable, even in this optimistic setting with perfect knowledge of healthy biological processes and ideal treatment decisions.

11.
EURASIP J Bioinform Syst Biol ; 2013(1): 10, 2013 Aug 20.
Artigo em Inglês | MEDLINE | ID: mdl-23958425

RESUMO

: A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims.Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA