Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
J Biomed Inform ; 113: 103625, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33221467

RESUMO

OBJECTIVE: To develop and evaluate methods to assess single and grouped variables impact on measuring intervention severities and support a search for most expressive variables. METHODS: Datasets of cohort studies are analyzed automatically based on algorithms. For this, a metric is developed to compare measured variables in different cohorts in a data-mining process. Variables are measured in all possible combinations to detect possible synergies of certain variable constellations and allow for a ranking of the combinations' expressiveness. Such ranking serves as a basis for a wide range of algorithmic data analysis. In an exemplary application, every group member's impact on the total result is determined based on the principle of the cooperative game theory besides to the total expressiveness of the variable groups. RESULTS: For different types of interventions, the method is applied to experimental data containing multiple recorded medical lab values. The expressiveness of variable combinations to indicate severity is ranked by means of a metric. Within each combination, any variable's contribution to the total effect is determined and accumulated over whole datasets to yield local and global variable importance measures. The computed results have been successfully matched with clinical expectations to prove their plausibility. CONCLUSION: Algorithmic evaluation shows to be a promising approach in automatized quantification of variable expressiveness. It can assess descriptive power of measurements, help to improve future study designs and expose worthwhile research issues.


Assuntos
Mineração de Dados , Teoria dos Jogos , Algoritmos , Humanos
2.
Biom J ; 62(3): 670-687, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-31099917

RESUMO

Uncertainty is a crucial issue in statistics which can be considered from different points of view. One type of uncertainty, typically referred to as sampling uncertainty, arises through the variability of results obtained when the same analysis strategy is applied to different samples. Another type of uncertainty arises through the variability of results obtained when using the same sample but different analysis strategies addressing the same research question. We denote this latter type of uncertainty as method uncertainty. It results from all the choices to be made for an analysis, for example, decisions related to data preparation, method choice, or model selection. In medical sciences, a large part of omics research is focused on the identification of molecular biomarkers, which can either be performed through ranking or by selection from among a large number of candidates. In this paper, we introduce a general resampling-based framework to quantify and compare sampling and method uncertainty. For illustration, we apply this framework to different scenarios related to the selection and ranking of omics biomarkers in the context of acute myeloid leukemia: variable selection in multivariable regression using different types of omics markers, the ranking of biomarkers according to their predictive performance, and the identification of differentially expressed genes from RNA-seq data. For all three scenarios, our findings suggest highly unstable results when the same analysis strategy is applied to two independent samples, indicating high sampling uncertainty and a comparatively smaller, but non-negligible method uncertainty, which strongly depends on the methods being compared.


Assuntos
Biometria/métodos , Biologia Computacional , Incerteza , Biomarcadores/metabolismo , Perfilação da Expressão Gênica , Humanos , Leucemia Mieloide Aguda/genética , Leucemia Mieloide Aguda/metabolismo
3.
Proteomics ; 16(11-12): 1731-5, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-27028088

RESUMO

Applying MALDI-MS imaging to tissue microarrays (TMAs) provides access to proteomics data from large cohorts of patients in a cost- and time-efficient way, and opens the potential for applying this technology in clinical diagnosis. The complexity of these TMA data-high-dimensional low sample size-provides challenges for the statistical analysis, as classical methods typically require a nonsingular covariance matrix that cannot be satisfied if the dimension is greater than the sample size. We use TMAs to collect data from endometrial primary carcinomas from 43 patients. Each patient has a lymph node metastasis (LNM) status of positive or negative, which we predict on the basis of the MALDI-MS imaging TMA data. We propose a variable selection approach based on canonical correlation analysis that explicitly uses the LNM information. We apply LDA to the selected variables only. Our method misclassifies 2.3-20.9% of patients by leave-one-out cross-validation and strongly outperforms LDA after reduction of the original data with principle component analysis.


Assuntos
Neoplasias do Endométrio/diagnóstico por imagem , Proteômica/métodos , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz/métodos , Análise Serial de Tecidos/métodos , Neoplasias do Endométrio/diagnóstico , Neoplasias do Endométrio/patologia , Feminino , Humanos , Metástase Linfática , Estadiamento de Neoplasias , Análise de Componente Principal
4.
Commun Stat Theory Methods ; 47(21): 5163-5195, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30237653

RESUMO

We derive explicit formulas for Sobol's sensitivity indices (SSIs) under the generalized linear models (GLMs) with independent or multivariate normal inputs. We argue that the main-effect SSIs provide a powerful tool for variable selection under GLMs with identity links under polynomial regressions. We also show via examples that the SSI-based variable selection results are similar to the ones obtained by the random forest algorithm but without the computational burden of data permutation. Finally, applying our results to the problem of gene network discovery, we identify though the SSI analysis of a public microarray dataset several novel higher-order gene-gene interactions missed out by the more standard inference methods. The relevant functions for SSI analysis derived here under GLMs with identity, log, and logit links are implemented and made available in the R package SobolSensitivity.

5.
Anal Chim Acta ; 911: 27-34, 2016 Mar 10.
Artigo em Inglês | MEDLINE | ID: mdl-26893083

RESUMO

Biomarker discovery is one important goal in metabolomics, which is typically modeled as selecting the most discriminating metabolites for classification and often referred to as variable importance analysis or variable selection. Until now, a number of variable importance analysis methods to discover biomarkers in the metabolomics studies have been proposed. However, different methods are mostly likely to generate different variable ranking results due to their different principles. Each method generates a variable ranking list just as an expert presents an opinion. The problem of inconsistency between different variable ranking methods is often ignored. To address this problem, a simple and ideal solution is that every ranking should be taken into account. In this study, a strategy, called rank aggregation, was employed. It is an indispensable tool for merging individual ranking lists into a single "super"-list reflective of the overall preference or importance within the population. This "super"-list is regarded as the final ranking for biomarker discovery. Finally, it was used for biomarkers discovery and selecting the best variable subset with the highest predictive classification accuracy. Nine methods were used, including three univariate filtering and six multivariate methods. When applied to two metabolic datasets (Childhood overweight dataset and Tubulointerstitial lesions dataset), the results show that the performance of rank aggregation has improved greatly with higher prediction accuracy compared with using all variables. Moreover, it is also better than penalized method, least absolute shrinkage and selectionator operator (LASSO), with higher prediction accuracy or less number of selected variables which are more interpretable.


Assuntos
Biomarcadores/metabolismo , Metabolômica , Estudos de Casos e Controles , Criança , Cromatografia Gasosa-Espectrometria de Massas , Humanos , Modelos Teóricos , Sobrepeso/sangue
6.
Ann Appl Stat ; 10(1): 418-450, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27213023

RESUMO

We propose a Multiple Imputation Random Lasso (mirl) method to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. In this study 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (ses) Asian boys who are at high risk of developing obesity.

7.
Anal Chim Acta ; 876: 39-48, 2015 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-25998456

RESUMO

Variable responses are fundamental for all experiments, and they can consist of information-rich, redundant, and low signal intensities. A dataset can consist of a collection of variable responses over multiple classes or groups. Usually some of the variables are removed in a dataset that contain very little information. Sometimes all the variables are used in the data analysis phase. It is common practice to discriminate between two distributions of data; however, there is no formal algorithm to arrive at a degree of separation (DS) between two distributions of data. The DS is defined herein as the average of the sum of the areas from the probability density functions (PDFs) of A and B that contain a≥percentage of A and/or B. Thus, DS90 is the average of the sum of the PDF areas of A and B that contain ≥90% of A and/or B. To arrive at a DS value, two synthesized PDFs or very large experimental datasets are required. Experimentally it is common practice to generate relatively small datasets. Therefore, the challenge was to find a statistical parameter that can be used on small datasets to estimate and highly correlate with the DS90 parameter. Established statistical methods include the overlap area of the two data distribution profiles, Welch's t-test, Kolmogorov-Smirnov (K-S) test, Mann-Whitney-Wilcoxon test, and the area under the receiver operating characteristics (ROC) curve (AUC). The area between the ROC curve and diagonal (ACD) and the length of the ROC curve (LROC) are introduced. The established, ACD, and LROC methods were correlated to the DS90 when applied on many pairs of synthesized PDFs. The LROC method provided the best linear correlation with, and estimation of, the DS90. The estimated DS90 from the LROC (DS90-LROC) is applied to a database, as an example, of three Italian wines consisting of thirteen variable responses for variable ranking consideration. An important highlight of the DS90-LROC method is utilizing the LROC curve methodology to test all variables one-at-a-time with all pairs of classes in a dataset.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA