Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 84
Filtrar
1.
J Bus Econ Stat ; 41(4): 1157-1172, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38046827

RESUMO

Modeling and inference for heterogeneous data have gained great interest recently due to rapid developments in personalized marketing. Most existing regression approaches are based on the conditional mean and may require additional cluster information to accommodate data heterogeneity. In this paper, we propose a novel nonparametric resolution-wise regression procedure to provide an estimated distribution of the response instead of one single value. We achieve this by decomposing the information of the response and the predictors into resolutions and patterns respectively based on marginal binary expansions. The relationships between resolutions and patterns are modeled by penalized logistic regressions. Combining the resolution-wise prediction, we deliver a histogram of the conditional response to approximate the distribution. Moreover, we show a sure independence screening property and the consistency of the proposed method for growing dimensions. Simulations and a real estate valuation dataset further illustrate the effectiveness of the proposed method.

2.
Stat Med ; 42(25): 4644-4663, 2023 11 10.
Artigo em Inglês | MEDLINE | ID: mdl-37649243

RESUMO

Identifying the existence and locations of change points has been a broadly encountered task in many statistical application areas. The existing change point detection methods may produce unsatisfactory results for high-dimensional data since certain distributional assumptions are made on data, which are hard to verify in practice. Moreover, some parameters (such as the number of change points) need to be estimated beforehand for some methods, making their powers sensitive to these values. Here, we propose a kernel-based U $$ U $$ -statistic to identify change points (KUCP) for high dimensional data, which is free of distributional assumptions and sup-parameter estimations. Specifically, we employ a kernel function to describe similarities among the subjects and construct a U $$ U $$ -statistic to test the existence of change point for a given location. The asymptotic properties of the U $$ U $$ -statistic are deduced. We also develop a procedure to locate the change points sequentially via a dichotomy algorithm. Extensive simulations demonstrate that KUCP has higher sensitivity in identifying existence of change points and higher accuracy in locating these change points than its counterparts. We further illustrate its practical utility by analyzing a gene expression data of human brain to detect the time point when gene expression profiles begin to change, which has been reported to be closely related with aging brain.


Assuntos
Algoritmos , Encéfalo , Humanos
3.
J Comput Graph Stat ; 32(1): 263-274, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37274355

RESUMO

Modern high-dimensional statistical inference often faces the problem of missing data. In recent decades, many studies have focused on this topic and provided strategies including complete-sample analysis and imputation procedures. However, complete-sample analysis discards information of incomplete samples, while imputation procedures have accumulative errors from each single imputation. In this paper, we propose a new method, Sample-wise COmbined missing effect Model with penalization (SCOM), to deal with missing data occurring in predictors. Instead of imputing the predictors, SCOM estimates the combined effect caused by all missing data for each incomplete sample. SCOM makes full use of all available data. It is robust with respect to various missing mechanisms. Theoretical studies show the oracle inequality for the proposed estimator, and the consistency of variable selection and combined missing effect selection. Simulation studies and an application to the Residential Building Data also illustrate the effectiveness of the proposed SCOM.

4.
Bioinformatics ; 39(4)2023 04 03.
Artigo em Inglês | MEDLINE | ID: mdl-37027223

RESUMO

MOTIVATION: Traditional genome-wide association study focuses on testing one-to-one relationship between genetic variants and complex human diseases or traits. While its success in the past decade, this one-to-one paradigm lacks efficiency because it does not utilize the information of intrinsic genetic structure and pleiotropic effects. Due to privacy reasons, only summary statistics of current genome-wide association study data are publicly available. Existing summary statistics-based association tests do not consider covariates for regression model, while adjusting for covariates including population stratification factors is a routine issue. RESULTS: In this work, we first derive the correlation coefficients between summary Wald statistics obtained from linear regression model with covariates. Then, a new test is proposed by integrating three-level information including the intrinsic genetic structure, pleiotropy, and the potential information combinations. Extensive simulations demonstrate that the proposed test outperforms three other existing methods under most of the considered scenarios. Real data analysis of polyunsaturated fatty acids further shows that the proposed test can identify more genes than the compared existing methods. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/bschilder/ThreeWayTest.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Modelos Lineares
5.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37104737

RESUMO

MOTIVATION: Testing the association between multiple phenotypes with a set of genetic variants simultaneously, rather than analyzing one trait at a time, is receiving increasing attention for its high statistical power and easy explanation on pleiotropic effects. The kernel-based association test (KAT), being free of data dimensions and structures, has proven to be a good alternative method for genetic association analysis with multiple phenotypes. However, KAT suffers from substantial power loss when multiple phenotypes have moderate to strong correlations. To handle this issue, we propose a maximum KAT (MaxKAT) and suggest using the generalized extreme value distribution to calculate its statistical significance under the null hypothesis. RESULTS: We show that MaxKAT reduces computational intensity greatly while maintaining high accuracy. Extensive simulations demonstrate that MaxKAT can properly control type I error rates and obtain remarkably higher power than KAT under most of the considered scenarios. Application to a porcine dataset used in biomedical experiments of human disease further illustrates its practical utility. AVAILABILITY AND IMPLEMENTATION: The R package MaxKAT that implements the proposed method is available on Github https://github.com/WangJJ-xrk/MaxKAT.


Assuntos
Estudo de Associação Genômica Ampla , Modelos Genéticos , Humanos , Animais , Suínos , Fenótipo , Simulação por Computador
6.
J Appl Stat ; 50(3): 631-658, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36819071

RESUMO

The National Heart, Lung and Blood Institute Growth and Health Study (NGHS) is a large longitudinal study of childhood health. A main objective of the study is to estimate the joint distributions of cardiovascular risk outcomes at any two time points conditioning on a large number of covariates. Existing multivariate longitudinal methods are not suitable for outcomes at multiple time points. We present a dynamic copula approach for estimating an outcome's joint distributions at two time points given a large number of time-varying covariates. Our models depend on the outcome's time-varying distributions at one time point, the bivariate copula densities and the functional copula parameters. We develop a three-step procedure for variable selection and estimation, which selects the influential covariates using a machine learning procedure based on spline Lasso-regularized least squares, computes the outcome's single-time distribution using splines, and estimates the functional copula parameter of the dynamic copula models. Pointwise confidence intervals are constructed through the resampling-subject bootstrap. We apply our procedure to the NGHS cardiovascular risk data and illustrate the clinical interpretations of the conditional distributions of a set of risk outcomes. We demonstrate the statistical properties of the dynamic models and estimation procedure through a simulation study.

7.
Sci Adv ; 9(1): eabq5506, 2023 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-36608134

RESUMO

Abnormal temperature caused by global climate change threatens the rice production. Defense signaling network for chilling has been uncovered in plants. However, less is known about repairing DNA damage produced from overwhelmed defense and its evolution during domestication. Here, we genetically identified a major QTL, COLD11, using the data-merging genome-wide association study based on an algorithm combining polarized data from two subspecies, indica and japonica, into one system. Rice loss-of-function mutations of COLD11 caused reduced chilling tolerance. Genome evolution analysis of representative rice germplasms suggested that numbers of GCG sequence repeats in the first exon of COLD11 were subjected to strong domestication selection during the northern expansion of rice planting. The repeat numbers affected the biochemical activity of DNA repair protein COLD11/RAD51A1 in renovating DNA damage under chilling stress. Our findings highlight a potential way to finely manipulate key genes in rice genome and effectively improve chilling tolerance through molecular designing.


Assuntos
Oryza , Oryza/genética , Oryza/metabolismo , Estudo de Associação Genômica Ampla , Códon/metabolismo , Temperatura Baixa
8.
J Appl Stat ; 49(16): 4278-4293, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36353301

RESUMO

In disease screening, a biomarker combination developed by combining multiple markers tends to have a higher sensitivity than an individual marker. Parametric methods for marker combination rely on the inverse of covariance matrices, which is often a non-trivial problem for high-dimensional data generated by modern high-throughput technologies. Additionally, another common problem in disease diagnosis is the existence of limit of detection (LOD) for an instrument - that is, when a biomarker's value falls below the limit, it cannot be observed and is assigned an NA value. To handle these two challenges in combining high-dimensional biomarkers with the presence of LOD, we propose a resample-replace lasso procedure. We first impute the values below LOD and then use the graphical lasso method to estimate the means and precision matrices for the high-dimensional biomarkers. The simulation results show that our method outperforms alternative methods such as either substitute NA values with LOD values or remove observations that have NA values. A real case analysis on a protein profiling study of glioblastoma patients on their survival status indicates that the biomarker combination obtained through the proposed method is more accurate in distinguishing between two groups.

9.
Commun Math Stat ; : 1-31, 2022 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-36213843

RESUMO

We investigated the false-negative, true-negative, false-positive, and true-positive predictive values from a general group testing procedure for a heterogeneous population. We show that its false (true)-negative predictive value of a specimen is larger (smaller), and the false (true)-positive predictive value is smaller (larger) than that from individual testing procedure, where the former is in aversion. Then we propose a nested group testing procedure, and show that it can keep the sterling characteristics and also improve the false-negative predictive values for a specimen, not larger than that from individual testing. These characteristics are studied from both theoretical and numerical points of view. The nested group testing procedure is better than individual testing on both false-positive and false-negative predictive values, while retains the efficiency as a basic characteristic of a group testing procedure. Applications to Dorfman's, Halving and Sterrett procedures are discussed. Results from extensive simulation studies and an application to malaria infection in microscopy-negative Malawian women exemplify the findings.

10.
Bioinformatics ; 38(14): 3493-3500, 2022 07 11.
Artigo em Inglês | MEDLINE | ID: mdl-35640978

RESUMO

MOTIVATION: Microbial communities have been shown to be associated with many complex diseases, such as cancers and cardiovascular diseases. The identification of differentially abundant taxa is clinically important. It can help understand the pathology of complex diseases, and potentially provide preventive and therapeutic strategies. Appropriate differential analyses for microbiome data are challenging due to its unique data characteristics including compositional constraint, excessive zeros and high dimensionality. Most existing approaches either ignore these data characteristics or only account for the compositional constraint by using log-ratio transformations with zero observations replaced by a pseudocount. However, there is no consensus on how to choose a pseudocount. More importantly, ignoring the characteristic of excessive zeros may result in poorly powered analyses and therefore yield misleading findings. RESULTS: We develop a novel microbiome-based direction-assisted test for the detection of overall difference in microbial relative abundances between two health conditions, which simultaneously incorporates the characteristics of relative abundance data. The proposed test (i) divides the taxa into two clusters by the directions of mean differences of relative abundances and then combines them at cluster level, in light of the compositional characteristic; and (ii) contains a burden type test, which collapses multiple taxa into a single one to account for excessive zeros. Moreover, the proposed test is an adaptive procedure, which can accommodate high-dimensional settings and yield high power against various alternative hypotheses. We perform extensive simulation studies across a wide range of scenarios to evaluate the proposed test and show its substantial power gain over some existing tests. The superiority of the proposed approach is further demonstrated with real datasets from two microbiome studies. AVAILABILITY AND IMPLEMENTATION: An R package for MiDAT is available at https://github.com/zhangwei0125/MiDAT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Microbiota , Simulação por Computador
11.
Stat Med ; 40(25): 5534-5546, 2021 11 10.
Artigo em Inglês | MEDLINE | ID: mdl-34258785

RESUMO

Balancing allocation of assigning units to two treatment groups to minimize the allocation differences is important in biomedical research. The complete randomization, rerandomization, and pairwise sequential randomization (PSR) procedures can be employed to balance the allocation. However, the first two do not allow a large number of covariates. In this article, we generalize the PSR procedure and propose a k-resolution sequential randomization (k-RSR) procedure by minimizing the Mahalanobis distance between both groups with equal group size. The proposed method can be used to achieve adequate balance and obtain a reasonable estimate of treatment effect. Compared to PSR, k-RSR is more likely to achieve the optimal value theoretically. Extensive simulation studies are conducted to show the superiorities of k-RSR and applications to the clinical synthetic data and GAW16 data further illustrate the methods.


Assuntos
Projetos de Pesquisa , Simulação por Computador , Humanos , Distribuição Aleatória
12.
Stat Methods Med Res ; 30(7): 1640-1653, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-34134561

RESUMO

For a nonparametric Behrens-Fisher problem, a directional-sum test is proposed based on division-combination strategy. A one-layer wild bootstrap procedure is given to calculate its statistical significance. We conduct simulation studies with data generated from lognormal, t and Laplace distributions to show that the proposed test can control the type I error rates properly and is more powerful than the existing rank-sum and maximum-type tests under most of the considered scenarios. Applications to the dietary intervention trial further show the performance of the proposed test.


Assuntos
Dieta , Projetos de Pesquisa , Simulação por Computador , Modelos Estatísticos
13.
Stat Med ; 40(21): 4597-4608, 2021 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-34050680

RESUMO

This article proposes a powerful method to compare two samples. The proposed method handles comparison of data by drawing inference from ROC curve model parameters. The method estimates parameters from a linear model framework on the empirical sensitivities and specificities. The consistent ROC parameters are then used to give a more powerful test than existing methods in several situations. In addition, we present a comprehensive statistic based on the Cauchy combination, which works well in all scenarios considered in this article. We also offer an efficient one-layer wild permutation procedure to calculate the P-value of our statistic. The method is particularly useful when the underlying continuous biomarker results are non-normal. We illustrate the proposed methods in a neonatal audiology diagnostic example.


Assuntos
Audiologia , Humanos , Recém-Nascido , Curva ROC , Sensibilidade e Especificidade
14.
Stat Med ; 40(10): 2422-2434, 2021 05 10.
Artigo em Inglês | MEDLINE | ID: mdl-33665825

RESUMO

In this article, we propose a novel test via combining the maximum and minimum values among a large number of dependent Z-scores for testing the hypothesis with sparse signals. The proposed test employs the information about different signs of maximum and minimum Z-scores and thus power is gained. Its asymptotic null distribution is derived under the null hypothesis and some regular conditions. Extensive simulation studies are conducted to show the advantages of the proposed test by comparing with two existing ones. A real application to the lipids genome wide association study further shows its performances.


Assuntos
Estudo de Associação Genômica Ampla , Simulação por Computador , Humanos
15.
Genet Epidemiol ; 44(6): 620-628, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-32567118

RESUMO

Distance-based regression model has become a powerful approach to identifying phenotypic associations in many fields. It is found to be particularly useful for high-dimensional biological and genetic data with proper distance or similarity measures being available. The pseudo F statistic used in this model accumulates information and is effective when the signals, that is the variations represented by the eigenvalues of the similarity matrix, scatter evenly along the eigenvectors of the similarity matrix. However, it might lose power for the uneven signals. To deal with this issue, we propose a group analysis on the variations of signals along the eigenvalues of the similarity matrix and take the maximum among them. The new procedure can automatically choose an optimal grouping point on some given thresholds and thus can improve the power evidence. Extensive computer simulations and applications to a prostate cancer data and an aging human brain data illustrate the effectiveness of the proposed method.


Assuntos
Modelos Genéticos , Adulto , Idoso , Idoso de 80 Anos ou mais , Algoritmos , Encéfalo/fisiologia , Simulação por Computador , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Masculino , Pessoa de Meia-Idade , Modelos Estatísticos , Neoplasias da Próstata/genética , Análise de Regressão , Fatores de Tempo
16.
Genet Epidemiol ; 44(7): 687-701, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32583530

RESUMO

To date, thousands of genetic variants to be associated with numerous human traits and diseases have been identified by genome-wide association studies (GWASs). The GWASs focus on testing the association between single trait and genetic variants. However, the analysis of multiple traits and single nucleotide polymorphisms (SNPs) might reflect physiological process of complex diseases and the corresponding study is called pleiotropy association analysis. Modern day GWASs report only summary statistics instead of individual-level phenotype and genotype data to avoid logistical and privacy issues. Existing methods for combining multiple phenotypes GWAS summary statistics mainly focus on low-dimensional phenotypes while lose power in high-dimensional cases. To overcome this defect, we propose two kinds of truncated tests to combine multiple phenotypes summary statistics. Extensive simulations show that the proposed methods are robust and powerful when the dimension of the phenotypes is high and only part of the phenotypes are associated with the SNPs. We apply the proposed methods to blood cytokines data collected from Finnish population. Results show that the proposed tests can identify additional genetic markers that are missed by single trait analysis.


Assuntos
Citocinas/sangue , Citocinas/genética , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Modelos Genéticos , Polimorfismo de Nucleotídeo Único/genética , Simulação por Computador , Finlândia , Marcadores Genéticos/genética , Genótipo , Humanos , Fenótipo
17.
Biometrics ; 76(4): 1147-1156, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-32083733

RESUMO

This article concerns the problem of estimating a continuous distribution in a diseased or nondiseased population when only group-based test results on the disease status are available. The problem is challenging in that individual disease statuses are not observed and testing results are often subject to misclassification, with further complication that the misclassification may be differential as the group size and the number of the diseased individuals in the group vary. We propose a method to construct nonparametric estimation of the distribution and obtain its asymptotic properties. The performance of the distribution estimator is evaluated under various design considerations concerning group sizes and classification errors. The method is exemplified with data from the National Health and Nutrition Examination Survey study to estimate the distribution and diagnostic accuracy of C-reactive protein in blood samples in predicting chlamydia incidence.


Assuntos
Modelos Estatísticos , Projetos de Pesquisa , Viés , Humanos , Inquéritos Nutricionais , Distribuições Estatísticas
18.
Stat Med ; 39(6): 687-697, 2020 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-31758594

RESUMO

Group testing has been widely used as a cost-effective strategy to screen for and estimate the prevalence of a rare disease. While it is well-recognized that retesting is necessary for identifying infected subjects, it is not required for estimating the prevalence. For a test without misclassification, gains in statistical efficiency are expected from incorporating retesting results in the estimation of the prevalence. However, when the test is subject to misclassification, it is not clear how much gain should be expected. There are a number of theoretical challenges in addressing this issue, including (1) enumerating the potential test results from retesting individual subjects in a group, (2) the dependence among these test results and the test result from testing at the group level, and (3) differential misclassification due to pooling of biospecimens. Overcoming some of these challenges, we show that retesting subjects in either positive or negative groups can substantially improve the efficiency of the estimation and that retesting positive groups yields higher efficiency than retesting a same number or proportion of negative groups.


Assuntos
Prevalência , Análise Custo-Benefício , Humanos
19.
Biometrics ; 76(3): 863-873, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-31725175

RESUMO

Receiver operating characteristic (ROC) curve is commonly used to evaluate and compare the accuracy of classification methods or markers. Estimating ROC curves has been an important problem in various fields including biometric recognition and diagnostic medicine. In real applications, classification markers are often developed under two or more ordered conditions, such that a natural stochastic ordering exists among the observations. Incorporating such a stochastic ordering into estimation can improve statistical efficiency (Davidov and Herman, 2012). In addition, clustered and correlated data arise when multiple measurements are gleaned from the same subject, making estimation of ROC curves complicated due to within-cluster correlations. In this article, we propose to model the ROC curve using a weighted empirical process to jointly account for the order constraint and within-cluster correlation structure. The algebraic properties of resulting summary statistics of the ROC curve such as its area and partial area are also studied. The algebraic expressions reduce to the ones by Davidov and Herman (2012) for independent observations. We derive asymptotic properties of the proposed order-restricted estimators and show that they have smaller mean-squared errors than the existing estimators. Simulation studies also demonstrate better performance of the newly proposed estimators over existing methods for finite samples. The proposed method is further exemplified with the fingerprint matching data from the National Institute of Standards and Technology Special Database 4.


Assuntos
Biometria , Modelos Estatísticos , Área Sob a Curva , Biomarcadores , Simulação por Computador , Curva ROC
20.
G3 (Bethesda) ; 9(8): 2573-2579, 2019 08 08.
Artigo em Inglês | MEDLINE | ID: mdl-31167832

RESUMO

The methods commonly used to test the associations between ordinal phenotypes and genotypes often treat either the ordinal phenotype or the genotype as continuous variables. To address limitations of these approaches, we propose a model where both the ordinal phenotype and the genotype are viewed as manifestations of an underlying multivariate normal random variable. The proposed method allows modeling the ordinal phenotype, the genotype and covariates jointly. We employ the generalized estimating equation technique and M-estimation theory to estimate the model parameters and deduce the corresponding asymptotic distribution. Numerical simulations and real data applications are also conducted to compare the performance of the proposed method with those of methods based on the logit and probit models. Even though there may be potential limitations in Type I error rate control for our method, the gains in power can prove its practical value in case of exactly ordinal phenotypes.


Assuntos
Estudos de Associação Genética , Genótipo , Modelos Genéticos , Fenótipo , Algoritmos , Autoanticorpos/imunologia , Suscetibilidade a Doenças , Estudos de Associação Genética/métodos , Humanos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Característica Quantitativa Herdável
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...