Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 51
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-37502671

RESUMO

The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA. This article is categorized under:Application Areas > Education and LearningAlgorithmic Development > StatisticsTechnologies > Machine Learning.

2.
Biom J ; 65(8): e2200229, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-37357560

RESUMO

The reference interval is the most widely used medical decision-making, constituting a central tool in determining whether an individual is healthy or not. When the results of several continuous diagnostic tests are available for the same patient, their clinical interpretation is more reliable if a multivariate reference region (MVR) is available rather than multiple univariate reference intervals. MVRs, defined as regions containing 95% of the results of healthy subjects, extend the concept of the reference interval to the multivariate setting. However, they are rarely used in clinical practice owing to difficulties associated with their interpretability and the restrictions inherent to the assumption of a Gaussian distribution. Further statistical research is thus needed to make MVRs more applicable and easier for physicians to interpret. Since the joint distribution of diagnostic test results may well change with patient characteristics independent of disease status, MVRs adjusted for covariates are desirable. The present work introduces a novel formulation for MVRs based on multivariate conditional transformation models (MCTMs). Additionally, we take into account the estimation uncertainty of such MVRs by means of tolerance regions. These conditional MVRs imply no parametric restriction on the response, and potentially nonlinear continuous covariate effects can be estimated. MCTMs allow the estimation of the effects of covariates on the joint distribution of multivariate response variables and on these variables' marginal distributions, via the use of most likely transformation estimation. Our contributions proved reliable when tested with simulated data and for a real data application with two glycemic markers.


Assuntos
Tomada de Decisão Clínica , Humanos , Distribuição Normal , Incerteza
3.
Comput Stat ; 38(2): 647-674, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37223721

RESUMO

Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.

4.
Cogn Neurodyn ; 17(1): 221-237, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-36704631

RESUMO

Reaction times (RTs) are an essential metric used for understanding the link between brain and behaviour. As research is reaffirming the tight coupling between neuronal and behavioural RTs, thorough statistical modelling of RT data is thus essential to enrich current theories and motivate novel findings. A statistical distribution is proposed herein that is able to model the complete RT's distribution, including location, scale and shape: the generalised-exponential-Gaussian (GEG) distribution. The GEG distribution enables shifting the attention from traditional means and standard deviations to the entire RT distribution. The mathematical properties of the GEG distribution are presented and investigated via simulations. Additionally, the GEG distribution is featured via four real-life data sets. Finally, we discuss how the proposed distribution can be used for regression analyses via generalised additive models for location, scale and shape (GAMLSS).

5.
Stat Methods Med Res ; 32(2): 425-440, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36384320

RESUMO

A range of regularization approaches have been proposed in the data sciences to overcome overfitting, to exploit sparsity or to improve prediction. Using a broad definition of regularization, namely controlling model complexity by adding information in order to solve ill-posed problems or to prevent overfitting, we review a range of approaches within this framework including penalization, early stopping, ensembling and model averaging. Aspects of their practical implementation are discussed including available R-packages and examples are provided. To assess the extent to which these approaches are used in medicine, we conducted a review of three general medical journals. It revealed that regularization approaches are rarely applied in practical clinical applications, with the exception of random effects models. Hence, we suggest a more frequent use of regularization approaches in medical research. In situations where also other approaches work well, the only downside of the regularization approaches is increased complexity in the conduct of the analyses which can pose challenges in terms of computational resources and expertise on the side of the data analyst. In our view, both can and should be overcome by investments in appropriate computing facilities and educational resources.

6.
Artigo em Inglês | MEDLINE | ID: mdl-35942194

RESUMO

A rapid response to global infectious disease outbreaks is crucial to protect public health. Ex ante information on the spatial probability distribution of early infections can guide governments to better target protection efforts. We propose a two-stage statistical approach to spatially map the ex ante importation risk of COVID-19 and its uncertainty across Indonesia based on a minimal set of routinely available input data related to the Indonesian flight network, traffic and population data, and geographical information. In a first step, we use a generalised additive model to predict the ex ante COVID-19 risk for 78 domestic Indonesian airports based on data from a global model on the disease spread and covariates associated with Indonesian airport network flight data prior to the global COVID-19 outbreak. In a second step, we apply a Bayesian geostatistical model to propagate the estimated COVID-19 risk from the airports to all of Indonesia using freely available spatial covariates including traffic density, population and two spatial distance metrics. The results of our analysis are illustrated using exceedance probability surface maps, which provide policy-relevant information accounting for the uncertainty of the estimates on the location of areas at risk and those that might require further data collection.

7.
BMC Med Res Methodol ; 22(1): 187, 2022 07 11.
Artigo em Inglês | MEDLINE | ID: mdl-35818026

RESUMO

BACKGROUND: Due to contradictory results in current research, whether age at menopause is increasing or decreasing in Western countries remains an open question, yet worth studying as later ages at menopause are likely to be related to an increased risk of breast cancer. Using data from breast cancer screening programs to study the temporal trend of age at menopause is difficult since especially younger women in the same generational cohort have often not yet reached menopause. Deleting these younger women in a breast cancer risk analyses may bias the results. The aim of this study is therefore to recover missing menopause ages as a covariate by comparing methods for handling missing data. Additionally, the study makes a contribution to understanding the evolution of age at menopause for several generations born in Portugal between 1920 and 1970. METHODS: Data from a breast cancer screening program in Portugal including 278,282 women aged 45-69 and collected between 1990 and 2010 are used to compare two approaches of imputing age at menopause: (i) a multiple imputation methodology based on a truncated distribution but ignoring the mechanism of missingness; (ii) a copula-based multiple imputation method that simultaneously handles the age at menopause and the missing mechanism. The linear predictors considered in both cases have a semiparametric additive structure accommodating linear and non-linear effects defined via splines or Markov random fields smoothers in the case of spatial variables. RESULTS: Both imputation methods unveiled an increasing trend of age at menopause when viewed as a function of the birth year for the youngest generation. This trend is hidden if we model only women with an observed age at menopause. CONCLUSION: When studying age at menopause, missing ages must be recovered with an adequate procedure for incomplete data. Imputing these missing ages avoids excluding the younger generation cohort of the screening program in breast cancer risk analyses and hence reduces the bias stemming from this exclusion. In addition, imputing the not yet observed ages of menopause for mostly younger women is also crucial when studying the time trend of age at menopause otherwise the analysis will be biased.


Assuntos
Neoplasias da Mama , Menopausa , Viés , Neoplasias da Mama/epidemiologia , Estudos de Coortes , Feminino , Humanos , Medição de Risco
9.
Front Plant Sci ; 12: 635440, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33643364

RESUMO

Automated species classification from 3D point clouds is still a challenge. It is, however, an important task for laser scanning-based forest inventory, ecosystem models, and to support forest management. Here, we tested the performance of an image classification approach based on convolutional neural networks (CNNs) with the aim to classify 3D point clouds of seven tree species based on 2D representation in a computationally efficient way. We were particularly interested in how the approach would perform with artificially increased training data size based on image augmentation techniques. Our approach yielded a high classification accuracy (86%) and the confusion matrix revealed that despite rather small sample sizes of the training data for some tree species, classification accuracy was high. We could partly relate this to the successful application of the image augmentation technique, improving our result by 6% in total and 13, 14, and 24% for ash, oak and pine, respectively. The introduced approach is hence not only applicable to small-sized datasets, it is also computationally effective since it relies on 2D instead of 3D data to be processed in the CNN. Our approach was faster and more accurate when compared to the point cloud-based "PointNet" approach.

10.
Biom J ; 63(5): 1028-1051, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33734453

RESUMO

Expectile regression, in contrast to classical linear regression, allows for heteroscedasticity and omits a parametric specification of the underlying distribution. This model class can be seen as a quantile-like generalization of least squares regression. Similarly as in quantile regression, the whole distribution can be modeled with expectiles, while still offering the same flexibility in the use of semiparametric predictors as modern mean regression. However, even with no parametric assumption for the distribution of the response in expectile regression, the model is still constructed with a linear relationship between the fitted value and the predictor. If the true underlying relationship is nonlinear then severe biases can be observed in the parameter estimates as well as in quantities derived from them such as model predictions. We observed this problem during the analysis of the distribution of a self-reported hearing score with limited range. Classical expectile regression should in theory adhere to these constraints, however, we observed predictions that exceeded the maximum score. We propose to include a response function between the fitted value and the predictor similarly as in generalized linear models. However, including a fixed response function would imply an assumption on the shape of the underlying distribution function. Such assumptions would be counterintuitive in expectile regression. Therefore, we propose to estimate the response function jointly with the covariate effects. We design the response function as a monotonically increasing P-spline, which may also contain constraints on the target set. This results in valid estimates for a self-reported listening effort score through nonlinear estimates of the response function. We observed strong associations with the speech reception threshold.


Assuntos
Modelos Lineares , Viés , Humanos
11.
Econ Hum Biol ; 40: 100950, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33321408

RESUMO

A history of insufficient nutritional intake is reflected by low anthropometric measures and can lead to growth failures, limited mental development, poor health outcomes and a higher risk of dying. Children below five years are among those most vulnerable and, while improvements in the share of children affected by insufficient nutritional intake has been observed, both sub-Saharan Africa and South Asia have a disproportionately high share of growth failures and large disparities at national and sub-national levels. In this study, we use a Bayesian distributional regression approach to develop models for the standard anthropometric measures, stunting and wasting. This approach allows us to model both the mean and the standard deviation of the underlying response distribution. Accordingly, the whole distribution of the anthropometric measures can be evaluated. This is of particular importance, considering the fact that (severe) growth failures of children are defined having a z-score below -2 (-3), emphasising the need to extend the analysis beyond the conditional mean. In addition, we merge individual data taken from the Demographic and Health Surveys with remote sensed data for a large sample of 38 countries located in sub-Saharan Africa and South Asia for the period 1990-2016, in order to combine individual and household specific characteristics with geophysical and environmental characteristics, and to allow for a comparison over time. Our results show besides gender differences across space, and strong non-linear effects of included socio-economic characteristics, in particular for maternal education and the wealth of the household that, surprisingly, in the presence of socio-economic characteristics, remote sensed data does not contribute to variations in growth failures, and including a pure spatial effect excluding remote sensed data leads to even better results. Further, while all regions showed improvements towards the target of the Sustainable Development Goals (SDGs), our analysis identifies hotspots of growth failures at sub-national levels within India, Nigeria, Niger, and Madagascar, emphasising the need to accelerate progress to reach the target set by the SDGs.


Assuntos
Transtornos do Crescimento , África Subsaariana/epidemiologia , Ásia , Teorema de Bayes , Criança , Transtornos do Crescimento/epidemiologia , Humanos , Fatores Socioeconômicos
12.
Br J Math Stat Psychol ; 74(1): 99-117, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33128469

RESUMO

A test score on a psychological test is usually expressed as a normed score, representing its position relative to test scores in a reference population. These typically depend on predictor(s) such as age. The test score distribution conditional on predictors is estimated using regression, which may need large normative samples to estimate the relationships between the predictor(s) and the distribution characteristics properly. In this study, we examine to what extent this burden can be alleviated by using prior information in the estimation of new norms with Bayesian Gaussian distributional regression. In a simulation study, we investigate to what extent this norm estimation is more efficient and how robust it is to prior model deviations. We varied the prior type, prior misspecification and sample size. In our simulated conditions, using a fixed effects prior resulted in more efficient norm estimation than a weakly informative prior as long as the prior misspecification was not age dependent. With the proposed method and reasonable prior information, the same norm precision can be achieved with a smaller normative sample, at least in empirical problems similar to our simulated conditions. This may help test developers to achieve cost-efficient high-quality norms. The method is illustrated using empirical normative data from the IDS-2 intelligence test.


Assuntos
Testes Psicológicos , Teorema de Bayes , Simulação por Computador , Distribuição Normal , Tamanho da Amostra
13.
PLoS One ; 15(2): e0226514, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32058999

RESUMO

This paper introduces distributional regression also known as generalized additive models for location, scale and shape (GAMLSS) as a modeling framework for analyzing treatment effects beyond the mean. In contrast to mean regression models, GAMLSS relate each distributional parameter to covariates. Therefore, they can be used to model the treatment effect not only on the mean but on the whole conditional distribution. Since they encompass a wide range of different distributions, GAMLSS provide a flexible framework for modeling non-normal outcomes in which additionally nonlinear and spatial effects can easily be incorporated. We elaborate on the combination of GAMLSS with program evaluation methods including randomized controlled trials, panel data techniques, difference in differences, instrumental variables, and regression discontinuity design. We provide practical guidance on the usage of GAMLSS by reanalyzing data from the Mexican Progresa program. Contrary to expectations, no significant effects of a cash transfer on the conditional consumption inequality level between treatment and control group are found.


Assuntos
Interpretação Estatística de Dados , Status Econômico/estatística & dados numéricos , Pobreza/estatística & dados numéricos , Bases de Dados Factuais , Humanos , México , Análise de Regressão , Distribuições Estatísticas
14.
J Appl Stat ; 47(11): 2066-2080, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-35707573

RESUMO

In this paper, we propose the class of generalized additive models for location, scale and shape in a test for the association of genetic markers with non-normally distributed phenotypes comprising a spike at zero. The resulting statistical test is a generalization of the quantitative transmission disequilibrium test with mating type indicator, which was originally designed for normally distributed quantitative traits and parent-offspring data. As a motivational example, we consider coronary artery calcification (CAC), which can accurately be identified by electron beam tomography. In the investigated regions, individuals will have a continuous measure of the extent of calcium found or they will be calcium-free. Hence, the resulting distribution is a mixed discrete-continuous distribution with spike at zero. We carry out parent-offspring simulations motivated by such CAC measurement values in a screening population to study statistical properties of the proposed test for genetic association. Furthermore, we apply the approach to data of the Genetic Analysis Workshop 16 that are based on real genotype and family data of the Framingham Heart Study, and test the association of selected genetic markers with simulated coronary artery calcification.

15.
G3 (Bethesda) ; 9(4): 1117-1129, 2019 04 09.
Artigo em Inglês | MEDLINE | ID: mdl-30760541

RESUMO

Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of single nucleotide polymorphisms -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing reproducing kernel Hilbert space (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the least absolute shrinkage and selection operator (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.


Assuntos
Genômica/métodos , Análise de Regressão , Polimorfismo de Nucleotídeo Único , Triticum/genética , Triticum/crescimento & desenvolvimento
16.
Stat Med ; 38(3): 413-436, 2019 02 10.
Artigo em Inglês | MEDLINE | ID: mdl-30334275

RESUMO

Bivariate copula regression allows for the flexible combination of two arbitrary, continuous marginal distributions with regression effects being placed on potentially all parameters of the resulting bivariate joint response distribution. Motivated by the risk factors for adverse birth outcomes, many of which are dichotomous, we consider mixed binary-continuous responses that extend the bivariate continuous framework to the situation where one response variable is discrete (more precisely, binary) whereas the other response remains continuous. Utilizing the latent continuous representation of binary regression models, we implement a penalized likelihood-based approach for the resulting class of copula regression models and employ it in the context of modeling gestational age and the presence/absence of low birth weight. The analysis demonstrates the advantage of the flexible specification of regression impacts including nonlinear effects of continuous covariates and spatial effects. Our results imply that racial and spatial inequalities in the risk factors for infant mortality are even greater than previously suggested.


Assuntos
Recém-Nascido Prematuro , Modelos Estatísticos , Resultado da Gravidez/epidemiologia , Análise de Regressão , Feminino , Idade Gestacional , Humanos , Lactente , Mortalidade Infantil , Recém-Nascido de Baixo Peso , Recém-Nascido , Funções Verossimilhança , Gravidez
17.
Health Econ ; 27(7): 1074-1088, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29676015

RESUMO

We reconsider the relationship between income and health taking a distributional perspective rather than one centered on conditional expectation. Using structured additive distributional regression, we find that the association between income and health is larger than generally estimated because aspects of the conditional health distribution that go beyond the expectation imply worse outcomes for those with lower incomes. Looking at German data from the Socio-Economic Panel, we find that the risk of bad health is roughly halved when doubling the net equivalent income from 15,000 to 30,000€. This is more than tenfold of the magnitude of change found when considering expected health measures. A distributional perspective thus highlights another dimension of the income-health relation-that the poor are in particular faced with greater health risk at the lower end of the health distribution. We therefore argue that when studying health outcomes, a distributional approach that considers stochastic variation among observationally equivalent individuals is warranted.


Assuntos
Economia Médica , Nível de Saúde , Disparidades em Assistência à Saúde , Renda/estatística & dados numéricos , Modelos Estatísticos , Alemanha , Humanos , Fatores Socioeconômicos
19.
J Cancer ; 8(14): 2692-2698, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28928857

RESUMO

Objectives: To update the first sentinel nomogram predicting the presence of lymph node invasion (LNI) in prostate cancer patients undergoing sentinel lymph node dissection (sPLND), taking into account the percentage of positive cores. Patients and Methods: Analysis included 1,870 prostate cancer patients who underwent radioisotope-guided sPLND and retropubic radical prostatectomy. Prostate-specific antigen (PSA), clinical T category, primary and secondary biopsy Gleason grade, and percentage of positive cores were included in univariate and multivariate logistic regression models predicting LNI, and constituted the basis for the regression coefficient-based nomogram. Bootstrapping was applied to generate 95% confidence intervals for predicted probabilities. The area under the receiver operator characteristic curve (AUC) was obtained to quantify accuracy. Results: Median PSA was 7.68 ng/ml (interquartile range (IQR) 5.5-12.3). The number of lymph nodes removed was 10 (IQR 7-13). Overall, 352 patients (18.8%) had LNI. All preoperative prostate cancer characteristics differed significantly between LNI-positive and LNI-negative patients (P<0.001). In univariate accuracy analyses, the proportion of positive cores was the foremost predictor of LNI (AUC, 77%) followed by PSA (71.1%), clinical T category (69.9%), and primary and secondary Gleason grade (66.6% and 61.3%, respectively). For multivariate logistic regression models, all parameters were independent predictors of LNI (P<0.001). The nomogram exhibited a high predictive accuracy (AUC, 83.5%). Conclusion: The first update of the only available sentinel nomogram predicting LNI in prostate cancer patients demonstrates even better predictive accuracy and improved calibration. As an additional factor, the percentage of positive cores represents the leading predictor of LNI. This updated sentinel model should be externally validated and compared with results of extended PLND-based nomograms.

20.
Comput Math Methods Med ; 2017: 6742763, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28785300

RESUMO

The analysis of genome-wide association studies (GWAS) benefits from the investigation of biologically meaningful gene sets, such as gene-interaction networks (pathways). We propose an extension to a successful kernel-based pathway analysis approach by integrating kernel functions into a powerful algorithmic framework for variable selection, to enable investigation of multiple pathways simultaneously. We employ genetic similarity kernels from the logistic kernel machine test (LKMT) as base-learners in a boosting algorithm. A model to explain case-control status is created iteratively by selecting pathways that improve its prediction ability. We evaluated our method in simulation studies adopting 50 pathways for different sample sizes and genetic effect strengths. Additionally, we included an exemplary application of kernel boosting to a rheumatoid arthritis and a lung cancer dataset. Simulations indicate that kernel boosting outperforms the LKMT in certain genetic scenarios. Applications to GWAS data on rheumatoid arthritis and lung cancer resulted in sparse models which were based on pathways interpretable in a clinical sense. Kernel boosting is highly flexible in terms of considered variables and overcomes the problem of multiple testing. Additionally, it enables the prediction of clinical outcomes. Thus, kernel boosting constitutes a new, powerful tool in the analysis of GWAS data and towards the understanding of biological processes involved in disease susceptibility.


Assuntos
Algoritmos , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Modelos Genéticos , Simulação por Computador , Redes Reguladoras de Genes , Humanos , Tamanho da Amostra
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...