Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 22
Filtrer
1.
J Appl Stat ; 51(1): 153-167, 2024.
Article de Anglais | MEDLINE | ID: mdl-38179162

RÉSUMÉ

A quick count seeks to estimate the voting trends of an election and communicate them to the population on the evening of the same day of the election. In quick counts, the sampling is based on a stratified design of polling stations. Voting information is gathered gradually, often with no guarantee of obtaining the complete sample or even information in all the strata. However, accurate interval estimates with partial information must be obtained. Furthermore, this becomes more challenging if the strata are additionally study domains. To produce partial estimates, two strategies are proposed: (1) a Bayesian model using a dynamic post-stratification strategy and a single imputation process defined after a thorough analysis of historic voting information; additionally, a credibility level correction is included to solve the underestimation of the variance and (2) a frequentist alternative that combines standard multiple imputation ideas with classic sampling techniques to obtain estimates under a missing information framework. Both solutions are illustrated and compared using information from the 2021 quick count. The aim was to estimate the composition of the Chamber of Deputies in Mexico.

2.
Environ Sci Pollut Res Int ; 30(28): 72319-72335, 2023 Jun.
Article de Anglais | MEDLINE | ID: mdl-37165270

RÉSUMÉ

The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables and the result that will be applied to analysis and simulation models that feed variables such as flow and level of a study area. One of the most common problems is the acquisition and transmission of data from weather stations due to atypical values and lost data; this generates difficulties in the simulation process. Consequently, it is necessary to propose a numerical strategy to solve this problem. The data source for this study is a real database where these problems are presented with different variables of weather. This study is based on comparing three methods of time series analysis to evaluate a multivariable process offline. For the development of the study, we applied a method based on the discrete Fourier transform (DFT), and we contrasted it with methods such as the average and linear regression without uncertainty parameters to complete missing data. The proposed methodology entails statistical values, outlier detection, and the application of the DFT. The application of DFT allows the time series completion, based on its ability to manage various gap sizes and replace missing values. In sum, DFT led to low error percentages for all the time series (1% average). This percentage reflects what would have likely been the shape or pattern of the time series behavior in the absence of misleading outliers and missing data.


Sujet(s)
Temps (météorologie) , Facteurs temps , Colombie , Modèles linéaires , Simulation numérique
3.
Int. j. morphol ; 40(1): 148-156, feb. 2022. ilus, tab
Article de Anglais | LILACS | ID: biblio-1385580

RÉSUMÉ

SUMMARY: Missing data may occur in every scientific studies. Statistical shape analysis involves methods that use geometric information obtained from objects. The most important input to the use of geometric information in statistical shape analysis is landmarks. Missing data in shape analysis occurs when there is a loss of information about landmark cartesian coordinates. The aim of the study is to propose F approach algorithm for estimating missing landmark coordinates and compare the performance of F approach with generally accepted missing data estimation methods, EM algorithm, PCA based methods such as Bayesian PCA, Nonlinear Estimation by Iterative Partial Least Squares PCA, Inverse non-linear PCA, Probabilistic PCA and regression imputation methods. Landmark counts were taken as 3, 6, 9 and sample sizes were taken as 5, 10, 30, 50, 100 in the simulation study. The data are generated based on multivariate normal distribution with positively defined variance-covariance matrices from isotropic models. In simulation study three different simulation scenarios and simulation based real data are considered with 1000 repetations. The best and the most different result in the performance evaluation according to all sample sizes is the Min (F) criteria of the F approach algorithm proposed in the study. In case of three landmarks which is only the proposed F approach and regression assignment method can be applied, Min (F) criteria give best results.


RESUMEN: Los datos faltantes pueden ocurrir en todos los estudios científicos. El análisis estadístico de formas involucra métodos que utilizan información geométrica obtenida de objetos. La entrada más importante para el uso de información geométrica en el análisis estadístico de formas son los puntos de referencia. Los datos que faltan en el análisis de formas se producen cuando hay una pérdida de información sobre las coordenadas cartesianas históricas. El objetivo del estudio es proponer el algoritmo de enfoque F para estimar las coordenadas de puntos de referencia faltantes y comparar el rendimiento del enfoque F con métodos de estimación de datos faltantes generalmente aceptados, algoritmo EM, métodos basados en PCA como Bayesian PCA, Estimación no lineal por Iterative Partial Least Squares PCA, PCA no lineal inverso, PCA probabilístico y métodos de imputación de regresión. Los recuentos de puntos de referencia se tomaron como 3, 6, 9 y los tamaños de muestra se tomaron como 5, 10, 30, 50, 100 en el estudio de simulación. Los datos se generan en base a una distribución normal multivariada con matrices de varianza-covarianza definidas positivamente a partir de modelos isotrópicos. En el estudio de simulación se consideran tres escenarios de simulación diferentes y se consideran datos reales basados en simulación con 1000 repeticiones. El mejor y más diferente resultado en la evaluación del desempeño según todos los tamaños de muestra es el criterio Min (F) del algoritmo de enfoque F propuesto en el estudio. En el caso de tres puntos de referencia, que es solo el enfoque F propuesto y se puede aplicar el método de asignación de regresión, los criterios Min (F) dan mejores resultados.


Sujet(s)
Algorithmes , Repères anatomiques , Interprétation statistique de données , Analyse en composantes principales
4.
Data Brief ; 39: 107592, 2021 Dec.
Article de Anglais | MEDLINE | ID: mdl-34869806

RÉSUMÉ

Changes observed in the current climate and projected for the future significantly concern researchers, decision-makers, and the general public. Climate indices of extreme rainfall events are a trend assessment tool to detect climate variability and change signals, which have an average reliability at least in the short term and given climatic inertia. This paper shows 12 climate indices of extreme rainfall events for annual and seasonal scales for 12 climate stations between 1969 to 2019 in the Metropolitan area of Cali (southwestern Colombia). The construction of the indices starts from daily rainfall time series, which although have between 0.5% and 5.4% of missing data, can affect the estimation of the indices. Here, we propose a methodology to complete missing data of the extreme event indices that model the peaks in the time series. This methodology uses an artificial neural network approach known as Non-Linear Principal Component Analysis (NLPCA). The approach reconstructs the time series by modulating the extreme values of the indices, a fundamental feature when evaluating extreme rainfall events in a region. The accuracy in the indices estimation shows values close to 1 in the Pearson's Correlation Coefficient and in the Bi-weighting Correlation. Moreover, values close to 0 in the percent bias and RMSE-observations standard deviation ratio. The database provided here is an essential input in future evaluation studies of extreme rainfall events in the Metropolitan area of Cali, the third most crucial urban conglomerate in Colombia with more than 3.9 million inhabitants.

5.
BioData Min ; 14(1): 44, 2021 Sep 03.
Article de Anglais | MEDLINE | ID: mdl-34479616

RÉSUMÉ

BACKGROUND: Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. RESULTS: Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. CONCLUSIONS: We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology.

6.
Environ Monit Assess ; 193(8): 531, 2021 Jul 28.
Article de Anglais | MEDLINE | ID: mdl-34322768

RÉSUMÉ

Multivariate calibration based on partial least squares, random forest, and support vector machine methods, combined with the MissForest imputation algorithm, was used to understand the interaction between ozone and nitrogen oxides, carbon monoxide, wind speed, solar radiation, temperature, relative humidity, and others, the data of which were collected by air quality monitoring stations in the metropolitan area of Rio de Janeiro in four distinct sites between, 2014 and, 2018. These techniques provide an easy and feasible way of modeling and analyzing air pollutants and can be used when coupled with other methods. The results showed that random forest and support vector machine chemometric techniques can be used in modeling and predicting tropospheric ozone concentrations, with a coefficient of determination for making predictions up to 0.92, a root-mean square error of calibration between 4.66 and 27.15 µg m-3, and a root-mean square error of prediction between 4.17 and 22.45 µg m-3, depending on the air quality monitoring stations and season.


Sujet(s)
Polluants atmosphériques , Pollution de l'air , Ozone , Polluants atmosphériques/analyse , Pollution de l'air/analyse , Brésil , Calibrage , Surveillance de l'environnement , Ozone/analyse
7.
Rev. bras. estud. popul ; 38: e0139, 2021. tab, graf
Article de Portugais | LILACS | ID: biblio-1280030

RÉSUMÉ

Neste artigo, são estimados os diferenciais educacionais de mortalidade de adultos residentes em São Paulo. É realizada uma análise comparativa de estimativas a partir de dados do Censo 2010 e do Sistema de Informação de Mortalidade (SIM) - Datasus e de três formas distintas de mensuração da escolaridade: registrada no SIM; declarada no Censo para o responsável pelo domicílio; e imputada estatisticamente no Censo para indivíduos que morreram. Para as imputações da escolaridade, utilizou-se o método de Dempester (1977), que propõe o uso do algoritmo esperança-maximização (algoritmo E-M) para lidar com dados faltantes. Foram considerados três níveis de escolaridade (baixo, médio e alto) e estimadas as taxas de mortalidade com base em modelos Poisson. Os resultados indicam que a obtenção de escolaridade pode reduzir em até 77% as taxas de mortalidade entre 25 e 59 anos de idade. Além disso, em um país em que a população tem baixa escolaridade, obter ensino médio representa um ganho significativo do ponto de vista da sobrevivência adulta (cerca de 50%). Encontraram-se padrões de mortalidade por escolaridade semelhantes para as estimativas obtidas com dados registrados no SIM e aqueles imputados no Censo 2010. Além disso, a análise sugere que estimativas assumindo a escolaridade do responsável pelo domicílio resultam em diferenciais de mortalidade atípicos, provavelmente distorcidos pela transição de educação no Brasil. Espera-se que o modelo de imputação proposto aqui possa ser utilizado em futuras análises dos dados de mortalidade a partir do Censo 2010.


En este artículo estimamos los diferenciales educativos de la mortalidad de adultos en San Pablo. Ofrecemos un análisis comparativo de estimaciones con base en datos del censo de 2010 y el Sistema de Información de Mortalidad (SIM) - Datasus, y tres formas diferentes de medir la escolaridad: registrada en el SIM, declarada en el censo por el jefe de hogar e imputado estadísticamente en el censo para las personas fallecidas. Para las imputaciones de escolaridad se utilizó el método de Dempester (1977), que propone el uso del algoritmo de maximización de esperanza (algoritmo E-M) para tratar los datos faltantes. Consideramos tres niveles de educación (bajo, medio y alto) y estimamos las tasas de mortalidad con base en los modelos de Poisson. Los resultados indican que la escolarización puede reducir las tasas de mortalidad entre los 25 y 59 años hasta en un 77 %. Además, en un país donde la población tiene bajo nivel de educación, completar la educación secundaria representa una ganancia significativa desde el punto de vista de la supervivencia de los adultos (alrededor del 50%). Encontramos patrones similares de mortalidad por educación para las estimaciones obtenidas con datos registrados en el SIM y datos imputados en el Censo de 2010. Además, nuestro análisis sugiere que las estimaciones asumiendo la educación del jefe de hogar dan como resultado diferenciales de mortalidad atípicos, probablemente distorsionados por la transición de educación en Brasil. Esperamos que el modelo de imputación propuesto aquí se pueda utilizar en futuros análisis de mortalidad del Censo de 2010.


In this article, we estimate adult mortality by education level in São Paulo. We compare estimates based on deaths from the 2010 Census and the 2013 Mortality Information System (Sistema de Informação de Mortalidade - SIM) - DATASUS, and three different ways of measuring education level: recorded in the SIM, reported in the census for the household heads and imputed statistically in the census for individuals who died. For the statistical imputation, we use the Dempester (1977) method, which proposes using the expectation-maximization algorithm (EM algorithm) to deal with missing data. We consider three education levels (low, medium, and high) and estimate mortality rates based on Poisson models. The results indicate that between ages 25 and 59, more years of schooling are associated with mortality rates up to 77% lower. Secondary (medium) education level provides most of the mortality gains at adult ages (about 50%). The mortality differentials calculated with death records from the SIM and census deaths with education imputed statistically are similar. However, estimates based on the assumption that the deceased's education is equal to the household head's in the census resulted in atypical mortality patterns. We hope that the imputation model we propose in the current study can be used in future mortality analyses by SES using census deaths.


Sujet(s)
Humains , Mortalité , Recensements , Niveau d'instruction , Survie (démographie) , Normes de référence , Algorithmes , Brésil , Systèmes d'information , Enseignement Primaire et Secondaire
8.
Mol Phylogenet Evol ; 151: 106896, 2020 10.
Article de Anglais | MEDLINE | ID: mdl-32562821

RÉSUMÉ

The reconstruction of relationships within recently radiated groups is challenging even when massive amounts of sequencing data are available. The use of restriction site-associated DNA sequencing (RAD-Seq) to this end is promising. Here, we assessed the performance of RAD-Seq to infer the species-level phylogeny of the rapidly radiating genus Cereus (Cactaceae). To examine how the amount of genomic data affects resolution in this group, we used datasets and implemented different analyses. We sampled 52 individuals of Cereus, representing 18 of the 25 species currently recognized, plus members of the closely allied genera Cipocereus and Praecereus, and other 11 Cactaceae genera as outgroups. Three scenarios of permissiveness to missing data were carried out in iPyRAD, assembling datasets with 30% (333 loci), 45% (1440 loci), and 70% (6141 loci) of missing data. For each dataset, Maximum Likelihood (ML) trees were generated using two supermatrices, i.e., only SNPs and SNPs plus invariant sites. Accuracy and resolution were improved when the dataset with the highest number of loci was used (6141 loci), despite the high percentage of missing data included (70%). Coalescent trees estimated using SVDQuartets and ASTRAL are similar to those obtained by the ML reconstructions. Overall, we reconstruct a well-supported phylogeny of Cereus, which is resolved as monophyletic and composed of four main clades with high support in their internal relationships. Our findings also provide insights into the impact of missing data for phylogeny reconstruction using RAD loci.


Sujet(s)
Évolution biologique , Cactaceae/génétique , Génome végétal , Analyse de séquence d'ADN , Séquence nucléotidique , Bases de données génétiques , Locus génétiques , Spéciation génétique , Fonctions de vraisemblance , Phylogenèse , Polymorphisme de nucléotide simple/génétique , Analyse en composantes principales
9.
Front Public Health ; 8: 536174, 2020.
Article de Anglais | MEDLINE | ID: mdl-33585375

RÉSUMÉ

Assessment of the air quality in metropolitan areas is a major challenge in environmental sciences. Issues related include the distribution of monitoring stations, their spatial range, or missing information. In Mexico City, stations have been located spanning the entire Metropolitan zone for pollutants, such as CO, NO2, O3, SO2, PM2.5, PM10, NO, NO x , and PM CO . A fundamental question is whether the number and location of such stations are adequate to optimally cover the city. By analyzing spatio-temporal correlations for pollutant measurements, we evaluated the distribution and performance of monitoring stations in Mexico City from 2009 to 2018. Based on our analysis, air quality evaluation of those contaminants is adequate to cover the 16 boroughs of Mexico City, with the exception of SO2, since its spatial range is shorter than the one needed to cover the whole surface of the city. We observed that NO and NO x concentrations must be taken into account since their long-range dispersion may have relevant consequences for public health. With this approach, we may be able to propose policy based on systematic criteria to locate new monitoring stations.


Sujet(s)
Polluants atmosphériques , Pollution de l'air , Polluants atmosphériques/effets indésirables , Pollution de l'air/analyse , Villes , Surveillance de l'environnement , Mexique , Santé publique
10.
Data Brief ; 26: 104517, 2019 Oct.
Article de Anglais | MEDLINE | ID: mdl-31667280

RÉSUMÉ

The success of many projects linked to the management and planning of water resources depends mainly on the quality of the climatic and hydrological data that is provided. Nevertheless, the missing data are frequently found in hydroclimatic variables due to measuring instrument failures, observation recording errors, meteorological extremes, and the challenges associated with accessing measurement areas. Hence, it is necessary to apply an appropriate fill of missing data before any analysis. This paper is intended to present the filling of missing data of monthly rainfall of 45 gauge stations located in southwestern Colombia. The series analyzed covers 34 years of observations between 1983 and 2016, available from the Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM). The estimation of missing data was done using Non-linear Principal Component Analysis (NLPCA); a non-linear generalization of the standard Principal Component Analysis Method via an Artificial Neural Networks (ANN) approach. The best result was obtained using a network with a [45-44-45] architecture. The estimated mean squared error in the imputation of missing data was approximately 9.8 mm. month-1, showing that the NLPCA approach constitutes a powerful methodology in the imputation of missing rainfall data. The estimated rainfall dataset helps reduce uncertainty for further studies related to homogeneity analyses, conglomerates, trends, multivariate statistics and meteorological forecasts in regions with information deficits such as southwestern Colombia.

11.
Stat Methods Med Res ; 28(9): 2583-2594, 2019 09.
Article de Anglais | MEDLINE | ID: mdl-29629629

RÉSUMÉ

Extreme learning machines have gained a lot of attention by the machine learning community because of its interesting properties and computational advantages. With the increase in collection of information nowadays, many sources of data have missing information making statistical analysis harder or unfeasible. In this paper, we present a new model, coined spatial extreme learning machine, that combine spatial modeling with extreme learning machines keeping the nice properties of both methodologies and making it very flexible and robust. As explained throughout the text, the spatial extreme learning machines have many advantages in comparison with the traditional extreme learning machines. By a simulation study and a real data analysis we present how the spatial extreme learning machine can be used to improve imputation of missing data and uncertainty prediction estimation.


Sujet(s)
Théorème de Bayes , Infections à VIH/épidémiologie , Tumeurs du poumon/épidémiologie , Apprentissage machine , Brésil/épidémiologie , Simulation numérique , Démographie , Humains , Incidence , Méthode de Monte Carlo
12.
Stat Med ; 37(24): 3503-3518, 2018 10 30.
Article de Anglais | MEDLINE | ID: mdl-29873100

RÉSUMÉ

Generalized linear models are often assumed to fit propensity scores, which are used to compute inverse probability weighted (IPW) estimators. To derive the asymptotic properties of IPW estimators, the propensity score is supposed to be bounded away from zero. This condition is known in the literature as strict positivity (or positivity assumption), and, in practice, when it does not hold, IPW estimators are very unstable and have a large variability. Although strict positivity is often assumed, it is not upheld when some of the covariates are unbounded. In real data sets, a data-generating process that violates the positivity assumption may lead to wrong inference because of the inaccuracy in the estimations. In this work, we attempt to conciliate between the strict positivity condition and the theory of generalized linear models by incorporating an extra parameter, which results in an explicit lower bound for the propensity score. An additional parameter is added to fulfil the overlap assumption in the causal framework.


Sujet(s)
Modèles statistiques , Score de propension , Biostatistiques , Causalité , Simulation numérique , Volume expiratoire maximal par seconde , Hospitalisation/économie , Humains , Méthode des moindres carrés , Fonctions de vraisemblance , Modèles linéaires , Méthode de Monte Carlo
13.
J Anat ; 232(1): 3-14, 2018 Jan.
Article de Anglais | MEDLINE | ID: mdl-29071711

RÉSUMÉ

Bone size and shape arise throughout ontogeny as a result of the coordinated activity of osteoblasts and osteoclasts, responsible for bone deposition and resorption, and growth displacements. The modelling processes leave specific microstructural features on the bone surface, which can be used to infer the mechanisms shaping craniofacial traits in extinct and extant species. However, the analysis of bone surfaces from fossils and archaeological samples faces some difficulties related to the bone loss caused by taphonomic factors, and the lack of formal methods for estimating missing information and comparing the patterns of bone modelling among several specimens and samples. The present study provides a new approach for the quantitative analysis of bone formation and resorption patterns obtained from craniofacial surfaces. First, interpolation techniques were used to estimate missing data on high-resolution replicas of the left maxilla in a sample of sub-adult and adult modern humans and sub-adult fossil hominins. The performance of this approach was assessed by simulating variable amounts of missing data. Then, we applied measures of dispersion and central tendency to represent the variation and average pattern of bone modelling within samples. The spatial interpolation resulted in reliable estimations of the type of cell activity (deposition or resorption) in the missing areas, even when large extensions of the bone surface were lost. The quantification of the histological data allowed us to integrate the information of different specimens and depict the areas with higher and lower variation in the bone modelling pattern of the maxilla among specimens. Overall, the main advantages of the quantitative approach used here for generating bone modelling patterns are the high replicability and the possibility of incorporating variation among specimens into the comparisons among samples.


Sujet(s)
Fossiles/anatomie et histologie , Hominidae/anatomie et histologie , Traitement d'image par ordinateur/méthodes , Modèles anatomiques , Crâne/anatomie et histologie , Animaux , Humains
14.
PeerJ ; 5: e2890, 2017.
Article de Anglais | MEDLINE | ID: mdl-28413719

RÉSUMÉ

BACKGROUND: Previous quantitative studies on Bauruemys elegans (Suárez, 1969) shell variation, as well as the taphonomic interpretation of its type locality, have suggested that all specimens collected in this locality may have belonged to the same population. We rely on this hypothesis in a morphometric study of the skull. Also, we tentatively assessed the eating preference habits differentiation that might be explained as due to ontogenetic changes. METHODS: We carried out an ANOVA testing 29 linear measurements from 21 skulls of B. elegans taken by using a caliper and through images, using the ImageJ software. First, a Principal Components Analysis (PCA) was performed with 27 measurements (excluding total length and width characters; =raw data) in order to visualize the scatter plots based on the form variance only. Then, a second PCA was carried out using ratios of length and width of each original measurement to assess shape variation among individuals. Finally, original measurements were log-transformed to describe allometries over ontogeny. RESULTS: No statistical differences were found between caliper and ImageJ measurements. The first three PCs of the PCA with raw data comprised 70.2% of the variance. PC1 was related to size variation and all others related to shape variation. Two specimens plotted outside the 95% ellipse in PC1∼PC2 axes. The first three PCs of the PCA with ratios comprised 64% of the variance. When considering PC1∼PC2, all specimens plotted inside the 95% ellipse. In allometric analysis, five measurements were positively allometric, 19 were negatively allometric and three represented enantiometric allometry. Many bones of the posterior and the lateral emarginations lengthen due to increasing size, while jugal and the quadratojugal decrease in width. DISCUSSION: ImageJ is useful in replacing caliper since there was no statistical differences. Yet iterative imputation is more appropriate to deal with missing data in PCA. Some specimens show small differences in form and shape. Form differences were interpreted as occuring due to ontogeny, whereas shape differences are related to feeding changes during growth. Moreover, all outlier specimens are crushed and/or distorted, thus the form/shape differences may be partially due to taphonomy. The allometric lengthening of the parietal, quadrate, squamosal, maxilla, associated with the narrowing of jugal and quadratojugal may be related to changes in feeding habit between different stages of development. This change in shape might represent a progressive skull stretching and enlargement of posterior and lateral emargination during ontogeny, and consequently, the increment of the feeding-apparatus musculature. Smaller individuals may have fed on softer diet, whereas larger ones probably have had a harder diet, as seen in some living species of Podocnemis. We conclude that the skull variation might be related to differences in feeding habits over ontogeny in B. elegans.

15.
Biometrics ; 73(1): 220-231, 2017 03.
Article de Anglais | MEDLINE | ID: mdl-27506481

RÉSUMÉ

Motivated by a study conducted to evaluate the associations of 51 inflammatory markers and lung cancer risk, we propose several approaches of varying computational complexity for analyzing multiple correlated markers that are also censored due to lower and/or upper limits of detection, using likelihood-based sufficient dimension reduction (SDR) methods. We extend the theory and the likelihood-based SDR framework in two ways: (i) we accommodate censored predictors directly in the likelihood, and (ii) we incorporate variable selection. We find linear combinations that contain all the information that the correlated markers have on an outcome variable (i.e., are sufficient for modeling and prediction of the outcome) while accounting for censoring of the markers. These methods yield efficient estimators and can be applied to any type of outcome, including continuous and categorical. We illustrate and compare all methods using data from the motivating study and in simulations. We find that explicitly accounting for the censoring in the likelihood of the SDR methods can lead to appreciable gains in efficiency and prediction accuracy, and also outperformed multiple imputations combined with standard SDR.


Sujet(s)
Biométrie/méthodes , Interprétation statistique de données , Probabilité , Marqueurs biologiques , Simulation numérique , Humains , Inflammation , Fonctions de vraisemblance , Limite de détection , Tumeurs du poumon/diagnostic , Tumeurs du poumon/anatomopathologie , Modèles statistiques , Analyse de régression , Risque
16.
Stat Methods Med Res ; 25(4): 1579-95, 2016 08.
Article de Anglais | MEDLINE | ID: mdl-23804968

RÉSUMÉ

Competing risks arise in medical research when subjects are exposed to various types or causes of death. Data from large cohort studies usually exhibit subsets of regressors that are missing for some study subjects. Furthermore, such studies often give rise to censored data. In this article, a carefully formulated likelihood-based technique for the regression analysis of right-censored competing risks data when two of the covariates are discrete and partially missing is developed. The approach envisaged here comprises two models: one describes the covariate effects on both long-term incidence and conditional latencies for each cause of death, whilst the other deals with the observation process by which the covariates are missing. The former is formulated with a well-established mixture model and the latter is characterised by copula-based bivariate probability functions for both the missing covariates and the missing data mechanism. The resulting formulation lends itself to the empirical assessment of non-ignorability by performing sensitivity analyses using models with and without a non-ignorable component. The methods are illustrated on a 20-year follow-up involving a prostate cancer cohort from the National Cancer Institutes Surveillance, Epidemiology, and End Results program.


Sujet(s)
Tumeurs de la prostate/diagnostic , Triage , Sujet âgé , Études de cohortes , Humains , Fonctions de vraisemblance , Mâle , Adulte d'âge moyen , National Cancer Institute (USA) , Pronostic , Analyse de régression , Risque , États-Unis
17.
Actual. psicol. (Impr.) ; 29(119)dic. 2015.
Article de Espagnol | LILACS-Express | LILACS | ID: biblio-1505549

RÉSUMÉ

La mayoría de los datos en ciencias sociales y educación presentan valores perdidos debido al abandono del estudio o la ausencia de respuesta. Los métodos para el manejo de datos perdidos han mejorado dramáticamente en los últimos años, y los programas computacionales ofrecen en la actualidad una variedad de opciones sofisticadas. A pesar de la amplia disponibilidad de métodos considerablemente justificados, muchos investigadores e investigadoras siguen confiando en técnicas viejas de imputación que pueden crear análisis sesgados. Este artículo presenta una introducción conceptual a los patrones de datos perdidos. Seguidamente, se introduce el manejo de datos perdidos y el análisis de los mismos con base en los mecanismos modernos del método de máxima verosimilitud con información completa (FIML, siglas en inglés) y la imputación múltiple (IM). Asimismo, se incluye una introducción a los diseños de datos perdidos así como nuevas herramientas computacionales tales como la función Quark y el paquete semTools. Se espera que este artículo incentive el uso de métodos modernos para el análisis de los datos perdidos.


Most of the social and educational data have missing observations due to either attrition or nonresponse. Missing data methodology has improved dramatically in recent years, and popular computer programs as well as software now offer a variety of sophisticated options. Despite the widespread availability of theoretically justified methods, many researchers still rely on old imputation techniques that can create biased analysis. This article provides conceptual introductions to the patterns of missing data. In line with that, this article introduces how to handle and analyze the missing information based on modern mechanisms of full-information maximum likelihood (FIML) and multiple imputation (MI). An introduction about planned missing designs is also included and new computational tools like Quark function, and semTools package are also mentioned. The authors hope that this paper encourages researchers to implement modern methods for analyzing missing data.

18.
An. acad. bras. ciênc ; 83(1): 61-72, Mar. 2011. ilus, graf, tab
Article de Anglais | LILACS | ID: lil-578282

RÉSUMÉ

Missing data is a common problem in paleontology. It makes it difficult to reconstruct extinct taxa accurately and restrains the inclusion of some taxa on comparative and biomechanical studies. Particularly, estimating the position of vertebrae on incomplete series is often non-empirical and does not allow precise estimation of missing parts. In this work we present a method for calculating the position of preserved middle sequences of caudal vertebrae in the saurischian dinosaur Staurikosaurus pricei, based on the length and height of preserved anterior and posterior caudal vertebral centra. Regression equations were used to estimate these dimensions for middle vertebrae and, consequently, to assess the position of the preserved middle sequences. It also allowed estimating these dimensions for non-preserved vertebrae. Results indicate that the preserved caudal vertebrae of Staurikosaurus may correspond to positions 1-3, 5, 7, 14-19/15-20, 24-25/25-26, and 29-47, and that at least 25 vertebrae had transverse processes. Total length of the tail was estimated in 134 cm and total body length was 220-225 cm.


Dados lacunares são um problema comum na paleontologia. Eles dificultam a reconstrução acurada de táxons extintos e limitam a inclusão de alguns táxons em estudos comparativose biomecânicos. Particularmente, estimar a posição de vértebras em séries incompletas tem sido feito com base em métodos não empíricos que não permitem estimar corretamente as partes ausentes. Neste trabalho apresentamos uma metodologia que permite estimar a posição de sequências médias preservadas de vértebras caudais no dinossauro saurísquio Staurikosaurus pricei, com base no comprimento e altura dos centros das vértebras anteriores e posteriores preservadas. Equações de regressão foram usadas para estimar essas dimensões para as vértebras médias e, consequentemente, para posicionar as sequências médias preservadas e para estimar o tamanho das vértebras não preservadas. Os resultados indicam que as vértebras caudais preservadas de Staurikosaurus corresponderiam às posições 1-3, 5, 7, 14-19/15-20, 24-25/25-26 e 29-47, e que pelo menos 25 vértebras possuíam processos transversos. O comprimento total da cauda foi estimado em 134 cm e o comprimento total do corpo em 220-225 cm.


Sujet(s)
Animaux , Dinosaures/anatomie et histologie , Paléontologie/méthodes , Rachis/anatomie et histologie , Queue/anatomie et histologie , Dinosaures/classification , Fossiles
19.
Rev. bras. epidemiol ; Rev. bras. epidemiol;13(4): 596-606, Dec. 2010. ilus, graf, tab
Article de Portugais | LILACS | ID: lil-569101

RÉSUMÉ

INTRODUÇÃO: A perda de informações é um problema frequente em estudos realizados na área da Saúde. Na literatura essa perda é chamada de missing data ou dados faltantes. Através da imputação dos dados faltantes são criados conjuntos de dados artificialmente completos que podem ser analisados por técnicas estatísticas tradicionais. O objetivo desse artigo foi comparar, em um exemplo baseado em dados reais, a utilização de três técnicas de imputações diferentes. MÉTODO: Os dados utilizados referem-se a um estudo de desenvolvimento de modelo de risco cirúrgico, sendo que o tamanho da amostra foi de 450 pacientes. Os métodos de imputação empregados foram duas imputações únicas e uma imputação múltipla (IM), e a suposição sobre o mecanismo de não-resposta foi MAR (Missing at Random). RESULTADOS: A variável com dados faltantes foi a albumina sérica, com 27,1 por cento de perda. Os modelos obtidos pelas imputações únicas foram semelhantes entre si, mas diferentes dos obtidos com os dados imputados pela IM quanto à inclusão de variáveis nos modelos. CONCLUSÕES: Os resultados indicam que faz diferença levar em conta a relação da albumina com outras variáveis observadas, pois foram obtidos modelos diferentes nas imputações única e múltipla. A imputação única subestima a variabilidade, gerando intervalos de confiança mais estreitos. É importante se considerar o uso de métodos de imputação quando há dados faltantes, especialmente a IM que leva em conta a variabilidade entre imputações para as estimativas do modelo.


INTRODUCTION: It is common for studies in health to face problems with missing data. Through imputation, complete data sets are built artificially and can be analyzed by traditional statistical analysis. The objective of this paper is to compare three types of imputation based on real data. METHODS: The data used came from a study on the development of risk models for surgical mortality. The sample size was 450 patients. The imputation methods applied were: two single imputations and one multiple imputation and the assumption was MAR (Missing at Random). RESULTS: The variable with missing data was serum albumin with 27.1 percent of missing rate. The logistic models adjusted by simple imputation were similar, but differed from models obtained by multiple imputation in relation to the inclusion of variables. CONCLUSIONS: The results indicate that it is important to take into account the relationship of albumin to other variables observed, because different models were obtained in single and multiple imputations. Single imputation underestimates the variability generating narrower confidence intervals. It is important to consider the use of imputation methods when there is missing data, especially multiple imputation that takes into account the variability between imputations for estimates of the model.


Sujet(s)
Humains , Méthodes épidémiologiques , Modèles statistiques , Procédures de chirurgie opératoire/mortalité , Risque
20.
Rev. bras. educ. fís. esp ; 24(3): 413-431, jul.-set. 2010. graf, tab
Article de Portugais | LILACS | ID: lil-604579

RÉSUMÉ

O grande propósito deste texto é apresentar um tutorial para investigadores das Ciências do Desporto e da Educação Física acerca dos desafios que se colocam quando se analisa informação longitudinal. A partir de um exemplo com dados reais do estudo longitudinal-misto de Muzambinho percorrem-se três avenidas de preocupações: 1) a construção de um discurso desenvolvimentista com base na modelação hierárquica; 2) a apresentação de duas soluções para lidar com informação omissa; 3) a pesquisa sobre a estabilidade das diferenças interindividuais nas mudanças intraindividuais (i.e., do "tracking"). Em cada uma dessas avenidas são lançadas questões cujas soluções são sempre acompanhadas de leituras dos principais resultados dos distintos programas estatísticos utilizados.


The main aim of this study is to present a tutorial to Sport Sciences and Physical Education researchers when facing challenges emerging from longitudinal data analysis. Based on a real data set from Muzambinho mixed-longitudinal study, we shall deal with three main concerns: 1) to build a developmental view based on hierarchical or multilevel modeling; 2) to present two solutions to the missing data problem; 3) to search for stability of interindividual differences in intraindividual change (i.e., tracking). In each of these main issues questions will be asked whose answers will be presented alongside with main results coming from different statistical softwares used.


Sujet(s)
Interprétation statistique de données , /statistiques et données numériques , Recherche
SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE