Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 39(12)2023 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-38039146

RESUMO

SUMMARY: Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications. AVAILABILITY AND IMPLEMENTATION: survex is available under the GPL3 public license at https://github.com/modeloriented/survex and on CRAN with documentation available at https://modeloriented.github.io/survex.


Assuntos
Inteligência Artificial , Pesquisa Biomédica , Reprodutibilidade dos Testes , Software , Aprendizado de Máquina
2.
Diabetologia ; 66(10): 1914-1924, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37420130

RESUMO

AIMS/HYPOTHESIS: There is increasing evidence for the existence of shared genetic predictors of metabolic traits and neurodegenerative disease. We previously observed a U-shaped association between fasting insulin in middle-aged women and dementia up to 34 years later. In the present study, we performed genome-wide association (GWA) analyses for fasting serum insulin in European children with a focus on variants associated with the tails of the insulin distribution. METHODS: Genotyping was successful in 2825 children aged 2-14 years at the time of insulin measurement. Because insulin levels vary during childhood, GWA analyses were based on age- and sex-specific z scores. Five percentile ranks of z-insulin were selected and modelled using logistic regression, i.e. the 15th, 25th, 50th, 75th and 85th percentile ranks (P15-P85). Additive genetic models were adjusted for age, sex, BMI, survey year, survey country and principal components derived from genetic data to account for ethnic heterogeneity. Quantile regression was used to determine whether associations with variants identified by GWA analyses differed across quantiles of log-insulin. RESULTS: A variant in the SLC28A1 gene (rs2122859) was associated with the 85th percentile rank of the insulin z score (P85, p value=3×10-8). Two variants associated with low z-insulin (P15, p value <5×10-6) were located on the RBFOX1 and SH3RF3 genes. These genes have previously been associated with both metabolic traits and dementia phenotypes. While variants associated with P50 showed stable associations across the insulin spectrum, we found that associations with variants identified through GWA analyses of P15 and P85 varied across quantiles of log-insulin. CONCLUSIONS/INTERPRETATION: The above results support the notion of a shared genetic architecture for dementia and metabolic traits. Our approach identified genetic variants that were associated with the tails of the insulin spectrum only. Because traditional heritability estimates assume that genetic effects are constant throughout the phenotype distribution, the new findings may have implications for understanding the discrepancy in heritability estimates from GWA and family studies and for the study of U-shaped biomarker-disease associations.


Assuntos
Demência , Doenças Neurodegenerativas , Masculino , Feminino , Humanos , Estudo de Associação Genômica Ampla , Insulina , Jejum , Polimorfismo de Nucleotídeo Único , Ubiquitina-Proteína Ligases
3.
Genet Epidemiol ; 45(5): 485-536, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-33942369

RESUMO

The Translational Machine (TM) is a machine learning (ML)-based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population substructure on outcome prediction. The TM consists of three main components. First, replicable but flexible feature engineering procedures translate genome-scale data into biologically informative features that appropriately contextualize simple variant calls/genotypes within biological and functional contexts. Second, model-free, nonparametric ML-based feature filtering procedures empirically reduce dimensionality and noise of both original genotype calls and engineered features. Third, a powerful ML algorithm for feature selection is used to differentiate risk variant contributions across variant frequency and functional prediction spectra. The TM simultaneously evaluates potential contributions of variants operative under polygenic and heterogeneous models of genetic architecture. Our TM enables integration of biological information (e.g., genomic annotations) within conceptual frameworks akin to geneset-/pathways-based and collapsing methods, but overcomes some of these methods' limitations. The full TM pipeline is executed in R. Our approach and initial findings from its application to a whole-exome schizophrenia case-control data set are presented. These TM procedures extend the findings of the primary investigation and yield novel results.


Assuntos
Aprendizado de Máquina , Modelos Genéticos , Algoritmos , Genômica , Genótipo , Humanos
4.
Int J Obes (Lond) ; 45(6): 1321-1330, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33753884

RESUMO

BACKGROUND: Childhood obesity is a complex multifaceted condition, which is influenced by genetics, environmental factors, and their interaction. However, these interactions have mainly been studied in twin studies and evidence from population-based cohorts is limited. Here, we analyze the interaction of an obesity-related genome-wide polygenic risk score (PRS) with sociodemographic and lifestyle factors for BMI and waist circumference (WC) in European children and adolescents. METHODS: The analyses are based on 8609 repeated observations from 3098 participants aged 2-16 years from the IDEFICS/I.Family cohort. A genome-wide polygenic risk score (PRS) was calculated using summary statistics from independent genome-wide association studies of BMI. Associations were estimated using generalized linear mixed models adjusted for sex, age, region of residence, parental education, dietary intake, relatedness, and population stratification. RESULTS: The PRS was associated with BMI (beta estimate [95% confidence interval (95%-CI)] = 0.33 [0.30, 0.37], r2 = 0.11, p value = 7.9 × 10-81) and WC (beta [95%-CI] = 0.36 [0.32, 0.40], r2 = 0.09, p value = 1.8 × 10-71). We observed significant interactions with demographic and lifestyle factors for BMI as well as WC. Children from Southern Europe showed increased genetic liability to obesity (BMI: beta [95%-CI] = 0.40 [0.34, 0.45]) in comparison to children from central Europe (beta [95%-CI] = 0.29 [0.23, 0.34]), p-interaction = 0.0066). Children of parents with a low level of education showed an increased genetic liability to obesity (BMI: beta [95%-CI] = 0.48 [0.38, 0.59]) in comparison to children of parents with a high level of education (beta [95%-CI] = 0.30 [0.26, 0.34]), p-interaction = 0.0012). Furthermore, the genetic liability to obesity was attenuated by a higher intake of fiber (BMI: beta [95%-CI] interaction = -0.02 [-0.04,-0.01]) and shorter screen times (beta [95%-CI] interaction = 0.02 [0.00, 0.03]). CONCLUSIONS: Our results highlight that a healthy childhood environment might partly offset a genetic predisposition to obesity during childhood and adolescence.


Assuntos
Estilo de Vida , Obesidade Infantil/epidemiologia , Obesidade Infantil/genética , Adolescente , Criança , Pré-Escolar , Estudos de Coortes , Europa (Continente)/epidemiologia , Feminino , Estudo de Associação Genômica Ampla , Humanos , Masculino , Fatores Sociais
5.
Hum Genet ; 139(1): 73-84, 2020 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-31049651

RESUMO

In this paper, we give an overview of methodological issues related to the use of statistical learning approaches when analyzing high-dimensional genetic data. The focus is set on regression models and machine learning algorithms taking genetic variables as input and returning a classification or a prediction for the target variable of interest; for example, the present or future disease status, or the future course of a disease. After briefly explaining the basic motivation and principle of these methods, we review different procedures that can be used to evaluate the accuracy of the obtained models and discuss common flaws that may lead to over-optimistic conclusions with respect to their prediction performance and usefulness.


Assuntos
Algoritmos , Doença/genética , Aprendizado de Máquina , Modelos Estatísticos , Epidemiologia Molecular , Inteligência Artificial , Humanos
6.
BMC Bioinformatics ; 20(1): 358, 2019 Jun 27.
Artigo em Inglês | MEDLINE | ID: mdl-31248362

RESUMO

BACKGROUND: In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. RESULTS: We identify one variant termed "block forest" that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. CONCLUSIONS: The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.


Assuntos
Aprendizado de Máquina , Genômica , Humanos , Análise de Sobrevida
7.
Bioinformatics ; 34(21): 3711-3718, 2018 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-29757357

RESUMO

Motivation: Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results: We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation: The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Estudo de Associação Genômica Ampla , Frequência do Gene , Genoma , Aprendizado de Máquina , Software
8.
Artigo em Alemão | MEDLINE | ID: mdl-30027343

RESUMO

Adverse drug reactions are among the leading causes of death. Pharmacovigilance aims to monitor drugs after they have been released to the market in order to detect potential risks. Data sources commonly used to this end are spontaneous reports sent in by doctors or pharmaceutical companies. Reports alone are rather limited when it comes to detecting potential health risks. Routine statutory health insurance data, however, are a richer source since they not only provide a detailed picture of the patients' wellbeing over time, but also contain information on concomitant medication and comorbidities.To take advantage of their potential and to increase drug safety, we will further develop statistical methods that have shown their merit in other fields as a source of inspiration. A plethora of methods have been proposed over the years for spontaneous reporting data: a comprehensive comparison of these methods and their potential use for longitudinal data should be explored. In addition, we show how methods from machine learning could aid in identifying rare risks. We discuss these so-called enrichment analyses and how utilizing pharmaceutical similarities between drugs and similarities between comorbidities could help to construct risk profiles of the patients prone to experience an adverse drug event.Summarizing these methods will further push drug safety research based on healthcare claim data from German health insurances which form, due to their size, longitudinal coverage, and timeliness, an excellent basis for investigating adverse effects of drugs.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Seguro Saúde , Farmacovigilância , Alemanha , Humanos , Seguro Saúde/estatística & dados numéricos
9.
Stat Med ; 36(8): 1272-1284, 2017 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-28088842

RESUMO

The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd.


Assuntos
Interpretação Estatística de Dados , Modelos Estatísticos , Análise de Sobrevida , Modificador do Efeito Epidemiológico , Humanos , Método de Monte Carlo , Modelos de Riscos Proporcionais
11.
BMC Bioinformatics ; 17: 145, 2016 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-27029549

RESUMO

BACKGROUND: Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. RESULTS: Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. CONCLUSIONS: Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.


Assuntos
Modelos Genéticos , Epistasia Genética , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo Único
12.
Rheumatology (Oxford) ; 55(1): 71-9, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26297628

RESUMO

OBJECTIVE: To evaluate the clinical presentation and long-term outcome of a vasculitis centre cohort of patients with microscopic polyangiitis (MPA) with respect to organ manifestations, treatment, chronic damage and mortality. METHODS: We performed a retrospective chart review at our vasculitis referral centre. MPA patients admitted between 1991 and 2013 classified by a modified European Medicines Agency algorithm were diagnosed and treated according to a standardized interdisciplinary approach. RESULTS: Comprehensive data from standardized interdisciplinary workups was available for 144 patients (median follow-up 72 months). The overall standardized mortality ratio was 1.40 (95% CI 0.91, 2.07; P = 0.13). We observed a higher mortality [hazard ratio (HR) 4.04 (95% CI 1.21, 13.45), P = 0.02] in 17 patients with MPA-associated fibrosing interstitial lung disease (ILD) and 56 patients with peripheral nervous system involvement [HR 5.26 (95% CI 1.10, 25.14), P = 0.04] at disease onset. One hundred and fifteen patients (79.9%) responded to the initial treatment. Sixty-one (42.3%) achieved complete remission and 54 (37.5%) achieved partial remission. Twenty (13.9%) showed a refractory disease course. CONCLUSION: MPA patients at our tertiary rheumatology referral centre seemed to have a less severe phenotype resulting in a less severe disease course and better outcome than reported in other cohorts. Fibrosing ILD was significantly associated with mortality in this cohort.


Assuntos
Poliangiite Microscópica/diagnóstico , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Progressão da Doença , Feminino , Seguimentos , Alemanha/epidemiologia , Glucocorticoides/uso terapêutico , Humanos , Imunossupressores/uso terapêutico , Incidência , Masculino , Poliangiite Microscópica/tratamento farmacológico , Poliangiite Microscópica/epidemiologia , Pessoa de Meia-Idade , Indução de Remissão/métodos , Estudos Retrospectivos , Taxa de Sobrevida/tendências , Fatores de Tempo , Adulto Jovem
13.
Biom J ; 57(3): 384-94, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25824320

RESUMO

Caries infiltration is a novel treatment option for proximal caries lesions. The idea is to build a diffusion barrier inside the lesion to slow down or stop the caries progression. If a lesion still reaches a critical size, restorative treatment is required. Clinical trials investigating caries infiltration thus produce multiple censored ordinal data. Standard statistical models do not take into account this censoring, and we therefore propose the Multiple Ordered Tobit (MOT) model. The model is implemented in R and compared with standard approaches. Simulation studies demonstrate that for all sample sizes and scenarios the MOT model has the largest statistical power among all methods compared, and it is robust against heteroscedasticity to some extent. Finally, a comparison with dichotomous and ordinal scaled models shows that the use of metric data for the lesion size reduces the required sample size considerably.


Assuntos
Artefatos , Cárie Dentária/diagnóstico , Modelos Estatísticos , Ensaios Clínicos Controlados Aleatórios como Assunto/métodos , Tamanho da Amostra , Simulação por Computador , Cárie Dentária/epidemiologia , Diagnóstico por Computador , Humanos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
14.
PeerJ ; 10: e13728, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35910765

RESUMO

This article describes a data-driven framework based on spatiotemporal machine learning to produce distribution maps for 16 tree species (Abies alba Mill., Castanea sativa Mill., Corylus avellana L., Fagus sylvatica L., Olea europaea L., Picea abies L. H. Karst., Pinus halepensis Mill., Pinus nigra J. F. Arnold, Pinus pinea L., Pinus sylvestris L., Prunus avium L., Quercus cerris L., Quercus ilex L., Quercus robur L., Quercus suber L. and Salix caprea L.) at high spatial resolution (30 m). Tree occurrence data for a total of three million of points was used to train different algorithms: random forest, gradient-boosted trees, generalized linear models, k-nearest neighbors, CART and an artificial neural network. A stack of 305 coarse and high resolution covariates representing spectral reflectance, different biophysical conditions and biotic competition was used as predictors for realized distributions, while potential distribution was modelled with environmental predictors only. Logloss and computing time were used to select the three best algorithms to tune and train an ensemble model based on stacking with a logistic regressor as a meta-learner. An ensemble model was trained for each species: probability and model uncertainty maps of realized distribution were produced for each species using a time window of 4 years for a total of six distribution maps per species, while for potential distributions only one map per species was produced. Results of spatial cross validation show that the ensemble model consistently outperformed or performed as good as the best individual model in both potential and realized distribution tasks, with potential distribution models achieving higher predictive performances (TSS = 0.898, R2 logloss = 0.857) than realized distribution ones on average (TSS = 0.874, R2 logloss = 0.839). Ensemble models for Q. suber achieved the best performances in both potential (TSS = 0.968, R2 logloss = 0.952) and realized (TSS = 0.959, R2 logloss = 0.949) distribution, while P. sylvestris (TSS = 0.731, 0.785, R2 logloss = 0.585, 0.670, respectively, for potential and realized distribution) and P. nigra (TSS = 0.658, 0.686, R2 logloss = 0.623, 0.664) achieved the worst. Importance of predictor variables differed across species and models, with the green band for summer and the Normalized Difference Vegetation Index (NDVI) for fall for realized distribution and the diffuse irradiation and precipitation of the driest quarter (BIO17) being the most frequent and important for potential distribution. On average, fine-resolution models outperformed coarse resolution models (250 m) for realized distribution (TSS = +6.5%, R2 logloss = +7.5%). The framework shows how combining continuous and consistent Earth Observation time series data with state of the art machine learning can be used to derive dynamic distribution maps. The produced predictions can be used to quantify temporal trends of potential forest degradation and species composition change.


Assuntos
Abies , Fagus , Pinus , Quercus , Europa (Continente)
15.
J Phys Act Health ; 17(10): 1025-1033, 2020 08 28.
Artigo em Inglês | MEDLINE | ID: mdl-32858522

RESUMO

BACKGROUND: To evaluate a multicomponent health promotion program targeting preschoolers' physical activity (PA). METHODS: PA of children from 23 German daycare facilities (DFs; 13 intervention DFs, 10 control DFs) was measured via accelerometry at baseline and after 12 months. Children's sedentary time, light PA, and moderate to vigorous PA were estimated. Adherence was tracked with paper-and-pencil calendars. Mixed-model regression analyses were used to assess intervention effects. RESULTS: PA data were analyzed from 183 (4.2 [0.8] y, 48.1% boys) children. At follow-up, children in DF groups with more than 50% adherence to PA intervention components showed an increase of 9 minutes of moderate to vigorous PA per day (ß = 9.28; 95% confidence interval [CI], -0.16 to 18.72) and a 19-minute decrease in sedentary time (ß = -19.25; 95% CI, -43.66 to 5.16) compared with the control group, whereas children's PA of those who were exposed to no or less than 50% adherence remained unchanged (moderate to vigorous PA: ß = 0.34; 95% CI, -13.73 to 14.41; sedentary time: ß = 1.78; 95% CI, -26.54 to 30.09). Notable effects were found in children with migration background. CONCLUSIONS: Only small benefits in PA outcomes were observed after 1 year. A minimum of 50% adherence to the intervention seems to be crucial for facilitating intervention effects.


Assuntos
Creches , Exercício Físico , Acelerometria , Criança , Feminino , Promoção da Saúde , Humanos , Masculino , Comportamento Sedentário
16.
J Comput Graph Stat ; 29(3): 639-658, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-34121830

RESUMO

Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of non-additive predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical modeling approaches based on mean squared error loss may severely suffer as they do not account for heteroscedasticity in the data. To address this issue, we propose a random forest approach for relating a beta dis-tributed outcome to a set of explanatory variables. Our approach explicitly makes use of the likelihood function of the beta distribution for the selection of splits dur-ing the tree-building procedure. In each iteration of the tree-building algorithm it chooses one explanatory variable in combination with a split point that maximizes the log-likelihood function of the beta distribution with the parameter estimates de-rived from the nodes of the currently built tree. Results of several simulation studies and an application using data from the U.S.A. National Lakes Assessment Survey demonstrate the properties and usefulness of the method, in particular when compared to random forest approaches based on mean squared error loss and parametric regression models.

17.
PeerJ ; 7: e6339, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30746306

RESUMO

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2 k - 1 - 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k - 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.

18.
PeerJ ; 6: e5518, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30186691

RESUMO

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality-especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

19.
Sci Rep ; 8(1): 5872, 2018 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-29651131

RESUMO

Mutations in mitochondrial DNA (mtDNA) lead to heteroplasmy, i.e., the intracellular coexistence of wild-type and mutant mtDNA strands, which impact a wide spectrum of diseases but also physiological processes, including endurance exercise performance in athletes. However, the phenotypic consequences of limited levels of naturally arising heteroplasmy have not been experimentally studied to date. We hence generated a conplastic mouse strain carrying the mitochondrial genome of an AKR/J mouse strain (B6-mtAKR) in a C57BL/6 J nuclear genomic background, leading to >20% heteroplasmy in the origin of light-strand DNA replication (OriL). These conplastic mice demonstrate a shorter lifespan as well as dysregulation of multiple metabolic pathways, culminating in impaired glucose metabolism, compared to that of wild-type C57BL/6 J mice carrying lower levels of heteroplasmy. Our results indicate that physiologically relevant differences in mtDNA heteroplasmy levels at a single, functionally important site impair the metabolic health and lifespan in mice.


Assuntos
Replicação do DNA/genética , DNA Mitocondrial/genética , Longevidade/genética , Mitocôndrias/genética , Animais , Glucose/genética , Glucose/metabolismo , Humanos , Redes e Vias Metabólicas/genética , Camundongos , Mitocôndrias/patologia , Mutação
20.
Methods Mol Biol ; 1666: 629-647, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28980267

RESUMO

The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, and contaminations, alignment to a reference genome and the postprocessing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento Completo do Genoma/métodos , Genoma Humano , Humanos , Mutação INDEL , Controle de Qualidade , Software , Fluxo de Trabalho
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA