Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 107
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
PLoS Comput Biol ; 20(3): e1011936, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38547084

RESUMEN

Throughout their education and when reading the scientific literature, students may get the impression that there is a unique and correct analysis strategy for every data analysis task and that this analysis strategy will always yield a significant and noteworthy result. This expectation conflicts with a growing realization that there is a multiplicity of possible analysis strategies in empirical research, which will lead to overoptimism and nonreplicable research findings if it is combined with result-dependent selective reporting. Here, we argue that students are often ill-equipped for real-world data analysis tasks and unprepared for the dangers of selectively reporting the most promising results. We present a seminar course intended for advanced undergraduates and beginning graduate students of data analysis fields such as statistics, data science, or bioinformatics that aims to increase the awareness of uncertain choices in the analysis of empirical data and present ways to deal with these choices through theoretical modules and practical hands-on sessions.


Asunto(s)
Estudiantes , Enseñanza , Humanos , Curriculum
2.
Am J Epidemiol ; 2024 May 06.
Artículo en Inglés | MEDLINE | ID: mdl-38717330

RESUMEN

Quantitative bias analysis (QBA) permits assessment of the expected impact of various imperfections of the available data on the results and conclusions of a particular real-world study. This article extends QBA methodology to multivariable time-to-event analyses with right-censored endpoints, possibly including time-varying exposures or covariates. The proposed approach employs data-driven simulations, which preserve important features of the data at hand while offering flexibility in controlling the parameters and assumptions that may affect the results. First, the steps required to perform data-driven simulations are described, and then two examples of real-world time-to-event analyses illustrate their implementation and the insights they may offer. The first example focuses on the omission of an important time-invariant predictor of the outcome in a prognostic study of cancer mortality, and permits separating the expected impact of confounding bias from non-collapsibility. The second example assesses how imprecise timing of an interval-censored event - ascertained only at sparse times of clinic visits - affects its estimated association with a time-varying drug exposure. The simulation results also provide a basis for comparing the performance of two alternative strategies for imputing the unknown event times in this setting. The R scripts that permit the reproduction of our examples are provided.

3.
PLoS Comput Biol ; 19(1): e1010820, 2023 01.
Artículo en Inglés | MEDLINE | ID: mdl-36608142

RESUMEN

In recent years, unsupervised analysis of microbiome data, such as microbial network analysis and clustering, has increased in popularity. Many new statistical and computational methods have been proposed for these tasks. This multiplicity of analysis strategies poses a challenge for researchers, who are often unsure which method(s) to use and might be tempted to try different methods on their dataset to look for the "best" ones. However, if only the best results are selectively reported, this may cause over-optimism: the "best" method is overly fitted to the specific dataset, and the results might be non-replicable on validation data. Such effects will ultimately hinder research progress. Yet so far, these topics have been given little attention in the context of unsupervised microbiome analysis. In our illustrative study, we aim to quantify over-optimism effects in this context. We model the approach of a hypothetical microbiome researcher who undertakes four unsupervised research tasks: clustering of bacterial genera, hub detection in microbial networks, differential microbial network analysis, and clustering of samples. While these tasks are unsupervised, the researcher might still have certain expectations as to what constitutes interesting results. We translate these expectations into concrete evaluation criteria that the hypothetical researcher might want to optimize. We then randomly split an exemplary dataset from the American Gut Project into discovery and validation sets multiple times. For each research task, multiple method combinations (e.g., methods for data normalization, network generation, and/or clustering) are tried on the discovery data, and the combination that yields the best result according to the evaluation criterion is chosen. While the hypothetical researcher might only report this result, we also apply the "best" method combination to the validation dataset. The results are then compared between discovery and validation data. In all four research tasks, there are notable over-optimism effects; the results on the validation data set are worse compared to the discovery data, averaged over multiple random splits into discovery/validation data. Our study thus highlights the importance of validation and replication in microbiome analysis to obtain reliable results and demonstrates that the issue of over-optimism goes beyond the context of statistical testing and fishing for significance.


Asunto(s)
Microbiota , Aprendizaje Automático , Consorcios Microbianos , Bacterias , Análisis por Conglomerados
4.
Stat Med ; 43(6): 1119-1134, 2024 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-38189632

RESUMEN

Tuning hyperparameters, such as the regularization parameter in Ridge or Lasso regression, is often aimed at improving the predictive performance of risk prediction models. In this study, various hyperparameter tuning procedures for clinical prediction models were systematically compared and evaluated in low-dimensional data. The focus was on out-of-sample predictive performance (discrimination, calibration, and overall prediction error) of risk prediction models developed using Ridge, Lasso, Elastic Net, or Random Forest. The influence of sample size, number of predictors and events fraction on performance of the hyperparameter tuning procedures was studied using extensive simulations. The results indicate important differences between tuning procedures in calibration performance, while generally showing similar discriminative performance. The one-standard-error rule for tuning applied to cross-validation (1SE CV) often resulted in severe miscalibration. Standard non-repeated and repeated cross-validation (both 5-fold and 10-fold) performed similarly well and outperformed the other tuning procedures. Bootstrap showed a slight tendency to more severe miscalibration than standard cross-validation-based tuning procedures. Differences between tuning procedures were larger for smaller sample sizes, lower events fractions and fewer predictors. These results imply that the choice of tuning procedure can have a profound influence on the predictive performance of prediction models. The results support the application of standard 5-fold or 10-fold cross-validation that minimizes out-of-sample prediction error. Despite an increased computational burden, we found no clear benefit of repeated over non-repeated cross-validation for hyperparameter tuning. We warn against the potentially detrimental effects on model calibration of the popular 1SE CV rule for tuning prediction models in low-dimensional settings.


Asunto(s)
Proyectos de Investigación , Humanos , Simulación por Computador , Tamaño de la Muestra
5.
BMC Med Res Methodol ; 24(1): 152, 2024 Jul 17.
Artículo en Inglés | MEDLINE | ID: mdl-39020325

RESUMEN

When different researchers study the same research question using the same dataset they may obtain different and potentially even conflicting results. This is because there is often substantial flexibility in researchers' analytical choices, an issue also referred to as "researcher degrees of freedom". Combined with selective reporting of the smallest p-value or largest effect, researcher degrees of freedom may lead to an increased rate of false positive and overoptimistic results. In this paper, we address this issue by formalizing the multiplicity of analysis strategies as a multiple testing problem. As the test statistics of different analysis strategies are usually highly dependent, a naive approach such as the Bonferroni correction is inappropriate because it leads to an unacceptable loss of power. Instead, we propose using the "minP" adjustment method, which takes potential test dependencies into account and approximates the underlying null distribution of the minimal p-value through a permutation-based procedure. This procedure is known to achieve more power than simpler approaches while ensuring a weak control of the family-wise error rate. We illustrate our approach for addressing researcher degrees of freedom by applying it to a study on the impact of perioperative p a O 2 on post-operative complications after neurosurgery. A total of 48 analysis strategies are considered and adjusted using the minP procedure. This approach allows to selectively report the result of the analysis strategy yielding the most convincing evidence, while controlling the type 1 error-and thus the risk of publishing false positive results that may not be replicable.


Asunto(s)
Investigadores , Humanos , Investigadores/estadística & datos numéricos , Proyectos de Investigación , Interpretación Estadística de Datos , Investigación Biomédica/métodos , Modelos Estadísticos , Complicaciones Posoperatorias/prevención & control
6.
Biom J ; 66(1): e2200238, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-36999395

RESUMEN

The constant development of new data analysis methods in many fields of research is accompanied by an increasing awareness that these new methods often perform better in their introductory paper than in subsequent comparison studies conducted by other researchers. We attempt to explain this discrepancy by conducting a systematic experiment that we call "cross-design validation of methods". In the experiment, we select two methods designed for the same data analysis task, reproduce the results shown in each paper, and then reevaluate each method based on the study design (i.e., datasets, competing methods, and evaluation criteria) that was used to show the abilities of the other method. We conduct the experiment for two data analysis tasks, namely cancer subtyping using multiomic data and differential gene expression analysis. Three of the four methods included in the experiment indeed perform worse when they are evaluated on the new study design, which is mainly caused by the different datasets. Apart from illustrating the many degrees of freedom existing in the assessment of a method and their effect on its performance, our experiment suggests that the performance discrepancies between original and subsequent papers may not only be caused by the nonneutrality of the authors proposing the new method but also by differences regarding the level of expertise and field of application. Authors of new methods should thus focus not only on a transparent and extensive evaluation but also on comprehensive method documentation that enables the correct use of their methods in subsequent studies.


Asunto(s)
Proyectos de Investigación
7.
Biom J ; 66(1): e2200222, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-36737675

RESUMEN

Although new biostatistical methods are published at a very high rate, many of these developments are not trustworthy enough to be adopted by the scientific community. We propose a framework to think about how a piece of methodological work contributes to the evidence base for a method. Similar to the well-known phases of clinical research in drug development, we propose to define four phases of methodological research. These four phases cover (I) proposing a new methodological idea while providing, for example, logical reasoning or proofs, (II) providing empirical evidence, first in a narrow target setting, then (III) in an extended range of settings and for various outcomes, accompanied by appropriate application examples, and (IV) investigations that establish a method as sufficiently well-understood to know when it is preferred over others and when it is not; that is, its pitfalls. We suggest basic definitions of the four phases to provoke thought and discussion rather than devising an unambiguous classification of studies into phases. Too many methodological developments finish before phase III/IV, but we give two examples with references. Our concept rebalances the emphasis to studies in phases III and IV, that is, carefully planned method comparison studies and studies that explore the empirical properties of existing methods in a wider range of problems.


Asunto(s)
Bioestadística , Proyectos de Investigación
8.
BMC Med ; 21(1): 182, 2023 05 15.
Artículo en Inglés | MEDLINE | ID: mdl-37189125

RESUMEN

BACKGROUND: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.


Asunto(s)
Investigación Biomédica , Objetivos , Humanos , Proyectos de Investigación
9.
Brief Bioinform ; 22(3)2021 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-32823283

RESUMEN

Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.


Asunto(s)
Benchmarking , Femenino , Humanos , Aprendizaje Automático , Masculino , Neoplasias/genética , Neoplasias/patología , Modelos de Riesgos Proporcionales , Análisis de Supervivencia
10.
Brief Bioinform ; 22(4)2021 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-33264391

RESUMEN

MOTIVATION: Estimating microbial association networks from high-throughput sequencing data is a common exploratory data analysis approach aiming at understanding the complex interplay of microbial communities in their natural habitat. Statistical network estimation workflows comprise several analysis steps, including methods for zero handling, data normalization and computing microbial associations. Since microbial interactions are likely to change between conditions, e.g. between healthy individuals and patients, identifying network differences between groups is often an integral secondary analysis step. Thus far, however, no unifying computational tool is available that facilitates the whole analysis workflow of constructing, analysing and comparing microbial association networks from high-throughput sequencing data. RESULTS: Here, we introduce NetCoMi (Network Construction and comparison for Microbiome data), an R package that integrates existing methods for each analysis step in a single reproducible computational workflow. The package offers functionality for constructing and analysing single microbial association networks as well as quantifying network differences. This enables insights into whether single taxa, groups of taxa or the overall network structure change between groups. NetCoMi also contains functionality for constructing differential networks, thus allowing to assess whether single pairs of taxa are differentially associated between two groups. Furthermore, NetCoMi facilitates the construction and analysis of dissimilarity networks of microbiome samples, enabling a high-level graphical summary of the heterogeneity of an entire microbiome sample collection. We illustrate NetCoMi's wide applicability using data sets from the GABRIELA study to compare microbial associations in settled dust from children's rooms between samples from two study centers (Ulm and Munich). AVAILABILITY: R scripts used for producing the examples shown in this manuscript are provided as supplementary data. The NetCoMi package, together with a tutorial, is available at https://github.com/stefpeschel/NetCoMi. CONTACT: Tel:+49 89 3187 43258; stefanie.peschel@mail.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Briefings in Bioinformatics online.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Secuenciación de Nucleótidos de Alto Rendimiento , Microbiota/genética , Programas Informáticos , Humanos
11.
Eur Heart J ; 43(31): 2921-2930, 2022 08 14.
Artículo en Inglés | MEDLINE | ID: mdl-35639667

RESUMEN

The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the limitations of the AI-based predictions. In this article, we present 12 critical questions for cardiovascular health professionals to ask when confronted with an AI-based prediction model. We aim to support medical professionals to distinguish the AI-based prediction models that can add value to patient care from the AI that does not.


Asunto(s)
Inteligencia Artificial , Enfermedades Cardiovasculares , Personal de Salud , Humanos , Programas Informáticos
12.
Brief Bioinform ; 21(6): 1904-1919, 2020 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-31750518

RESUMEN

Data integration, i.e. the use of different sources of information for data analysis, is becoming one of the most important topics in modern statistics. Especially in, but not limited to, biomedical applications, a relevant issue is the combination of low-dimensional (e.g. clinical data) and high-dimensional (e.g. molecular data such as gene expressions) data sources in a prediction model. Not only the different characteristics of the data, but also the complex correlation structure within and between the two data sources, pose challenging issues. In this paper, we investigate these issues via simulations, providing some useful insight into strategies to combine low- and high-dimensional data in a regression prediction model. In particular, we focus on the effect of the correlation structure on the results, while accounting for the influence of our specific choices in the design of the simulation study.


Asunto(s)
Biología Computacional , Simulación por Computador , Modelos Estadísticos
13.
Mol Genet Metab ; 136(4): 268-273, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35835062

RESUMEN

Infantile nephropathic cystinosis, due to impaired transport of cystine out of lysosomes, occurs with an incidence of 1 in 100-200,000 live births. It is characterized by renal Fanconi syndrome in the first year of life and glomerular dysfunction progression to end-stage kidney disease by approximately 10 years of age. Treatment with oral cysteamine therapy helps preserve glomerular function, but affected individuals eventually require kidney replacement therapy. This is because glomerular damage had already occurred by the time a child is diagnosed with cystinosis, typically in the second year of life. We performed a retrospective multicenter study to investigate the impact of initiating cysteamine treatment within the first 2 months of life in some infants and comparing two different levels of adherence in patients diagnosed at the typical age. We collected 3983 data points from 55 patients born between 1997 and 2020; 52 patients with 1592 data points could be further evaluated. These data were first analyzed by dividing the patient cohort into three groups: (i) standard treatment start with good adherence, (ii) standard treatment start with less good adherence, and (iii) early treatment start. At every age, mean estimated glomerular filtration rate (eGFR) was higher in early-treated patients than in later-treated patients. Second, a generalized additive mixed model (GAMM) was applied showing that patients with initiation of treatment before 2 months of age are expected to have a 34 ml/min/1.73 m2 higher eGFR than patients with later treatment start while controlling for adherence and patients' age. These data strongly suggest that oral cysteamine treatment initiated within 2 months of birth preserves kidney function in infantile nephropathic cystinosis and provide evidence of the utility of newborn screening for this disease.


Asunto(s)
Cistinosis , Síndrome de Fanconi , Niño , Cisteamina/uso terapéutico , Cistinosis/complicaciones , Cistinosis/tratamiento farmacológico , Síndrome de Fanconi/inducido químicamente , Síndrome de Fanconi/diagnóstico , Síndrome de Fanconi/tratamiento farmacológico , Humanos , Lactante , Recién Nacido , Riñón
14.
BMC Palliat Care ; 21(1): 18, 2022 Feb 04.
Artículo en Inglés | MEDLINE | ID: mdl-35120502

RESUMEN

BACKGROUND: A casemix classification based on patients' needs can serve to better describe the patient group in palliative care and thus help to develop adequate future care structures and enable national benchmarking and quality control. However, in Germany, there is no such an evidence-based system to differentiate the complexity of patients' needs in palliative care. Therefore, the study aims to develop a patient-oriented, nationally applicable complexity and casemix classification for adult palliative care patients in Germany. METHODS: COMPANION is a mixed-methods study with data derived from three subprojects. Subproject 1: Prospective, cross-sectional multi-centre study collecting data on patients' needs which reflect the complexity of the respective patient situation, as well as data on resources that are required to meet these needs in specialist palliative care units, palliative care advisory teams, and specialist palliative home care. Subproject 2: Qualitative study including the development of a literature-based preliminary list of characteristics, expert interviews, and a focus group to develop a taxonomy for specialist palliative care models. Subproject 3: Multi-centre costing study based on resource data from subproject 1 and data of study centres. Data and results from the three subprojects will inform each other and form the basis for the development of the casemix classification. Ultimately, the casemix classification will be developed by applying Classification and Regression Tree (CART) analyses using patient and complexity data from subproject 1 and patient-related cost data from subproject 3. DISCUSSION: This is the first multi-centre costing study that integrates the structure and process characteristics of different palliative care settings in Germany with individual patient care. The mixed methods design and variety of included data allow for the development of a casemix classification that reflect on the complexity of the research subject. The consecutive inclusion of all patients cared for in participating study centres within the time of data collection allows for a comprehensive description of palliative care patients and their needs. A limiting factor is that data will be collected at least partly during the COVID-19 pandemic and potential impact of the pandemic on health care and the research topic cannot be excluded. TRIAL REGISTRATION: German Register for Clinical Studies trial registration number: DRKS00020517 .


Asunto(s)
Cuidados Paliativos , Adulto , COVID-19 , Estudios Transversales , Humanos , Estudios Multicéntricos como Asunto , Pandemias , Estudios Prospectivos
15.
BMC Med Educ ; 22(1): 417, 2022 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-35650577

RESUMEN

BACKGROUND: Guideline-based therapy of cardiac arrhythmias is important for many physicians from the beginning of their training. Practical training of the required skills to treat cardiac arrhythmias is useful for acquiring these skills but does not seem sufficient for skill retention. The aim of this study was to compare different retention methods for skills required to treat cardiac arrhythmias with respect to the performance of these skills in an assessment. METHODS: Seventy-one final-year medical students participated in a newly designed workshop to train synchronized cardioversion (SC) and transcutaneous cardiac pacing (TCP) skills in 2020. All participants completed an objective structured clinical examination (OSCE 1) one week after the training. Afterwards, the participants were stratified and randomized into three groups. Nine weeks later, one group received a standard operating procedure (SOP) for the skills, one group participated in a second workshop (SW), and one group received no further intervention (control). Ten weeks after the first training, all groups participated in OSCE 2. RESULTS: The average score of all students in OSCE 1 was 15.6 ± 0.8 points with no significant differences between the three groups. Students in the control group reached a significantly (p < 0.001) lower score in OSCE 2 (-2.0 points, CI: [-2.9;-1.1]) than in OSCE 1. Students in the SOP-group achieved on average the same result in OSCE 2 as in OSCE 1 (0 points, CI: [-0.63;+0.63]). Students who completed a second skills training (SW-group) scored not significantly higher in OSCE 2 compared to OSCE 1 (+0.4 points, CI: [-0.29;+1.12]). The OSCE 2 scores in groups SOP and SW were neither significantly different nor statistically equivalent. CONCLUSIONS: Partial loss of SC and TCP skills acquired in a workshop can be prevented after 10 weeks by reading an SOP as well as by a second workshop one week before the second assessment. Refreshing practical skills with an SOP could provide an effective and inexpensive method for skills retention compared to repeating a training. Further studies need to show whether this effect also exists for other skills and how frequently an SOP should be re-read for appropriate long-term retention of complex skills.


Asunto(s)
Estudiantes de Medicina , Competencia Clínica , Evaluación Educacional/métodos , Cardioversión Eléctrica , Humanos , Estudios Prospectivos
16.
Hum Genet ; 139(1): 73-84, 2020 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-31049651

RESUMEN

In this paper, we give an overview of methodological issues related to the use of statistical learning approaches when analyzing high-dimensional genetic data. The focus is set on regression models and machine learning algorithms taking genetic variables as input and returning a classification or a prediction for the target variable of interest; for example, the present or future disease status, or the future course of a disease. After briefly explaining the basic motivation and principle of these methods, we review different procedures that can be used to evaluate the accuracy of the obtained models and discuss common flaws that may lead to over-optimistic conclusions with respect to their prediction performance and usefulness.


Asunto(s)
Algoritmos , Enfermedad/genética , Aprendizaje Automático , Modelos Estadísticos , Epidemiología Molecular , Inteligencia Artificial , Humanos
17.
Biom J ; 62(3): 670-687, 2020 05.
Artículo en Inglés | MEDLINE | ID: mdl-31099917

RESUMEN

Uncertainty is a crucial issue in statistics which can be considered from different points of view. One type of uncertainty, typically referred to as sampling uncertainty, arises through the variability of results obtained when the same analysis strategy is applied to different samples. Another type of uncertainty arises through the variability of results obtained when using the same sample but different analysis strategies addressing the same research question. We denote this latter type of uncertainty as method uncertainty. It results from all the choices to be made for an analysis, for example, decisions related to data preparation, method choice, or model selection. In medical sciences, a large part of omics research is focused on the identification of molecular biomarkers, which can either be performed through ranking or by selection from among a large number of candidates. In this paper, we introduce a general resampling-based framework to quantify and compare sampling and method uncertainty. For illustration, we apply this framework to different scenarios related to the selection and ranking of omics biomarkers in the context of acute myeloid leukemia: variable selection in multivariable regression using different types of omics markers, the ranking of biomarkers according to their predictive performance, and the identification of differentially expressed genes from RNA-seq data. For all three scenarios, our findings suggest highly unstable results when the same analysis strategy is applied to two independent samples, indicating high sampling uncertainty and a comparatively smaller, but non-negligible method uncertainty, which strongly depends on the methods being compared.


Asunto(s)
Biometría/métodos , Biología Computacional , Incertidumbre , Biomarcadores/metabolismo , Perfilación de la Expresión Génica , Humanos , Leucemia Mieloide Aguda/genética , Leucemia Mieloide Aguda/metabolismo
18.
BMC Med Res Methodol ; 19(1): 162, 2019 07 24.
Artículo en Inglés | MEDLINE | ID: mdl-31340753

RESUMEN

BACKGROUND: Omics data can be very informative in survival analysis and may improve the prognostic ability of classical models based on clinical risk factors for various diseases, for example breast cancer. Recent research has focused on integrating omics and clinical data, yet has often ignored the need for appropriate model building for clinical variables. Medical literature on classical prognostic scores, as well as biostatistical literature on appropriate model selection strategies for low dimensional (clinical) data, are often ignored in the context of omics research. The goal of this paper is to fill this methodological gap by investigating the added predictive value of gene expression data for models using varying amounts of clinical information. METHODS: We analyze two data sets from the field of survival prognosis of breast cancer patients. First, we construct several proportional hazards prediction models using varying amounts of clinical information based on established medical knowledge. These models are then used as a starting point (i.e. included as a clinical offset) for identifying informative gene expression variables using resampling procedures and penalized regression approaches (model based boosting and the LASSO). In order to assess the added predictive value of the gene signatures, measures of prediction accuracy and separation are examined on a validation data set for the clinical models and the models that combine the two sources of information. RESULTS: For one data set, we do not find any substantial added predictive value of the omics data when compared to clinical models. On the second data set, we identify a noticeable added predictive value, however only for scenarios where little or no clinical information is included in the modeling process. We find that including more clinical information can lead to a smaller number of selected omics predictors. CONCLUSIONS: New research using omics data should include all available established medical knowledge in order to allow an adequate evaluation of the added predictive value of omics data. Including all relevant clinical information in the analysis might also lead to more parsimonious models. The developed procedure to assess the predictive value of the omics data can be readily applied to other scenarios.


Asunto(s)
Neoplasias de la Mama/genética , Neoplasias de la Mama/mortalidad , Genómica/estadística & datos numéricos , Modelos Estadísticos , Análisis de Supervivencia , Conjuntos de Datos como Asunto , Femenino , Expresión Génica , Humanos , Factores de Riesgo
19.
Biom J ; 61(5): 1314-1328, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-30069934

RESUMEN

Ideally, prediction rules should be published in such a way that readers may apply them, for example, to make predictions for their own data. While this is straightforward for simple prediction rules, such as those based on the logistic regression model, this is much more difficult for complex prediction rules derived by machine learning tools. We conducted a survey of articles reporting prediction rules that were constructed using the random forest algorithm and published in PLOS ONE in 2014-2015 in the field "medical and health sciences", with the aim of identifying issues related to their applicability. Making a prediction rule reproducible is a possible way to ensure that it is applicable; thus reproducibility is also examined in our survey. The presented prediction rules were applicable in only 2 of 30 identified papers, while for further eight prediction rules it was possible to obtain the necessary information by contacting the authors. Various problems, such as nonresponse of the authors, hampered the applicability of prediction rules in the other cases. Based on our experiences from this illustrative survey, we formulate a set of recommendations for authors who aim to make complex prediction rules applicable for readers. All data including the description of the considered studies and analysis codes are available as supplementary materials.


Asunto(s)
Biometría/métodos , Medicina , Ciencia , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA