RESUMEN
Metabolism during pregnancy is a dynamic and precisely programmed process, the failure of which can bring devastating consequences to the mother and fetus. To define a high-resolution temporal profile of metabolites during healthy pregnancy, we analyzed the untargeted metabolome of 784 weekly blood samples from 30 pregnant women. Broad changes and a highly choreographed profile were revealed: 4,995 metabolic features (of 9,651 total), 460 annotated compounds (of 687 total), and 34 human metabolic pathways (of 48 total) were significantly changed during pregnancy. Using linear models, we built a metabolic clock with five metabolites that time gestational age in high accordance with ultrasound (R = 0.92). Furthermore, two to three metabolites can identify when labor occurs (time to delivery within two, four, and eight weeks, AUROC ≥ 0.85). Our study represents a weekly characterization of the human pregnancy metabolome, providing a high-resolution landscape for understanding pregnancy with potential clinical utilities.
Asunto(s)
Edad Gestacional , Metabolómica/métodos , Embarazo/metabolismo , Adulto , Biomarcadores/sangre , Femenino , Feto/metabolismo , Humanos , Redes y Vías Metabólicas/fisiología , Metaboloma/fisiología , Mujeres EmbarazadasRESUMEN
Accurate prediction of long-term outcomes remains a challenge in the care of cancer patients. Due to the difficulty of serial tumor sampling, previous prediction tools have focused on pretreatment factors. However, emerging non-invasive diagnostics have increased opportunities for serial tumor assessments. We describe the Continuous Individualized Risk Index (CIRI), a method to dynamically determine outcome probabilities for individual patients utilizing risk predictors acquired over time. Similar to "win probability" models in other fields, CIRI provides a real-time probability by integrating risk assessments throughout a patient's course. Applying CIRI to patients with diffuse large B cell lymphoma, we demonstrate improved outcome prediction compared to conventional risk models. We demonstrate CIRI's broader utility in analogous models of chronic lymphocytic leukemia and breast adenocarcinoma and perform a proof-of-concept analysis demonstrating how CIRI could be used to develop predictive biomarkers for therapy selection. We envision that dynamic risk assessment will facilitate personalized medicine and enable innovative therapeutic paradigms.
Asunto(s)
Biomarcadores de Tumor/metabolismo , Neoplasias de la Mama/patología , Linfoma de Células B Grandes Difuso/patología , Medicina de Precisión , Algoritmos , Antineoplásicos/uso terapéutico , Biomarcadores de Tumor/sangre , Neoplasias de la Mama/tratamiento farmacológico , Neoplasias de la Mama/mortalidad , ADN Tumoral Circulante/sangre , Femenino , Humanos , Estimación de Kaplan-Meier , Linfoma de Células B Grandes Difuso/tratamiento farmacológico , Linfoma de Células B Grandes Difuso/mortalidad , Terapia Neoadyuvante , Pronóstico , Supervivencia sin Progresión , Modelos de Riesgos Proporcionales , Medición de Riesgo , Resultado del TratamientoRESUMEN
Radiologic screening of high-risk adults reduces lung-cancer-related mortality1,2; however, a small minority of eligible individuals undergo such screening in the United States3,4. The availability of blood-based tests could increase screening uptake. Here we introduce improvements to cancer personalized profiling by deep sequencing (CAPP-Seq)5, a method for the analysis of circulating tumour DNA (ctDNA), to better facilitate screening applications. We show that, although levels are very low in early-stage lung cancers, ctDNA is present prior to treatment in most patients and its presence is strongly prognostic. We also find that the majority of somatic mutations in the cell-free DNA (cfDNA) of patients with lung cancer and of risk-matched controls reflect clonal haematopoiesis and are non-recurrent. Compared with tumour-derived mutations, clonal haematopoiesis mutations occur on longer cfDNA fragments and lack mutational signatures that are associated with tobacco smoking. Integrating these findings with other molecular features, we develop and prospectively validate a machine-learning method termed 'lung cancer likelihood in plasma' (Lung-CLiP), which can robustly discriminate early-stage lung cancer patients from risk-matched controls. This approach achieves performance similar to that of tumour-informed ctDNA detection and enables tuning of assay specificity in order to facilitate distinct clinical applications. Our findings establish the potential of cfDNA for lung cancer screening and highlight the importance of risk-matching cases and controls in cfDNA-based screening studies.
Asunto(s)
ADN Tumoral Circulante/análisis , ADN Tumoral Circulante/genética , Detección Precoz del Cáncer/métodos , Genoma Humano/genética , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/genética , Mutación , Estudios de Cohortes , Femenino , Hematopoyesis/genética , Humanos , Pulmón/metabolismo , Pulmón/patología , Neoplasias Pulmonares/sangre , Neoplasias Pulmonares/patología , Masculino , Persona de Mediana Edad , Reproducibilidad de los ResultadosRESUMEN
We propose a method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. "Cooperative learning" combines the usual squared-error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
Asunto(s)
Genómica , Redes Neurales de la Computación , Aprendizaje Automático Supervisado , Genómica/métodosRESUMEN
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's â´ = 0.61, p = 2.2 x 10-59 for quantitative traits, â´ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).
Asunto(s)
Estudio de Asociación del Genoma Completo , Herencia Multifactorial , Bancos de Muestras Biológicas , Predisposición Genética a la Enfermedad , Humanos , Herencia Multifactorial/genética , Fenotipo , Factores de Riesgo , Reino UnidoRESUMEN
The main objective of most clinical trials is to estimate the effect of some treatment compared to a control condition. We define the signal-to-noise ratio (SNR) as the ratio of the true treatment effect to the SE of its estimate. In a previous publication in this journal, we estimated the distribution of the SNR among the clinical trials in the Cochrane Database of Systematic Reviews (CDSR). We found that the SNR is often low, which implies that the power against the true effect is also low in many trials. Here we use the fact that the CDSR is a collection of meta-analyses to quantitatively assess the consequences. Among trials that have reached statistical significance we find considerable overoptimism of the usual unbiased estimator and under-coverage of the associated confidence interval. Previously, we have proposed a novel shrinkage estimator to address this "winner's curse." We compare the performance of our shrinkage estimator to the usual unbiased estimator in terms of the root mean squared error, the coverage and the bias of the magnitude. We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.
Asunto(s)
Ensayos Clínicos como Asunto , Humanos , Sesgo , Revisiones Sistemáticas como Asunto , Metaanálisis como AsuntoRESUMEN
Short-term forecasts of traditional streams from public health reporting (such as cases, hospitalizations, and deaths) are a key input to public health decision-making during a pandemic. Since early 2020, our research group has worked with data partners to collect, curate, and make publicly available numerous real-time COVID-19 indicators, providing multiple views of pandemic activity in the United States. This paper studies the utility of five such indicators-derived from deidentified medical insurance claims, self-reported symptoms from online surveys, and COVID-related Google search activity-from a forecasting perspective. For each indicator, we ask whether its inclusion in an autoregressive (AR) model leads to improved predictive accuracy relative to the same model excluding it. Such an AR model, without external features, is already competitive with many top COVID-19 forecasting models in use today. Our analysis reveals that 1) inclusion of each of these five indicators improves on the overall predictive accuracy of the AR model; 2) predictive gains are in general most pronounced during times in which COVID cases are trending in "flat" or "down" directions; and 3) one indicator, based on Google searches, seems to be particularly helpful during "up" trends.
Asunto(s)
COVID-19/epidemiología , Indicadores de Salud , Modelos Estadísticos , Métodos Epidemiológicos , Predicción , Humanos , Internet/estadística & datos numéricos , Encuestas y Cuestionarios , Estados Unidos/epidemiologíaRESUMEN
We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.
Asunto(s)
Algoritmos , Bancos de Muestras Biológicas , Humanos , Funciones de Verosimilitud , Modelos de Riesgos Proporcionales , Reino UnidoRESUMEN
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.
Asunto(s)
Proyectos de Investigación , Humanos , Modelos de Riesgos Proporcionales , Intervalos de ConfianzaRESUMEN
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports â1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with â1/â2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
Asunto(s)
Asma/epidemiología , Bancos de Muestras Biológicas , Genética de Población , Estudio de Asociación del Genoma Completo , Algoritmos , Asma/sangre , Asma/genética , Estatura/genética , Índice de Masa Corporal , Colesterol/sangre , Estudios de Cohortes , Genotipo , Humanos , Modelos Logísticos , Fenotipo , Polimorfismo de Nucleótido Simple/genética , Modelos de Riesgos Proporcionales , Reino Unido/epidemiologíaRESUMEN
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.
RESUMEN
Summary: In the last few years, desorption electrospray ionization mass spectrometry imaging (DESI-MSI) has been increasingly used for simultaneous detection of thousands of metabolites and lipids from human tissues and biofluids. To successfully find the most significant differences between two sets of DESI-MSI data (e.g., healthy vs disease) requires the application of accurate computational and statistical methods that can pre-process the data under various normalization settings and help identify these changes among thousands of detected metabolites. Here, we report MassExplorer, a novel computational tool, to help pre-process DESI-MSI data, visualize raw data, build predictive models using the statistical lasso approach to select for a sparse set of significant molecular changes, and interpret selected metabolites. This tool, which is available for both online and offline use, is flexible for both chemists and biologists and statisticians as it helps in visualizing structure of DESI-MSI data and in analyzing the statistically significant metabolites that are differentially expressed across both sample types. Based on the modules in MassExplorer, we expect it to be immediately useful for various biological and chemical applications in mass spectrometry. Availability and implementation: MassExplorer is available as an online R-Shiny application or Mac OS X compatible standalone application. The application, sample performance, source code and corresponding guide can be found at: https://zarelab.com/research/massexplorer-a-tool-to-help-guide-analysis-of-mass-spectrometry-samples/. Supplementary informationMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
MOTIVATION: Large-scale and high-dimensional genome sequencing data poses computational challenges. General-purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data. RESULTS: We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0,1,2,NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact two-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1â000â000 variants and almost 100â000 individuals within 10 min and using less than 32GB of memory. AVAILABILITY AND IMPLEMENTATION: https://github.com/rivas-lab/snpnet/tree/compact.
Asunto(s)
Bancos de Muestras Biológicas , Genoma , Humanos , Algoritmos , Mapeo Cromosómico , Análisis de los Mínimos CuadradosRESUMEN
MOTIVATION: The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data. RESULTS: We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. AVAILABILITYANDIMPLEMENTATION: https://github.com/rivas-lab/multisnpnet-Cox.
Asunto(s)
Algoritmos , Humanos , Análisis de Supervivencia , Modelos de Riesgos Proporcionales , Análisis de RegresiónRESUMEN
Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.
Asunto(s)
Análisis Mutacional de ADN/estadística & datos numéricos , Neoplasias/genética , Mutación Puntual , Algoritmos , Biomarcadores de Tumor/genética , Neoplasias de la Mama/clasificación , Neoplasias de la Mama/genética , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Genes BRCA1 , Genes BRCA2 , Genoma Humano , Humanos , Neoplasias Pancreáticas/clasificación , Neoplasias Pancreáticas/genética , Programas InformáticosRESUMEN
Thyroid neoplasia is common and requires appropriate clinical workup with imaging and fine-needle aspiration (FNA) biopsy to evaluate for cancer. Yet, up to 20% of thyroid nodule FNA biopsies will be indeterminate in diagnosis based on cytological evaluation. Genomic approaches to characterize the malignant potential of nodules showed initial promise but have provided only modest improvement in diagnosis. Here, we describe a method using metabolic analysis by desorption electrospray ionization mass spectrometry (DESI-MS) imaging for direct analysis and diagnosis of follicular cell-derived neoplasia tissues and FNA biopsies. DESI-MS was used to analyze 178 tissue samples to determine the molecular signatures of normal, benign follicular adenoma (FTA), and malignant follicular carcinoma (FTC) and papillary carcinoma (PTC) thyroid tissues. Statistical classifiers, including benign thyroid versus PTC and benign thyroid versus FTC, were built and validated with 114,125 mass spectra, with accuracy assessed in correlation with clinical pathology. Clinical FNA smears were prospectively collected and analyzed using DESI-MS imaging, and the performance of the statistical classifiers was tested with 69 prospectively collected clinical FNA smears. High performance was achieved for both models when predicting on the FNA test set, which included 24 nodules with indeterminate preoperative cytology, with accuracies of 93% and 89%. Our results strongly suggest that DESI-MS imaging is a valuable technology for identification of malignant potential of thyroid nodules.
Asunto(s)
Espectrometría de Masa por Ionización de Electrospray/métodos , Neoplasias de la Tiroides/diagnóstico por imagen , Neoplasias de la Tiroides/patología , Nódulo Tiroideo/metabolismo , Biopsia con Aguja Fina , Femenino , Humanos , Masculino , Estudios Prospectivos , Neoplasias de la Tiroides/metabolismo , Nódulo Tiroideo/química , Nódulo Tiroideo/diagnóstico por imagenRESUMEN
The outbreak of COVID-19 has created an unprecedent global crisis. While the polymerase chain reaction (PCR) is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of a significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionization (ESI) for direct analysis of clinical swabs and investigate its use for COVID-19 screening. The redesigned MasSpec Pen system incorporates a disposable sampling device refined for uniform and efficient analysis of swab tips via liquid extraction directly coupled to an ESI source. Using this system, we analyzed nasopharyngeal swabs from 244 individuals including symptomatic COVID-19 positive, symptomatic negative, and asymptomatic negative individuals, enabling rapid detection of rich lipid profiles. Two statistical classifiers were generated based on the lipid information acquired. Classifier 1 was built to distinguish symptomatic PCR-positive from asymptomatic PCR-negative individuals, yielding a cross-validation accuracy of 83.5%, sensitivity of 76.6%, and specificity of 86.6%, and validation set accuracy of 89.6%, sensitivity of 100%, and specificity of 85.3%. Classifier 2 was built to distinguish symptomatic PCR-positive patients from negative individuals including symptomatic PCR-negative patients with moderate to severe symptoms and asymptomatic individuals, yielding a cross-validation accuracy of 78.4%, specificity of 77.21%, and sensitivity of 81.8%. Collectively, this study suggests that the lipid profiles detected directly from nasopharyngeal swabs using MasSpec Pen-ESI mass spectrometry (MS) allow fast (under a minute) screening of the COVID-19 disease using minimal operating steps and no specialized reagents, thus representing a promising alternative high-throughput method for screening of COVID-19.
Asunto(s)
COVID-19 , Pruebas Diagnósticas de Rutina , Humanos , Nasofaringe , SARS-CoV-2 , Sensibilidad y Especificidad , Manejo de EspecímenesRESUMEN
Cross-species differences form barriers to translational research that ultimately hinder the success of clinical trials, yet knowledge of species differences has yet to be systematically incorporated in the interpretation of animal models. Here we present Found In Translation (FIT; http://www.mouse2man.org ), a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition. We applied FIT to data from mouse models of 28 different human diseases and identified experimental conditions in which FIT predictions outperformed direct cross-species extrapolation from mouse results, increasing the overlap of differentially expressed genes by 20-50%. FIT predicted novel disease-associated genes, an example of which we validated experimentally. FIT highlights signals that may otherwise be missed and reduces false leads, with no experimental cost.
Asunto(s)
Perfilación de la Expresión Génica , Genómica/métodos , Enfermedades Inflamatorias del Intestino/genética , Aprendizaje Automático , Transcriptoma , Investigación Biomédica Traslacional , Algoritmos , Animales , Estudios de Casos y Controles , Femenino , Humanos , Masculino , Ratones , Persona de Mediana Edad , Transducción de SeñalRESUMEN
We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.
Asunto(s)
Algoritmos , HumanosRESUMEN
High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records, and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature selection or penalized methods to mitigate the undesirable consequences of overfitting. In some cases survival data are also left-truncated which can give rise to an immortal time bias, but penalized survival methods that adjust for left truncation are not commonly implemented. To address these challenges, we apply a penalized Cox proportional hazards model for left-truncated and right-censored survival data and assess implications of left truncation adjustment on bias and interpretation. We use simulation studies and a high-dimensional, real-world clinico-genomic database to highlight the pitfalls of failing to account for left truncation in survival modeling.