RESUMEN
Rationale: Emphysema is a chronic obstructive pulmonary disease phenotype with important prognostic implications. Identifying blood-based biomarkers of emphysema will facilitate early diagnosis and development of targeted therapies. Objectives: To discover blood omics biomarkers for chest computed tomography-quantified emphysema and develop predictive biomarker panels. Methods: Emphysema blood biomarker discovery was performed using differential gene expression, alternative splicing, and protein association analyses in a training sample of 2,370 COPDGene participants with available blood RNA sequencing, plasma proteomics, and clinical data. Internal validation was conducted in a COPDGene testing sample (n = 1,016), and external validation was done in the ECLIPSE study (n = 526). Because low body mass index (BMI) and emphysema often co-occur, we performed a mediation analysis to quantify the effect of BMI on gene and protein associations with emphysema. Elastic net models with bootstrapping were also developed in the training sample sequentially using clinical, blood cell proportions, RNA-sequencing, and proteomic biomarkers to predict quantitative emphysema. Model accuracy was assessed by the area under the receiver operating characteristic curves for subjects stratified into tertiles of emphysema severity. Measurements and Main Results: Totals of 3,829 genes, 942 isoforms, 260 exons, and 714 proteins were significantly associated with emphysema (false discovery rate, 5%) and yielded 11 biological pathways. Seventy-four percent of these genes and 62% of these proteins showed mediation by BMI. Our prediction models demonstrated reasonable predictive performance in both COPDGene and ECLIPSE. The highest-performing model used clinical, blood cell, and protein data (area under the receiver operating characteristic curve in COPDGene testing, 0.90; 95% confidence interval, 0.85-0.90). Conclusions: Blood transcriptome and proteome-wide analyses revealed key biological pathways of emphysema and enhanced the prediction of emphysema.
Asunto(s)
Enfisema , Enfermedad Pulmonar Obstructiva Crónica , Enfisema Pulmonar , Humanos , Transcriptoma , Proteómica , Enfisema Pulmonar/genética , Enfisema Pulmonar/complicaciones , Biomarcadores , Perfilación de la Expresión GénicaRESUMEN
Most predictive models based on gene expression data do not leverage information related to gene splicing, despite the fact that splicing is a fundamental feature of eukaryotic gene expression. Cigarette smoking is an important environmental risk factor for many diseases, and it has profound effects on gene expression. Using smoking status as a prediction target, we developed deep neural network predictive models using gene, exon, and isoform level quantifications from RNA sequencing data in 2,557 subjects in the COPDGene Study. We observed that models using exon and isoform quantifications clearly outperformed gene-level models when using data from 5 genes from a previously published prediction model. Whereas the test set performance of the previously published model was 0.82 in the original publication, our exon-based models including an exon-to-isoform mapping layer achieved a test set AUC (area under the receiver operating characteristic) of 0.88, which improved to an AUC of 0.94 using exon quantifications from a larger set of genes. Isoform variability is an important source of latent information in RNA-seq data that can be used to improve clinical prediction models.
Asunto(s)
Aprendizaje Profundo , Modelos Estadísticos , RNA-Seq/métodos , Fumar , Anciano , Biología Computacional , Exones/genética , Femenino , Perfilación de la Expresión Génica , Humanos , Masculino , Persona de Mediana Edad , Isoformas de Proteínas/genética , Curva ROC , Fumar/epidemiología , Fumar/genéticaRESUMEN
Rapid progress in various advanced analytical methods, such as single-cell technologies, enable unprecedented and deeper understanding of microbial ecology beyond the resolution of conventional approaches. A major application challenge exists in the determination of sufficient sample size without sufficient prior knowledge of the community complexity and, the need to balance between statistical power and limited time or resources. This hinders the desired standardization and wider application of these technologies. Here, we proposed, tested and validated a computational sampling size assessment protocol taking advantage of a metric, named kernel divergence. This metric has two advantages: First, it directly compares data set-wise distributional differences with no requirements on human intervention or prior knowledge-based preclassification. Second, minimal assumptions in distribution and sample space are made in data processing to enhance its application domain. This enables test-verified appropriate handling of data sets with both linear and nonlinear relationships. The model was then validated in a case study with Single-cell Raman Spectroscopy (SCRS) phenotyping data sets from eight different enhanced biological phosphorus removal (EBPR) activated sludge communities located across North America. The model allows the determination of sufficient sampling size for any targeted or customized information capture capacity or resolution level. Promised by its flexibility and minimal restriction of input data types, the proposed method is expected to be a standardized approach for sampling size optimization, enabling more comparable and reproducible experiments and analysis on complex environmental samples. Finally, these advantages enable the extension of the capability to other single-cell technologies or environmental applications with data sets exhibiting continuous features.
Asunto(s)
Productos Biológicos , Fósforo , Humanos , Aprendizaje Automático , Fósforo/química , Polifosfatos , Aguas del Alcantarillado , Espectrometría RamanRESUMEN
Chronic obstructive pulmonary disease (COPD) is an umbrella definition encompassing multiple disease processes. COPD heterogeneity has been described as distinct subgroups of individuals (subtypes) or as continuous measures of COPD variability (disease axes). There is little consensus on whether subtypes or disease axes are preferred, and the relative value of disease axes and subtypes for predicting COPD progression is unknown. Using a propensity score approach to learn disease axes from pairs of subtypes, we demonstrate that these disease axes predict prospective forced expiratory volume in 1 s decline and emphysema progression more accurately than the subtype pairs from which they were derived.
Asunto(s)
Enfermedad Pulmonar Obstructiva Crónica/clasificación , Enfermedad Pulmonar Obstructiva Crónica/fisiopatología , Progresión de la Enfermedad , Humanos , Valor Predictivo de las Pruebas , Prueba de Estudio Conceptual , Puntaje de Propensión , Pruebas de Función Respiratoria , Factores de RiesgoRESUMEN
RATIONALE: The relationship between longitudinal lung function trajectories, chest computed tomography (CT) imaging, and genetic predisposition to chronic obstructive pulmonary disease (COPD) has not been explored. OBJECTIVES: 1) To model trajectories using a data-driven approach applied to longitudinal data spanning adulthood in the Normative Aging Study (NAS), and 2) to apply these models to demographically similar subjects in the COPDGene (Genetic Epidemiology of COPD) Study with detailed phenotypic characterization including chest CT. METHODS: We modeled lung function trajectories in 1,060 subjects in NAS with a median follow-up time of 29 years. We assigned 3,546 non-Hispanic white males in COPDGene to these trajectories for further analysis. We assessed phenotypic and genetic differences between trajectories and across age strata. MEASUREMENTS AND MAIN RESULTS: We identified four trajectories in NAS with differing levels of maximum lung function and rate of decline. In COPDGene, 617 subjects (17%) were assigned to the lowest trajectory and had the greatest radiologic burden of disease (P < 0.01); 1,283 subjects (36%) were assigned to a low trajectory with evidence of airway disease preceding emphysema on CT; 1,411 subjects (40%) and 237 subjects (7%) were assigned to the remaining two trajectories and tended to have preserved lung function and negligible emphysema. The genetic contribution to these trajectories was as high as 83% (P = 0.02), and membership in lower lung function trajectories was associated with greater parental histories of COPD, decreased exercise capacity, greater dyspnea, and more frequent COPD exacerbations. CONCLUSIONS: Data-driven analysis identifies four lung function trajectories. Trajectory membership has a genetic basis and is associated with distinct lung structural abnormalities.
Asunto(s)
Pulmón/fisiopatología , Enfermedad Pulmonar Obstructiva Crónica/complicaciones , Fumar/efectos adversos , Adulto , Anciano , Anciano de 80 o más Años , Estudios de Casos y Controles , Progresión de la Enfermedad , Volumen Espiratorio Forzado , Humanos , Estudios Longitudinales , Masculino , Persona de Mediana Edad , Enfermedad Pulmonar Obstructiva Crónica/fisiopatología , Pruebas de Función Respiratoria , Adulto JovenRESUMEN
Chronic obstructive pulmonary disease (COPD) is a syndrome caused by damage to the lungs that results in decreased pulmonary function and reduced structural integrity. Pulmonary function testing (PFT) is used to diagnose and stratify COPD into severity groups, and computed tomography (CT) imaging of the chest is often used to assess structural changes in the lungs. We hypothesized that the combination of PFT and CT phenotypes would provide a more powerful tool for assessing underlying morphologic differences associated with pulmonary function in COPD than does PFT alone. We used factor analysis of 26 variables to classify 8,157 participants recruited into the COPDGene cohort between January 2008 and June 2011 from 21 clinical centers across the United States. These factors were used as predictors of all-cause mortality using Cox proportional hazards modeling. Five factors explained 80% of the covariance and represented the following domains: factor 1, increased emphysema and decreased pulmonary function; factor 2, airway disease and decreased pulmonary function; factor 3, gas trapping; factor 4, CT variability; and factor 5, hyperinflation. After more than 46,079 person-years of follow-up, factors 1 through 4 were associated with mortality and there was a significant synergistic interaction between factors 1 and 2 on death. Considering CT measures along with PFT in the assessment of COPD can identify patients at particularly high risk for death.
Asunto(s)
Enfermedad Pulmonar Obstructiva Crónica/genética , Enfermedad Pulmonar Obstructiva Crónica/mortalidad , Pruebas de Función Respiratoria/estadística & datos numéricos , Medición de Riesgo/métodos , Tomografía Computarizada por Rayos X/estadística & datos numéricos , Adulto , Anciano , Anciano de 80 o más Años , Causas de Muerte , Análisis Factorial , Femenino , Humanos , Pulmón/diagnóstico por imagen , Pulmón/fisiopatología , Masculino , Persona de Mediana Edad , Fenotipo , Valor Predictivo de las Pruebas , Modelos de Riesgos Proporcionales , Factores de RiesgoRESUMEN
BACKGROUND: COPD is a heterogeneous disease, but there is little consensus on specific definitions for COPD subtypes. Unsupervised clustering offers the promise of 'unbiased' data-driven assessment of COPD heterogeneity. Multiple groups have identified COPD subtypes using cluster analysis, but there has been no systematic assessment of the reproducibility of these subtypes. OBJECTIVE: We performed clustering analyses across 10 cohorts in North America and Europe in order to assess the reproducibility of (1) correlation patterns of key COPD-related clinical characteristics and (2) clustering results. METHODS: We studied 17 146 individuals with COPD using identical methods and common COPD-related characteristics across cohorts (FEV1, FEV1/FVC, FVC, body mass index, Modified Medical Research Council score, asthma and cardiovascular comorbid disease). Correlation patterns between these clinical characteristics were assessed by principal components analysis (PCA). Cluster analysis was performed using k-medoids and hierarchical clustering, and concordance of clustering solutions was quantified with normalised mutual information (NMI), a metric that ranges from 0 to 1 with higher values indicating greater concordance. RESULTS: The reproducibility of COPD clustering subtypes across studies was modest (median NMI range 0.17-0.43). For methods that excluded individuals that did not clearly belong to any cluster, agreement was better but still suboptimal (median NMI range 0.32-0.60). Continuous representations of COPD clinical characteristics derived from PCA were much more consistent across studies. CONCLUSIONS: Identical clustering analyses across multiple COPD cohorts showed modest reproducibility. COPD heterogeneity is better characterised by continuous disease traits coexisting in varying degrees within the same individual, rather than by mutually exclusive COPD subtypes.
Asunto(s)
Análisis por Conglomerados , Volumen Espiratorio Forzado , Enfermedad Pulmonar Obstructiva Crónica/clasificación , Enfermedad Pulmonar Obstructiva Crónica/fisiopatología , Índice de Masa Corporal , Europa (Continente)/epidemiología , Humanos , Fenotipo , Enfermedad Pulmonar Obstructiva Crónica/epidemiología , Reproducibilidad de los Resultados , Estados Unidos/epidemiologíaRESUMEN
One of the most common smoking-related diseases, chronic obstructive pulmonary disease (COPD), results from a dysregulated, multi-tissue inflammatory response to cigarette smoke. We hypothesized that systemic inflammatory signals in genome-wide blood gene expression can identify clinically important COPD-related disease subtypes, and we leveraged pre-existing gene interaction networks to guide unsupervised clustering of blood microarray expression data. Using network-informed non-negative matrix factorization, we analyzed genome-wide blood gene expression from 229 former smokers in the ECLIPSE Study, and we identified novel, clinically relevant molecular subtypes of COPD. These network-informed clusters were more stable and more strongly associated with measures of lung structure and function than clusters derived from a network-naïve approach, and they were associated with subtype-specific enrichment for inflammatory and protein catabolic pathways. These clusters were successfully reproduced in an independent sample of 135 smokers from the COPDGene Study.
Asunto(s)
Biología Computacional/métodos , Expresión Génica , Redes Reguladoras de Genes , Enfermedad Pulmonar Obstructiva Crónica/genética , Fumar/genética , Anciano , Anciano de 80 o más Años , Análisis por Conglomerados , Femenino , Perfilación de la Expresión Génica/métodos , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Persona de Mediana Edad , Enfermedad Pulmonar Obstructiva Crónica/sangre , Fumar/sangreRESUMEN
BACKGROUND: There is notable heterogeneity in the clinical presentation of patients with COPD. To characterise this heterogeneity, we sought to identify subgroups of smokers by applying cluster analysis to data from the COPDGene study. METHODS: We applied a clustering method, k-means, to data from 10 192 smokers in the COPDGene study. After splitting the sample into a training and validation set, we evaluated three sets of input features across a range of k (user-specified number of clusters). Stable solutions were tested for association with four COPD-related measures and five genetic variants previously associated with COPD at genome-wide significance. The results were confirmed in the validation set. FINDINGS: We identified four clusters that can be characterised as (1) relatively resistant smokers (ie, no/mild obstruction and minimal emphysema despite heavy smoking), (2) mild upper zone emphysema-predominant, (3) airway disease-predominant and (4) severe emphysema. All clusters are strongly associated with COPD-related clinical characteristics, including exacerbations and dyspnoea (p<0.001). We found strong genetic associations between the mild upper zone emphysema group and rs1980057 near HHIP, and between the severe emphysema group and rs8034191 in the chromosome 15q region (p<0.001). All significant associations were replicated at p<0.05 in the validation sample (12/12 associations with clinical measures and 2/2 genetic associations). INTERPRETATION: Cluster analysis identifies four subgroups of smokers that show robust associations with clinical characteristics of COPD and known COPD-associated genetic variants.
Asunto(s)
Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo/métodos , Enfermedad Pulmonar Obstructiva Crónica/genética , Enfisema Pulmonar/genética , Fumar/efectos adversos , Análisis por Conglomerados , Femenino , Estudios de Seguimiento , Humanos , Masculino , Persona de Mediana Edad , Fenotipo , Enfermedad Pulmonar Obstructiva Crónica/diagnóstico , Enfermedad Pulmonar Obstructiva Crónica/fisiopatología , Enfisema Pulmonar/diagnóstico , Enfisema Pulmonar/fisiopatología , Estudios Retrospectivos , Índice de Severidad de la Enfermedad , Espirometría , Tomografía Computarizada por Rayos XRESUMEN
Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of 10 9 - 10 12 , machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than 1% or even 0.1% over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of 10 3 - 10 6 over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.
RESUMEN
Rationale: Genetic variants and gene expression predict risk of chronic obstructive pulmonary disease (COPD), but their effect on COPD heterogeneity is unclear. Objectives: Define high-risk COPD subtypes using both genetics (polygenic risk score, PRS) and blood gene expression (transcriptional risk score, TRS) and assess differences in clinical and molecular characteristics. Methods: We defined high-risk groups based on PRS and TRS quantiles by maximizing differences in protein biomarkers in a COPDGene training set and identified these groups in COPDGene and ECLIPSE test sets. We tested multivariable associations of subgroups with clinical outcomes and compared protein-protein interaction networks and drug repurposing analyses between high-risk groups. Measurements and Main Results: We examined two high-risk omics-defined groups in non-overlapping test sets (n=1,133 NHW COPDGene, n=299 African American (AA) COPDGene, n=468 ECLIPSE). We defined "High activity" (low PRS/high TRS) and "severe risk" (high PRS/high TRS) subgroups. Participants in both subgroups had lower body-mass index (BMI), lower lung function, and alterations in metabolic, growth, and immune signaling processes compared to a low-risk (low PRS, low TRS) reference subgroup. "High activity" but not "severe risk" participants had greater prospective FEV 1 decline (COPDGene: -51 mL/year; ECLIPSE: - 40 mL/year) and their proteomic profiles were enriched in gene sets perturbed by treatment with 5-lipoxygenase inhibitors and angiotensin-converting enzyme (ACE) inhibitors. Conclusions: Concomitant use of polygenic and transcriptional risk scores identified clinical and molecular heterogeneity amongst high-risk individuals. Proteomic and drug repurposing analysis identified subtype-specific enrichment for therapies and suggest prior drug repurposing failures may be explained by patient selection.
RESUMEN
Myelodysplastic syndromes have increased in frequency and incidence in the American population, but patient prognosis has not significantly improved over the last decade. Such improvements could be realized if biomarkers for accurate diagnosis and prognostic stratification were successfully identified. In this study, we propose a method that associates two state-of-the-art array technologies--single nucleotide polymor-phism(SNP) array and gene expression array--with gene motifs considered transcription factor-binding sites (TFBS). We are particularly interested in SNP-containing motifs introduced by genetic variation and mutation as TFBS. The potential regulation of SNP-containing motifs affects only when certain mutations occur. These motifs can be identified from a group of co-expressed genes with copy number variation. Then, we used a sliding window to identify motif candidates near SNPs on gene sequences. The candidates were filtered by coarse thresholding and fine statistical testing. Using the regression-based LARS-EN algorithm and a level-wise sequence combination procedure, we identified 28 SNP-containing motifs as candidate TFBS. We confirmed 21 of the 28 motifs with ChIP-chip fragments in the TRANSFAC database. Another six motifs were validated by TRANSFAC via searching binding fragments on co-regulated genes. The identified motifs and their location genes can be considered potential biomarkers for myelodysplastic syndromes. Thus, our proposed method, a novel strategy for associating two data categories, is capable of integrating information from different sources to identify reliable candidate regulatory SNP-containing motifs introduced by genetic variation and mutation.
Asunto(s)
Perfilación de la Expresión Génica , Genes Reguladores , Síndromes Mielodisplásicos/genética , Polimorfismo de Nucleótido Simple/genética , Factores de Transcripción/genética , Algoritmos , Sitios de Unión , Variaciones en el Número de Copia de ADN , Bases de Datos Genéticas , Genotipo , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodosRESUMEN
K-means is a fundamental clustering algorithm widely used in both academic and industrial applications. Its popularity can be attributed to its simplicity and efficiency. Studies show the equivalence of K-means to principal component analysis, non-negative matrix factorization, and spectral clustering. However, these studies focus on standard K-means with squared euclidean distance. In this review paper, we unify the available approaches in generalizing K-means to solve challenging and complex problems. We show that these generalizations can be seen from four aspects: data representation, distance measure, label assignment, and centroid updating. As concrete applications of transforming problems into modified K-means formulation, we review the following applications: iterative subspace projection and clustering, consensus clustering, constrained clustering, domain adaptation, and outlier detection.
RESUMEN
The El Niño Southern Oscillation (ENSO) is a semi-periodic fluctuation in sea surface temperature (SST) over the tropical central and eastern Pacific Ocean that influences interannual variability in regional hydrology across the world through long-range dependence or teleconnections. Recent research has demonstrated the value of Deep Learning (DL) methods for improving ENSO prediction as well as Complex Networks (CN) for understanding teleconnections. However, gaps in predictive understanding of ENSO-driven river flows include the black box nature of DL, the use of simple ENSO indices to describe a complex phenomenon and translating DL-based ENSO predictions to river flow predictions. Here we show that eXplainable DL (XDL) methods, based on saliency maps, can extract interpretable predictive information contained in global SST and discover SST information regions and dependence structures relevant for river flows which, in tandem with climate network constructions, enable improved predictive understanding. Our results reveal additional information content in global SST beyond ENSO indices, develop understanding of how SSTs influence river flows, and generate improved river flow prediction, including uncertainty estimation. Observations, reanalysis data, and earth system model simulations are used to demonstrate the value of the XDL-CN based methods for future interannual and decadal scale climate projections.
Asunto(s)
Aprendizaje Profundo , El Niño Oscilación del Sur , Ríos , Temperatura , Océano PacíficoRESUMEN
Purpose: Deep learning has demonstrated excellent performance enhancing noisy or degraded biomedical images. However, many of these models require access to a noise-free version of the images to provide supervision during training, which limits their utility. Here, we develop an algorithm (noise2Nyquist) that leverages the fact that Nyquist sampling provides guarantees about the maximum difference between adjacent slices in a volumetric image, which allows denoising to be performed without access to clean images. We aim to show that our method is more broadly applicable and more effective than other self-supervised denoising algorithms on real biomedical images, and provides comparable performance to algorithms that need clean images during training. Approach: We first provide a theoretical analysis of noise2Nyquist and an upper bound for denoising error based on sampling rate. We go on to demonstrate its effectiveness in denoising in a simulated example as well as real fluorescence confocal microscopy, computed tomography, and optical coherence tomography images. Results: We find that our method has better denoising performance than existing self-supervised methods and is applicable to datasets where clean versions are not available. Our method resulted in peak signal to noise ratio (PSNR) within 1 dB and structural similarity (SSIM) index within 0.02 of supervised methods. On medical images, it outperforms existing self-supervised methods by an average of 3 dB in PSNR and 0.1 in SSIM. Conclusion: noise2Nyquist can be used to denoise any volumetric dataset sampled at at least the Nyquist rate making it useful for a wide variety of existing datasets.
RESUMEN
Background: Spirometry measures lung function by selecting the best of multiple efforts meeting pre-specified quality control (QC), and reporting two key metrics: forced expiratory volume in 1 second (FEV1) and forced vital capacity (FVC). We hypothesize that discarded submaximal and QC-failing data meaningfully contribute to the prediction of airflow obstruction and all-cause mortality. Methods: We evaluated volume-time spirometry data from the UK Biobank. We identified "best" spirometry efforts as those passing QC with the maximum FVC. "Discarded" efforts were either submaximal or failed QC. To create a combined representation of lung function we implemented a contrastive learning approach, Spirogram-based Contrastive Learning Framework (Spiro-CLF), which utilized all recorded volume-time curves per participant and applied different transformations (e.g. flow-volume, flow-time). In a held-out 20% testing subset we applied the Spiro-CLF representation of a participant's overall lung function to 1) binary predictions of FEV1/FVC < 0.7 and FEV1 Percent Predicted (FEV1PP) < 80%, indicative of airflow obstruction, and 2) Cox regression for all-cause mortality. Findings: We included 940,705 volume-time curves from 352,684 UK Biobank participants with 2-3 spirometry efforts per individual (66.7% with 3 efforts) and at least one QC-passing spirometry effort. Of all spirometry efforts, 24.1% failed QC and 37.5% were submaximal. Spiro-CLF prediction of FEV1/FVC < 0.7 utilizing discarded spirometry efforts had an Area under the Receiver Operating Characteristics (AUROC) of 0.981 (0.863 for FEV1PP prediction). Incorporating discarded spirometry efforts in all-cause mortality prediction was associated with a concordance index (c-index) of 0.654, which exceeded the c-indices from FEV1 (0.590), FVC (0.559), or FEV1/FVC (0.599) from each participant's single best effort. Interpretation: A contrastive learning model using raw spirometry curves can accurately predict lung function using submaximal and QC-failing efforts. This model also has superior prediction of all-cause mortality compared to standard lung function measurements. Funding: MHC is supported by NIH R01HL137927, R01HL135142, HL147148, and HL089856.BDH is supported by NIH K08HL136928, U01 HL089856, and an Alpha-1 Foundation Research Grant.DH is supported by NIH 2T32HL007427-41EKS is supported by NIH R01 HL152728, R01 HL147148, U01 HL089856, R01 HL133135, P01 HL132825, and P01 HL114501.PJC is supported by NIH R01HL124233 and R01HL147326.SPB is supported by NIH R01HL151421 and UH3HL155806.TY, FH, and CYM are employees of Google LLC.
RESUMEN
BACKGROUND: RNA interference (RNAi) becomes an increasingly important and effective genetic tool to study the function of target genes by suppressing specific genes of interest. This system approach helps identify signaling pathways and cellular phase types by tracking intensity and/or morphological changes of cells. The traditional RNAi screening scheme, in which one siRNA is designed to knockdown one specific mRNA target, needs a large library of siRNAs and turns out to be time-consuming and expensive. RESULTS: In this paper, we propose a conceptual model, called compressed sensing RNAi (csRNAi), which employs a unique combination of group of small interfering RNAs (siRNAs) to knockdown a much larger size of genes. This strategy is based on the fact that one gene can be partially bound with several small interfering RNAs (siRNAs) and conversely, one siRNA can bind to a few genes with distinct binding affinity. This model constructs a multi-to-multi correspondence between siRNAs and their targets, with siRNAs much fewer than mRNA targets, compared with the conventional scheme. Mathematically this problem involves an underdetermined system of equations (linear or nonlinear), which is ill-posed in general. However, the recently developed compressed sensing (CS) theory can solve this problem. We present a mathematical model to describe the csRNAi system based on both CS theory and biological concerns. To build this model, we first search nucleotide motifs in a target gene set. Then we propose a machine learning based method to find the effective siRNAs with novel features, such as image features and speech features to describe an siRNA sequence. Numerical simulations show that we can reduce the siRNA library to one third of that in the conventional scheme. In addition, the features to describe siRNAs outperform the existing ones substantially. CONCLUSIONS: This csRNAi system is very promising in saving both time and cost for large-scale RNAi screening experiments which may benefit the biological research with respect to cellular processes and pathways.
Asunto(s)
Simulación por Computador , Interferencia de ARN , Algoritmos , Inteligencia Artificial , Biblioteca de Genes , Humanos , Motivos de Nucleótidos , ARN Mensajero/metabolismo , ARN Interferente Pequeño/genética , ARN Interferente Pequeño/metabolismoRESUMEN
Synaptic vesicle dynamics play an important role in the study of neuronal and synaptic activities of neurodegradation diseases ranging from the epidemic Alzheimer's disease to the rare Rett syndrome. A high-throughput assay with a large population of neurons would be useful and efficient to characterize neuronal activity based on the dynamics of synaptic vesicles for the study of mechanisms or to discover drug candidates for neurodegenerative and neurodevelopmental disorders. However, the massive amounts of image data generated via high-throughput screening require enormous manual processing time and effort, restricting the practical use of such an assay. This paper presents an automated analytic system to process and interpret the huge data set generated by such assays. Our system enables the automated detection, segmentation, quantification, and measurement of neuron activities based on the synaptic vesicle assay. To overcome challenges such as noisy background, inhomogeneity, and tiny object size, we first employ MSVST (Multi-Scale Variance Stabilizing Transform) to obtain a denoised and enhanced map of the original image data. Then, we propose an adaptive thresholding strategy to solve the inhomogeneity issue, based on the local information, and to accurately segment synaptic vesicles. We design algorithms to address the issue of tiny objects of interest overlapping. Several post processing criteria are defined to filter false positives. A total of 152 features are extracted for each detected vesicle. A score is defined for each synaptic vesicle image to quantify the neuron activity. We also compare the unsupervised strategy with the supervised method. Our experiments on hippocampal neuron assays showed that the proposed system can automatically detect vesicles and quantify their dynamics for evaluating neuron activities. The availability of such an automated system will open opportunities for investigation of synaptic neuropathology and identification of candidate therapeutics for neurodegeneration.
Asunto(s)
Diagnóstico por Imagen/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Neuronas/fisiología , Algoritmos , Animales , Encéfalo/fisiología , Células Cultivadas , Ratones , Ratones Endogámicos C57BL , Vesículas SinápticasRESUMEN
Lifelong Learning (LL) refers to the ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Much attention has been given lately to Supervised Lifelong Learning (SLL) with a stream of labelled data. In contrast, we focus on resolving challenges in Unsupervised Lifelong Learning (ULL) with streaming unlabelled data when the data distribution and the unknown class labels evolve over time. Bayesian framework is natural to incorporate past knowledge and sequentially update the belief with new data. We develop a fully Bayesian inference framework for ULL with a novel end-to-end Deep Bayesian Unsupervised Lifelong Learning (DBULL) algorithm, which can progressively discover new clusters without forgetting the past with unlabelled data while learning latent representations. To efficiently maintain past knowledge, we develop a novel knowledge preservation mechanism via sufficient statistics of the latent representation for raw data. To detect the potential new clusters on the fly, we develop an automatic cluster discovery and redundancy removal strategy in our inference inspired by Nonparametric Bayesian statistics techniques. We demonstrate the effectiveness of our approach using image and text corpora benchmark datasets in both LL and batch settings.
Asunto(s)
Algoritmos , Educación Continua , Teorema de BayesRESUMEN
One of the major challenges in realization and implementations of the Tox21 vision is the urgent need to establish quantitative link between in-vitro assay molecular endpoint and in-vivo regulatory-relevant phenotypic toxicity endpoint. Current toxicomics approach still mostly rely on large number of redundant markers without pre-selection or ranking, therefore, selection of relevant biomarkers with minimal redundancy would reduce the number of markers to be monitored and reduce the cost, time, and complexity of the toxicity screening and risk monitoring. Here, we demonstrated that, using time series toxicomics in-vitro assay along with machine learning-based feature selection (maximum relevance and minimum redundancy (MRMR)) and classification method (support vector machine (SVM)), an "optimal" number of biomarkers with minimum redundancy can be identified for prediction of phenotypic toxicity endpoints with good accuracy. We included two case studies for in-vivo carcinogenicity and Ames genotoxicity prediction, using 20 selected chemicals including model genotoxic chemicals and negative controls, respectively. The results suggested that, employing the adverse outcome pathway (AOP) concept, molecular endpoints based on a relatively small number of properly selected biomarker-ensemble involved in the conserved DNA-damage and repair pathways among eukaryotes, were able to predict both Ames genotoxicity endpoints and in-vivo carcinogenicity in rats. A prediction accuracy of 76% with AUC = 0.81 was achieved while predicting in-vivo carcinogenicity with the top-ranked five biomarkers. For Ames genotoxicity prediction, the top-ranked five biomarkers were able to achieve prediction accuracy of 70% with AUC = 0.75. However, the specific biomarkers identified as the top-ranked five biomarkers are different for the two different phenotypic genotoxicity assays. The top-ranked biomarkers for the in-vivo carcinogenicity prediction mainly focused on double strand break repair and DNA recombination, whereas the selected top-ranked biomarkers for Ames genotoxicity prediction are associated with base- and nucleotide-excision repair The method developed in this study will help to fill in the knowledge gap in phenotypic anchoring and predictive toxicology, and contribute to the progress in the implementation of tox 21 vision for environmental and health applications.