RESUMO
Understanding population health disparities is an essential component of equitable precision health efforts. Epidemiology research often relies on definitions of race and ethnicity, but these population labels may not adequately capture disease burdens and environmental factors impacting specific sub-populations. Here, we propose a framework for repurposing data from electronic health records (EHRs) in concert with genomic data to explore the demographic ties that can impact disease burdens. Using data from a diverse biobank in New York City, we identified 17 communities sharing recent genetic ancestry. We observed 1,177 health outcomes that were statistically associated with a specific group and demonstrated significant differences in the segregation of genetic variants contributing to Mendelian diseases. We also demonstrated that fine-scale population structure can impact the prediction of complex disease risk within groups. This work reinforces the utility of linking genomic data to EHRs and provides a framework toward fine-scale monitoring of population health.
Assuntos
Etnicidade/genética , Saúde da População , Bases de Dados Genéticas , Registros Eletrônicos de Saúde , Genômica , Humanos , AutorrelatoRESUMO
Reproducibility of results obtained using ribonucleic acid (RNA) data across labs remains a major hurdle in cancer research. Often, molecular predictors trained on one dataset cannot be applied to another due to differences in RNA library preparation and quantification, which inhibits the validation of predictors across labs. While current RNA correction algorithms reduce these differences, they require simultaneous access to patient-level data from all datasets, which necessitates the sharing of training data for predictors when sharing predictors. Here, we describe SpinAdapt, an unsupervised RNA correction algorithm that enables the transfer of molecular models without requiring access to patient-level data. It computes data corrections only via aggregate statistics of each dataset, thereby maintaining patient data privacy. Despite an inherent trade-off between privacy and performance, SpinAdapt outperforms current correction methods, like Seurat and ComBat, on publicly available cancer studies, including TCGA and ICGC. Furthermore, SpinAdapt can correct new samples, thereby enabling unbiased evaluation on validation cohorts. We expect this novel correction paradigm to enhance research reproducibility and to preserve patient privacy.
Assuntos
Confidencialidade , Privacidade , Algoritmos , Humanos , RNA , Reprodutibilidade dos TestesRESUMO
BACKGROUND: Endocrine-resistant HR+/HER2- breast cancer (BC) and triple-negative BC (TNBC) are of interest for molecularly informed treatment due to their aggressive natures and limited treatment profiles. Patients of African Ancestry (AA) experience higher rates of TNBC and mortality than European Ancestry (EA) patients, despite lower overall BC incidence. Here, we compare the molecular landscapes of AA and EA patients with HR+/HER2- BC and TNBC in a real-world cohort to promote equity in precision oncology by illuminating the heterogeneity of potentially druggable genomic and transcriptomic pathways. METHODS: De-identified records from patients with TNBC or HR+/HER2- BC in the Tempus Database were randomly selected (N = 5000), with most having stage IV disease. Mutations, gene expression, and transcriptional signatures were evaluated from next-generation sequencing data. Genetic ancestry was estimated from DNA-seq. Differences in mutational prevalence, gene expression, and transcriptional signatures between AA and EA were compared. EA patients were used as the reference population for log fold-changes (logFC) in expression. RESULTS: After applying inclusion criteria, 3433 samples were evaluated (n = 623 AA and n = 2810 EA). Observed patterns of dysregulated pathways demonstrated significant heterogeneity among the two groups. Notably, PIK3CA mutations were significantly lower in AA HR+/HER2- tumors (AA = 34% vs. EA = 42%, P < 0.05) and the overall cohort (AA = 28% vs. EA = 37%, P = 2.08e-05). Conversely, KMT2C mutation was significantly more frequent in AA than EA TNBC (23% vs. 12%, P < 0.05) and HR+/HER2- (24% vs. 15%, P = 3e-03) tumors. Across all subtypes and stages, over 8000 genes were differentially expressed between the two ancestral groups including RPL10 (logFC = 2.26, P = 1.70e-162), HSPA1A (logFC = - 2.73, P = 2.43e-49), ATRX (logFC = - 1.93, P = 5.89e-83), and NUTM2F (logFC = 2.28, P = 3.22e-196). Ten differentially expressed gene sets were identified among stage IV HR+/HER2- tumors, of which four were considered relevant to BC treatment and were significantly enriched in EA: ERBB2_UP.V1_UP (P = 3.95e-06), LTE2_UP.V1_UP (P = 2.90e-05), HALLMARK_FATTY_ACID_METABOLISM (P = 0.0073), and HALLMARK_ANDROGEN_RESPONSE (P = 0.0074). CONCLUSIONS: We observed significant differences in mutational spectra, gene expression, and relevant transcriptional signatures between patients with genetically determined African and European ancestries, particularly within the HR+/HER2- BC and TNBC subtypes. These findings could guide future development of treatment strategies by providing opportunities for biomarker-informed research and, ultimately, clinical decisions for precision oncology care in diverse populations.
Assuntos
Neoplasias da Mama , Neoplasias de Mama Triplo Negativas , Feminino , Humanos , População Negra/genética , Neoplasias da Mama/etnologia , Neoplasias da Mama/patologia , Mutação , Medicina de Precisão , Neoplasias de Mama Triplo Negativas/etnologia , Neoplasias de Mama Triplo Negativas/patologia , População BrancaRESUMO
SUMMARY: Finding informative predictive features in high-dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets. AVAILABILITY AND IMPLEMENTATION: The software package for Python is available online at https://github.com/roohy/eps. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
The emergence of genomic data in biobanks and health systems offers new ways to derive medically important phenotypes, including acute phenotypes occurring during inpatient clinical care. Here we study the genetic underpinnings of the rapid response to phenylephrine, an α1-adrenergic receptor agonist commonly used to treat hypotension during anesthesia and surgery. We quantified this response by extracting blood pressure (BP) measurements 5 min before and after the administration of phenylephrine. Based on this derived phenotype, we show that systematic differences exist between self-reported ancestry groups: European-Americans (EA; n = 1387) have a significantly higher systolic response to phenylephrine than African-Americans (AA; n = 1217) and Hispanic/Latinos (HA; n = 1713) (31.3% increase, p value < 6e-08 and 22.9% increase, p value < 5e-05 respectively), after adjusting for genetic ancestry, demographics, and relevant clinical covariates. We performed a genome-wide association study to investigate genetic factors underlying individual differences in this derived phenotype. We discovered genome-wide significant association signals in loci and genes previously associated with BP measured in ambulatory settings, and a general enrichment of association in these genes. Finally, we discovered two low frequency variants, present at ~1% in EAs and AAs, respectively, where patients carrying one copy of these variants show no phenylephrine response. This work demonstrates our ability to derive a quantitative phenotype suited for comparative statistics and genome-wide association studies from dense clinical and physiological measures captured for managing patients during surgery. We identify genetic variants underlying non response to phenylephrine, with implications for preemptive pharmacogenomic screening to improve safety during surgery.
Assuntos
Adrenérgicos/uso terapêutico , Fenilefrina/uso terapêutico , Negro ou Afro-Americano/genética , Pressão Sanguínea/efeitos dos fármacos , Pressão Sanguínea/genética , Feminino , Estudo de Associação Genômica Ampla/métodos , Genômica/métodos , Humanos , Masculino , Pessoa de Meia-Idade , Período Perioperatório/métodos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , População Branca/genéticaRESUMO
An increasing number of bioinformatic tools designed to detect CNVs (copy number variants) in tumor samples based on paired exome data where a matched healthy tissue constitutes the reference have been published in the recent years. The idea of using a pool of unrelated healthy DNA as reference has previously been formulated but not thoroughly validated. As of today, the gold standard for CNV calling is still aCGH but there is an increasing interest in detecting CNVs by exome sequencing. We propose to design a metric allowing the comparison of two CNV profiles, independently of the technique used and assessed the validity of using a pool of unrelated healthy DNA instead of a matched healthy tissue as reference in exome-based CNV detection. We compared the CNV profiles obtained with three different approaches (aCGH, exome sequencing with a matched healthy tissue as reference, exome sequencing with a pool of eight unrelated healthy tissue as reference) on three multiple myeloma samples. We show that the usual analyses performed to compare CNV profiles (deletion/amplification ratios and CNV size distribution) lack in precision when confronted with low LRR values, as they only consider the binary status of each CNV. We show that the metric-based distance constitutes a more accurate comparison of two CNV profiles. Based on these analyses, we conclude that a reliable picture of CNV alterations in multiple myeloma samples can be obtained from whole-exome sequencing in the absence of a matched healthy sample.
Assuntos
Medula Óssea/metabolismo , Biologia Computacional , Variações do Número de Cópias de DNA/genética , Exoma/genética , Mieloma Múltiplo/genética , Algoritmos , Estudos de Casos e Controles , Humanos , Padrões de ReferênciaRESUMO
BACKGROUND: Over time, the chance of cure after the diagnosis of breast cancer has been increasing, as a consequence of earlier diagnosis, improved diagnostic procedures and more effective treatment options. However, oncologists are concerned by the risk of long term treatment side effects, including congestive heart failure (CHF). METHODS: In this study, we evaluated innovative circulating cardiac biomarkers during and after anthracycline-based neoadjuvant chemotherapy (NAC) in breast cancer patients. Levels of cardiac-specific troponins T (cTnT), N-terminal natriuretic peptides (NT-proBNP), soluble ST2 (sST2) and 10 circulating microRNAs (miRNAs) were measured. RESULTS: Under chemotherapy, we observed an elevation of cTnT and NT-proBNP levels, but also the upregulation of sST2 and of 4 CHF-related miRNAs (miR-126-3p, miR-199a-3p, miR-423-5p, miR-34a-5p). The elevations of cTnT, NT-proBNP, sST2 and CHF-related miRNAs were poorly correlated, suggesting that these molecules could provide different information. CONCLUSIONS: Circulating miRNA and sST2 are potential biomarkers of the chemotherapy-related cardiac dysfunction (CRCD). Nevertheless, further studies and long-term follow-up are needed in order to evaluate if these new markers may help to predict CRCD and to identify the patients at risk to later develop CHF.
Assuntos
Antraciclinas/efeitos adversos , Neoplasias da Mama/tratamento farmacológico , Insuficiência Cardíaca/sangue , Proteína 1 Semelhante a Receptor de Interleucina-1/sangue , Adulto , Idoso , Antraciclinas/administração & dosagem , Biomarcadores Farmacológicos/sangue , Biomarcadores Tumorais/sangue , Neoplasias da Mama/sangue , Neoplasias da Mama/patologia , Feminino , Insuficiência Cardíaca/induzido quimicamente , Insuficiência Cardíaca/patologia , Humanos , Masculino , MicroRNAs/sangue , Pessoa de Meia-Idade , Peptídeo Natriurético Encefálico/sangue , Células Neoplásicas Circulantes/metabolismo , Fragmentos de Peptídeos/sangue , Troponina T/sangueRESUMO
The genomic profile of multiple myeloma (MM) has prognostic value by dividing patients into a good prognosis hyperdiploid group and a bad prognosis nonhyperdiploid group with a higher incidence of IGH translocations. This classification, however, is inadequate and many other parameters like mutations, epigenetic modifications, and genomic heterogeneity may influence the prognosis. We performed a genomic study by array-based comparative genomic hybridization on a cohort of 162 patients to evaluate the frequency of genomic gains and losses. We identified a high frequency of X chromosome alterations leading to partial Xq duplication, often associated with inactive X (Xi) deletion in female patients. This partial X duplication could be a cytogenetic marker of aneuploidy as it is correlated with a high number of chromosomal breakages. Patient with high level of chromosomal breakage had reduced survival regardless the region implicated. A higher transcriptional level was shown for genes with potential implication in cancer and located in this altered region. Among these genes, IKBKG and IRAK1 are members of the NFKB pathway which plays an important role in MM and is a target for specific treatments. © 2016 Wiley Periodicals, Inc.
Assuntos
Biomarcadores Tumorais/genética , Aberrações Cromossômicas , Cromossomos Humanos X/genética , Genômica/métodos , Mieloma Múltiplo/genética , Adulto , Idoso , Idoso de 80 Anos ou mais , Hibridização Genômica Comparativa , Feminino , Seguimentos , Humanos , Hibridização in Situ Fluorescente , Masculino , Pessoa de Meia-Idade , Estadiamento de Neoplasias , Prognóstico , Estudos Prospectivos , Taxa de SobrevidaRESUMO
BACKGROUND: The BRCA1 gene plays a key role in triple negative breast cancers (TNBCs), in which its expression can be lost by multiple mechanisms: germinal mutation followed by deletion of the second allele; negative regulation by promoter methylation; or miRNA-mediated silencing. This study aimed to establish a correlation among the BRCA1-related molecular parameters, tumor characteristics and clinical follow-up of patients to find new prognostic factors. METHODS: BRCA1 protein and mRNA expression was quantified in situ in the TNBCs of 69 patients. BRCA1 promoter methylation status was checked, as well as cytokeratin 5/6 expression. Maintenance of expressed BRCA1 protein interaction with BARD1 was quantified, as a marker of BRCA1 functionality, and the tumor expression profiles of 27 microRNAs were determined. RESULTS: miR-548c-5p was emphasized as a new independent prognostic factor in TNBC. A combination of the tumoral expression of miR-548c and three other known prognostic parameters (tumor size, lymph node invasion and CK 5/6 expression status) allowed for relapse prediction by logistic regression with an area under the curve (AUC) = 0.96. BRCA1 mRNA and protein in situ expression, as well as the amount of BRCA1 ligated to BARD1 in the tumor, lacked any associations with patient outcomes, likely due to high intratumoral heterogeneity, and thus could not be used for clinical purposes. CONCLUSIONS: In situ BRCA1-related expression parameters could be used for clinical purposes at the time of diagnosis. In contrast, miR-548c-5p showed a promising potential as a prognostic factor in TNBC.
Assuntos
Proteína BRCA1/biossíntese , MicroRNAs/biossíntese , Recidiva Local de Neoplasia/genética , Neoplasias de Mama Triplo Negativas/genética , Adulto , Idoso , Idoso de 80 Anos ou mais , Proteína BRCA1/genética , Metilação de DNA , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Linfonodos/patologia , Metástase Linfática , MicroRNAs/genética , Pessoa de Meia-Idade , Mutação , Recidiva Local de Neoplasia/patologia , Prognóstico , Regiões Promotoras Genéticas , RNA Mensageiro/biossíntese , Neoplasias de Mama Triplo Negativas/patologiaRESUMO
Whole exome sequencing undertaken in two siblings with delayed psychomotor development, absent speech, severe intellectual disability and postnatal microcephaly, with brain malformations consisting of cerebellar atrophy in the eldest affected and hypoplastic corpus callosum in the younger sister; revealed a homozygous intragenic deletion in VPS51, which encodes the vacuolar protein sorting-associated protein, one the four subunits of the Golgi-associated retrograde protein (GARP) and endosome-associated recycling protein (EARP) complexes that promotes the fusion of endosome-derived vesicles with the trans-Golgi network (GARP) and recycling endosomes (EARP). This observation supports a pathogenic effect of VPS51 variants, which has only been reported previously once, in a single child with microcephaly. It confirms the key role of membrane trafficking in normal brain development and homeostasis.
Assuntos
Encéfalo/fisiopatologia , Microcefalia/genética , Malformações do Sistema Nervoso/genética , Proteínas de Transporte Vesicular/genética , Encéfalo/diagnóstico por imagem , Criança , Endossomos/genética , Feminino , Humanos , Masculino , Microcefalia/diagnóstico por imagem , Microcefalia/fisiopatologia , Malformações do Sistema Nervoso/diagnóstico por imagem , Malformações do Sistema Nervoso/fisiopatologia , Transporte Proteico/genética , Rede trans-Golgi/genéticaRESUMO
Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.
RESUMO
Non-coding RNAs (ncRNA) represent 1/5 of the mammalian transcript number, and 90% of the genome length is transcribed. Many ncRNAs play a role in cancer. Among them, non-coding natural antisense transcripts (ncNAT) are RNA sequences that are complementary and overlapping to those of either protein-coding (PCT) or non-coding transcripts. Several ncNATs were described as regulating protein coding gene expression on the same loci, and they are expected to act more frequently in cis compared to other ncRNAs that commonly function in trans. In this work, 22 breast cancers expressing estrogen receptors and their paired adjacent non-malignant tissues were analyzed by strand-specific RNA sequencing. To highlight ncNATs potentially playing a role in protein coding gene regulations that occur in breast cancer, three different data analysis methods were used: differential expression analysis of ncNATs between tumor and non-malignant tissues, differential correlation analysis of paired ncNAT/PCT between tumor and non-malignant tissues, and ncNAT/PCT read count ratio variation between tumor and non-malignant tissues. Each of these methods yielded lists of ncNAT/PCT pairs that were enriched in survival-associated genes. This work highlights ncNAT lists that display potential to affect the expression of protein-coding genes involved in breast cancer pathology.
Assuntos
Neoplasias da Mama/metabolismo , RNA Antissenso/metabolismo , RNA não Traduzido/metabolismo , Transcriptoma , Adulto , Idoso , Idoso de 80 Anos ou mais , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Neoplasias da Mama/genética , Neoplasias da Mama/mortalidade , Feminino , Perfilação da Expressão Gênica/métodos , Humanos , Pessoa de Meia-Idade , Receptor ErbB-2/genética , Receptor ErbB-2/metabolismo , Receptores de Estrogênio/genética , Receptores de Estrogênio/metabolismo , Estudos Retrospectivos , Análise de Sequência de RNA , Análise de SobrevidaRESUMO
Circulating microRNAs (miRNAs) are increasingly recognized as powerful biomarkers in several pathologies, including breast cancer. Here, their plasmatic levels were measured to be used as an alternative screening procedure to mammography for breast cancer diagnosis.A plasma miRNA profile was determined by RT-qPCR in a cohort of 378 women. A diagnostic model was designed based on the expression of 8 miRNAs measured first in a profiling cohort composed of 41 primary breast cancers and 45 controls, and further validated in diverse cohorts composed of 108 primary breast cancers, 88 controls, 35 breast cancers in remission, 31 metastatic breast cancers and 30 gynecologic tumors.A receiver operating characteristic curve derived from the 8-miRNA random forest based diagnostic tool exhibited an area under the curve of 0.81. The accuracy of the diagnostic tool remained unchanged considering age and tumor stage. The miRNA signature correctly identified patients with metastatic breast cancer. The use of the classification model on cohorts of patients with breast cancers in remission and with gynecologic cancers yielded prediction distributions similar to that of the control group.Using a multivariate supervised learning method and a set of 8 circulating miRNAs, we designed an accurate, minimally invasive screening tool for breast cancer.