RESUMO
MOTIVATION: Domain adaptation allows for the development of predictive models even in cases with limited sample data. Weighted elastic net domain adaptation specifically leverages features of genomic data to maximize transferability but the method is too computationally demanding to apply to many genome-sized datasets. RESULTS: We developed wenda_gpu, which uses GPyTorch to train models on genomic data within hours on a single GPU-enabled machine. We show that wenda_gpu returns comparable results to the original wenda implementation, and that it can be used for improved prediction of cancer mutation status on small sample sizes than regular elastic net. AVAILABILITY AND IMPLEMENTATION: wenda_gpu is available on GitHub at https://github.com/greenelab/wenda_gpu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Neoplasias , Software , Humanos , Genômica/métodos , Neoplasias/genética , Tamanho da AmostraRESUMO
Single-cell RNA-sequencing (scRNA-seq) has made it possible to profile gene expression in tissues at high resolution. An important preprocessing step prior to performing downstream analyses is to identify and remove cells with poor or degraded sample quality using quality control (QC) metrics. Two widely used QC metrics to identify a 'low-quality' cell are (i) if the cell includes a high proportion of reads that map to mitochondrial DNA (mtDNA) encoded genes and (ii) if a small number of genes are detected. Current best practices use these QC metrics independently with either arbitrary, uniform thresholds (e.g. 5%) or biological context-dependent (e.g. species) thresholds, and fail to jointly model these metrics in a data-driven manner. Current practices are often overly stringent and especially untenable on certain types of tissues, such as archived tumor tissues, or tissues associated with mitochondrial function, such as kidney tissue [1]. We propose a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset. We demonstrate how our QC metric easily adapts to different types of single-cell datasets to remove low-quality cells while preserving high-quality cells that can be used for downstream analyses. Our software package is available at https://bioconductor.org/packages/miQC.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Probabilidade , Controle de Qualidade , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , DNA Mitocondrial/genética , HumanosRESUMO
Although many studies have documented codon usage bias in different species, the importance of codon usage in a phylogenetic framework remains largely unknown. We demonstrate that a phylogenetic signal is present in the codon usage and non-usage biases of 17 717 orthologues evaluated across 72 tetrapod species using a simple parsimony analysis of a binary matrix of codon characters. Phylogenies estimated using stop codons were more congruent with previous hypotheses than phylogenies based on any other single codon or a combination of codons. Although each codon is present in every species, specific genes have different codon preferences and may or may not use every possible codon. This observation allowed us to map the pattern of codon usage and non-usage across the topology. These results suggest that codon usage is phylogenetically conserved across shallow and deep levels within tetrapods.
RESUMO
BACKGROUND: Alzheimer's disease is the leading cause of dementia in the elderly and the third most common cause of death in the United States. A vast number of genes regulate Alzheimer's disease, including Presenilin 1 (PSEN1). Multiple studies have attempted to locate novel variants in the PSEN1 gene that affect Alzheimer's disease status. A recent study suggested that one of these variants, PSEN1 E318G (rs17125721), significantly affects Alzheimer's disease status in a large case-control dataset, particularly in connection with the APOEε4 allele. METHODS: Our study looks at the same variant in the Cache County Study on Memory and Aging, a large population-based dataset. We tested for association between E318G genotype and Alzheimer's disease status by running a series of Fisher's exact tests. We also performed logistic regression to test for an additive effect of E318G genotype on Alzheimer's disease status and for the existence of an interaction between E318G and APOEε4. RESULTS: In our Fisher's exact test, it appeared that APOEε4 carriers with an E318G allele have slightly higher risk for AD than those without the allele (3.3 vs. 3.8); however, the 95 % confidence intervals of those estimates overlapped completely, indicating non-significance. Our logistic regression model found a positive but non-significant main effect for E318G (p = 0.895). The interaction term between E318G and APOEε4 was also non-significant (p = 0.689). CONCLUSIONS: Our findings do not provide significant support for E318G as a risk factor for AD in APOEε4 carriers. Our calculations indicated that the overall sample used in the logistic regression models was adequately powered to detect the sort of effect sizes observed previously. However, the power analyses of our Fisher's exact tests indicate that our partitioned data was underpowered, particularly in regards to the low number of E318G carriers, both AD cases and controls, in the Cache county dataset. Thus, the differences in types of datasets used may help to explain the difference in effect magnitudes seen. Analyses in additional case-control datasets will be required to understand fully the effect of E318G on Alzheimer's disease status.
Assuntos
Doença de Alzheimer/genética , Predisposição Genética para Doença/genética , Mutação de Sentido Incorreto , Polimorfismo de Nucleotídeo Único , Presenilina-1/genética , Idoso , Idoso de 80 Anos ou mais , Alelos , Apolipoproteína E4/genética , Estudos de Casos e Controles , Epistasia Genética , Frequência do Gene , Genótipo , Humanos , Modelos Logísticos , Razão de Chances , Fatores de Risco , UtahRESUMO
BACKGROUND: High-grade serous carcinoma (HGSC) gene expression subtypes are associated with differential survival. We characterized HGSC gene expression in Black individuals and considered whether gene expression differences by self-identified race may contribute to poorer HGSC survival among Black versus White individuals. METHODS: We included newly generated RNA sequencing data from Black and White individuals and array-based genotyping data from four existing studies of White and Japanese individuals. We used K-means clustering, a method with no predefined number of clusters or dataset-specific features, to assign subtypes. Cluster- and dataset-specific gene expression patterns were summarized by moderated t-scores. We compared cluster-specific gene expression patterns across datasets by calculating the correlation between the summarized vectors of moderated t-scores. After mapping to The Cancer Genome Atlas-derived HGSC subtypes, we used Cox proportional hazards models to estimate subtype-specific survival by dataset. RESULTS: Cluster-specific gene expression was similar across gene expression platforms and racial groups. Comparing the Black population with the White and Japanese populations, the immunoreactive subtype was more common (39% vs. 23%-28%) and the differentiated subtype was less common (7% vs. 22%-31%). Patterns of subtype-specific survival were similar between the Black and White populations with RNA sequencing data; compared with mesenchymal cases, the risk of death was similar for proliferative and differentiated cases and suggestively lower for immunoreactive cases [Black population HR = 0.79 (0.55, 1.13); White population HR = 0.86 (0.62, 1.19)]. CONCLUSIONS: Although the prevalence of HGSC subtypes varied by race, subtype-specific survival was similar. IMPACT: HGSC subtypes can be consistently assigned across platforms and self-identified racial groups.
Assuntos
Cistadenocarcinoma Seroso , Neoplasias Ovarianas , Humanos , Feminino , Neoplasias Ovarianas/genética , Neoplasias Ovarianas/etnologia , Neoplasias Ovarianas/patologia , Neoplasias Ovarianas/mortalidade , Cistadenocarcinoma Seroso/genética , Cistadenocarcinoma Seroso/patologia , Cistadenocarcinoma Seroso/etnologia , Cistadenocarcinoma Seroso/mortalidade , Pessoa de Meia-Idade , População Branca/genética , População Branca/estatística & dados numéricos , Gradação de Tumores , Idoso , Negro ou Afro-Americano/genética , Negro ou Afro-Americano/estatística & dados numéricosRESUMO
BACKGROUND: Single-cell gene expression profiling provides unique opportunities to understand tumor heterogeneity and the tumor microenvironment. Because of cost and feasibility, profiling bulk tumors remains the primary population-scale analytical strategy. Many algorithms can deconvolve these tumors using single-cell profiles to infer their composition. While experimental choices do not change the true underlying composition of the tumor, they can affect the measurements produced by the assay. RESULTS: We generated a dataset of high-grade serous ovarian tumors with paired expression profiles from using multiple strategies to examine the extent to which experimental factors impact the results of downstream tumor deconvolution methods. We find that pooling samples for single-cell sequencing and subsequent demultiplexing has a minimal effect. We identify dissociation-induced differences that affect cell composition, leading to changes that may compromise the assumptions underlying some deconvolution algorithms. We also observe differences across mRNA enrichment methods that introduce additional discrepancies between the two data types. We also find that experimental factors change cell composition estimates and that the impact differs by method. CONCLUSIONS: Previous benchmarks of deconvolution methods have largely ignored experimental factors. We find that methods vary in their robustness to experimental factors. We provide recommendations for methods developers seeking to produce the next generation of deconvolution approaches and for scientists designing experiments using deconvolution to study tumor heterogeneity.
Assuntos
Perfilação da Expressão Gênica , Neoplasias Ovarianas , Humanos , Feminino , Perfilação da Expressão Gênica/métodos , Algoritmos , Análise de Sequência de RNA/métodos , Neoplasias Ovarianas/genética , Transcriptoma , Microambiente TumoralRESUMO
Introduction: High-grade serous carcinoma (HGSC) gene expression subtypes are associated with differential survival. We characterized HGSC gene expression in Black individuals and considered whether gene expression differences by race may contribute to poorer HGSC survival among Black versus non-Hispanic White individuals. Methods: We included newly generated RNA-Seq data from Black and White individuals, and array-based genotyping data from four existing studies of White and Japanese individuals. We assigned subtypes using K-means clustering. Cluster- and dataset-specific gene expression patterns were summarized by moderated t-scores. We compared cluster-specific gene expression patterns across datasets by calculating the correlation between the summarized vectors of moderated t-scores. Following mapping to The Cancer Genome Atlas (TCGA)-derived HGSC subtypes, we used Cox proportional hazards models to estimate subtype-specific survival by dataset. Results: Cluster-specific gene expression was similar across gene expression platforms. Comparing the Black study population to the White and Japanese study populations, the immunoreactive subtype was more common (39% versus 23%-28%) and the differentiated subtype less common (7% versus 22%-31%). Patterns of subtype-specific survival were similar between the Black and White populations with RNA-Seq data; compared to mesenchymal cases, the risk of death was similar for proliferative and differentiated cases and suggestively lower for immunoreactive cases (Black population HR=0.79 [0.55, 1.13], White population HR=0.86 [0.62, 1.19]). Conclusions: A single, platform-agnostic pipeline can be used to assign HGSC gene expression subtypes. While the observed prevalence of HGSC subtypes varied by race, subtype-specific survival was similar.
RESUMO
Genomic data sharing accelerates research. Data are most valuable when they are accompanied by detailed metadata. To date, metadata are often human-annotated descriptions of samples and their handling. We discuss how machine learning-derived elements complement such descriptions to enhance the research ecosystem around genomic data.
Assuntos
Genômica , Metadados , Humanos , Aprendizado de Máquina , Neoplasias/genéticaRESUMO
Delivering a keynote talk at a conference organized by a scientific society or being named as a fellow by such a society indicates that a scientist is held in high regard by their colleagues. To explore if the distribution of such indicators of esteem in the field of bioinformatics reflects the composition of this field, we compared the gender, name origin, and country of affiliation of 412 honorees from the "International Society for Computational Biology" (75 fellows and 337 keynote speakers) with over 170,000 last authorships on computational biology papers between 1993 and 2019. The proportion of honors bestowed on women was similar to that of the field's overall last authorship rate. However, names of East Asian origin have been persistently underrepresented among honorees. Moreover, there were roughly twice as many honors bestowed on scientists with an affiliation in the United States as expected based on literature authorship. A record of this paper's transparent peer review process is included in the supplemental information.
Assuntos
Biologia Computacional , Sociedades Científicas , Feminino , Humanos , Estados UnidosRESUMO
BACKGROUND: Pooling cells from multiple biological samples prior to library preparation within the same single-cell RNA sequencing experiment provides several advantages, including lower library preparation costs and reduced unwanted technological variation, such as batch effects. Computational demultiplexing tools based on natural genetic variation between individuals provide a simple approach to demultiplex samples, which does not require complex additional experimental procedures. However, to our knowledge these tools have not been evaluated in cancer, where somatic variants, which could differ between cells from the same sample, may obscure the signal in natural genetic variation. RESULTS: Here, we performed in silico benchmark evaluations by combining raw sequencing reads from multiple single-cell samples in high-grade serous ovarian cancer, which has a high copy number burden, and lung adenocarcinoma, which has a high tumor mutational burden. Our results confirm that genetic demultiplexing tools can be effectively deployed on cancer tissue using a pooled experimental design, although high proportions of ambient RNA from cell debris reduce performance. CONCLUSIONS: This strategy provides significant cost savings through pooled library preparation. To facilitate similar analyses at the experimental design phase, we provide freely accessible code and a reproducible Snakemake workflow built around the best-performing tools found in our in silico benchmark evaluations, available at https://github.com/lmweber/snp-dmx-cancer.